Extracting emails from a file, string or url with php and regex

July 31, 2009

So, sometimes, in life, you have a crazy boss that comes up to you and asks you to do odd things. Like export a folder with a few thousand emails to a csv file. Then extract all the emails from that file. Luckily enough this is possible using php and regex. Below I will show you how I was able to accomplish this.

After doing a little research online I found out on the Wikipedia email address entry all the valid characters needed or allowed in an email address.

E-mail addresses are formally defined in RFC 5322 (mostly section 3.4.1) and to a lesser degree RFC 5321. An e-mail address is a string of a subset of ASCII characters (see however the internationalize addresses below) separated into 2 parts by an “@” (at sign), a “local-part” and a domain, that is, local-part@domain.

The local-part of an e-mail address may be up to 64 characters long and the domain name may have a maximum of 255 characters. However, the maximum length of a forward or reverse path length of 256 characters restricts the entire e-mail address to be no more than 254 characters.[1] Some mail protocols, such as X.400, may require larger objects, however. The SMTP specification recommends that software implementations impose no limits for the lengths of such objects.

The local-part of the e-mail address may use any of these ASCII characters:

  • Uppercase and lowercase English letters (a-z, A-Z)
  • Digits 0 through 9
  • Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
  • Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively.

 

Now armed with this knowledge we are able to create our regular expression. BTW if you look online there are tons of sites that have email validation regex. Unfortunately none of those worked for what I wanted. After doing enough testing the regular expression I whipped up does the job we need to do here. On an aside I’m not a regex guru but I did stay at a Holiday Inn Express.

/([A-Za-z0-9\.\-\_\!\#\$\%\&\'\*\+\/\=\?\^\`\{\|\}]+)\@([A-Za-z0-9.-_]+)(\.[A-Za-z]{2,5})/

Now if you’re not familiar with regular expressions this looks like a bunch of nonsense. So let me explain with the help of RegexBuddy.

Match the regular expression below and capture its match into backreference number 1 «([A-Za-z0-9\.\-\_\!\#\$\%\&\'\*\+\/\=\?\^\`\{\|\}]+)»
   Match a single character present in the list below «[A-Za-z0-9\.\-\_\!\#\$\%\&\'\*\+\/\=\?\^\`\{\|\}]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      A character in the range between “A” and “Z” «A-Z»
      A character in the range between “a” and “z” «a-z»
      A character in the range between “0” and “9” «0-9»
      A . character «\.»
      A – character «\-»
      A _ character «\_»
      A ! character «\!»
      A # character «\#»
      A $ character «\$»
      A % character «\%»
      A & character «\&»
      A ‘ character «\’»
      A * character «\*»
      A + character «\+»
      A / character «\/»
      A = character «\=»
      A ? character «\?»
      A ^ character «\^»
      A ` character «\`»
      A { character «\{»
      A | character «\|»
      A } character «\}»
Match the character “@” literally «\@»
Match the regular expression below and capture its match into backreference number 2 «([A-Za-z0-9.-_]+)»
   Match a single character present in the list below «[A-Za-z0-9.-_]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      A character in the range between “A” and “Z” «A-Z»
      A character in the range between “a” and “z” «a-z»
      A character in the range between “0” and “9” «0-9»
      A character in the range between “.” and “_” «.-_»
Match the regular expression below and capture its match into backreference number 3 «(\.[A-Za-z]{2,5})»
   Match the character “.” literally «\.»
   Match a single character present in the list below «[A-Za-z]{2,5}»
      Between 2 and 5 times, as many times as possible, giving back as needed (greedy) «{2,5}»
      A character in the range between “A” and “Z” «A-Z»
      A character in the range between “a” and “z” «a-z»

 

I know, I know. Even explained it’s a little tough to figure out if you’re not used to regular expressions. Google regex or regular expressions tutorial for more help with regex.

So now that we have our regex all setup. We can use some PHP functions to get some content into a string using file_get_contents. Alternately we can also use file_get_contents to read a url if we want. All you email spammers will probably love this tutorial btw.

// open the file and get contents.
$file = file_get_contents(‘content.csv’, FILE_USE_INCLUDE_PATH);

Let’s put the regex into a variable to use.

// define our search pattern
$pattern = “/([A-Za-z0-9\.\-\_\!\#\$\%\&\'\*\+\/\=\?\^\`\{\|\}]+)\@([A-Za-z0-9.-_]+)(\.[A-Za-z]{2,5})/”;

We’ll use the function preg_match_all. This is the most important function here. Unlike preg_match, preg_match_all will search an entire string from beginning to end not stopping at the first occurence. We put in our regex pattern variable, the string we want to parse and finally we get out the emails in the emails array.

// preg match all in the string
preg_match_all($pattern,$file,$emails);

Now the file will probably have a bunch of duplicate emails so we’ll need to delete them from our results. We can accomplish this by using array_unique to make a new array with distinct email addresses.

// remove duplicate emails
$new_emails = array_unique($emails[0]);

Finally lets run a foreach and echo out the $new_emails array.

// echo out the emails
foreach ($new_emails as $key => $val){
      echo “$val\n”;
}

Now we run the script and you should get a nice list of emails. I hope you guys enjoyed this tutorial. If you find this tutorial useful please donate a few bucks via PayPal. Daddy needs a new laptop. ;)

Here’s the script in it’s full glory for those that just want to copy and paste.

<?php
/*
E-mail addresses are formally defined in RFC 5322 (mostly section 3.4.1) and to a lesser degree RFC 5321.
An e-mail address is a string of a subset of ASCII characters (see however the internationalize addresses below)
separated into 2 parts by an “@” (at sign), a “local-part” and a domain, that is, local-part@domain.

The local-part of an e-mail address may be up to 64 characters long and the domain name may have a maximum of 255 characters.
However, the maximum length of a forward or reverse path length of 256 characters restricts the entire e-mail address to be no more than 254 characters.
[1] Some mail protocols, such as X.400, may require larger objects, however.
The SMTP specification recommends that software implementations impose no limits for the lengths of such objects.

The local-part of the e-mail address may use any of these ASCII characters:

Uppercase and lowercase English letters (a-z, A-Z)
Digits 0 through 9
Characters ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~
Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively.
*/

// open the file and get contents.
$file = file_get_contents(‘investors.CSV’, FILE_USE_INCLUDE_PATH);

// define our search pattern
$pattern = “/([A-Za-z0-9\.\-\_\!\#\$\%\&\'\*\+\/\=\?\^\`\{\|\}]+)\@([A-Za-z0-9.-_]+)(\.[A-Za-z]{2,5})/”;

// preg match all in the string
preg_match_all($pattern,$file,$emails);

// remove duplicate emails
$new_emails = array_unique($emails[0]);

// echo out the emails
foreach ($new_emails as $key => $val){
 echo “$val\n”;
}

 

?>

Comments, criticism or cash? Just send them my way.

Peace!

Learn PHP!

Got something to say?

You must be logged in to post a comment.