Help with Regular Expression syntax

Marcel · Post by **Marcel** » Tue Oct 28, 2003 5:16 pm

Hi,

Please could someone help me achieve something in TextPad? I have a huge great log file from an e-mail server, and I would like to filter it so that only e-mail addresses for our domain remain in the file, i.e.:

<anyusername>@outdomain.com

There are lots of other addresses and other text either side of what I want to filter out. I have tried various macros, but I think I need to use a regular expression to find what I need.

Any suggestions/tips please?

Thanks,
Marcel

skaemper · Post by **skaemper** » Tue Oct 28, 2003 5:49 pm

Marcel wrote:Hi,
..., and I would like to filter it so that only e-mail addresses for our domain remain in the file, i.e.:

<anyusername>@outdomain.com

There are lots of other addresses and other text either side of what I want to filter out....

Any suggestions/tips please?

Thanks,
Marcel

Is there always only one email on each line? In that case "all" you need is the appropiate regex for 'Search & Replace". Otherwise I'd think that using a small script in Ruby (or Python or Perl) might serve you better (especially if that task happens more regularly)

Now, if you just need a regex to describe an email address, you'll like to have a close look at Friedl's "Mastering Regular Expressions" ( http://www.oreilly.com/catalog/regex2/index.html ). [Wich I really really recommend if you're going to work with regexes regularly]
Note that a regexp that matches all possible allowed email addresses
and doesn't match any invalid email address is a pretty complex beast....
Check your TextPad help for the usage of character classes and grouping of regexes (click the help button when the search or replace dialog is show and just follow the liks in the document).

Hope that helped for now...[/code]

CyberSlug · Post by **CyberSlug** » Wed Oct 29, 2003 2:22 am

Sounds as if you need an inverse regular expression (i.e. everything but some well-defined stuff....) I concur that writing your own program is the way to go, but here's a tip from another post:

1) Create a regular expression for the address you want. I *think* it might be something like this:
([[:word:]]*[[:digit:]]*)*@outdomain.com
Adjust if the emails are always surrounded by whitespace or angle brackets or something.

2) Search > Find > ... > click the Mark All button

3) Edit > Copy Other > Bookmarked lines

4) Open a new document and Edit > Paste the text

Hope that helps.

P.S. Actually, you don't need to open a new document.
After step 2, you can click Search > Invert All Bookmarks
then you can go to Edit > Delete bookmarked lines

talleyrand · Post by **talleyrand** » Wed Oct 29, 2003 5:18 am

Unfortunately for me, Regular expressions and I don't get along. Bob will probably post a kick butt solution in the morning but if you're impatient, here's a program that'll get you in the ballpark. Testing it against my own mailbox, the only thing I saw was that it'll pull back something like <myuserid@mydomain.com> in addition to myuserid@mydomain.com

If you'd like to see a frequency count instead of just occurences, do a search on my username, in this forum and I think it was something about unique occurences in a file? Something like that. Anyway what you'll want to do is update the l.append part to what's probably listed as d[key] = d[key] +1 If you need help, post away.

Install Python and you're good to go.

Copy this code and paste into Textpad.
Save it to something like C:\marcel.py
Update the filename variable and the domain name.

C:\>python marcel.py

You can also run it straight out of Texpad. Set the command to wherever you installed Python (probably C:\python23\python.exe)

Code: Select all

def emailSlurp(fileName, ourAddress="@mydomain.com"):
   """
   Iterate through a text file looking for occurences of ourAddress.
   Break that line containing ourAddress apart by whitespace and
   create a unique list of all the email addresses containing the
   value of ourAddress.
   """
   #create an empty list
   l = []
   #iterate through each line of the file
   for currentLine in open(fileName, 'r').readlines():
      #force it to lowercase
      currentLine = currentLine.lower()
      #if the current line contains our address
      if (currentLine.find(ourAddress) >= 0):
         #break the current line apart by whitespace
         words = currentLine.split()
         #iterate through the words in the line
         for token in words:
            #Find the address that is our
            if (token.find(ourAddress)>=0):
               #add it to our list only if doesn't already belong
               if (token not in l):
                  l.append(token)
   return l

def main():
   #the r before the quotes means it's a raw string which means backslashes do not need to be escaped
   file = r"f:\python\mailbox.mbox"
   domain = r"@somedomain.com"

   #get a list of unique email addresses in a file
   emailList = emailSlurp(file, domain)
   #print each address
   for l in emailList:
      print l

if (__name__ == "__main__"):
   main()

Bob Hansen · Post by **Bob Hansen** » Wed Oct 29, 2003 8:43 am

Hi skaemper

You can try this using POSIX:

Search for:

^.*[: :]([-a-zA-Z0-9\._]+@[-a-zA-Z0-9]+(\.[-a-zA-Z0-9]+)*\.(com|edu|org|info|gov|int|mil|biz|net)).*$

Replace with:

\1

=======================
Explanation of Search Regex:
^ .....................=beginning of line anchor
.*.......................=any number of characters
[: :]...................= a space character
(........................=start of first tagged expression
[-a-zA-Z0-9\._]+...=one or more alpha-numberics,hyphen,period,underscore characters
@.........................=@ symbol
[-a-zA-Z0-9]+...........= one or more alpha-numeric characters
(..........................=start of second tagged expression
\. ......................=period
[-a-zA-Z0-9]+............=one or more alpha-numeric characters
)........................= end of second tagged expression
* ......................= any number of previous expression
\. ......................=period
(com|edu|org|info|gov|int|mil|biz|net).....= choice of valid domain extension ID values
)............................=end of first tagged expression
.* ........................=any number of characters
$ ........................=end of line anchor

=======================
Explanation of Replace Regex:
\1.................=first tagged expression:
([-a-zA-Z0-9\._]+@[-a-zA-Z0-9]+(\.[-a-zA-Z0-9]+)*\.(com|edu|org|info|gov|int|mil|biz|net))
=======================

Here are some test lines that I used:
emailname@domain.com. This is at beginning of a sentence, with a space.
This is at the end of a sentence: emailname@Domain.Com
This email address: myname@AOL.org is in the middle of the line.
This name has periods in the name: your.name@myplace.com
And this address: His-Name@thatplace.edu uses a hyphen
And you may also see underscores: your_test@mine.ABC.gov
===========================================

This appears to pick up most email names that I have tested (more than shown above), but....

A few cautions:
1. This requires a leading space, so it may be necessary to add a space to the beginning of every line (at least to those where the email address starts at the very beginning. You may be able to do that by replacing "\n" with "\n ". (Quotes are only to show the extra space in the replacement value.
2. You may want to add more extensions to the end of the domain, I don't think this is all-inclusive.
3. You may need to add more punctuation to the group after the leading space [: :], I don't think this is all-inclusive.
4. This will only handle one email address per line because of the line start/end anchors.

WARNING:, this first pass expression will probably miss some valid email addresses. It can use some more work to make it better.

Post by **MudGuard** » Wed Oct 29, 2003 12:22 pm

2. You may want to add more extensions to the end of the domain, I don't think this is all-inclusive.

No, definitely not all-inclusive - as ALL the country code top level domains (like us, de, uk, it, au, ...) are missing, and even the list of non-country top level domains is complete - at least museum and aero are missing...

Bob Hansen · Post by **Bob Hansen** » Wed Oct 29, 2003 5:39 pm

Hey MudGuard, I told you so! Thanks for the challenge.

Here is an updated version:

^.*[: :]([-a-zA-Z0-9\._]+@[-a-zA-Z0-9]+(\.[-a-zA-Z0-9]+)*\.
(com|edu|org|info|gov|int|mil|biz|net|aero|coop|museum|name|[a-zA-Z][a-zA-Z] )(\.[a-zA-Z]{2})*).*$

Added (\.[a-zA-Z]{2})* inside the first tagged expression to include two letter id for country

Explanation of Regex terms:
(................................=start of tagged expression
\. ............................=period
[a-zA-Z]{2}...............= group of 2 successive letters
)................................=end of tagged expression
*...............................= any number of previous expression
=============================================
I have also added the following domain IDs
aero|coop|museum|name|[a-zA-Z][a-zA-Z]

Modified test lines:
This is at the end of a sentence: emailname@Domain.Com
This email address: myname@AOL.org is in the middle of the line.
This name has periods in the name: your.name@myplace.com.de from Denmark.
And this address: His-Name@thatplace.edu.sw uses a hyphen from Sweden.
And you may also see underscores: your_test@mine.ABC.gov
Don't forget anothername@domain.us as distinguished from somename@domain.com.us
============================
Previous warning is still valid! I am sure that other combinations can still be found.

But please don't take this as a challenge by me. I am doing this as a self-tutorial. And my instructor is intolerable!

talleyrand · Post by **talleyrand** » Wed Oct 29, 2003 6:25 pm

You forgot pro. Sheesh, for the money we pay to use this service you think they'd produce better results.

This site seems to detail what's out there quite nicely.

Bob Hansen · Post by **Bob Hansen** » Wed Oct 29, 2003 9:18 pm

I think I also left out a necessary trailing space after the two digit country id........will have to recheck again later.

And I think I also have to remove some periods and underscores from portions of the code. These are invalid characters. Grrr, did it too fast...... stay tuned for more modifications.

Feel free to add pro in there yourself. No extra charge.

I have provided full refunds to all who asked after their purchase. We aim to please, customer service is number one.
Strange thing though, no refund requests. And you can't beat the price!

Community

Help with Regular Expression syntax

Help with Regular Expression syntax

Re: Help with Regular Expression syntax