rants from the dark side of marketing

Regex tips and tricks

I knew it from the first glance, that regular expressions were very very beautiful. As of today, I feel that I have a pretty good grasp of regular expressions. Pay attention and one day you might be a regex god like me.

I’m sure you all have that buddy that told you that regex is easy and all you need to know is ‘<tag>(.+?)</tag>’. Trust me, that won’t get you very far. I’m gonna start with one of the most common examples, html links aka <a> tags. I’m using PCRE regex (PHP), so you may need to adjust for the regex implementation you are using.

Scroll to the bottom of the post for the tips.

Examples

RegexNoob: Hey, I already know how to do that!
RegexPro: Really?
RegexNoob: That’s easy, /<a href="(.+?)">(.+?)<\/a>/. Look I grabbed both the url and the anchor!
RegexPro: Good luck with that.

There’s a few points to make. First, html tags have attributes (rel,id,class,href,target etc) and their order inside the tag doesn’t matter. Second, attributes can be wrapped by "", ” , whitespace or even nothing. Also the whole thing may be in mixed case ( <a HREF=’#'></A> ) and it may contain whitespace just about everywhere ( < a HREF = ‘#’ >< / A > ). Below is a better example of a link matching regex.

@<[\s]*a[\s]*[^>]*?href[\s]*=[\"\'\s]*(.*?)[\"\'\s]*[^>]*?> (.*?)<[\s]*/[\s]*a[\s]*>@is

RegexNoob: WTF???
RegexPro: That’s right.

Ok, let’s break this down.

Example 1

First thing is the ‘@…@i’. You need to enclose your regex in delimiters. Whatever you use as a delimiter needs to be escaped inside the expression. A lot of people habitually use ‘/…/’, which is bad because you have to escape all the closing html tags likeso, ‘/<p>(.+?)<\/p>/’. By using ‘@…@’, I don’t have to escape ‘</html>’ tags.

Then there is ‘is’ after the closing delimiter. There are various modifiers which you can use there, ‘i’ is the most common and it makes your expression case insensitive. The ‘s’ makes the dot inside ‘(.*?)’ match new lines, which might be present inside the anchor.

Then there is the opening ‘<’ and after that ‘[\s]*’. That means zero or more whitespace characters (\n,\r,\t etc). As I said there can be whitespace anywhere and it will still be valid html. I’ve enclosed all the \s in square brackets for extra readability but it’s not necessary. As you will notice ‘[\s]*’ is used in multiple places in the expression string.

Next comes ‘a’, then ‘[\s]*’ again. Ah there’s something new, ‘[^>]*?’. This one matches anything that’s not ‘>’. The ‘^’ char inside ‘[]‘ directs a negative match. We have added a ‘?’ after the ‘*’ because otherwise it wouldn’t stop at ‘href’. This will match any html attributes between ‘a’ and ‘href’.

The ‘[\"\'\s]*’ matches single or double quotes and whitespace. The ‘*’ covers all the combinations we might encounter, take note as I’ve seen some similar regex that uses ‘+’ instead of ‘*’. That will fail in case there’s nothing between the ‘=’ and the URI.

After the URL we have another set of ‘[\"\'\s]*’ and ‘[^>]*?’, which will match any html atributes that are after the ‘href’.

Next is the anchor of the link, ‘(.*?)’. We use *’ instead of ‘+’ because it would be valid html to have an empty anchor. We are almost finished, with just the closing tag which includes some more ‘[\s]*’ to account for whitespace. We didn’t have to escape ‘/’ because we use ‘@’ as the delimiter.

Example 2

‘@<[\s]*br[\s/]*>@’
If you wanted to match both ‘<br>’ and ‘<br/>’, this would take care of whitespace plus the extra ‘/’ for XHTML.

Example 3

‘@<font[^>]*>Results <b>[0-9]*</b> – <b>[0-9]*</b> of about <b>([0-9,]+)</b> for@’
Matches the number of Google results for a query. Notice again how I included a ‘,’ in the subpattern ‘([0-9,]+)’.

Example 4

‘/@import (?:url)?[("]+(.+?)[)"]+[^;]*;/i’
Matches any imported stylesheets. Sometime you need a subpattern that won’t capture anything (so matches[1] is still the ‘(.+?)’ ). That’s what ‘(?:expression)’ does, so ‘(?:url)?’ matches ‘url’ or ”. If you use something like ‘url?’, it will match ‘ur’ or ‘url’. You can add options between ‘?’ and ‘:’ like you can add after the ending delimiter.

Example 5

‘@(?>[0-9]+)bar@’
If you have the string ’123456789foo’, then this expression will match all 9 digits, then fail to match ‘bar’, then match 8 digits, fail again, match 7 digits and so on. By using ‘(?>[0-9]+)’ this behaviour is prevented, after it matches the max number of digits and fails to match further, it does not backtrack and retry with less digits.

TIPS

Tip 1. Use a delimiter that is not used inside your expression, ‘@…@’.
Tip 2. HTML can be written in many different ways. Account for mixed case,whitespace, attributes, XHTML vs HTML (<br> and <br/>), double, single or no quotes, empty tags or attributes (use * instead of +).
Tip 3. Use negative matches, ‘[^"]+’ ‘[^>]+’.
Tip 4. Non-capturing subpatterns, ‘(?:expr)’.
Tip 5. Once-only subpatterns, ‘(?>[0-9]+)bar’. These are also non-capturing.

Posted on Monday, May 14th, 2007 at 10:08 am under Rants. You can skip to the end and leave a response. Pinging is currently not allowed.

4 Comments

Alan Says:

On my top 10 of essential softwares for Black-hatters : RegexBuddy
You can create and test easily you regular expressions!
Saved me lots of times when writing scrapers scripts!!!

Andreas Says:

Really helpful for editing scripts, I saved half of my time using RegexBuddy – thanx!

Plepco Says:

Nice. I’ve been reading a lot of regex tutorials stuff lately, but this is one of the better posts. Some really good advice here – thanks. This will help me.

SEO Techniques Says:

Sometimes I absolutely hate working with Regex, you know? But you got some really neat stuff, so thanks for sharing.

And we’ll see, maybe I’ll become a “Regex God” just like you :P

Leave a Reply

You must be logged in to post a comment.

 

Subscribe

RSS feed

Contact


Pages


Search


Asides

Content may be king, but distribution pays the king’s mortgage.

8/12/09» 15:51» link» comments

Google acquired reCaptcha about a month ago, you might want to throttle your reCaptcha solving per IP address from now on.

14/10/09» 16:22» link» comments

Matt Cutts on how Google deals with spam.

7/10/09» 14:31» link» comments

Why you don’t want to shard.

Real World Web: Performance & Scalability.

NGINX + PHP-FPM + APC.

Gearman is interesting.

31/08/09» 4:46» link» comments
 
 
Copyright 2008, blackhat-seo.com