I knew it from the first glance, that regular expressions were very very beautiful. As of today, I feel that I have a pretty good grasp of regular expressions. Pay attention and one day you might be a regex god like me.
I’m sure you all have that buddy that told you that regex is easy and all you need to know is ‘<tag>(.+?)</tag>’. Trust me, that won’t get you very far. I’m gonna start with one of the most common examples, html links aka <a> tags. I’m using PCRE regex (PHP), so you may need to adjust for the regex implementation you are using.
Scroll to the bottom of the post for the tips.
RegexNoob: Hey, I already know how to do that!
RegexNoob: That’s easy, /<a href="(.+?)">(.+?)<\/a>/. Look I grabbed both the url and the anchor!
RegexPro: Good luck with that.
There’s a few points to make. First, html tags have attributes (rel,id,class,href,target etc) and their order inside the tag doesn’t matter. Second, attributes can be wrapped by "", ” , whitespace or even nothing. Also the whole thing may be in mixed case ( <a HREF=’#'></A> ) and it may contain whitespace just about everywhere ( < a HREF = ‘#’ >< / A > ). Below is a better example of a link matching regex.
RegexPro: That’s right.
Ok, let’s break this down.
First thing is the ‘@…@i’. You need to enclose your regex in delimiters. Whatever you use as a delimiter needs to be escaped inside the expression. A lot of people habitually use ‘/…/’, which is bad because you have to escape all the closing html tags likeso, ‘/<p>(.+?)<\/p>/’. By using ‘@…@’, I don’t have to escape ‘</html>’ tags.
Then there is ‘is’ after the closing delimiter. There are various modifiers which you can use there, ‘i’ is the most common and it makes your expression case insensitive. The ‘s’ makes the dot inside ‘(.*?)’ match new lines, which might be present inside the anchor.
Then there is the opening ‘<’ and after that ‘[\s]*’. That means zero or more whitespace characters (\n,\r,\t etc). As I said there can be whitespace anywhere and it will still be valid html. I’ve enclosed all the \s in square brackets for extra readability but it’s not necessary. As you will notice ‘[\s]*’ is used in multiple places in the expression string.
Next comes ‘a’, then ‘[\s]*’ again. Ah there’s something new, ‘[^>]*?’. This one matches anything that’s not ‘>’. The ‘^’ char inside ‘‘ directs a negative match. We have added a ‘?’ after the ‘*’ because otherwise it wouldn’t stop at ‘href’. This will match any html attributes between ‘a’ and ‘href’.
The ‘[\"\'\s]*’ matches single or double quotes and whitespace. The ‘*’ covers all the combinations we might encounter, take note as I’ve seen some similar regex that uses ‘+’ instead of ‘*’. That will fail in case there’s nothing between the ‘=’ and the URI.
After the URL we have another set of ‘[\"\'\s]*’ and ‘[^>]*?’, which will match any html atributes that are after the ‘href’.
Next is the anchor of the link, ‘(.*?)’. We use *’ instead of ‘+’ because it would be valid html to have an empty anchor. We are almost finished, with just the closing tag which includes some more ‘[\s]*’ to account for whitespace. We didn’t have to escape ‘/’ because we use ‘@’ as the delimiter.
If you wanted to match both ‘<br>’ and ‘<br/>’, this would take care of whitespace plus the extra ‘/’ for XHTML.
‘@<font[^>]*>Results <b>[0-9]*</b> – <b>[0-9]*</b> of about <b>([0-9,]+)</b> for@’
Matches the number of Google results for a query. Notice again how I included a ‘,’ in the subpattern ‘([0-9,]+)’.
Matches any imported stylesheets. Sometime you need a subpattern that won’t capture anything (so matches is still the ‘(.+?)’ ). That’s what ‘(?:expression)’ does, so ‘(?:url)?’ matches ‘url’ or ”. If you use something like ‘url?’, it will match ‘ur’ or ‘url’. You can add options between ‘?’ and ‘:’ like you can add after the ending delimiter.
If you have the string ’123456789foo’, then this expression will match all 9 digits, then fail to match ‘bar’, then match 8 digits, fail again, match 7 digits and so on. By using ‘(?>[0-9]+)’ this behaviour is prevented, after it matches the max number of digits and fails to match further, it does not backtrack and retry with less digits.
Tip 1. Use a delimiter that is not used inside your expression, ‘@…@’.
Tip 2. HTML can be written in many different ways. Account for mixed case,whitespace, attributes, XHTML vs HTML (<br> and <br/>), double, single or no quotes, empty tags or attributes (use * instead of +).
Tip 3. Use negative matches, ‘[^"]+’ ‘[^>]+’.
Tip 4. Non-capturing subpatterns, ‘(?:expr)’.
Tip 5. Once-only subpatterns, ‘(?>[0-9]+)bar’. These are also non-capturing.