Python 2.7 Tutorial

On this page: re module, re.findall(), re.compile(), re.search(), re.search().group().

The re Module

So you learned all about regular expressions and are ready to use them in Python. Let’s get to it! The re module is Python’s standard library that handles all things regular expression. Like any other module, you start by importing it.

 

>>>

import

re

>>>

Finding All Matches in a String

Suppose you want to find all words starting with ‘wo’ in this very short text below. What you want to use is the re.findall() method. It takes two arguments: (1) the regular expression pattern, and (2) the target string to find matches in.

 

>>>

wood =

'How much wood would a woodchuck chuck if a woodchuck could chuck wood?'

>>>

re.findall(

r'wo\w+'

, wood)

['wood', 'would', 'woodchuck', 'woodchuck', 'wood']

>>>

r'wo\w+' is written as a raw string, as indicated by the r'...' string prefix. That is because regular expressions, as you are aware by now, use the backslash “\” as their own special escape character, and without 'r' the backslash gets interpreted as *Python’s* special escape character. Basically, on Python’s string object level, that “\” in “\w” should be interpreted as a literal backslash character so that it can later be interpreted as a regular expression’s special escape character when the string is processed by the re module. If this all sounds too complicated, just remember to ALWAYS PREFIX YOUR REGULAR EXPRESSION WITH 'r'.

Back to re.findall(). It returns all matched string portions as a list. If there are no matches, it will simply return an empty list:

 

>>>

re.findall(

r'o+'

, wood)

['o', 'oo', 'o', 'oo', 'oo', 'o', 'oo']

>>>

re.findall(

r'e+'

, wood)

[]

re.IGNORECASE.

 

>>>

foo =

'This and that and those'

>>>

re.findall(

r'th\w+'

, foo)

['that', 'those']

>>>

re.findall(

r'th\w+'

, foo, re.IGNORECASE)

['This', 'that', 'those']

>>>

Compiling a Regular Expression Object

If you have to match a regular expression on many different strings, it is a good idea to construct a regular expression as a python object. That way, the finite-state automaton for the regular expression is compiled once and reused. Since constructing a FSA is rather computationally expensive, this lightens processing loads. To do this, use the re.compile() method:

 

>>>

myre = re.compile(

r'\w+ou\w+'

)

>>>

myre.findall(wood)

['would', 'could']

>>>

myre.findall(

'Colorless green ideas sleep furiously'

)

['furiously']

>>>

myre.findall(

'The thirty-three thieves thought that they thrilled the throne throughout Thursday.'

)

['thought', 'throughout']

re method directly on the regular expression object. In the example above, myre is the compiled regular expression object corresponding to r'\w+ou\w+', and you call .findall() on it as myre.findall(). In doing so, you now need to specify only one argument, which is the target string: myre.findall(wood).

Testing if a Match Exists

Sometimes, we are only interested in confirming whether or not there is a match within the given string. For that, re.findall() is an overkill, because it scans the entire string to produce *every* matching substring. This is fine when you are dealing with a few short strings like we are here, but in the real world your strings might be much longer and/or you will be doing the matching thousands or even millions of times, so the difference adds up.

In this context, re.search() is a good alternative. This method only finds the first match and then quits. If a match is found, it returns a “match object”. But if not, it returns… nothing. Below, r'e+' is successfully matched in the ‘Colorless…’ string, so a match object is returned. Funnily enough, there is not a single ‘e’ in our wood, so the same search returns nothing.

 

>>>

re.search(

r'e+'

,

'Colorless green ideas sleep furiously'

)

<_sre.SRE_Match object at 0x02D9CB48>

>>>

re.search(

r'e+'

, wood)

>>>

.group() method defined on the match object. There’s a problem though: it works fine when there is a match and therefore a match object has been returned, but when there is no match, there is no returned object, so…

 

>>>

re.search(

r'e+'

,

'Colorless green ideas sleep furiously'

).group()

'e'

>>>

re.search(

r'e+'

, wood).group()

re.search() in the context of an if statement. Below, the if ... line checks if there is a returned object by the re.search method, and only then you proceed to print out the matched portion and the matching line. (NOTE: if someobj returns True as long as someobj is not one of the following: “nothing”, integer 0, an empty string "", an empty list [], and an empty dictionary {}.)

 

>>>

f = open(

'D:\\Lab\\ling1330\\bible-kjv.txt'

)

>>>

blines = f.readlines()

>>>

f.close()

>>>

smite = re.compile(

r'sm(i|o)te\w*'

)

>>>

for b in blines: matchobj = smite.search(b) if matchobj: print matchobj.group(),

'-'

, b,

smite - again smite any more every thing living, as I have done. smote - were with him, and smote the Rephaims in Ashteroth Karnaim, and the smite - hand of Esau: for I fear him, lest he will come and smite me, and the smote - 36:35 And Husham died, and Hadad the son of Bedad, who smote Midian in smitest - Wherefore smitest thou thy fellow? 2:14 And he said, Who made thee a smite - 3:20 And I will stretch out my hand, and smite Egypt with all my smite - behold, I will smite with the rod that is in mine hand upon the waters smote - up the rod, and smote the waters that were in the river, in the sight smite - 8:2 And if thou refuse to let them go, behold, I will smite all thy smite - rod, and smite the dust of the land, that it may become lice smotest - with thee of the elders of Israel; and thy rod, wherewith thou smotest smite - and thou shalt smite the rock, and there shall come water out of it, smiteth - 21:12 He that smiteth a man, so that he die, shall be surely put to smiteth - 21:15 And he that smiteth his father, or his mother, shall be surely smite - 21:18 And if men strive together, and one smite another with a stone, ...

.readlines() method. And then, because we will be doing the matching many times over, we do the smart thing of compiling our regular expression. Then, for-looping through the Bible lines, we create a match object through .search(), and print out the matched portion and the line only if a match object exists.