Python Tutorial Python Advanced Python References Python Libraries

Python - RegEx



A RegEx, or Regular Expression is a sequence of characters that defines a search pattern. It is used to check whether a string contains specified search pattern or not. Please see the below mentioned search pattern:

^P....n$

The above search pattern can be used to check whether a string contains six characters which starts with P and ends with n.

Please note that Python has a built-in RegEx module called re which need to be imported to work with Regular Expression.

Example:

In the example below, ^p....n$ search pattern is checked for its presence in the given string called MyString.

import re

MyString = "Python"
x = re.search("^P....n$", MyString)
if(x):
  print("Pattern found.")
else:
  print("Pattern not found.")


MyString = "Python!."
x = re.search("^P....n$", MyString)
if(x):
  print("Pattern found.")
else:
  print("Pattern not found.")

The output of the above code will be:

Pattern found.
Pattern not found.

MetaCharacters

Metacharacters are the special characters which are interpreted in a different way by RegEx engine. The metacharacters are:

CharacterDescriptionExample
[]To specify a set to characters"[a-z]"
.To specify any character except new line"He..o"
^To specify starts with character(s)"^Hello"
$To specify ends with character(s)"World$"
*To check zero or more occurrences of specified character(s)"Helx*"
+To check one or more occurrences of specified character(s)"Helx+"
{}To check the specified number of occurrences of specified character(s)"Hel{2}"
?To check zero or one occurrences of specified character(s)"He?l"
|To specify either or"go|come"
()To group sub-patterns"(x|y|z)abc"
\To escape various characters including all metacharacters"\$"

Special Sequences

Metacharacters are the special characters which are interpreted in a different way by RegEx engine. The metacharacters are:

CharacterDescriptionExample
\AMatches if the specified characters are at the beginning of the string."\AThe"
\bMatches if the specified characters are at the beginning or at the end of a word."\bain"
"ain\b"
\BMatches if the specified characters are present, but NOT at the beginning (or at the end) of a word."\Bain"
"ain\B"
\dMatches if the string contains digits (numbers from 0-9)."\d"
\DMatches if the string DOES NOT contain digits."\D"
\sMatches if the string contains a white space character."\s"
\SMatches if the string DOES NOT contain a white space character."\S"
\wMatches if the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)."\w"
\WMatches if the string DOES NOT contain any word characters."\W"
\ZMatches if the specified characters are at the end of the string."rain\Z"

Sets

A set is a collection of characters inside a pair of square brackets [] with a special meaning:

SetDescription
[abc]Matches if one of the specified characters (a, b, or c) are present.
[a-d]Matches if any lower case character, alphabetically between a and d is present.
[^abc]Matches for any character EXCEPT a, b, and c.
[123]Matches if any of the specified digits (1, 2, or 3) are present.
[0-9]Matches for any digit between 0 and 9.
[1-8][0-9]Matches for any two-digit numbers from 10 and 89.
[a-zA-Z]Matches for any character alphabetically between a and z, lower case or upper case.
[+]In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string.

The findall() Function

The findall() function returns a list containing all matches. The list contains the matches in the order they are found. If no matches are found, an empty list is returned.

Example:

In the example below, the findall() function is used to find all matches of comma (,) and ampersand (&) in the given string.

import re

MyString = "31 January, 28 February, 31 March"

#find all matches of comma (,)
x = re.findall(",", MyString)
print(x)

#find all matches of ampersand (&)
y = re.findall("&", MyString)
print(y)

The output of the above code will be:

[',', ',']
[]

The search() Function

The search() function is used to search the string for a match, and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match is returned. In case of no match, None is returned.

Example:

In the example below, the search() function is used to find first match of comma (,) and ampersand (&) in the given string.

import re

MyString = "31 January, 28 February, 31 March"

#find first match of comma (,)
x = re.search(",", MyString)
print("First comma starting point:", x.start())

#find first match of ampersand (&)
x = re.search("&", MyString)
print("First ampersand starting point:", x)

The output of the above code will be:

First comma starting point: 10
First ampersand starting point: None    

The split() Function

The split() function returns a list where the string has been split at each match. The number of split can be controlled by specifying maxsplit parameter.

Example:

In the example below, the split() function returns a list where the string has been split at each match.

import re

MyString = "31 January, 28 February, 31 March"

#create list containing elements spitted using comma (,)
x = re.split(",", MyString)
print("The List contains: ", x)

#create list containing elements spitted using comma (,)
#maximum number of split is specified as 1 
y = re.split(",", MyString, 1)
print("The List contains: ", y)

The output of the above code will be:

The List contains:  ['31 January', ' 28 February', ' 31 March']
The List contains:  ['31 January', ' 28 February, 31 March']   

The sub() Function

The sub() function is used to replace the matches with the specified text. The number of replacement can be controlled by specifying count parameter.

Example:

In the example below, the sub() function is used to replace the comma (,) with asterisk (*).

import re

MyString = "31 January, 28 February, 31 March"

#replacing comma (,) with asterisk (*)
x = re.sub(",", "*", MyString)
print("The String contains: ", x)

#replacing comma (,) with asterisk (*)
#maximum number of replacement is specified as 1 
y = re.sub(",", "*", MyString, 1)
print("The String contains: ", y)

The output of the above code will be:

The String contains:  31 January* 28 February* 31 March
The String contains:  31 January* 28 February, 31 March 

5