Python RegEx (Regular Expressions)

  • Python

Python RegEx or a Regular Expression is a sequence of characters for matching patterns in a given text. You can also understand this as a tool for matching patterns in text.

In this tutorial, we’ll learn why do we use use Regular Expressions in Python and how do we use them.

We now know that regular expressions are sequence of characters to match a pattern in a given string which is we can match, find or replace text in a string.

You must be wondering that we can perform this using the regular string operations. So, why is there a need for regular expressions?

Why Regular Expressions?

Let’s understand and compare these using an example.

x = "I like Mango"
print(x.replace("Mango", "Cherry"))


y = "That tree across the street"
print(y.replace("tree", "bike"))

Output :

I like Cherry
That bike across the sbiket

Now, if you notice both the examples, we didn’t have any issues when there was only a single word that matched the word we wanted to replace but the second string y had that twice. First the word tree itself and street also contained tree in itself.

Now, you can replace the word tree by using slicing along with replace() but it will become a tedious task when replacing a single word multiple times. We can simplify this by using Regular Expressions.

How to use Regular Expressions?

To work with RegEx in Python, we need to import the re module.

import re

There are multiple functions that we can work with in Regular Expressions. All of them have different uses.

findall()Returns a list containing all matches
search()Returns the first matched object
split()Returns a list of strings where the match was successful
sub()Returns the string after replacing the match
match()Returns a match object on success and None when not

These are some functions in RegEx. Before proceeding further, let’s understand some basics of Python RegEx.

Metacharacters

Metacharacters are characters with a special meaning in RegEx.

^Caret

This symbol is used to check if a string starts with the given characters.

import re

pattern = "^Apple"
string = "Apple is Red"

result = re.match(pattern, string)

if result:
  print("Match")
else:
  print("Not a match")

$Dollar

This will check if a string ends with the given pattern.

import re

pattern = "Red$"
string = "Apple is Red"

result = re.findall(pattern, string)

if result:
  print("Match")
else:
  print("Not a match")

.Period

It matches a single character which can be anything except for a new line “\n”.

import re

pattern = "R.d"
string = "Apple is Red"

result = re.findall(pattern, string)

if result:
  print("Match")
else:
  print("Not a match")

[]Square Brackets

It is used to match the characters mentioned inside the square brackets. It will match single characters and not the complete word/string.

import re

pattern1 = "[a-p]"
string1 = "Apple is Red"
result1 = re.findall(pattern1, string1)
print("Result 1: ", result1)

pattern2 = "[1-4]"
string2 = "99445678345123"
result2 = re.findall(pattern2, string2)
print("Result 2: ",result2)
Result 1:  ['p', 'p', 'l', 'e', 'i', 'e', 'd']
Result 2:  ['4', '4', '3', '4', '1', '2', '3']

In the above example, we used range of characters and numbers but you can use only the characters like [1234] or [abcd].

To find match except for the ones mentioned in the square brackets, use ^ sumbol in front.

[^a-e] – This will find all except a,b,c,d,e.

[^2] – This will find all except 2.

* – Star

This matches zero or more occurrences of preceding character.

import re

pattern = "ed*"
string = "Apple is Red"

result = re.findall(pattern, string)
print(result)
if result:
  print("Match")
else:
  print("Not a match")

+ – Plus

This matches one or more occurrence of the preceding element.

import re

pattern = "ed+"
string = "Apple is Red"

result = re.findall(pattern, string)
print(result)
if result:
  print("Match")
else:
  print("Not a match")

? – Question Mark

This matches zero or one occurrence of the preceding pattern.

import re

pattern = "e?d"
string = "Apple is Red"

result = re.findall(pattern, string)
print(result)
['ed']

{} – Braces

Syntax for this is {x,y} which means at least x and at most y. This can also be written as {x}.

import re

# Check if it contains 'r' followed by exactly two 'e'
pattern1 = "re{2}"
string = "That tree across the street"

result1 = re.findall(pattern1, string)
print(result1)

pattern2 = "re{3}"
result2 = re.findall(pattern2, string)
print(result2)
['ree', 'ree']
[]

|Vertical Bar

This is the concept of either-or.

import re

# Check if it contains 'r' followed by exactly two 'e'
pattern = "across|under"
string = "That tree across the street"

result = re.findall(pattern, string)
print(result)
['across']

()Group

This is used to group a sub-pattern.

For example, (1|2|3)ab will match anything with either 1 or 2 or 3 followed by ab.

\ – Backslash

This is used when you want to search the metacharacters in your pattern.

import re

string = "That will be $100 dollars"

result = re.findall("\$...", string)
print(result)
['$100']

Special Sequences

CharacterDescriptionExample
\AReturns a match if specified characters are in the
beginning of the string
“\AApple”
\bReturns a match if specified characters are at the
beginning of the string or at the end
“\bApple”
“Red\b”
\BReturns a match if specified characters are NOT at the
beginning of the string or at the end but are in string
“\Btree”
“tree\B”
\dReturns a match if string contains digits (0-9)“\d”
\DReturns a match if string does not contains digits (0-9)“\D”
\sReturns a match if string contains white space“\s”
\SReturns a match if string does not contains white space“\S”
\wReturns a match if string contains alphanumeric character“\w”
\WReturns a match if string does not contains alphanumeric
character
“\W”
\ZReturns a match if specified characters are at the end of the string“Red\Z”

These were the basics of Python RegEx. Let’s move towards the functions that will be used to implement regular expressions.

findall() Function

This function returns all the matches of the given pattern. You can also say that this returns all the duplicate values in a string.

import re

#Return a list containing every occurrence of a two digit number:

string = "I am 32 and i live at 19, Kingsway Road"
x = re.findall("\d\d", string)
print(x)
['32', '19']

search() Function

This function searches the entire string and returns a Match object when success but will return only the first occurrence. If there is no match found, it returns None.

import re

string = "I am 32 and I live at 19, Kingsway Road"
x = re.search("\d\d", string)

if x:
  print("Match found at position", x.start())
else:
  print("Match not Found")
Match found at position 5

split() Function

This function returns a list of strings where the split has occurred as per the given pattern.

import re

#Splits the string at every white-space character

string = "I am a Python Enthusiast"
result = re.split("\s", string)
print(result)
['I', 'am', 'a', 'Python', 'Enthusiast']

sub() Function

This function is used to replace the sub string with the given pattern.

import re

#Return a string after replacing the string with given pattern

string = "I am 32 and i live at 19, Kingsway Road"
x = re.sub("\d\d", "xx", string)
print(x)
I am xx and i live at xx, Kingsway Road

match() Function

This function is used to test if the given string matches the given pattern. It returns a match object on success and None otherwise.

This is similar to search() function but it only matches the pattern at the beginning of the string whereas the search() matches in the whole string. Let’s understand using a few examples.

import re

string = "Python is more popular than Java"
x1 = re.match("Python", string)
print(x1)

x2 = re.match("Java", string)
print(x2)
<_sre.SRE_Match object; span=(0, 6), match='Python'>
None

Match Object

We’ve seen match() and search() function return a Match object. This object has a few properties and methods that we can use to grab the result.

.group() – Returns the part of the string where there is a match.
.start() – Return the index value of the start where the match was found.
.end() – Return the index value of the end where the match was found.
.span() – Returns a tuple containing the start and end index value.
.string – Returns the passed string.

import re

string = "I am 32 and i live at 19, Kingsway Road"
x = re.search("\d\d", string)
print(x)

print(x.group())
print(x.start())
print(x.end())
print(x.span())
print(x.string)
<_sre.SRE_Match object; span=(5, 7), match='32'>
32
5
7
(5, 7)
I am 32 and i live at 19, Kingsway Road