Regular Expressions (REs) are patterns that are used to match specific combinations of characters within a particular string. REs, despite being a very basic concept in the field of Natural Language Processing are widely used for a variety of applications. This tutorial is the first part of a series of articles and explains the basics of Regular Expressions, the various operators used to construct custom regular expressions and simple examples of their implementation in Python. The source for the theory described in this article is the book “Speech and Language Processing 3rd Edition Draft by Daniel Jurafsky & James H. Martin”. I have also followed a convention similar to the one used in the book mentioned above to represent REs. Some basic programming knowledge in Python is also assumed.
What are Regular Expressions ?
Definition: A Regular Expression is an algebraic notation for characterising a set of strings.
Requirements: A Regular Expression takes as input two arguments, namely:
- A Pattern (to search)
- A Corpus (of texts) to search through
Note: A corpus can be a single document or a collection of documents.
Implementation of Regular Expressions
The programming language used to illustrate the implementation of these examples is Python, which is one of the most popular languages at present. Python’s re module provides an optimised implementation to perform various operations on regular expressions such as pattern matching, pattern searching and pattern substitution.
Basic Regular Expressions
Consider a simple regular expression that consists of a sequence of characters. An example of this regular expression could be any word, say for eg. apple. Let us observe the following sentences:
1. “An apple a day keeps the doctor away.”
2. “Apple is a fruit.”
These strings can be assigned to variables in Python.
# Declare two string variables
search_text = "An apple a day keeps the doctor away."
match_text = "Apple is a fruit."
A regular expression that can be used to match the word “apple” in the above text is given as: /apple/.
Note: The forward slashes are NOT a part of the expression but are just used as a convention to enclose the regular expression.
The above regular expression is implemented as follows:
# Import the re module
import re# Initialise a variable with its value as the pattern to match
pattern = r'apple'
Note: The pattern variable is initialised as r’apple’ instead of ‘apple’. The ‘r’ before the single quote denotes a raw string.
There are four basic functions in the re module that can be used to check the presence of specific strings within the given text, namely:
- match() : The match() function returns a match if the pattern specified in the input is detected at the start of the string.
For example, if we execute the following lines of code:
pattern_capital = r'Apple'
# <re.Match object; span = 0.5, match='Apple'>
The above code, upon execution returns a match at the start of the string.
Note: We initialised a new variable named ‘pattern_capital’ as r’Apple’ because REs are case sensitive.
2. search() : The search() function returns a match if the pattern specified in the input for the function is detected anywhere within the string.
# <re.Match object; span=(3, 8), match='apple'>
Upon executing the above line of code, we obtain a result that displays a pattern match at index 3 of the given text.
3. findall() : A major drawback of the match() and search() functions is that these return only a single match from the given text input. To detect multiple instances of the search pattern, we use the findall() function that returns every instance of the pattern detected within the given input.
# Initialise a new variable
multi_text = "An apple a day keeps the doctor away. Eat apples to stay healthy"print(re.findall(pattern, multi_text))
# ['apple', 'apple']
Upon execution, we obtain a list of matches of the pattern to be searched within the text (2 matches in the above example).
4. sub() : The sub() or substitution function takes three input arguments which are:
i. A Pattern present within the text input that the user wishes to replace.
ii. A Pattern that the user wishes to replace the previous pattern with.
iii. The input text itself.
For eg. In the sentence “My email ID is firstname.lastname@example.org”. We wish to substitute the character ‘x’ with ‘d’ in the above sentence. The pattern which we wish to replace will be /x/ and the pattern with which we wish to replace will be /d/.
sub_text = "My email ID is email@example.com"initial_patern = r'x'final_pattern = r'd'print(re.sub(initial_pattern, final_pattern, sub_text))
# My email ID is firstname.lastname@example.org
Operators & their uses in Regular Expressions
- Brackets “” : The characters between these specify disjunction of characters to be matched. For eg. /[aA]pple/ matches both ‘apple’ and ‘Apple’
# Concatenate the match_text and search_text variables into a single new variable
disj_text = match_text + ' ' + search_text# Declare pattern
disj_pattern = r'[aA]pple'# Find all instances of the new pattern within the text
# ['Apple', 'apple']
We observe that upon executing the above lines of code the regular expression is now able to detect both the strings ‘Apple’ and ‘apple’.
Brackets can also be used with hyphen (-) to specify a single character in range for a well defined sequence of characters. For eg. [2–5] matches 2, 3, 4 & 5.
dir_text = "There are 4 directions and 8 hemispheres."dir_pattern = r'[2-5]'# Can also use search() but findall() is better since it considers all numbers within the text.
Thus we get a match for the pattern at number 4 present in the input text whereas the number 8 is ignored.
2. Caret (^) : A caret is used to specify what a single character cannot be. For eg. /[^a-z]/ matches single characters including special ones except ‘a’.
exclude_text = "A bird sits on a tree."exclude_pattern = r'[^\sa-z]'print(re.findall(exclude_pattern, exclude_text))
Note: \s is an escape sequence that is used to represent whitespaces. Since whitespaces often occur as single characters within text, the findall() function treats them in the same way as any other single character. Hence, in order to prevent whitespaces from being detected as a single character we need to add \s to the regular expression.
It is observed that the result includes only ‘A’ and ignores ‘a’.
Another use case of caret involves matching a regular expression at the start of a line. For eg. /^[tT]he/ matches ‘the’ and ‘The’ occurring at the start of a line.
start_text = "The lion is the king of the forest."start_pattern = r'^[tT]he'print(re.findall(start_pattern, start_text))
The above code results in a list that contains ‘The’ from the start of the text but ignores ‘the’ which is the second last word in the text.
3. Question Mark (?) : The question mark is used to match a set of strings that may or may not contain a particular character that precedes the question mark. For eg. /apples?/ matches both ‘apple’ and ‘apples’.
last_char_text = "Remove a stick from the bundle of sticks."last_char_pattern = r'sticks?'print(re.findall(last_char_pattern, last_char_text))
# ['stick', 'sticks']
The above code results in the expression matching with both stick and sticks upon execution.
4. Period (.) : A period is a wildcard expression to match any single character. For eg. /./ matches every character in a given sentence.
period_text = "A ball of yarn"period_pattern = r'.'print(re.findall(period_pattern, period_text))
# ['A', ' ', 'b', 'a', 'l', 'l', ' ', 'o', 'f', ' ', 'y', 'a', 'r', 'n', '.']
The result is a list of every character within the text.
5. Asterisk (*) : An asterisk matches zero or more occurrences of an immediately previous character or regular expression. For eg. /a*/ matches a string with 0 or more ‘a’s.
sentence = "An aardvark is a burrowing mammal"asterisk_pattern = r'a*'print(re.findall(asterisk_pattern, sentence))
# ['', '', '', 'aa', '', '', '', 'a', '', '', '', '', '', '', 'a', '', '', '', '', '', '', '', '', '', '', '', '', 'a', '', '', 'a', '', '']
We observe that the result not only includes mentions of ‘a’ and ‘aa’ but also
6. Plus (+) : A plus matches one or more occurrences of an immediately previous character. For eg. /a+/ matches a string with 1 or more ‘a’s.
plus_text = "The largest 3 digit number is 999."plus_pattern = r'[0-9]+'print(re.findall(plus_pattern, plus_text))
# ['3', '999']
The resulting list upon executing the above code includes both 3 & 999.
7. Dollar ($) : The dollar symbol is used to match the end of a line. For eg. /friend$/ matches the string ‘friend’ at the end of a line.
dollar_text = "'A friend in need is a friend indeed', describes the qualities of a true friend" # do not include full stopsdollar_pattern = r'friend$'# Can also use findall() but search() gives location of pattern as a part of the result.
# <re.Match object; span=(73, 79), match='friend'>
The above code upon execution detects the pattern ‘friend’ only at the end of the string and not the one that occurs within the double quotes.
8. Disjunction (‘|’) : A Pipe or Disjunction operator for a given string or an alternate string for the given string within the given text input.
General Format : r‘<string1>|<string2>’ i.e. <string1> or <string2>
For eg. /football|handball/ matches both ‘football’ or ‘handball’ .
pipe_text = "He plays football and handball as well."pipe_pattern = r'football|handball'print(re.findall(pipe_pattern, pipe_text))
# ['football', 'handball']
The above code segment, upon execution results in a list containing both the strings ‘football’ and ‘handball’.
Another way to use the ‘|’ operator is to enclose a character or a sequence of characters within parentheses. This allows that character or sequence to be treated as a single character. For eg. /gupp(y|ies)/ means the disjunction matches both ‘y’ and ‘ies’.
plural_text = "A butterfly is an insect. Many butterflies help in the process of pollination."plural_pattern = r'butterfl(y|ies)'print(re.findall(plural_pattern, plural_text))
# ['y', 'ies']print(re.search(plural_pattern, plural_text))
# <re.Match object; span=(3, 9), match='butterfly'>
We observe that each of these punctuation marks is used within a pattern to search a particular pattern within a string. However, what if we need to include them within the search pattern? To do so, we use ’ \’ before the symbol that we need to detect within the sentence, for eg. to search ‘.’ within a pattern, we use the expression ‘\.’.
escape_text = "The coffee costs $2.50."escape_pattern = r'\.'print(re.match(escape_pattern, escape_text))
# <re.Match object; span=(19, 20), match='.'>print(re.findall(escape_pattern, escape_text))
# ['.', '.']
Operator Precedence Hierarchy
Regular Expressions match the longest string they can find, hence they can be called greedy. An Operator Precedence Hierarchy formalises the use of parentheses to specify what we mean in regards to the idea that one operator may take precedence over the other.
The Operator Precedence Hierarchy in Regular Expressions is as follows:
Image Source: Speech and Language Processing by Daniel Jurafsky & James H. Martin 3rd Edition Draft, Chapter 2, Page 7.
Regular Expressions are thus a simple but an efficient tool to search for specific character sequences within a given input document or corpus. The can be designed to return only the first observed match or all the matches for the input pattern to be searched. Thus,Regular Expressions form a language on their own that is used. In the next part of this article, we’ll cover an example of constructing a regular expression built to match a specific pattern and optimising it to match more generalised use cases. We’ll also explore the types of errors encountered in the example, a variety of escape sequences used to represent specific groups of characters (similar to \s).