Introduction to regular expressions

2023-02-24 3047 words 15 minutes

Contents

Introduction

Regular expressions (or regex) are sequences of characters that describe a search pattern in a string of characters. They are used in many programming languages to search for, extract, and manipulate textual data. Regular expressions allow for precise descriptions of complex search patterns, such as email addresses or phone numbers, and can be much faster and more efficient than manually traversing each line of text to extract specific information.

In this article, we will explore the various features of regular expressions, focusing on practical examples in Python, Bash, and JavaScript. We will start with the basics of regular expressions, and then progress to more advanced techniques such as lookaround, quantifiers, and alternation. Finally, we will examine how to use regular expressions for substitutions and modify search behavior using flags.

Whether you are a beginner or experienced developer, this article should provide you with a solid foundation for understanding and using regular expressions in your projects.

Here’s an example of a simple regular expression:

1

abc+

In this example, the regular expression abc+ matches the string “ab” followed by one or more “c” characters. This means that the regular expression will match “abc”, “abcc”, “abccc”, and so on.

Character Classes

Character classes are sets of characters that can be used to describe specific search patterns. Character classes are specified using brackets ([]) and allow for the description of a single character that matches any character within the specified set.

Here are some examples of character classes:

[abc] : matches any character that is either ‘a’, ‘b’, or ‘c’
[a-z] : matches any character that is a lowercase letter
[A-Z] : matches any character that is an uppercase letter
[0-9] : matches any character that is a digit
. : matches any character

Character classes can also be negated, meaning they match any character that is not in the specified set. Negated character classes are specified by using the ^ symbol inside the brackets.

Here’s an example of a negative character class:

[^0-9]: matches any character that is not a digit

It is also possible to combine character classes by using parentheses to create groups. The groups can be combined with quantifiers to search for more complex text patterns.

([a-z]|[A-Z]): matches any character that is a letter, whether it is uppercase or lowercase.

In Python, character classes can be used using the re.search() function:

1
2
3
4
5
6
7
8
9


import re

text = "The quick brown fox jumps over the lazy dog."
match = re.searc("[aeiou]", text)

if match:
    print("The first vowel found is:", match.group())
else:
    print("No vowels found.")

In JavaScript, character classes can be used by using the test() method of the RegExp object:

1
2
3
4
5
6
7
8


const text = "The quick brown fox jumps over the lazy dog.";
const match = /[aeiou]/.test(text);

if (match) {
    console.log("A vowel was found.");
} else {
    console.log("No vowel found.");
}

In Bash, character classes can be used with the grep tool:

1
2
3
4
5
6
7
8


text="The quick brown fox jumps over the lazy dog."
match=$(echo $text | grep "[aeiou]")

if [ -n "$match" ]; then
    echo "A vowel has been found."
else
    echo "No vowel found."
fi

Anchors

Anchors are metacharacters that allow to delimit search patterns by specifying a precise position in the text. The two most commonly used anchors are ^ and $.

The anchor ^ corresponds to the beginning of a string, while the anchor $ corresponds to the end of a string.

They can also correspond to the beginning and end of a line if the multiline flag m is enabled. We will revisit the concept of flags later in the article.

There are also two other anchors, \b and \B, which respectively match the beginning or end of a word.

Here are some examples:

\bcode : matches all occurrences of the word “code” at the beginning of a word
tuto\B : matches all occurrences of the word “tuto” that do not end with a letter

Here are some examples:

^Hello: matches any strings that start with “Hello”
Goodbye!$: matches any strings that end with “Goodbye!”

In Python, anchors can be used using the re.search() function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


import re

text = "Hello everyone.\nHow are you?\nGoodbye!"
match = re.search("^Hello", text)
if match:
    print("The text starts with 'Hello'")

match = re.search("Goodbye!$", text)
if match:
    print("The text ends with 'Goodbye!'")

In JavaScript, anchors can be used by using the test() method of the RegExp object:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


const text = "Hello everyone.\nHow are you?\nGoodbye!";
const regex1 = /^Hello/;
const regex2 = /Goodbye!$/;

if (regex1.test(text)) {
    console.log("The text starts with 'Hello'");
}

if (regex2.test(text)) {
    console.log("The text ends with 'Goodbye!'");
}

In Bash, anchors can be used by using the grep tool:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


text="Hello everyone.
How are you?
Goodbye!"

if echo "$text" | grep -q "^Hello"; then
    echo "The text starts with 'Hello'"
fi

if echo "$text" | grep -q "Goodbye!$"; then
    echo "The text ends with 'Goodbye!'"
fi

Note that in bash, grep applies to each line, so there is no multline flag m.

escaped characters

Some characters have a special meaning in regular expressions and cannot be used directly. To use them in a regular expression, they must be escaped by prefixing them with an escape character (\). Here are some commonly used escaped characters:

\.: matches a literal period
\\: matches a literal backslash
\d: matches any digit
\D: matches any character that is not a digit
\w: matches any alphanumeric character, including underscore
\W: matches any character that is not alphanumeric or underscore
\s: matches any whitespace character (space, tab, newline, etc.)
\S: matches any character that is not a whitespace character
\b: matches the beginning of a word
\B: matches the end of a word

In Python, escaped characters can be used by using the re.search() function:

1
2
3
4
5
6
7
8
9


import re

text = "The price is 5€."
match = re.search("\d", text)

if match:
    print("The price is : ", match.group(), "euros.")
else:
    print("No price found.")

In JavaScript, escaped characters can be used with the test() method of the RegExp object:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


const text = "The price is 5€.";
const re = /\d/
const match = re.exec(text)

console.log(match)
if (match) {
    console.log("The price is : " + match[0] + " euros.");
} else {
    console.log("No price found.");
}

In Bash, escaped characters can be used using the grep tool:

1
2
3
4
5
6
7
8


text="The price is 5€."
match=$(echo $text | grep -Po "\d")

if [ -n "$match" ]; then
    echo "The price is : $match euros."
else
    echo "No price found."
fi

-o: displays only the matching characters

-P: the pattern is a Perl regular expression.

Quantifiers & Alternation

Quantifiers are symbols that allow you to specify how many times an element should appear in a regular expression. The most commonly used quantifiers are:

*: matches zero or more occurrences of the preceding element.
+: matches one or more occurrences of the preceding element.
?: matches zero or one occurrence of the preceding element.
{n}: matches exactly n occurrences of the preceding element.
{n,}: matches at least n occurrences of the preceding element.
{n,m}: matches between n and m occurrences of the preceding element.

Alternation is another important concept in regular expressions that allows specifying multiple alternatives for the same element. Alternation is specified using the | symbol, which means “or”.

Here’s an example of using quantifiers and alternation in Python :

1
2
3
4
5
6


import re

text = "The quick brown fox jumps over the lazy dog."
matches = re.findall("q.*?k|fox|dog", text)

print(matches)  # ['quick', 'fox', 'dog']

In this example, the regular expression "q.*?k|fox|dog" searches for all occurrences that match either "q.*?k", "fox", or "dog". The .*? matches any character appearing between the letters "q" and "k", while the alternation | specifies the alternatives "fox" and "dog".

Here’s an example of using quantifiers and alternation in JavaScript :

1
2
3
4


const text = "The quick brown fox jumps over the lazy dog.";
const matches = text.match(/q.*?k|fox|dog/g);

console.log(matches);  // ['quick', 'fox', 'dog']

In this example, the regular expression /q.*?k|fox|dog/g searches for all occurrences that match either "q.*?k", "fox", or "dog". The .*? matches any character that appears between the letters "q" and "k", while the alternation | specifies the alternatives "fox" and "dog". The g flag allows searching for all occurrences in the text.

Here is an example of using quantifiers and alternation in Bash:

1
2
3
4


text="The quick brown fox jumps over the lazy dog."
matches=$(echo $text | grep -oE "q.*?k|fox|dog")

echo $matches  # quick fox dog

In this example, the regular expression "q.*?k|fox|dog" searches for all occurrences that match either "q.*?k", "fox", or "dog". The .*? matches any character that appears between the letters "q" and "k", while the alternation | specifies the alternatives "fox" and "dog".

The option -E means “use extended regular expressions” instead of using basic regular expressions. Extended regular expressions offer more advanced features than basic regular expressions, such as the ability to use quantifiers like +, ?, and {}, or alternation |.

The option -o means “print only the matched parts”. This option allows to display only the parts of the input string that match the regular expression, instead of showing the entire input string with the matches highlighted.

Groups and References

Groups and references are advanced features of regular expressions that allow capturing subparts of the regular expression and reuse them in the same regular expression or in code.

A group is defined by enclosing a part of the regular expression in parentheses. Groups can be used to capture subparts of the string that match the regular expression. For example, the regular expression "(\d{2})-(\d{2})-(\d{4})" captures three groups corresponding to the date in the form "dd-mm-yyyy". Each group can then be referenced using the syntax \N, where N is the group number.

Voici un exemple de groupe utilisé pour capturer des adresses e-mail dans un texte :

1

([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)

In this example, the group is used to capture the complete email address, consisting of three parts: the username, the domain name, and the extension. The + after each character class means that there is at least one character that matches that class.

Here are examples of regular expressions using groups and references in Python, JavaScript, and Bash:

1
2
3
4
5
6
7


import re

text = "John Doe, 35 years old"
matches = re.findall("(\w+) (\w+), (\d+) years old", text)

for match in matches:
    print(f"Last name : {match[0]}, First name : {match[1]}, Age : {match[2]}")

1
2
3
4
5


const text = "John Doe, 35 years old";
const regex = /(\w+) (\w+), (\d+) years old/;
const match = regex.exec(text);

console.log(`Name: ${match[1]}, First Name: ${match[2]}, Age: ${match[3]}`);

1
2
3
4
5


text='John Doe, 35 years old'
regex="([a-zA-Z]+) ([a-zA-Z]+), ([0-9]+) years old"

[[ $text =~ $regex ]]
echo "Name: ${BASH_REMATCH[1]}, First Name: ${BASH_REMATCH[2]}, Age: ${BASH_REMATCH[3]}"

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


import re

text = "On 01/01/2022, something happened. On 25-12-21, something else happened. On 30 03 2023, something will happen."

regex = r'(?P<day>\d{1,2})[/\- ](?P<month>\w{2})[/\- ](?P<year>\d{2,4})'

matches = re.findall(regex, text)

for match in matches:
    day, month, year = match
    print(f"The date is {day} {month} {year}")

Explanation of the regex (?P<day>\d{1,2})[/\- ](?P<month>\w{2})[/\- ](?P<year>\d{2,4}):

(?<day>\d{1,2}): This uses the syntax (?<name>pattern) to create a named group "day". \d{1,2} means there will be one or two digits.
[/\- ]: Matches one of the following characters: /, -, or space.
(?<month>\w{2}): This uses the syntax (?<name>pattern) to create a named group "month". \w{2} means there will be exactly two alphanumeric characters.
[/\- ]: Matches one of the following characters: /, -, or space.
(?<year>\d{2,4}): This uses the syntax (?<name>pattern) to create a named group "year". \d{2,4} means there will be two, three or four digits.

1
2
3
4
5
6
7
8
9


const text = "On 01/01/2022, something happened. On 25-12-21, something else happened. On 03 30 2023, something is going to happen.";

const regex = /(?<day>\d{1,2})[/\- ](?<month>\w{2})[/\- ](?<year>\d{2,4})/g;

let matches;
while ((matches = regex.exec(text)) !== null) {
  const { day, month, year } = matches.groups;
  console.log(`The date is ${month} ${day}, ${year}`);
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


text="On 01/01/2022, something happened. On 25-12-21, something else happened. On 03 30 2023, something is going to happen."
regex='([0-9]{1,2})[-/ ](\w+){2}[-/ ]([0-9]{2,4})'

echo "$text" | grep -oE "$regex" | while read -r line; do
  day=$(echo "$line" | sed -E 's/'"$regex"'/\1/')
  month=$(echo "$line" | sed -E 's/'"$regex"'/\2/')
  year=$(echo "$line" | sed -E 's/'"$regex"'/\3/')

  echo "The date is $day $month $year"
done

This code extracts dates from a given text using a regular expression to identify the different possible date formats (day/month/year), stored in a regex variable.

The code then uses the grep command to find all occurrences of matching regular expressions in the text variable, and uses the sed command to extract the different parts of the date (day, month, year) from each occurrence and store them in variables.

Finally, it displays the extracted dates in a message.

1
2
3


The date is 01 1 2022
The date is 25 2 21
The date is 30 3 2023

lookaround

Here’s the part on lookaround, including positive and negative lookbehind:

Lookaround in regular expressions are operations that check for the presence of an expression to the left or right of the current position, without being part of the overall match. They allow specifying conditions under which a regular expression should be valid or not.

There are two types of lookaround: lookahead and lookbehind. Lookaheads specify conditions to be checked after the current position, while lookbehinds specify conditions to be checked before the current position.

Positive lookaheads ((?=expression)) check that the expression is present after the current position, without including this expression in the overall match. Negative lookaheads ((?!expression)) check the opposite: they verify that the expression is not present after the current position.

Here is an example of using lookaheads in Python to find all words that are followed by the word “fox”:

1
2
3
4
5
6


import re

text = "The quick brown fox jumps over the lazy dog."
matches = re.findall(r'\b\w+(?= fox)', text)

print(matches)  # ['brown']

In this example, the regular expression /\b\w+(?= fox)/ searches for all words (\b\w+) that are followed by the word “fox” ((?= fox)). The \b matches a word boundary. Using the lookahead allows to check the presence of the word “fox” after the current position, but without including it in the global match.

Positive lookbehinds ((?<=expression)) verify that the expression is located before the current position, without including that expression in the global match. Negative lookbehinds ((?<!expression)) verify the opposite: they verify that the expression is not located before the current position.

Here’s an example of using positive lookbehinds in JavaScript to find all numbers that are preceded by the “$” sign:

1
2
3
4
5


const text = "The price is $10.50, but the discount makes it $8.99.";
const regex = /(?<=\$)\d+(\.\d{2})?/g;

const matches = text.match(regex);
console.log(matches);  // ["10.50", "8.99"]

In this example, the regular expression /(?<=\$)\d+(\.\d{2})?/g searches for all numbers (\d+(\.\d{2})?) that are preceded by the “$” sign ((?<=\$)). The positive lookbehind (?<=\$) ensures that the “$” sign is present before the current position, but without including it in the match. The g flag is used to find all matches in the input string.

Finally, lookbehinds are not supported in all regular expression implementations. For example, lookbehinds are not supported in JavaScript regular expressions before ECMAScript 2018.

Flags

In addition to the symbols and concepts we have already discussed, regular expressions can be accompanied by “flags” or “options”. Flags are indicators that modify how regular expressions work. The three most commonly used flags are:

i: this flag makes the search case-insensitive. This means the search will match uppercase or lowercase letters interchangeably. For example, the regular expression /hello/i will match “Hello”, “hello”, and “hElLo”.
g: this flag stands for “global” and allows searching for all occurrences of a match in the text string. By default, a regular expression will only match the first occurrence of the match. For example, the regular expression /hello/g will match all occurrences of “hello” in the text.
m: this flag stands for “multiline” and allows a regular expression to match text strings that contain multiple lines. Without this flag, the regular expression will consider the text string as a single line.

Here is an example of using the m flag in Python:

1
2
3
4
5
6
7
8
9


import re

text = """The quick brown fox
jumps over
the lazy dog."""

matches = re.findall("^t\w+", text, flags=re.MULTILINE)

print(matches)  # ['the']

The re.MULTILINE (or re.M) flag is a compilation flag that modifies the behavior of certain character classes and quantifiers to work across multiple lines. For example, without the re.MULTILINE flag, the regular expression ^ only matches the beginning of the string. With the re.MULTILINE flag, the regular expression ^ also matches the beginning of each line.

In this example, the regular expression ^t\w+ searches for all strings that start with a “t” followed by one or more word characters (\w+), using the re.MULTILINE flag. The re.MULTILINE flag allows the search for the string the in all lines.

Here is an example of using the g flag in JavaScript:

1
2
3
4


const text = "hello world, hello universe";
const matches = text.match(/hello/g);

console.log(matches);  // ['hello', 'hello']

In this example, the g flag is used to search for all occurrences of the “hello” match in the text.

Here’s an example of using the i flag in Bash:

1
2
3
4


text="The quick brown fox jumps over the lazy dog."
matches=$(echo $text | grep -oi "FOX")

echo $matches  # FOX

In this example, the grep command searches for all occurrences of "FOX" in the $text string, ignoring case sensitivity thanks to the -i flag. The -o flag displays only the parts of the input string that match the regular expression, instead of displaying the entire input string with the matches highlighted.