Master Regular Expressions Part 2: Literal Characters, Global Mode, Meta-characters and escaping character
In this part 2 of the series: Master Regular Expressions, we will discuss literal characters, global mode, metacharacters, and escaping character.
Table of Contents
Literal Characters
Let’s start learning the syntax of regular expressions with the simplest match of all, a literal character. In other words, letter b matches the letter b and letter C matches the letter C. It simply matches what we typed in.
We are going to work with strings. Strings are just the finite sequence of characters i.e. letters, numbers, symbols and punctuation marks. If we match the regular expression /web/
against the text web
, it matches the full text. /web/
also matches with the first three words of the website
. It does not match the entire string website
but it matches the first three letters of the website.
It’s like searching in the word processor. Its the simplest match available. It just says look for the literal word “web” inside the string.
Case Sensitivity
Searches in Regular Expressions are Case-sensitive by default that may not be the way your word processor works. For Example, if we have regex /web/
and we are matching it against the string Web
, it will not match because regex is trying to match with the lowercase w
. Instead, if we try to match it with Web
, it will match.
If you want to make these searches case insensitive you could use the i
flag. You can try them at regex101.com.
White spaces
Whitespace does matters in regular expressions. For example, regex /web/
does not match the string w eb
.
Global mode
You could add the global flag in a regex editor by clicking the flag icon and selecting global. You could use global mode in a programming language by adding g after the regex or calling appropriate function or method.
By default global mode is turned off when using regex in programming languages. In standard mode or non-global mode or non-global matching, the leftmost or earliest match is always preferred. As soon as it finds its first match, it returns. For example, if we have the string aalaallaa
and we are going to match it against the regex /aa/
. With global mode turned off, it will only match the first set of aa
in the string. If global mode is turned on, it will match all sets of aa throughout the string. Therefore, in global mode /aa/
would match all three sets of aa
in the string aalaallaa
. It is because regex starts matching from left to right.
Metacharacters
Let’s start with a definition.
A metacharacter is a character that has a special meaning instead of a literal meaning.
These are the characters with special meaning. They transform literal characters into powerful expressions. There are only a few metacharacters to learan. We will take a look at most of them in the upcoming parts.
They are complex because they have more than one meaning depending on the context. These metacharacters have variations between the different regex engines.
Now we have got the idea about metacharacters. Let’s learn the first metacharacter, wildcard metacharacter.
Wildcard Metacharacter
Wildcard metacharacter is the period sign. It matches every character except the newline character. If we add s or single line flag, the regex will also match the newline character.
Wildcard metacharacter doesn’t match newline character because original UNIX regex tools were line-based. For example, let’s take the regular expression /c.t/
. This regex matches “cat”, “cut” but not “cast”. It matches only one single character. It is the broadest match possible and it is the most commonly used metacharacter.
It also causes a lot of mistakes. When matching a decimal number like 1.12, we write a regex like /1.12/
. It will not only match 1.12 but also 1312 and even 1a12. When working with regular expressions, we need to be very careful in selecting and only selecting you want.
Escaping Metacharacters
If a metacharacter has a special meaning what if we wanted its literal meaning. As we saw in the previous lesson in /1.12/
we do not want the wildcard metacharacter. We want a literal period, a decimal. To do that, we use another metacharacter backslash character \
. It escapes the next character and gives it a different meaning.
In case of a period, it can be written as \.
. In the example discussed previously, the regex will be written like /1\.12/
. It will only match 1.12.
It is true for all metacharacters. If the regex wants to match backslash, it can be escaped with a backslash like /\\/
. This regex would match the string \
.
Literal characters must not be escaped because a lot of times it gives them a different meaning. Backslash with a w has a different meaning than a literal w. Quotation marks are not metacharacters. Therefore, they do not need to be escaped. Forward slashes need to be escaped because regex starts and ends with a forward slash. If forward slash is not escaped, it would act like the end of regular expressions.
You can buy Master Regular Expressions from Scratch course for just $10. It will teach you Regular Expressions practically.
Read Part 1 of the series Master Regular Expressions Part 1: introduction, history, engines, notation, and modes.
Part 3 will be available soon.