Master Regular Expressions Part 1: introduction, history, engines, notation and modes
In this tutorial series, we will master regular expressions from scratch. Let’s start with an introduction.
Table of Contents
Introduction to Regular Expressions
Let’s start with the definition of Regular Expressions:
Regular Expressions are a sequence of symbols and characters expressing a string or pattern to be searched for within a larger piece of text.
What are Regular Expressions
Regular Expressions are all about text. It is a tool that allows us to work with texts by describing text pattern for validating email, news, stories, text messages, computer code, contacts and all sort of data available. When we see it pluralized, we are talking about the formal language of these symbols that needs to be interpreted by a regular expression processor. A processor allows us to match, search and manipulate test by using those symbol. We will talk about regular expression processors later in a bit detail.
Regular Expressions are not a programming language, they may seem similar with a defined set of rule. Most programming languages use regular expressions. There are no constants or variables or loops and you can’t add one plus one. They are just the symbols to describe a search pattern. Frequently you will find hear them as regex or regexp. I will be using regex in these tutorials. The plural of regex is regexes.
Usage
These are some of the following usages of regex:
- Test if an email address is valid in format
- Test if a postal code has the correct number of digits
- Replace all occurrences of Hello with Hola or work with working
- Count the number of times “development” is preceded by “web”, “software”, or “android”
- Search a file or a text document for either word or words
- Finding duplicate words in a text
- Converting all newline characters into spaces
In each of these cases, you will be using regular expressions. A regular expression matches text if it correctly describes the expression.
History of Regular Expression
Let’s take a quick look at the history of regular expressions before we move on. If you are not interested in the history of regex feel free to skip this lesson and continue learning.
Regular Expressions first got their start in the field of neuroscience in the 1940s. In 1943, McCulloch and Pitts developed models that describe how the human nervous system works, or how a machine or a computer could be built to act more like a human brain. In 1956, mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular sets. It’s not until 1968 when Ken Thompson, one of the key developers of UNIX implemented regular expressions inside a UNIX text editor called ed. This is where regular expressions enter the computer world, and it happens right there at the birth of UNIX.
Using Regex in UNIX
In ed text editor, you search for regular expressions by typing g
, a forward slash, then writing the regular expression you want to search for. Then you end it with a forward slash and a p
. This is how it looks like g/Regular Expression/p
.
g
and p
are the modifiers. g is telling the processor to search globally meaning everywhere and p is telling the processor to output the results on the screen. We call it grep an acronym for Global Regular Expression Print.
It gets popular and became widely used in other Unix programs such as awk, ecmas, vi, et al. Changes were made in the grep and egrep(extended grep) was released.
Standardization of Regex
In 1986, POSIX (Portable Operating System Interface) was released as standard to ensure the compatibility across different operating systems.
Essentially there were two types of regular expressions.
- Basic Regular Expressions (BREs)
- Extended Regular Expressions (EREs)
BREs were maintained for backward compatibility and EREs are used by modern tools use. So this effort standardizes the regular expressions.
In 1986 Henry Spencer releases a regular expression library in C. This library could be added in other programs. It provides consistency because everyone who uses its library, their expressions work the same way.
Perl Regular Expressions
In 1987 Larry Wall releases Perl programming language. It uses Spencer’s regex library. Many powerful features were added over time that makes it the gold standard that people wants their library to work. So Perl compatible languages and programs came out.
Apache, C#, VB, .NET, Java, Javascript, MySQL, PHP, Python, Ruby, and C/C++ all uses Perl Compatible Regular Expressions so-called PCRE library with all powerful Perl regular expression features.
Regular Expression Engines
Most of the languages attempted to offer Perl Compatible Regular Expressions (PCRE), but they all are slightly different because each one was implemented by some different body with slightly different specifications. Each different implementation is known as regex engines. Most common regex engines are
- Javascript/ActionScript (ECMAScript)
- .NET
- Perl
- C/C++
- Apache
- PHP (PCRE)
- Ruby
- Python
- UNIX
- JAVA
It’s hard to talk about these engines because there could be very subtle differences. Most of the basic features of the regular expression are same across all of these engines. Most of the engines are using EREs and Perl-Compatible engines. I’ll highlight the differences and the important exceptions among these engines. You need to consult the documentation for that language or that engine to get the real facts about how things work. Here we will be talking about general principles of regular expressions regardless of which one of these languages that you work with.
Using an engine
In these tutorials, we will be using PHP (PCRE) engine. You could use a very useful online resource Regex101 for testing the regular expressions.
Notation Convention and Modes
Regular Expressions always starts and ends with a forward slash. It is a convention that came from the grep. When I talked about the history of regular expressions I talked about grep. Regular Expressions in grep were written like g/re/p
. So this convention came from the UNIX. When using Regex101, do not add slashes because they will be added automatically.
Therefore, do not add forward slashes when writing regular expressions while testing in Regex101 but write forward slashes when using regex inside a programming language.
Regular Expressions comes in several different modes.
- Standard
/re/
which is the regular expression by itself - Global
/re/g
which is used by writing g after the slash. It shows global mode. - Case-Insensitive
/re/i
which makes regex search case insensitive. By default regular expressions search is case sensitive. - Multi-line
/re/m
that matches result in multi-line. - Dot matches all
/re/s
We will be talking about all of these in depth later. Mode comes after the regular expression. It’s not the part of regex, its a modifier that changes the way regex would be handled. You could select the flags icon in regex101 to change add or remove modifiers. Modifiers are also called flags.
You can buy Master Regular Expressions from Scratch course for just $10. You will learn Regular Expressions practically.
Part 2 of the series: Master Regular Expressions Part 2: Literal Characters, Global Mode, Meta-characters and escaping character