Regular Expressions

CS305 -- Formal Language Theory

"The algebraic notation for regular languages" ____ _____ | _ \| ____| | |_) | _| | _ <| |___ |_| \_\_____| Pattern --> Language --> Machine

Use Arrow Keys or Space to navigate | Press S to reveal steps

1 / 21

The Big Picture: Three Equivalent Views

Regular Expressions, DFAs, and NFAs all describe the exact same class of languages: the regular languages.

Regular Expressions (RE) / \ / \ convert convert to NFA from DFA / \ v v NFA <----------> DFA subset construction & vice versa +-----------+ +-----------+ +-----------+ | | | | | | | RE | === | NFA | === | DFA | | | | | | | +-----------+ +-----------+ +-----------+ All three recognize EXACTLY the regular languages

Key Idea

Any language you can describe with an RE, you can build a DFA for -- and vice versa. They are equally powerful. The proofs go around the triangle: RE --> NFA --> DFA --> RE.

2 / 21

What Is a Regular Expression?

A regular expression is a compact, algebraic notation for describing a set of strings (i.e., a language).

Every RE defines a language L(R)
Built from a small set of operations
No memory, no counting -- just patterns
Equivalent in power to finite automata

Analogy

Regex is to languages what arithmetic is to numbers. Just as 3 + 5 x 2 is a compact way to describe the number 13, the expression (0|1)*01 is a compact way to describe "all binary strings ending in 01."

Quick Examples

RE	Language
`0`	{ "0" }
`0\|1`	{ "0", "1" }
`01`	{ "01" }
`0*`	{ "", "0", "00", "000", ... }
`(0\|1)*`	All binary strings
`(0\|1)*01`	Binary strings ending in 01

3 / 21

The Three Basic Operations

Every regular expression is built from just three operations:

1. Union (|)

R1 | R2 = "either R1 or R2"

L(R1 | R2) = L(R1) ∪ L(R2)

Example: 0 | 1 = { "0", "1" }

2. Concatenation (.)

R1 R2 = "R1 followed by R2"

L(R1 . R2) = { xy : x ∈ L(R1), y ∈ L(R2) }

Example: 0 . 1 = { "01" }

3. Kleene Star (*)

R* = "zero or more copies of R"

L(R*) = { ε, w, ww, www, ... : w ∈ L(R) }

Example: 0* = { "", "0", "00", "000", ... }

That's It!

Union, Concatenation, and Kleene Star -- these three operations are all you need to build every regular expression. They are like LEGO bricks: simple pieces, infinite combinations.

4 / 21

Formal Definition (Recursive)

A regular expression over alphabet Σ is defined inductively. Click an RE below to explore its definition:

Select a regular expression above to see its formal definition and language.

Watch Out: ∅ vs ε

∅ matches nothing at all (the empty language -- no strings). ε matches one thing: the empty string "". They are NOT the same!

• L(∅) = { } (0 elements) • L(ε) = { "" } (1 element)

5 / 21

Operator Precedence

Just like arithmetic has PEMDAS, regular expressions have a precedence order:

Click a regex above to compare correct vs incorrect parsing.

Analogy: Arithmetic Parallel

Arithmetic	Regex
Exponent (^)	Star (*)
Multiply (x)	Concat (.)
Add (+)	Union (\|)

Just as 2+3x4 = 14 (not 20),
a|bc = {a, bc} (not {ac, bc}).

6 / 21

Interactive RE Parse Tree Explorer

Select a regular expression to see its parse tree build step by step:

Reading Strategy

1. Identify the outermost operation (lowest precedence). 2. Break into sub-expressions. 3. Describe each part in English. 4. Combine the descriptions.

7 / 21

Examples: Writing Regular Expressions

Given a language description, write the RE:

Language 1

"All binary strings of length exactly 3"

Think: 3 symbols, each is 0 or 1 Answer: (0|1)(0|1)(0|1) Check: 000 ✓ 010 ✓ 111 ✓ 0 ✗ 0011 ✗ ✓!

Language 2

"All binary strings containing 010"

Think: something, then 010, then something Answer: (0|1)*010(0|1)* Check: 010 ✓ 10101 ✓ 111 ✗ ✓!

Language 3

"Strings over {a,b} with even length"

Think: pairs of symbols, repeated Answer: ((a|b)(a|b))* Check: "" ✓ "ab" ✓ "a" ✗ "abba" ✓ ✓!

Language 4

"Strings over {0,1} with no consecutive 1s"

Think: after each 1, must see 0 or end Answer: (0|10)*(1|ε) Check: "" ✓ "0" ✓ "101" ✓ "11" ✗ "010" ✓ ✓!

Writing Strategy

Think in building blocks: (1) What must appear? (2) What can repeat? (3) What are the choices? Combine using concat for "then," union for "or," and star for "repeat."

8 / 21

RE to ε-NFA: Thompson's Construction

We can systematically convert any RE into an equivalent NFA using Thompson's Construction (1968).

The Idea: Build the NFA like LEGO ==================================== 1. Each sub-expression gets its own small NFA "fragment" +-------------------+ ---->| start accept|----> +-------------------+ "fragment" 2. Fragments have EXACTLY: - One start state (no incoming edges from outside) - One accept state (no outgoing edges from outside) 3. Combine fragments using rules for |, ., and * 4. The structure mirrors the parse tree of the RE!

Analogy: LEGO Bricks

Think of each base case (symbol, ε) as a basic LEGO brick. Union, concatenation, and star are ways to snap bricks together. You build bottom-up, combining small NFAs into bigger ones, until you have one NFA for the whole RE.

9 / 21

Thompson's Construction: Base Cases

Empty String ε

L = { "" } -- Accepts only the empty string.

Empty Set ∅

No transition! L = { } -- Accepts nothing.

Single Symbol a

L = { "a" } -- Accepts only the string "a".

Properties of Every Fragment

Exactly one start state
Exactly one accept state
Start state has no incoming edges
Accept state has no outgoing edges

These invariants let us compose fragments cleanly!

10 / 21

Thompson's Construction: Union (R1 | R2)

"Accept if either R1 or R2 matches."

Why This Works

From the new start, the NFA nondeterministically guesses which branch (R1 or R2) will match. If either branch reaches its old accept state, the ε-transition carries us to the new accept state.

11 / 21

Thompson's Construction: Concatenation (R1 R2)

"Accept if R1 matches a prefix and R2 matches the rest."

Analogy: Train Cars

Concatenation is like coupling train cars. The output of the first car (R1) connects directly to the input of the second car (R2). A string must ride through both to be accepted.

12 / 21

Thompson's Construction: Kleene Star (R*)

"Accept zero or more repetitions of R."

Critical: The ε-bypass

The bottom ε-transition from start directly to accept is what makes R* accept the empty string. Without it, we would need at least one copy of R (that would be R⁺, not R*).

13 / 21

Thompson's Construction: Step-Through for (0|1)*01

Watch the NFA build step by step using Thompson's rules:

Step 0 / 9

14 / 21

DFA to RE: State Elimination Method

To convert a DFA (or NFA) back to a regular expression, we use state elimination.

The Idea: ========= 1. Start with the DFA 2. Add a NEW unique start state (s) --ε--> old start Add a NEW unique accept state: old accepts --ε--> (f) 3. Remove states ONE BY ONE (not s or f) - When removing state q, replace transitions through q with RE-labeled transitions 4. When only s and f remain, the label on the single edge s --> f is the answer RE! Before: After removing q_rip: +----+ R1 +------+ R3 +----+ | qi |------>| q_rip|------>| qj | +----+ +------+ +----+ ^ | R2 | | R2 +-+ +----+ R1 R2* R3 +----+ | qi |------------------->| qj | (plus any direct qi-->qj edge) +----+ +----+

Core Rule for Removing State q

For every pair (qi, qj) that both connect through q: the new label on qi-->qj becomes: (old label qi-->qj) | (label qi-->q)(label q-->q)*(label q-->qj)

15 / 21

State Elimination: Interactive Step-Through

Convert a DFA (binary strings ending in 1) to a regular expression:

Step 0 / 7

16 / 21

Challenge Quiz: Regular Expressions

Test your understanding with these questions:

Click "Start Quiz" to begin!

17 / 21

Algebraic Laws of Regular Expressions

REs obey many useful identities. These help simplify expressions.

Union Laws

Law	Rule
Commutative	R \| S = S \| R
Associative	(R\|S)\|T = R\|(S\|T)
Idempotent	R \| R = R
Identity	R \| ∅ = R

Concatenation Laws

Law	Rule
Associative	(RS)T = R(ST)
Identity	Rε = εR = R
Annihilator	R∅ = ∅R = ∅

Distributivity

Law	Rule
Left dist.	R(S\|T) = RS \| RT
Right dist.	(S\|T)R = SR \| TR

Star Laws

Law	Rule
Star idem.	(R) = R*
Star of ε	ε* = ε
Star of ∅	∅* = ε

Not Commutative!

Concatenation is NOT commutative. ab ≠ ba. Order matters!

18 / 21

Theory vs Practice: Regex in Programming

The "regex" in Python/Java/etc. is more powerful than theoretical REs!

Theoretical RE (CS305)

Only: union, concat, star
Describes exactly the regular languages
Equivalent to DFA/NFA
Always runs in O(n) time
Cannot count or match patterns like aⁿbⁿ

Practical Regex (Programming)

Adds: +, ?, {n,m}, [a-z], \d, \w
. for any character
Backreferences: \1, \2 (NOT regular!)
Lookahead/lookbehind
Can cause exponential blowup!

Feature	Theoretical	Practical	Still Regular?
`R+` (one or more)	Write as RR*	Built-in	Yes
`R?` (optional)	Write as R\|ε	Built-in	Yes
`[a-z]`	Write as a\|b\|...\|z	Built-in	Yes
`R{3,5}`	Write as RRR\|RRRR\|RRRRR	Built-in	Yes
`\1` backreference	Cannot express	Built-in	NO!
Lookahead `(?=...)`	Cannot express	Built-in	NO!

ReDoS: When Regex Goes Wrong

Backreferences can make regex matching NP-hard. Poorly written patterns like (a+)+ can cause catastrophic backtracking -- a real security vulnerability called ReDoS.

19 / 21

Common RE Patterns

Useful regular expression patterns for common language descriptions.

Over Σ = {0, 1}

Language	RE
All strings	`(0\|1)*`
Strings starting with 1	`1(0\|1)*`
Strings ending in 00	`(0\|1)*00`
Strings containing 101	`(0\|1)101(0\|1)`
Even-length strings	`((0\|1)(0\|1))*`
Strings of only 0s	`0*`
Non-empty strings	`(0\|1)(0\|1)*`
Exactly three characters	`(0\|1)(0\|1)(0\|1)`

Over Σ = {a, b}

Language	RE
Starts and ends with a	`a(a\|b)*a \| a`
Contains at least two b's	`(a\|b)b(a\|b)b(a\|b)*`
No two consecutive a's	`(b\|ab)*(a\|ε)`
Every a followed by b	`(b\|ab)*`
Alternating a's and b's	`(ab)(a\|ε) \| (ba)(b\|ε)`

Building Block Patterns

Σ* = anything,
Σ* w Σ* = contains w,
w Σ* = starts with w,
Σ* w = ends with w

20 / 21

Summary & Cheat Sheet

Core Concepts

RE = DFA = NFA (same power!) Three operations: +--------+---------------------------+ | R|S | Union: R or S | | RS | Concat: R then S | | R* | Star: 0 or more R | +--------+---------------------------+ Precedence: * > concat > | Base cases: ∅ --> empty language { } ε --> empty string { "" } a --> single symbol { "a" }

Conversions

RE --Thompson's--> ε-NFA Construction DFA --State-------> RE Elimination

Key Identities

R | S = S | R (union commutes) R | R = R (idempotent) Rε = R (concat identity) R∅ = ∅ (annihilator) (R*)* = R* (star is idempotent) ∅* = ε (star of empty) R | ∅ = R (union identity) R(S|T) = RS | RT (distribute)

Common Mistakes

∅ ≠ ε (empty lang vs empty string)
ab* ≠ (ab)* (star binds tightest)
R* always includes ε
Concat does NOT commute
Practical regex ≠ theoretical RE

Remember

Regular expressions are the algebra of regular languages. Master the three operations, know the precedence, and you can describe any regular language concisely.

21 / 21