Regular Expressions

CS305 -- Formal Language Theory

"The algebraic notation for regular languages" ____ _____ | _ \| ____| | |_) | _| | _ <| |___ |_| \_\_____| Pattern --> Language --> Machine

Use Arrow Keys or Space to navigate | Press S to reveal steps

1 / 21

The Big Picture: Three Equivalent Views

Regular Expressions, DFAs, and NFAs all describe the exact same class of languages: the regular languages.

Regular Expressions (RE) / \ / \ convert convert to NFA from DFA / \ v v NFA <----------> DFA subset construction & vice versa +-----------+ +-----------+ +-----------+ | | | | | | | RE | === | NFA | === | DFA | | | | | | | +-----------+ +-----------+ +-----------+ All three recognize EXACTLY the regular languages

Key Idea

Any language you can describe with an RE, you can build a DFA for -- and vice versa. They are equally powerful. The proofs go around the triangle: RE --> NFA --> DFA --> RE.

2 / 21

What Is a Regular Expression?

A regular expression is a compact, algebraic notation for describing a set of strings (i.e., a language).

  • Every RE defines a language L(R)
  • Built from a small set of operations
  • No memory, no counting -- just patterns
  • Equivalent in power to finite automata

Analogy

Regex is to languages what arithmetic is to numbers. Just as 3 + 5 x 2 is a compact way to describe the number 13, the expression (0|1)*01 is a compact way to describe "all binary strings ending in 01."

Quick Examples

RELanguage
0{ "0" }
0|1{ "0", "1" }
01{ "01" }
0*{ "", "0", "00", "000", ... }
(0|1)*All binary strings
(0|1)*01Binary strings ending in 01
3 / 21

The Three Basic Operations

Every regular expression is built from just three operations:

1. Union (|)

R1 | R2 = "either R1 or R2"

R1 R2 or

L(R1 | R2) = L(R1) ∪ L(R2)

Example: 0 | 1 = { "0", "1" }

2. Concatenation (.)

R1 R2 = "R1 followed by R2"

R1 R2

L(R1 . R2) = { xy : x ∈ L(R1), y ∈ L(R2) }

Example: 0 . 1 = { "01" }

3. Kleene Star (*)

R* = "zero or more copies of R"

R repeat bypass (ε)

L(R*) = { ε, w, ww, www, ... : w ∈ L(R) }

Example: 0* = { "", "0", "00", "000", ... }

That's It!

Union, Concatenation, and Kleene Star -- these three operations are all you need to build every regular expression. They are like LEGO bricks: simple pieces, infinite combinations.

4 / 21

Formal Definition (Recursive)

A regular expression over alphabet Σ is defined inductively. Click an RE below to explore its definition:

Select a regular expression above to see its formal definition and language.

Watch Out: ∅ vs ε

matches nothing at all (the empty language -- no strings). ε matches one thing: the empty string "". They are NOT the same!

• L(∅) = { }   (0 elements)     • L(ε) = { "" }   (1 element)

5 / 21

Operator Precedence

Just like arithmetic has PEMDAS, regular expressions have a precedence order:

Highest +-----------+ priority | * Star | Bind tightest (like exponents) +-----------+ | . Concat | Middle (like multiplication) +-----------+ Lowest | | Union | Bind loosest (like addition) priority +-----------+

Click a regex above to compare correct vs incorrect parsing.

Analogy: Arithmetic Parallel

ArithmeticRegex
Exponent (^)Star (*)
Multiply (x)Concat (.)
Add (+)Union (|)

Just as 2+3x4 = 14 (not 20),
a|bc = {a, bc} (not {ac, bc}).

6 / 21

Interactive RE Parse Tree Explorer

Select a regular expression to see its parse tree build step by step:

Reading Strategy

1. Identify the outermost operation (lowest precedence). 2. Break into sub-expressions. 3. Describe each part in English. 4. Combine the descriptions.

7 / 21

Examples: Writing Regular Expressions

Given a language description, write the RE:

Language 1

"All binary strings of length exactly 3"

Think: 3 symbols, each is 0 or 1 Answer: (0|1)(0|1)(0|1) Check: 000 ✓ 010 ✓ 111 ✓ 0 ✗ 0011 ✗ ✓!

Language 2

"All binary strings containing 010"

Think: something, then 010, then something Answer: (0|1)*010(0|1)* Check: 010 ✓ 10101 ✓ 111 ✗ ✓!

Language 3

"Strings over {a,b} with even length"

Think: pairs of symbols, repeated Answer: ((a|b)(a|b))* Check: "" ✓ "ab" ✓ "a" ✗ "abba" ✓ ✓!

Language 4

"Strings over {0,1} with no consecutive 1s"

Think: after each 1, must see 0 or end Answer: (0|10)*(1|ε) Check: "" ✓ "0" ✓ "101" ✓ "11" ✗ "010" ✓ ✓!

Writing Strategy

Think in building blocks: (1) What must appear? (2) What can repeat? (3) What are the choices? Combine using concat for "then," union for "or," and star for "repeat."

8 / 21

RE to ε-NFA: Thompson's Construction

We can systematically convert any RE into an equivalent NFA using Thompson's Construction (1968).

The Idea: Build the NFA like LEGO ==================================== 1. Each sub-expression gets its own small NFA "fragment" +-------------------+ ---->| start accept|----> +-------------------+ "fragment" 2. Fragments have EXACTLY: - One start state (no incoming edges from outside) - One accept state (no outgoing edges from outside) 3. Combine fragments using rules for |, ., and * 4. The structure mirrors the parse tree of the RE!

Analogy: LEGO Bricks

Think of each base case (symbol, ε) as a basic LEGO brick. Union, concatenation, and star are ways to snap bricks together. You build bottom-up, combining small NFAs into bigger ones, until you have one NFA for the whole RE.

9 / 21

Thompson's Construction: Base Cases

Empty String ε

q0 q1 ε start accept

L = { "" } -- Accepts only the empty string.

Empty Set ∅

q0 q1 start accept

No transition! L = { } -- Accepts nothing.

Single Symbol a

q0 q1 a start accept

L = { "a" } -- Accepts only the string "a".

Properties of Every Fragment

  • Exactly one start state
  • Exactly one accept state
  • Start state has no incoming edges
  • Accept state has no outgoing edges

These invariants let us compose fragments cleanly!

10 / 21

Thompson's Construction: Union (R1 | R2)

"Accept if either R1 or R2 matches."

start NFA(R1) s1 f1 NFA(R2) s2 f2 accept ε ε ε ε

Why This Works

From the new start, the NFA nondeterministically guesses which branch (R1 or R2) will match. If either branch reaches its old accept state, the ε-transition carries us to the new accept state.

11 / 21

Thompson's Construction: Concatenation (R1 R2)

"Accept if R1 matches a prefix and R2 matches the rest."

NFA(R1) s1 f1 ε NFA(R2) s2 f2 start = s1 accept = f2

Analogy: Train Cars

Concatenation is like coupling train cars. The output of the first car (R1) connects directly to the input of the second car (R2). A string must ride through both to be accepted.

12 / 21

Thompson's Construction: Kleene Star (R*)

"Accept zero or more repetitions of R."

start NFA(R) s1 f1 accept ε ε ε (loop) ε (bypass: zero copies)

Critical: The ε-bypass

The bottom ε-transition from start directly to accept is what makes R* accept the empty string. Without it, we would need at least one copy of R (that would be R+, not R*).

13 / 21

Thompson's Construction: Step-Through for (0|1)*01

Watch the NFA build step by step using Thompson's rules:

Step 0 / 9
B(250,80) --> A B 0 D(250,200) --> C D 1 q_u q_ua ε ε ε ε q_s q_sa ε ε ε (loop) ε (bypass) F(560,140) --> E F 0 H(720,140) --> G H 1 G --> ε E --> ε ACCEPT START
14 / 21

DFA to RE: State Elimination Method

To convert a DFA (or NFA) back to a regular expression, we use state elimination.

The Idea: ========= 1. Start with the DFA 2. Add a NEW unique start state (s) --ε--> old start Add a NEW unique accept state: old accepts --ε--> (f) 3. Remove states ONE BY ONE (not s or f) - When removing state q, replace transitions through q with RE-labeled transitions 4. When only s and f remain, the label on the single edge s --> f is the answer RE! Before: After removing q_rip: +----+ R1 +------+ R3 +----+ | qi |------>| q_rip|------>| qj | +----+ +------+ +----+ ^ | R2 | | R2 +-+ +----+ R1 R2* R3 +----+ | qi |------------------->| qj | (plus any direct qi-->qj edge) +----+ +----+

Core Rule for Removing State q

For every pair (qi, qj) that both connect through q: the new label on qi-->qj becomes: (old label qi-->qj) | (label qi-->q)(label q-->q)*(label q-->qj)

15 / 21

State Elimination: Interactive Step-Through

Convert a DFA (binary strings ending in 1) to a regular expression:

Step 0 / 7
q0 q1 0 1 0 1 s ε f ε eliminating... q1 --> 0*1 (1|00*1) eliminating... 0*1(1|00*1)* Final RE: 0*1(1|00*1)* = (0|1)*1
16 / 21

Challenge Quiz: Regular Expressions

Test your understanding with these questions:

Click "Start Quiz" to begin!

17 / 21

Algebraic Laws of Regular Expressions

REs obey many useful identities. These help simplify expressions.

Union Laws

LawRule
CommutativeR | S = S | R
Associative(R|S)|T = R|(S|T)
IdempotentR | R = R
IdentityR | ∅ = R

Concatenation Laws

LawRule
Associative(RS)T = R(ST)
IdentityRε = εR = R
AnnihilatorR∅ = ∅R = ∅

Distributivity

LawRule
Left dist.R(S|T) = RS | RT
Right dist.(S|T)R = SR | TR

Star Laws

LawRule
Star idem.(R*)* = R*
Star of εε* = ε
Star of ∅∅* = ε

Not Commutative!

Concatenation is NOT commutative. ab ≠ ba. Order matters!

18 / 21

Theory vs Practice: Regex in Programming

The "regex" in Python/Java/etc. is more powerful than theoretical REs!

Theoretical RE (CS305)

  • Only: union, concat, star
  • Describes exactly the regular languages
  • Equivalent to DFA/NFA
  • Always runs in O(n) time
  • Cannot count or match patterns like anbn

Practical Regex (Programming)

  • Adds: +, ?, {n,m}, [a-z], \d, \w
  • . for any character
  • Backreferences: \1, \2 (NOT regular!)
  • Lookahead/lookbehind
  • Can cause exponential blowup!
FeatureTheoreticalPracticalStill Regular?
R+ (one or more)Write as RR*Built-inYes
R? (optional)Write as R|εBuilt-inYes
[a-z]Write as a|b|...|zBuilt-inYes
R{3,5}Write as RRR|RRRR|RRRRRBuilt-inYes
\1 backreferenceCannot expressBuilt-inNO!
Lookahead (?=...)Cannot expressBuilt-inNO!

ReDoS: When Regex Goes Wrong

Backreferences can make regex matching NP-hard. Poorly written patterns like (a+)+ can cause catastrophic backtracking -- a real security vulnerability called ReDoS.

19 / 21

Common RE Patterns

Useful regular expression patterns for common language descriptions.

Over Σ = {0, 1}

LanguageRE
All strings(0|1)*
Strings starting with 11(0|1)*
Strings ending in 00(0|1)*00
Strings containing 101(0|1)*101(0|1)*
Even-length strings((0|1)(0|1))*
Strings of only 0s0*
Non-empty strings(0|1)(0|1)*
Exactly three characters(0|1)(0|1)(0|1)

Over Σ = {a, b}

LanguageRE
Starts and ends with aa(a|b)*a | a
Contains at least two b's(a|b)*b(a|b)*b(a|b)*
No two consecutive a's(b|ab)*(a|ε)
Every a followed by b(b|ab)*
Alternating a's and b's(ab)*(a|ε) | (ba)*(b|ε)

Building Block Patterns

Σ* = anything,
Σ* w Σ* = contains w,
w Σ* = starts with w,
Σ* w = ends with w

20 / 21

Summary & Cheat Sheet

Core Concepts

RE = DFA = NFA (same power!) Three operations: +--------+---------------------------+ | R|S | Union: R or S | | RS | Concat: R then S | | R* | Star: 0 or more R | +--------+---------------------------+ Precedence: * > concat > | Base cases: ∅ --> empty language { } ε --> empty string { "" } a --> single symbol { "a" }

Conversions

RE --Thompson's--> ε-NFA Construction DFA --State-------> RE Elimination

Key Identities

R | S = S | R (union commutes) R | R = R (idempotent) Rε = R (concat identity) R∅ = ∅ (annihilator) (R*)* = R* (star is idempotent) ∅* = ε (star of empty) R | ∅ = R (union identity) R(S|T) = RS | RT (distribute)

Common Mistakes

  • ∅ ≠ ε  (empty lang vs empty string)
  • ab*(ab)*  (star binds tightest)
  • R* always includes ε
  • Concat does NOT commute
  • Practical regex ≠ theoretical RE

Remember

Regular expressions are the algebra of regular languages. Master the three operations, know the precedence, and you can describe any regular language concisely.

21 / 21