Regular Expressions

CS305 — Formal Language Theory

Use ← → arrows to navigate

The Big Picture: Three Equivalent Views

Regular Expressions, DFAs, and NFAs all describe the exact same class of languages.

Key Idea

Any language you can describe with an RE, you can build a DFA for — and vice versa. The proofs go around the triangle:

  • RE → ε-NFA: Thompson's Construction
  • NFA → DFA: Subset Construction
  • DFA → RE: State Elimination

Analogy

It's like having three different maps of the same city — street map, satellite view, and transit map. Different formats, same territory.

What Is a Regular Expression?

A regular expression is a compact, algebraic notation for describing a set of strings (a language).

  • Every RE defines a language L(R)
  • Built from three simple operations
  • No memory, no counting — just patterns
  • Equivalent in power to finite automata

Analogy

Regex : languages :: arithmetic : numbers
Just as 3 + 5 × 2 compactly describes 13, (0|1)*01 compactly describes "all binary strings ending in 01."

Quick Examples

RELanguage L(R)
0{ "0" }
0|1{ "0", "1" }
01{ "01" }
0*{ ε, "0", "00", ... }
(0|1)*All binary strings
(0|1)*01Ends in 01

Try it — enter a string to test against (0|1)*01:

The Three Basic Operations

Every RE is built from just three operations. Click each to see the Canvas diagram:

Click an operation above to see how it works.

That's It!

Union, Concatenation, and Kleene Star — these three are all you need to build every regular expression. Like LEGO bricks: simple pieces, infinite combinations.

Formal Definition (Recursive)

A regular expression over alphabet Σ is defined inductively. Click to explore:

Test string:

Select a regular expression above to see its formal definition, language, and Canvas visualization.

Watch Out: ∅ vs ε

matches nothing at all (empty language — 0 elements). ε matches the empty string "" (1 element). They are NOT the same!

Operator Precedence

Just like arithmetic has PEMDAS, regular expressions have a precedence order:

PriorityRegexLike Arithmetic
Highest* StarExponent (^)
Middle. ConcatMultiply (×)
Lowest| UnionAdd (+)

Click a regex above to compare correct vs incorrect parsing.

Arithmetic Parallel

Just as 2+3×4 = 14 (not 20), a|bc = {a, bc} (not {ac, bc}).

Interactive RE Parse Tree Explorer

Select a regex to see its parse tree built step by step on Canvas:

Select a regex to explore its parse tree.

Reading Strategy

1. Find the outermost (lowest precedence) operation
2. Break into sub-expressions
3. Describe each part in English
4. Combine

Writing Regular Expressions

Given a language description, write the RE. Click each example to reveal the answer:

"Binary strings of length exactly 3"

"Binary strings containing 010"

"Strings over {a,b} with even length"

"Over {0,1} with no consecutive 1s"

Writing Strategy

Think in building blocks: (1) What must appear? → concat. (2) What can repeat? → star. (3) What are the choices? → union. Combine these three to express any regular language.

RE → ε-NFA: Thompson's Construction

We can systematically convert any RE into an equivalent ε-NFA using Thompson's Construction (1968).

The Idea: Build Like LEGO

  1. Each sub-expression gets a small NFA "fragment"
  2. Every fragment has exactly one start and one accept state
  3. Combine fragments using rules for |, ·, and *
  4. The structure mirrors the parse tree!

Analogy

Each base case (symbol, ε) is a basic LEGO brick. Union, concat, and star are ways to snap bricks together. Build bottom-up until you have one NFA for the whole RE.

Thompson's Construction Rules

Click each rule to see the NFA fragment it produces:

Click a rule above to see its NFA fragment on the Canvas.

Thompson's Construction: (0|1)*01

Watch the NFA build step by step:

Step 0 / 9

Challenge: Predict the Thompson NFA

Given the RE a(b|c), how many states will Thompson's Construction produce?

Think about it step by step:

  1. Base case for a: 2 states
  2. Base case for b: 2 states
  3. Base case for c: 2 states
  4. Union b|c: adds 2 new states
  5. Concat a·(b|c): merges, no new states

DFA → RE: State Elimination

To convert a DFA back to a regular expression, we eliminate states one by one.

The Algorithm

  1. Add new unique start s → old start (ε)
  2. Add new unique accept f ← old accepts (ε)
  3. Remove states one by one (not s or f)
  4. When only s and f remain, the edge label is the RE!

Core Rule for Removing State q

For every pair (qi, qj) through q:
New label = (old qi→qj) | (qi→q)(q→q)*(q→qj)

State Elimination: Step-Through

Convert a DFA (binary strings ending in 1) to RE:

Step 0 / 7

Challenge: Fix the Bug

A student tried to eliminate state q1 from this DFA but got the wrong RE. Find the mistake:

DFA: s --ε→ q0, q0 --0→ q0, q0 --1→ q1, q1 --1→ q1, q1 --0→ q0, q1 --ε→ f

Student eliminated q0 first and wrote:

s --0*1→ q1

q1 self-loop: 1 ← BUG?

q1 --ε→ f

What's wrong with the q1 self-loop label?

Algebraic Laws of Regular Expressions

REs obey useful identities. Click a category to explore:

CategoryLawRule
UnionCommutativeR|S = S|R
Associative(R|S)|T = R|(S|T)
IdempotentR|R = R
IdentityR|∅ = R
ConcatAssociative(RS)T = R(ST)
IdentityRε = εR = R
AnnihilatorR∅ = ∅R = ∅
Distrib.LeftR(S|T) = RS|RT
Right(S|T)R = SR|TR
StarIdempotent(R*)* = R*
Star of εε* = ε
Star of ∅∅* = ε

Not Commutative!

Concatenation is NOT commutative. ab ≠ ba. Order matters!

Why These Matter

These laws let you simplify complex REs. For example: 0*1(1|00*1)* = (0|1)*1 can be verified algebraically.

Theory vs Practice: Regex in Programming

The "regex" in Python/Java is more powerful than theoretical REs!

Theoretical RE (CS305)

  • Only: union, concat, star
  • Describes exactly the regular languages
  • Equivalent to DFA/NFA
  • Always runs in O(n) time

Practical Regex (Programming)

  • Adds: +, ?, {n,m}, [a-z], \d
  • Backreferences: \1 (NOT regular!)
  • Lookahead (NOT regular!)
  • Can cause exponential blowup!
FeatureTheoreticalStill Regular?
R+Write as RR*Yes
R?Write as R|εYes
[a-z]Write as a|b|...|zYes
R{3,5}Expand manuallyYes
\1 backrefCannot expressNO!
(?=...)Cannot expressNO!

ReDoS

Backreferences make regex matching NP-hard. Patterns like (a+)+ cause catastrophic backtracking — a real security vulnerability.

Common RE Patterns

Useful building blocks for writing regular expressions:

Over Σ = {0, 1}

LanguageRE
All strings(0|1)*
Starts with 11(0|1)*
Ends in 00(0|1)*00
Contains 101(0|1)*101(0|1)*
Even length((0|1)(0|1))*
Only 0s0*
Exactly 3 chars(0|1)(0|1)(0|1)

Over Σ = {a, b}

LanguageRE
Starts & ends with aa(a|b)*a | a
≥2 b's(a|b)*b(a|b)*b(a|b)*
No consecutive a's(b|ab)*(a|ε)
Every a followed by b(b|ab)*

Building Block Patterns

Σ* = anything
Σ* w Σ* = contains w
w Σ* = starts with w
Σ* w = ends with w

Interactive RE Tester

Enter a simple RE and test strings against it:

How This Works

This tester converts the theoretical RE into a JavaScript regex. It supports: 0, 1, | (union), concatenation, * (star), ε (empty string), and parentheses.

Reminder

This is a theoretical RE tester — only the three basic operations (union, concat, star) plus base cases. No +, ?, [...], or backreferences.

Challenge: Match Language to RE

For each language description, select the correct regular expression:

1. Binary strings with at least one 0

2. Strings over {a,b} of odd length

3. Binary strings NOT ending in 11

Summary & Cheat Sheet

Precedence: * > concat > |

Base: ∅ (empty lang), ε (empty string), a (symbol)

Operations: R|S (union), RS (concat), R* (star)

Key Identities

R|S = S|Runion commutes
R|R = Ridempotent
Rε = Rconcat identity
R∅ = ∅annihilator
(R*)* = R*star idempotent
∅* = εstar of empty

Common Mistakes

  • ∅ ≠ ε (empty lang vs empty string)
  • ab*(ab)* (star binds tightest)
  • R* always includes ε
  • Concat does NOT commute
  • Practical regex ≠ theoretical RE

Quiz: Multiple Choice

Q1: What is L(∅*)?

Q2: Which is NOT a valid RE simplification?

Q3: Thompson's Construction for a symbol 'a' produces how many states?

Quiz: Trace Exercise

Given the RE (0|1)*00, determine which strings are accepted:

For each string, select Accept or Reject:

The RE: (0|1)*00

This accepts all binary strings ending in 00.

(0|1)* matches any prefix, then 00 requires the string to end with two zeros.

Quiz: Build the RE

Write the regular expression for each language description. Type your answer and check:

1. Binary strings starting with 1 and ending with 0

2. Strings over {a,b} with exactly one b

3. Binary strings with an even number of 0s

Answers