CS305 -- Formal Language Theory
Use Arrow Keys or Space to navigate | Press S to reveal steps
Regular Expressions, DFAs, and NFAs all describe the exact same class of languages: the regular languages.
Any language you can describe with an RE, you can build a DFA for -- and vice versa. They are equally powerful. The proofs go around the triangle: RE --> NFA --> DFA --> RE.
A regular expression is a compact, algebraic notation for describing a set of strings (i.e., a language).
Regex is to languages what arithmetic is to numbers. Just as 3 + 5 x 2 is a compact way to describe the number 13, the expression (0|1)*01 is a compact way to describe "all binary strings ending in 01."
| RE | Language |
|---|---|
0 | { "0" } |
0|1 | { "0", "1" } |
01 | { "01" } |
0* | { "", "0", "00", "000", ... } |
(0|1)* | All binary strings |
(0|1)*01 | Binary strings ending in 01 |
Every regular expression is built from just three operations:
R1 | R2 = "either R1 or R2"
L(R1 | R2) = L(R1) ∪ L(R2)
Example: 0 | 1 = { "0", "1" }
R1 R2 = "R1 followed by R2"
L(R1 . R2) = { xy : x ∈ L(R1), y ∈ L(R2) }
Example: 0 . 1 = { "01" }
R* = "zero or more copies of R"
L(R*) = { ε, w, ww, www, ... : w ∈ L(R) }
Example: 0* = { "", "0", "00", "000", ... }
Union, Concatenation, and Kleene Star -- these three operations are all you need to build every regular expression. They are like LEGO bricks: simple pieces, infinite combinations.
A regular expression over alphabet Σ is defined inductively. Click an RE below to explore its definition:
Select a regular expression above to see its formal definition and language.
∅ matches nothing at all (the empty language -- no strings). ε matches one thing: the empty string "". They are NOT the same!
• L(∅) = { } (0 elements) • L(ε) = { "" } (1 element)
Just like arithmetic has PEMDAS, regular expressions have a precedence order:
Click a regex above to compare correct vs incorrect parsing.
| Arithmetic | Regex |
|---|---|
| Exponent (^) | Star (*) |
| Multiply (x) | Concat (.) |
| Add (+) | Union (|) |
Just as 2+3x4 = 14 (not 20),a|bc = {a, bc} (not {ac, bc}).
Select a regular expression to see its parse tree build step by step:
1. Identify the outermost operation (lowest precedence). 2. Break into sub-expressions. 3. Describe each part in English. 4. Combine the descriptions.
Given a language description, write the RE:
"All binary strings of length exactly 3"
"All binary strings containing 010"
"Strings over {a,b} with even length"
"Strings over {0,1} with no consecutive 1s"
Think in building blocks: (1) What must appear? (2) What can repeat? (3) What are the choices? Combine using concat for "then," union for "or," and star for "repeat."
We can systematically convert any RE into an equivalent NFA using Thompson's Construction (1968).
Think of each base case (symbol, ε) as a basic LEGO brick. Union, concatenation, and star are ways to snap bricks together. You build bottom-up, combining small NFAs into bigger ones, until you have one NFA for the whole RE.
L = { "" } -- Accepts only the empty string.
No transition! L = { } -- Accepts nothing.
L = { "a" } -- Accepts only the string "a".
These invariants let us compose fragments cleanly!
"Accept if either R1 or R2 matches."
From the new start, the NFA nondeterministically guesses which branch (R1 or R2) will match. If either branch reaches its old accept state, the ε-transition carries us to the new accept state.
"Accept if R1 matches a prefix and R2 matches the rest."
Concatenation is like coupling train cars. The output of the first car (R1) connects directly to the input of the second car (R2). A string must ride through both to be accepted.
"Accept zero or more repetitions of R."
The bottom ε-transition from start directly to accept is what makes R* accept the empty string. Without it, we would need at least one copy of R (that would be R+, not R*).
Watch the NFA build step by step using Thompson's rules:
To convert a DFA (or NFA) back to a regular expression, we use state elimination.
For every pair (qi, qj) that both connect through q: the new label on qi-->qj becomes: (old label qi-->qj) | (label qi-->q)(label q-->q)*(label q-->qj)
Convert a DFA (binary strings ending in 1) to a regular expression:
Test your understanding with these questions:
Click "Start Quiz" to begin!
REs obey many useful identities. These help simplify expressions.
| Law | Rule |
|---|---|
| Commutative | R | S = S | R |
| Associative | (R|S)|T = R|(S|T) |
| Idempotent | R | R = R |
| Identity | R | ∅ = R |
| Law | Rule |
|---|---|
| Associative | (RS)T = R(ST) |
| Identity | Rε = εR = R |
| Annihilator | R∅ = ∅R = ∅ |
| Law | Rule |
|---|---|
| Left dist. | R(S|T) = RS | RT |
| Right dist. | (S|T)R = SR | TR |
| Law | Rule |
|---|---|
| Star idem. | (R*)* = R* |
| Star of ε | ε* = ε |
| Star of ∅ | ∅* = ε |
Concatenation is NOT commutative. ab ≠ ba. Order matters!
The "regex" in Python/Java/etc. is more powerful than theoretical REs!
+, ?, {n,m}, [a-z], \d, \w. for any character\1, \2 (NOT regular!)| Feature | Theoretical | Practical | Still Regular? |
|---|---|---|---|
R+ (one or more) | Write as RR* | Built-in | Yes |
R? (optional) | Write as R|ε | Built-in | Yes |
[a-z] | Write as a|b|...|z | Built-in | Yes |
R{3,5} | Write as RRR|RRRR|RRRRR | Built-in | Yes |
\1 backreference | Cannot express | Built-in | NO! |
Lookahead (?=...) | Cannot express | Built-in | NO! |
Backreferences can make regex matching NP-hard. Poorly written patterns like (a+)+ can cause catastrophic backtracking -- a real security vulnerability called ReDoS.
Useful regular expression patterns for common language descriptions.
| Language | RE |
|---|---|
| All strings | (0|1)* |
| Strings starting with 1 | 1(0|1)* |
| Strings ending in 00 | (0|1)*00 |
| Strings containing 101 | (0|1)*101(0|1)* |
| Even-length strings | ((0|1)(0|1))* |
| Strings of only 0s | 0* |
| Non-empty strings | (0|1)(0|1)* |
| Exactly three characters | (0|1)(0|1)(0|1) |
| Language | RE |
|---|---|
| Starts and ends with a | a(a|b)*a | a |
| Contains at least two b's | (a|b)*b(a|b)*b(a|b)* |
| No two consecutive a's | (b|ab)*(a|ε) |
| Every a followed by b | (b|ab)* |
| Alternating a's and b's | (ab)*(a|ε) | (ba)*(b|ε) |
Σ* = anything,
Σ* w Σ* = contains w,
w Σ* = starts with w,
Σ* w = ends with w
ab* ≠ (ab)* (star binds tightest)R* always includes εRegular expressions are the algebra of regular languages. Master the three operations, know the precedence, and you can describe any regular language concisely.