Hash Tables

O(1) Average Lookup

key: "alice" value: 95 | ^ v | +============+ +=========+ +========+ | Hash Func |---->| Index 3 |---->| Bucket | +============+ +=========+ +========+ put("alice", 95) --> O(1) average get("alice") --> O(1) average remove("alice") --> O(1) average

CS205 Data Structures

Use arrow keys or buttons to navigate

1 / 20

The Problem: Fast Key-Value Access

We want O(1) for get, put, and remove.

Arrays give O(1) access by integer index:

arr[3] = "hello" // O(1) arr[3] // O(1)

But what if keys are strings, objects, or arbitrary data?

map["alice"] = 95 // ??? map["bob"] = 87 // ??? map["carlos"] = 72 // ???

What other structures offer:

Structuregetput
Unsorted ArrayO(n)O(1)
Sorted ArrayO(log n)O(n)
Linked ListO(n)O(1)
BST (balanced)O(log n)O(log n)
Hash TableO(1)*O(1)*

* average case

Key Idea

Convert any key into an array index using a hash function, then store the value at that index.

2 / 20

The Hash Function Idea

Transform any key into a valid array index in two steps:

HASH TABLE PIPELINE ======================================================================== hash code compression Key ------------------> Integer ------------------> Index (0 to N-1) "alice" ---> hashCode() ---> 97_429_158 ---> mod 11 ---> 3 "bob" ---> hashCode() ---> 66_837 ---> mod 11 ---> 7 "carlos" ---> hashCode() ---> 84_201_559 ---> mod 11 ---> 0 ======================================================================== Step 1: Hash Code - Turn the key into a (large) integer Step 2: Compression - Squeeze that integer into range [0, N-1]

Analogy: Coat Check Room

You hand your coat (key-value pair) to the attendant. They give you a ticket number (the hash). When you return with the ticket, they go directly to the right hook -- no searching through all coats. The ticket number is the position. That is exactly what a hash function does: it computes a "ticket number" (index) for each key so you can retrieve it instantly.

3 / 20

Hash Function Requirements

Three Requirements

  • Deterministic -- Same key always produces the same hash code. If h("alice") = 42 now, it must be 42 forever.
  • Uniform Distribution -- Spread keys evenly across all indices. Avoid clustering many keys into the same bucket.
  • Fast to Compute -- The whole point is O(1); a slow hash function defeats the purpose.

Warning

A bad hash function that maps everything to index 0 turns the hash table into a linked list -- O(n) for everything!

Common Compression Methods

Division Method (Modulo)

index = hashCode % N

Simple and fast. Works best when N is prime.

MAD Method (Multiply-Add-Divide)

index = ((a * hashCode + b) mod p) mod N

Where p is a prime > N, a,b are random integers with a > 0. Better distribution than simple modulo.

4 / 20

Hash Codes for Different Types

Integers

Use the integer itself (or i mod N).

h(42) = 42 h(-7) = -7 (handle sign!)

Strings: Polynomial Hash

Treat each character as a coefficient in a polynomial:

h(s) = s[0]*x^(n-1) + s[1]*x^(n-2) + ... + s[n-1]*x^0 where x is a constant (e.g., 31, 37, 41) Example: h("abc") with x = 31 = 'a'*31^2 + 'b'*31^1 + 'c'*31^0 = 97*961 + 98*31 + 99*1 = 93217 + 3038 + 99 = 96354

Use Horner's method to evaluate efficiently: ((97*31 + 98)*31 + 99)

Why Polynomial Hashing?

  • Uses position of characters, not just content
  • "abc" and "cba" get different hashes
  • Java's String.hashCode() uses x = 31

Objects: Combine Field Hashes

class Student { String name; int id; int hashCode() { int h = 17; // start h = 31*h + name.hashCode(); h = 31*h + id; return h; } }

Warning

If two objects are equals(), they must have the same hashCode(). The reverse is not required (collisions are allowed).

5 / 20

Compression Functions in Detail

Simple Modulo: h(k) mod N

Hash codes: 96354, 42, 10007, 555 Table size N = 7: 96354 mod 7 = 2 42 mod 7 = 0 10007 mod 7 = 4 555 mod 7 = 2 <-- collision!

Why N Should Be Prime

If N = 10 and keys are multiples of 5: {5, 10, 15, 20, 25, ...} all map to indices {0, 5} -- only 2 of 10 buckets used!

A prime N (e.g., 7, 11, 13, 97) minimizes patterns in the keys creating index clusters.

MAD: ((a*h(k)+b) mod p) mod N

Parameters: N = 7 (table size) p = 11 (prime > N) a = 3, b = 5 h(k) = 42: (3*42 + 5) mod 11 = 131 mod 11 = 10 10 mod 7 = 3 h(k) = 555: (3*555 + 5) mod 11 = 1670 mod 11 = 10 10 mod 7 = 3 h(k) = 96354: (3*96354 + 5) mod 11 = 289067 mod 11 = 5 5 mod 7 = 5

Key Idea

MAD spreads keys more uniformly because the multiply-add step "scrambles" the hash code before compressing.

6 / 20

Collisions Are Inevitable

A collision occurs when two different keys map to the same index.

h("alice") = 3 h("dave") = 3 | | v v Index: [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] ^ | COLLISION! Both keys want slot 3

Pigeonhole Principle

If you have more keys than array slots, at least two keys must share a slot. Even with fewer keys, collisions are likely (cf. Birthday Paradox: with 23 people, there's a 50% chance two share a birthday).

Two Main Solutions:

  • 1 Separate Chaining -- Store a list at each bucket
  • 2 Open Addressing -- Find another open slot in the array
Chaining: Open Addressing: [3] -> A -> D [3] = A [4] = D (probed)
7 / 20

Collision Handling 1: Separate Chaining

Each bucket holds a linked list (or chain) of all entries that hash to that index.

Index +-----+ | 0 | --> [ "carlos":72 ] --> null +-----+ | 1 | --> null +-----+ | 2 | --> [ "eve":91 ] --> [ "frank":68 ] --> null +-----+ | 3 | --> [ "alice":95 ] --> [ "dave":80 ] --> null +-----+ | 4 | --> null +-----+ | 5 | --> [ "bob":87 ] --> null +-----+ | 6 | --> null +-----+

Operations

  • put(k, v): hash k to index i, prepend (k,v) to list at bucket[i]
  • get(k): hash k to index i, search list at bucket[i] for key k
  • remove(k): hash k to index i, remove node with key k from list at bucket[i]

Key Idea

The array itself never "fills up." Each bucket can hold an unlimited number of entries. The tradeoff: long chains degrade to O(n) search within that chain.

8 / 20

Separate Chaining: Step-by-Step Example

Table size N = 7. Hash function: h(k) = k mod 7. Insert keys: 10, 22, 31, 4, 15

Insert 10: 10 mod 7 = 3 Insert 22: 22 mod 7 = 1 +---+ +---+ | 0 |-> null | 0 |-> null | 1 |-> null | 1 |-> [22] -> null | 2 |-> null | 2 |-> null | 3 |-> [10] -> null | 3 |-> [10] -> null | 4 |-> null | 4 |-> null | 5 |-> null | 5 |-> null | 6 |-> null | 6 |-> null +---+ +---+ Insert 31: 31 mod 7 = 3 (!) Insert 4: 4 mod 7 = 4 +---+ +---+ | 0 |-> null | 0 |-> null | 1 |-> [22] -> null | 1 |-> [22] -> null | 2 |-> null | 2 |-> null | 3 |-> [31] -> [10] -> null | 3 |-> [31] -> [10] -> null | 4 |-> null | 4 |-> [4] -> null | 5 |-> null | 5 |-> null | 6 |-> null | 6 |-> null +---+ +---+ Insert 15: 15 mod 7 = 1 (!) FINAL STATE: +---+ Bucket 1 has chain: [15] -> [22] | 0 |-> null Bucket 3 has chain: [31] -> [10] | 1 |-> [15] -> [22] -> null All others: single or empty | 2 |-> null | 3 |-> [31] -> [10] -> null get(31): hash to 3, traverse chain, | 4 |-> [4] -> null found at 1st node. O(1)! | 5 |-> null | 6 |-> null +---+
9 / 20

Collision Handling 2: Open Addressing

Linear Probing

All entries live directly in the array. If the target slot is taken, try the next slot (wrapping around).

Probe sequence for h(k) = 3: Try index 3 --> occupied? --> Try index 4 --> occupied? --> Try index 5 --> empty! +-------+-------+-------+-------+-------+-------+-------+ | 0 | 1 | 2 | [X] | [X] | | 6 | +-------+-------+-------+-------+-------+-------+-------+ 0 1 2 3 4 5 6 ^ ^ ^ | | | try #1 try #2 INSERT HERE

Probe Formula

probe(k, i) = (h(k) + i) mod N i = 0, 1, 2, 3, ...

Search: probe until you find the key or an empty slot.

Clustering Problem

Primary clustering: occupied slots form long contiguous runs. New keys that hash anywhere in the cluster must probe to the end of it, making the cluster even longer. Performance degrades.

10 / 20

Linear Probing: Step-by-Step

N = 7, h(k) = k mod 7. Insert: 10, 22, 31, 4, 15

Step 1: Insert 10. h(10) = 3. Slot 3 is empty --> place it. +------+------+------+------+------+------+------+ | | | | 10 | | | | +------+------+------+------+------+------+------+ 0 1 2 3 4 5 6 Step 2: Insert 22. h(22) = 1. Slot 1 is empty --> place it. +------+------+------+------+------+------+------+ | | 22 | | 10 | | | | +------+------+------+------+------+------+------+ 0 1 2 3 4 5 6 Step 3: Insert 31. h(31) = 3. Slot 3 is OCCUPIED (10). Try slot 4 --> empty --> place it. +------+------+------+------+------+------+------+ | | 22 | | 10 | 31 | | | +------+------+------+------+------+------+------+ 0 1 2 3 4 5 6 Step 4: Insert 4. h(4) = 4. Slot 4 is OCCUPIED (31). Try slot 5 --> empty --> place it. +------+------+------+------+------+------+------+ | | 22 | | 10 | 31 | 4 | | +------+------+------+------+------+------+------+ 0 1 2 3 4 5 6 Step 5: Insert 15. h(15) = 1. Slot 1 OCCUPIED (22). Try slot 2 --> empty --> place it. +------+------+------+------+------+------+------+ | | 22 | 15 | 10 | 31 | 4 | | +------+------+------+------+------+------+------+ 0 1 2 3 4 5 6 ^^^^^^^^^^^^^^^^ PRIMARY CLUSTER (slots 2-5)

Warning

Notice how slots 2-5 form a cluster. Any future key hashing to 2, 3, 4, or 5 must probe to slot 6, extending the cluster further.

11 / 20

Open Addressing: Quadratic Probing

Instead of probing the next slot, probe at increasing squared offsets.

Probe Formula

probe(k, i) = (h(k) + i^2) mod N i = 0: h(k) + 0 = h(k) i = 1: h(k) + 1 i = 2: h(k) + 4 i = 3: h(k) + 9 i = 4: h(k) + 16 ...

Key Idea

Quadratic probing jumps farther with each attempt, breaking up clusters. Reduces primary clustering.

Linear vs Quadratic Comparison

h(k) = 3, table size = 11 Linear probing: 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2 ^ ^ ^ ^ ^ Consecutive! Creates clusters. Quadratic probing: 3, 4, 7, 1, 8, 6, ... ^ ^ ^ ^ Jumps around! Breaks clusters. Offsets: +0 +1 +4 +9 +16 +25 3 4 7 12 19 28 mod 11: 3 4 7 1 8 6

Warning

Quadratic probing may not visit all slots. It is guaranteed to work when N is prime and the table is less than half full.

12 / 20

Open Addressing: Double Hashing

Use a second hash function to determine the step size. Each key gets its own unique probe sequence.

Probe Formula

probe(k, i) = (h1(k) + i * h2(k)) mod N Common choice: h1(k) = k mod N h2(k) = q - (k mod q) where q is a prime < N Example: N = 11, q = 7 key = 20: h1(20) = 20 mod 11 = 9 h2(20) = 7 - (20 mod 7) = 7 - 6 = 1 Probe: 9, 10, 0, 1, 2, ... key = 31: h1(31) = 31 mod 11 = 9 h2(31) = 7 - (31 mod 7) = 7 - 3 = 4 Probe: 9, 2, 6, 10, 3, ...

Key Idea

Even though keys 20 and 31 both start at index 9, their step sizes differ (1 vs 4), so they explore completely different sequences. This eliminates secondary clustering.

Comparison of Probing Strategies

MethodPrimary ClusteringSecondary Clustering
LinearYesYes
QuadraticNoYes
Double HashNoNo
13 / 20

Deletion in Open Addressing

You cannot simply empty a slot -- it breaks the probe chain for other keys!

PROBLEM: Insert 10 (->3), Insert 31 (->3, probe to 4). Then delete 10. Before delete 10: Naive delete of 10: +------+------+------+------+ +------+------+------+------+ | | | 10 | 31 | | | | | 31 | +------+------+------+------+ +------+------+------+------+ 0 1 2 3 0 1 2 3 Now search for 31: Now search for 31: h(31)=2 -> found 10, not 31 h(31)=2 -> EMPTY -> "not found"! -> try 3 -> found 31! WRONG! 31 is at index 3 but we stopped because slot 2 is empty.

Solution: Tombstones (DELETED markers)

After marking 10 as DELETED: +------+------+---------+------+ | | | DEL | 31 | +------+------+---------+------+ 0 1 2 3 Search for 31: h(31)=2 -> DEL (skip, keep going) -> try 3 -> found 31! Correct!

Key Idea

A DELETED (tombstone) marker means: "a key was here; keep probing." It is treated as empty for inserts (you can reuse the slot) but as occupied for searches (don't stop here).

Warning

Too many tombstones degrade performance. Periodic rehashing cleans them out.

14 / 20

Load Factor

Definition

α = n / N

n = number of entries stored
N = number of buckets (table size)

alpha = 0.0 Table is empty alpha = 0.5 Half full alpha = 0.75 Three-quarters full alpha = 1.0 Completely full alpha > 1.0 Only possible w/ chaining

Expected Probes (Linear Probing)

αSuccessfulUnsuccessful
0.251.171.39
0.501.502.50
0.752.508.50
0.905.5050.50
Performance vs Load Factor Probes | | * alpha=0.9 | * | * 50| * | * | * | * | * | * | * | * | * |* * * * +--+--+--+--+---> alpha .25 .5 .75 .9

Recommended Thresholds

  • Separate Chaining: rehash when α > 0.75
  • Open Addressing: rehash when α > 0.5

Open addressing needs a lower threshold because clusters form and probes increase steeply.

15 / 20

Rehashing

When the load factor exceeds the threshold, grow the table and reinsert everything.

BEFORE REHASH (N=7, n=5, alpha = 5/7 = 0.71) +------+------+------+------+------+------+------+ | | 22 | 15 | 10 | 31 | 4 | | +------+------+------+------+------+------+------+ 0 1 2 3 4 5 6 alpha > 0.5 threshold --> REHASH! 1. Create new array of size 2*7+1 = 15 (next prime: 17) 2. Recompute h(k) = k mod 17 for every key 3. Insert each key into new table AFTER REHASH (N=17, n=5, alpha = 5/17 = 0.29) +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+ | | | | | 4 | 22 | | | | | 10 | | | | 31 | 15 | | +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4 mod 17=4 22 mod 17=5 10 mod 17=10 31 mod 17=14 15 mod 17=15

Key Idea

Rehashing costs O(n) for that one operation, but it happens so rarely that the amortized cost per insert remains O(1). Same idea as dynamic array doubling.

Warning

You must recompute all indices because the table size N has changed. Old indices are no longer valid. You cannot just copy entries to the same slots!

16 / 20

Time Complexity Analysis

Average Case (good hash, low α)

OperationChainingOpen Addr.
put(k,v)O(1)O(1)
get(k)O(1)O(1)
remove(k)O(1)O(1)
SpaceO(n)O(n)

Worst Case (all keys collide)

OperationChainingOpen Addr.
put(k,v)O(n)O(n)
get(k)O(n)O(n)
remove(k)O(n)O(n)
BEST CASE: Uniform distribution +---+ | 0 |-> [A] | 1 |-> [B] | 2 |-> [C] Every chain | 3 |-> [D] has length ~1 | 4 |-> [E] | 5 |-> [F] +---+ Each get/put: O(1) WORST CASE: All hash to same index +---+ | 0 |-> [A]->[B]->[C]->[D]->[E]->[F] | 1 |-> null | 2 |-> null One chain has | 3 |-> null ALL n entries | 4 |-> null | 5 |-> null +---+ Each get/put: O(n) -- it's a linked list!

Analogy

A good hash function is like assigning students to exam rooms evenly. A bad one puts everyone in Room 1 and leaves Rooms 2-10 empty.

17 / 20

Java's HashMap Internals

Architecture

HashMap<String, Integer> Default capacity: 16 Load factor: 0.75 Rehash threshold: 16 * 0.75 = 12 +---+ | 0 |-> null | 1 |-> [K1:V1] -> [K2:V2] -> null | 2 |-> null | 3 |-> [K3:V3] -> null | . | | . | When chain length >= 8: | . | LinkedList -> Red-Black Tree | | |12 |-> [K4:V4] -> ... (7 more) | | | | | v TREEIFY! | | [Tree Root] | | / \ | | [left] [right] | | ... ... |13 |-> null |14 |-> null |15 |-> null +---+

Key Details

  • Initial capacity: 16 (always a power of 2)
  • Load factor: 0.75 by default
  • Rehash: double size when entries > capacity * 0.75
  • Bucket structure:
    • Chain length < 8 : linked list -- O(n) search within chain
    • Chain length >= 8 : Red-Black Tree -- O(log n) search within chain
    • Untreeify when chain shrinks below 6
  • Index computation: (n-1) & hash (bitwise AND, because n is power of 2, equivalent to mod but faster)

Key Idea

Java's HashMap evolved: before Java 8 it was pure chaining. The tree upgrade guarantees O(log n) worst case per bucket even with many collisions, protecting against hash-flooding attacks.

18 / 20

Real-World Applications

1. Spell Checker

Dictionary: HashSet<String> +---------------------------+ | "apple" "banana" "cherry" | | "date" "elder" "fig" ... | +---------------------------+ User types: "banan" dict.contains("banan") -> hash("banan") -> index 5 -> not found -> RED UNDERLINE!

O(1) lookup per word. Even checking an entire document is fast.

2. Caching (Memoization)

HTTP Cache: HashMap<URL, Response> Request: GET /api/users/42 cache.get("/api/users/42") -> HIT -> return cached response -> MISS -> fetch from server, store in cache

3. Database Indexing

Table: Students +----+--------+-------+ | id | name | grade | +----+--------+-------+ | 1 | Alice | A | | 2 | Bob | B | | 3 | Carlos | A | +----+--------+-------+ Hash index on "id": h(1)=5 h(2)=2 h(3)=0 SELECT * FROM Students WHERE id=2 -> hash(2) = 2 -> direct lookup! -> No full table scan needed.

4. More Uses

  • Compilers: symbol tables (variable name -> type, scope)
  • Networking: routing tables, DNS caches
  • Deduplication: detect duplicate files by content hash
  • Counting: word frequency, vote tallying
  • Blockchain: transaction verification via hash chains
19 / 20

Summary & Cheat Sheet

HASH TABLE AT A GLANCE ============================================== key --[hash func]--> index --[bucket]--> value ============================================== Average: O(1) get / put / remove Worst: O(n) if all keys collide COLLISION STRATEGIES: +------------------+------------------+ | Separate | Open | | Chaining | Addressing | +------------------+------------------+ | Linked list per | Store in array | | bucket | itself | | alpha can be > 1 | alpha must be <1 | | Simpler deletes | Needs tombstones | | Extra memory | Better cache | | (pointers) | locality | +------------------+------------------+ OPEN ADDRESSING VARIANTS: +----------+--------+-----------+ | Linear | Quadr. | Double | | h+i | h+i^2 | h+i*h2 | | Clusters | Better | Best | +----------+--------+-----------+

5 Things to Remember

  1. Hash function: must be deterministic, uniform, and fast
  2. Collisions: inevitable (pigeonhole principle) -- you must handle them
  3. Load factor α = n/N controls performance; rehash before it gets too high
  4. Tombstones: needed for deletion in open addressing
  5. Average O(1) for all operations -- the best we can hope for with unsorted data

Final Analogy

A hash table is like a well-organized filing cabinet. The hash function is the labeling system that tells you exactly which drawer to open. When two files share a drawer (collision), you either stack them in that drawer (chaining) or find the next empty drawer (open addressing). And when the cabinet gets too full, you buy a bigger one and refile everything (rehash).

20 / 20