Convert any key into an array index using a hash function, then store the value at that index.
2 / 20
The Hash Function Idea
Transform any key into a valid array index in two steps:
HASH TABLE PIPELINE
========================================================================
hash code compression
Key ------------------> Integer ------------------> Index
(0 to N-1)
"alice" ---> hashCode() ---> 97_429_158 ---> mod 11 ---> 3
"bob" ---> hashCode() ---> 66_837 ---> mod 11 ---> 7
"carlos" ---> hashCode() ---> 84_201_559 ---> mod 11 ---> 0
========================================================================
Step 1: Hash Code - Turn the key into a (large) integer
Step 2: Compression - Squeeze that integer into range [0, N-1]
Analogy: Coat Check Room
You hand your coat (key-value pair) to the attendant. They give you a ticket number (the hash). When you return with the ticket, they go directly to the right hook -- no searching through all coats. The ticket number is the position. That is exactly what a hash function does: it computes a "ticket number" (index) for each key so you can retrieve it instantly.
3 / 20
Hash Function Requirements
Three Requirements
Deterministic -- Same key always produces the same hash code. If h("alice") = 42 now, it must be 42 forever.
Uniform Distribution -- Spread keys evenly across all indices. Avoid clustering many keys into the same bucket.
Fast to Compute -- The whole point is O(1); a slow hash function defeats the purpose.
Warning
A bad hash function that maps everything to index 0 turns the hash table into a linked list -- O(n) for everything!
Common Compression Methods
Division Method (Modulo)
index = hashCode % N
Simple and fast. Works best when N is prime.
MAD Method (Multiply-Add-Divide)
index = ((a * hashCode + b) mod p) mod N
Where p is a prime > N, a,b are random integers with a > 0. Better distribution than simple modulo.
4 / 20
Hash Codes for Different Types
Integers
Use the integer itself (or i mod N).
h(42) = 42
h(-7) = -7 (handle sign!)
Strings: Polynomial Hash
Treat each character as a coefficient in a polynomial:
h(s) = s[0]*x^(n-1) + s[1]*x^(n-2)
+ ... + s[n-1]*x^0
where x is a constant (e.g., 31, 37, 41)
Example: h("abc") with x = 31
= 'a'*31^2 + 'b'*31^1 + 'c'*31^0
= 97*961 + 98*31 + 99*1
= 93217 + 3038 + 99
= 96354
Use Horner's method to evaluate efficiently: ((97*31 + 98)*31 + 99)
Why Polynomial Hashing?
Uses position of characters, not just content
"abc" and "cba" get different hashes
Java's String.hashCode() uses x = 31
Objects: Combine Field Hashes
class Student {
String name;
int id;
int hashCode() {
int h = 17; // start
h = 31*h + name.hashCode();
h = 31*h + id;
return h;
}
}
Warning
If two objects are equals(), they must have the same hashCode(). The reverse is not required (collisions are allowed).
5 / 20
Compression Functions in Detail
Simple Modulo: h(k) mod N
Hash codes: 96354, 42, 10007, 555
Table size N = 7:
96354 mod 7 = 2
42 mod 7 = 0
10007 mod 7 = 4
555 mod 7 = 2 <-- collision!
Why N Should Be Prime
If N = 10 and keys are multiples of 5:
{5, 10, 15, 20, 25, ...} all map to indices {0, 5} -- only 2 of 10 buckets used!
A prime N (e.g., 7, 11, 13, 97) minimizes patterns in the keys creating index clusters.
MAD: ((a*h(k)+b) mod p) mod N
Parameters:
N = 7 (table size)
p = 11 (prime > N)
a = 3, b = 5
h(k) = 42:
(3*42 + 5) mod 11 = 131 mod 11 = 10
10 mod 7 = 3
h(k) = 555:
(3*555 + 5) mod 11 = 1670 mod 11 = 10
10 mod 7 = 3
h(k) = 96354:
(3*96354 + 5) mod 11 = 289067 mod 11 = 5
5 mod 7 = 5
Key Idea
MAD spreads keys more uniformly because the multiply-add step "scrambles" the hash code before compressing.
6 / 20
Collisions Are Inevitable
A collision occurs when two different keys map to the same index.
If you have more keys than array slots, at least two keys must share a slot. Even with fewer keys, collisions are likely (cf. Birthday Paradox: with 23 people, there's a 50% chance two share a birthday).
Two Main Solutions:
1Separate Chaining -- Store a list at each bucket
2Open Addressing -- Find another open slot in the array
Chaining: Open Addressing:
[3] -> A -> D [3] = A
[4] = D (probed)
7 / 20
Collision Handling 1: Separate Chaining
Each bucket holds a linked list (or chain) of all entries that hash to that index.
put(k, v): hash k to index i, prepend (k,v) to list at bucket[i]
get(k): hash k to index i, search list at bucket[i] for key k
remove(k): hash k to index i, remove node with key k from list at bucket[i]
Key Idea
The array itself never "fills up." Each bucket can hold an unlimited number of entries. The tradeoff: long chains degrade to O(n) search within that chain.
8 / 20
Separate Chaining: Step-by-Step Example
Table size N = 7. Hash function: h(k) = k mod 7. Insert keys: 10, 22, 31, 4, 15
probe(k, i) = (h(k) + i) mod N
i = 0, 1, 2, 3, ...
Search: probe until you find the key or an empty slot.
Clustering Problem
Primary clustering: occupied slots form long contiguous runs. New keys that hash anywhere in the cluster must probe to the end of it, making the cluster even longer. Performance degrades.
Quadratic probing may not visit all slots. It is guaranteed to work when N is prime and the table is less than half full.
12 / 20
Open Addressing: Double Hashing
Use a second hash function to determine the step size. Each key gets its own unique probe sequence.
Probe Formula
probe(k, i) = (h1(k) + i * h2(k)) mod N
Common choice:
h1(k) = k mod N
h2(k) = q - (k mod q)
where q is a prime < N
Example: N = 11, q = 7
key = 20:
h1(20) = 20 mod 11 = 9
h2(20) = 7 - (20 mod 7) = 7 - 6 = 1
Probe: 9, 10, 0, 1, 2, ...
key = 31:
h1(31) = 31 mod 11 = 9
h2(31) = 7 - (31 mod 7) = 7 - 3 = 4
Probe: 9, 2, 6, 10, 3, ...
Key Idea
Even though keys 20 and 31 both start at index 9, their step sizes differ (1 vs 4), so they explore completely different sequences. This eliminates secondary clustering.
Comparison of Probing Strategies
Method
Primary Clustering
Secondary Clustering
Linear
Yes
Yes
Quadratic
No
Yes
Double Hash
No
No
13 / 20
Deletion in Open Addressing
You cannot simply empty a slot -- it breaks the probe chain for other keys!
PROBLEM: Insert 10 (->3), Insert 31 (->3, probe to 4). Then delete 10.
Before delete 10: Naive delete of 10:
+------+------+------+------+ +------+------+------+------+
| | | 10 | 31 | | | | | 31 |
+------+------+------+------+ +------+------+------+------+
0 1 2 3 0 1 2 3
Now search for 31: Now search for 31:
h(31)=2 -> found 10, not 31 h(31)=2 -> EMPTY -> "not found"!
-> try 3 -> found 31! WRONG! 31 is at index 3 but we
stopped because slot 2 is empty.
Solution: Tombstones (DELETED markers)
After marking 10 as DELETED:
+------+------+---------+------+
| | | DEL | 31 |
+------+------+---------+------+
0 1 2 3
Search for 31:
h(31)=2 -> DEL (skip, keep going)
-> try 3 -> found 31! Correct!
Key Idea
A DELETED (tombstone) marker means: "a key was here; keep probing." It is treated as empty for inserts (you can reuse the slot) but as occupied for searches (don't stop here).
Warning
Too many tombstones degrade performance. Periodic rehashing cleans them out.
14 / 20
Load Factor
Definition
α = n / N
n = number of entries stored
N = number of buckets (table size)
alpha = 0.0 Table is empty
alpha = 0.5 Half full
alpha = 0.75 Three-quarters full
alpha = 1.0 Completely full
alpha > 1.0 Only possible w/ chaining
Open addressing needs a lower threshold because clusters form and probes increase steeply.
15 / 20
Rehashing
When the load factor exceeds the threshold, grow the table and reinsert everything.
BEFORE REHASH (N=7, n=5, alpha = 5/7 = 0.71)
+------+------+------+------+------+------+------+
| | 22 | 15 | 10 | 31 | 4 | |
+------+------+------+------+------+------+------+
0 1 2 3 4 5 6
alpha > 0.5 threshold --> REHASH!
1. Create new array of size 2*7+1 = 15 (next prime: 17)
2. Recompute h(k) = k mod 17 for every key
3. Insert each key into new table
AFTER REHASH (N=17, n=5, alpha = 5/17 = 0.29)
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| | | | | 4 | 22 | | | | | 10 | | | | 31 | 15 | |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
4 mod 17=4 22 mod 17=5 10 mod 17=10 31 mod 17=14 15 mod 17=15
Key Idea
Rehashing costs O(n) for that one operation, but it happens so rarely that the amortized cost per insert remains O(1). Same idea as dynamic array doubling.
Warning
You must recompute all indices because the table size N has changed. Old indices are no longer valid. You cannot just copy entries to the same slots!
16 / 20
Time Complexity Analysis
Average Case (good hash, low α)
Operation
Chaining
Open Addr.
put(k,v)
O(1)
O(1)
get(k)
O(1)
O(1)
remove(k)
O(1)
O(1)
Space
O(n)
O(n)
Worst Case (all keys collide)
Operation
Chaining
Open Addr.
put(k,v)
O(n)
O(n)
get(k)
O(n)
O(n)
remove(k)
O(n)
O(n)
BEST CASE: Uniform distribution
+---+
| 0 |-> [A]
| 1 |-> [B]
| 2 |-> [C] Every chain
| 3 |-> [D] has length ~1
| 4 |-> [E]
| 5 |-> [F]
+---+
Each get/put: O(1)
WORST CASE: All hash to same index
+---+
| 0 |-> [A]->[B]->[C]->[D]->[E]->[F]
| 1 |-> null
| 2 |-> null One chain has
| 3 |-> null ALL n entries
| 4 |-> null
| 5 |-> null
+---+
Each get/put: O(n) -- it's a linked list!
Analogy
A good hash function is like assigning students to exam rooms evenly. A bad one puts everyone in Room 1 and leaves Rooms 2-10 empty.
Rehash: double size when entries > capacity * 0.75
Bucket structure:
Chain length < 8 : linked list -- O(n) search within chain
Chain length >= 8 : Red-Black Tree -- O(log n) search within chain
Untreeify when chain shrinks below 6
Index computation:(n-1) & hash (bitwise AND, because n is power of 2, equivalent to mod but faster)
Key Idea
Java's HashMap evolved: before Java 8 it was pure chaining. The tree upgrade guarantees O(log n) worst case per bucket even with many collisions, protecting against hash-flooding attacks.
18 / 20
Real-World Applications
1. Spell Checker
Dictionary: HashSet<String>
+---------------------------+
| "apple" "banana" "cherry" |
| "date" "elder" "fig" ... |
+---------------------------+
User types: "banan"
dict.contains("banan")
-> hash("banan") -> index 5
-> not found -> RED UNDERLINE!
O(1) lookup per word. Even checking an entire document is fast.
2. Caching (Memoization)
HTTP Cache:
HashMap<URL, Response>
Request: GET /api/users/42
cache.get("/api/users/42")
-> HIT -> return cached response
-> MISS -> fetch from server,
store in cache
3. Database Indexing
Table: Students
+----+--------+-------+
| id | name | grade |
+----+--------+-------+
| 1 | Alice | A |
| 2 | Bob | B |
| 3 | Carlos | A |
+----+--------+-------+
Hash index on "id":
h(1)=5 h(2)=2 h(3)=0
SELECT * FROM Students WHERE id=2
-> hash(2) = 2 -> direct lookup!
-> No full table scan needed.
4. More Uses
Compilers: symbol tables (variable name -> type, scope)
Networking: routing tables, DNS caches
Deduplication: detect duplicate files by content hash
Counting: word frequency, vote tallying
Blockchain: transaction verification via hash chains
19 / 20
Summary & Cheat Sheet
HASH TABLE AT A GLANCE
==============================================
key --[hash func]--> index --[bucket]--> value
==============================================
Average: O(1) get / put / remove
Worst: O(n) if all keys collide
COLLISION STRATEGIES:
+------------------+------------------+
| Separate | Open |
| Chaining | Addressing |
+------------------+------------------+
| Linked list per | Store in array |
| bucket | itself |
| alpha can be > 1 | alpha must be <1 |
| Simpler deletes | Needs tombstones |
| Extra memory | Better cache |
| (pointers) | locality |
+------------------+------------------+
OPEN ADDRESSING VARIANTS:
+----------+--------+-----------+
| Linear | Quadr. | Double |
| h+i | h+i^2 | h+i*h2 |
| Clusters | Better | Best |
+----------+--------+-----------+
5 Things to Remember
Hash function: must be deterministic, uniform, and fast
Collisions: inevitable (pigeonhole principle) -- you must handle them
Load factor α = n/N controls performance; rehash before it gets too high
Tombstones: needed for deletion in open addressing
Average O(1) for all operations -- the best we can hope for with unsorted data
Final Analogy
A hash table is like a well-organized filing cabinet. The hash function is the labeling system that tells you exactly which drawer to open. When two files share a drawer (collision), you either stack them in that drawer (chaining) or find the next empty drawer (open addressing). And when the cabinet gets too full, you buy a bigger one and refile everything (rehash).