Hash Maps/Dynamic Resizing: Difference between revisions
From charlesreid1
(Rewrite for clarity and accuracy: introduce load factor, fix garbled deletion/thrashing section, clarify amortization, improve overall structure (via update-page on MediaWiki MCP Server)) |
|||
| Line 1: | Line 1: | ||
==Hash Maps: Dynamic Resizing== | ==Hash Maps: Dynamic Resizing== | ||
Once we have our hash table implemented, we'll want to start adding lots of items. But | Once we have our hash table implemented, we'll want to start adding lots of items. But we have to strike a balance. | ||
The | The '''load factor''' α = n/m relates the number of items stored (n) to the number of slots (m, the output space of the hash function). | ||
The smaller the output space, the smaller our map is in memory. But with chaining, a smaller m means a higher load factor, leading to longer linked lists (average length α = n/m). As α grows, operations drift from O(1) toward O(n). | |||
On the other hand, if we make m too large, we waste memory on empty slots. | |||
Dynamic resizing maintains the balance on-the-fly by growing and shrinking the table as items are inserted and deleted. The concept is similar to the dynamic array resizing used by the [[Arrays/Java/PythonList|PythonList]] class in Java, which implements a dynamically expanding and shrinking array. | |||
===Growing the table=== | |||
Growing the table from size m to size m' requires the following steps: | |||
* Allocate a new table of size m' | |||
* | * Build a new hash function h' suited to output space m' | ||
* Rehash every item: for each key-value pair in T, compute h'(key) and insert into T' | |||
* m' | |||
* | |||
The cost of a single resize is: | |||
* O(n) at least (to rehash all n items) | |||
* In general: O(n + m + m') | |||
A new hash function is necessary because the original hash function was designed for output space m. Each key must be re-hashed to map into the new range. | |||
* When | '''How much to grow?''' | ||
* | If we grow the table by a constant amount each time (e.g., m' = m + c), the cost of n insertions becomes a triangular number: | ||
<math>\Theta(1 + 2 + 3 + \dots + n) = \Theta(n^2)</math> | |||
We resize frequently, and each resize touches every item — hence the quadratic total. | |||
'''Doubling strategy:''' If we instead double the table size (m' = 2m), the cost of n insertions is dramatically better. The resize costs form a geometric series: | |||
<math>\Theta(1 + 2 + 4 + 8 + \dots + n) = \Theta(n)</math> | |||
The series sums to at most 2n. Spread across n insertions, this gives O(1) amortized cost per insertion. Only O(log n) of the operations (the resizes themselves) are expensive. | |||
===Shrinking the table=== | |||
When items are deleted and the load factor drops too low, the table should shrink to reclaim memory. | |||
A naive approach would shrink by half whenever the table is less than half full (n < m/2). However, this causes '''thrashing:''' with alternating insertions and deletions near the boundary, every operation could trigger a Θ(n) resize. | |||
'''Solution: shrink when the table is one-quarter full.''' | |||
* When n < m/4, shrink the table by half (m' = m/2) | |||
After shrinking from m to m/2 with n = m/4 items, the load factor becomes (m/4) / (m/2) = 1/2. To trigger another resize from this state: | |||
* A grow would require n to exceed 3m'/4 = 3m/8 — inserting m/8 more items. | |||
* A shrink would require n to drop below m'/4 = m/8 — deleting m/8 more items. | |||
Either way, Θ(m) operations must occur before the next resize, preventing the thrashing scenario. | |||
With this strategy, both insertions and deletions have amortized cost Θ(1). | |||
===Amortization=== | ===Amortization=== | ||
An operation | An operation has ''amortized cost'' T(n) if any sequence of k such operations takes at most <math>k \cdot T(n)</math> time. | ||
k | With the doubling/halving strategy described above, k insertions (or k deletions) take Θ(k) total time, yielding O(1) amortized cost per operation. | ||
For a more detailed treatment including the accounting method and potential method, see [[Amortization]]. | |||
==Flags== | ==Flags== | ||
Latest revision as of 04:30, 8 June 2026
Hash Maps: Dynamic Resizing
Once we have our hash table implemented, we'll want to start adding lots of items. But we have to strike a balance.
The load factor α = n/m relates the number of items stored (n) to the number of slots (m, the output space of the hash function).
The smaller the output space, the smaller our map is in memory. But with chaining, a smaller m means a higher load factor, leading to longer linked lists (average length α = n/m). As α grows, operations drift from O(1) toward O(n).
On the other hand, if we make m too large, we waste memory on empty slots.
Dynamic resizing maintains the balance on-the-fly by growing and shrinking the table as items are inserted and deleted. The concept is similar to the dynamic array resizing used by the PythonList class in Java, which implements a dynamically expanding and shrinking array.
Growing the table
Growing the table from size m to size m' requires the following steps:
- Allocate a new table of size m'
- Build a new hash function h' suited to output space m'
- Rehash every item: for each key-value pair in T, compute h'(key) and insert into T'
The cost of a single resize is:
- O(n) at least (to rehash all n items)
- In general: O(n + m + m')
A new hash function is necessary because the original hash function was designed for output space m. Each key must be re-hashed to map into the new range.
How much to grow?
If we grow the table by a constant amount each time (e.g., m' = m + c), the cost of n insertions becomes a triangular number:
$ \Theta(1 + 2 + 3 + \dots + n) = \Theta(n^2) $
We resize frequently, and each resize touches every item — hence the quadratic total.
Doubling strategy: If we instead double the table size (m' = 2m), the cost of n insertions is dramatically better. The resize costs form a geometric series:
$ \Theta(1 + 2 + 4 + 8 + \dots + n) = \Theta(n) $
The series sums to at most 2n. Spread across n insertions, this gives O(1) amortized cost per insertion. Only O(log n) of the operations (the resizes themselves) are expensive.
Shrinking the table
When items are deleted and the load factor drops too low, the table should shrink to reclaim memory.
A naive approach would shrink by half whenever the table is less than half full (n < m/2). However, this causes thrashing: with alternating insertions and deletions near the boundary, every operation could trigger a Θ(n) resize.
Solution: shrink when the table is one-quarter full.
- When n < m/4, shrink the table by half (m' = m/2)
After shrinking from m to m/2 with n = m/4 items, the load factor becomes (m/4) / (m/2) = 1/2. To trigger another resize from this state:
- A grow would require n to exceed 3m'/4 = 3m/8 — inserting m/8 more items.
- A shrink would require n to drop below m'/4 = m/8 — deleting m/8 more items.
Either way, Θ(m) operations must occur before the next resize, preventing the thrashing scenario.
With this strategy, both insertions and deletions have amortized cost Θ(1).
Amortization
An operation has amortized cost T(n) if any sequence of k such operations takes at most $ k \cdot T(n) $ time.
With the doubling/halving strategy described above, k insertions (or k deletions) take Θ(k) total time, yielding O(1) amortized cost per operation.
For a more detailed treatment including the accounting method and potential method, see Amortization.
Flags
| Maps and Dictionaries Part of Computer Science Notes
Series on Data Structures
Maps/Dictionaries Maps · Maps/ADT · Maps in Java · Maps/OOP · Maps/Operations and Performance Map implementations: Maps/AbstractMap · Maps/UnsortedArrayMap · Maps/SortedArrayMap Dictionary implementations: Dictionaries/LinkedDict · Dictionaries/ArrayDict
Hashes Hash Maps/OOP · Hash Maps/Operations and Performance Hash Maps/Dynamic Resizing · Hash Maps/Collision Handling with Chaining Hash functions: Hash Functions · Hash Functions/Cyclic Permutation Hash map implementations: Hash Maps/AbstractHashMap · Hash Maps/ChainedHashMap
Skip Lists · Java/ConcurrentSkipList · Java implementations: SkipList
Sets Sets · Sets/ADT · Sets in Java · Sets/OOP · Multisets
|