From charlesreid1

(Created page with "==Hash Maps: Dynamic Resizing== Once we have our hash table implemented, we'll want to start adding lots of items. But then, we have to strike a balance. The smaller the out...")
 
(Rewrite for clarity and accuracy: introduce load factor, fix garbled deletion/thrashing section, clarify amortization, improve overall structure (via update-page on MediaWiki MCP Server))
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
==Hash Maps: Dynamic Resizing==
==Hash Maps: Dynamic Resizing==


Once we have our hash table implemented, we'll want to start adding lots of items. But then, we have to strike a balance.
Once we have our hash table implemented, we'll want to start adding lots of items. But we have to strike a balance.


The smaller the output space of the hash function, the smaller our map is in memory. But if we are using chaining, that leads to longer and longer linked lists, linked lists of size n/m, which gets further from O(1) as the linked lists get longer.
The '''load factor''' α = n/m relates the number of items stored (n) to the number of slots (m, the output space of the hash function).


On the other hand, if we make m too big, and make the output space of our hash function too large, we end up wasting space in our hash table.
The smaller the output space, the smaller our map is in memory. But with chaining, a smaller m means a higher load factor, leading to longer linked lists (average length α = n/m). As α grows, operations drift from O(1) toward O(n).


The concept behind expanding the hash table size dynamically is to maintain the balance on-the-fly. The concept is similar to the array resizing algorithm used by the [[Arrays/Java/PythonList]] PythonList class in Java, which implements a dynamically expanding/shrinking array.
On the other hand, if we make m too large, we waste memory on empty slots.


Growing the table size, from m to m', requires the following steps:
Dynamic resizing maintains the balance on-the-fly by growing and shrinking the table as items are inserted and deleted. The concept is similar to the dynamic array resizing used by the [[Arrays/Java/PythonList|PythonList]] class in Java, which implements a dynamically expanding and shrinking array.
* make table of size m'
* build new hash h'
* rehash
* for item in T: T'.insert(item)


How much time?
===Growing the table===
* O(n) at least
* In general, O(n + m + m')


Why new hash?
Growing the table from size m to size m' requires the following steps:
* Hash function is designed to have the specified output space of m
* Each key needs to be re-hashed


If n > m, we need to make the table bigger.
* Allocate a new table of size m'
* If we make the table a constant amount larger, the cost of n inserts is a triangular number, theta(1+2+3+4+...+n) = theta(n^2)
* Build a new hash function h' suited to output space m'
* If we make the table twice as large, cost of n insertions is amortized
* Rehash every item: for each key-value pair in T, compute h'(key) and insert into T'
* m' = 2m: cost of n inserts
* theta(1 + 2 + 4 + 8 + ... + n)
* theta(n)
* Only a few - log N - operations are expensive


Similarly, deletions should happen when we get to n < m/4 quarter-full table
The cost of a single resize is:
* Should shrink table size by half
 
* As we saw before, shrinking table size by quarter, we can have single table size or add/remove operation that costs O(N)
* O(n) at least (to rehash all n items)
* Cost of n inserts
* In general: O(n + m + m')
* Slow operation if we shrink by half when table is empty by half:
 
* slow operation is 2^k <--insert/delete--> (2^k) + 1
A new hash function is necessary because the original hash function was designed for output space m. Each key must be re-hashed to map into the new range.
* This is a Theta(n) operation
 
* When you are a quarter full, you shrink by half
'''How much to grow?'''
* Amortized time is Theta(1)
 
* Little bit tricky to prove...
If we grow the table by a constant amount each time (e.g., m' = m + c), the cost of n insertions becomes a triangular number:
 
<math>\Theta(1 + 2 + 3 + \dots + n) = \Theta(n^2)</math>
 
We resize frequently, and each resize touches every item — hence the quadratic total.
 
'''Doubling strategy:''' If we instead double the table size (m' = 2m), the cost of n insertions is dramatically better. The resize costs form a geometric series:
 
<math>\Theta(1 + 2 + 4 + 8 + \dots + n) = \Theta(n)</math>
 
The series sums to at most 2n. Spread across n insertions, this gives O(1) amortized cost per insertion. Only O(log n) of the operations (the resizes themselves) are expensive.
 
===Shrinking the table===
 
When items are deleted and the load factor drops too low, the table should shrink to reclaim memory.
 
A naive approach would shrink by half whenever the table is less than half full (n < m/2). However, this causes '''thrashing:''' with alternating insertions and deletions near the boundary, every operation could trigger a Θ(n) resize.
 
'''Solution: shrink when the table is one-quarter full.'''
 
* When n < m/4, shrink the table by half (m' = m/2)
 
After shrinking from m to m/2 with n = m/4 items, the load factor becomes (m/4) / (m/2) = 1/2. To trigger another resize from this state:
 
* A grow would require n to exceed 3m'/4 = 3m/8 — inserting m/8 more items.
* A shrink would require n to drop below m'/4 = m/8 — deleting m/8 more items.
 
Either way, Θ(m) operations must occur before the next resize, preventing the thrashing scenario.
 
With this strategy, both insertions and deletions have amortized cost Θ(1).


===Amortization===
===Amortization===


An operation takes "T(n) amortized" if k operations take <math> \leq k \cdot T(n)</math> time
An operation has ''amortized cost'' T(n) if any sequence of k such operations takes at most <math>k \cdot T(n)</math> time.
 
With the doubling/halving strategy described above, k insertions (or k deletions) take Θ(k) total time, yielding O(1) amortized cost per operation.


k inserts take theta(k) time, so this is O(1) amortized insert
For a more detailed treatment including the accounting method and potential method, see [[Amortization]].


==Flags==
==Flags==

Latest revision as of 04:30, 8 June 2026

Hash Maps: Dynamic Resizing

Once we have our hash table implemented, we'll want to start adding lots of items. But we have to strike a balance.

The load factor α = n/m relates the number of items stored (n) to the number of slots (m, the output space of the hash function).

The smaller the output space, the smaller our map is in memory. But with chaining, a smaller m means a higher load factor, leading to longer linked lists (average length α = n/m). As α grows, operations drift from O(1) toward O(n).

On the other hand, if we make m too large, we waste memory on empty slots.

Dynamic resizing maintains the balance on-the-fly by growing and shrinking the table as items are inserted and deleted. The concept is similar to the dynamic array resizing used by the PythonList class in Java, which implements a dynamically expanding and shrinking array.

Growing the table

Growing the table from size m to size m' requires the following steps:

  • Allocate a new table of size m'
  • Build a new hash function h' suited to output space m'
  • Rehash every item: for each key-value pair in T, compute h'(key) and insert into T'

The cost of a single resize is:

  • O(n) at least (to rehash all n items)
  • In general: O(n + m + m')

A new hash function is necessary because the original hash function was designed for output space m. Each key must be re-hashed to map into the new range.

How much to grow?

If we grow the table by a constant amount each time (e.g., m' = m + c), the cost of n insertions becomes a triangular number:

$ \Theta(1 + 2 + 3 + \dots + n) = \Theta(n^2) $

We resize frequently, and each resize touches every item — hence the quadratic total.

Doubling strategy: If we instead double the table size (m' = 2m), the cost of n insertions is dramatically better. The resize costs form a geometric series:

$ \Theta(1 + 2 + 4 + 8 + \dots + n) = \Theta(n) $

The series sums to at most 2n. Spread across n insertions, this gives O(1) amortized cost per insertion. Only O(log n) of the operations (the resizes themselves) are expensive.

Shrinking the table

When items are deleted and the load factor drops too low, the table should shrink to reclaim memory.

A naive approach would shrink by half whenever the table is less than half full (n < m/2). However, this causes thrashing: with alternating insertions and deletions near the boundary, every operation could trigger a Θ(n) resize.

Solution: shrink when the table is one-quarter full.

  • When n < m/4, shrink the table by half (m' = m/2)

After shrinking from m to m/2 with n = m/4 items, the load factor becomes (m/4) / (m/2) = 1/2. To trigger another resize from this state:

  • A grow would require n to exceed 3m'/4 = 3m/8 — inserting m/8 more items.
  • A shrink would require n to drop below m'/4 = m/8 — deleting m/8 more items.

Either way, Θ(m) operations must occur before the next resize, preventing the thrashing scenario.

With this strategy, both insertions and deletions have amortized cost Θ(1).

Amortization

An operation has amortized cost T(n) if any sequence of k such operations takes at most $ k \cdot T(n) $ time.

With the doubling/halving strategy described above, k insertions (or k deletions) take Θ(k) total time, yielding O(1) amortized cost per operation.

For a more detailed treatment including the accounting method and potential method, see Amortization.

Flags