From charlesreid1

Hash Maps: Collision Handling with Chaining

Chaining is a way of using linked lists to deal with the problem of turning a huge keyspace with a tiny number of keys into actual usable slots in an array.

Normally, if you turn a keyspace into usable slots in an array, you have a tradeoff between a large range (meaning a larger number of potential slots, and therefore higher memory usage) and a small range (meaning a smaller number of potential slots, and therefore a higher probability of colliding hashes).

Chaining deals with this problem by erring on the side of a small range (and therefore more efficient memory usage), and dealing with collisions by storing buckets of values in linked lists, rather than storing the single values themselves.

So, I have a high probability of two keys (say, the strings "peanut butter" and "jelly") mapping to the same two keys in the array (say, the integer 10). But I can deal with this by placing the data corresponding to the key "peanut butter" in a linked list node and add it to a linked list stored at array index 10. Then, when I pass in "jelly" data, I get the same slot in the array (10), but create another linked list node with data corresponding to key "jelly" and add it to the same linked list at array index 10.

Worst case: all your keys have exactly the same hash, so you just have one big long linked list of size n. Then the complexity class is \Theta(n).

In the worst case, hashing sucks. But normally, it works really well, because of randomization.

Note: see Theta versus Big O for a discussion of when to use \Theta( f(N) ) versus O( f(N) ).

Expected length of chain

Analysis: What is the expected length of a chain of hashed items, if we have n keys and m slots?

Don't overthink this! The expected length of a chain of hashed items assumes you have a good hash function, so it assumes a perfectly uniform distribution of n items into m slots. So each of the m slots gets n/m.

The probability of one key going into one slot is 1/m. The probability of each key is independent. Each slot gets a 1/m chance of receiving the ith key, so they see a probability of getting a key of (1/m) + (1/m) + (1/m) + (1/m) + ...

This is something we want to control. We call this the load factor.

The runtime of retrieving all the keys from a slot therefore becomes (n/m), so if m = \Theta(n) we can say that retrieval is \Theta(1)

More correctly, retrieval is \Theta(1 + \lambda) where lambda is the load factor

Simple uniform hashing

Simple uniform hashing is basically the hashing equivalent of the assumption that running any single operation has a constant O(1) cost. It makes big O analysis easier. Likewise, to analyze the performance of chaining, we want to be able to use simple uniform hashing to simplify analysis.

This gives us an average case. If we assume anything less than uniform, we end up with worse- to worst-case scenarios.

Good hash functions (compression functions)

Most common:

  • Division method
  • Multiplication method
  • MAD (multiply-and-divide) method/Universal hashing

Division method:

h(k) = k \mod m

We require m to be prime to minimize collisions, and be far from powers of 2 or 10 (more commonly occurring)

Multiplication method:

h(k) = a k \mod 2^w > > (w-r)

w bit machine. right arrows mean shift it right (w-r) bits.

MAD Method/Universal hashing:

Map a key to:

h(k) = \left( (ak + b) \mod p \right) \mod m

where a and b are integers chosen at random from interval 0 to p-1, p is a prime number larger than N.

For the worst case keys, k1 != k2,

Pr \left( h(k_1) = h(k_2) \right) = \frac{1}{m}


We implement a chained collision handling hash table here: link:

In this particular implementation, we did not, strictly speaking, follow the "traditional" chained hash table by using a linked list to store items in different buckets - instead, we used the UnsortedArrayMap class, which is basically (internally) an ArrayList of MapItem objects (which are the key-value pairs that the map stores).