From charlesreid1

(Fix mathematical inaccuracy (maps are functions, not bijections); fix hash example syntax; fix Gödel reference; correct etymology of "hash" (via update-page on MediaWiki MCP Server))
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
See also: [[Dictionaries]]
See also: [[Dictionaries]]


Maps are value-based data structures that create a bijection from one set onto another, such that one input key yields one corresponding output value.
Maps are value-based data structures that create a mapping from one set (the domain) onto another (the range), such that one input key yields one corresponding output value.


=Notes=
=Notes=


Using the term "map" in place of the term "dictionary" is a hat tip to the mathematical nature of a map. The mathematical definition of a function is, a mapping from one set (the domain) onto another (the range) such that one input corresponds to exactly one output. This is a bijective mapping between two sets. This is all that is required of a function - one input gives one output. Likewise, a programming map is define as a key-value data structure, in which an input key can be used to find the resulting value in O(1) time.
Using the term "map" in place of the term "dictionary" is a hat tip to the mathematical nature of a map. The mathematical definition of a function is, a mapping from one set (the domain) onto another (the range) such that one input corresponds to exactly one output. This is all that is required of a function - one input gives one output. Likewise, a programming map is defined as a key-value data structure, in which an input key can be used to find the resulting value in O(1) time.
 
Note that a map is a function (each key maps to exactly one value), but it is not necessarily a bijection: multiple distinct keys may map to the same value, and not every value in the codomain need have a corresponding key. The term "bijection" would require a one-to-one correspondence in both directions, which maps do not guarantee.


Maps can be implemented multiple ways. They can be stored as a simple unsorted array of key-value item objects. They can be stored in a sorted array of key-value items. They can be stored with a tree. They can be stored with a hash table. etc.
Maps can be implemented multiple ways. They can be stored as a simple unsorted array of key-value item objects. They can be stored in a sorted array of key-value items. They can be stored with a tree. They can be stored with a hash table. etc.
Line 33: Line 35:
* 0s and 1s are numbers, in binary, so we can convert them to numbers, in decimal.
* 0s and 1s are numbers, in binary, so we can convert them to numbers, in decimal.


(Note the similarities between this approach and Gödel's approach with his [[Incompleteness Theorem]] of assigning numbers to mathematical proofs/theories.)
(Note the similarities between this approach and Gödel's approach with his [[Incompleteness Theorems]] of assigning numbers to mathematical proofs/theories.)


This gives us a way of turning arbitrary objects/data into arbitrary integers.
This gives us a way of turning arbitrary objects/data into arbitrary integers.


<pre>
<pre>
hash('\0B') == hash'\0\0C') == 64
hash('\0B') == hash('\0\0C') == 64
</pre>
</pre>


Line 66: Line 68:
* If we pass a mutable type as a key, we are going to change the input to the hash table/map, and therefore change the output location, and therefore lose our data.
* If we pass a mutable type as a key, we are going to change the input to the hash table/map, and therefore change the output location, and therefore lose our data.
* To prevent this, need to define an immutable list, or use an immutable object, or define a __hash__ function for any key types that we use that will return the same hash regardless of content.
* To prevent this, need to define an immutable list, or use an immutable object, or define a __hash__ function for any key types that we use that will return the same hash regardless of content.
* Example: we might pass a customer object to a database application. The customer may change (e.g., phone number, address, etc.), but this should not affect hash of object, otherwise looks like entirely new "Customer". Could luse customer ID only when computing hash function.
* Example: we might pass a customer object to a database application. The customer may change (e.g., phone number, address, etc.), but this should not affect hash of object, otherwise looks like entirely new "Customer". Could use customer ID only when computing hash function.


===Solving memory hog issue===
===Solving memory hog issue===
Line 72: Line 74:
To solve the issue of hash tables being memory hogs, we have a second step that we call hashing. (Prehashing is turning keys into numbers; hashing is turning numbers into tinier sets of numbers.) Like turning a potato into hash browns. Like turning tofu into tofu crumble. Like turning large boulders into sand.  
To solve the issue of hash tables being memory hogs, we have a second step that we call hashing. (Prehashing is turning keys into numbers; hashing is turning numbers into tinier sets of numbers.) Like turning a potato into hash browns. Like turning tofu into tofu crumble. Like turning large boulders into sand.  


Arabic hash - grass
The word "hash" comes from the French ''hacher'' ("to chop into small pieces"), from Old French ''hache'' ("axe"), which is of Germanic origin and related to "hatchet."
 
German hash - hashae - hatchet


Want to map a giant key space to a hash table with slots 0 to n-1. Need a hash function that will turn key 0 into position 0, key 1 into position 1, etc.
Want to map a giant key space to a hash table with slots 0 to n-1. Need a hash function that will turn key 0 into position 0, key 1 into position 1, etc.
Line 90: Line 90:
==Collision Handling with Chaining==
==Collision Handling with Chaining==


Chaining is a way of using linked lists to deal with the problem of turning a huge keyspace with a tiny number of keys into actual usable slots in an array.
{{Main|Hash Maps/Collision Handling with Chaining}}
 
Normally, if you turn a keyspace into usable slots in an array, you have a tradeoff between a large range (meaning a larger number of potential slots, and therefore higher memory usage) and a small range (meaning a smaller number of potential slots, and therefore a higher probability of colliding hashes).
 
Chaining deals with this problem by erring on the side of a small range (and therefore more efficient memory usage), and dealing with collisions by storing buckets of values in linked lists, rather than storing the single values themselves.
 
So, I have a high probability of two keys (say, the strings "peanut butter" and "jelly") mapping to the same two keys in the array (say, the integer 10). But I can deal with this by placing the data corresponding to the key "peanut butter" in a linked list node and add it to a linked list stored at array index 10. Then, when I pass in "jelly" data, I get the same slot in the array (10), but create another linked list node with data corresponding to key "jelly" and add it to the same linked list at array index 10.
 
Worst case: all your keys have exactly the same hash, so you just have one big long linked list of size n. Then the complexity class is <math>\Theta(n)</math>.
 
In the worst case, hashing sucks. But normally, it works really well, because of randomization.
 
===Expected length of chain===
 
Analysis: What is the expected length of a chain of hashed items, if we have n keys and m slots?
 
Don't overthink this! The expected length of a chain of hashed items assumes you have a good hash function, so it assumes a perfectly uniform distribution of n items into m slots. So each of the m slots gets n/m.
 
The probability of one key going into one slot is 1/m. The probability of each key is independent. Each slot gets a 1/m chance of receiving the ith key, so they see a probability of getting a key of (1/m) + (1/m) + (1/m) + (1/m) + ...
 
This is something we want to control. We call this the ''load factor''.
 
The runtime of retrieving all the keys from a slot therefore becomes (n/m), so if <math>m = \Theta(n)</math> we can say that retrieval is <math>\Theta(1)</math>
 
More correctly, retrieval is <math>\Theta(1 + \lambda)</math> where lambda is the load factor
 
===Simple uniform hashing===
 
Simple uniform hashing is basically the hashing equivalent of the assumption that running any single operation has a constant O(1) cost. It makes big O analysis easier. Likewise, to analyze the performance of chaining, we want to be able to use simple uniform hashing to simplify analysis.
 
This gives us an average case. If we assume anything less than uniform, we end up with worse- to worst-case scenarios.
 
===Good hash functions (compression functions)===
 
Most common:
* Division method
* Multiplication method
* MAD (multiply-and-divide) method/Universal hashing
 
Division method:
 
<math>
h(k) = k \mod m
</math>
 
We require m to be prime to minimize collisions, and be far from powers of 2 or 10 (more commonly occurring)
 
Multiplication method:
 
<math>
h(k) = a k \mod 2^w > > (w-r)
</math>
 
w bit machine. right arrows mean shift it right (w-r) bits.
 
MAD Method/Universal hashing:
 
Map a key to:
 
<math>
h(k) = \left( (ak + b) \mod p \right) \mod m
</math>
 
where a and b are integers chosen at random from interval 0 to p-1, p is a prime number larger than N.
 
For the worst case keys, k1 != k2,
 
<math>
Pr \left( h(k_1) = h(k_2) \right) = \frac{1}{m}
</math>


==Hash Table Expansion==
==Hash Table Expansion==


Once we have our hash table implemented, we'll want to start adding lots of items. But then, we have to strike a balance.
{{Main|Hash Maps/Dynamic Resizing}}
 
The smaller the output space of the hash function, the smaller our map is in memory. But if we are using chaining, that leads to longer and longer linked lists, linked lists of size n/m, which gets further from O(1) as the linked lists get longer.
 
On the other hand, if we make m too big, and make the output space of our hash function too large, we end up wasting space in our hash table.
 
The concept behind expanding the hash table size dynamically is to maintain the balance on-the-fly. The concept is similar to the array resizing algorithm used by the [[Arrays/Java/PythonList]] PythonList class in Java, which implements a dynamically expanding/shrinking array.
 
Growing the table size, from m to m', requires the following steps:
* make table of size m'
* build new hash h'
* rehash
* for item in T: T'.insert(item)
 
How much time?
* O(n) at least
* In general, O(n + m + m')
 
Why new hash?
* Hash function is designed to have the specified output space of m
* Each key needs to be re-hashed
 
If n > m, we need to make the table bigger.
* If we make the table a constant amount larger, the cost of n inserts is a triangular number, theta(1+2+3+4+...+n) = theta(n^2)
* If we make the table twice as large, cost of n insertions is amortized
* m' = 2m: cost of n inserts
* theta(1 + 2 + 4 + 8 + ... + n)
* theta(n)
* Only a few - log N - operations are expensive
 
Similarly, deletions should happen when we get to n < m/4 quarter-full table
* Should shrink table size by half
* As we saw before, shrinking table size by quarter, we can have single table size or add/remove operation that costs O(N)
* Cost of n inserts
* Slow operation if we shrink by half when table is empty by half:
* slow operation is 2^k <--insert/delete--> (2^k) + 1
* This is a Theta(n) operation
* When you are a quarter full, you shrink by half
* Amortized time is Theta(1)
* Little bit tricky to prove...
 
===Amortization===
 
An operation takes "T(n) amortized" if k operations take <math> \leq k \cdot T(n)</math> time
 
k inserts take theta(k) time, so this is O(1) amortized insert


==Maps ADT==
==Maps ADT==
Line 245: Line 131:


Map abstract data type:
Map abstract data type:
* [[Maps/AbstractMap]]
* [[Maps/ADT]] - Map interface class
 
* [[Maps/AbstractMap]] - AbstractMap abstract class
Map base class (implements most of the ADT above):
* [[Maps/MapBase]]


Map implementations using arrays:
Map implementations using arrays:
* [[Maps/UnsortedArrayMap]]
* [[Maps/UnsortedArrayMap]] - concrete implementation of Map using unsorted ArrayList
* [[Maps/SortedArrayMap]]
* [[Maps/SortedArrayMap]] - concrete implementation of Map using sorted ArrayList


Map implementations using hash tables:
Map implementations using hash tables:
* [[Maps/HashMapBase]]
* [[Hash Maps/AbstractHashMap]] - abstract class defining further details of implementing a map using a hash table
* [[Maps/ChainHashMap]]
* [[Hash Maps/ChainHashMap]] - concrete implementation of hash map using a chained hash map
* [[Maps/ProbeHashMap]]
* [[Hash Maps/ProbeHashMap]] - concrete implementation of hash map using linear probing.


=Flags=
=Flags=


{{MapsFlag}}
{{MapsFlag}}

Latest revision as of 04:06, 8 June 2026

See also: Dictionaries

Maps are value-based data structures that create a mapping from one set (the domain) onto another (the range), such that one input key yields one corresponding output value.

Notes

Using the term "map" in place of the term "dictionary" is a hat tip to the mathematical nature of a map. The mathematical definition of a function is, a mapping from one set (the domain) onto another (the range) such that one input corresponds to exactly one output. This is all that is required of a function - one input gives one output. Likewise, a programming map is defined as a key-value data structure, in which an input key can be used to find the resulting value in O(1) time.

Note that a map is a function (each key maps to exactly one value), but it is not necessarily a bijection: multiple distinct keys may map to the same value, and not every value in the codomain need have a corresponding key. The term "bijection" would require a one-to-one correspondence in both directions, which maps do not guarantee.

Maps can be implemented multiple ways. They can be stored as a simple unsorted array of key-value item objects. They can be stored in a sorted array of key-value items. They can be stored with a tree. They can be stored with a hash table. etc.

While these data structures are nominally a little bit different from Dictionaries - dictionaries return positions in a data structure given an input object x) - they are ultimately a generalization of them (the position can be thought of as the value stored by the map, while the input object x is the key) and therefore identical.

Good MIT lecture on hashing with chaining: https://www.youtube.com/watch?v=0M_kIqhwbFo

The Motivation

With a map or dictionary, our goal is to create a data structure that can be searched in O(1) time.

The simplest possible map, if we could design the perfect map, would simply have keys that were integers 0 to N-1, and the values stored in the array at the index of its key. This would give O(1) lookup: the user says "give me item corresponding to key 1" and programming language says "here is array[1]". This is great, because by using an array and using the key as the index of the array, we have made everything O(1) fast. This is not so great, because there will be huge memory problems, not to mention, we didn't decide how to assign keys.

Problems to resolve:

1. Real keys may not be small non-negative integers

2. Potentially gigantic memory hog (if we want to store two items, one with key 2 and one with key 65,110,245,336,002

Prehash - assigning integers to objects

Assigning unique keys to items that are not necessarily nonnegative integers can be solved by solving the reverse problem: turn arbitrary objects (keys) into nonnegative integers.

We can argue that this is "easy" from a theoretical standpoint:

  • Any item in memory can be represented (written down) as a sequence of 0's and 1's (existential statement about objects/variables/data)
  • 0s and 1s are numbers, in binary, so we can convert them to numbers, in decimal.

(Note the similarities between this approach and Gödel's approach with his Incompleteness Theorems of assigning numbers to mathematical proofs/theories.)

This gives us a way of turning arbitrary objects/data into arbitrary integers.

hash('\0B') == hash('\0\0C') == 64

Ideally: hash(x) = hash(y) <==> x = y

Python details:

  • hash(x) is the prehash of x
  • Can use __hash__ builtin to adjust the way the hash is computed
  • Challenges with adding mutable items to dictionaries - your hash function must not change over time (Python cannot protect you from that)
  • You cannot use a MUTABLE LIST as a key value in a Python dictionary, because it can potentially change over time

Hashes and lists and Python

Here's an interview question that requires packing up a lot of knowledge into a single answer: "Why doesn't the following code work?"

a = [1,2,3]
b = [4,5,6]
d = {}
d[a] = 'The first list'
d[b] = 'The second list'

Why doesn't this work?

  • Doesn't work because lists are mutable.
  • Basics of dictionaries in Python: they are key-value data structures where the key can be any type we'd like.
  • If we pass a mutable type as a key, we are going to change the input to the hash table/map, and therefore change the output location, and therefore lose our data.
  • To prevent this, need to define an immutable list, or use an immutable object, or define a __hash__ function for any key types that we use that will return the same hash regardless of content.
  • Example: we might pass a customer object to a database application. The customer may change (e.g., phone number, address, etc.), but this should not affect hash of object, otherwise looks like entirely new "Customer". Could use customer ID only when computing hash function.

Solving memory hog issue

To solve the issue of hash tables being memory hogs, we have a second step that we call hashing. (Prehashing is turning keys into numbers; hashing is turning numbers into tinier sets of numbers.) Like turning a potato into hash browns. Like turning tofu into tofu crumble. Like turning large boulders into sand.

The word "hash" comes from the French hacher ("to chop into small pieces"), from Old French hache ("axe"), which is of Germanic origin and related to "hatchet."

Want to map a giant key space to a hash table with slots 0 to n-1. Need a hash function that will turn key 0 into position 0, key 1 into position 1, etc.

Idea: $ m = \Theta(n) $

where n is number of keys being stored in dictionary. This is the big challenge. We solve it using hash functions.

Hash function:

$ h = \mathbb{U} \rightarrow \mathbb{K}, \mbox{ where } \mathbb{K} = set( 0, 1, \dots, n-1 ) $

Collision Handling with Chaining

Hash Table Expansion

Maps ADT

See Maps/ADT for abstract data type/interface specification for Maps.

See Java API documentation for interface class Map: http://docs.oracle.com/javase/8/docs/api/java/util/Map.html

Example: word counts

A canonical example of maps is performing word counts for text files. Each word is added to the dictionary as a key, and the word count is the value. Keys are added with a default key value of 1, and if a word already exists as a key its value is incremented.

Pseudocode:

create empty map
for each word in input file:
    if(word in map keys):
        increment value at key word
    else:
        add new pair (word, 1) to map

Map classification

Map types are organized as follows:

  • Unsorted maps
  • Hash maps
  • Sorted maps

See implementations below.

Map Implementations

For implementations of maps, see the following:

Map abstract data type:

Map implementations using arrays:

Map implementations using hash tables:

Flags