Maps: Difference between revisions
From charlesreid1
| Line 117: | Line 117: | ||
Simple uniform hashing is basically the hashing equivalent of the assumption that running any single operation has a constant O(1) cost. It makes big O analysis easier. Likewise, to analyze the performance of chaining, we want to be able to use simple uniform hashing to simplify analysis. | Simple uniform hashing is basically the hashing equivalent of the assumption that running any single operation has a constant O(1) cost. It makes big O analysis easier. Likewise, to analyze the performance of chaining, we want to be able to use simple uniform hashing to simplify analysis. | ||
This gives us an average case. If we assume anything less than uniform, we end up with worse- to worst-case scenarios. | |||
===Good hash functions=== | ===Good hash functions=== | ||
Revision as of 20:39, 21 June 2017
See also: Dictionaries
Maps are value-based data structures that create a bijection from one set onto another, such that one input key yields one corresponding output value.
Notes
Using the term "map" in place of the term "dictionary" is a hat tip to the mathematical nature of a map. The mathematical definition of a function is, a mapping from one set (the domain) onto another (the range) such that one input corresponds to exactly one output. This is a bijective mapping between two sets. This is all that is required of a function - one input gives one output. Likewise, a programming map is define as a key-value data structure, in which an input key can be used to find the resulting value in O(1) time.
Maps can be implemented multiple ways. They can be stored as a simple unsorted array of key-value item objects. They can be stored in a sorted array of key-value items. They can be stored with a tree. They can be stored with a hash table. etc.
While these data structures are nominally a little bit different from Dictionaries - dictionaries return positions in a data structure given an input object x) - they are ultimately a generalization of them (the position can be thought of as the value stored by the map, while the input object x is the key) and therefore identical.
The Motivation
With a map or dictionary, our goal is to create a data structure that can be searched in O(1) time.
The simplest possible map, if we could design the perfect map, would simply have keys that were integers 0 to N-1, and the values stored in the array at the index of its key. This would give O(1) lookup: the user says "give me item corresponding to key 1" and programming language says "here is array[1]". This is great, because by using an array and using the key as the index of the array, we have made everything O(1) fast. This is not so great, because there will be huge memory problems, not to mention, we didn't decide how to assign keys.
Problems to resolve:
1. Real keys may not be small non-negative integers
2. Potentially gigantic memory hog (if we want to store two items, one with key 2 and one with key 65,110,245,336,002
Prehash - assigning integers to objects
Assigning unique keys to items that are not necessarily nonnegative integers can be solved by solving the reverse problem: turn arbitrary objects (keys) into nonnegative integers.
We can argue that this is "easy" from a theoretical standpoint:
- Any item in memory can be represented (written down) as a sequence of 0's and 1's (existential statement about objects/variables/data)
- 0s and 1s are numbers, in binary, so we can convert them to numbers, in decimal.
(Note the similarities between this approach and Gödel's approach with his Incompleteness Theorem of assigning numbers to mathematical proofs/theories.)
This gives us a way of turning arbitrary objects/data into arbitrary integers.
hash('\0B') == hash'\0\0C') == 64
Ideally: hash(x) = hash(y) <==> x = y
Python details:
- hash(x) is the prehash of x
- Can use
__hash__builtin to adjust the way the hash is computed - Challenges with adding mutable items to dictionaries - your hash function must not change over time (Python cannot protect you from that)
- You cannot use a MUTABLE LIST as a key value in a Python dictionary, because it can potentially change over time
Hashes and lists and Python
Here's an interview question that requires packing up a lot of knowledge into a single answer: "Why doesn't the following code work?"
a = [1,2,3]
b = [4,5,6]
d = {}
d[a] = 'The first list'
d[b] = 'The second list'
Why doesn't this work?
- Doesn't work because lists are mutable.
- Basics of dictionaries in Python: they are key-value data structures where the key can be any type we'd like.
- If we pass a mutable type as a key, we are going to change the input to the hash table/map, and therefore change the output location, and therefore lose our data.
- To prevent this, need to define an immutable list, or use an immutable object, or define a __hash__ function for any key types that we use that will return the same hash regardless of content.
- Example: we might pass a customer object to a database application. The customer may change (e.g., phone number, address, etc.), but this should not affect hash of object, otherwise looks like entirely new "Customer". Could luse customer ID only when computing hash function.
Solving memory hog issue
To solve the issue of hash tables being memory hogs, we have a second step that we call hashing. (Prehashing is turning keys into numbers; hashing is turning numbers into tinier sets of numbers.) Like turning a potato into hash browns. Like turning tofu into tofu crumble. Like turning large boulders into sand.
Arabic hash - grass
German hash - hashae - hatchet
Want to map a giant key space to a hash table with slots 0 to n-1. Need a hash function that will turn key 0 into position 0, key 1 into position 1, etc.
Idea: $ m = \Theta(n) $
where n is number of keys being stored in dictionary. This is the big challenge. We solve it using hash functions.
Hash function:
$ h = \mathbb{U} \rightarrow \mathbb{K}, \mbox{ where } \mathbb{K} = set( 0, 1, \dots, n-1 ) $
Collision Handling with Chaining
Chaining is a way of using linked lists to deal with the problem of turning a huge keyspace with a tiny number of keys into actual usable slots in an array.
Normally, if you turn a keyspace into usable slots in an array, you have a tradeoff between a large range (meaning a larger number of potential slots, and therefore higher memory usage) and a small range (meaning a smaller number of potential slots, and therefore a higher probability of colliding hashes).
Chaining deals with this problem by erring on the side of a small range (and therefore more efficient memory usage), and dealing with collisions by storing buckets of values in linked lists, rather than storing the single values themselves.
So, I have a high probability of two keys (say, the strings "peanut butter" and "jelly") mapping to the same two keys in the array (say, the integer 10). But I can deal with this by placing the data corresponding to the key "peanut butter" in a linked list node and add it to a linked list stored at array index 10. Then, when I pass in "jelly" data, I get the same slot in the array (10), but create another linked list node with data corresponding to key "jelly" and add it to the same linked list at array index 10.
Worst case: all your keys have exactly the same hash, so you just have one big long linked list of size n. Then the complexity class is $ \Theta(n) $.
In the worst case, hashing sucks. But normally, it works really well, because of randomization.
Expected length of chain
Analysis: What is the expected length of a chain of hashed items, if we have n keys and m slots?
Don't overthink this! The expected length of a chain of hashed items assumes you have a good hash function, so it assumes a perfectly uniform distribution of n items into m slots. So each of the m slots gets n/m.
The probability of one key going into one slot is 1/m. The probability of each key is independent. Each slot gets a 1/m chance of receiving the ith key, so they see a probability of getting a key of (1/m) + (1/m) + (1/m) + (1/m) + ...
This is something we want to control. We call this the load factor.
The runtime of retrieving all the keys from a slot therefore becomes (n/m), so if $ m = \Theta(n) $ we can say that retrieval is $ \Theta(1) $
More correctly, retrieval is $ \Theta(1 + \alpha) $ where alpha is the load factor
Simple uniform hashing
Simple uniform hashing is basically the hashing equivalent of the assumption that running any single operation has a constant O(1) cost. It makes big O analysis easier. Likewise, to analyze the performance of chaining, we want to be able to use simple uniform hashing to simplify analysis.
This gives us an average case. If we assume anything less than uniform, we end up with worse- to worst-case scenarios.
Good hash functions
Maps ADT
See Maps/ADT for abstract data type/interface specification for Maps.
See Java API docs page for Map class for Java Collections interface class Map: http://docs.oracle.com/javase/8/docs/api/java/util/Map.html
Example: word counts
A canonical example of maps is performing word counts for text files. Each word is added to the dictionary as a key, and the word count is the value. Keys are added with a default key value of 1, and if a word already exists as a key its value is incremented.
Pseudocode:
create empty map
for each word in input file:
if(word in map keys):
increment value at key word
else:
add new pair (word, 1) to map
Map classification
Map types are organized as follows:
- Unsorted maps
- Hash maps
- Sorted maps
See implementations below.
Map Implementations
For implementations of maps, see the following:
Map abstract data type:
Map base class (implements most of the ADT above):
Map implementations using arrays:
Map implementations using hash tables:
Flags
| Maps and Dictionaries Part of Computer Science Notes
Series on Data Structures
Maps/Dictionaries Maps · Maps/ADT · Maps in Java · Maps/OOP · Maps/Operations and Performance Map implementations: Maps/AbstractMap · Maps/UnsortedArrayMap · Maps/SortedArrayMap Dictionary implementations: Dictionaries/LinkedDict · Dictionaries/ArrayDict
Hashes Hash Maps/OOP · Hash Maps/Operations and Performance Hash Maps/Dynamic Resizing · Hash Maps/Collision Handling with Chaining Hash functions: Hash Functions · Hash Functions/Cyclic Permutation Hash map implementations: Hash Maps/AbstractHashMap · Hash Maps/ChainedHashMap
Skip Lists · Java/ConcurrentSkipList · Java implementations: SkipList
Sets Sets · Sets/ADT · Sets in Java · Sets/OOP · Multisets
|