Revision as of 20:39, 21 June 2017

Notes

Using the term "map" in place of the term "dictionary" is a hat tip to the mathematical nature of a map. The mathematical definition of a function is, a mapping from one set (the domain) onto another (the range) such that one input corresponds to exactly one output. This is a bijective mapping between two sets. This is all that is required of a function - one input gives one output. Likewise, a programming map is define as a key-value data structure, in which an input key can be used to find the resulting value in O(1) time.

Maps can be implemented multiple ways. They can be stored as a simple unsorted array of key-value item objects. They can be stored in a sorted array of key-value items. They can be stored with a tree. They can be stored with a hash table. etc.

While these data structures are nominally a little bit different from Dictionaries - dictionaries return positions in a data structure given an input object x) - they are ultimately a generalization of them (the position can be thought of as the value stored by the map, while the input object x is the key) and therefore identical.

The Motivation

With a map or dictionary, our goal is to create a data structure that can be searched in O(1) time.

The simplest possible map, if we could design the perfect map, would simply have keys that were integers 0 to N-1, and the values stored in the array at the index of its key. This would give O(1) lookup: the user says "give me item corresponding to key 1" and programming language says "here is array[1]". This is great, because by using an array and using the key as the index of the array, we have made everything O(1) fast. This is not so great, because there will be huge memory problems, not to mention, we didn't decide how to assign keys.

Problems to resolve:

1. Real keys may not be small non-negative integers

2. Potentially gigantic memory hog (if we want to store two items, one with key 2 and one with key 65,110,245,336,002

Prehash - assigning integers to objects

Assigning unique keys to items that are not necessarily nonnegative integers can be solved by solving the reverse problem: turn arbitrary objects (keys) into nonnegative integers.

We can argue that this is "easy" from a theoretical standpoint:

Any item in memory can be represented (written down) as a sequence of 0's and 1's (existential statement about objects/variables/data)
0s and 1s are numbers, in binary, so we can convert them to numbers, in decimal.

(Note the similarities between this approach and Gödel's approach with his Incompleteness Theorem of assigning numbers to mathematical proofs/theories.)

This gives us a way of turning arbitrary objects/data into arbitrary integers.

hash('\0B') == hash'\0\0C') == 64

Ideally: hash(x) = hash(y) <==> x = y

Python details:

hash(x) is the prehash of x
Can use __hash__ builtin to adjust the way the hash is computed
Challenges with adding mutable items to dictionaries - your hash function must not change over time (Python cannot protect you from that)
You cannot use a MUTABLE LIST as a key value in a Python dictionary, because it can potentially change over time

Hashes and lists and Python

Here's an interview question that requires packing up a lot of knowledge into a single answer: "Why doesn't the following code work?"

a = [1,2,3]
b = [4,5,6]
d = {}
d[a] = 'The first list'
d[b] = 'The second list'

Why doesn't this work?

Doesn't work because lists are mutable.
Basics of dictionaries in Python: they are key-value data structures where the key can be any type we'd like.
If we pass a mutable type as a key, we are going to change the input to the hash table/map, and therefore change the output location, and therefore lose our data.
To prevent this, need to define an immutable list, or use an immutable object, or define a __hash__ function for any key types that we use that will return the same hash regardless of content.
Example: we might pass a customer object to a database application. The customer may change (e.g., phone number, address, etc.), but this should not affect hash of object, otherwise looks like entirely new "Customer". Could luse customer ID only when computing hash function.

Solving memory hog issue

To solve the issue of hash tables being memory hogs, we have a second step that we call hashing. (Prehashing is turning keys into numbers; hashing is turning numbers into tinier sets of numbers.) Like turning a potato into hash browns. Like turning tofu into tofu crumble. Like turning large boulders into sand.

Arabic hash - grass

German hash - hashae - hatchet

Want to map a giant key space to a hash table with slots 0 to n-1. Need a hash function that will turn key 0 into position 0, key 1 into position 1, etc.

Idea: $m = \Theta(n)$

where n is number of keys being stored in dictionary. This is the big challenge. We solve it using hash functions.

Hash function:

$h = \mathbb{U} \rightarrow \mathbb{K}, \mbox{ where } \mathbb{K} = set( 0, 1, \dots, n-1 )$

Collision Handling with Chaining

Chaining is a way of using linked lists to deal with the problem of turning a huge keyspace with a tiny number of keys into actual usable slots in an array.

Normally, if you turn a keyspace into usable slots in an array, you have a tradeoff between a large range (meaning a larger number of potential slots, and therefore higher memory usage) and a small range (meaning a smaller number of potential slots, and therefore a higher probability of colliding hashes).

Chaining deals with this problem by erring on the side of a small range (and therefore more efficient memory usage), and dealing with collisions by storing buckets of values in linked lists, rather than storing the single values themselves.

So, I have a high probability of two keys (say, the strings "peanut butter" and "jelly") mapping to the same two keys in the array (say, the integer 10). But I can deal with this by placing the data corresponding to the key "peanut butter" in a linked list node and add it to a linked list stored at array index 10. Then, when I pass in "jelly" data, I get the same slot in the array (10), but create another linked list node with data corresponding to key "jelly" and add it to the same linked list at array index 10.

Worst case: all your keys have exactly the same hash, so you just have one big long linked list of size n. Then the complexity class is $\Theta(n)$ .

In the worst case, hashing sucks. But normally, it works really well, because of randomization.

Expected length of chain

Analysis: What is the expected length of a chain of hashed items, if we have n keys and m slots?

Don't overthink this! The expected length of a chain of hashed items assumes you have a good hash function, so it assumes a perfectly uniform distribution of n items into m slots. So each of the m slots gets n/m.

The probability of one key going into one slot is 1/m. The probability of each key is independent. Each slot gets a 1/m chance of receiving the ith key, so they see a probability of getting a key of (1/m) + (1/m) + (1/m) + (1/m) + ...

This is something we want to control. We call this the load factor.

The runtime of retrieving all the keys from a slot therefore becomes (n/m), so if $m = \Theta(n)$ we can say that retrieval is $\Theta(1)$

More correctly, retrieval is $\Theta(1 + \alpha)$ where alpha is the load factor

Simple uniform hashing

Simple uniform hashing is basically the hashing equivalent of the assumption that running any single operation has a constant O(1) cost. It makes big O analysis easier. Likewise, to analyze the performance of chaining, we want to be able to use simple uniform hashing to simplify analysis.

This gives us an average case. If we assume anything less than uniform, we end up with worse- to worst-case scenarios.

Good hash functions

Maps ADT

See Maps/ADT for abstract data type/interface specification for Maps.

See Java API docs page for Map class for Java Collections interface class Map: http://docs.oracle.com/javase/8/docs/api/java/util/Map.html

Example: word counts

A canonical example of maps is performing word counts for text files. Each word is added to the dictionary as a key, and the word count is the value. Keys are added with a default key value of 1, and if a word already exists as a key its value is incremented.

Pseudocode:

create empty map
for each word in input file:
    if(word in map keys):
        increment value at key word
    else:
        add new pair (word, 1) to map

Map classification

Map types are organized as follows:

Unsorted maps
Hash maps
Sorted maps

See implementations below.

Map Implementations

For implementations of maps, see the following:

Map abstract data type:

Maps/AbstractMap

Map base class (implements most of the ADT above):

Maps/MapBase

Map implementations using arrays:

Map implementations using hash tables:

Flags

Maps: Difference between revisions

From charlesreid1