Hash Functions

Hash functions have two principal applications: cryptographic hash functions, and non-cryptographic hash functions.

Cryptographic hash functions must have a much lower probability of collision, since a collision could result in identity theft or forged documents.

Non-cryptographic hash functions are mainly used for mapping keys to values, so a collision is an inconvenience rather than a problem.

Non-Crypto Hash Functions

Trivial Hash Functions

Trivial hash functions use the data itself as the hash value. This works if the data are small in value or limited in length. (Example: hashing integers below 100. Can just use the integers themselves as their own index in the hash collection.)

Checksums

For longer sequences of data, be they strings or bytes, the approach is to break the sequence into small parts, and to make the output dependent on each piece, and each piece dependent on the others. This can be done with an iterative algorithm that maintains a "state" (integer or set of integers) modified by each chunk of the stream.

For example:

# Start by initializing the state
S = S0

# For each piece of data,
for k in range(1,m):
    # fold the current chunk of data into the state
    S = modifyHash(S, b[k])
# Extract the hash value from the state
return returnHash(S, n)

Perfect Hashing

Perfect hash functions, as indicated by the term "perfect," fulfill everything that hash functions must do. Specifically, perfect hash functions are injective and map each valid input to a different valid output.

(If the map were a function, we would say the function is a one-to-one function.)

Cyclic Permutation Hash Function

Also see Hash Functions/Cyclic Permutation

The cyclic permutation hash function reduces data to a binary representation, then cycles through the digits, for example shifting 10011 two digits to the right results in 11100.

Minimal Perfect Hashing

A minimal perfect hash maps every valid input to a different valid output, but do so in a way that requires minimal space, so that all of the outputs are consecutive.

A minimal perfect hash is very challenging to find.

Avoiding Collisions

If you're using a simpler hash function, it is possible to determine the effect that similar inputs will have on the output. For example, if hashing phone numbers, most of the leading digits will be similar. Any method that depends heavily on the leading digits is likely to lead to higher collisions.

Therefore, the distribution of data can have an affect on the hash function and make different hash functions better or worse choices. Ultimately it is important to test your hash function of interest on your data set of interest.

What Hash Function Does HashMap Use?

You may be wondering what hash function the HashMap class in the Java Collections API uses. We can take a look at the OpenJDK source code for this class here: https://github.com/openjdk/jdk7-jdk/blob/master/src/share/classes/java/util/HashMap.java

In particular, shortly after the constructors, on line 264, the hash function is defined: https://github.com/openjdk/jdk7-jdk/blob/master/src/share/classes/java/util/HashMap.java#L264

As the comment states, this hash function is applied to the object's built-in hashCode result, and not to the object itself.

    /**
     * Applies a supplemental hash function to a given hashCode, which
     * defends against poor quality hash functions.  This is critical
     * because HashMap uses power-of-two length hash tables, that
     * otherwise encounter collisions for hashCodes that do not differ
     * in lower bits. Note: Null keys always map to hash 0, thus index 0.
     */
    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

Immediately after the hash function is the indexFor function, which is used to transform a hash code in a table into an array index, the critical mapping step:

    /**
     * Returns index for hash code h.
     */
    static int indexFor(int h, int length) {
        return h & (length-1);
}

Finally, these methods come together in the get() method of the HashMap, which is defined shortly after on line 314: https://github.com/openjdk/jdk7-jdk/blob/master/src/share/classes/java/util/HashMap.java#L314

    /**
     * Returns the value to which the specified key is mapped,
     * or {@code null} if this map contains no mapping for the key.
     *
     * <p>More formally, if this map contains a mapping from a key
     * {@code k} to a value {@code v} such that {@code (key==null ? k==null :
     * key.equals(k))}, then this method returns {@code v}; otherwise
     * it returns {@code null}.  (There can be at most one such mapping.)
     *
     * <p>A return value of {@code null} does not <i>necessarily</i>
     * indicate that the map contains no mapping for the key; it's also
     * possible that the map explicitly maps the key to {@code null}.
     * The {@link #containsKey containsKey} operation may be used to
     * distinguish these two cases.
     *
     * @see #put(Object, Object)
     */
    public V get(Object key) {
        if (key == null)
            return getForNullKey();
        int hash = hash(key.hashCode());
        for (Entry<K,V> e = table[indexFor(hash, table.length)];
             e != null;
             e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
                return e.value;
        }
        return null;
}

Finding Collisions for Built-In Hash Function

To test how good the hash function built into the HashMap class is, I wrote a short class that would load a large number of strings (7,000) from a file, and compute a hash for each, counting the duplicates. First, the output - then the code.

The output:

$ javac BuiltinHashFunction.java && java BuiltinHashFunction

Builtin Hash Function: 	Hash codes: 7144		Collisions: 3

The code:

Link to git.charlesreid1.com: https://git.charlesreid1.com/cs/java/src/master/hash/BuiltinHashFunction.java

mport java.util.Random;
import java.util.List;
import java.util.ArrayList;
import java.util.Set;
import java.util.HashSet;
import java.util.Scanner;
import java.io.FileReader;
import java.io.BufferedReader;
import java.io.FileNotFoundException;

public class BuiltinHashFunction {
	public static void main(String[] args) throws FileNotFoundException {
		//shortTest();
		manyCollisions();
	}


	public static void manyCollisions() throws FileNotFoundException {

		Scanner s = new Scanner(new BufferedReader(new FileReader("14oxenofthesun.txt")));

		// Populate word list
		Set<String> wordset = new HashSet<>();
		while(s.hasNext()) {
			wordset.add(s.next());
		}

		List<Integer> list = new ArrayList<>();
		Set<Integer> set = new HashSet<>();
		for(String word : wordset) {
			int next = hash(word.hashCode());
			list.add(next);
			set.add(next);
		}
		int diff = list.size() - set.size();
		System.out.printf("Builtin Hash Function: \tHash codes: %d\t\tCollisions: %d\n", list.size(), diff);
	}



	public static void shortTest() {
		int n = 100;
		for(int i=0; i<n; i++) {
			String s = randomString();
			System.out.println(s);
			System.out.println(s.hashCode());
			System.out.println(hash(s.hashCode()));
			System.out.println();
		}
	}

	public static String randomString() {
		Random r = new Random();
		char[] rand = new char[2+r.nextInt(7)];
		for(int i=0; i<rand.length; i++) {
			rand[i] = (char)('A' + r.nextInt(26));
		}
		return new String(rand);
	}

    public static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

}

Flags