Lab 12: Maps and Hashing Part 1

A. Intro

Download the code for Lab 12 and create a new Eclipse project out of it.

B. Get Acquainted with Maps

Tables

Before this class you probably worked with tables that stored collections of associations between symbols and other objects. For example, a function to translates a Roman numeral to decimal might use a table that stores the following associations:

Roman digit Corresponding Decimal Value
I 1
V 5
X 10
L 50
C 100
D 500
M 1000

In CS 61A, you first encountered tables in the context of dictionaries. Other applications of tables in that course included memoization of recursive processes (we'll revisit that application shortly).

A table has two important features.

The Map Interface

Java's name for an association table is a Map. Map is an interface in java.util, not an actual class; an actual class that stores associations would implement Map and provide definitions for the methods it prescribes. Among these methods, the most interesting are

Object put(Object key, Object value)

which adds the association between key and value to the table (and returns the previously existing one, if there was one), and

Object get(Object key)

which retrieves the value associated with key

The Object returned by put is the value previously associated with the given key, or null if the key had no previously associated value. (Recall that tables contain at most one associated value for each key.) The example below, in which table is assumed to be an instance of a class implementing Map, demonstrates how put works. (The table stores associations between people and blood types.)

Statement Effect
Object value; value = table.put ("mike", "a-"); value contains null; the set of associations represented by table is {("mike", "a-")}.
value = table.put ("sara", "a+"); value contains null; table represents a set of two associations, {("mike", "a-"), ("sara", "a+")}.
value = table.put ("mike", "ab-"); value originally contains "a-" since that was the value that was originally associated with "mike"; the set of associations represented by the resulting table is {("mike", "ab-"), ("sara", "a+")}.
value = table.put ("david", "o-"); value contains null; table represents a set of three associations, {("mike", "ab-"), ("sara", "a+"), ("david", "o-")}.

Discussion: A Class that Implements Map

Link to the discussion

Try to imagine what a class that implements Map would look like. For example, what would its private instance variable(s) be? Explain how the get and put operations would work with those instance variables. (There are several reasonable answers.)

Finally, what are the runtimes of your get and put methods with respect to the total number of keys in the map?

java.util Classes that Implement Map

Several classes in java.util implement the Map interface. One involves a binary search tree, which we'll discuss in a later lab. Two others involve hash tables, the subject of the rest of this lab.

C. The Hash Map

A Faster Map

In the previous discussion, did you come up with run times for put and get that were had constant runtime with respect to the number of keys in the map? Likely not! If your method called something like ArrayList's indexOf(Object o) method, then it isn't constant time. Finding an object in an ArrayList requires iterating through the whole thing, which is linear time.

But one observation that you may have come up with is that if the keys in our map are integers, then we can use just a normal array as a map with constant time access. All we do is store the values in an array and access them immediately using array indexing. For example, if the table were an array of 100 elements that stores the entries (5, "Clancy"), (17, "Wei"), (42, "Waliany"), and (83, "Hale"), the value associated with 5 is just table[5], and we can access that value in constant time.

Cool, right? Too bad it only works for integer keys. But hashing is a technique for extending this constant-time access to non-integer keys.

Hashing

The idea underlying hashing is to add a step to the getting process: namely the transformation of the key into an int that can be used to index an array. If this transformation is fast and different keys get transformed into different index values, then we can approximate the direct access that an array provides.

The transforming function is called a hash function, and the int it returns is the hash value. (The reason for the term "hash" will appear shortly.)

Here's an example with first and last names. Let's say there are the following key-value pairs: ("Alice", "Sheng"), ("Amit", "Akula"), ("Amruta", "Yelamanchili"), ("Armani", "Ferrante"), ("Daniel", "Nguyen"), ("Gilbert", "Ghang"), ("Ross", "Teixeria"), ("Rudy", "Laprade"), and ("Yeseon", "Lee"). The keys start with a variety of different first letters. Thus, a reasonable hash function for this set of keys could be based on the index of the first letter of the key in the alphabet. Applying this hash function to the four keys gives us the following results:

key hash value
"Alice" 0
"Amit" 0
"Amruta" 0
"Armani" 0
"Daniel" 3
"Gilbert" 6
"Ross" 17
"Rudy" 17
"Yeseon" 24

With a 26-element array named table, we could go directly to Daniel's associated value by accessing table[3].

An array that's not very full uses a lot of extra space, but that is okay. We are trading memory efficiency for search efficiency.

Collisions

In the previous example, the keys had mostly different values, but there were four collisions (keys with the same hash value). "Alice", "Amit", "Amruta", and "Armani" all had a hash value of 0, and "Ross" and "Rudy" had hash values of 17. Other hash functions, like one depending on the length of the first name, would produce even more collisions.

There are two common methods of dealing with collisions:

  1. Linear Probing: Store the colliding keys elsewhere in the array, potentially in the next open array space. We will encounter it later in this lab.
  2. Chaining: An easier, more common solution is to store all the keys with a given hash value together in a collection of their own, such as a linked list. This collection is often called a bucket.

Here are the example entries from the previous step, hashed into a 26-element table of linked lists using the function hash(key) = ((int) key.charAt(0)) - 65.

HashMapWithCollisions3

Assume for a moment that all of the keys in the hash table have different hash codes and end up in different places in the array. What is the runtime of getting a key with respect to the total number of keys in the map?

Toggle Solution

O(1). We can index into the array in constant time. Then we search for the key in the bucket. But there is only one item in the bucket, so this search also runs in O(1) time.

Now assume instead that all the keys in the hash table happen to have the same hash code (despite being different keys), and they end up in the same bucket. What is the runtime of getting a key with respect to the total number of keys in the map?

Toggle Solution

O(n) in the worst case, where there are n keys in the map. To find a key, we first index into the array at the proper location in constant time. But then we have to search through all of the items in the bucket to find the appropriate key. If the bucket is a linked list, we just have to iterate through the linked list until we find the key. In the worst case, all the keys end up in the same bucket, so this could take up to n steps.

The Meaning of 'Hashing'

Clearly we are better off if different keys have different hash codes, and collisions are low, since this produces the best average running time.

To determine the average case in any problem, you would need to determine or approximate the probability distribution of possible inputs; in the case of hashing, the possible inputs are the keys. Given this distribution, you want to design your hash function to scatter the keys throughout the table as much as possible to minimize clumps produced by keys that are similar in some way to one another. One definition of "hash" is "to muddle or confuse", and a good hash function will "muddle" whatever ordering exists in a set of keys. This is where the hash function got its name.

Note also that the performance of hashing depends very much on average case behavior. No matter what the hash function, we can always produce a set of keys that make it perform as badly as possible.

Load Factor

No matter what, if the underlying array of our hash table is small, and we add a lot of keys to it, then we will start getting more and more collisions. Because of this, a hash tables should expand its underlying array once it starts to get filled (much like how an ArrayList expands once it fills up).

A hash table should have an associated load factor, which is a fraction between 0 and 1. Once the number of keys / total size of the underlying array is greater than the load factor, then the hash table should resize to be bigger. The load factor is commonly around 0.7.

An Animation of Hashing

Phew! That's a lot of text describing hashing. Visualizing the process will probably be more helpful for really understanding what's going on.

This demo illustrates the hashing process, using the hashCode method of the String class. Collisions are stored in a linked list as just described.

You can try inserting the strings "h", "12", "z", "m", "a", "14", "1", and "za" to yield interesting behaviour. In addition to insert, don't forget to experiment with delete and find!

java.util.HashMap

The class java.util.HashMap is implemented as a hash table, that is, an array of collections as just diagrammed. The put method adds an entry to the HashMap:

Object put (Object key, Object value);

put returns either the value previously associated with key, or null if there was no previous association with key.

The put method requires that keys support two operations:

public boolean equals(Object obj)

and

public int hashCode()

Here is how it uses them:

  1. First, put calls key.hashCode() automatically to find out which collection to search for key. The programmer does not call hashCode themselves. The relevant array index in the table maintained by the HashMap class is key.hashCode() % the length of the bucket array.
  2. Then, it searches the bucket, the collection of entries whose keys all have the same remainder when dividing the hash value by the table size. It does this first by comparing hash values, then using equals for keys with the same hash code. If it finds an entry with the given key, it replaces the associated value. Otherwise it adds the key/value entry to the collection.

The get method, which returns the value associated with a given key, works similarly. It (not the programmer) calls the key's hashCode method to find out which collection to search, then compares the key with the keys in the collection using equals.

A full description of this class can be found on the Java API here.

Note: A possible ambiguity arises when get returns null: either the key is not in the table, or it is in the table but is associated with the value null. We resolve the ambiguity with the containsKey method, which returns true if its key argument is associated with something and returns false otherwise.

Constructor

As we have seen in other Java collection classes, we are allowed to restrict the keys and values in the table to certain types. To declare a java.util.HashMap whose keys are all Strings and whose associated values are all Integers, we provide the relevant types in brackets:

HashMap<String, Integer> table = new HashMap <String, Integer>();

Some Additional (not super important) Points about the Constructor

The HashMap constructor has two optional arguments. One is the initial size of the bucket array. Normally, the array is expanded (doubling its size, similar to ArrayList) when the table gets too full. If we know approximately how many entries the table will contain, however, we can avoid the intermediate expansions. The other argument is the maximum load factor , which is a percentage specifying how full the HashMap can be before it expands its internal array. The load factor is defined as the number of entries in the map divided by the number of buckets (i.e. the size of the internal array).

Self-Test: Relation of .equals and .hashCode

True or false: If for two objects a and b, the expression a.equals(b) is true, then it must be the case that a.hashCode() == b.hashCode() is also true.

True
Correct!
False
Incorrect.
Check Solution

True or false: If for two objects a and b, the expression a.hashCode() == b.hashCode() is true, then it must be the case that a.equals(b) is also true.

True
Incorrect.
False
Correct!
Check Solution

Exercise: PhoneBook with HashMap

Add code to complete the PhoneBook class and possibly to the Person and PhoneNumber class. You must use HashMap from java.util! The point of this exercise is to give you practice with it.

Run the test cases in PhoneBookTest.java to check that the code works. Then answer the following questions to check your understanding of the desired behavior.

What type was the Key in the HashMap book used in the PhoneBook?

PhoneNumber Object
Incorrect.
String representing the name
Incorrect.
String representing the phone number
Incorrect.
PhoneBook Object
Incorrect.
Person Object
Correct!
Check Solution

What type was the Value in the HashMap book used in the PhoneBook?

PhoneNumber Object
Correct!
String representing the name
Incorrect.
String representing the phone number
Incorrect.
PhoneBook Object
Incorrect.
Person Object
Incorrect.
Check Solution

What happens when you try to add the same person to the phone book twice?

An exception is thrown
Incorrect.
The second entry is ignored because an entry already exists
Incorrect.
Duplicate entries are added to the HashMap
Incorrect.
The second entry's value replaces the first's
Correct!
Check Solution

What happens if you modify a PhoneNumber that is already stored in the Book?

You are now no longer able to look up the PhoneNumber for that Person object.
Incorrect.
The PhoneBook remembers the old entry and is not changed because they reference different objects
Incorrect.
An exception is thrown
Incorrect.
The entry is reflected in the Phone book because they reference the same object
Correct!
Check Solution

What happens if you modify a Person that is already stored in the Book?

The entry is reflected in the Phone book because they reference the same object and you can still look up the PhoneNumber
Incorrect.
The entry is reflected in the Phone book because they reference the same object, but you can no longer look up the PhoneNumber
Correct!
The PhoneBook creates a second entry for the second person and still references the old Object as well
Incorrect.
An exception is thrown
Incorrect.
Check Solution

Note: For each question ensure that you understand why the answer is so. Consult with your neighbors and TA, especially for the forth and fifth questions.

Keeping Track of Hash Table Contents

Consider the following program. We think drawing a rough sketch of the Hash Table contents will be helpful. Note: a HashSet works just like a HashMap, except instead of storing key-value pairs, it just stores items with no values associated with them.

import java.awt.*;
import java.util.*;

public class TestPoint {

    public static void main (String [ ] args) {
        HashSet<Point> points = new HashSet<Point> ( );
        Point p = new Point (0, 5);
        for (int k = 0; k < 3; k++) {
            p.x = k;
            points.add (p);
        }
        for (int k = 0; k < 3; k++) {
            System.out.println ("Does the hash table contain (" + k + ",5)? "
              + points.contains (new Point(k,5)));
        }
    }
}

When run, it produces the following output:

Does the hash table contain (0,5)? false
Does the hash table contain (1,5)? false
Does the hash table contain (2,5)? true

Think about why for the next question.

Discussion: What Happened?

Link to the discussion

Why did the code return false when asked whether it contained the points (0,5) and (1,5)? Earlier in the code we put added those points to the HashSet .

Linear Probing

Hashing into "buckets" is not the only method for dealing with collisions. In the "linear probing" strategy of handling collisions, keys are stored in the array rather than in a separate collection. A key's hash value is found, and then the array is searched linearly (wrapping around to the beginning of the array when it reaches the end) for the key. The search stops either when the table is full, when an empty space is found (meaning that the key is not yet in the table), or when the key is found.

This Demo lets you play with the linear probing algorithm with insertion, deleting, and find.

Discussion: Linear Probing

Link to the discussion

Think about the kinds of instance variables that a linear probing hash map would have in comparison with those of a bucketing hash map. What's one major pro and one major con of using a linear probing hash map instead of a bucketing hash map?

Let's Make our own Hash Map!

If you haven't yet switched which partner is coding in this lab, please do so now.

By this point in the lab, you should have a strong understanding of the following concepts:

This should be enough to make our own hash map! We have created a template class for you, MyHashMap.java. Implement the core functionality of your hash map. You can decide how to implement the hash map (how to handle collisions, etc.) so add what instance variables you need. This means you can either use a bucketing or a linear probing approach.

Discussion: Inventing a Hash Function

Link to the discussion

Here are eleven keys.

10 100 32 45 58 126 1 29 200 400 15

Find a hash function that, when used with an empty chained table of size 11, produces three or fewer collisions when the eleven keys are added to the table. Make a post describing your function. Then check your hash function with your neighbors and trade tactics for handling these sorts of exercises.

hashCode Methods of Builtin Objects

Print the hash values for several instances of builtin classes such as java.awt.Point, java.lang.String, and java.lang.Integer, just to get a feel for what kind of values a hashCode method typically produces.

Also, produce evidence for whether or not the ArrayList.hashCode method uses the hashCode methods of its elements.

Discussion: ArrayList Hash Code

Link to the discussion

Does the hashCode of an ArrayList depend upon hashCode s of the objects in the ArrayList ? How does your evidence confirm of disconfirm this?

D. Some real-world hash functions

Information presented here is taken from the article "Selecting a Hashing Algorithm", B.J. McKenzie et al., Software Practice & Experience, vol. 20, no. 2, February 1990. The authors experimented with eight hash functions implemented in a variety of widely distributed software.

Hashing Algorithms

All of the algorithms compute a hash value H for a string of length n whose characters are c1, c2, ..., cn. The hash value is determined from successive partial results h0, h1, h2, ..., hn, with each hk computed from hk?1 as given in the formulas below. The hash table size is the value used in the mod operation at the end of each algorithm.

  1. Amsterdam Compiler Kit (ACK)

    There is a "mask" for characters, built as follows:

    `m1 = 171; mk = rightmost 8 bits of 77mk?1+153`

    The hash value H is then the last 8 bits of hn, where h0 = 0 and hk = hk?1 + XOR(ck, mk).

  2. Eidgenossische Technische Hochschule Modula-2 Cross Compiler (ETH)

    h0 = 1; hk = ck*((hk?1 mod 257)+1); H = hn mod 1699
  3. GNU C preprocessor (GNU-cpp)

    h0 = 0; hk = 4hk?1+ck; H = last 31 bits of hn, mod 1403
  4. GNU compiler front end (GNU-cc1)

    h0 = n; hk = 613hk?1+ck; H = last 30 bits of hn, mod 1008
  5. Portable C Compiler front end (PCC)

    h0 = 0; hk = 2hk?1+ck; H = last 15 bits of hn, mod 1013
  6. Unix 4.3 BSD C preprocessor (CPP)

    h0 = 0; hk = 2hk?1+ck; H = hn mod 2000
  7. AT&T C++ compiler (C++)

    h0 = 0; hk = 2hk?1+ck; H = hn mod 257
  8. Icon translator (Icon)

    h0 = 0; hk = hk?1+ck; H = hn mod 128

Performance

Algorithms were tested on 36,376 identifiers from a large bunch of C programs, and 24,473 words from a UNIX dictionary.

ACK is a loser (U-shaped distribution). Icon, C++, GNU-cc1, and GNU-cpp seem to distribute the words well. Theoretical results suggest that an algorithm of the form hk = A*hk?1+ck; H = hn mod N will be good, with A a power of 2 for speed and N chosen appropriately. The authors note:

"[A] and N need to be selected with care. Although it may seem unlikely that anyone would choose one of the really bad combinations, the facts ... indicate that far-from-optimal choices are made and persisted with. The experiments have shown that very small variations in N can produce large variations in the efficiency of the hash table lookup, and that the popular view, that choice of a prime number will automatically ensure a good result, is not well founded."

E. Conclusion

Please turn in:

Submit these files as lab12.