Collisions
In the previous example, the keys had mostly different values, but there were two collisions (keys with the same hash value). "Alan" and "Amit" both had a hash value of 0, and "Jimmy" and "Joseph" had hash values of 9. Other hash functions, like one depending on the length of the first name, would produce even more collisions.
There are two common methods of dealing with collisions:
  1. Linear Probing: Store the colliding keys elsewhere in the array. We will encounter it later in this lab.
  2. Chaining: An easier, more common solution is to store all the keys with a given hash value together in a collection of their own. This collection is often called a bucket.
Here are the example entries from the previous step, hashed into a 26-element table of linked lists using the function
hash(key) = ((int) key.charAt(0)) - 65;
In the figure below, we use chaining to deal with collisions.
HashMapWithCollisions2
Average Case Performance
We would like to have the collections in the buckets for the various hash values be as close in size as possible, since this produces the best average running time. (To get some insight justifying this claim, look at the worst case, where one bucket has all the keys and the rest have none. This basically boils down to linear search, which we want to avoid.)
To determine the average case in any problem, you would need to determine or approximate the probability distribution of possible inputs; in the case of hashing, the possible inputs are the keys. Given this distribution, you want to design your hash function to scatter the keys throughout the table as much as possible to minimize clumps produced by keys that are similar in some way to one another. One definition of "hash" is "to muddle or confuse", and a good hash function will "muddle" whatever ordering exists in a set of keys. This is where the hash function got its name.
Note also that the performance of hashing depends very much on average case behavior. No matter what the hash function, we can always produce a set of keys that make it perform as badly as possible.