Tuesday 23 July 2013

Examples of a HashSet

Introduction


Some Real Life Examples where we use a Hash and Buckets.

Let me first shortly introduce how a hash works by means of the first example, an AddressBook, and as a followup how it is implemented in Java's HashSet.

Addressbook


A HashSet is a specific implementation of a Set. It uses the Hash of the stored objects to determine the position in an array of what are called Buckets. Every bucket in the array is, if filled, filled with another collection. So a HashSet is actually a collection of collections.

Let's see some examples of hashes people use when finding addresses in an AddressBook:
perfect hash: returns number in a numeral system with radix 26
in this case each name in the address book has a unique hashcode. Every list contains exactly one entry. If you use this as an address book, each page contains at most one address. You are going to need a huge amount of pages, if you have a lot of friends and relatives. With this behaviour it would make more sense to store addresses via a key-value store (for example an implementation of Map).
average hash: return first letter of the last name, capitalised
in this case, the behaviour is the same as your average normal address book. Each list contains all address of people whos names start with the same letter.
bad hash: return 'A'
in this case, the behaviour is terrible! You could compare it to an address book, on which all addresses are written on the first page. This way, the hashset will behave like a normal linked list.
Let's assume the hash used is the middle one. The Hashcode of entries in an AddressBook takes the first letter of the last name (capitalised).


Java's HashSet

We assume a class Address, containing the hashCode() method that returns the first letter of the last name.


The HashSet<Address> is implemented by using a HashMap<Address, Object> internally.

Now, the HashMap has an underlying data structure consisting of a Entry<Address, Object>[] table. This is the array that is going to contain all the first elements of each linked list. In other words, this is the Array containing the buckets.


The index of a bucket in the array is almost equivalent to the hashCode of the Address instance.

When I say almost, I mean that some additional conversions take place on the original hashCode, for instance:
  • In the case of using Strings Sun (Oracle?) uses a special hashing function (sometimes)
  • Some bit twiddling takes place to prevent collisions due to hashCodes that do not use lower bits, and therefore would cause the start of the bucket array to never be used.
  • The result is "binary and"ed, ofcourse, with the table length (which is always a power of two) as a quick sort of modulo function to prevent ArrayIndexOutOfBounds problems.

For this reason, the buckets are not at the positions in the bucket array as you would expect, given the answers of the hashCode() function.

Every bucket contains the first Entry (Entry<Address, Object>) of a linked list.

A new entry is added to the top of the Linked list.

Where each Entry has a pointer to the next Entry in the bucket, or null if it is the only or last entry, effectively creating/ending the linked list.

So the contents should look like follows:

Supermarket


The Hashcode used in a supermarket is infinitely worse than my Addressbook. To start a search, accost an employee of said supermarket and inquire after the location of an article, say washing detergent. You'll receive the hashcode "aisle 14" or some such. When at the location of said hashbucket, you find all articles jumbled together. Your senses assailed by every colour swirl, bubble and geometric shape imaginable with letters and numbers in wild and disturbing images.

If you are lucky, the hashcode is maginally improved with additional information like "on the left", or "at the very end".

Conclusion


Apparently, in real life, there are HashSets in use as well. We can only come to the conclusion that any logical data structure that we use in real life without thinking, has it's corresponding implementation in the Software Realm.

I hope you've enjoyed this journey of discovery into the Java source code, as much as I did.

References

Why always override hashcode() if overriding equals()?
http://www.xyzws.com/javafaq/why-always-override-hashcode-if-overriding-equals/20
Java 6 HashMap JavaDoc
http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html
Java 6 HashSet JavaDoc
http://docs.oracle.com/javase/6/docs/api/java/util/HashSet.html
Java Generics and Collections
Maurice Naftalin & Philip Wadler

1 comment: