2. Bucket Array and hash functions #

Created Tuesday 18 February 2020

We don’t want to learn about something and make it a black box.

We want to implement maps using hash tables:

O(1) access
O(1) insert
O(1) remove

Arrays are the only data structure with O(1) insert and O(1) access.
As we are just storing pairs. Suppose that we are concerned only about keys here.
We use an array (called bucket array) for collecting the keys. This is the functional part of the hash table. We can add value functionality later.

Just dream that we could somehow convert the key to an index value(i.e integer).
To realize this, we make a hashing function. Our problem breaks down to designing the function.

The hash function has two functions:

hashcode - takes the key and generates a number for it. It is a function. For the same input we always get the same output.
compression function - this helps us assign an index which is within our bucket size. i.e makes it practical. e.g “abc” by some h() returns 17645678. If our bucket size is 56. We need an index between 0 and 55. A perfect compression bucket is h() % bucket_size; **A very good compression function. **Don’t worry about this. This can be dealt with easily.

Note: Some keys may have the same index, because bucket size is limited. This are called collisions. There are ways we can deal with them. This is the basis for hash table.

For different keys different types of hashcodes can be proposed, depending on the nature of the key. e.g

suppose we have an infinte array, if the key is an integer. Then h(key) = key; identity functions works here perfectly.
for strings - “abc”, there are lots of ways. Among them:
1. Sum of ascii values - not good because because it is the same for all permutations of a string. So it is problematic.
2. Take sum of first 3 characters - permutations can be handles, still it is limited to ASCII range.
3. This is used practically - treat the word as a word with base p. 128 for ASCII. This is acceptable because most words are not that long. We take ‘p’ as prime because they work very well for making the distribution normal, according to research.
4. We can define our own hashcode for the new key if required. For our **classes. **e.g We can use the address of the object, as a number. We can use it as input for the hashCode.

Characteristics of a good hashCode:

Should be a normal distribution.