Hashing
- The motivation behind hashing is that insertion, removal, and contain checks all occur in $O(1)$ time in the best case.
- This offers efficiency bonuses greater than that of the tree-based map as the tree structure requires us to have elements that can be compared to each other.
- The principle of the idea behind Hashing is that it allows us to create data-based indices for an array structure.
DataIndexedIntegerSet
DataIndexedIntegerSet
is a structure that stores and searches for ints in constant time.- Design:
- Create a boolean type
ArrayList
that has size 2 billion. - The
add(int x)
method sets thex
position in the arraylist to true in $\Theta(1)$ time. - The
contains(int x)
checks if the value in thex
position in the array list is true in $\Theta(1)$ time as well.
- Create a boolean type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class DataIndexedIntegerSet {
private boolean[] array;
public DataIndexedIntegerSet() {
array = new boolean[2000000000];
}
public add(int x) {
array[x] = true;
}
public contains(int x) {
return array[x];
}
}
- However there are major flaws with this preliminary implementation.
- The structure only works for integer types.
- The struture is very wasteful, as if each boolean takes up 1 byte, then the entire array would consume 2 GB of memory.
DataIndexedWordSet
- We may attempt to adapt our data structure such that it works for strings.
- We can build on our integer based map by representing each word as a unique string.
- We want to be able to generate a number for any string.
- One idea is to use the first letter of the string, but this may result in a collision if two strings have the same starting letter. In that case, the two strings would have the same hashcode.
- To avoid collisions, we may represent strings in base 26, where each letter is assigned a number from 1 to 26. This way, we may represent any character in base 26 and have a unique integer representation for any string.
- Ex: \(``cat"=3\cdot26^2+26^1+20\cdot26^0\)
DataIndexedStringSet
- We may generalize this to any string by relying on character formattings such as ASCII or Unicode.
- This may work for a small number of characters, but the moment we work with complex scripts such as Chinese, we require massively large arrays to keep track of each individual character, of which there are over 40,000 to choose from.
1
2
3
4
5
6
7
8
public static int asciiToInt(String s) {
int intRep = 0;
for (int i = 0; i < s.length(); i += 1) {
intRep = intRep * 126;
intRep = intRep + s.charAt(i);
}
return intRep;
}
HashCode
- We want to create a hashing function to generate values for different keys.
- We can then convert the hashcode into an index for the array
- The values that we generate for many different characters could be massively large.
- We want to wrap this number around to a smaller range, say, 0-9.
- We use a modulo (%). Which gives us the remainder in fractional division.
- Our index calculation code would look like this:
1
Math.floorMod(key.hashCode(), array.length)
Valid Hashcode
- There are 2 properties for a valid hashcode:
- *Deterministic: The
hashCode()
function of two equal object must be equal to each other. - Consistent: The
hashCode()
function returns the same integer every time for the same instance of an object.hashCode()
must be independent of time-varying attributes, RNG, and any other source of inconsistency.
- *Deterministic: The
- The hashcode does not have to be unique for two objects.
Good Hashcodes
- A good hashcode as 3 properties.
- The
hashCode()
function must be valid. - The
hashCode()
function values hsould be spread as uniformly as possible over the set of all integers - The
hashCode()
function should be quick to compute with O(1) constant time operation.
- The
Handling COllisions:
- Collisions occur in hashing when multiple leements have the same index.
- Two common methods to deal with collision:
- Linear Probing: Store the colliding key somewhere else in the array, maybe with the next open array space. This is implemented in distributed hash tables.
- External Chaining: We may store all keys with the same hash value in a collection of their own. This is commonly done with a
LinkedList
implementation. When we make a collection of entires sharing a single index, that collection is called a bucket.
Resizing and Hash Table Performance
- As we add more and more elements, the hash table should expand it’s underlying array to free up more possible indicies to maintain efficiency and lessen collisions.
- The load factor is a ratio that describes how full a hash table is. \(\text{load factor} = \frac{\text{size()}}{\text{array.length}}\)
- Once the hashtable exceeds the maximum load factor, then the hash table should resize.
- Typically done geometrically by a factor of 2.
- Resizing does not appear if we add a key-value pair that already exists in the hashmap.
- Since the index is calculated under the modulous of the array length, after resizing has been done, all items would need to be relocated.