Bad Hash Functions Examples
Bad Hash Function - Example 1
keys = phone numbers (10 digits)
Size of universe = |U| = 1010. Suppose we won’t have more than 500 friends or phone numbers which we have to track, so our |S| = 500. So, we choose 1000 buckets, just doubling it. So, n=1000. So our hash function should take mobile number and produce 3 digit hash function.
Terrible hash function
h(x) = floor(x/1000 )
So, suppose we use 3 most significant digits of mobile number given the number, and suppose first 3 digits in the mobile number reflects the operator or area code, which is normally the case. Then this is a terrible idea.
Mediocre hash function
h(x) = x % 1000
Using the last 3 digits as the hash value is better since there is no specific pattern related to the operator or area code. However, collisions are still possible, which can make this hash function vulnerable if patterns exist in those last 3 digits.
Bad Hash Function - Example 2
key = Memory location of the object
Memory locations are usually powers of 2, and thus even.
keys = Memory location, and have power of 2 and hence even.
Terrible hash function
h(x) = x % 1000
This hash function is incapable of producing odd hash, and hence half of the buckets are wasted. So, the mediocre hash function in example 1, is terrible here.
Good hash function
Answering what is good hash function is quite tricky. So, what we discuss now is okay-ish hash function and not good. For good hash function we have to study much more that :)
Suppose the object type is a string (e.g., a name). We first convert the string into an integer/hash code, and then use a compression function.
int getHashCode(String s) {
int sum = 0;
for (char c : s.toCharArray()) {
int x = (int) c; // get ASCII value of char c
sum += x;
}
return sum;
}
Once you convert the object to integer, then we have to use some compression function like mod (n) where n is the size of the hash table.
int compress(int num, int size) {
return num % size; // returns remainder after dividing num by size
}
So overall we do following:
int getBucketIndex(Object o, int sizeOfHashTable) {
int hashCode = getHashCode((String) o);
int compressedValue = compress(hashCode, sizeOfHashTable);
return compressedValue;
}
So, this gives us the index of the object o, where it has to be inserted.
Bad Hash Function Example 3
Let’s understand the need for a good hash function with an example. Assume that you have to store strings in a hash table using the following strings: ["abcd", "bcda", "cdab", "dabc"]
.
Bad Hash Function 1 - Sum of ascii values
To compute the index for storing these strings, consider a hash function that calculates the index as the sum of the ASCII values of the characters modulo 344
.
Since the ASCII values of the characters a
, b
, c
, and d
are 97, 98, 99, and 100 respectively, the sum of these values for each string would produce the same result.
H("abcd") = (97 + 98 + 99 + 100)
= 394
Bad Hash Function 2 - Sum of ascii values modulo 10
String: “abcd”
H("abcd") = (97 + 98 + 99 + 100) % 10
= 394 % 10
= 4
String: “bcda”
H("bcda") = (98 + 99 + 100 + 97) % 10
= 394 % 10
= 4
String: “cdab”
H("cdab") = (99 + 100 + 97 + 98) % 10
= 394 % 10
= 4
Since all the strings contain the same characters in different permutations, the sum will always be the same, resulting in the same index for all strings. Consequently, all the strings will be hashed to the same index, leading to poor performance due to excessive collisions.
Better Hash Function
Now, let’s try a better hash function. This new hash function will compute the index as the sum of ASCII values of the characters multiplied by their respective order in the string, followed by a modulo operation with 11 (a prime number).
String: “abcd”
H("abcd") = (97*1 + 98*2 + 99*3 + 100*4) % 11
= (97 + 196 + 297 + 400) % 11
= 990 % 11
= 0
String: “bcda”
H("bcda") = (98*1 + 99*2 + 100*3 + 97*4) % 11
= (98 + 198 + 300 + 388) % 11
= 984 % 11
= 5
String: “cdab”
H("cdab") = (99*1 + 100*2 + 97*3 + 98*4) % 11
= (99 + 200 + 291 + 392) % 11
= 982 % 11
= 3
String: “dabc”
H("dabc") = (100*1 + 97*2 + 98*3 + 99*4) % 11
= (100 + 194 + 294 + 396) % 11
= 984 % 11
= 5
In this case, even though there may still be collisions (as seen with “bcda” and “dabc”), the distribution is significantly better, and we do not have all strings mapping to the same index.
How to Choose the N - Number of Buckets ?
When choosing the number of buckets (N):
- Choose N to be a prime number close to the number of objects in the table.
- Avoid primes that are close to powers of 2 or 10, as they can lead to suboptimal distributions.
Note that there is no single solution to what constitutes a “good” hash function.
Here is How hashCode() method is implemented in java for strings.
Summary
A good hash function should meet the following criteria:
- Type-dependent: The function must be aware of the type to compute a suitable hash value.
- Within Array Range: It should produce values within the range of the array to avoid index errors.
- Efficient: It should not involve too many operations, ensuring efficiency.
- Key Utilization: It should use as much of the key as possible to minimize collisions.
- Avoid Common Factors: It should avoid common factors to further minimize collisions.
By using suitable hash functions and numbers of buckets, we can significantly improve the performance of our hash tables and reduce the number of collisions, ensuring quick lookups and efficient storage.