Understanding Hash Tables

Hash tables, its features & functions
Hash tables, its features & functions

A hash table is an unordered collection of key-value pairs, where each key is unique.

Hash tables offer a combination of efficient lookupinsert and delete operations. They are used to implement map and set data structures in most common programming languages. In C++ and Java they are part of the standard libraries, while Python and Go have built-in dictionaries and maps.

Neither arrays nor linked lists can achieve this:

  • A lookup in an unsorted array takes linear worst-case time
  • In a sorted array, a lookup using binary search is very fast, but insertions become inefficient
  • In a linked list an insertion can be efficient, but lookups take linear time

For example, student records for a class could be stored in an array C of dimension 10000 by truncating the student’s ID number to its last four digits: H(IDNum) = IDNum % 10000 Given an ID number X, the corresponding record would be inserted at C[H(X)].

The simplest way the hash table may be explained as a data structure that divides every element into equal-sized categories, or buckets, to permit quick access to the elements. The hash function finds that which element belongs to which bucket. They can also be definite as data structure those acquaintances keys with values. The basic procedure is to support powerfully and finds the consequent value.

Basically it is the one-dimensional array indexed by an integer value computed by an index function called a hash function. They are sometimes referred to as scatter tables. They are containers that represent a group of objects inserted at computed index locations. Each object inserted in this table is related with a hash index. The process of hashing involves the computation of an integer index (the hash index) for a given object (such as a string).

If calculated correctly, the hash calculation (1) should be quick, and (2) when finished frequently for a set of keys to be inserted in a table should create hash indices consistently spread crossways the variety of index values designed for the table. In algorithms a hash table or hash map is a data structure that uses a hash function to efficiently map certain identifiers or keys (student name) to associated values (e.g. that students enrolment no.).The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought.

A hash function is used to map each key to different address space but practically it’s not possible to create such a hash function that is able to do this and the problem called collision occurs. But still, it is done up to the greatest feasible case so that the chances of collision should be kept minimum. In a well-dimensioned table, the regular cost (number of instructions) for each lookup is self-determining of the number of elements stored in the table.

But still have all the problems, they are much more proficient in many cases comparative to all other data structures like search trees or any other table lookup structure. That the reason behind using these tables in all kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.

Let’s try to understand hashing with some examples as follows:

Sample Python Implementation: This sample is a minimum implementation of a hash table whose keys must be strings. It uses DJB2 (xor variant) as its hashing function. Buckets are implemented with linked lists. (Note that Python’s built-in Dictionary type is implemented with a hash table. This sample implementation is purely for academic purposes.)

blog banner 1

class _Node(object):
def init(self, key, value):
self.key = key
self.value = value
self.next_node = None

def __str__(self):
    return "'{}': '{}'".format(self.key, self.value)

a simple hashing algorithm with acceptable collision rate

def _djb2x_hash(string):
hash = 5381
byte_array = string.encode(‘utf-8’)

for byte in byte_array:
    # the modulus keeps it 32-bit, python ints don't overflow
    hash = ((hash * 33) ^ byte) % 0x100000000

return hash

class HashTable(object):
def init(self, capacity):
self.bucket_array = [None for i in range(capacity)]
self.capacity = capacity

def insert(self, key, value):
    key_hash = _djb2x_hash(key)
    bucket_index = key_hash % self.capacity

    new_node = _Node(key, value)
    existing_node = self.bucket_array[bucket_index]

    if existing_node:
        last_node = None
        while existing_node:
            if existing_node.key == key:
                # found existing key, replace value
                existing_node.value = value
                return
            last_node = existing_node
            existing_node = existing_node.next_node
        # if we get this far, we didn't find an existing key
        # so just append the new node to the end of the bucket
        last_node.next_node = new_node
    else:
        self.bucket_array[bucket_index] = new_node

def lookup(self, key):
    key_hash = _djb2x_hash(key)
    bucket_index = key_hash % self.capacity

    existing_node = self.bucket_array[bucket_index]
    if existing_node:
        while existing_node:
            if existing_node.pair.key == key:
                return existing_node.pair.value
            existing_node = existing_node.next_node

    return None

def delete(self, key):
    key_hash = _djb2x_hash(key)
    bucket_index = key_hash % self.capacity

    existing_node = self.bucket_array[bucket_index]
    if existing_node:
        last_node = None
        while existing_node:
            if existing_node.key == key:
                if last_node:
                    last_node.next_node = existing_node.next_node
                else:
                    self.bucket_array[bucket_index] = existing_node.next_node
            last_node = existing_node
            existing_node = existing_node.next_node

def debug_print(self):
    for i in range(self.capacity):
        node = self.bucket_array[i]
        print('Bucket {}'.format(i))
        if node:
            while node:
                print('    {}'.format(node))
                node = node.next_node
        else:
            print('    Empty')

Output:

hash_table = HashTable(6)
hash_table.insert(‘lion’, ‘yellow’)
hash_table.insert(‘lizard’, ‘green’)
hash_table.insert(‘horse’, ‘brown’)
hash_table.insert(‘tiger’, ‘orange’)
hash_table.insert(‘cardinal’, ‘red’)
hash_table.insert(‘bear’, ‘black’)
hash_table.insert(‘elephant’, ‘grey’)
hash_table.insert(‘moose’, ‘brown’)
hash_table.debug_print()

Bucket 0
Empty
Bucket 1
‘lion’: ‘yellow’
Bucket 2
‘tiger’: ‘orange’
‘elephant’: ‘grey’
‘moose’: ‘brown’
Bucket 3
‘lizard’: ‘green’
‘cardinal’: ‘red’
Bucket 4
‘horse’: ‘brown’
Bucket 5
‘bear’: ‘black’

Hash Function Techniques:

Division:

  • The first order of business for a hash function is to compute an integer value
  • If we expect the hash function to produce a valid index for our chosen table size, that integer will probably be out of range
  • That is easily remedied by modding the integer by the table size
  • There is some reason to believe that it is better if the table size is a prime, or at least has no small prime factors

Folding:

  • Portions of the key are often recombined, or folded together
  • Shift folding: 123-45-6789 à 123 + 456 + 789
  • Boundary folding: 123-45-6789 à 123 + 654 + 789
  • Can be efficiently performed using bitwise operations
  • The characters of a string can be xor’d together, but small numbers result
  • Chunks of characters can be xor’d instead, say in integer-sized chunks

Mid-Square Function:

  • Square the key, then use the middle part as the result, example: 3121 à 9740641 à 406 (with a table size of 1000)
  • A string would first be transformed into a number, say by folding
  • Idea is to let all of the key influence the result
  • If the table size is a power of 2, this can be done efficiently at the bit level: 3121 à 100101001010000101100001 à 0101000010 (with a table size of 1024)

Extraction:

  • Use only part of the key to computing the result
  • Motivation may be related to the distribution of the actual key values, e.g., VT student IDs almost all begin with 904, so it would contribute no useful separation

Radix Transformation:

  • Change the base-of-representation of the numeric key, mod by table size
  • Not much of a rationale for it

A good hash function should:

  • Be easy and quick to compute
  • Achieve an even distribution of the key values that actually occur across the index range supported by the table
  • Ideally, be mathematically one-to-one on the set of relevant key values

Note: Hash functions are NOT random in any sense.

Collision: It is a situation in which the hash function returns the same hash key for more than one record, it is called a collision. Sometimes when we are going to resolve the collision it may lead to an overflow condition and this overflow and collision condition makes the poor hash function.

What if we insert an element say 15 to existing hash table?

Insert : 15
Key = element % key
Key = 15 % 7
Key = 1

But already arr[1] has element 8. Here, two or more different elements pointing to the same index under modulo size, This is called a collision.

Collision Resolution Technique: If there is a problem of collision occurs then it can be handled by applying some technique. These techniques are called as collision resolution techniques. There are generally four techniques which are described below.

  • Chaining: It is a method in which additional field with data i.e. chain is introduced. A chain is maintained at the home bucket. In this when a collision occurs then a linked list is maintained for colliding data.

Example: Let us consider a hash table of size 10 and we apply a hash function of H(key)=key % size of the table. Let us take the keys to be inserted are 31,33,77,61. In the above diagram we can see at same bucket 1 there are two records which are maintained by linked list or we can say by chaining method.

  • Linear Probing: It is a very easy and simple method to resolve or to handle the collision. In this collision can be solved by placing the second record linearly down, whenever the empty place is found. In this method there is a problem of clustering which means at some place block of a data is formed in a hash table.

Example: Let us consider a hash table of size 10 and the hash function is defined as H(key)=key % table size. Consider that following keys are to be inserted that are 56,64,36,71.

In this diagram we can see that 56 and 36 need to be placed at same bucket but by linear probing technique the records linearly placed downward if place is empty i.e. it can be seen 36 is placed at index 7.

  • Quadratic Probing This is a method in which solving of clustering problem is done. In this method the hash function is defined by the H(key)=(H(key)+x*x)%table size. Let us consider we have to insert following elements that are:-67, 90,55,17,49.

In this, we can see if we insert 67, 90, and 55 it can be inserted easily but at the case of 17 hash function is used in such a manner that :-(17+0*0)%10=17 (when x=0 it provide the index value 7 only) by making the increment in the value of x. let x =1 so (17+1*1)%10=8.in this case bucket, 8 is empty hence we will place 17 at index 8.

  • Double Hashing: It is a technique in which two hash function is used when there is an occurrence of a collision. In this method, 1 hash function is simple as same as the division method. But for the second hash function, there are two important rules which are:
  • It must never evaluate to zero
  • Must sure about the buckets, that they are probed

The hash functions for this technique are:

    H1(key)=key % table size
    H2(key)=P-(key mod P)

Where, p is a prime number which should be taken smaller than the size of a hash table.

Example: Let us consider we have to insert 67, 90,55,17,49.

In this, we can see 67, 90 and 55 can be inserted in a hash table by using the first hash function but in case of 17 again the bucket is full and in this case, we have to use the second hash function which is H2(key)=P-(key mode P) where p is a prime number which should be taken smaller than the hash table so the value of p will be the 7. i.e. H2(17)=7-(17%7)=7-3=4 that means we have to take 4 jumps for placing the 17. Therefore 17 will be placed at index 1.

Basic Operations

Following are the basic primary operations of a hash table.

  • Search: Searches an element in a hash table
  • Insert: Inserts an element in a hash table
  • Delete: Deletes an element from a hash table

Initialise the hash bucket:

Before inserting elements into an array. Let’s make array default value as -1. -1 indicates element not present or the particular index is available to insert.

Inserting elements in the hash table:

  • Insert 24
  • Insert 8

Searching elements from the hash table:

  • Search 19
  • Search 8

Deleting an element from the hash table:

Here, we are not going to remove the element. We just mark the index as -1. It indirectly deletes the element from an array.

  • Delete 24

To explore our courses, click here.

By Yogesh Kumar