Load Factor and Rehashing
HashMap is a popular data structure for storing key-value pairs that can be used to solve a variety of problems. Both search() and insert() operations on HashMap have a constant O(1) time complexity. One of the most frequently asked questions on HashMap is about rehashing. To understand rehashing, we must first understand the load factor and why it is used.
Briefly, the load factor measures how full the hash table is allowed to get before its capacity is automatically increased. It is very important as it is used to measure the performance of our hashmap.
Before coming to rehashing and in-depth analysis of load factor, let’s brush up our concepts on the basic hashing principle used by HashMap.
The following operations are performed to insert a key-value pair into HashMap:
- Using the internal hash() algorithm, the first hash code of key ‘K' is calculated.
- The hash code is then used to determine the bucket's index position (hashCode modulosizeOfBucket). The bucket is an array that holds the Nodes on the inside.
- If the object does not exist at that index position, it will be inserted; otherwise, its value will be updated as a Node in the list, referred to as chaining.
Image showing how a key is mapped to a value
The Load Factor and Complexity
The time it takes to complete the first step is determined by the key, 'K', and the hash function.
If the key is a string like "abcd," and let's say our hash function depends on the length of the string. For example, it could be adding the ASCII values of all characters and taking the modulo of that by 100.
However, for very large values of 'N', the number of entries in the map, the length of the keys is virtually small compared to ‘N’. Hence hash computation can be assumed to run in constant time, i.e., O(1).
The second step consists of traversing the list of K-V pairs that exist at that index. The worst-case scenario is that all ‘N’ entries are at the same index. As a result, the time complexity would be O(N). However, enough study has been done to ensure that hash functions are evenly distributed. As a result, this nearly never happens.
So, on average, if there are ‘N’ entries and ‘B’ is the array size, each index will have N/B entries. This N/B figure is known as the load factor, and it represents the load on our map.
This load factor should be kept low such that the number of entries in each index is reduced and the complexity remains relatively constant, i.e., O(1).
What is Rehashing?
Rehashing is the process of recalculating the hashcode of previously-stored entries (Key-Value pairs) in order to shift them to a larger size hashmap when the threshold is reached/crossed,
When the number of elements in a hash map reaches the maximum threshold value, it is rehashed.
According to the Java specification, the good load factor value is 75, while HashMap's default beginning capacity is 16. When the number of elements hits or exceeds 0.75 times the capacity, the complexity rises. To combat this, the array's size is doubled, and all the values are hashed again and saved in the new double-sized array. This is done in order to simplify things and keep the focus on the important things and keep the load factor low.
So, when the number of items reaches 12, rehashing begins. (as 12 = 0.75 * 16).
Rehashing in Action, Source: learningSolo
But what is the need for rehashing?
Why is Rehashing done?
The reason for rehashing is that anytime a new key-value pair is placed into the map, the load factor rises, and as a result, the complexity rises as well. And as the complexity of our HashMap grows, it will no longer have a constant O(1) time complexity. As a result, rehashing is used to disperse the items across the hashmap, lowering both the load factor and the complexity, resulting in search() and insert() having a constant time complexity of O(1). One should note that the existing objects may fall into the same or a different category after rehashing.
Now that we know what rehashing is and why it is important, it is equally important to know how rehashing is done.
How is Rehashing done?
Rehashing can be done in the following way:
- Check the load factor after adding a new entry to the map.
- Rehash if it is more than the pre-defined value (or the default value of 0.75 if none is specified).
- Make a new bucket array that is double the size of the previous one for Rehash.
- Then go through each element in the old bucket array, calling insert() on each to add it to the new larger bucket array.
I hope you have got a hold of these new topics, and you must be thrilled about learning something new. But learning never stops, and a good coder should keep practising no matter what. So head over to our practice platform CodeStudio to practice top problems, mock tests and many more. Till then, Happy Coding!
By - Saksham Gupta