Huffman Encoding
Introduction
Huffman encoding algorithm is a data compression algorithm. It is a common type of entropy encoder that encodes fixed-length data objects into variable-length codes. Its purpose is to find the most efficient code possible for a block of data, which reduces the need for padding or other methods used to pad fixed-length codes with zeroes. Huffman coding also produces code rates similar to those produced by Rabin decoding, meaning it can be used as an entropy decoder as well.
The Huffman algorithm was proposed in 1952 by Donald E.
The following is a list of the topics that this blog post will cover:
- What Huffman Encoding is.
- Why is it used?
- How it works.
- Examples of apps that use Huffman Encoding.
This article uses the concepts of MinHeap and Priority Queue. Readers with no prior knowledge can go to Heap and Priority Queue to know more.
General Idea of Huffman Encoding
Huffman Encoding is a method for encoding a list of symbols into a binary string. It is named after David Huffman, the Stanford professor who invented it, and is based on the idea that shorter sequences of symbols are more useful than longer ones. For example, if you want to encode "a," "b," and "c" as binary numbers, you might use the placement values 0, 1, 2 as 128 2 122.
In this way, each symbol has been encoded as an odd number from zero to two times its maximum value, with an even number meaning that it's not in the alphabet we're encoding.
Example
INPUT:
charList[] = { 'c', 'o', 'd', 'i', 'n', 'g' }
charFreq[] = { 7, 11, 13, 17, 19, 23 }
OUTPUT:
Character Code
c 000
o 001
n 01
g 10
d 110
i 111
We get a set of characters and their frequencies as input. We are supposed to encode the characters and return the unique codes for each character.
Approach and Explanation
The Huffman encoding algorithm is an efficient, unambiguous code that analyzes the frequencies of certain characters that appear in a message. Characters that occur more often will be encoded as a shorter-bit string, while symbols with less frequency will be encoded as longer strings. Since different messages have different character frequencies, the Huffman encoding algorithm has more than one variation. This means that the Huffman encoding used in message X will indeed differ from the Huffman encoding used in another message Y. To implement this logic, follow the steps given below:
- We will be maintaining a MinHeap for the nodes of each unique character.
- We will be using a priority queue to implement MinHeap. The order of priority of each character will be their frequency.
- First, create a heap with all the characters and their frequencies.
- To build the MinHeap, we will:
- First, extract two nodes with minimum frequency from the MinHeap.
- Then create a new internal node whose frequency equals the sum of the two node’s frequencies.
- Repeat these two steps until the heap contains only one node. Make this node the root of the tree.
- For getting the codes, follow a simple pattern- 0 when going to left child, 1 when going to right child.
- For more clarity, let us discuss our example.
We input the characters {c,o,d,i,n,g} with frequencies {7,11,13,17,19,23}. Our heap will look like this:
Nodes | Frequency |
c | 7 |
o | 11 |
d | 13 |
i | 17 |
n | 19 |
g | 23 |
We pick the two characters with the smallest frequency. They are ‘c’ and ‘o’ with frequencies 7 and 11. We create an internal node (IN) to store their sum and create a tree. It will look like this:
Our table is updated to:
Nodes | Frequency |
d | 13 |
i | 17 |
Internal Node (IN_1) | 18 |
n | 19 |
g | 23 |
The next set of iterations and the gradual tree formation is shown below:
- Iteration 2: choose from the heap d, i
Nodes | Frequency |
Internal Node (IN_1) | 18 |
n | 19 |
g | 23 |
Internal Node (IN_2) | 30 |
- Iteration 3: choose from the heap IN_1 and n
Nodes | Frequency |
g | 23 |
Internal Node (IN_2) | 30 |
Internal Node (IN_3) | 37 |
- Iteration 4: choose from the heap: g and IN_2
Nodes | Frequency |
Internal Node (IN_3) | 37 |
Internal Node (IN_4) | 53 |
- Iteration 5: choose from the heap: IN_3 and IN_4
Nodes | Frequency |
Internal Node (IN_5) | 90 |
Final tree after all the calculations:
Now for the traversal, follow the given rule, that is 0 = left-child, 1 = right-child. On doing so, we get the tree below:
From this tree, we get the required codes for each character. To get the codes, we simply travel to the character from the root and store the incoming 0s and 1s. The characters and their codes are:
Characters | Codes |
c | 000 |
o | 001 |
n | 01 |
g | 10 |
d | 110 |
i | 111 |
Hopefully, after seeing the explanation of the example, you will be able to formulate a logic and implement it first.
Recommended: First, try to solve the Huffman encoding algorithm on your own. You can go to this link Huffman Coding to try it on our CodeStudio.
C++ implementation
#include <bits/stdc++.h>
using namespace std;
struct MinHeap{
char data;
unsigned freq;
MinHeap *leftChild, *rightCHild;
MinHeap(char data, unsigned freq)
{
leftChild = rightCHild = NULL;
this->data = data;
this->freq = freq;
}
};
struct compare {
bool operator()(MinHeap* leftChild, MinHeap* rightCHild)
{
return (leftChild->freq > rightCHild->freq);
}
};
void printCharCodes(struct MinHeap* root, string str)
{
if (!root)
return;
if (root->data != '$')
cout << " " << root->data << "\t\t" << str << "\n";
printCharCodes(root->leftChild, str + "0");
printCharCodes(root->rightCHild, str + "1");
}
void HuffmanCodes(char data[], int freq[], int size)
{
struct MinHeap *leftChild, *rightCHild, *top;
priority_queue<MinHeap*, vector<MinHeap*>, compare> minHeap;
for (int i = 0; i < size; ++i)
minHeap.push(new MinHeap(data[i], freq[i]));
while (minHeap.size() != 1) {
leftChild = minHeap.top();
minHeap.pop();
rightCHild = minHeap.top();
minHeap.pop();
top = new MinHeap('$', leftChild->freq + rightCHild->freq);
top->leftChild = leftChild;
top->rightCHild = rightCHild;
minHeap.push(top);
}
cout << "Character\t" << "Code" << endl;
printCharCodes(minHeap.top(), "");
}
int main()
{
char charList[] = { 'c', 'o', 'd', 'i', 'n', 'g' };
int charFreq[] = { 7, 11, 13, 17, 19, 23 };
int size = sizeof(charList) / sizeof(charList[0]);
HuffmanCodes(charList, charFreq, size);
return 0;
}
OUTPUT
Character Code
c 000
o 001
n 01
g 10
d 110
i 111
Complexity Analysis
Time Complexity
In the given implementation, we extract the minimum node from the heap in (log n) time, and we do this for all the n nodes. Thus, the time complexity is,
T(n) = O(n log n),
where n is the number of nodes.
Space Complexity
In the given implementation, we have n symbols to encode and store. Thus,
Space complexity = O(n),
where n is the number of nodes.
Frequently Asked Questions
- What are the applications of the Huffman Encoding algorithm?
Ans. Some of the real-life applications include:
a. Huffman encoding is used in compression formats like PKZIP, GZIP.
b. Multimedia codecs like JPEG, PNG, and MP3 use the Huffman encoding algorithm. - Can we solve the Huffman encoding algorithm in linear time?
Ans. Yes, we can do so. If the input array is in sorted order, then creating the MinHeap takes linear time.
Key Takeaways
To summarize the article, we have thoroughly discussed the Huffman encoding algorithm. We first saw the general idea of the algorithm, followed by an example, then the approach to take and the explanation of the example. We also saw the C++ implementation, its time and space complexity. In the end, we summed it up with a few FAQs.
Want to ace the coding rounds of big tech companies? Try our Attempt Unlimited Online Mock Test Series to start your preparation.
Learn various topics from Web Technologies, Programming Fundamentals, Data Structures, and Algorithms from our Library.
Happy Coding!
Comments
No comments yet
Be the first to share what you think