Probabilistic Data Structures: Bloom Filter

Probabilistic Data Structures: Bloom Filter
Probabilistic Data Structures: Bloom Filter

Bloom filters are one of that concept that always confused me for the longest time in Computer Science. I’m going to take few minutes to actually explain it to you guys and not what are they but why do they exist so I’m going to flip the question well bet if you’re interested to stay tuned.

So here is a problem, forget about bloom filters here’s the problem today we know how to solve but we can do it better. I’m going to write a service a web service Express, Node.JS right that essentially check if my username exists or not and if you think about it a little bit then this feature is very simple to build right?

Build a database with all usernames as you start writing your usernames if you want to build this interface you make a good request does user name with “Arun” exists. You make a request to the server Express/Django anything and then you execute a query against your database like select a username from this table.

Hopefully, you have an index there and if the record comes back that means the username exists if not then it doesn’t exist right?

The problem is this is very slow right and this feature is going to be very popular. All the users going to this web page and timing hey does test123 exist?

I mean everybody wants a fancy nickname right?

So here’s the problem this is going to be really slow.

Introduction

According to Wikipedia, a Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. There is a possibility of False positive matches, but false negatives are not – in other words, a query returns either “possibly in a set” or “definitely not in set”.

You can also add elements to the set, but not remove them (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives.

One more thing to note is because of its probabilistic nature, the Bloom Filter trades space and performance for accuracy which is very similar to the CAP theorem, we choose performance over accuracy. Bloom filters have some interesting use cases. For example, Bloom filters can also be placed on top of a datastore.

When there is a need for a key for its existence and the filter does not have it, we can skip querying the datastore entirely.

So problem is in front of us now what we are going to do. Well I heard about this Redis (Redis) thing i.e. actually in-memory database. So let’s take it from disk and put it in Redis well that’s fine you’re going to do the same thing execute the same get request but this time going to hit the database right and if it’s not there I might sometimes need to go to the actual database because these two can get out of sync.

So you created some inefficiency and you actually doubled your memory footprint because you’re storing the data here and storing the data here just to solve this simple problem. So we know how to solve this thing but some smart people, computer science Professors came up with a solution, very efficient solution and they called it bloom filter so let’s explain what these things are?

How to store a value?

Basically, an empty Bloom Filter is a bit array of m bits in which all bits are set to 0. For storing a value,  there must also be either:

k different hash functions which define that each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

Typically, k is a constant, which is very small in comparison to m (k << m), which is proportional to the number of elements to be added; the precise choice of k and the constant of proportionality of m are determined by the intended false positive rate of the filter.

Feed the value to each of the k hash functions to get k array positions.

  • or one hash function which hashes bits of the value and returns k array positions.
    • there’re several ways to achieve this, we’ll talk about this later

To add an element, Set the bits at all these positions to 1.

How to query a value?

To query for an element (test whether it is in the set), get the k array positions by either

  • Feeding it to each of the k hash functions.
  • The single hash function returns k positions.

If any of the bits at these positions is 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted.

If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple Bloom Filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem.

Applications of a Bloom Filter:

  • Bloom filters are used in spell checking so that it can track the words of the dictionary
  • For determining the strength of created password when setting up a new account
  • Websites can also track their traffic with the help of a bloom filter based on the IP address of a user

Frequently Asked Questions

Is it a space-efficient probabilistic data structure?

You might wonder but bloom filter actually doesn’t store any elements, it only stores the membership of elements.

What is difference in using tries and bloom filters?

Tries can only be used for strings while bloom filters can be used for any type of data type.

Conclusion

Alright, guys, that’s a bloom filter, in a nutshell, I know the actual implementation of bloom filters are a little bit fancier they use like three locations and all that stuff right sometimes they have more bits. They use three hash functions just to make the odds harder to get right but that’s just to me that such implementation. But if you understand how that works and why it exists right so a lot of limitation of bloom filters is you can get into a case where all of these bits can become 1 and in this case your bloom filter is useless.

You became the first case where you’re always going to hit the database. It’s not really harmful, it’s not going to slow you down but it won’t benefit you per se. So you have to think about this like the bigger you make this thing you kind of interfere with your memory footprint. I mean it depends on how big it is right really but the shorter it is you will get all these false-positive cases where you can always hit the database.

To know more about web development, please refer to our course pages.

By Yogesh Kumar