Sharding vs. Partitioning: Understanding and Implementing Scalability in Modern Databases (Part 2)

Introduction

Imagine you have a massive list—say, millions of email addresses—and you need to quickly check if a specific email is on that list. Storing and searching through every single email address can be slow and take up a lot of memory. This is where Bloom Filters come in handy.

In this guide, I'll walk you through how the Bloom Filter works and show you how to build one in Java.

If you're looking for a clear, hands-on guide to Bloom Filter in Java introductory guide, keep reading.

What is a Bloom Filter?

A Bloom filter quickly tells you if an item might be present or is definitely absent. Instead of saving every element, it uses a fixed-size bit array and several hash functions to mark the presence of data.

Bloom filter is great for situations where you can tolerate a little uncertainty (a false positive) but cannot afford a missed item. In many applications, this trade-off works perfectly.

For example, a Bloom filter can quickly tell you if an email address is definitely not on your list. However, if it indicates that an email address might be present, you might need a secondary check to confirm.

Key points of Bloom Filters:

Saves memory: Instead of storing actual elements, a Bloom filter uses a small bit array.
Delivers fast lookups: The time it takes to check remains the same, no matter how many elements you process.
Avoids false negatives: If a Bloom filter tells you an element is missing, you can trust that result.
May produce false positives: At times, it might indicate an element exists when it does not. In many cases, this small chance is an acceptable trade-off.

How Does a Bloom Filter Work?

The Setup

You start with a long row of boxes (a bit array) where each box is empty (all set to 0). You also have several simple rules (hash functions) that take an item and tell you which boxes to check.

Adding an Item

When you add an item—say, an email address—the Bloom filter uses each of its hash functions to decide which boxes to mark. Each chosen box gets filled in (set to 1). This step is fast and uses little memory.

Checking for an Item

To see if an item is in the list, the filter applies the same hash functions to that item.

If every box it points to is filled (1), the item might be in the list.
If even one box is still empty (0), you know for sure the item was never added.

Understanding False Positives in Bloom Filters

Imagine you're signing up for a new account on a popular website like GitHub. You enter a username, and almost instantly, it tells you if the name is already taken or available.

But how does it check millions of usernames so quickly? Instead of searching a huge database every time, GitHub might use a Bloom filter. It can quickly guess if a username is already taken, but sometimes, it wrongly says a username exists when it actually doesn’t—this is called a false positive!

A false positive occurs when a Bloom filter mistakenly indicates that an element is present even though it was never added. This happens because different elements can set some of the same bits in the bit array due to hash collisions.

Example of a False Positive

To see how this works, let’s suppose a website is using a Bloom filter to check usernames:

You create a new account with the username "devMaster".
- The Bloom filter hashes "devMaster" and marks positions {3, 7, 12} in the bit array.
Someone else registers "codeNinja", which maps to {2, 7, 15}.
Now, another user tries "techWizard", which hashes to {2, 3, 7}.
Since positions 2, 3, and 7 were already marked, the Bloom filter may incorrectly indicate that "techWizard" is taken—even though it isn’t!

Remember, Bloom filters don’t guarantee 100% accuracy—they are best suited for scenarios where occasional false positives are acceptable.

Bloom filters never say "available" if a username is truly taken (no false negatives), but they can falsely say a username is taken when it isn’t (false positives).

Step 1: What Are We Building?

We're going to build a simple Bloom filter in Java that can:

Quickly check if an item might exist
Handle hash collisions efficiently
Be used for tasks like filtering duplicate requests or caching

How It Works

A Bloom filter is just a bit array and a few hash functions. When we add an element, we flip bits at positions given by hash functions. Later, when we check for an element, we look at those same positions.

If all bits are set → The item might exist (but could be a false positive).
If any bit is NOT set → The item definitely does NOT exist.

Step 2: Setting Up the Bloom Filter Class

First, let's create our BloomFilter class. We'll use BitSet to store our bits and an array of hash functions.


    import java.nio.ByteBuffer;
    import java.security.MessageDigest;
    import java.security.NoSuchAlgorithmException;
    import java.util.BitSet;

    public class BloomFilter {
        private final BitSet bitSet;
        private final int size;
        private final int hashFunctionCount;

        public BloomFilter(int size, int hashFunctionCount) {
            this.size = size;
            this.hashFunctionCount = Math.min(hashFunctionCount, 8); // Ensure we don't exceed 8 hash functions
            this.bitSet = new BitSet(size);
        }

👆 What’s Happening Here?

BitSet bitSet → Stores our bit array.
size → Defines how big our Bloom filter is.
hashFunctionCount → Controls how many hash functions we use.

💡 Limiting hash functions to 8 ensures we don't exceed what SHA-256 can provide.

Step 3: Creating Hash Functions

Since SHA-256 produces 256 bits, we can split it into 8 different hashes.


    private int[] getHashes(String element) {
        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-256");
            byte[] hash = digest.digest(element.getBytes());
            int[] hashes = new int[hashFunctionCount];

            for (int i = 0; i < hashFunctionCount; i++) {
                hashes[i] = ByteBuffer.wrap(hash, i * 4, 4).getInt();
            }
            return hashes;
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException("SHA-256 algorithm not found", e);
        }
    }

💡 How It Works:

MessageDigest.getInstance("SHA-256") → Creates a SHA-256 hash of the input.
ByteBuffer.wrap(hash, i * 4, 4).getInt(); → Extracts 8 different hash values from SHA-256.

Step 4: Adding an Element to the Bloom Filter

When adding an item, we set bits at each hash-generated index.


    private int getIndexFromHash(int hash) {
        return Math.abs(hash % size); // Ensures index is within valid range
    }

    public void add(String element) {
        for (int hash : getHashes(element)) {
            bitSet.set(getIndexFromHash(hash));
        }
    }

💡 Why This Works:

The hash function generates indices for each input.
bitSet.set(index) → Marks those positions as "occupied".

Step 5: Checking for an Element

To check if an element might exist, we verify if all its hash positions are set.


    public boolean mightContain(String element) {
        for (int hash : getHashes(element)) {
            if (!bitSet.get(getIndexFromHash(hash))) {
                return false; // If any bit is 0, element is definitely NOT in filter
            }
        }
        return true; // Otherwise, it MIGHT be in the filter
    }

💡 Key Takeaways:

If any bit is unset, the item was never added.
If all bits are set, the item might exist (but could be a false positive).

Step 6: Testing the Bloom Filter

Let's add some items.


    class BloomFilterTest {

        @Test
        void shouldAddToBloomFilterAndCheckIfItemMightBeContained() {
            // Given
            BloomFilter bloomFilter = new BloomFilter(100, 4);

            // When
            bloomFilter.add("https://www.aniefiok.com");
            bloomFilter.add("https://www.codeninja.com");

            // Then
            assertThat(bloomFilter.mightContain("https://www.aniefiok.com")).isTrue(); // true
            assertThat(bloomFilter.mightContain("https://www.codeninja.com")).isTrue(); // true
            assertThat(bloomFilter.mightContain("https://www.google.com")).isFalse(); // false (most likely)
        }
    }

Step 7: Bloom Filter in a Web Application (Optional)

I built a sample BloomFilter web service that exposes 3 endpoints via REST APIs in Spring Boot.


    @RestController
    @RequestMapping("/bloomfilter")
    public class BloomFilterController {
        private BloomFilter bloomFilter = new BloomFilter(100, 4);

        @PostMapping("/add")
        public String addElement(@RequestParam String element) {
            bloomFilter.add(element);
            return "Added: " + element;
        }

        @GetMapping("/check")
        public String checkElement(@RequestParam String element) {
            return bloomFilter.mightContain(element)
                   ? "Might exist in Bloom filter."
                   : "Definitely NOT in Bloom filter.";
        }
    }

Step 8: Tuning Bloom Filter Parameters

Bloom filter accuracy depends on:

Size of the bit array → Larger means fewer false positives.
Number of hash functions → More hashes = better accuracy (but slower).

Reconfiguring the Bloom Filter


    @PostMapping("/reconfigure")
    public String reconfigure(@RequestParam int size, @RequestParam int hashFunctionCount) {
        if (hashFunctionCount > 8) {
            return "Max hash functions: 8";
        }
        bloomFilter = new BloomFilter(size, hashFunctionCount);
        return "Reconfigured Bloom Filter with size " + size + " and " + hashFunctionCount + " hash functions.";
    }

💡 Why This Matters?

Increasing size reduces false positives but uses more memory.
More hash functions improve accuracy but increase processing time.

Use Cases of Bloom Filters

Bloom filters make systems faster and more efficient without consuming too much memory. They are incredibly useful when we need to quickly check if something exists while saving memory.

That’s why they’re widely used in high-performance applications where a few false positives are okay—but speed and scalability are critical.

Let’s look at some real-world examples:

Web Crawlers (Google, Bing, etc.)

Search engines like Google and Bing use Bloom filters to avoid crawling the same webpage twice. Instead of keeping a giant list of visited URLs, they use a Bloom filter to quickly check if a page was already processed—saving both time and storage.

Databases (Cassandra, Bigtable, etc.)

Databases often need to check if a record exists before looking it up on disk. Bloom filters help databases like Apache Cassandra and Bigtable skip unnecessary reads, making queries much faster.

Security & Firewalls (Cloudflare, Firewalls, etc.)

Bloom filters help security systems block malicious IP addresses efficiently. Services like Cloudflare use them to track known attackers and prevent spam requests, all while using minimal memory.

Spell Checkers (Google Docs, MS Word, etc.)

When you type in Google Docs or MS Word, the spell checker needs to quickly verify if a word is valid. Instead of storing a massive dictionary in RAM, it uses a Bloom filter to rapidly check common words while handling rare words separately.

Bitcoin & Blockchain

Bloom filters play a key role in Bitcoin and other blockchain networks by speeding up transaction searches. Instead of scanning an entire blockchain, they allow lightweight wallets to filter relevant transactions efficiently.

Limitations of Bloom Filters

Bloom filters are fast and memory-efficient, but they’re not perfect. Here are some key drawbacks to keep in mind:

False Positives: Bloom filters don’t guarantee 100% accuracy. Sometimes, they mistakenly say, "Yes, this item exists" when it actually doesn’t. This happens because multiple items can overlap in the bit array due to hash collisions. That’s why Bloom filters are great when occasional mistakes are acceptable but not when 100% accuracy is required.
No Deletions: Once an element is added to a Bloom filter, you can’t remove it. Since bits in the array are shared across multiple items, clearing one bit could accidentally remove data for another element. Some advanced variations (like Counting Bloom Filters) solve this, but the standard Bloom filter only supports additions.
Requires Careful Tuning: A Bloom filter’s effectiveness depends on choosing the right size (m) and number of hash functions (k). Too small, and you’ll get too many false positives. Too large, and you’ll waste memory. Picking the right balance is crucial for performance.

Despite these limitations, Bloom filters are still widely used in places where speed and memory efficiency matter more than 100% accuracy.

Bloom Filter Alternatives

Bloom filters are a powerful tool, but they aren’t the only way to handle membership checks. Depending on your needs, other data structures might work better. Let’s explore a few alternatives and when to use them.

Counting Bloom Filters

Standard Bloom filters don’t support removals, but Counting Bloom Filters fix that by using small counters instead of simple bits. This makes them great for applications where items come and go, like cache eviction or real-time analytics.

Cuckoo Filters

Cuckoo filters are a more advanced alternative that reduces false positives while still being space-efficient. They’re often used in security applications, like Google Safe Browsing, where accuracy matters.

Hash Tables

If you need 100% accuracy and have enough memory, a hash table (like Java’s HashSet) is a simpler and more reliable choice. Unlike Bloom filters, it never produces false positives—but it does consume more space.

Tries

If you’re dealing with prefix searches (like autocompleting search queries or checking domain names), tries can be a better choice. They allow for fast lookups with exact matches, making them useful for DNS resolvers and predictive text systems.

Wrapping It Up

Bloom filters are a smart, space-efficient way to check if something exists in a large dataset without storing everything. They trade a little accuracy (false positives) for blazing-fast performance and minimal memory use, making them perfect for cases where speed matters more than perfection.

If you're working on large-scale systems in Java, adding a Bloom filter can help reduce unnecessary lookups, speed up searches, and optimize performance.

Want to take it further? Try using Bloom filters in distributed caching systems like Redis or in Java-based databases—you’ll see just how powerful they can be!

Find the GitHub link of the code used in this guide here .