Class BloomFilterBuilder


  • public final class BloomFilterBuilder
    extends Object

    This class provides methods to help estimate the correct parameters when creating a Bloom filter, and methods to create the filter using those values.

    The underlying math is described in the Wikipedia article on Bloom filters.

    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static BloomFilter createByAccuracy​(long maxDistinctItems, double targetFalsePositiveProb)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function.
      static BloomFilter createByAccuracy​(long maxDistinctItems, double targetFalsePositiveProb, long seed)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function.
      static BloomFilter createBySize​(long numBits, int numHashes)
      Creates a BloomFilter with given number of bits and number of hash functions, using a rnadom base seed for the hash function.
      static BloomFilter createBySize​(long numBits, int numHashes, long seed)
      Creates a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function.
      static long getSerializedFilterSize​(long numBits)
      Returns the minimum memory size, in bytes, needed for a serialized BloomFilter with the given number of bits.
      static long getSerializedFilterSizeByAccuracy​(long maxDistinctItems, double targetFalsePositiveProb)
      Returns the minimum memory size, in bytes, needed for a serialized BloomFilter with an optimal number of bits and hash functions for the given inputs.
      static BloomFilter initializeByAccuracy​(long maxDistinctItems, double targetFalsePositiveProb, long seed, org.apache.datasketches.memory.WritableMemory dstMem)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function and writing into the provided WritableMemory.
      static BloomFilter initializeByAccuracy​(long maxDistinctItems, double targetFalsePositiveProb, org.apache.datasketches.memory.WritableMemory dstMem)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function and writing into the provided WritableMemory.
      static BloomFilter initializeBySize​(long numBits, int numHashes, long seed, org.apache.datasketches.memory.WritableMemory dstMem)
      Initializes a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function and writing into the provided WritableMemory.
      static BloomFilter initializeBySize​(long numBits, int numHashes, org.apache.datasketches.memory.WritableMemory dstMem)
      Initializes a BloomFilter with given number of bits and number of hash functions, using a random base seed for the hash function and writing into the provided WritableMemory.
      static long suggestNumFilterBits​(long maxDistinctItems, double targetFalsePositiveProb)
      Returns the optimal number of bits to use in a Bloom Filter given a target number of distinct items and a target false positive probability.
      static short suggestNumHashes​(double targetFalsePositiveProb)
      Returns the optimal number of hash functions to achieve a target false positive probability.
      static short suggestNumHashes​(long maxDistinctItems, long numFilterBits)
      Returns the optimal number of hash functions to given target numbers of distinct items and the BloomFilter size in bits.
    • Constructor Detail

      • BloomFilterBuilder

        public BloomFilterBuilder()
    • Method Detail

      • suggestNumHashes

        public static short suggestNumHashes​(long maxDistinctItems,
                                             long numFilterBits)
        Returns the optimal number of hash functions to given target numbers of distinct items and the BloomFilter size in bits. This function will provide a result even if the input values exceed the capacity of a single BloomFilter.
        Parameters:
        maxDistinctItems - The maximum expected number of distinct items to add to the filter
        numFilterBits - The intended size of the Bloom Filter in bits
        Returns:
        The suggested number of hash functions to use with the filter
      • suggestNumHashes

        public static short suggestNumHashes​(double targetFalsePositiveProb)
        Returns the optimal number of hash functions to achieve a target false positive probability.
        Parameters:
        targetFalsePositiveProb - A desired false positive probability per item
        Returns:
        The suggested number of hash functions to use with the filter.
      • suggestNumFilterBits

        public static long suggestNumFilterBits​(long maxDistinctItems,
                                                double targetFalsePositiveProb)
        Returns the optimal number of bits to use in a Bloom Filter given a target number of distinct items and a target false positive probability.
        Parameters:
        maxDistinctItems - The maximum expected number of distinct items to add to the filter
        targetFalsePositiveProb - A desired false positive probability per item
        Returns:
        The suggested number of bits to use with the filter
      • getSerializedFilterSizeByAccuracy

        public static long getSerializedFilterSizeByAccuracy​(long maxDistinctItems,
                                                             double targetFalsePositiveProb)
        Returns the minimum memory size, in bytes, needed for a serialized BloomFilter with an optimal number of bits and hash functions for the given inputs. This is also the minimum size of a WritableMemory for in-place filter initialization.
        Parameters:
        maxDistinctItems - The maximum expected number of distinct items to add to the filter
        targetFalsePositiveProb - A desired false positive probability per item
        Returns:
        The size, in bytes, required to hold the specified BloomFilter when serialized
      • getSerializedFilterSize

        public static long getSerializedFilterSize​(long numBits)
        Returns the minimum memory size, in bytes, needed for a serialized BloomFilter with the given number of bits. This is also the minimum size of a WritableMemory for in-place filter initialization.
        Parameters:
        numBits - The number of bits in the target BloomFilter's bit array.
        Returns:
        The size, in bytes, required to hold the specified BloomFilter when serialized
      • createByAccuracy

        public static BloomFilter createByAccuracy​(long maxDistinctItems,
                                                   double targetFalsePositiveProb)
        Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function.
        Parameters:
        maxDistinctItems - The maximum expected number of distinct items to add to the filter
        targetFalsePositiveProb - A desired false positive probability per item
        Returns:
        A new BloomFilter configured for the given input parameters
      • createByAccuracy

        public static BloomFilter createByAccuracy​(long maxDistinctItems,
                                                   double targetFalsePositiveProb,
                                                   long seed)
        Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function.
        Parameters:
        maxDistinctItems - The maximum expected number of distinct items to add to the filter
        targetFalsePositiveProb - A desired false positive probability per item
        seed - A base hash seed
        Returns:
        A new BloomFilter configured for the given input parameters
      • createBySize

        public static BloomFilter createBySize​(long numBits,
                                               int numHashes)
        Creates a BloomFilter with given number of bits and number of hash functions, using a rnadom base seed for the hash function.
        Parameters:
        numBits - The size of the BloomFilter, in bits
        numHashes - The number of hash functions to apply to items
        Returns:
        A new BloomFilter configured for the given input parameters
      • createBySize

        public static BloomFilter createBySize​(long numBits,
                                               int numHashes,
                                               long seed)
        Creates a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function.
        Parameters:
        numBits - The size of the BloomFilter, in bits
        numHashes - The number of hash functions to apply to items
        seed - A base hash seed
        Returns:
        A new BloomFilter configured for the given input parameters
      • initializeByAccuracy

        public static BloomFilter initializeByAccuracy​(long maxDistinctItems,
                                                       double targetFalsePositiveProb,
                                                       org.apache.datasketches.memory.WritableMemory dstMem)
        Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function and writing into the provided WritableMemory.
        Parameters:
        maxDistinctItems - The maximum expected number of distinct items to add to the filter
        targetFalsePositiveProb - A desired false positive probability per item
        dstMem - A WritableMemory to hold the initialized filter
        Returns:
        A new BloomFilter configured for the given input parameters
      • initializeByAccuracy

        public static BloomFilter initializeByAccuracy​(long maxDistinctItems,
                                                       double targetFalsePositiveProb,
                                                       long seed,
                                                       org.apache.datasketches.memory.WritableMemory dstMem)
        Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function and writing into the provided WritableMemory.
        Parameters:
        maxDistinctItems - The maximum expected number of distinct items to add to the filter
        targetFalsePositiveProb - A desired false positive probability per item
        seed - A base hash seed
        dstMem - A WritableMemory to hold the initialized filter
        Returns:
        A new BloomFilter configured for the given input parameters
      • initializeBySize

        public static BloomFilter initializeBySize​(long numBits,
                                                   int numHashes,
                                                   org.apache.datasketches.memory.WritableMemory dstMem)
        Initializes a BloomFilter with given number of bits and number of hash functions, using a random base seed for the hash function and writing into the provided WritableMemory.
        Parameters:
        numBits - The size of the BloomFilter, in bits
        numHashes - The number of hash functions to apply to items
        dstMem - A WritableMemory to hold the initialized filter
        Returns:
        A new BloomFilter configured for the given input parameters
      • initializeBySize

        public static BloomFilter initializeBySize​(long numBits,
                                                   int numHashes,
                                                   long seed,
                                                   org.apache.datasketches.memory.WritableMemory dstMem)
        Initializes a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function and writing into the provided WritableMemory.
        Parameters:
        numBits - The size of the BloomFilter, in bits
        numHashes - The number of hash functions to apply to items
        seed - A base hash seed
        dstMem - A WritableMemory to hold the initialized filter
        Returns:
        A new BloomFilter configured for the given input parameters