Class BloomFilterBuilder

java.lang.Object
org.apache.datasketches.filters.bloomfilter.BloomFilterBuilder

public final class BloomFilterBuilder extends Object
This class provides methods to help estimate the correct parameters when creating a Bloom filter, and methods to create the filter using those values.

The underlying math is described in the Wikipedia article on Bloom filters.

  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    createByAccuracy(long maxDistinctItems, double targetFalsePositiveProb)
    Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function.
    createByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, long seed)
    Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function.
    createBySize(long numBits, int numHashes)
    Creates a BloomFilter with given number of bits and number of hash functions, using a rnadom base seed for the hash function.
    createBySize(long numBits, int numHashes, long seed)
    Creates a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function.
    static long
    Returns the minimum memory size, in bytes, needed for a serialized BloomFilter with the given number of bits.
    static long
    getSerializedFilterSizeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb)
    Returns the minimum memory size, in bytes, needed for a serialized BloomFilter with an optimal number of bits and hash functions for the given inputs.
    initializeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, long seed, org.apache.datasketches.memory.WritableMemory dstMem)
    Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function and writing into the provided WritableMemory.
    initializeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, org.apache.datasketches.memory.WritableMemory dstMem)
    Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function and writing into the provided WritableMemory.
    initializeBySize(long numBits, int numHashes, long seed, org.apache.datasketches.memory.WritableMemory dstMem)
    Initializes a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function and writing into the provided WritableMemory.
    initializeBySize(long numBits, int numHashes, org.apache.datasketches.memory.WritableMemory dstMem)
    Initializes a BloomFilter with given number of bits and number of hash functions, using a random base seed for the hash function and writing into the provided WritableMemory.
    static long
    suggestNumFilterBits(long maxDistinctItems, double targetFalsePositiveProb)
    Returns the optimal number of bits to use in a Bloom Filter given a target number of distinct items and a target false positive probability.
    static short
    suggestNumHashes(double targetFalsePositiveProb)
    Returns the optimal number of hash functions to achieve a target false positive probability.
    static short
    suggestNumHashes(long maxDistinctItems, long numFilterBits)
    Returns the optimal number of hash functions to given target numbers of distinct items and the BloomFilter size in bits.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • BloomFilterBuilder

      public BloomFilterBuilder()
  • Method Details

    • suggestNumHashes

      public static short suggestNumHashes(long maxDistinctItems, long numFilterBits)
      Returns the optimal number of hash functions to given target numbers of distinct items and the BloomFilter size in bits. This function will provide a result even if the input values exceed the capacity of a single BloomFilter.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      numFilterBits - The intended size of the Bloom Filter in bits
      Returns:
      The suggested number of hash functions to use with the filter
    • suggestNumHashes

      public static short suggestNumHashes(double targetFalsePositiveProb)
      Returns the optimal number of hash functions to achieve a target false positive probability.
      Parameters:
      targetFalsePositiveProb - A desired false positive probability per item
      Returns:
      The suggested number of hash functions to use with the filter.
    • suggestNumFilterBits

      public static long suggestNumFilterBits(long maxDistinctItems, double targetFalsePositiveProb)
      Returns the optimal number of bits to use in a Bloom Filter given a target number of distinct items and a target false positive probability.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      Returns:
      The suggested number of bits to use with the filter
    • getSerializedFilterSizeByAccuracy

      public static long getSerializedFilterSizeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb)
      Returns the minimum memory size, in bytes, needed for a serialized BloomFilter with an optimal number of bits and hash functions for the given inputs. This is also the minimum size of a WritableMemory for in-place filter initialization.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      Returns:
      The size, in bytes, required to hold the specified BloomFilter when serialized
    • getSerializedFilterSize

      public static long getSerializedFilterSize(long numBits)
      Returns the minimum memory size, in bytes, needed for a serialized BloomFilter with the given number of bits. This is also the minimum size of a WritableMemory for in-place filter initialization.
      Parameters:
      numBits - The number of bits in the target BloomFilter's bit array.
      Returns:
      The size, in bytes, required to hold the specified BloomFilter when serialized
    • createByAccuracy

      public static BloomFilter createByAccuracy(long maxDistinctItems, double targetFalsePositiveProb)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      Returns:
      A new BloomFilter configured for the given input parameters
    • createByAccuracy

      public static BloomFilter createByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, long seed)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      seed - A base hash seed
      Returns:
      A new BloomFilter configured for the given input parameters
    • createBySize

      public static BloomFilter createBySize(long numBits, int numHashes)
      Creates a BloomFilter with given number of bits and number of hash functions, using a rnadom base seed for the hash function.
      Parameters:
      numBits - The size of the BloomFilter, in bits
      numHashes - The number of hash functions to apply to items
      Returns:
      A new BloomFilter configured for the given input parameters
    • createBySize

      public static BloomFilter createBySize(long numBits, int numHashes, long seed)
      Creates a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function.
      Parameters:
      numBits - The size of the BloomFilter, in bits
      numHashes - The number of hash functions to apply to items
      seed - A base hash seed
      Returns:
      A new BloomFilter configured for the given input parameters
    • initializeByAccuracy

      public static BloomFilter initializeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, org.apache.datasketches.memory.WritableMemory dstMem)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function and writing into the provided WritableMemory.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      dstMem - A WritableMemory to hold the initialized filter
      Returns:
      A new BloomFilter configured for the given input parameters
    • initializeByAccuracy

      public static BloomFilter initializeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, long seed, org.apache.datasketches.memory.WritableMemory dstMem)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function and writing into the provided WritableMemory.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      seed - A base hash seed
      dstMem - A WritableMemory to hold the initialized filter
      Returns:
      A new BloomFilter configured for the given input parameters
    • initializeBySize

      public static BloomFilter initializeBySize(long numBits, int numHashes, org.apache.datasketches.memory.WritableMemory dstMem)
      Initializes a BloomFilter with given number of bits and number of hash functions, using a random base seed for the hash function and writing into the provided WritableMemory.
      Parameters:
      numBits - The size of the BloomFilter, in bits
      numHashes - The number of hash functions to apply to items
      dstMem - A WritableMemory to hold the initialized filter
      Returns:
      A new BloomFilter configured for the given input parameters
    • initializeBySize

      public static BloomFilter initializeBySize(long numBits, int numHashes, long seed, org.apache.datasketches.memory.WritableMemory dstMem)
      Initializes a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function and writing into the provided WritableMemory.
      Parameters:
      numBits - The size of the BloomFilter, in bits
      numHashes - The number of hash functions to apply to items
      seed - A base hash seed
      dstMem - A WritableMemory to hold the initialized filter
      Returns:
      A new BloomFilter configured for the given input parameters