Class BloomFilter

java.lang.Object
org.apache.datasketches.filters.bloomfilter.BloomFilter

public final class BloomFilter extends Object
A Bloom filter is a data structure that can be used for probabilistic set membership.

When querying a Bloom filter, there are no false positives. Specifically: When querying an item that has already been inserted to the filter, the filter will always indicate that the item is present. There is a chance of false positives, where querying an item that has never been presented to the filter will indicate that the item has already been seen. Consequently, any query should be interpreted as "might have seen."

A standard Bloom filter is unlike typical sketches in that it is not sub-linear in size and does not resize itself. A Bloom filter will work up to a target number of distinct items, beyond which it will saturate and the false positive rate will start to increase. The size of a Bloom filter will be linear in the expected number of distinct items.

See the BloomFilterBuilder class for methods to create a filter, especially one sized correctly for a target number of distinct elements and a target false positive probability.

This implementation uses xxHash64 and follows the approach in Kirsch and Mitzenmacher, "Less Hashing, Same Performance: Building a Better Bloom Filter," Wiley Interscience, 2008, pp. 187-218.

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final long
    The maximum size of a bloom filter in bits.
  • Method Summary

    Modifier and Type
    Method
    Description
    long
    Returns the number of bits in the BloomFilter that are set to 1.
    long
    Returns the total number of bits in the BloomFilter.
    double
    Returns the percentage of all bits in the BloomFilter set to 1.
    short
    Returns the configured number of hash functions for this BloomFilter
    long
    Returns the hash seed for this BloomFilter.
    static long
    getSerializedSize(long numBits)
    Returns the serialized length of a non-empty BloomFilter of the given size, in bytes
    long
    Returns the length of this BloomFilter when serialized, in bytes
    boolean
    Returns whether the filter has a backing Memory object
    heapify(org.apache.datasketches.memory.Memory mem)
    Reads a serialized image of a BloomFilter from the provided Memory
    void
    Intersects two BloomFilters by applying a logical AND.
    void
    Inverts all the bits of the BloomFilter.
    boolean
    Helps identify if two BloomFilters may be unioned or intersected.
    boolean
    Returns whether the filter is a direct (off-heap) or on-heap object.
    boolean
    Checks if the BloomFilter has processed any items
    boolean
    Returns whether the filter is in read-only mode.
    boolean
    query(byte[] data)
    Queries the filter with the provided byte[] and returns whether the array might have been seen previously.
    boolean
    query(char[] data)
    Queries the filter with the provided char[] and returns whether the array might have been seen previously.
    boolean
    query(double item)
    Queries the filter with the provided double and returns whether the value might have been seen previously.
    boolean
    query(int[] data)
    Queries the filter with the provided int[] and returns whether the array might have been seen previously.
    boolean
    query(long item)
    Queries the filter with the provided long and returns whether the value might have been seen previously.
    boolean
    query(long[] data)
    Queries the filter with the provided long[] and returns whether the array might have been seen previously.
    boolean
    query(short[] data)
    Queries the filter with the provided short[] and returns whether the array might have been seen previously.
    boolean
    query(String item)
    Queries the filter with the provided double and returns whether the value might have been seen previously.
    boolean
    query(org.apache.datasketches.memory.Memory mem)
    Queries the filter with the provided Memory and returns whether the data might have been seen previously.
    boolean
    queryAndUpdate(byte[] data)
    Updates the filter with the provided byte[] and returns the result from quering that array prior to the update.
    boolean
    queryAndUpdate(char[] data)
    Updates the filter with the provided char[] and returns the result from quering that array prior to the update.
    boolean
    queryAndUpdate(double item)
    Updates the filter with the provided double and returns the result from quering that value prior to the update.
    boolean
    queryAndUpdate(int[] data)
    Updates the filter with the provided int[] and returns the result from quering that array prior to the update.
    boolean
    queryAndUpdate(long item)
    Updates the filter with the provided long and returns the result from quering that value prior to the update.
    boolean
    queryAndUpdate(long[] data)
    Updates the filter with the provided long[] and returns the result from quering that array prior to the update.
    boolean
    queryAndUpdate(short[] data)
    Updates the filter with the provided short[] and returns the result from quering that array prior to the update.
    boolean
    Updates the filter with the provided String and returns the result from quering that value prior to the update.
    boolean
    queryAndUpdate(org.apache.datasketches.memory.Memory mem)
    Updates the filter with the provided Memory and returns the result from quering that Memory prior to the update.
    void
    Resets the BloomFilter to an empty state
    byte[]
    Serializes the current BloomFilter to an array of bytes.
    long[]
    Serializes the current BloomFilter to an array of longs.
     
    void
    Unions two BloomFilters by applying a logical OR.
    void
    update(byte[] data)
    Updates the filter with the provided byte[].
    void
    update(char[] data)
    Updates the filter with the provided char[].
    void
    update(double item)
    Updates the filter with the provided double value.
    void
    update(int[] data)
    Updates the filter with the provided int[].
    void
    update(long item)
    Updates the filter with the provided long value.
    void
    update(long[] data)
    Updates the filter with the provided long[].
    void
    update(short[] data)
    Updates the filter with the provided short[].
    void
    update(String item)
    Updates the filter with the provided String.
    void
    update(org.apache.datasketches.memory.Memory mem)
    Updates the filter with the data in the provided Memory.
    wrap(org.apache.datasketches.memory.Memory mem)
    Wraps the given Memory into this filter class.
    writableWrap(org.apache.datasketches.memory.WritableMemory wmem)
    Wraps the given WritableMemory into this filter class.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • MAX_SIZE_BITS

      public static final long MAX_SIZE_BITS
      The maximum size of a bloom filter in bits.
  • Method Details

    • heapify

      public static BloomFilter heapify(org.apache.datasketches.memory.Memory mem)
      Reads a serialized image of a BloomFilter from the provided Memory
      Parameters:
      mem - Memory containing a previously serialized BloomFilter
      Returns:
      a BloomFilter object
    • wrap

      public static BloomFilter wrap(org.apache.datasketches.memory.Memory mem)
      Wraps the given Memory into this filter class. The class itself only contains a few metadata items and holds a reference to the Memory object, which contains all the data.
      Parameters:
      mem - the given Memory object
      Returns:
      the wrapping BloomFilter class.
    • writableWrap

      public static BloomFilter writableWrap(org.apache.datasketches.memory.WritableMemory wmem)
      Wraps the given WritableMemory into this filter class. The class itself only contains a few metadata items and holds a reference to the Memory object, which contains all the data.
      Parameters:
      wmem - the given WritableMemory object
      Returns:
      the wrapping BloomFilter class.
    • reset

      public void reset()
      Resets the BloomFilter to an empty state
    • isEmpty

      public boolean isEmpty()
      Checks if the BloomFilter has processed any items
      Returns:
      True if the BloomFilter is empty, otherwise False
    • getBitsUsed

      public long getBitsUsed()
      Returns the number of bits in the BloomFilter that are set to 1.
      Returns:
      The number of bits in use in this filter
    • getCapacity

      public long getCapacity()
      Returns the total number of bits in the BloomFilter.
      Returns:
      The total size of the BloomFilter
    • getNumHashes

      public short getNumHashes()
      Returns the configured number of hash functions for this BloomFilter
      Returns:
      The number of hash functions to apply to inputs
    • getSeed

      public long getSeed()
      Returns the hash seed for this BloomFilter.
      Returns:
      The hash seed for this filter
    • hasMemory

      public boolean hasMemory()
      Returns whether the filter has a backing Memory object
      Returns:
      true if backed by Memory, otherwise false
    • isReadOnly

      public boolean isReadOnly()
      Returns whether the filter is in read-only mode. That is possible only if there is a backing Memory in read-only mode.
      Returns:
      true if read-only, otherwise false
    • isDirect

      public boolean isDirect()
      Returns whether the filter is a direct (off-heap) or on-heap object. That is possible only if there is a backing Memory.
      Returns:
      true if using direct memory access, otherwise false
    • getFillPercentage

      public double getFillPercentage()
      Returns the percentage of all bits in the BloomFilter set to 1.
      Returns:
      the percentage of bits in the filter set to 1
    • update

      public void update(long item)
      Updates the filter with the provided long value.
      Parameters:
      item - an item with which to update the filter
    • update

      public void update(double item)
      Updates the filter with the provided double value. The value is canonicalized (NaN and infinities) prior to updating.
      Parameters:
      item - an item with which to update the filter
    • update

      public void update(String item)
      Updates the filter with the provided String. The string is converted to a byte array using UTF8 encoding.

      Note: this will not produce the same output hash values as the update(char[]) method and will generally be a little slower depending on the complexity of the UTF8 encoding.

      Parameters:
      item - an item with which to update the filter
    • update

      public void update(byte[] data)
      Updates the filter with the provided byte[].
      Parameters:
      data - an array with which to update the filter
    • update

      public void update(char[] data)
      Updates the filter with the provided char[].
      Parameters:
      data - an array with which to update the filter
    • update

      public void update(short[] data)
      Updates the filter with the provided short[].
      Parameters:
      data - an array with which to update the filter
    • update

      public void update(int[] data)
      Updates the filter with the provided int[].
      Parameters:
      data - an array with which to update the filter
    • update

      public void update(long[] data)
      Updates the filter with the provided long[].
      Parameters:
      data - an array with which to update the filter
    • update

      public void update(org.apache.datasketches.memory.Memory mem)
      Updates the filter with the data in the provided Memory.
      Parameters:
      mem - a Memory object with which to update the filter
    • queryAndUpdate

      public boolean queryAndUpdate(long item)
      Updates the filter with the provided long and returns the result from quering that value prior to the update.
      Parameters:
      item - an item with which to update the filter
      Returns:
      The query result prior to applying the update
    • queryAndUpdate

      public boolean queryAndUpdate(double item)
      Updates the filter with the provided double and returns the result from quering that value prior to the update. The double is canonicalized (NaN and +/- infinity) in the call.
      Parameters:
      item - an item with which to update the filter
      Returns:
      The query result prior to applying the update
    • queryAndUpdate

      public boolean queryAndUpdate(String item)
      Updates the filter with the provided String and returns the result from quering that value prior to the update. The string is converted to a byte array using UTF8 encoding.

      Note: this will not produce the same output hash values as the queryAndUpdate(char[]) method and will generally be a little slower depending on the complexity of the UTF8 encoding.

      Parameters:
      item - an item with which to update the filter
      Returns:
      The query result prior to applying the update, or false if item is null
    • queryAndUpdate

      public boolean queryAndUpdate(byte[] data)
      Updates the filter with the provided byte[] and returns the result from quering that array prior to the update.
      Parameters:
      data - an array with which to update the filter
      Returns:
      The query result prior to applying the update, or false if data is null
    • queryAndUpdate

      public boolean queryAndUpdate(char[] data)
      Updates the filter with the provided char[] and returns the result from quering that array prior to the update.
      Parameters:
      data - an array with which to update the filter
      Returns:
      The query result prior to applying the update, or false if data is null
    • queryAndUpdate

      public boolean queryAndUpdate(short[] data)
      Updates the filter with the provided short[] and returns the result from quering that array prior to the update.
      Parameters:
      data - an array with which to update the filter
      Returns:
      The query result prior to applying the update, or false if data is null
    • queryAndUpdate

      public boolean queryAndUpdate(int[] data)
      Updates the filter with the provided int[] and returns the result from quering that array prior to the update.
      Parameters:
      data - an array with which to update the filter
      Returns:
      The query result prior to applying the update, or false if data is null
    • queryAndUpdate

      public boolean queryAndUpdate(long[] data)
      Updates the filter with the provided long[] and returns the result from quering that array prior to the update.
      Parameters:
      data - an array with which to update the filter
      Returns:
      The query result prior to applying the update, or false if data is null
    • queryAndUpdate

      public boolean queryAndUpdate(org.apache.datasketches.memory.Memory mem)
      Updates the filter with the provided Memory and returns the result from quering that Memory prior to the update.
      Parameters:
      mem - an array with which to update the filter
      Returns:
      The query result prior to applying the update, or false if mem is null
    • query

      public boolean query(long item)
      Queries the filter with the provided long and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
      Parameters:
      item - an item with which to query the filter
      Returns:
      The result of querying the filter with the given item
    • query

      public boolean query(double item)
      Queries the filter with the provided double and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible. Double values are canonicalized (NaN and +/- infinity) before querying.
      Parameters:
      item - an item with which to query the filter
      Returns:
      The result of querying the filter with the given item
    • query

      public boolean query(String item)
      Queries the filter with the provided double and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible. The string is converted to a byte array using UTF8 encoding.

      Note: this will not produce the same output hash values as the update(char[]) method and will generally be a little slower depending on the complexity of the UTF8 encoding.

      Parameters:
      item - an item with which to query the filter
      Returns:
      The result of querying the filter with the given item, or false if item is null
    • query

      public boolean query(byte[] data)
      Queries the filter with the provided byte[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
      Parameters:
      data - an array with which to query the filter
      Returns:
      The result of querying the filter with the given data, or false if data is null
    • query

      public boolean query(char[] data)
      Queries the filter with the provided char[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
      Parameters:
      data - an array with which to query the filter
      Returns:
      The result of querying the filter with the given data, or false if data is null
    • query

      public boolean query(short[] data)
      Queries the filter with the provided short[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
      Parameters:
      data - an array with which to query the filter
      Returns:
      The result of querying the filter with the given data, or false if data is null
    • query

      public boolean query(int[] data)
      Queries the filter with the provided int[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
      Parameters:
      data - an array with which to query the filter
      Returns:
      The result of querying the filter with the given data, or false if data is null
    • query

      public boolean query(long[] data)
      Queries the filter with the provided long[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
      Parameters:
      data - an array with which to query the filter
      Returns:
      The result of querying the filter with the given data, or false if data is null
    • query

      public boolean query(org.apache.datasketches.memory.Memory mem)
      Queries the filter with the provided Memory and returns whether the data might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
      Parameters:
      mem - a Memory array with which to query the filter
      Returns:
      The result of querying the filter with the given Memory, or false if data is null
    • union

      public void union(BloomFilter other)
      Unions two BloomFilters by applying a logical OR. The result will recognized any values seen by either filter (as well as false positives).
      Parameters:
      other - A BloomFilter to union with this one
    • intersect

      public void intersect(BloomFilter other)
      Intersects two BloomFilters by applying a logical AND. The result will recognize only values seen by both filters (as well as false positives).
      Parameters:
      other - A BloomFilter to union with this one
    • invert

      public void invert()
      Inverts all the bits of the BloomFilter. Approximately inverts the notion of set-membership.
    • isCompatible

      public boolean isCompatible(BloomFilter other)
      Helps identify if two BloomFilters may be unioned or intersected.
      Parameters:
      other - A BloomFilter to check for compatibility with this one
      Returns:
      True if the filters are compatible, otherwise false
    • getSerializedSizeBytes

      public long getSerializedSizeBytes()
      Returns the length of this BloomFilter when serialized, in bytes
      Returns:
      The length of this BloomFilter when serialized, in bytes
    • getSerializedSize

      public static long getSerializedSize(long numBits)
      Returns the serialized length of a non-empty BloomFilter of the given size, in bytes
      Parameters:
      numBits - The number of bits of to use for size computation
      Returns:
      The serialized length of a non-empty BloomFilter of the given size, in bytes
    • toByteArray

      public byte[] toByteArray()
      Serializes the current BloomFilter to an array of bytes.

      Note: Method throws if the serialized size exceeds Integer.MAX_VALUE.

      Returns:
      A serialized image of the current BloomFilter as byte[]
    • toLongArray

      public long[] toLongArray()
      Serializes the current BloomFilter to an array of longs. Unlike toByteArray(), this method can handle any size filter.
      Returns:
      A serialized image of the current BloomFilter as long[]
    • toString

      public String toString()
      Overrides:
      toString in class Object