Class BloomFilter


  • public final class BloomFilter
    extends Object

    A Bloom filter is a data structure that can be used for probabilistic set membership.

    When querying a Bloom filter, there are no false positives. Specifically: When querying an item that has already been inserted to the filter, the filter will always indicate that the item is present. There is a chance of false positives, where querying an item that has never been presented to the filter will indicate that the item has already been seen. Consequently, any query should be interpreted as "might have seen."

    A standard Bloom filter is unlike typical sketches in that it is not sub-linear in size and does not resize itself. A Bloom filter will work up to a target number of distinct items, beyond which it will saturate and the false positive rate will start to increase. The size of a Bloom filter will be linear in the expected number of distinct items.

    See the BloomFilterBuilder class for methods to create a filter, especially one sized correctly for a target number of distinct elements and a target false positive probability.

    This implementation uses xxHash64 and follows the approach in Kirsch and Mitzenmacher, "Less Hashing, Same Performance: Building a Better Bloom Filter," Wiley Interscience, 2008, pp. 187-218.

    • Field Summary

      Fields 
      Modifier and Type Field Description
      static long MAX_SIZE_BITS
      The maximum size of a bloom filter in bits.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      long getBitsUsed()
      Returns the number of bits in the BloomFilter that are set to 1.
      long getCapacity()
      Returns the total number of bits in the BloomFilter.
      double getFillPercentage()
      Returns the percentage of all bits in the BloomFilter set to 1.
      short getNumHashes()
      Returns the configured number of hash functions for this BloomFilter
      long getSeed()
      Returns the hash seed for this BloomFilter.
      static long getSerializedSize​(long numBits)
      Returns the serialized length of a non-empty BloomFilter of the given size, in bytes
      long getSerializedSizeBytes()
      Returns the length of this BloomFilter when serialized, in bytes
      boolean hasMemory()
      Returns whether the filter has a backing Memory object
      static BloomFilter heapify​(org.apache.datasketches.memory.Memory mem)
      Reads a serialized image of a BloomFilter from the provided Memory
      void intersect​(BloomFilter other)
      Intersects two BloomFilters by applying a logical AND.
      void invert()
      Inverts all the bits of the BloomFilter.
      boolean isCompatible​(BloomFilter other)
      Helps identify if two BloomFilters may be unioned or intersected.
      boolean isDirect()
      Returns whether the filter is a direct (off-heap) or on-heap object.
      boolean isEmpty()
      Checks if the BloomFilter has processed any items
      boolean isReadOnly()
      Returns whether the filter is in read-only mode.
      boolean query​(byte[] data)
      Queries the filter with the provided byte[] and returns whether the array might have been seen previously.
      boolean query​(char[] data)
      Queries the filter with the provided char[] and returns whether the array might have been seen previously.
      boolean query​(double item)
      Queries the filter with the provided double and returns whether the value might have been seen previously.
      boolean query​(int[] data)
      Queries the filter with the provided int[] and returns whether the array might have been seen previously.
      boolean query​(long item)
      Queries the filter with the provided long and returns whether the value might have been seen previously.
      boolean query​(long[] data)
      Queries the filter with the provided long[] and returns whether the array might have been seen previously.
      boolean query​(short[] data)
      Queries the filter with the provided short[] and returns whether the array might have been seen previously.
      boolean query​(String item)
      Queries the filter with the provided double and returns whether the value might have been seen previously.
      boolean query​(org.apache.datasketches.memory.Memory mem)
      Queries the filter with the provided Memory and returns whether the data might have been seen previously.
      boolean queryAndUpdate​(byte[] data)
      Updates the filter with the provided byte[] and returns the result from quering that array prior to the update.
      boolean queryAndUpdate​(char[] data)
      Updates the filter with the provided char[] and returns the result from quering that array prior to the update.
      boolean queryAndUpdate​(double item)
      Updates the filter with the provided double and returns the result from quering that value prior to the update.
      boolean queryAndUpdate​(int[] data)
      Updates the filter with the provided int[] and returns the result from quering that array prior to the update.
      boolean queryAndUpdate​(long item)
      Updates the filter with the provided long and returns the result from quering that value prior to the update.
      boolean queryAndUpdate​(long[] data)
      Updates the filter with the provided long[] and returns the result from quering that array prior to the update.
      boolean queryAndUpdate​(short[] data)
      Updates the filter with the provided short[] and returns the result from quering that array prior to the update.
      boolean queryAndUpdate​(String item)
      Updates the filter with the provided String and returns the result from quering that value prior to the update.
      boolean queryAndUpdate​(org.apache.datasketches.memory.Memory mem)
      Updates the filter with the provided Memory and returns the result from quering that Memory prior to the update.
      void reset()
      Resets the BloomFilter to an empty state
      byte[] toByteArray()
      Serializes the current BloomFilter to an array of bytes.
      long[] toLongArray()
      Serializes the current BloomFilter to an array of longs.
      String toString()  
      void union​(BloomFilter other)
      Unions two BloomFilters by applying a logical OR.
      void update​(byte[] data)
      Updates the filter with the provided byte[].
      void update​(char[] data)
      Updates the filter with the provided char[].
      void update​(double item)
      Updates the filter with the provided double value.
      void update​(int[] data)
      Updates the filter with the provided int[].
      void update​(long item)
      Updates the filter with the provided long value.
      void update​(long[] data)
      Updates the filter with the provided long[].
      void update​(short[] data)
      Updates the filter with the provided short[].
      void update​(String item)
      Updates the filter with the provided String.
      void update​(org.apache.datasketches.memory.Memory mem)
      Updates the filter with the data in the provided Memory.
      static BloomFilter wrap​(org.apache.datasketches.memory.Memory mem)
      Wraps the given Memory into this filter class.
      static BloomFilter writableWrap​(org.apache.datasketches.memory.WritableMemory wmem)
      Wraps the given WritableMemory into this filter class.
    • Field Detail

      • MAX_SIZE_BITS

        public static final long MAX_SIZE_BITS
        The maximum size of a bloom filter in bits.
    • Method Detail

      • heapify

        public static BloomFilter heapify​(org.apache.datasketches.memory.Memory mem)
        Reads a serialized image of a BloomFilter from the provided Memory
        Parameters:
        mem - Memory containing a previously serialized BloomFilter
        Returns:
        a BloomFilter object
      • wrap

        public static BloomFilter wrap​(org.apache.datasketches.memory.Memory mem)
        Wraps the given Memory into this filter class. The class itself only contains a few metadata items and holds a reference to the Memory object, which contains all the data.
        Parameters:
        mem - the given Memory object
        Returns:
        the wrapping BloomFilter class.
      • writableWrap

        public static BloomFilter writableWrap​(org.apache.datasketches.memory.WritableMemory wmem)
        Wraps the given WritableMemory into this filter class. The class itself only contains a few metadata items and holds a reference to the Memory object, which contains all the data.
        Parameters:
        wmem - the given WritableMemory object
        Returns:
        the wrapping BloomFilter class.
      • reset

        public void reset()
        Resets the BloomFilter to an empty state
      • isEmpty

        public boolean isEmpty()
        Checks if the BloomFilter has processed any items
        Returns:
        True if the BloomFilter is empty, otherwise False
      • getBitsUsed

        public long getBitsUsed()
        Returns the number of bits in the BloomFilter that are set to 1.
        Returns:
        The number of bits in use in this filter
      • getCapacity

        public long getCapacity()
        Returns the total number of bits in the BloomFilter.
        Returns:
        The total size of the BloomFilter
      • getNumHashes

        public short getNumHashes()
        Returns the configured number of hash functions for this BloomFilter
        Returns:
        The number of hash functions to apply to inputs
      • getSeed

        public long getSeed()
        Returns the hash seed for this BloomFilter.
        Returns:
        The hash seed for this filter
      • hasMemory

        public boolean hasMemory()
        Returns whether the filter has a backing Memory object
        Returns:
        true if backed by Memory, otherwise false
      • isReadOnly

        public boolean isReadOnly()
        Returns whether the filter is in read-only mode. That is possible only if there is a backing Memory in read-only mode.
        Returns:
        true if read-only, otherwise false
      • isDirect

        public boolean isDirect()
        Returns whether the filter is a direct (off-heap) or on-heap object. That is possible only if there is a backing Memory.
        Returns:
        true if using direct memory access, otherwise false
      • getFillPercentage

        public double getFillPercentage()
        Returns the percentage of all bits in the BloomFilter set to 1.
        Returns:
        the percentage of bits in the filter set to 1
      • update

        public void update​(long item)
        Updates the filter with the provided long value.
        Parameters:
        item - an item with which to update the filter
      • update

        public void update​(double item)
        Updates the filter with the provided double value. The value is canonicalized (NaN and infinities) prior to updating.
        Parameters:
        item - an item with which to update the filter
      • update

        public void update​(String item)
        Updates the filter with the provided String. The string is converted to a byte array using UTF8 encoding.

        Note: this will not produce the same output hash values as the update(char[]) method and will generally be a little slower depending on the complexity of the UTF8 encoding.

        Parameters:
        item - an item with which to update the filter
      • update

        public void update​(byte[] data)
        Updates the filter with the provided byte[].
        Parameters:
        data - an array with which to update the filter
      • update

        public void update​(char[] data)
        Updates the filter with the provided char[].
        Parameters:
        data - an array with which to update the filter
      • update

        public void update​(short[] data)
        Updates the filter with the provided short[].
        Parameters:
        data - an array with which to update the filter
      • update

        public void update​(int[] data)
        Updates the filter with the provided int[].
        Parameters:
        data - an array with which to update the filter
      • update

        public void update​(long[] data)
        Updates the filter with the provided long[].
        Parameters:
        data - an array with which to update the filter
      • update

        public void update​(org.apache.datasketches.memory.Memory mem)
        Updates the filter with the data in the provided Memory.
        Parameters:
        mem - a Memory object with which to update the filter
      • queryAndUpdate

        public boolean queryAndUpdate​(long item)
        Updates the filter with the provided long and returns the result from quering that value prior to the update.
        Parameters:
        item - an item with which to update the filter
        Returns:
        The query result prior to applying the update
      • queryAndUpdate

        public boolean queryAndUpdate​(double item)
        Updates the filter with the provided double and returns the result from quering that value prior to the update. The double is canonicalized (NaN and +/- infinity) in the call.
        Parameters:
        item - an item with which to update the filter
        Returns:
        The query result prior to applying the update
      • queryAndUpdate

        public boolean queryAndUpdate​(String item)
        Updates the filter with the provided String and returns the result from quering that value prior to the update. The string is converted to a byte array using UTF8 encoding.

        Note: this will not produce the same output hash values as the queryAndUpdate(char[]) method and will generally be a little slower depending on the complexity of the UTF8 encoding.

        Parameters:
        item - an item with which to update the filter
        Returns:
        The query result prior to applying the update, or false if item is null
      • queryAndUpdate

        public boolean queryAndUpdate​(byte[] data)
        Updates the filter with the provided byte[] and returns the result from quering that array prior to the update.
        Parameters:
        data - an array with which to update the filter
        Returns:
        The query result prior to applying the update, or false if data is null
      • queryAndUpdate

        public boolean queryAndUpdate​(char[] data)
        Updates the filter with the provided char[] and returns the result from quering that array prior to the update.
        Parameters:
        data - an array with which to update the filter
        Returns:
        The query result prior to applying the update, or false if data is null
      • queryAndUpdate

        public boolean queryAndUpdate​(short[] data)
        Updates the filter with the provided short[] and returns the result from quering that array prior to the update.
        Parameters:
        data - an array with which to update the filter
        Returns:
        The query result prior to applying the update, or false if data is null
      • queryAndUpdate

        public boolean queryAndUpdate​(int[] data)
        Updates the filter with the provided int[] and returns the result from quering that array prior to the update.
        Parameters:
        data - an array with which to update the filter
        Returns:
        The query result prior to applying the update, or false if data is null
      • queryAndUpdate

        public boolean queryAndUpdate​(long[] data)
        Updates the filter with the provided long[] and returns the result from quering that array prior to the update.
        Parameters:
        data - an array with which to update the filter
        Returns:
        The query result prior to applying the update, or false if data is null
      • queryAndUpdate

        public boolean queryAndUpdate​(org.apache.datasketches.memory.Memory mem)
        Updates the filter with the provided Memory and returns the result from quering that Memory prior to the update.
        Parameters:
        mem - an array with which to update the filter
        Returns:
        The query result prior to applying the update, or false if mem is null
      • query

        public boolean query​(long item)
        Queries the filter with the provided long and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
        Parameters:
        item - an item with which to query the filter
        Returns:
        The result of querying the filter with the given item
      • query

        public boolean query​(double item)
        Queries the filter with the provided double and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible. Double values are canonicalized (NaN and +/- infinity) before querying.
        Parameters:
        item - an item with which to query the filter
        Returns:
        The result of querying the filter with the given item
      • query

        public boolean query​(String item)
        Queries the filter with the provided double and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible. The string is converted to a byte array using UTF8 encoding.

        Note: this will not produce the same output hash values as the update(char[]) method and will generally be a little slower depending on the complexity of the UTF8 encoding.

        Parameters:
        item - an item with which to query the filter
        Returns:
        The result of querying the filter with the given item, or false if item is null
      • query

        public boolean query​(byte[] data)
        Queries the filter with the provided byte[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
        Parameters:
        data - an array with which to query the filter
        Returns:
        The result of querying the filter with the given data, or false if data is null
      • query

        public boolean query​(char[] data)
        Queries the filter with the provided char[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
        Parameters:
        data - an array with which to query the filter
        Returns:
        The result of querying the filter with the given data, or false if data is null
      • query

        public boolean query​(short[] data)
        Queries the filter with the provided short[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
        Parameters:
        data - an array with which to query the filter
        Returns:
        The result of querying the filter with the given data, or false if data is null
      • query

        public boolean query​(int[] data)
        Queries the filter with the provided int[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
        Parameters:
        data - an array with which to query the filter
        Returns:
        The result of querying the filter with the given data, or false if data is null
      • query

        public boolean query​(long[] data)
        Queries the filter with the provided long[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
        Parameters:
        data - an array with which to query the filter
        Returns:
        The result of querying the filter with the given data, or false if data is null
      • query

        public boolean query​(org.apache.datasketches.memory.Memory mem)
        Queries the filter with the provided Memory and returns whether the data might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.
        Parameters:
        mem - a Memory array with which to query the filter
        Returns:
        The result of querying the filter with the given Memory, or false if data is null
      • union

        public void union​(BloomFilter other)
        Unions two BloomFilters by applying a logical OR. The result will recognized any values seen by either filter (as well as false positives).
        Parameters:
        other - A BloomFilter to union with this one
      • intersect

        public void intersect​(BloomFilter other)
        Intersects two BloomFilters by applying a logical AND. The result will recognize only values seen by both filters (as well as false positives).
        Parameters:
        other - A BloomFilter to union with this one
      • invert

        public void invert()
        Inverts all the bits of the BloomFilter. Approximately inverts the notion of set-membership.
      • isCompatible

        public boolean isCompatible​(BloomFilter other)
        Helps identify if two BloomFilters may be unioned or intersected.
        Parameters:
        other - A BloomFilter to check for compatibility with this one
        Returns:
        True if the filters are compatible, otherwise false
      • getSerializedSizeBytes

        public long getSerializedSizeBytes()
        Returns the length of this BloomFilter when serialized, in bytes
        Returns:
        The length of this BloomFilter when serialized, in bytes
      • getSerializedSize

        public static long getSerializedSize​(long numBits)
        Returns the serialized length of a non-empty BloomFilter of the given size, in bytes
        Parameters:
        numBits - The number of bits of to use for size computation
        Returns:
        The serialized length of a non-empty BloomFilter of the given size, in bytes
      • toByteArray

        public byte[] toByteArray()
        Serializes the current BloomFilter to an array of bytes.

        Note: Method throws if the serialized size exceeds Integer.MAX_VALUE.

        Returns:
        A serialized image of the current BloomFilter as byte[]
      • toLongArray

        public long[] toLongArray()
        Serializes the current BloomFilter to an array of longs. Unlike toByteArray(), this method can handle any size filter.
        Returns:
        A serialized image of the current BloomFilter as long[]