Class BloomFilter
- java.lang.Object
-
- org.apache.datasketches.filters.bloomfilter.BloomFilter
-
public final class BloomFilter extends Object
A Bloom filter is a data structure that can be used for probabilistic set membership.
When querying a Bloom filter, there are no false positives. Specifically: When querying an item that has already been inserted to the filter, the filter will always indicate that the item is present. There is a chance of false positives, where querying an item that has never been presented to the filter will indicate that the item has already been seen. Consequently, any query should be interpreted as "might have seen."
A standard Bloom filter is unlike typical sketches in that it is not sub-linear in size and does not resize itself. A Bloom filter will work up to a target number of distinct items, beyond which it will saturate and the false positive rate will start to increase. The size of a Bloom filter will be linear in the expected number of distinct items.
See the BloomFilterBuilder class for methods to create a filter, especially one sized correctly for a target number of distinct elements and a target false positive probability.
This implementation uses xxHash64 and follows the approach in Kirsch and Mitzenmacher, "Less Hashing, Same Performance: Building a Better Bloom Filter," Wiley Interscience, 2008, pp. 187-218.
-
-
Field Summary
Fields Modifier and Type Field Description static long
MAX_SIZE_BITS
The maximum size of a bloom filter in bits.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description long
getBitsUsed()
Returns the number of bits in the BloomFilter that are set to 1.long
getCapacity()
Returns the total number of bits in the BloomFilter.double
getFillPercentage()
Returns the percentage of all bits in the BloomFilter set to 1.short
getNumHashes()
Returns the configured number of hash functions for this BloomFilterlong
getSeed()
Returns the hash seed for this BloomFilter.static long
getSerializedSize(long numBits)
Returns the serialized length of a non-empty BloomFilter of the given size, in byteslong
getSerializedSizeBytes()
Returns the length of this BloomFilter when serialized, in bytesboolean
hasMemory()
Returns whether the filter has a backing Memory objectstatic BloomFilter
heapify(org.apache.datasketches.memory.Memory mem)
Reads a serialized image of a BloomFilter from the provided Memoryvoid
intersect(BloomFilter other)
Intersects two BloomFilters by applying a logical AND.void
invert()
Inverts all the bits of the BloomFilter.boolean
isCompatible(BloomFilter other)
Helps identify if two BloomFilters may be unioned or intersected.boolean
isDirect()
Returns whether the filter is a direct (off-heap) or on-heap object.boolean
isEmpty()
Checks if the BloomFilter has processed any itemsboolean
isReadOnly()
Returns whether the filter is in read-only mode.boolean
query(byte[] data)
Queries the filter with the provided byte[] and returns whether the array might have been seen previously.boolean
query(char[] data)
Queries the filter with the provided char[] and returns whether the array might have been seen previously.boolean
query(double item)
Queries the filter with the provided double and returns whether the value might have been seen previously.boolean
query(int[] data)
Queries the filter with the provided int[] and returns whether the array might have been seen previously.boolean
query(long item)
Queries the filter with the provided long and returns whether the value might have been seen previously.boolean
query(long[] data)
Queries the filter with the provided long[] and returns whether the array might have been seen previously.boolean
query(short[] data)
Queries the filter with the provided short[] and returns whether the array might have been seen previously.boolean
query(String item)
Queries the filter with the provided double and returns whether the value might have been seen previously.boolean
query(org.apache.datasketches.memory.Memory mem)
Queries the filter with the provided Memory and returns whether the data might have been seen previously.boolean
queryAndUpdate(byte[] data)
Updates the filter with the provided byte[] and returns the result from quering that array prior to the update.boolean
queryAndUpdate(char[] data)
Updates the filter with the provided char[] and returns the result from quering that array prior to the update.boolean
queryAndUpdate(double item)
Updates the filter with the provided double and returns the result from quering that value prior to the update.boolean
queryAndUpdate(int[] data)
Updates the filter with the provided int[] and returns the result from quering that array prior to the update.boolean
queryAndUpdate(long item)
Updates the filter with the provided long and returns the result from quering that value prior to the update.boolean
queryAndUpdate(long[] data)
Updates the filter with the provided long[] and returns the result from quering that array prior to the update.boolean
queryAndUpdate(short[] data)
Updates the filter with the provided short[] and returns the result from quering that array prior to the update.boolean
queryAndUpdate(String item)
Updates the filter with the provided String and returns the result from quering that value prior to the update.boolean
queryAndUpdate(org.apache.datasketches.memory.Memory mem)
Updates the filter with the provided Memory and returns the result from quering that Memory prior to the update.void
reset()
Resets the BloomFilter to an empty statebyte[]
toByteArray()
Serializes the current BloomFilter to an array of bytes.long[]
toLongArray()
Serializes the current BloomFilter to an array of longs.String
toString()
void
union(BloomFilter other)
Unions two BloomFilters by applying a logical OR.void
update(byte[] data)
Updates the filter with the provided byte[].void
update(char[] data)
Updates the filter with the provided char[].void
update(double item)
Updates the filter with the provided double value.void
update(int[] data)
Updates the filter with the provided int[].void
update(long item)
Updates the filter with the provided long value.void
update(long[] data)
Updates the filter with the provided long[].void
update(short[] data)
Updates the filter with the provided short[].void
update(String item)
Updates the filter with the provided String.void
update(org.apache.datasketches.memory.Memory mem)
Updates the filter with the data in the provided Memory.static BloomFilter
wrap(org.apache.datasketches.memory.Memory mem)
Wraps the given Memory into this filter class.static BloomFilter
writableWrap(org.apache.datasketches.memory.WritableMemory wmem)
Wraps the given WritableMemory into this filter class.
-
-
-
Method Detail
-
heapify
public static BloomFilter heapify(org.apache.datasketches.memory.Memory mem)
Reads a serialized image of a BloomFilter from the provided Memory- Parameters:
mem
- Memory containing a previously serialized BloomFilter- Returns:
- a BloomFilter object
-
wrap
public static BloomFilter wrap(org.apache.datasketches.memory.Memory mem)
Wraps the given Memory into this filter class. The class itself only contains a few metadata items and holds a reference to the Memory object, which contains all the data.- Parameters:
mem
- the given Memory object- Returns:
- the wrapping BloomFilter class.
-
writableWrap
public static BloomFilter writableWrap(org.apache.datasketches.memory.WritableMemory wmem)
Wraps the given WritableMemory into this filter class. The class itself only contains a few metadata items and holds a reference to the Memory object, which contains all the data.- Parameters:
wmem
- the given WritableMemory object- Returns:
- the wrapping BloomFilter class.
-
reset
public void reset()
Resets the BloomFilter to an empty state
-
isEmpty
public boolean isEmpty()
Checks if the BloomFilter has processed any items- Returns:
- True if the BloomFilter is empty, otherwise False
-
getBitsUsed
public long getBitsUsed()
Returns the number of bits in the BloomFilter that are set to 1.- Returns:
- The number of bits in use in this filter
-
getCapacity
public long getCapacity()
Returns the total number of bits in the BloomFilter.- Returns:
- The total size of the BloomFilter
-
getNumHashes
public short getNumHashes()
Returns the configured number of hash functions for this BloomFilter- Returns:
- The number of hash functions to apply to inputs
-
getSeed
public long getSeed()
Returns the hash seed for this BloomFilter.- Returns:
- The hash seed for this filter
-
hasMemory
public boolean hasMemory()
Returns whether the filter has a backing Memory object- Returns:
- true if backed by Memory, otherwise false
-
isReadOnly
public boolean isReadOnly()
Returns whether the filter is in read-only mode. That is possible only if there is a backing Memory in read-only mode.- Returns:
- true if read-only, otherwise false
-
isDirect
public boolean isDirect()
Returns whether the filter is a direct (off-heap) or on-heap object. That is possible only if there is a backing Memory.- Returns:
- true if using direct memory access, otherwise false
-
getFillPercentage
public double getFillPercentage()
Returns the percentage of all bits in the BloomFilter set to 1.- Returns:
- the percentage of bits in the filter set to 1
-
update
public void update(long item)
Updates the filter with the provided long value.- Parameters:
item
- an item with which to update the filter
-
update
public void update(double item)
Updates the filter with the provided double value. The value is canonicalized (NaN and infinities) prior to updating.- Parameters:
item
- an item with which to update the filter
-
update
public void update(String item)
Updates the filter with the provided String. The string is converted to a byte array using UTF8 encoding.Note: this will not produce the same output hash values as the
update(char[])
method and will generally be a little slower depending on the complexity of the UTF8 encoding.- Parameters:
item
- an item with which to update the filter
-
update
public void update(byte[] data)
Updates the filter with the provided byte[].- Parameters:
data
- an array with which to update the filter
-
update
public void update(char[] data)
Updates the filter with the provided char[].- Parameters:
data
- an array with which to update the filter
-
update
public void update(short[] data)
Updates the filter with the provided short[].- Parameters:
data
- an array with which to update the filter
-
update
public void update(int[] data)
Updates the filter with the provided int[].- Parameters:
data
- an array with which to update the filter
-
update
public void update(long[] data)
Updates the filter with the provided long[].- Parameters:
data
- an array with which to update the filter
-
update
public void update(org.apache.datasketches.memory.Memory mem)
Updates the filter with the data in the provided Memory.- Parameters:
mem
- a Memory object with which to update the filter
-
queryAndUpdate
public boolean queryAndUpdate(long item)
Updates the filter with the provided long and returns the result from quering that value prior to the update.- Parameters:
item
- an item with which to update the filter- Returns:
- The query result prior to applying the update
-
queryAndUpdate
public boolean queryAndUpdate(double item)
Updates the filter with the provided double and returns the result from quering that value prior to the update. The double is canonicalized (NaN and +/- infinity) in the call.- Parameters:
item
- an item with which to update the filter- Returns:
- The query result prior to applying the update
-
queryAndUpdate
public boolean queryAndUpdate(String item)
Updates the filter with the provided String and returns the result from quering that value prior to the update. The string is converted to a byte array using UTF8 encoding.Note: this will not produce the same output hash values as the
queryAndUpdate(char[])
method and will generally be a little slower depending on the complexity of the UTF8 encoding.- Parameters:
item
- an item with which to update the filter- Returns:
- The query result prior to applying the update, or false if item is null
-
queryAndUpdate
public boolean queryAndUpdate(byte[] data)
Updates the filter with the provided byte[] and returns the result from quering that array prior to the update.- Parameters:
data
- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(char[] data)
Updates the filter with the provided char[] and returns the result from quering that array prior to the update.- Parameters:
data
- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(short[] data)
Updates the filter with the provided short[] and returns the result from quering that array prior to the update.- Parameters:
data
- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(int[] data)
Updates the filter with the provided int[] and returns the result from quering that array prior to the update.- Parameters:
data
- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(long[] data)
Updates the filter with the provided long[] and returns the result from quering that array prior to the update.- Parameters:
data
- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(org.apache.datasketches.memory.Memory mem)
Updates the filter with the provided Memory and returns the result from quering that Memory prior to the update.- Parameters:
mem
- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if mem is null
-
query
public boolean query(long item)
Queries the filter with the provided long and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
item
- an item with which to query the filter- Returns:
- The result of querying the filter with the given item
-
query
public boolean query(double item)
Queries the filter with the provided double and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible. Double values are canonicalized (NaN and +/- infinity) before querying.- Parameters:
item
- an item with which to query the filter- Returns:
- The result of querying the filter with the given item
-
query
public boolean query(String item)
Queries the filter with the provided double and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible. The string is converted to a byte array using UTF8 encoding.Note: this will not produce the same output hash values as the
update(char[])
method and will generally be a little slower depending on the complexity of the UTF8 encoding.- Parameters:
item
- an item with which to query the filter- Returns:
- The result of querying the filter with the given item, or false if item is null
-
query
public boolean query(byte[] data)
Queries the filter with the provided byte[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data
- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(char[] data)
Queries the filter with the provided char[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data
- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(short[] data)
Queries the filter with the provided short[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data
- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(int[] data)
Queries the filter with the provided int[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data
- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(long[] data)
Queries the filter with the provided long[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data
- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(org.apache.datasketches.memory.Memory mem)
Queries the filter with the provided Memory and returns whether the data might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
mem
- a Memory array with which to query the filter- Returns:
- The result of querying the filter with the given Memory, or false if data is null
-
union
public void union(BloomFilter other)
Unions two BloomFilters by applying a logical OR. The result will recognized any values seen by either filter (as well as false positives).- Parameters:
other
- A BloomFilter to union with this one
-
intersect
public void intersect(BloomFilter other)
Intersects two BloomFilters by applying a logical AND. The result will recognize only values seen by both filters (as well as false positives).- Parameters:
other
- A BloomFilter to union with this one
-
invert
public void invert()
Inverts all the bits of the BloomFilter. Approximately inverts the notion of set-membership.
-
isCompatible
public boolean isCompatible(BloomFilter other)
Helps identify if two BloomFilters may be unioned or intersected.- Parameters:
other
- A BloomFilter to check for compatibility with this one- Returns:
- True if the filters are compatible, otherwise false
-
getSerializedSizeBytes
public long getSerializedSizeBytes()
Returns the length of this BloomFilter when serialized, in bytes- Returns:
- The length of this BloomFilter when serialized, in bytes
-
getSerializedSize
public static long getSerializedSize(long numBits)
Returns the serialized length of a non-empty BloomFilter of the given size, in bytes- Parameters:
numBits
- The number of bits of to use for size computation- Returns:
- The serialized length of a non-empty BloomFilter of the given size, in bytes
-
toByteArray
public byte[] toByteArray()
Serializes the current BloomFilter to an array of bytes.Note: Method throws if the serialized size exceeds
Integer.MAX_VALUE
.- Returns:
- A serialized image of the current BloomFilter as byte[]
-
toLongArray
public long[] toLongArray()
Serializes the current BloomFilter to an array of longs. UnliketoByteArray()
, this method can handle any size filter.- Returns:
- A serialized image of the current BloomFilter as long[]
-
-