Frequent Items
This sketch is useful for tracking approximate frequencies of items (object
or string
) with optional associated
integer counts that are members of a multiset of such items.
The true frequency of an item is defined to be the sum of associated counts.
This implementation provides the following capabilities:
Estimate the frequency of an item.
Return upper and lower bounds of any item, such that the true frequency is always between the upper and lower bounds.
Return a global maximum error that holds for all items in the stream.
Return an array of frequent items that qualify either a NO_FALSE_POSITIVES or a NO_FALSE_NEGATIVES error type.
Merge itself with another sketch object created from this class.
Serialize/Deserialize to/from a byte array.
Space Usage
The sketch is initialized with a maximum map size, maxMapSize
, that specifies the maximum physical length of the internal hash map of the form (object item, int count)
.
The maximum map size is always a power of 2, defined through the variables lg_max_map_size
.
The hash map starts at a very small size (8 entries) and grows as needed up to the specified maximum map size.
Excluding external space required for the item objects, the internal memory space usage of this sketch is 18 * mapSize
bytes (assuming 8 bytes for each reference),
plus a small constant number of additional bytes.
The internal memory space usage of this sketch will never exceed 18 * maxMapSize
bytes, plus a small constant number of additional bytes.
Maximum Capacity of the Sketch
The LOAD_FACTOR
for the hash map is internally set at \(75\%\), which means at any time the map capacity of (item, count)
pairs is mapCap = 0.75 * mapSize
.
The maximum capacity of (item, count)
pairs of the sketch is maxMapCap = 0.75 * maxMapSize
.
Updating the sketch with ``(item, count)`` pairs
If the item is found in the hash map, the mapped count field (the “counter”) is incremented by the incoming count; otherwise, a new counter (item, count)
pair is created.
If the number of tracked counters reaches the maximum capacity of the hash map, the sketch decrements all of the counters (by an approximately computed median)
and removes any non-positive counters.
Accuracy
If fewer than 0.75 * maxMapSize
different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
The logic of the frequent items sketch is such that the stored counts and true counts are never too different. More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).
For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item.
Specifically, (UB- LB) ≤ W * epsilon
, where \(W\) denotes the sum of all item counts, and \(epsilon = 3.5/M\), where \(epsilon = M\) is the maxMapSize.
This is a worst-case guarantee that applies to arbitrary inputs.
For inputs typically seen in practice, (UB-LB)
is usually much smaller.
Background
This code implements a variant of what is commonly known as the “Misra-Gries algorithm”. Variants of it were discovered and rediscovered and redesigned several times over the years:
Finding repeated elements, Misra, Gries, 1982
Frequency estimation of Internet packet streams with limited space Demaine, Lopez-Ortiz, Munro, 2002
A simple algorithm for finding frequent elements in streams and bags Karp, Shenker, Papadimitriou, 2003
Efficient Computation of Frequent and Top-k Elements in Data Streams Metwally, Agrawal, Abbadi, 2006
For speed, we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run.
However, we have ensured that this probability is extremely small.
For example, if the stream causes one table purge (rebuild), our proof of the worst-case bound applies with a probability of at least 1 - 1E-14.
If the stream causes 1E9
purges, our proof applies with a probability of at least 1 - 1E-5
.
There are two flavors of Frequent Items Sketches, one with generic items (objects) and another specific to strings. The string version is a legacy name from before the library supported generic objects and is retained only for backwards compatibility.
Note
The frequent_items_sketch
uses an input object’s __hash__
and __eq__
methods.
Note
Serializing and deserializing the frequent_items_sketch
requires the use of a PyObjectSerDe
.
- class frequent_items_error_type(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
- NO_FALSE_POSITIVES : Returns only true positives but may miss some heavy hitters.
- NO_FALSE_NEGATIVES : Does not miss any heavy hitters but may return false positives.
- class frequent_items_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) _datasketches.frequent_items_sketch
Reads a bytes object using the provided serde and returns the corresponding frequent_strings_sketch.
- get_epsilon_for_lg_size(lg_max_map_size: int) float
Returns the epsilon value used to compute a priori error for a given log2(max_map_size)
- get_apriori_error(lg_max_map_size: int, estimated_total_weight: int) float
Returns the estimated a priori error given the max_map_size for the sketch and the estimated_total_stream_weight.
Non-static Methods:
- __init__(self, lg_max_k: int) None
Creates an instance of the sketch
- Parameters:
lg_max_k (int) – base 2 logarithm of the maximum size of the internal hash map of the sketch. Maximum capacity is 0.75 of this value, which is the maximum number of distinct items the sketch can contain.
- property epsilon
The epsilon value used by the sketch to compute error
- get_estimate
Returns the estimate of the weight (frequency) of the given item. Note: The true frequency of a item would be the sum of the counts as a result of the two update functions.
- get_frequent_items
- get_lower_bound
Returns the guaranteed lower bound weight (frequency) of the given item.
- get_serialized_size_bytes
Computes the size needed to serialize the current state of the sketch using the provided serde. This can be expensive since every item needs to be looked at.
- get_upper_bound
Returns the guaranteed upper bound weight (frequency) of the given item.
- is_empty
Returns True if the sketch is empty, otherwise False
- merge
Merges the given sketch into this one
- property num_active_items
The number of active items in the sketch
- serialize
Serializes the sketch into a bytes object using the provided serde.
- to_string
Produces a string summary of the sketch
- property total_weight
The sum of the weights (frequencies) in the stream seen so far by the sketch
- update
Updates the sketch with the given string and, optionally, a weight
- class frequent_strings_sketch(*args, **kwargs)
Static Methods:
- deserialize(bytes: bytes) _datasketches.frequent_strings_sketch
Reads a bytes object and returns the corresponding frequent_strings_sketch.
- get_epsilon_for_lg_size(lg_max_map_size: int) float
Returns the epsilon value used to compute a priori error for a given log2(max_map_size)
- get_apriori_error(lg_max_map_size: int, estimated_total_weight: int) float
Returns the estimated a priori error given the max_map_size for the sketch and the estimated_total_stream_weight.
Non-static Methods:
- __init__(self, lg_max_k: int) None
Creates an instance of the sketch
- Parameters:
lg_max_k (int) – base 2 logarithm of the maximum size of the internal hash map of the sketch. Maximum capacity is 0.75 of this value, which is the maximum number of distinct items the sketch can contain.
- property epsilon
The epsilon value used by the sketch to compute error
- get_estimate
Returns the estimate of the weight (frequency) of the given item. Note: The true frequency of a item would be the sum of the counts as a result of the two update functions.
- get_frequent_items
- get_lower_bound
Returns the guaranteed lower bound weight (frequency) of the given item.
- get_serialized_size_bytes
Computes the size needed to serialize the current state of the sketch. This can be expensive since every item needs to be looked at.
- get_upper_bound
Returns the guaranteed upper bound weight (frequency) of the given item.
- is_empty
Returns True if the sketch is empty, otherwise False
- merge
Merges the given sketch into this one
- property num_active_items
The number of active items in the sketch
- serialize
Serializes the sketch into a bytes object.
- to_string
Produces a string summary of the sketch
- property total_weight
The sum of the weights (frequencies) in the stream seen so far by the sketch
- update
Updates the sketch with the given string and, optionally, a weight