Frequent Items

This sketch is useful for tracking approximate frequencies of items (object or string) with optional associated integer counts that are members of a multiset of such items. The true frequency of an item is defined to be the sum of associated counts.

This implementation provides the following capabilities:

  • Estimate the frequency of an item.

  • Return upper and lower bounds of any item, such that the true frequency is always between the upper and lower bounds.

  • Return a global maximum error that holds for all items in the stream.

  • Return an array of frequent items that qualify either a NO_FALSE_POSITIVES or a NO_FALSE_NEGATIVES error type.

  • Merge itself with another sketch object created from this class.

  • Serialize/Deserialize to/from a byte array.

Space Usage

The sketch is initialized with a maximum map size, maxMapSize, that specifies the maximum physical length of the internal hash map of the form (object item, int count). The maximum map size is always a power of 2, defined through the variables lg_max_map_size.

The hash map starts at a very small size (8 entries) and grows as needed up to the specified maximum map size.

Excluding external space required for the item objects, the internal memory space usage of this sketch is 18 * mapSize bytes (assuming 8 bytes for each reference), plus a small constant number of additional bytes. The internal memory space usage of this sketch will never exceed 18 * maxMapSize bytes, plus a small constant number of additional bytes.

Maximum Capacity of the Sketch

The LOAD_FACTOR for the hash map is internally set at \(75\%\), which means at any time the map capacity of (item, count) pairs is mapCap = 0.75 * mapSize. The maximum capacity of (item, count) pairs of the sketch is maxMapCap = 0.75 * maxMapSize.

Updating the sketch with ``(item, count)`` pairs

If the item is found in the hash map, the mapped count field (the “counter”) is incremented by the incoming count; otherwise, a new counter (item, count) pair is created. If the number of tracked counters reaches the maximum capacity of the hash map, the sketch decrements all of the counters (by an approximately computed median) and removes any non-positive counters.

Accuracy

If fewer than 0.75 * maxMapSize different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.

The logic of the frequent items sketch is such that the stored counts and true counts are never too different. More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).

For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item. Specifically, (UB- LB) W * epsilon, where \(W\) denotes the sum of all item counts, and \(epsilon = 3.5/M\), where \(epsilon = M\) is the maxMapSize.

This is a worst-case guarantee that applies to arbitrary inputs. For inputs typically seen in practice, (UB-LB) is usually much smaller.

Background

This code implements a variant of what is commonly known as the “Misra-Gries algorithm”. Variants of it were discovered and rediscovered and redesigned several times over the years:

  • Finding repeated elements, Misra, Gries, 1982

  • Frequency estimation of Internet packet streams with limited space Demaine, Lopez-Ortiz, Munro, 2002

  • A simple algorithm for finding frequent elements in streams and bags Karp, Shenker, Papadimitriou, 2003

  • Efficient Computation of Frequent and Top-k Elements in Data Streams Metwally, Agrawal, Abbadi, 2006

For speed, we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run. However, we have ensured that this probability is extremely small. For example, if the stream causes one table purge (rebuild), our proof of the worst-case bound applies with a probability of at least 1 - 1E-14. If the stream causes 1E9 purges, our proof applies with a probability of at least 1 - 1E-5.

There are two flavors of Frequent Items Sketches, one with generic items (objects) and another specific to strings. The string version is a legacy name from before the library supported generic objects and is retained only for backwards compatibility.

Note

The frequent_items_sketch uses an input object’s __hash__ and __eq__ methods.

Note

Serializing and deserializing the frequent_items_sketch requires the use of a PyObjectSerDe.

class frequent_items_error_type(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
NO_FALSE_POSITIVES : Returns only true positives but may miss some heavy hitters.
NO_FALSE_NEGATIVES : Does not miss any heavy hitters but may return false positives.
class frequent_items_sketch(*args, **kwargs)

Static Methods:

deserialize(bytes: bytes, serde: _datasketches.PyObjectSerDe) _datasketches.frequent_items_sketch

Reads a bytes object using the provided serde and returns the corresponding frequent_strings_sketch.

get_epsilon_for_lg_size(lg_max_map_size: int) float

Returns the epsilon value used to compute a priori error for a given log2(max_map_size)

get_apriori_error(lg_max_map_size: int, estimated_total_weight: int) float

Returns the estimated a priori error given the max_map_size for the sketch and the estimated_total_stream_weight.

Non-static Methods:

__init__(self, lg_max_k: int) None

Creates an instance of the sketch

Parameters:

lg_max_k (int) – base 2 logarithm of the maximum size of the internal hash map of the sketch. Maximum capacity is 0.75 of this value, which is the maximum number of distinct items the sketch can contain.

property epsilon

The epsilon value used by the sketch to compute error

get_estimate

Returns the estimate of the weight (frequency) of the given item. Note: The true frequency of a item would be the sum of the counts as a result of the two update functions.

get_frequent_items
get_lower_bound

Returns the guaranteed lower bound weight (frequency) of the given item.

get_serialized_size_bytes

Computes the size needed to serialize the current state of the sketch using the provided serde. This can be expensive since every item needs to be looked at.

get_upper_bound

Returns the guaranteed upper bound weight (frequency) of the given item.

is_empty

Returns True if the sketch is empty, otherwise False

merge

Merges the given sketch into this one

property num_active_items

The number of active items in the sketch

serialize

Serializes the sketch into a bytes object using the provided serde.

to_string

Produces a string summary of the sketch

property total_weight

The sum of the weights (frequencies) in the stream seen so far by the sketch

update

Updates the sketch with the given string and, optionally, a weight

class frequent_strings_sketch(*args, **kwargs)

Static Methods:

deserialize(bytes: bytes) _datasketches.frequent_strings_sketch

Reads a bytes object and returns the corresponding frequent_strings_sketch.

get_epsilon_for_lg_size(lg_max_map_size: int) float

Returns the epsilon value used to compute a priori error for a given log2(max_map_size)

get_apriori_error(lg_max_map_size: int, estimated_total_weight: int) float

Returns the estimated a priori error given the max_map_size for the sketch and the estimated_total_stream_weight.

Non-static Methods:

__init__(self, lg_max_k: int) None

Creates an instance of the sketch

Parameters:

lg_max_k (int) – base 2 logarithm of the maximum size of the internal hash map of the sketch. Maximum capacity is 0.75 of this value, which is the maximum number of distinct items the sketch can contain.

property epsilon

The epsilon value used by the sketch to compute error

get_estimate

Returns the estimate of the weight (frequency) of the given item. Note: The true frequency of a item would be the sum of the counts as a result of the two update functions.

get_frequent_items
get_lower_bound

Returns the guaranteed lower bound weight (frequency) of the given item.

get_serialized_size_bytes

Computes the size needed to serialize the current state of the sketch. This can be expensive since every item needs to be looked at.

get_upper_bound

Returns the guaranteed upper bound weight (frequency) of the given item.

is_empty

Returns True if the sketch is empty, otherwise False

merge

Merges the given sketch into this one

property num_active_items

The number of active items in the sketch

serialize

Serializes the sketch into a bytes object.

to_string

Produces a string summary of the sketch

property total_weight

The sum of the weights (frequencies) in the stream seen so far by the sketch

update

Updates the sketch with the given string and, optionally, a weight