datasketches-cpp
Classes | Public Member Functions | Static Public Member Functions | List of all members
frequent_items_sketch< T, W, H, E, A > Class Template Reference

Frequent Items sketch. More...

#include <frequent_items_sketch.hpp>

Classes

class  row
 Row in the output from get_frequent_items. More...
 

Public Member Functions

 frequent_items_sketch (uint8_t lg_max_map_size, uint8_t lg_start_map_size=LG_MIN_MAP_SIZE, const E &equal=E(), const A &allocator=A())
 Construct this sketch with parameters lg_max_map_size and lg_start_map_size. More...
 
void update (const T &item, W weight=1)
 Update this sketch with an item and a positive weight (frequency count). More...
 
void update (T &&item, W weight=1)
 Update this sketch with an item and a positive weight (frequency count). More...
 
void merge (const frequent_items_sketch &other)
 This function merges the other sketch into this one. More...
 
void merge (frequent_items_sketch &&other)
 This function merges the other sketch into this one. More...
 
bool is_empty () const
 
uint32_t get_num_active_items () const
 
get_total_weight () const
 Returns the sum of the weights (frequencies) in the stream seen so far by the sketch. More...
 
get_estimate (const T &item) const
 Returns the estimate of the weight (frequency) of the given item. More...
 
get_lower_bound (const T &item) const
 Returns the guaranteed lower bound weight (frequency) of the given item. More...
 
get_upper_bound (const T &item) const
 Returns the guaranteed upper bound weight (frequency) of the given item. More...
 
get_maximum_error () const
 
double get_epsilon () const
 Returns epsilon value of this sketch. More...
 
vector_row get_frequent_items (frequent_items_error_type err_type) const
 Returns an array of rows that include frequent items, estimates, upper and lower bounds given an error_type and using get_maximum_error() as a threshold. More...
 
vector_row get_frequent_items (frequent_items_error_type err_type, W threshold) const
 Returns an array of rows that include frequent items, estimates, upper and lower bounds given an error_type and a threshold. More...
 
template<typename SerDe = serde<T>>
size_t get_serialized_size_bytes (const SerDe &sd=SerDe()) const
 Computes size needed to serialize the current state of the sketch. More...
 
template<typename SerDe = serde<T>>
void serialize (std::ostream &os, const SerDe &sd=SerDe()) const
 This method serializes the sketch into a given stream in a binary form. More...
 
template<typename SerDe = serde<T>>
vector_bytes serialize (unsigned header_size_bytes=0, const SerDe &sd=SerDe()) const
 This method serializes the sketch as a vector of bytes. More...
 
string< A > to_string (bool print_items=false) const
 Returns a human readable summary of this sketch. More...
 

Static Public Member Functions

static double get_epsilon (uint8_t lg_max_map_size)
 Returns epsilon used to compute a priori error. More...
 
static double get_apriori_error (uint8_t lg_max_map_size, W estimated_total_weight)
 Returns the estimated a priori error given the max_map_size for the sketch and the estimated_total_stream_weight. More...
 
template<typename SerDe = serde<T>>
static frequent_items_sketch deserialize (std::istream &is, const SerDe &sd=SerDe(), const E &equal=E(), const A &allocator=A())
 This method deserializes a sketch from a given stream. More...
 
template<typename SerDe = serde<T>>
static frequent_items_sketch deserialize (const void *bytes, size_t size, const SerDe &sd=SerDe(), const E &equal=E(), const A &allocator=A())
 This method deserializes a sketch from a given array of bytes. More...
 

Detailed Description

template<typename T, typename W = uint64_t, typename H = std::hash<T>, typename E = std::equal_to<T>, typename A = std::allocator<T>>
class datasketches::frequent_items_sketch< T, W, H, E, A >

Frequent Items sketch.

Based on Java implementation here: https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/frequencies/ItemsSketch.java

Author
Alexander Saydakov

Constructor & Destructor Documentation

◆ frequent_items_sketch()

frequent_items_sketch ( uint8_t  lg_max_map_size,
uint8_t  lg_start_map_size = LG_MIN_MAP_SIZE,
const E &  equal = E(),
const A &  allocator = A() 
)
explicit

Construct this sketch with parameters lg_max_map_size and lg_start_map_size.

Parameters
lg_max_map_sizeLog2 of the physical size of the internal hash map managed by this sketch. The maximum capacity of this internal hash map is 0.75 times 2^lg_max_map_size. Both the ultimate accuracy and size of this sketch are functions of lg_max_map_size.
lg_start_map_sizeLog2 of the starting physical size of the internal hash map managed by this sketch.
equalinstance of Equality operator
allocatorinstance of an Allocator

Member Function Documentation

◆ update() [1/2]

void update ( const T &  item,
weight = 1 
)

Update this sketch with an item and a positive weight (frequency count).

Parameters
itemfor which the weight should be increased (lvalue)
weightthe amount by which the weight of the item should be increased A count of zero is a no-op, and a negative count will throw an exception.

◆ update() [2/2]

void update ( T &&  item,
weight = 1 
)

Update this sketch with an item and a positive weight (frequency count).

Parameters
itemfor which the weight should be increased (rvalue)
weightthe amount by which the weight of the item should be increased A count of zero is a no-op, and a negative count will throw an exception.

◆ merge() [1/2]

void merge ( const frequent_items_sketch< T, W, H, E, A > &  other)

This function merges the other sketch into this one.

The other sketch may be of a different size.

Parameters
othersketch to be merged into this (lvalue)

◆ merge() [2/2]

void merge ( frequent_items_sketch< T, W, H, E, A > &&  other)

This function merges the other sketch into this one.

The other sketch may be of a different size.

Parameters
othersketch to be merged into this (rvalue)

◆ is_empty()

bool is_empty
Returns
true if this sketch is empty

◆ get_num_active_items()

uint32_t get_num_active_items
Returns
the number of active items in the sketch

◆ get_total_weight()

W get_total_weight

Returns the sum of the weights (frequencies) in the stream seen so far by the sketch.

Returns
the total weight of all items in the stream seen so far by the sketch

◆ get_estimate()

W get_estimate ( const T &  item) const

Returns the estimate of the weight (frequency) of the given item.

Note: The true frequency of a item would be the sum of the counts as a result of the two update functions.

Parameters
itemthe given item
Returns
the estimate of the weight (frequency) of the given item

◆ get_lower_bound()

W get_lower_bound ( const T &  item) const

Returns the guaranteed lower bound weight (frequency) of the given item.

Parameters
itemthe given item.
Returns
the guaranteed lower bound weight of the given item. That is, a number which is guaranteed to be no larger than the real weight.

◆ get_upper_bound()

W get_upper_bound ( const T &  item) const

Returns the guaranteed upper bound weight (frequency) of the given item.

Parameters
itemthe given item
Returns
the guaranteed upper bound weight of the given item. That is, a number which is guaranteed to be no smaller than the real frequency.

◆ get_maximum_error()

W get_maximum_error
Returns
An upper bound on the maximum error of get_estimate(item) for any item. This is equivalent to the maximum distance between the upper bound and the lower bound for any item.

◆ get_epsilon() [1/2]

double get_epsilon

Returns epsilon value of this sketch.

This is just the value 3.5 / max_map_size.

Returns
epsilon used by the sketch to compute error.

◆ get_epsilon() [2/2]

double get_epsilon ( uint8_t  lg_max_map_size)
static

Returns epsilon used to compute a priori error.

This is just the value 3.5 / maxMapSize.

Parameters
lg_max_map_sizethe planned map size to be used when constructing this sketch.
Returns
epsilon used to compute a priori error.

◆ get_apriori_error()

double get_apriori_error ( uint8_t  lg_max_map_size,
estimated_total_weight 
)
static

Returns the estimated a priori error given the max_map_size for the sketch and the estimated_total_stream_weight.

Parameters
lg_max_map_sizethe planned map size to be used when constructing this sketch.
estimated_total_weightthe estimated total stream weight.
Returns
the estimated a priori error.

◆ get_frequent_items() [1/2]

auto get_frequent_items ( frequent_items_error_type  err_type) const

Returns an array of rows that include frequent items, estimates, upper and lower bounds given an error_type and using get_maximum_error() as a threshold.

The method first examines all active items in the sketch (items that have a counter).

If error_type = NO_FALSE_NEGATIVES, this will include an item in the result list if get_upper_bound(item) > threshold. There will be no false negatives, i.e., no Type II error. There may be items in the set with true frequencies less than the threshold (false positives).

If error_type = NO_FALSE_POSITIVES, this will include an item in the result list if get_lower_bound(item) > threshold. There will be no false positives, i.e., no Type I error. There may be items omitted from the set with true frequencies greater than the threshold (false negatives).

Parameters
err_typedetermines whether no false positives or no false negatives are desired.
Returns
an array of frequent items

◆ get_frequent_items() [2/2]

auto get_frequent_items ( frequent_items_error_type  err_type,
threshold 
) const

Returns an array of rows that include frequent items, estimates, upper and lower bounds given an error_type and a threshold.

The method first examines all active items in the sketch (items that have a counter).

If error_type = NO_FALSE_NEGATIVES, this will include an item in the result list if get_upper_bound(item) > threshold. There will be no false negatives, i.e., no Type II error. There may be items in the set with true frequencies less than the threshold (false positives).

If error_type = NO_FALSE_POSITIVES, this will include an item in the result list if get_lower_bound(item) > threshold. There will be no false positives, i.e., no Type I error. There may be items omitted from the set with true frequencies greater than the threshold (false negatives).

Parameters
err_typedetermines whether no false positives or no false negatives are desired.
thresholdto include items in the result list
Returns
an array of frequent items

◆ get_serialized_size_bytes()

size_t get_serialized_size_bytes ( const SerDe &  sd = SerDe()) const

Computes size needed to serialize the current state of the sketch.

This can be expensive since every item needs to be looked at.

Parameters
sdinstance of a SerDe
Returns
size in bytes needed to serialize this sketch

◆ serialize() [1/2]

void serialize ( std::ostream &  os,
const SerDe &  sd = SerDe() 
) const

This method serializes the sketch into a given stream in a binary form.

Parameters
osoutput stream
sdinstance of a SerDe

◆ serialize() [2/2]

vector_bytes serialize ( unsigned  header_size_bytes = 0,
const SerDe &  sd = SerDe() 
) const

This method serializes the sketch as a vector of bytes.

An optional header can be reserved in front of the sketch. It is a blank space of a given size. This header is used in Datasketches PostgreSQL extension.

Parameters
header_size_bytesspace to reserve in front of the sketch
sdinstance of a SerDe
Returns
serialized sketch as a vector of bytes

◆ deserialize() [1/2]

static frequent_items_sketch deserialize ( std::istream &  is,
const SerDe &  sd = SerDe(),
const E &  equal = E(),
const A &  allocator = A() 
)
static

This method deserializes a sketch from a given stream.

Parameters
isinput stream
sdinstance of a SerDe
equalinstance of Equality operator
allocatorinstance of an Allocator
Returns
an instance of the sketch

◆ deserialize() [2/2]

static frequent_items_sketch deserialize ( const void *  bytes,
size_t  size,
const SerDe &  sd = SerDe(),
const E &  equal = E(),
const A &  allocator = A() 
)
static

This method deserializes a sketch from a given array of bytes.

Parameters
bytespointer to the array of bytes
sizethe size of the array
sdinstance of a SerDe
equalinstance of Equality operator
allocatorinstance of an Allocator
Returns
an instance of the sketch

◆ to_string()

string< A > to_string ( bool  print_items = false) const

Returns a human readable summary of this sketch.

Parameters
print_itemsif true include the list of items retained by the sketch

The documentation for this class was generated from the following files: