datasketches-cpp
|
C++ implementation of the CountMin sketch data structure of Cormode and Muthukrishnan. More...
#include <count_min.hpp>
Public Member Functions | |
count_min_sketch (uint8_t num_hashes, uint32_t num_buckets, uint64_t seed=DEFAULT_SEED, const Allocator &allocator=Allocator()) | |
Creates an instance of the sketch given parameters _num_hashes, _num_buckets and hash seed, seed . | |
uint8_t | get_num_hashes () const |
uint32_t | get_num_buckets () const |
uint64_t | get_seed () const |
double | get_relative_error () const |
W | get_total_weight () const |
W | get_estimate (uint64_t item) const |
Specific get_estimate function for uint64_t type see generic get_estimate function. | |
W | get_estimate (int64_t item) const |
Specific get_estimate function for int64_t type see generic get_estimate function. | |
W | get_estimate (const std::string &item) const |
Specific get_estimate function for std::string type see generic get_estimate function. | |
W | get_estimate (const void *item, size_t size) const |
This is the generic estimate query function for any of the given datatypes. | |
W | get_upper_bound (const void *item, size_t size) const |
Query the sketch for the upper bound of a given item. | |
W | get_upper_bound (int64_t item) const |
Query the sketch for the upper bound of a given item. | |
W | get_upper_bound (uint64_t item) const |
Query the sketch for the upper bound of a given item. | |
W | get_upper_bound (const std::string &item) const |
Query the sketch for the upper bound of a given item. | |
W | get_lower_bound (const void *item, size_t size) const |
Query the sketch for the lower bound of a given item. | |
W | get_lower_bound (int64_t item) const |
Query the sketch for the lower bound of a given item. | |
W | get_lower_bound (uint64_t item) const |
Query the sketch for the lower bound of a given item. | |
W | get_lower_bound (const std::string &item) const |
Query the sketch for the lower bound of a given item. | |
void | update (const void *item, size_t size, W weight) |
Update this sketch with given data of any type. | |
void | update (uint64_t item, W weight=1) |
Update this sketch with a given item. | |
void | update (int64_t item, W weight=1) |
Update this sketch with a given item. | |
void | update (const std::string &item, W weight=1) |
Update this sketch with a given string. | |
void | merge (const count_min_sketch &other_sketch) |
Merges another count_min_sketch into this count_min_sketch. | |
bool | is_empty () const |
Returns true if this sketch is empty. | |
string< Allocator > | to_string () const |
Returns a string describing the sketch. | |
const_iterator | begin () const |
Iterator pointing to the first item in the sketch. | |
const_iterator | end () const |
Iterator pointing to the past-the-end item in the sketch. | |
size_t | get_serialized_size_bytes () const |
Computes size needed to serialize the current state of the sketch. | |
void | serialize (std::ostream &os) const |
This method serializes the sketch into a given stream in a binary form. | |
vector_bytes | serialize (unsigned header_size_bytes=0) const |
This method serializes the sketch as a vector of bytes. | |
allocator_type | get_allocator () const |
Static Public Member Functions | |
static uint32_t | suggest_num_buckets (double relative_error) |
Suggests the number of buckets required to achieve the given relative error. | |
static uint8_t | suggest_num_hashes (double confidence) |
Suggests the number of hash functions required to achieve the given confidence. | |
static count_min_sketch | deserialize (std::istream &is, uint64_t seed=DEFAULT_SEED, const Allocator &allocator=Allocator()) |
This method deserializes a sketch from a given stream. | |
static count_min_sketch | deserialize (const void *bytes, size_t size, uint64_t seed=DEFAULT_SEED, const Allocator &allocator=Allocator()) |
This method deserializes a sketch from a given array of bytes. | |
C++ implementation of the CountMin sketch data structure of Cormode and Muthukrishnan.
[1] - http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf The template type W is the type of the vector that contains the weights of the objects inserted into the sketch, not the type of the input items themselves.
count_min_sketch | ( | uint8_t | num_hashes, |
uint32_t | num_buckets, | ||
uint64_t | seed = DEFAULT_SEED , |
||
const Allocator & | allocator = Allocator() |
||
) |
Creates an instance of the sketch given parameters _num_hashes, _num_buckets and hash seed, seed
.
num_hashes | number of hash functions in the sketch. Equivalently the number of rows in the array |
num_buckets | number of buckets that hash functions map into. Equivalently the number of columns in the array |
seed | for hash function |
allocator | to acquire and release memory |
The items inserted into the sketch can be arbitrary type, so long as they are hashable via murmurhash. Only update and estimate methods are added for uint64_t and string types.
uint8_t get_num_hashes | ( | ) | const |
uint32_t get_num_buckets | ( | ) | const |
uint64_t get_seed | ( | ) | const |
double get_relative_error | ( | ) | const |
W get_total_weight | ( | ) | const |
|
static |
Suggests the number of buckets required to achieve the given relative error.
relative_error | the desired accuracy within which estimates should lie. For example, when relative_error = 0.05, then the returned frequency estimates satisfy the relative_error guarantee that never overestimates the weights but may underestimate the weights by 5% of the total weight in the sketch. |
|
static |
Suggests the number of hash functions required to achieve the given confidence.
confidence | the desired confidence with which estimates should be correct. For example, with 95% confidence, frequency estimates satisfy the relative_error guarantee. |
W get_estimate | ( | uint64_t | item | ) | const |
Specific get_estimate function for uint64_t type see generic get_estimate function.
item | uint64_t type. |
W get_estimate | ( | int64_t | item | ) | const |
Specific get_estimate function for int64_t type see generic get_estimate function.
item | int64_t type. |
W get_estimate | ( | const std::string & | item | ) | const |
Specific get_estimate function for std::string type see generic get_estimate function.
item | std::string type |
W get_estimate | ( | const void * | item, |
size_t | size | ||
) | const |
This is the generic estimate query function for any of the given datatypes.
Query the sketch for the estimate of a given item.
item | pointer to the data item to be query from the sketch. |
size | size of the item in bytes |
W get_upper_bound | ( | const void * | item, |
size_t | size | ||
) | const |
Query the sketch for the upper bound of a given item.
item | to query |
size | of the item in bytes |
W get_upper_bound | ( | int64_t | item | ) | const |
Query the sketch for the upper bound of a given item.
item | to query |
W get_upper_bound | ( | uint64_t | item | ) | const |
Query the sketch for the upper bound of a given item.
item | to query |
W get_upper_bound | ( | const std::string & | item | ) | const |
Query the sketch for the upper bound of a given item.
item | to query |
W get_lower_bound | ( | const void * | item, |
size_t | size | ||
) | const |
Query the sketch for the lower bound of a given item.
item | to query |
size | of the item in bytes |
W get_lower_bound | ( | int64_t | item | ) | const |
Query the sketch for the lower bound of a given item.
item | to query |
W get_lower_bound | ( | uint64_t | item | ) | const |
Query the sketch for the lower bound of a given item.
item | to query |
W get_lower_bound | ( | const std::string & | item | ) | const |
Query the sketch for the lower bound of a given item.
item | to query |
void update | ( | const void * | item, |
size_t | size, | ||
W | weight | ||
) |
Update this sketch with given data of any type.
This is a "universal" update that covers all cases, but may produce different hashes compared to specialized update methods.
item | pointer to the data item to be inserted into the sketch. |
size | of the data in bytes |
weight | arithmetic type |
void update | ( | uint64_t | item, |
W | weight = 1 |
||
) |
Update this sketch with a given item.
item | to update the sketch with |
weight | arithmetic type |
void update | ( | int64_t | item, |
W | weight = 1 |
||
) |
Update this sketch with a given item.
item | to update the sketch with |
weight | arithmetic type |
void update | ( | const std::string & | item, |
W | weight = 1 |
||
) |
Update this sketch with a given string.
item | string to update the sketch with |
weight | arithmetic type |
void merge | ( | const count_min_sketch< W, Allocator > & | other_sketch | ) |
Merges another count_min_sketch into this count_min_sketch.
other_sketch |
bool is_empty | ( | ) | const |
Returns true if this sketch is empty.
A Count Min Sketch is defined to be empty iff weight == 0 This can only ever happen if all items inserted to the sketch have weights that cancel each other out.
string< A > to_string | ( | ) | const |
Returns a string describing the sketch.
count_min_sketch< W, A >::const_iterator begin | ( | ) | const |
Iterator pointing to the first item in the sketch.
If the sketch is empty, the returned iterator must not be dereferenced or incremented.
count_min_sketch< W, A >::const_iterator end | ( | ) | const |
Iterator pointing to the past-the-end item in the sketch.
The past-the-end item is the hypothetical item that would follow the last item. It does not point to any item, and must not be dereferenced or incremented.
size_t get_serialized_size_bytes | ( | ) | const |
Computes size needed to serialize the current state of the sketch.
void serialize | ( | std::ostream & | os | ) | const |
This method serializes the sketch into a given stream in a binary form.
os | output stream |
auto serialize | ( | unsigned | header_size_bytes = 0 | ) | const |
This method serializes the sketch as a vector of bytes.
An optional header can be reserved in front of the sketch. It is an uninitialized space of a given size. This header is used in Datasketches PostgreSQL extension.
header_size_bytes | space to reserve in front of the sketch |
|
static |
This method deserializes a sketch from a given stream.
is | input stream |
seed | the seed for the hash function that was used to create the sketch |
allocator | instance of an Allocator |
|
static |
This method deserializes a sketch from a given array of bytes.
bytes | pointer to the array of bytes |
size | the size of the array |
seed | the seed for the hash function that was used to create the sketch |
allocator | instance of an Allocator |
allocator_type get_allocator | ( | ) | const |