datasketches-cpp
Loading...
Searching...
No Matches
Public Member Functions | Static Public Member Functions | List of all members
count_min_sketch< W, Allocator > Class Template Reference

C++ implementation of the CountMin sketch data structure of Cormode and Muthukrishnan. More...

#include <count_min.hpp>

Public Member Functions

 count_min_sketch (uint8_t num_hashes, uint32_t num_buckets, uint64_t seed=DEFAULT_SEED, const Allocator &allocator=Allocator())
 Creates an instance of the sketch given parameters _num_hashes, _num_buckets and hash seed, seed.
 
uint8_t get_num_hashes () const
 
uint32_t get_num_buckets () const
 
uint64_t get_seed () const
 
double get_relative_error () const
 
get_total_weight () const
 
get_estimate (uint64_t item) const
 Specific get_estimate function for uint64_t type see generic get_estimate function.
 
get_estimate (int64_t item) const
 Specific get_estimate function for int64_t type see generic get_estimate function.
 
get_estimate (const std::string &item) const
 Specific get_estimate function for std::string type see generic get_estimate function.
 
get_estimate (const void *item, size_t size) const
 This is the generic estimate query function for any of the given datatypes.
 
get_upper_bound (const void *item, size_t size) const
 Query the sketch for the upper bound of a given item.
 
get_upper_bound (int64_t item) const
 Query the sketch for the upper bound of a given item.
 
get_upper_bound (uint64_t item) const
 Query the sketch for the upper bound of a given item.
 
get_upper_bound (const std::string &item) const
 Query the sketch for the upper bound of a given item.
 
get_lower_bound (const void *item, size_t size) const
 Query the sketch for the lower bound of a given item.
 
get_lower_bound (int64_t item) const
 Query the sketch for the lower bound of a given item.
 
get_lower_bound (uint64_t item) const
 Query the sketch for the lower bound of a given item.
 
get_lower_bound (const std::string &item) const
 Query the sketch for the lower bound of a given item.
 
void update (const void *item, size_t size, W weight)
 Update this sketch with given data of any type.
 
void update (uint64_t item, W weight=1)
 Update this sketch with a given item.
 
void update (int64_t item, W weight=1)
 Update this sketch with a given item.
 
void update (const std::string &item, W weight=1)
 Update this sketch with a given string.
 
void merge (const count_min_sketch &other_sketch)
 Merges another count_min_sketch into this count_min_sketch.
 
bool is_empty () const
 Returns true if this sketch is empty.
 
string< Allocator > to_string () const
 Returns a string describing the sketch.
 
const_iterator begin () const
 Iterator pointing to the first item in the sketch.
 
const_iterator end () const
 Iterator pointing to the past-the-end item in the sketch.
 
size_t get_serialized_size_bytes () const
 Computes size needed to serialize the current state of the sketch.
 
void serialize (std::ostream &os) const
 This method serializes the sketch into a given stream in a binary form.
 
vector_bytes serialize (unsigned header_size_bytes=0) const
 This method serializes the sketch as a vector of bytes.
 
allocator_type get_allocator () const
 

Static Public Member Functions

static uint32_t suggest_num_buckets (double relative_error)
 Suggests the number of buckets required to achieve the given relative error.
 
static uint8_t suggest_num_hashes (double confidence)
 Suggests the number of hash functions required to achieve the given confidence.
 
static count_min_sketch deserialize (std::istream &is, uint64_t seed=DEFAULT_SEED, const Allocator &allocator=Allocator())
 This method deserializes a sketch from a given stream.
 
static count_min_sketch deserialize (const void *bytes, size_t size, uint64_t seed=DEFAULT_SEED, const Allocator &allocator=Allocator())
 This method deserializes a sketch from a given array of bytes.
 

Detailed Description

template<typename W, typename Allocator = std::allocator<W>>
class datasketches::count_min_sketch< W, Allocator >

C++ implementation of the CountMin sketch data structure of Cormode and Muthukrishnan.

[1] - http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf The template type W is the type of the vector that contains the weights of the objects inserted into the sketch, not the type of the input items themselves.

Author
Charlie Dickens

Constructor & Destructor Documentation

◆ count_min_sketch()

template<typename W , typename Allocator = std::allocator<W>>
count_min_sketch ( uint8_t  num_hashes,
uint32_t  num_buckets,
uint64_t  seed = DEFAULT_SEED,
const Allocator &  allocator = Allocator() 
)

Creates an instance of the sketch given parameters _num_hashes, _num_buckets and hash seed, seed.

Parameters
num_hashesnumber of hash functions in the sketch. Equivalently the number of rows in the array
num_bucketsnumber of buckets that hash functions map into. Equivalently the number of columns in the array
seedfor hash function
allocatorto acquire and release memory

The items inserted into the sketch can be arbitrary type, so long as they are hashable via murmurhash. Only update and estimate methods are added for uint64_t and string types.

Member Function Documentation

◆ get_num_hashes()

template<typename W , typename A >
uint8_t get_num_hashes ( ) const
Returns
configured _num_hashes of this sketch

◆ get_num_buckets()

template<typename W , typename A >
uint32_t get_num_buckets ( ) const
Returns
configured _num_buckets of this sketch

◆ get_seed()

template<typename W , typename A >
uint64_t get_seed ( ) const
Returns
configured seed of this sketch

◆ get_relative_error()

template<typename W , typename A >
double get_relative_error ( ) const
Returns
epsilon The maximum permissible error for any frequency estimate query. epsilon = ceil(e / _num_buckets)

◆ get_total_weight()

template<typename W , typename A >
W get_total_weight ( ) const
Returns
_total_weight The total weight currently inserted into the stream.

◆ suggest_num_buckets()

template<typename W , typename A >
uint32_t suggest_num_buckets ( double  relative_error)
static

Suggests the number of buckets required to achieve the given relative error.

Parameters
relative_errorthe desired accuracy within which estimates should lie. For example, when relative_error = 0.05, then the returned frequency estimates satisfy the relative_error guarantee that never overestimates the weights but may underestimate the weights by 5% of the total weight in the sketch.
Returns
the number of hash buckets at every level of the sketch required in order to obtain the specified relative error. [1] - Section 3 `‘Data Structure’', page 6.

◆ suggest_num_hashes()

template<typename W , typename A >
uint8_t suggest_num_hashes ( double  confidence)
static

Suggests the number of hash functions required to achieve the given confidence.

Parameters
confidencethe desired confidence with which estimates should be correct. For example, with 95% confidence, frequency estimates satisfy the relative_error guarantee.
Returns
the number of hash functions that are required in order to achieve the specified confidence of the sketch. confidence = 1 - delta, with delta denoting the sketch failure probability in the literature. [1] - Section 3 `‘Data Structure’', page 6.

◆ get_estimate() [1/4]

template<typename W , typename A >
W get_estimate ( uint64_t  item) const

Specific get_estimate function for uint64_t type see generic get_estimate function.

Parameters
itemuint64_t type.
Returns
an estimate of the item's frequency.

◆ get_estimate() [2/4]

template<typename W , typename A >
W get_estimate ( int64_t  item) const

Specific get_estimate function for int64_t type see generic get_estimate function.

Parameters
itemint64_t type.
Returns
an estimate of the item's frequency.

◆ get_estimate() [3/4]

template<typename W , typename A >
W get_estimate ( const std::string &  item) const

Specific get_estimate function for std::string type see generic get_estimate function.

Parameters
itemstd::string type
Returns
an estimate of the item's frequency.

◆ get_estimate() [4/4]

template<typename W , typename A >
W get_estimate ( const void *  item,
size_t  size 
) const

This is the generic estimate query function for any of the given datatypes.

Query the sketch for the estimate of a given item.

Parameters
itempointer to the data item to be query from the sketch.
sizesize of the item in bytes
Returns
the estimated frequency of the item denoted f_est satisfying f_true - relative_error*_total_weight <= f_est <= f_true

◆ get_upper_bound() [1/4]

template<typename W , typename A >
W get_upper_bound ( const void *  item,
size_t  size 
) const

Query the sketch for the upper bound of a given item.

Parameters
itemto query
sizeof the item in bytes
Returns
the upper bound on the true frequency of the item f_true <= f_est + relative_error*_total_weight

◆ get_upper_bound() [2/4]

template<typename W , typename A >
W get_upper_bound ( int64_t  item) const

Query the sketch for the upper bound of a given item.

Parameters
itemto query
Returns
the upper bound on the true frequency of the item f_true <= f_est + relative_error*_total_weight

◆ get_upper_bound() [3/4]

template<typename W , typename A >
W get_upper_bound ( uint64_t  item) const

Query the sketch for the upper bound of a given item.

Parameters
itemto query
Returns
the upper bound on the true frequency of the item f_true <= f_est + relative_error*_total_weight

◆ get_upper_bound() [4/4]

template<typename W , typename A >
W get_upper_bound ( const std::string &  item) const

Query the sketch for the upper bound of a given item.

Parameters
itemto query
Returns
the upper bound on the true frequency of the item f_true <= f_est + relative_error*_total_weight

◆ get_lower_bound() [1/4]

template<typename W , typename A >
W get_lower_bound ( const void *  item,
size_t  size 
) const

Query the sketch for the lower bound of a given item.

Parameters
itemto query
sizeof the item in bytes
Returns
the lower bound for the query result, f_est, on the true frequency, f_est of the item f_true - relative_error*_total_weight <= f_est

◆ get_lower_bound() [2/4]

template<typename W , typename A >
W get_lower_bound ( int64_t  item) const

Query the sketch for the lower bound of a given item.

Parameters
itemto query
Returns
the lower bound for the query result, f_est, on the true frequency, f_est of the item f_true - relative_error*_total_weight <= f_est

◆ get_lower_bound() [3/4]

template<typename W , typename A >
W get_lower_bound ( uint64_t  item) const

Query the sketch for the lower bound of a given item.

Parameters
itemto query
Returns
the lower bound for the query result, f_est, on the true frequency, f_est of the item f_true - relative_error*_total_weight <= f_est

◆ get_lower_bound() [4/4]

template<typename W , typename A >
W get_lower_bound ( const std::string &  item) const

Query the sketch for the lower bound of a given item.

Parameters
itemto query
Returns
the lower bound for the query result, f_est, on the true frequency, f_est of the item f_true - relative_error*_total_weight <= f_est

◆ update() [1/4]

template<typename W , typename A >
void update ( const void *  item,
size_t  size,
weight 
)

Update this sketch with given data of any type.

This is a "universal" update that covers all cases, but may produce different hashes compared to specialized update methods.

Parameters
itempointer to the data item to be inserted into the sketch.
sizeof the data in bytes
weightarithmetic type

◆ update() [2/4]

template<typename W , typename A >
void update ( uint64_t  item,
weight = 1 
)

Update this sketch with a given item.

Parameters
itemto update the sketch with
weightarithmetic type

◆ update() [3/4]

template<typename W , typename A >
void update ( int64_t  item,
weight = 1 
)

Update this sketch with a given item.

Parameters
itemto update the sketch with
weightarithmetic type

◆ update() [4/4]

template<typename W , typename A >
void update ( const std::string &  item,
weight = 1 
)

Update this sketch with a given string.

Parameters
itemstring to update the sketch with
weightarithmetic type

◆ merge()

template<typename W , typename A >
void merge ( const count_min_sketch< W, Allocator > &  other_sketch)

Merges another count_min_sketch into this count_min_sketch.

Parameters
other_sketch

◆ is_empty()

template<typename W , typename A >
bool is_empty ( ) const

Returns true if this sketch is empty.

A Count Min Sketch is defined to be empty iff weight == 0 This can only ever happen if all items inserted to the sketch have weights that cancel each other out.

Returns
empty flag

◆ to_string()

template<typename W , typename A >
string< A > to_string ( ) const

Returns a string describing the sketch.

Returns
A string with a human-readable description of the sketch

◆ begin()

template<typename W , typename A >
count_min_sketch< W, A >::const_iterator begin ( ) const

Iterator pointing to the first item in the sketch.

If the sketch is empty, the returned iterator must not be dereferenced or incremented.

Returns
iterator pointing to the first item in the sketch

◆ end()

template<typename W , typename A >
count_min_sketch< W, A >::const_iterator end ( ) const

Iterator pointing to the past-the-end item in the sketch.

The past-the-end item is the hypothetical item that would follow the last item. It does not point to any item, and must not be dereferenced or incremented.

Returns
iterator pointing to the past-the-end item in the sketch

◆ get_serialized_size_bytes()

template<typename W , typename A >
size_t get_serialized_size_bytes ( ) const

Computes size needed to serialize the current state of the sketch.

Returns
size in bytes needed to serialize this sketch

◆ serialize() [1/2]

template<typename W , typename A >
void serialize ( std::ostream &  os) const

This method serializes the sketch into a given stream in a binary form.

Parameters
osoutput stream

◆ serialize() [2/2]

template<typename W , typename A >
auto serialize ( unsigned  header_size_bytes = 0) const

This method serializes the sketch as a vector of bytes.

An optional header can be reserved in front of the sketch. It is an uninitialized space of a given size. This header is used in Datasketches PostgreSQL extension.

Parameters
header_size_bytesspace to reserve in front of the sketch

◆ deserialize() [1/2]

template<typename W , typename Allocator = std::allocator<W>>
static count_min_sketch deserialize ( std::istream &  is,
uint64_t  seed = DEFAULT_SEED,
const Allocator &  allocator = Allocator() 
)
static

This method deserializes a sketch from a given stream.

Parameters
isinput stream
seedthe seed for the hash function that was used to create the sketch
allocatorinstance of an Allocator
Returns
an instance of a sketch

◆ deserialize() [2/2]

template<typename W , typename Allocator = std::allocator<W>>
static count_min_sketch deserialize ( const void *  bytes,
size_t  size,
uint64_t  seed = DEFAULT_SEED,
const Allocator &  allocator = Allocator() 
)
static

This method deserializes a sketch from a given array of bytes.

Parameters
bytespointer to the array of bytes
sizethe size of the array
seedthe seed for the hash function that was used to create the sketch
allocatorinstance of an Allocator
Returns
an instance of the sketch

◆ get_allocator()

template<typename W , typename Allocator = std::allocator<W>>
allocator_type get_allocator ( ) const
Returns
allocator

The documentation for this class was generated from the following files: