datasketches-cpp
|
Implementation of a very compact quantiles sketch with lazy compaction scheme and nearly optimal accuracy per retained item. More...
#include <kll_sketch.hpp>
Public Types | |
using | quantile_return_type = typename quantiles_sorted_view< T, C, A >::quantile_return_type |
Quantile return type. More... | |
Public Member Functions | |
kll_sketch (uint16_t k=kll_constants::DEFAULT_K, const C &comparator=C(), const A &allocator=A()) | |
Constructor. More... | |
kll_sketch (const kll_sketch &other) | |
Copy constructor. More... | |
kll_sketch (kll_sketch &&other) noexcept | |
Move constructor. More... | |
kll_sketch & | operator= (const kll_sketch &other) |
Copy assignment. More... | |
kll_sketch & | operator= (kll_sketch &&other) |
Move assignment. More... | |
template<typename FwdT > | |
void | update (FwdT &&item) |
Updates this sketch with the given data item. More... | |
template<typename FwdSk > | |
void | merge (FwdSk &&other) |
Merges another sketch into this one. More... | |
bool | is_empty () const |
Returns true if this sketch is empty. More... | |
uint16_t | get_k () const |
Returns configured parameter k. More... | |
uint64_t | get_n () const |
Returns the length of the input stream. More... | |
uint32_t | get_num_retained () const |
Returns the number of retained items (samples) in the sketch. More... | |
bool | is_estimation_mode () const |
Returns true if this sketch is in estimation mode. More... | |
T | get_min_item () const |
Returns the min item of the stream. More... | |
T | get_max_item () const |
Returns the max item of the stream. More... | |
C | get_comparator () const |
Returns an instance of the comparator for this sketch. More... | |
A | get_allocator () const |
Returns an instance of the allocator for this sketch. More... | |
quantile_return_type | get_quantile (double rank, bool inclusive=true) const |
Returns an item from the sketch that is the best approximation to an item from the original stream with the given rank. More... | |
double | get_rank (const T &item, bool inclusive=true) const |
Returns an approximation to the normalized rank of the given item from 0 to 1, inclusive. More... | |
vector_double | get_PMF (const T *split_points, uint32_t size, bool inclusive=true) const |
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (items). More... | |
vector_double | get_CDF (const T *split_points, uint32_t size, bool inclusive=true) const |
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (items). More... | |
double | get_normalized_rank_error (bool pmf) const |
Gets the approximate rank error of this sketch normalized as a fraction between zero and one. More... | |
template<typename TT = T, typename SerDe = serde<T>, typename std::enable_if< std::is_arithmetic< TT >::value, int >::type = 0> | |
size_t | get_serialized_size_bytes (const SerDe &sd=SerDe()) const |
Computes size needed to serialize the current state of the sketch. More... | |
template<typename TT = T, typename SerDe = serde<T>, typename std::enable_if<!std::is_arithmetic< TT >::value, int >::type = 0> | |
size_t | get_serialized_size_bytes (const SerDe &sd=SerDe()) const |
Computes size needed to serialize the current state of the sketch. More... | |
template<typename SerDe = serde<T>> | |
void | serialize (std::ostream &os, const SerDe &sd=SerDe()) const |
This method serializes the sketch into a given stream in a binary form. More... | |
template<typename SerDe = serde<T>> | |
vector_bytes | serialize (unsigned header_size_bytes=0, const SerDe &sd=SerDe()) const |
This method serializes the sketch as a vector of bytes. More... | |
string< A > | to_string (bool print_levels=false, bool print_items=false) const |
Prints a summary of the sketch. More... | |
const_iterator | begin () const |
Iterator pointing to the first item in the sketch. More... | |
const_iterator | end () const |
Iterator pointing to the past-the-end item in the sketch. More... | |
quantiles_sorted_view< T, C, A > | get_sorted_view () const |
Gets the sorted view of this sketch. More... | |
Static Public Member Functions | |
template<typename TT = T, typename std::enable_if< std::is_arithmetic< TT >::value, int >::type = 0> | |
static size_t | get_max_serialized_size_bytes (uint16_t k, uint64_t n) |
Returns upper bound on the serialized size of a sketch given a parameter k and stream length. More... | |
template<typename TT = T, typename std::enable_if<!std::is_arithmetic< TT >::value, int >::type = 0> | |
static size_t | get_max_serialized_size_bytes (uint16_t k, uint64_t n, size_t max_item_size_bytes) |
Returns upper bound on the serialized size of a sketch given a parameter k and stream length. More... | |
template<typename SerDe = serde<T>> | |
static kll_sketch | deserialize (std::istream &is, const SerDe &sd=SerDe(), const C &comparator=C(), const A &allocator=A()) |
This method deserializes a sketch from a given stream. More... | |
template<typename SerDe = serde<T>> | |
static kll_sketch | deserialize (const void *bytes, size_t size, const SerDe &sd=SerDe(), const C &comparator=C(), const A &allocator=A()) |
This method deserializes a sketch from a given array of bytes. More... | |
Implementation of a very compact quantiles sketch with lazy compaction scheme and nearly optimal accuracy per retained item.
See Optimal Quantile Approximation in Streams.
This is a stochastic streaming sketch that enables near real-time analysis of the approximate distribution of items from a very large stream in a single pass, requiring only that the items are comparable. The analysis is obtained using get_quantile() function or the inverse functions get_rank(), get_PMF() (Probability Mass Function), and get_CDF() (Cumulative Distribution Function).
As of May 2020, this implementation produces serialized sketches which are binary-compatible with the equivalent Java implementation only when template parameter T = float (32-bit single precision values).
Given an input stream of N items, the natural rank of any specific item is defined as its index (1 to N) in inclusive mode or (0 to N-1) in exclusive mode in the hypothetical sorted stream of all N input items.
The normalized rank (rank) of any specific item is defined as its natural rank divided by N. Thus, the normalized rank is between zero and one. In the documentation for this sketch natural rank is never used so any reference to just rank should be interpreted to mean normalized rank.
This sketch is configured with a parameter k, which affects the size of the sketch and its estimation error.
The estimation error is commonly called epsilon (or eps) and is a fraction between zero and one. Larger values of k result in smaller values of epsilon. Epsilon is always with respect to the rank and cannot be applied to the corresponding items.
The relationship between the normalized rank and the corresponding items can be viewed as a two dimensional monotonic plot with the normalized rank on one axis and the corresponding items on the other axis. If the y-axis is specified as the item-axis and the x-axis as the normalized rank, then y = get_quantile(x) is a monotonically increasing function.
The function get_quantile(rank) translates ranks into corresponding quantiles. The functions get_rank(item), get_CDF(...) (Cumulative Distribution Function), and get_PMF(...) (Probability Mass Function) perform the opposite operation and translate items into ranks.
The getPMF(...) function has about 13 to 47% worse rank error (depending on k) than the other queries because the mass of each "bin" of the PMF has "double-sided" error from the upper and lower edges of the bin as a result of a subtraction, as the errors from the two edges can sometimes add.
The default k of 200 yields a "single-sided" epsilon of about 1.33% and a "double-sided" (PMF) epsilon of about 1.65%.
A get_quantile(rank) query has the following guarantees:
A get_rank(item) query has the following guarantees:
A get_PMF() query has the following guarantees:
A get_CDF(...) query has the following guarantees;
From the above, it might seem like we could make some estimates to bound the item returned from a call to get_quantile(). The sketch, however, does not let us derive error bounds or confidences around items. Because errors are independent, we can approximately bracket a value as shown below, but there are no error estimates available. Additionally, the interval may be quite large for certain distributions.
author Kevin Lang author Alexander Saydakov author Lee Rhodes
using quantile_return_type = typename quantiles_sorted_view<T, C, A>::quantile_return_type |
Quantile return type.
This is to return quantiles either by value (for arithmetic types) or by const reference (for all other types)
|
explicit |
Constructor.
k | affects the size of the sketch and its estimation error |
comparator | strict weak ordering function (see C++ named requirements: Compare) |
allocator | used by this sketch to allocate memory |
kll_sketch | ( | const kll_sketch< T, C, A > & | other | ) |
Copy constructor.
other | sketch to be copied |
|
noexcept |
Move constructor.
other | sketch to be moved |
kll_sketch< T, C, A > & operator= | ( | const kll_sketch< T, C, A > & | other | ) |
Copy assignment.
other | sketch to be copied |
kll_sketch< T, C, A > & operator= | ( | kll_sketch< T, C, A > && | other | ) |
Move assignment.
other | sketch to be moved |
void update | ( | FwdT && | item | ) |
Updates this sketch with the given data item.
item | from a stream of items |
void merge | ( | FwdSk && | other | ) |
Merges another sketch into this one.
other | sketch to merge into this one |
bool is_empty |
Returns true if this sketch is empty.
uint16_t get_k |
Returns configured parameter k.
uint64_t get_n |
Returns the length of the input stream.
uint32_t get_num_retained |
Returns the number of retained items (samples) in the sketch.
bool is_estimation_mode |
Returns true if this sketch is in estimation mode.
T get_min_item |
Returns the min item of the stream.
If the sketch is empty this throws std::runtime_error.
T get_max_item |
Returns the max item of the stream.
If the sketch is empty this throws std::runtime_error.
C get_comparator |
Returns an instance of the comparator for this sketch.
A get_allocator |
Returns an instance of the allocator for this sketch.
auto get_quantile | ( | double | rank, |
bool | inclusive = true |
||
) | const |
Returns an item from the sketch that is the best approximation to an item from the original stream with the given rank.
If the sketch is empty this throws std::runtime_error.
rank | of an item in the hypothetical sorted stream. |
inclusive | if true, the given rank is considered inclusive (includes weight of an item) |
double get_rank | ( | const T & | item, |
bool | inclusive = true |
||
) | const |
Returns an approximation to the normalized rank of the given item from 0 to 1, inclusive.
The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(false) function.
If the sketch is empty this throws std::runtime_error.
item | to be ranked. |
inclusive | if true the weight of the given item is included into the rank. Otherwise the rank equals the sum of the weights of all items that are less than the given item according to the comparator C. |
auto get_PMF | ( | const T * | split_points, |
uint32_t | size, | ||
bool | inclusive = true |
||
) | const |
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (items).
The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(true) function.
If the sketch is empty this throws std::runtime_error.
split_points | an array of m unique, monotonically increasing items that divide the input domain into m+1 consecutive disjoint intervals (bins). |
size | the number of split points in the array |
inclusive | if true the rank of an item includes its own weight, and therefore if the sketch contains items equal to a slit point, then in PMF such items are included into the interval to the left of split point. Otherwise they are included into the interval to the right of split point. |
auto get_CDF | ( | const T * | split_points, |
uint32_t | size, | ||
bool | inclusive = true |
||
) | const |
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (items).
The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(false) function.
If the sketch is empty this throws std::runtime_error.
split_points | an array of m unique, monotonically increasing items that divide the input domain into m+1 consecutive disjoint intervals. |
size | the number of split points in the array |
inclusive | if true the rank of an item includes its own weight, and therefore if the sketch contains items equal to a slit point, then in CDF such items are included into the interval to the left of split point. Otherwise they are included into the interval to the right of split point. |
double get_normalized_rank_error | ( | bool | pmf | ) | const |
Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
pmf | if true, returns the "double-sided" normalized rank error for the get_PMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries. |
size_t get_serialized_size_bytes | ( | const SerDe & | sd = SerDe() | ) | const |
Computes size needed to serialize the current state of the sketch.
This version is for fixed-size arithmetic types (integral and floating point).
sd | instance of a SerDe |
size_t get_serialized_size_bytes | ( | const SerDe & | sd = SerDe() | ) | const |
Computes size needed to serialize the current state of the sketch.
This version is for all other types and can be expensive since every item needs to be looked at.
sd | instance of a SerDe |
|
static |
Returns upper bound on the serialized size of a sketch given a parameter k and stream length.
The resulting size is an overestimate to make sure actual sketches don't exceed it. This method can be used if allocation of storage is necessary beforehand, but it is not optimal. This method is for arithmetic types (integral and floating point)
k | parameter that controls size of the sketch and accuracy of estimates |
n | stream length |
|
static |
Returns upper bound on the serialized size of a sketch given a parameter k and stream length.
The resulting size is an overestimate to make sure actual sketches don't exceed it. This method can be used if allocation of storage is necessary beforehand, but it is not optimal. This method is for all other non-arithmetic types, and it takes a max size of an item as input.
k | parameter that controls size of the sketch and accuracy of estimates |
n | stream length |
max_item_size_bytes | maximum size of an item in bytes |
void serialize | ( | std::ostream & | os, |
const SerDe & | sd = SerDe() |
||
) | const |
This method serializes the sketch into a given stream in a binary form.
os | output stream |
sd | instance of a SerDe |
vector_bytes serialize | ( | unsigned | header_size_bytes = 0 , |
const SerDe & | sd = SerDe() |
||
) | const |
This method serializes the sketch as a vector of bytes.
An optional header can be reserved in front of the sketch. It is a blank space of a given size. This header is used in Datasketches PostgreSQL extension.
header_size_bytes | space to reserve in front of the sketch |
sd | instance of a SerDe |
|
static |
This method deserializes a sketch from a given stream.
is | input stream |
sd | instance of a SerDe |
comparator | instance of a Comparator |
allocator | instance of an Allocator |
|
static |
This method deserializes a sketch from a given array of bytes.
bytes | pointer to the array of bytes |
size | the size of the array |
sd | instance of a SerDe |
comparator | instance of a Comparator |
allocator | instance of an Allocator |
string< A > to_string | ( | bool | print_levels = false , |
bool | print_items = false |
||
) | const |
Prints a summary of the sketch.
print_levels | if true include information about levels |
print_items | if true include sketch data |
kll_sketch< T, C, A >::const_iterator begin |
Iterator pointing to the first item in the sketch.
If the sketch is empty, the returned iterator must not be dereferenced or incremented.
kll_sketch< T, C, A >::const_iterator end |
Iterator pointing to the past-the-end item in the sketch.
The past-the-end item is the hypothetical item that would follow the last item. It does not point to any item, and must not be dereferenced or incremented.
quantiles_sorted_view< T, C, A > get_sorted_view |
Gets the sorted view of this sketch.