datasketches-cpp
|
Relative Error Quantiles Sketch. More...
#include <req_sketch.hpp>
Public Types | |
using | quantile_return_type = typename quantiles_sorted_view< T, Comparator, Allocator >::quantile_return_type |
Quantile return type. | |
Public Member Functions | |
req_sketch (uint16_t k, bool hra=true, const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator()) | |
Constructor. | |
req_sketch (const req_sketch &other) | |
Copy constructor. | |
req_sketch (req_sketch &&other) noexcept | |
Move constructor. | |
req_sketch & | operator= (const req_sketch &other) |
Copy assignment. | |
req_sketch & | operator= (req_sketch &&other) |
Move assignment. | |
template<typename TT , typename CC , typename AA > | |
req_sketch (const req_sketch< TT, CC, AA > &other, const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator()) | |
Type converting constructor. | |
uint16_t | get_k () const |
Returns configured parameter K. | |
bool | is_HRA () const |
Returns configured parameter High Rank Accuracy. | |
bool | is_empty () const |
Returns true if this sketch is empty. | |
uint64_t | get_n () const |
Returns the length of the input stream. | |
uint32_t | get_num_retained () const |
Returns the number of retained items in the sketch. | |
bool | is_estimation_mode () const |
Returns true if this sketch is in estimation mode. | |
template<typename FwdT > | |
void | update (FwdT &&item) |
Updates this sketch with the given data item. | |
template<typename FwdSk > | |
void | merge (FwdSk &&other) |
Merges another sketch into this one. | |
const T & | get_min_item () const |
Returns the min item of the stream. | |
const T & | get_max_item () const |
Returns the max item of the stream. | |
Comparator | get_comparator () const |
Returns an instance of the comparator for this sketch. | |
Allocator | get_allocator () const |
Returns an instance of the allocator for this sketch. | |
double | get_rank (const T &item, bool inclusive=true) const |
Returns an approximation to the normalized rank of the given item from 0 to 1 inclusive. | |
vector_double | get_PMF (const T *split_points, uint32_t size, bool inclusive=true) const |
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (items). | |
vector_double | get_CDF (const T *split_points, uint32_t size, bool inclusive=true) const |
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (items). | |
quantile_return_type | get_quantile (double rank, bool inclusive=true) const |
Returns an approximate quantile of the given normalized rank. | |
double | get_rank_lower_bound (double rank, uint8_t num_std_dev) const |
Returns an approximate lower bound of the given normalized rank. | |
double | get_rank_upper_bound (double rank, uint8_t num_std_dev) const |
Returns an approximate upper bound of the given normalized rank. | |
template<typename TT = T, typename SerDe = serde<T>, typename std::enable_if< std::is_arithmetic< TT >::value, int >::type = 0> | |
size_t | get_serialized_size_bytes (const SerDe &sd=SerDe()) const |
Computes size needed to serialize the current state of the sketch. | |
template<typename TT = T, typename SerDe = serde<T>, typename std::enable_if<!std::is_arithmetic< TT >::value, int >::type = 0> | |
size_t | get_serialized_size_bytes (const SerDe &sd=SerDe()) const |
Computes size needed to serialize the current state of the sketch. | |
template<typename SerDe = serde<T>> | |
void | serialize (std::ostream &os, const SerDe &sd=SerDe()) const |
This method serializes the sketch into a given stream in a binary form. | |
template<typename SerDe = serde<T>> | |
vector_bytes | serialize (unsigned header_size_bytes=0, const SerDe &sd=SerDe()) const |
This method serializes the sketch as a vector of bytes. | |
string< Allocator > | to_string (bool print_levels=false, bool print_items=false) const |
Prints a summary of the sketch. | |
const_iterator | begin () const |
Iterator pointing to the first item in the sketch. | |
const_iterator | end () const |
Iterator pointing to the past-the-end item in the sketch. | |
quantiles_sorted_view< T, Comparator, Allocator > | get_sorted_view () const |
Gets the sorted view of this sketch. | |
Static Public Member Functions | |
static double | get_RSE (uint16_t k, double rank, bool hra, uint64_t n) |
Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). | |
template<typename SerDe = serde<T>> | |
static req_sketch | deserialize (std::istream &is, const SerDe &sd=SerDe(), const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator()) |
This method deserializes a sketch from a given stream. | |
template<typename SerDe = serde<T>> | |
static req_sketch | deserialize (const void *bytes, size_t size, const SerDe &sd=SerDe(), const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator()) |
This method deserializes a sketch from a given array of bytes. | |
Relative Error Quantiles Sketch.
This is an implementation based on the paper "Relative Error Streaming Quantiles" by Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel VeselĂ˝, and loosely derived from a Python prototype written by Pavel VeselĂ˝.
Reference: https://arxiv.org/abs/2004.01668
This implementation differs from the algorithm described in the paper in the following:
The algorithm requires no upper bound on the stream length. Instead, each relative-compactor counts the number of compaction operations performed so far (via variable state). Initially, the relative-compactor starts with INIT_NUMBER_OF_SECTIONS. Each time the number of compactions (variable state) exceeds 2^{numSections - 1}, we double numSections. Note that after merging the sketch with another one variable state may not correspond to the number of compactions performed at a particular level, however, since the state variable never exceeds the number of compactions, the guarantees of the sketch remain valid.
The size of each section (variable k and section_size in the code and parameter k in the paper) is initialized with a number set by the user via variable k. When the number of sections doubles, we decrease section_size by a factor of sqrt(2). This is applied at each level separately. Thus, when we double the number of sections, the nominal compactor size increases by a factor of approx. sqrt(2) (+/- rounding).
This implementation provides a number of capabilities not discussed in the paper or provided in the Python prototype.
using quantile_return_type = typename quantiles_sorted_view<T, Comparator, Allocator>::quantile_return_type |
Quantile return type.
This is to return quantiles either by value (for arithmetic types) or by const reference (for all other types)
|
explicit |
Constructor.
k | Controls the size and error of the sketch. It must be even and in the range [4, 1024], inclusive. Value of 12 roughly corresponds to 1% relative error guarantee at 95% confidence. |
hra | if true, the default, the high ranks are prioritized for better accuracy. Otherwise the low ranks are prioritized for better accuracy. |
comparator | strict weak ordering function (see C++ named requirements: Compare) |
allocator | used by this sketch to allocate memory |
req_sketch | ( | const req_sketch< T, Comparator, Allocator > & | other | ) |
Copy constructor.
other | sketch to be copied |
|
noexcept |
Move constructor.
other | sketch to be moved |
|
explicit |
Type converting constructor.
other | sketch of a different type |
comparator | instance of a Comparator |
allocator | instance of an Allocator |
req_sketch< T, C, A > & operator= | ( | const req_sketch< T, Comparator, Allocator > & | other | ) |
Copy assignment.
other | sketch to be copied |
req_sketch< T, C, A > & operator= | ( | req_sketch< T, Comparator, Allocator > && | other | ) |
Move assignment.
other | sketch to be moved |
uint16_t get_k | ( | ) | const |
Returns configured parameter K.
bool is_HRA | ( | ) | const |
Returns configured parameter High Rank Accuracy.
bool is_empty | ( | ) | const |
Returns true if this sketch is empty.
uint64_t get_n | ( | ) | const |
Returns the length of the input stream.
uint32_t get_num_retained | ( | ) | const |
Returns the number of retained items in the sketch.
bool is_estimation_mode | ( | ) | const |
Returns true if this sketch is in estimation mode.
void update | ( | FwdT && | item | ) |
Updates this sketch with the given data item.
item | from a stream of items |
void merge | ( | FwdSk && | other | ) |
Merges another sketch into this one.
other | sketch to merge into this one |
const T & get_min_item | ( | ) | const |
Returns the min item of the stream.
If the sketch is empty this throws std::runtime_error.
const T & get_max_item | ( | ) | const |
Returns the max item of the stream.
If the sketch is empty this throws std::runtime_error.
C get_comparator | ( | ) | const |
Returns an instance of the comparator for this sketch.
A get_allocator | ( | ) | const |
Returns an instance of the allocator for this sketch.
double get_rank | ( | const T & | item, |
bool | inclusive = true |
||
) | const |
Returns an approximation to the normalized rank of the given item from 0 to 1 inclusive.
If the sketch is empty this throws std::runtime_error.
item | to be ranked. |
inclusive | if true the weight of the given item is included into the rank. Otherwise the rank equals the sum of the weights of all items that are less than the given item according to the comparator C. |
auto get_PMF | ( | const T * | split_points, |
uint32_t | size, | ||
bool | inclusive = true |
||
) | const |
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (items).
If the sketch is empty this throws std::runtime_error.
split_points | an array of m unique, monotonically increasing items that divide the input domain into m+1 consecutive disjoint intervals (bins). |
size | the number of split points in the array |
inclusive | if true the rank of an item includes its own weight, and therefore if the sketch contains items equal to a slit point, then in PMF such items are included into the interval to the left of split point. Otherwise they are included into the interval to the right of split point. |
auto get_CDF | ( | const T * | split_points, |
uint32_t | size, | ||
bool | inclusive = true |
||
) | const |
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (items).
If the sketch is empty this throws std::runtime_error.
split_points | an array of m unique, monotonically increasing items that divide the input domain into m+1 consecutive disjoint intervals. |
size | the number of split points in the array |
inclusive | if true the rank of an item includes its own weight, and therefore if the sketch contains items equal to a slit point, then in CDF such items are included into the interval to the left of split point. Otherwise they are included into the interval to the right of split point. |
auto get_quantile | ( | double | rank, |
bool | inclusive = true |
||
) | const |
Returns an approximate quantile of the given normalized rank.
The normalized rank must be in the range [0.0, 1.0] (both inclusive).
If the sketch is empty this throws std::runtime_error.
rank | of an item in the hypothetical sorted stream. |
inclusive | if true, the given rank is considered inclusive (includes weight of an item) |
double get_rank_lower_bound | ( | double | rank, |
uint8_t | num_std_dev | ||
) | const |
Returns an approximate lower bound of the given normalized rank.
rank | the given rank, a value between 0 and 1.0. |
num_std_dev | the number of standard deviations. Must be 1, 2, or 3. |
double get_rank_upper_bound | ( | double | rank, |
uint8_t | num_std_dev | ||
) | const |
Returns an approximate upper bound of the given normalized rank.
rank | the given rank, a value between 0 and 1.0. |
num_std_dev | the number of standard deviations. Must be 1, 2, or 3. |
|
static |
Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).
Derived from Lemma 12 in https://arxiv.org/abs/2004.01668v2, but the constant factors were modified based on empirical measurements.
k | the given value of k |
rank | the given normalized rank, a number in [0,1]. |
hra | if true High Rank Accuracy mode is being selected, otherwise, Low Rank Accuracy. |
n | an estimate of the total number of items submitted to the sketch. |
size_t get_serialized_size_bytes | ( | const SerDe & | sd = SerDe() | ) | const |
Computes size needed to serialize the current state of the sketch.
This version is for fixed-size arithmetic types (integral and floating point).
sd | instance of a SerDe |
size_t get_serialized_size_bytes | ( | const SerDe & | sd = SerDe() | ) | const |
Computes size needed to serialize the current state of the sketch.
This version is for all other types and can be expensive since every item needs to be looked at.
sd | instance of a SerDe |
void serialize | ( | std::ostream & | os, |
const SerDe & | sd = SerDe() |
||
) | const |
This method serializes the sketch into a given stream in a binary form.
os | output stream |
sd | instance of a SerDe |
vector_bytes serialize | ( | unsigned | header_size_bytes = 0 , |
const SerDe & | sd = SerDe() |
||
) | const |
This method serializes the sketch as a vector of bytes.
An optional header can be reserved in front of the sketch. It is a blank space of a given size. This header is used in Datasketches PostgreSQL extension.
header_size_bytes | space to reserve in front of the sketch |
sd | instance of a SerDe |
|
static |
This method deserializes a sketch from a given stream.
is | input stream |
sd | instance of a SerDe |
comparator | instance of a Comparator |
allocator | instance of an Allocator |
|
static |
This method deserializes a sketch from a given array of bytes.
bytes | pointer to the array of bytes |
size | the size of the array |
sd | instance of a SerDe |
comparator | instance of a Comparator |
allocator | instance of an Allocator |
string< A > to_string | ( | bool | print_levels = false , |
bool | print_items = false |
||
) | const |
Prints a summary of the sketch.
print_levels | if true include information about levels |
print_items | if true include sketch data |
auto begin | ( | ) | const |
Iterator pointing to the first item in the sketch.
If the sketch is empty, the returned iterator must not be dereferenced or incremented.
auto end | ( | ) | const |
Iterator pointing to the past-the-end item in the sketch.
The past-the-end item is the hypothetical item that would follow the last item. It does not point to any item, and must not be dereferenced or incremented.
quantiles_sorted_view< T, C, A > get_sorted_view | ( | ) | const |
Gets the sorted view of this sketch.