datasketches-cpp
|
This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution from a very large stream in a single pass. More...
#include <quantiles_sketch.hpp>
Public Member Functions | |
quantiles_sketch (uint16_t k=quantiles_constants::DEFAULT_K, const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator()) | |
Constructor. More... | |
quantiles_sketch (const quantiles_sketch &other) | |
Copy constructor. More... | |
quantiles_sketch (quantiles_sketch &&other) noexcept | |
Move constructor. More... | |
quantiles_sketch & | operator= (const quantiles_sketch &other) |
Copy assignment. More... | |
quantiles_sketch & | operator= (quantiles_sketch &&other) noexcept |
Move assignment. More... | |
template<typename From , typename FC , typename FA > | |
quantiles_sketch (const quantiles_sketch< From, FC, FA > &other, const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator()) | |
Type converting constructor. More... | |
template<typename FwdT > | |
void | update (FwdT &&item) |
Updates this sketch with the given data item. More... | |
template<typename FwdSk > | |
void | merge (FwdSk &&other) |
Merges another sketch into this one. More... | |
bool | is_empty () const |
Returns true if this sketch is empty. More... | |
uint16_t | get_k () const |
Returns configured parameter k. More... | |
uint64_t | get_n () const |
Returns the length of the input stream. More... | |
uint32_t | get_num_retained () const |
Returns the number of retained items (samples) in the sketch. More... | |
bool | is_estimation_mode () const |
Returns true if this sketch is in estimation mode. More... | |
const T & | get_min_item () const |
Returns the min item of the stream. More... | |
const T & | get_max_item () const |
Returns the max item of the stream. More... | |
Comparator | get_comparator () const |
Returns an instance of the comparator for this sketch. More... | |
allocator_type | get_allocator () const |
Returns the allocator for this sketch. More... | |
quantile_return_type | get_quantile (double rank, bool inclusive=true) const |
Returns an approximation to the data item associated with the given rank of a hypothetical sorted version of the input stream so far. More... | |
double | get_rank (const T &item, bool inclusive=true) const |
Returns an approximation to the normalized rank of the given item from 0 to 1, inclusive. More... | |
vector_double | get_PMF (const T *split_points, uint32_t size, bool inclusive=true) const |
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (items). More... | |
vector_double | get_CDF (const T *split_points, uint32_t size, bool inclusive=true) const |
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (items). More... | |
template<typename SerDe = serde<T>, typename TT = T, typename std::enable_if< std::is_arithmetic< TT >::value, int >::type = 0> | |
size_t | get_serialized_size_bytes (const SerDe &sd=SerDe()) const |
Computes size needed to serialize the current state of the sketch. More... | |
template<typename SerDe = serde<T>, typename TT = T, typename std::enable_if<!std::is_arithmetic< TT >::value, int >::type = 0> | |
size_t | get_serialized_size_bytes (const SerDe &sd=SerDe()) const |
Computes size needed to serialize the current state of the sketch. More... | |
template<typename SerDe = serde<T>> | |
void | serialize (std::ostream &os, const SerDe &sd=SerDe()) const |
This method serializes the sketch into a given stream in a binary form. More... | |
template<typename SerDe = serde<T>> | |
vector_bytes | serialize (unsigned header_size_bytes=0, const SerDe &sd=SerDe()) const |
This method serializes the sketch as a vector of bytes. More... | |
double | get_normalized_rank_error (bool is_pmf) const |
Gets the normalized rank error for this sketch. More... | |
string< Allocator > | to_string (bool print_levels=false, bool print_items=false) const |
Prints a summary of the sketch. More... | |
const_iterator | begin () const |
Iterator pointing to the first item in the sketch. More... | |
const_iterator | end () const |
Iterator pointing to the past-the-end item in the sketch. More... | |
quantiles_sorted_view< T, Comparator, Allocator > | get_sorted_view () const |
Gets the sorted view of this sketch. More... | |
Static Public Member Functions | |
template<typename SerDe = serde<T>> | |
static quantiles_sketch | deserialize (std::istream &is, const SerDe &sd=SerDe(), const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator()) |
This method deserializes a sketch from a given stream. More... | |
template<typename SerDe = serde<T>> | |
static quantiles_sketch | deserialize (const void *bytes, size_t size, const SerDe &sd=SerDe(), const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator()) |
This method deserializes a sketch from a given array of bytes. More... | |
static double | get_normalized_rank_error (uint16_t k, bool is_pmf) |
Gets the normalized rank error given k and pmf. More... | |
This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution from a very large stream in a single pass.
The analysis is obtained using get_rank() and get_quantile() functions, the Probability Mass Function from get_PMF() and the Cumulative Distribution Function from get_CDF().
Consider a large stream of one million values such as packet sizes coming into a network node. The natural rank of any specific size value is its index in the hypothetical sorted array of values. The normalized rank is the natural rank divided by the stream size, in this case one million. The value corresponding to the normalized rank of 0.5 represents the 50th percentile or median value of the distribution, or get_quantile(0.5). Similarly, the 95th percentile is obtained from get_quantile(0.95).
From the min and max values, for example, 1 and 1000 bytes, you can obtain the PMF from get_PMF(100, 500, 900) that will result in an array of 4 fractional values such as {.4, .3, .2, .1}, which means that
A frequency histogram can be obtained by multiplying these fractions by get_n(), which is the total count of values received. The get_CDF() works similarly, but produces the cumulative distribution instead.
As of November 2021, this implementation produces serialized sketches which are binary-compatible with the equivalent Java implementation only when template parameter T = double (64-bit double precision values).
The accuracy of this sketch is a function of the configured value k, which also affects the overall size of the sketch. Accuracy of this quantile sketch is always with respect to the normalized rank. A k of 128 produces a normalized, rank error of about 1.7%. For example, the median item returned from getQuantile(0.5) will be between the actual items from the hypothetically sorted array of input items at normalized ranks of 0.483 and 0.517, with a confidence of about 99%.
Table Guide for DoublesSketch Size in Bytes and Approximate Error: K => | 16 32 64 128 256 512 1,024 ~ Error => | 12.145% 6.359% 3.317% 1.725% 0.894% 0.463% 0.239% N | Size in Bytes -> ------------------------------------------------------------------------ 0 | 8 8 8 8 8 8 8 1 | 72 72 72 72 72 72 72 3 | 72 72 72 72 72 72 72 7 | 104 104 104 104 104 104 104 15 | 168 168 168 168 168 168 168 31 | 296 296 296 296 296 296 296 63 | 424 552 552 552 552 552 552 127 | 552 808 1,064 1,064 1,064 1,064 1,064 255 | 680 1,064 1,576 2,088 2,088 2,088 2,088 511 | 808 1,320 2,088 3,112 4,136 4,136 4,136 1,023 | 936 1,576 2,600 4,136 6,184 8,232 8,232 2,047 | 1,064 1,832 3,112 5,160 8,232 12,328 16,424 4,095 | 1,192 2,088 3,624 6,184 10,280 16,424 24,616 8,191 | 1,320 2,344 4,136 7,208 12,328 20,520 32,808 16,383 | 1,448 2,600 4,648 8,232 14,376 24,616 41,000 32,767 | 1,576 2,856 5,160 9,256 16,424 28,712 49,192 65,535 | 1,704 3,112 5,672 10,280 18,472 32,808 57,384 131,071 | 1,832 3,368 6,184 11,304 20,520 36,904 65,576 262,143 | 1,960 3,624 6,696 12,328 22,568 41,000 73,768 524,287 | 2,088 3,880 7,208 13,352 24,616 45,096 81,960 1,048,575 | 2,216 4,136 7,720 14,376 26,664 49,192 90,152 2,097,151 | 2,344 4,392 8,232 15,400 28,712 53,288 98,344 4,194,303 | 2,472 4,648 8,744 16,424 30,760 57,384 106,536 8,388,607 | 2,600 4,904 9,256 17,448 32,808 61,480 114,728 16,777,215 | 2,728 5,160 9,768 18,472 34,856 65,576 122,920 33,554,431 | 2,856 5,416 10,280 19,496 36,904 69,672 131,112 67,108,863 | 2,984 5,672 10,792 20,520 38,952 73,768 139,304 134,217,727 | 3,112 5,928 11,304 21,544 41,000 77,864 147,496 268,435,455 | 3,240 6,184 11,816 22,568 43,048 81,960 155,688 536,870,911 | 3,368 6,440 12,328 23,592 45,096 86,056 163,880 1,073,741,823 | 3,496 6,696 12,840 24,616 47,144 90,152 172,072 2,147,483,647 | 3,624 6,952 13,352 25,640 49,192 94,248 180,264 4,294,967,295 | 3,752 7,208 13,864 26,664 51,240 98,344 188,456
There is more documentation available on datasketches.apache.org.
This is an implementation of the Low Discrepancy Mergeable Quantiles Sketch described in section 3.2 of the journal version of the paper "Mergeable Summaries" by Agarwal, Cormode, Huang, Phillips, Wei, and Yi.
This algorithm is independent of the distribution of items and requires only that the items be comparable.
This algorithm intentionally inserts randomness into the sampling process for items that ultimately get retained in the sketch. The results produced by this algorithm are not deterministic. For example, if the same stream is inserted into two different instances of this sketch, the answers obtained from the two sketches may not be identical.
Similarly, there may be directional inconsistencies. For example, the result obtained from get_quantile(rank) input into the reverse directional query get_rank(item) may not result in the original item.
|
explicit |
Constructor.
k | affects the size of the sketch and its estimation error |
comparator | strict weak ordering function (see C++ named requirements: Compare) |
allocator | used to allocate memory |
quantiles_sketch | ( | const quantiles_sketch< T, Comparator, Allocator > & | other | ) |
Copy constructor.
other | sketch to be copied |
|
noexcept |
Move constructor.
other | sketch to be moved |
|
explicit |
Type converting constructor.
other | quantiles sketch of a different type |
comparator | instance of a Comparator |
allocator | instance of an Allocator |
quantiles_sketch< T, C, A > & operator= | ( | const quantiles_sketch< T, Comparator, Allocator > & | other | ) |
Copy assignment.
other | sketch to be copied |
|
noexcept |
Move assignment.
other | sketch to be moved |
void update | ( | FwdT && | item | ) |
Updates this sketch with the given data item.
item | from a stream of items |
void merge | ( | FwdSk && | other | ) |
Merges another sketch into this one.
other | sketch to merge into this one |
bool is_empty |
Returns true if this sketch is empty.
uint16_t get_k |
Returns configured parameter k.
uint64_t get_n |
Returns the length of the input stream.
uint32_t get_num_retained |
Returns the number of retained items (samples) in the sketch.
bool is_estimation_mode |
Returns true if this sketch is in estimation mode.
const T & get_min_item |
Returns the min item of the stream.
If the sketch is empty this throws std::runtime_error.
const T & get_max_item |
Returns the max item of the stream.
If the sketch is empty this throws std::runtime_error.
C get_comparator |
Returns an instance of the comparator for this sketch.
A get_allocator |
Returns the allocator for this sketch.
auto get_quantile | ( | double | rank, |
bool | inclusive = true |
||
) | const |
Returns an approximation to the data item associated with the given rank of a hypothetical sorted version of the input stream so far.
If the sketch is empty this throws std::runtime_error.
rank | the specified normalized rank in the hypothetical sorted stream. |
inclusive | if true the weight of the given item is included into the rank. Otherwise the rank equals the sum of the weights of all items that are less than the given item according to the Comparator. |
double get_rank | ( | const T & | item, |
bool | inclusive = true |
||
) | const |
Returns an approximation to the normalized rank of the given item from 0 to 1, inclusive.
The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(false) function.
If the sketch is empty this throws std::runtime_error.
item | to be ranked |
inclusive | if true the weight of the given item is included into the rank. Otherwise the rank equals the sum of the weights of all items that are less than the given item according to the Comparator. |
auto get_PMF | ( | const T * | split_points, |
uint32_t | size, | ||
bool | inclusive = true |
||
) | const |
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (items).
The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(true) function.
If the sketch is empty this throws std::runtime_error.
split_points | an array of m unique, monotonically increasing items that divide the input domain into m+1 consecutive disjoint intervals (bins). |
size | of the array of split points. |
inclusive | if true the rank of an item includes its own weight, and therefore if the sketch contains items equal to a slit point, then in PMF such items are included into the interval to the left of split point. Otherwise they are included into the interval to the right of split point. |
auto get_CDF | ( | const T * | split_points, |
uint32_t | size, | ||
bool | inclusive = true |
||
) | const |
Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (items).
The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(false) function.
If the sketch is empty this throws std::runtime_error.
split_points | an array of m unique, monotonically increasing items that divide the input domain into m+1 consecutive disjoint intervals. |
size | of the array of split points. |
inclusive | if true the rank of an item includes its own weight, and therefore if the sketch contains items equal to a slit point, then in CDF such items are included into the interval to the left of split point. Otherwise they are included into the interval to the right of split point. |
size_t get_serialized_size_bytes | ( | const SerDe & | sd = SerDe() | ) | const |
Computes size needed to serialize the current state of the sketch.
This version is for fixed-size arithmetic types (integral and floating point).
sd | instance of a SerDe |
size_t get_serialized_size_bytes | ( | const SerDe & | sd = SerDe() | ) | const |
Computes size needed to serialize the current state of the sketch.
This version is for all other types and can be expensive since every item needs to be looked at.
sd | instance of a SerDe |
void serialize | ( | std::ostream & | os, |
const SerDe & | sd = SerDe() |
||
) | const |
This method serializes the sketch into a given stream in a binary form.
os | output stream |
sd | instance of a SerDe |
vector_bytes serialize | ( | unsigned | header_size_bytes = 0 , |
const SerDe & | sd = SerDe() |
||
) | const |
This method serializes the sketch as a vector of bytes.
An optional header can be reserved in front of the sketch. It is a blank space of a given size. This header is used in Datasketches PostgreSQL extension.
header_size_bytes | space to reserve in front of the sketch |
sd | instance of a SerDe |
|
static |
This method deserializes a sketch from a given stream.
is | input stream |
sd | instance of a SerDe |
comparator | instance of a Comparator |
allocator | instance of an Allocator |
|
static |
This method deserializes a sketch from a given array of bytes.
bytes | pointer to the array of bytes |
size | the size of the array |
sd | instance of a SerDe |
comparator | instance of a Comparator |
allocator | instance of an Allocator |
double get_normalized_rank_error | ( | bool | is_pmf | ) | const |
Gets the normalized rank error for this sketch.
Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials.
is_pmf | if true, returns the "double-sided" normalized rank error for the get_PMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries. |
|
static |
Gets the normalized rank error given k and pmf.
Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials.
k | the configuration parameter |
is_pmf | if true, returns the "double-sided" normalized rank error for the get_PMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries. |
string< A > to_string | ( | bool | print_levels = false , |
bool | print_items = false |
||
) | const |
Prints a summary of the sketch.
print_levels | if true include information about levels |
print_items | if true include sketch data |
quantiles_sketch< T, C, A >::const_iterator begin |
Iterator pointing to the first item in the sketch.
If the sketch is empty, the returned iterator must not be dereferenced or incremented.
quantiles_sketch< T, C, A >::const_iterator end |
Iterator pointing to the past-the-end item in the sketch.
The past-the-end item is the hypothetical item that would follow the last item. It does not point to any item, and must not be dereferenced or incremented.
quantiles_sorted_view< T, C, A > get_sorted_view |
Gets the sorted view of this sketch.