datasketches-cpp
Public Member Functions | Static Public Member Functions | List of all members
quantiles_sketch< T, Comparator, Allocator > Class Template Reference

This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution from a very large stream in a single pass. More...

#include <quantiles_sketch.hpp>

Public Member Functions

 quantiles_sketch (uint16_t k=quantiles_constants::DEFAULT_K, const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator())
 Constructor. More...
 
 quantiles_sketch (const quantiles_sketch &other)
 Copy constructor. More...
 
 quantiles_sketch (quantiles_sketch &&other) noexcept
 Move constructor. More...
 
quantiles_sketchoperator= (const quantiles_sketch &other)
 Copy assignment. More...
 
quantiles_sketchoperator= (quantiles_sketch &&other) noexcept
 Move assignment. More...
 
template<typename From , typename FC , typename FA >
 quantiles_sketch (const quantiles_sketch< From, FC, FA > &other, const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator())
 Type converting constructor. More...
 
template<typename FwdT >
void update (FwdT &&item)
 Updates this sketch with the given data item. More...
 
template<typename FwdSk >
void merge (FwdSk &&other)
 Merges another sketch into this one. More...
 
bool is_empty () const
 Returns true if this sketch is empty. More...
 
uint16_t get_k () const
 Returns configured parameter k. More...
 
uint64_t get_n () const
 Returns the length of the input stream. More...
 
uint32_t get_num_retained () const
 Returns the number of retained items (samples) in the sketch. More...
 
bool is_estimation_mode () const
 Returns true if this sketch is in estimation mode. More...
 
const T & get_min_item () const
 Returns the min item of the stream. More...
 
const T & get_max_item () const
 Returns the max item of the stream. More...
 
Comparator get_comparator () const
 Returns an instance of the comparator for this sketch. More...
 
allocator_type get_allocator () const
 Returns the allocator for this sketch. More...
 
quantile_return_type get_quantile (double rank, bool inclusive=true) const
 Returns an approximation to the data item associated with the given rank of a hypothetical sorted version of the input stream so far. More...
 
double get_rank (const T &item, bool inclusive=true) const
 Returns an approximation to the normalized rank of the given item from 0 to 1, inclusive. More...
 
vector_double get_PMF (const T *split_points, uint32_t size, bool inclusive=true) const
 Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (items). More...
 
vector_double get_CDF (const T *split_points, uint32_t size, bool inclusive=true) const
 Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (items). More...
 
template<typename SerDe = serde<T>, typename TT = T, typename std::enable_if< std::is_arithmetic< TT >::value, int >::type = 0>
size_t get_serialized_size_bytes (const SerDe &sd=SerDe()) const
 Computes size needed to serialize the current state of the sketch. More...
 
template<typename SerDe = serde<T>, typename TT = T, typename std::enable_if<!std::is_arithmetic< TT >::value, int >::type = 0>
size_t get_serialized_size_bytes (const SerDe &sd=SerDe()) const
 Computes size needed to serialize the current state of the sketch. More...
 
template<typename SerDe = serde<T>>
void serialize (std::ostream &os, const SerDe &sd=SerDe()) const
 This method serializes the sketch into a given stream in a binary form. More...
 
template<typename SerDe = serde<T>>
vector_bytes serialize (unsigned header_size_bytes=0, const SerDe &sd=SerDe()) const
 This method serializes the sketch as a vector of bytes. More...
 
double get_normalized_rank_error (bool is_pmf) const
 Gets the normalized rank error for this sketch. More...
 
string< Allocator > to_string (bool print_levels=false, bool print_items=false) const
 Prints a summary of the sketch. More...
 
const_iterator begin () const
 Iterator pointing to the first item in the sketch. More...
 
const_iterator end () const
 Iterator pointing to the past-the-end item in the sketch. More...
 
quantiles_sorted_view< T, Comparator, Allocator > get_sorted_view () const
 Gets the sorted view of this sketch. More...
 

Static Public Member Functions

template<typename SerDe = serde<T>>
static quantiles_sketch deserialize (std::istream &is, const SerDe &sd=SerDe(), const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator())
 This method deserializes a sketch from a given stream. More...
 
template<typename SerDe = serde<T>>
static quantiles_sketch deserialize (const void *bytes, size_t size, const SerDe &sd=SerDe(), const Comparator &comparator=Comparator(), const Allocator &allocator=Allocator())
 This method deserializes a sketch from a given array of bytes. More...
 
static double get_normalized_rank_error (uint16_t k, bool is_pmf)
 Gets the normalized rank error given k and pmf. More...
 

Detailed Description

template<typename T, typename Comparator = std::less<T>, typename Allocator = std::allocator<T>>
class datasketches::quantiles_sketch< T, Comparator, Allocator >

This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution from a very large stream in a single pass.

The analysis is obtained using get_rank() and get_quantile() functions, the Probability Mass Function from get_PMF() and the Cumulative Distribution Function from get_CDF().

Consider a large stream of one million values such as packet sizes coming into a network node. The natural rank of any specific size value is its index in the hypothetical sorted array of values. The normalized rank is the natural rank divided by the stream size, in this case one million. The value corresponding to the normalized rank of 0.5 represents the 50th percentile or median value of the distribution, or get_quantile(0.5). Similarly, the 95th percentile is obtained from get_quantile(0.95).

From the min and max values, for example, 1 and 1000 bytes, you can obtain the PMF from get_PMF(100, 500, 900) that will result in an array of 4 fractional values such as {.4, .3, .2, .1}, which means that

A frequency histogram can be obtained by multiplying these fractions by get_n(), which is the total count of values received. The get_CDF() works similarly, but produces the cumulative distribution instead.

As of November 2021, this implementation produces serialized sketches which are binary-compatible with the equivalent Java implementation only when template parameter T = double (64-bit double precision values).

The accuracy of this sketch is a function of the configured value k, which also affects the overall size of the sketch. Accuracy of this quantile sketch is always with respect to the normalized rank. A k of 128 produces a normalized, rank error of about 1.7%. For example, the median item returned from getQuantile(0.5) will be between the actual items from the hypothetically sorted array of input items at normalized ranks of 0.483 and 0.517, with a confidence of about 99%.

Table Guide for DoublesSketch Size in Bytes and Approximate Error:
          K => |      16      32      64     128     256     512   1,024
    ~ Error => | 12.145%  6.359%  3.317%  1.725%  0.894%  0.463%  0.239%
             N | Size in Bytes ->
------------------------------------------------------------------------
             0 |       8       8       8       8       8       8       8
             1 |      72      72      72      72      72      72      72
             3 |      72      72      72      72      72      72      72
             7 |     104     104     104     104     104     104     104
            15 |     168     168     168     168     168     168     168
            31 |     296     296     296     296     296     296     296
            63 |     424     552     552     552     552     552     552
           127 |     552     808   1,064   1,064   1,064   1,064   1,064
           255 |     680   1,064   1,576   2,088   2,088   2,088   2,088
           511 |     808   1,320   2,088   3,112   4,136   4,136   4,136
         1,023 |     936   1,576   2,600   4,136   6,184   8,232   8,232
         2,047 |   1,064   1,832   3,112   5,160   8,232  12,328  16,424
         4,095 |   1,192   2,088   3,624   6,184  10,280  16,424  24,616
         8,191 |   1,320   2,344   4,136   7,208  12,328  20,520  32,808
        16,383 |   1,448   2,600   4,648   8,232  14,376  24,616  41,000
        32,767 |   1,576   2,856   5,160   9,256  16,424  28,712  49,192
        65,535 |   1,704   3,112   5,672  10,280  18,472  32,808  57,384
       131,071 |   1,832   3,368   6,184  11,304  20,520  36,904  65,576
       262,143 |   1,960   3,624   6,696  12,328  22,568  41,000  73,768
       524,287 |   2,088   3,880   7,208  13,352  24,616  45,096  81,960
     1,048,575 |   2,216   4,136   7,720  14,376  26,664  49,192  90,152
     2,097,151 |   2,344   4,392   8,232  15,400  28,712  53,288  98,344
     4,194,303 |   2,472   4,648   8,744  16,424  30,760  57,384 106,536
     8,388,607 |   2,600   4,904   9,256  17,448  32,808  61,480 114,728
    16,777,215 |   2,728   5,160   9,768  18,472  34,856  65,576 122,920
    33,554,431 |   2,856   5,416  10,280  19,496  36,904  69,672 131,112
    67,108,863 |   2,984   5,672  10,792  20,520  38,952  73,768 139,304
   134,217,727 |   3,112   5,928  11,304  21,544  41,000  77,864 147,496
   268,435,455 |   3,240   6,184  11,816  22,568  43,048  81,960 155,688
   536,870,911 |   3,368   6,440  12,328  23,592  45,096  86,056 163,880
 1,073,741,823 |   3,496   6,696  12,840  24,616  47,144  90,152 172,072
 2,147,483,647 |   3,624   6,952  13,352  25,640  49,192  94,248 180,264
 4,294,967,295 |   3,752   7,208  13,864  26,664  51,240  98,344 188,456

   

There is more documentation available on datasketches.apache.org.

This is an implementation of the Low Discrepancy Mergeable Quantiles Sketch described in section 3.2 of the journal version of the paper "Mergeable Summaries" by Agarwal, Cormode, Huang, Phillips, Wei, and Yi.

This algorithm is independent of the distribution of items and requires only that the items be comparable.

This algorithm intentionally inserts randomness into the sampling process for items that ultimately get retained in the sketch. The results produced by this algorithm are not deterministic. For example, if the same stream is inserted into two different instances of this sketch, the answers obtained from the two sketches may not be identical.

Similarly, there may be directional inconsistencies. For example, the result obtained from get_quantile(rank) input into the reverse directional query get_rank(item) may not result in the original item.

Author
Kevin Lang
Lee Rhodes
Alexander Saydakov
Jon Malkin

Constructor & Destructor Documentation

◆ quantiles_sketch() [1/4]

quantiles_sketch ( uint16_t  k = quantiles_constants::DEFAULT_K,
const Comparator &  comparator = Comparator(),
const Allocator &  allocator = Allocator() 
)
explicit

Constructor.

Parameters
kaffects the size of the sketch and its estimation error
comparatorstrict weak ordering function (see C++ named requirements: Compare)
allocatorused to allocate memory

◆ quantiles_sketch() [2/4]

quantiles_sketch ( const quantiles_sketch< T, Comparator, Allocator > &  other)

Copy constructor.

Parameters
othersketch to be copied

◆ quantiles_sketch() [3/4]

quantiles_sketch ( quantiles_sketch< T, Comparator, Allocator > &&  other)
noexcept

Move constructor.

Parameters
othersketch to be moved

◆ quantiles_sketch() [4/4]

quantiles_sketch ( const quantiles_sketch< From, FC, FA > &  other,
const Comparator &  comparator = Comparator(),
const Allocator &  allocator = Allocator() 
)
explicit

Type converting constructor.

Parameters
otherquantiles sketch of a different type
comparatorinstance of a Comparator
allocatorinstance of an Allocator

Member Function Documentation

◆ operator=() [1/2]

quantiles_sketch< T, C, A > & operator= ( const quantiles_sketch< T, Comparator, Allocator > &  other)

Copy assignment.

Parameters
othersketch to be copied
Returns
reference to this sketch

◆ operator=() [2/2]

quantiles_sketch< T, C, A > & operator= ( quantiles_sketch< T, Comparator, Allocator > &&  other)
noexcept

Move assignment.

Parameters
othersketch to be moved
Returns
reference to this sketch

◆ update()

void update ( FwdT &&  item)

Updates this sketch with the given data item.

Parameters
itemfrom a stream of items

◆ merge()

void merge ( FwdSk &&  other)

Merges another sketch into this one.

Parameters
othersketch to merge into this one

◆ is_empty()

bool is_empty

Returns true if this sketch is empty.

Returns
empty flag

◆ get_k()

uint16_t get_k

Returns configured parameter k.

Returns
parameter k

◆ get_n()

uint64_t get_n

Returns the length of the input stream.

Returns
stream length

◆ get_num_retained()

uint32_t get_num_retained

Returns the number of retained items (samples) in the sketch.

Returns
the number of retained items

◆ is_estimation_mode()

bool is_estimation_mode

Returns true if this sketch is in estimation mode.

Returns
estimation mode flag

◆ get_min_item()

const T & get_min_item

Returns the min item of the stream.

If the sketch is empty this throws std::runtime_error.

Returns
the min item of the stream

◆ get_max_item()

const T & get_max_item

Returns the max item of the stream.

If the sketch is empty this throws std::runtime_error.

Returns
the max item of the stream

◆ get_comparator()

C get_comparator

Returns an instance of the comparator for this sketch.

Returns
comparator

◆ get_allocator()

A get_allocator

Returns the allocator for this sketch.

Returns
allocator

◆ get_quantile()

auto get_quantile ( double  rank,
bool  inclusive = true 
) const

Returns an approximation to the data item associated with the given rank of a hypothetical sorted version of the input stream so far.

If the sketch is empty this throws std::runtime_error.

Parameters
rankthe specified normalized rank in the hypothetical sorted stream.
inclusiveif true the weight of the given item is included into the rank. Otherwise the rank equals the sum of the weights of all items that are less than the given item according to the Comparator.
Returns
the approximation to the item at the given rank

◆ get_rank()

double get_rank ( const T &  item,
bool  inclusive = true 
) const

Returns an approximation to the normalized rank of the given item from 0 to 1, inclusive.

The resulting approximation has a probabilistic guarantee that can be obtained from the get_normalized_rank_error(false) function.

If the sketch is empty this throws std::runtime_error.

Parameters
itemto be ranked
inclusiveif true the weight of the given item is included into the rank. Otherwise the rank equals the sum of the weights of all items that are less than the given item according to the Comparator.
Returns
an approximate normalized rank of the given item

◆ get_PMF()

auto get_PMF ( const T *  split_points,
uint32_t  size,
bool  inclusive = true 
) const

Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of split points (items).

The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(true) function.

If the sketch is empty this throws std::runtime_error.

Parameters
split_pointsan array of m unique, monotonically increasing items that divide the input domain into m+1 consecutive disjoint intervals (bins).
sizeof the array of split points.
inclusiveif true the rank of an item includes its own weight, and therefore if the sketch contains items equal to a slit point, then in PMF such items are included into the interval to the left of split point. Otherwise they are included into the interval to the right of split point.
Returns
an array of m+1 doubles each of which is an approximation to the fraction of the input stream items (the mass) that fall into one of those intervals.

◆ get_CDF()

auto get_CDF ( const T *  split_points,
uint32_t  size,
bool  inclusive = true 
) const

Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of split points (items).

The resulting approximations have a probabilistic guarantee that can be obtained from the get_normalized_rank_error(false) function.

If the sketch is empty this throws std::runtime_error.

Parameters
split_pointsan array of m unique, monotonically increasing items that divide the input domain into m+1 consecutive disjoint intervals.
sizeof the array of split points.
inclusiveif true the rank of an item includes its own weight, and therefore if the sketch contains items equal to a slit point, then in CDF such items are included into the interval to the left of split point. Otherwise they are included into the interval to the right of split point.
Returns
an array of m+1 double values, which are a consecutive approximation to the CDF of the input stream given the split_points. The value at array position j of the returned CDF array is the sum of the returned values in positions 0 through j of the returned PMF array. This can be viewed as array of ranks of the given split points plus one more value that is always 1.

◆ get_serialized_size_bytes() [1/2]

size_t get_serialized_size_bytes ( const SerDe &  sd = SerDe()) const

Computes size needed to serialize the current state of the sketch.

This version is for fixed-size arithmetic types (integral and floating point).

Parameters
sdinstance of a SerDe
Returns
size in bytes needed to serialize this sketch

◆ get_serialized_size_bytes() [2/2]

size_t get_serialized_size_bytes ( const SerDe &  sd = SerDe()) const

Computes size needed to serialize the current state of the sketch.

This version is for all other types and can be expensive since every item needs to be looked at.

Parameters
sdinstance of a SerDe
Returns
size in bytes needed to serialize this sketch

◆ serialize() [1/2]

void serialize ( std::ostream &  os,
const SerDe &  sd = SerDe() 
) const

This method serializes the sketch into a given stream in a binary form.

Parameters
osoutput stream
sdinstance of a SerDe

◆ serialize() [2/2]

vector_bytes serialize ( unsigned  header_size_bytes = 0,
const SerDe &  sd = SerDe() 
) const

This method serializes the sketch as a vector of bytes.

An optional header can be reserved in front of the sketch. It is a blank space of a given size. This header is used in Datasketches PostgreSQL extension.

Parameters
header_size_bytesspace to reserve in front of the sketch
sdinstance of a SerDe
Returns
serialized sketch as a vector of bytes

◆ deserialize() [1/2]

static quantiles_sketch deserialize ( std::istream &  is,
const SerDe &  sd = SerDe(),
const Comparator &  comparator = Comparator(),
const Allocator &  allocator = Allocator() 
)
static

This method deserializes a sketch from a given stream.

Parameters
isinput stream
sdinstance of a SerDe
comparatorinstance of a Comparator
allocatorinstance of an Allocator
Returns
an instance of a sketch

◆ deserialize() [2/2]

static quantiles_sketch deserialize ( const void *  bytes,
size_t  size,
const SerDe &  sd = SerDe(),
const Comparator &  comparator = Comparator(),
const Allocator &  allocator = Allocator() 
)
static

This method deserializes a sketch from a given array of bytes.

Parameters
bytespointer to the array of bytes
sizethe size of the array
sdinstance of a SerDe
comparatorinstance of a Comparator
allocatorinstance of an Allocator
Returns
an instance of a sketch

◆ get_normalized_rank_error() [1/2]

double get_normalized_rank_error ( bool  is_pmf) const

Gets the normalized rank error for this sketch.

Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials.

Parameters
is_pmfif true, returns the "double-sided" normalized rank error for the get_PMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
Returns
the normalized rank error for the sketch

◆ get_normalized_rank_error() [2/2]

double get_normalized_rank_error ( uint16_t  k,
bool  is_pmf 
)
static

Gets the normalized rank error given k and pmf.

Constants were derived as the best fit to 99 percentile empirically measured max error in thousands of trials.

Parameters
kthe configuration parameter
is_pmfif true, returns the "double-sided" normalized rank error for the get_PMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
Returns
the normalized rank error for the given parameters

◆ to_string()

string< A > to_string ( bool  print_levels = false,
bool  print_items = false 
) const

Prints a summary of the sketch.

Parameters
print_levelsif true include information about levels
print_itemsif true include sketch data

◆ begin()

quantiles_sketch< T, C, A >::const_iterator begin

Iterator pointing to the first item in the sketch.

If the sketch is empty, the returned iterator must not be dereferenced or incremented.

Returns
iterator pointing to the first item in the sketch

◆ end()

quantiles_sketch< T, C, A >::const_iterator end

Iterator pointing to the past-the-end item in the sketch.

The past-the-end item is the hypothetical item that would follow the last item. It does not point to any item, and must not be dereferenced or incremented.

Returns
iterator pointing to the past-the-end item in the sketch

◆ get_sorted_view()

quantiles_sorted_view< T, C, A > get_sorted_view

Gets the sorted view of this sketch.

Returns
the sorted view of this sketch

The documentation for this class was generated from the following files: