Sketch Library Dictionary

Sketch Accuracy

Refers to sketch accuracy...

Alpha TCF

The Alpha Theta Choosing Function (TCF) and the theory behind it is fully described in the Theta Sketch Framework paper. The alpha algorithm is optimized for speed and accuracy in a real-time sketch building / estimating environment.

One of the properties of the Alpha Algorithm used for cache management within a sketch is that the value of Theta Long is always up-to-date. There may be dirty hash values in the cache, but the Alpha Algorithm estimation function ignores them.

Default Nominal Entries

In Theta Sketches, the default nominal entries of 4096 is provided as a convenience for those cases where the number of entries is not provided. A sketch of 4096 entries has a Relative Standard Error (RSE) of +/- 1.56% at a confidence of 68%; or equivalently, a Relative Error of +/- 3.1% at a confidence of 95.4%.

Default Update Seed

In Theta Sketches, the default Update Hash Seed 9001 is a prime number that was chosen very early on in experimental testing. Choosing a seed is somewhat arbitrary, and this particular seed is not superior to other seeds.

In performing set operations on two sketches it is critical that the same hash function and seed was used for both sketches, otherwise the assumed 1:1 relationship between the original source key value and the hashed bit string would be violated. Once you have developed a history of stored sketches you are stuck with your chosen seed. So don't change it! Update Hash Seed

Degenerate Sketch

A sketch is considered degenerate if it has never reached reached Estimation Mode where the count of its retained hash values is exact and less than Nominal Entries (or k entries) and no estimation is needed. This is also referred to as Exact Mode.

Destination Memory

The destination Memory for this object. If not required it may be null. If not null and large enough the returned object will be backed by this Memory. If null, the returned object will be on the Java heap.

Destination Ordered

In Theta Sketches, this refers to the ordering of hash values in a returned CompactSketch. If true, the internal hash values will be ordered. This will enable much higher performance during union (merge) operations but at the cost of a sort, which may not be desirable in some applications, especially real-time.

Dirty Hash

For the Theta Alpha Sketch, a retained hash value is considered dirty if it is ≥ Theta Long or < 0. See Valid Hash.

isEmpty()

In Theta Sketches, the state isEmpty() for a sketch means that the sketch cache has zero hash values and that none of the update methods have been called with valid data. In other words, the sketch has never seen any data. This state is equivalent to "null" in the sense that it is safe to exclude empty sketches from union operations. However, an empty sketch will impact intersections and difference set operations.

Note that isEmpty() does not always mean that theta is 1.0 because if p < 1.0, theta will be set equal to p during construction. Also, a cache of zero values (getRetainedEntries(true) = 0) does not mean that the sketch is Empty since set intersection or difference operations can result in a sketch with zero values. If the sourcing sketches had seen data then a resulting intersection or difference sketch will be not Empty and have valid upper and lower bounds even if the cache has zero values. In other words, the resulting sketch represents a valid distribution of data that just happens to have zero samples collected from it.

Note also that a virgin Intersection object will return isEmpty() == false. This is because a virgin Intersection object represents the Universe Set, which is clearly not empty.

These are subtle distinctions and exist for mathematical correctness. Excluding sketches that just have getRetainedEntries(true) = 0 from set operations them could result in impacting the accuracy of results.

Estimation Mode

Once a Theta Sketch exceeds the configured Nominal Entries, or k, number of retained hash values, the sketch transitions into estimation mode where it must now estimate the population of uniques that the retained entries represent. A sketch can also be in estimation mode if the sketch was configured with a Sampling Probability < 1.0. Estimation Mode = (θ < 1.0) & NOT empty.

lgNomLongs

For Theta Sketches, the Log base 2 of the number of Nominal Entries.

lgArrLongs

For Theta Sketches, the Log base 2 of the size of the internal hash table in 64-bit longs.

Memory

The backing Memory object which may be a source or destination.

Nominal Entries

For Theta Sketches and depending on the specific sketch, the constructor data type is int or String. Acceptable values for constructing new sketches are positive integers that are powers of two. This parameter specifies the target nominal number of entries (a.k.a. k) that will be retained in the internal cache and ultimately determines the accuracy of the sketch (see Sketch Accuracy). Internally each entry is retained as a long of 8 bytes.

The reason it is called "nominal" is that depending on the specific algorithm of the sketch, the actual number of entries retained by the sketch after the sketch goes into estimation mode may differ from the configured value. For the AlphaSketch the number of retained entries statistically varies but has a mean of k. For the QuickSelect sketches, the number of retained entries will statistically vary from k to almost 2*k.

Each sketch type also has a minimum acceptable value for this value. For QuickSelect Sketches this value is 16 and for Alpha Sketches this value is 512. Specifying a value less than this minimum value just results in the minimum value being used.

Number of Standard Deviations

This is a positive number, which may be either an integer (1, 2, or 3) or a double ≤ 3.0. This value is used in the getUpperBounds(int numStdDev) and getLowerBounds(int numStdDev) methods and represents (theoretically) the +/- standard deviation from the center of the Standard Normal Gaussian Distribution. For example:

getUpperBound(1) returns the estimated quantile(0.841) of the distribution.
getLowerBound(1) returns the estimated quantile(0.158) of the distribution.
getUpperBound(2) returns the estimated quantile(0.977) of the distribution.
getLowerBound(2) returns the estimated quantile(0.023) of the distribution.
getUpperBound(3) returns the estimated quantile(0.9986) of the distribution.
getLowerBound(3) returns the estimated quantile(0.0013) of the distribution.

However, for sketches with small configured values of Nominal Entries < 4096 for Theta or lgConfigK < 12 for HLL, the error distribution of the sketch becomes quite asymmetric and cannot be approximated with a Gaussian. In these cases the interpretation of numStdDev is that of an index that returns the quantile of the sketch error distribution that corresponds to fractional normalized rank of the standard normal distribution at the specified numStdDev.

Thus, getUpperBound(1) and getLowerBound(2) represent the 68.3% confidence bounds, getUpperBound(2) and getLowerBound(2) represent the 95.4% confidence bounds, and getUpperBound(3) and getLowerBound(3) represent the 99.7% confidence bounds.

For some sketches where the error distribution is not Gaussian, special mathematical approximation methods are used. See Sketch Accuracy.

Quick Select TCF

The fundamental Theta Sketch QuickSelect algorithm is described in classic algorithm texts by Sedgewick and is the Theta Choosing Function (TCF) for the QuickSelect Sketches. When the internal hash table of the sketch reaches its internal refresh threshold, the quick select algorithm is used to select the (k+1)th order statistic from the hash table with a complexity of O(n). The value of the selected hash becomes the new Theta Long and immediately makes some number of entries in the table dirty. The rebuild() method is called that rebuilds the hash table removing the dirty values. Since the value of Theta Long is only changed when the hash table needs to be rebuilt, the values in the hash table are only ever dirty briefly during the rebuild process. Thus, all the values in the hash table are always valid during normal updating of the sketch.

One of the benefits of using the QuickSelect algorithm for the cache management of the sketch is that the number of valid hashes ranges from nominal entries to the current REBUILD_THRESHOLD, which is nominally 15/16 * cacheSize. This means that without the user forcing a rebuild(), the sketch, on average, may be about 50% larger than nominal entries, about 19% more accurate, and faster.

Resize Factor

For Theta Sketches, the Resize Factor is a dynamic, speed performance vs. memory size tradeoff. The sketches created on-heap and configured with a Resize Factor of > X1 start out with an internal hash table size that is the smallest submultiple of the the target Nominal Entries and larger than the minimum required hash table size for that sketch. When the sketch needs to be resized larger, then the Resize Factor is used as a multiplier of the current sketch cache array size.
"X1" means no resizing is allowed and the sketch will be intialized at full size.
"X2" means the internal cache will start very small and double in size until the target size is reached.
Similarly, "X4" is a factor of 4 and "X8 is a factor of 8.

Sampling Probability p

For Theta Sketches, the uniform random pre-sketching sampling probability. Depending on the specific sketch, the constructor data type is float or String. Incoming hashed data values are sampled by this probability factor before being submitted to the sketching algorithm. For example, if p were set to 0.25, then on average, only one forth of the incoming values, selected uniformly and at random, would be evaluated by the sketching algorithm to be retained by the sketch. Its default value is 1.0 (no sampling). Its value must be in the range: 0 < p ≤ 1.0.

This mode is particularly useful when merging large numbers of degenerate sketches.

Seed

For Theta Sketches, the long (64-bit) seed is required by the Update Hash Function. This seed value is intentionally not serialized along with this sketch in order to provide some security and protection against "dictionary attacks".

In order to provide some protection against accidental mixing of sketches that were generated with different seeds a short, 16-bit, Seed Hash is stored with the sketch image. When heapifying or wrapping an UpdateSketch image, which can be either a byte array or a Memory object, the user must provide the original seed either directly or indirectly by assuming the DEFAULT_UPDATE_SEED. The provided seed will be hashed and validated against the internal short Seed Hash and an error will be thrown if the seed hashes do not match. The Set Operations classes, Union, Intersection and AnotB also require the user to provide the seed either directly or indirectly.

An internal check will be made to make sure that the provided seed does not hash to a 16-bit value of zero. If it does produce zero, an error will be thrown and the user must provide a different seed value.
See also Default Update Seed.

Seed Hash

For Theta and Tuple Sketches, a 16-bit hash of the Update Hash Seed used internally to validate (1) that two sketches undergoing set operations were, in fact, created using matching Update Hash Seeds; or (2) that when deserializing or wrapping a sketch image that the caller has the correct seed.

Snow Plow Effect

When coordinated hash tables are merged and if the merging process does not update the target sketch with sufficient randomness, clustering in the target hash table can be greatly exaggerated causing poor speed performance for both updates and searches. This is called the "snowplow" effect because of the analogy of visualizing the clusters in a hash table as piles of snow that grow larger and larger. Since the size of the clusters are only represented by their width (not height like piles of snow), the clusters push themselves out horizontally and merge together as if they were pushed together with a snowplow.

Theta Choosing Function (TCF)

For Theta Sketches, the Theta Choosing Function (TCF) and the theory behind it is fully described in the Theta Sketch Framework paper.

Theta, θ

For Theta Sketches, refers to the mathematical random variable θ that represents the current probability that the next, non-duplicate unique input value presented to the sketch will change the state of the sketch. Given N uniquified inputs to a sketch configured to retain at most k hashes of the inputs, θ ≈ k/N, 0 < θ ≤ 1.0, and θ = (double) thetaLong / Long.MAX_VALUE;. See thetaLong.

thetaLong

For Theta Sketches, the 64-bit, positive long equivalent of theta where
0 < thetaLongLong.MAX_VALUE, and thetaLong = θ * Long.MAX_VALUE.

Theta Sketch Framework

This framework enables sketches with different algorithms and Theta Choosing Functions to be arguments to the Union, Intersection and AnotB Set Operations. This framework also enables the sketches to share estimation, upper and lower bounds algorithms and a common serialization data structure. The Theta Sketch Framework, Theta Choosing Functions and the theory behind them is fully described in the Theta Sketch Framework paper.

Update Return State

For Theta Sketches, this provides useful detail for sketch characterization and debugging. It is not required that any of these values be monitored during normal operation. The UpdateReturnState is defined as follows:

Valid Hash

For Theta Sketches, a retained hash value is considered valid if it is greater than zero and less than thetaLong. See Dirty Hash.