Class UniqueCountMap
 java.lang.Object

 org.apache.datasketches.hllmap.UniqueCountMap

public final class UniqueCountMap extends Object
This is a realtime, keyvalue HLL mapping sketch that tracks approximate unique counts of identifiers (the values) associated with each key. An example might be tracking the number of unique user identifiers associated with each IP address. This map has been specifically designed for the usecase where the number of keys is quite large (many millions) and the distribution of identifiers per key is very skewed. A typical distribution where this works well is a powerlaw distribution of identifiers per key of the form y = Cx^{α}, where α < 0.5, and C is roughly y_{max}. For example, with 100M keys, over 75% of the keys would have only one identifier, 99% of the keys would have less than 20 identifiers, 99.9% would have less than 200 identifiers, and a very tiny fraction might have identifiers in the thousands.The space consumed by this map is quite sensitive to the actual distribution of identifiers per key, so you should characterize and or experiment with your typical input streams. Nonetheless, our experiments on live streams of over 100M keys required about 1.4GB of space. This is about 14 bytes per key for key storage and unique count storage.
Given such highlyskewed distributions, using this map is far more efficient spacewise than the alternative of dedicating an HLL sketch per key. Based on our use cases, after subtracting the space required for key storage, the average bytes per key required for unique count estimation (
getAverageSketchMemoryPerKey()
) is about 10.Internally, this map is implemented as a hierarchy of internal hash maps with progressively increasing storage allocated for unique count estimation. As a key acquires more identifiers it is "promoted" up to a higher internal map. The final map of keys is a map of compact HLL sketches.
The unique values in all the internal maps, except the final HLL map, are stored in a special form called a coupon. A coupon is a 16bit value that fully describes a k=1024 HLL bin. It contains 10 bits of address and a 6bit HLL value.
All internal maps use a prime number size and Knuth's Open Addressing Double Hash (OADH) search algorithm.
The internal base map holds all the keys and each key is associated with one 16bit value. Initially, the value is a single coupon. Once the key is promoted, this 16bit field contains a reference to the internal map where the key is still active.
The intermediate maps between the base map and the final HLL map are of two types. The first few of these are called traverse maps where the coupons are stored as unsorted arrays. After the traverse maps are the coupon hash maps, where the coupons are stored in small OASH hash tables.
All the intermediate maps support deletes and can dynamically grow and shrink as required by the input stream.
The sketch estimator algorithms are unbiased with a Relative Standard Error (RSE) of about 2.6% with 68% confidence, or equivalently, about 5.2% with a 95% confidence.
In a parallel package in the sketchesmisc repository, there are 2 classes that can be used from the command line to feed this mapping sketch piped from standardin for experimental evaluation. The first is ProcessIpStream, which processes simple IP/ID pairs and the second, ProcessDistributionStream, which processes pairs that describe a distribution. In this same package is the VariousMapRSETest class that was used to generate the error plots for the web site. Please refer to the javadocs for those classes for more information.
 Author:
 Lee Rhodes, Alexander Saydakov, Kevin Lang


Constructor Summary
Constructors Constructor Description UniqueCountMap(int keySizeBytes)
Constructs a UniqueCountMap with an initial capacity of one million entries.UniqueCountMap(int initialNumEntries, int keySizeBytes)
Constructs a UniqueCountMap with a given initial number of entries.

Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getActiveEntries()
Returns the number of active, unique keys across all internal mapsdouble
getAverageSketchMemoryPerKey()
Returns the average memory storage per key that is dedicated to sketching the unique counts.double
getEstimate(byte[] key)
Retrieves the current estimate of unique count for a given key.long
getKeyMemoryUsageBytes()
Returns total bytes used for key storagedouble
getLowerBound(byte[] key)
Returns the lower bound cardinality with respect togetEstimate(byte[])
associated with the given key.long
getMemoryUsageBytes()
Returns total bytes used by all internal mapsdouble
getUpperBound(byte[] key)
Returns the upper bound cardinality with respect togetEstimate(byte[])
associated with the given key.String
toString()
Returns a string with a humanreadable summary of the UniqueCountMap and all the internal mapsdouble
update(byte[] key, byte[] identifier)
Updates the map with a given key and identifier and returns the estimate of the number of unique identifiers encountered so far for the given key.



Constructor Detail

UniqueCountMap
public UniqueCountMap(int keySizeBytes)
Constructs a UniqueCountMap with an initial capacity of one million entries. Parameters:
keySizeBytes
 must be at least 4 bytes to have sufficient entropy.

UniqueCountMap
public UniqueCountMap(int initialNumEntries, int keySizeBytes)
Constructs a UniqueCountMap with a given initial number of entries. Parameters:
initialNumEntries
 The initial number of entries provides a tradeoff between wasted space, if too high, and wasted time resizing the table, if too low.keySizeBytes
 must be at least 4 bytes to have sufficient entropy


Method Detail

update
public double update(byte[] key, byte[] identifier)
Updates the map with a given key and identifier and returns the estimate of the number of unique identifiers encountered so far for the given key. Parameters:
key
 the given keyidentifier
 the given identifier for unique counting associated with the key Returns:
 the estimate of the number of unique identifiers encountered so far for the given key.

getEstimate
public double getEstimate(byte[] key)
Retrieves the current estimate of unique count for a given key. Parameters:
key
 given key Returns:
 estimate of unique count so far

getUpperBound
public double getUpperBound(byte[] key)
Returns the upper bound cardinality with respect togetEstimate(byte[])
associated with the given key. Parameters:
key
 the given key Returns:
 the upper bound cardinality with respect to
getEstimate(byte[])
associated with the given key.

getLowerBound
public double getLowerBound(byte[] key)
Returns the lower bound cardinality with respect togetEstimate(byte[])
associated with the given key. Parameters:
key
 the given key Returns:
 the lower bound cardinality with respect to
getEstimate(byte[])
associated with the given key.

getActiveEntries
public int getActiveEntries()
Returns the number of active, unique keys across all internal maps Returns:
 the number of active, unique keys across all internal maps

getMemoryUsageBytes
public long getMemoryUsageBytes()
Returns total bytes used by all internal maps Returns:
 total bytes used by all internal maps

getKeyMemoryUsageBytes
public long getKeyMemoryUsageBytes()
Returns total bytes used for key storage Returns:
 total bytes used for key storage

getAverageSketchMemoryPerKey
public double getAverageSketchMemoryPerKey()
Returns the average memory storage per key that is dedicated to sketching the unique counts. Returns:
 the average memory storage per key that is dedicated to sketching the unique counts.

