Class UniqueCountMap
- java.lang.Object
-
- org.apache.datasketches.hllmap.UniqueCountMap
-
public class UniqueCountMap extends Object
This is a real-time, key-value HLL mapping sketch that tracks approximate unique counts of identifiers (the values) associated with each key. An example might be tracking the number of unique user identifiers associated with each IP address. This map has been specifically designed for the use-case where the number of keys is quite large (many millions) and the distribution of identifiers per key is very skewed. A typical distribution where this works well is a power-law distribution of identifiers per key of the form y = Cx-α, where α < 0.5, and C is roughly ymax. For example, with 100M keys, over 75% of the keys would have only one identifier, 99% of the keys would have less than 20 identifiers, 99.9% would have less than 200 identifiers, and a very tiny fraction might have identifiers in the thousands.The space consumed by this map is quite sensitive to the actual distribution of identifiers per key, so you should characterize and or experiment with your typical input streams. Nonetheless, our experiments on live streams of over 100M keys required about 1.4GB of space.
Given such highly-skewed distributions, using this map is far more efficient space-wise than the alternative of dedicating an HLL sketch per key. Based on our use cases, after subtracting the space required for key storage, the average bytes per key required for unique count estimation (
getAverageSketchMemoryPerKey()
) is about 10.Internally, this map is implemented as a hierarchy of internal hash maps with progressively increasing storage allocated for unique count estimation. As a key acquires more identifiers it is "promoted" up to a higher internal map. The final map of keys is a map of compact HLL sketches.
The unique values in all the internal maps, except the final HLL map, are stored in a special form called a coupon. A coupon is a 16-bit value that fully describes a k=1024 HLL bin. It contains 10 bits of address and a 6-bit HLL value.
All internal maps use a prime number size and Knuth's Open Addressing Double Hash (OADH) search algorithm.
The internal base map holds all the keys and each key is associated with one 16-bit value. Initially, the value is a single coupon. Once the key is promoted, this 16-bit field contains a reference to the internal map where the key is still active.
The intermediate maps between the base map and the final HLL map are of two types. The first few of these are called traverse maps where the coupons are stored as unsorted arrays. After the traverse maps are the coupon hash maps, where the coupons are stored in small OASH hash tables.
All the intermediate maps support deletes and can dynamically grow and shrink as required by the input stream.
The sketch estimator algorithms are unbiased with a Relative Standard Error (RSE) of about 2.6% with 68% confidence, or equivalently, about 5.2% with a 95% confidence.
In a parallel package in the sketches-misc repository, there are 2 classes that can be used from the command line to feed this mapping sketch piped from standard-in for experimental evaluation. The first is ProcessIpStream, which processes simple IP/ID pairs and the second, ProcessDistributionStream, which processes pairs that describe a distribution. In this same package is the VariousMapRSETest class that was used to generate the error plots for the web site. Please refer to the javadocs for those classes for more information.
- Author:
- Lee Rhodes, Alexander Saydakov, Kevin Lang
-
-
Constructor Summary
Constructors Constructor Description UniqueCountMap(int keySizeBytes)
Constructs a UniqueCountMap with an initial capacity of one million entries.UniqueCountMap(int initialNumEntries, int keySizeBytes)
Constructs a UniqueCountMap with a given initial number of entries.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getActiveEntries()
Returns the number of active, unique keys across all internal mapsdouble
getAverageSketchMemoryPerKey()
Returns the average memory storage per key that is dedicated to sketching the unique counts.double
getEstimate(byte[] key)
Retrieves the current estimate of unique count for a given key.long
getKeyMemoryUsageBytes()
Returns total bytes used for key storagedouble
getLowerBound(byte[] key)
Returns the lower bound cardinality with respect togetEstimate(byte[])
associated with the given key.long
getMemoryUsageBytes()
Returns total bytes used by all internal mapsdouble
getUpperBound(byte[] key)
Returns the upper bound cardinality with respect togetEstimate(byte[])
associated with the given key.String
toString()
Returns a string with a human-readable summary of the UniqueCountMap and all the internal mapsdouble
update(byte[] key, byte[] identifier)
Updates the map with a given key and identifier and returns the estimate of the number of unique identifiers encountered so far for the given key.
-
-
-
Constructor Detail
-
UniqueCountMap
public UniqueCountMap(int keySizeBytes)
Constructs a UniqueCountMap with an initial capacity of one million entries.- Parameters:
keySizeBytes
- must be at least 4 bytes to have sufficient entropy.
-
UniqueCountMap
public UniqueCountMap(int initialNumEntries, int keySizeBytes)
Constructs a UniqueCountMap with a given initial number of entries.- Parameters:
initialNumEntries
- The initial number of entries provides a tradeoff between wasted space, if too high, and wasted time resizing the table, if too low.keySizeBytes
- must be at least 4 bytes to have sufficient entropy
-
-
Method Detail
-
update
public double update(byte[] key, byte[] identifier)
Updates the map with a given key and identifier and returns the estimate of the number of unique identifiers encountered so far for the given key.- Parameters:
key
- the given keyidentifier
- the given identifier for unique counting associated with the key- Returns:
- the estimate of the number of unique identifiers encountered so far for the given key.
-
getEstimate
public double getEstimate(byte[] key)
Retrieves the current estimate of unique count for a given key.- Parameters:
key
- given key- Returns:
- estimate of unique count so far
-
getUpperBound
public double getUpperBound(byte[] key)
Returns the upper bound cardinality with respect togetEstimate(byte[])
associated with the given key.- Parameters:
key
- the given key- Returns:
- the upper bound cardinality with respect to
getEstimate(byte[])
associated with the given key.
-
getLowerBound
public double getLowerBound(byte[] key)
Returns the lower bound cardinality with respect togetEstimate(byte[])
associated with the given key.- Parameters:
key
- the given key- Returns:
- the lower bound cardinality with respect to
getEstimate(byte[])
associated with the given key.
-
getActiveEntries
public int getActiveEntries()
Returns the number of active, unique keys across all internal maps- Returns:
- the number of active, unique keys across all internal maps
-
getMemoryUsageBytes
public long getMemoryUsageBytes()
Returns total bytes used by all internal maps- Returns:
- total bytes used by all internal maps
-
getKeyMemoryUsageBytes
public long getKeyMemoryUsageBytes()
Returns total bytes used for key storage- Returns:
- total bytes used for key storage
-
getAverageSketchMemoryPerKey
public double getAverageSketchMemoryPerKey()
Returns the average memory storage per key that is dedicated to sketching the unique counts.- Returns:
- the average memory storage per key that is dedicated to sketching the unique counts.
-
-