Class UniqueCountMap

java.lang.Object
org.apache.datasketches.hllmap.UniqueCountMap

public final class UniqueCountMap extends Object
This is a real-time, key-value HLL mapping sketch that tracks approximate unique counts of identifiers (the values) associated with each key. An example might be tracking the number of unique user identifiers associated with each IP address. This map has been specifically designed for the use-case where the number of keys is quite large (many millions) and the distribution of identifiers per key is very skewed. A typical distribution where this works well is a power-law distribution of identifiers per key of the form y = Cx, where α < 0.5, and C is roughly ymax. For example, with 100M keys, over 75% of the keys would have only one identifier, 99% of the keys would have less than 20 identifiers, 99.9% would have less than 200 identifiers, and a very tiny fraction might have identifiers in the thousands.

The space consumed by this map is quite sensitive to the actual distribution of identifiers per key, so you should characterize and or experiment with your typical input streams. Nonetheless, our experiments on live streams of over 100M keys required about 1.4GB of space. This is about 14 bytes per key for key storage and unique count storage.

Given such highly-skewed distributions, using this map is far more efficient space-wise than the alternative of dedicating an HLL sketch per key. Based on our use cases, after subtracting the space required for key storage, the average bytes per key required for unique count estimation (getAverageSketchMemoryPerKey()) is about 10.

Internally, this map is implemented as a hierarchy of internal hash maps with progressively increasing storage allocated for unique count estimation. As a key acquires more identifiers it is "promoted" up to a higher internal map. The final map of keys is a map of compact HLL sketches.

The unique values in all the internal maps, except the final HLL map, are stored in a special form called a coupon. A coupon is a 16-bit value that fully describes a k=1024 HLL bin. It contains 10 bits of address and a 6-bit HLL value.

All internal maps use a prime number size and Knuth's Open Addressing Double Hash (OADH) search algorithm.

The internal base map holds all the keys and each key is associated with one 16-bit value. Initially, the value is a single coupon. Once the key is promoted, this 16-bit field contains a reference to the internal map where the key is still active.

The intermediate maps between the base map and the final HLL map are of two types. The first few of these are called traverse maps where the coupons are stored as unsorted arrays. After the traverse maps are the coupon hash maps, where the coupons are stored in small OASH hash tables.

All the intermediate maps support deletes and can dynamically grow and shrink as required by the input stream.

The sketch estimator algorithms are unbiased with a Relative Standard Error (RSE) of about 2.6% with 68% confidence, or equivalently, about 5.2% with a 95% confidence.

In a parallel package in the sketches-misc repository, there are 2 classes that can be used from the command line to feed this mapping sketch piped from standard-in for experimental evaluation. The first is ProcessIpStream, which processes simple IP/ID pairs and the second, ProcessDistributionStream, which processes pairs that describe a distribution. In this same package is the VariousMapRSETest class that was used to generate the error plots for the web site. Please refer to the javadocs for those classes for more information.

Author:
Lee Rhodes, Alexander Saydakov, Kevin Lang
  • Constructor Summary

    Constructors
    Constructor
    Description
    UniqueCountMap(int keySizeBytes)
    Constructs a UniqueCountMap with an initial capacity of one million entries.
    UniqueCountMap(int initialNumEntries, int keySizeBytes)
    Constructs a UniqueCountMap with a given initial number of entries.
  • Method Summary

    Modifier and Type
    Method
    Description
    int
    Returns the number of active, unique keys across all internal maps
    double
    Returns the average memory storage per key that is dedicated to sketching the unique counts.
    double
    getEstimate(byte[] key)
    Retrieves the current estimate of unique count for a given key.
    long
    Returns total bytes used for key storage
    double
    getLowerBound(byte[] key)
    Returns the lower bound cardinality with respect to getEstimate(byte[]) associated with the given key.
    long
    Returns total bytes used by all internal maps
    double
    getUpperBound(byte[] key)
    Returns the upper bound cardinality with respect to getEstimate(byte[]) associated with the given key.
    Returns a string with a human-readable summary of the UniqueCountMap and all the internal maps
    double
    update(byte[] key, byte[] identifier)
    Updates the map with a given key and identifier and returns the estimate of the number of unique identifiers encountered so far for the given key.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Constructor Details

    • UniqueCountMap

      public UniqueCountMap(int keySizeBytes)
      Constructs a UniqueCountMap with an initial capacity of one million entries.
      Parameters:
      keySizeBytes - must be at least 4 bytes to have sufficient entropy.
    • UniqueCountMap

      public UniqueCountMap(int initialNumEntries, int keySizeBytes)
      Constructs a UniqueCountMap with a given initial number of entries.
      Parameters:
      initialNumEntries - The initial number of entries provides a tradeoff between wasted space, if too high, and wasted time resizing the table, if too low.
      keySizeBytes - must be at least 4 bytes to have sufficient entropy
  • Method Details

    • update

      public double update(byte[] key, byte[] identifier)
      Updates the map with a given key and identifier and returns the estimate of the number of unique identifiers encountered so far for the given key.
      Parameters:
      key - the given key
      identifier - the given identifier for unique counting associated with the key
      Returns:
      the estimate of the number of unique identifiers encountered so far for the given key.
    • getEstimate

      public double getEstimate(byte[] key)
      Retrieves the current estimate of unique count for a given key.
      Parameters:
      key - given key
      Returns:
      estimate of unique count so far
    • getUpperBound

      public double getUpperBound(byte[] key)
      Returns the upper bound cardinality with respect to getEstimate(byte[]) associated with the given key.
      Parameters:
      key - the given key
      Returns:
      the upper bound cardinality with respect to getEstimate(byte[]) associated with the given key.
    • getLowerBound

      public double getLowerBound(byte[] key)
      Returns the lower bound cardinality with respect to getEstimate(byte[]) associated with the given key.
      Parameters:
      key - the given key
      Returns:
      the lower bound cardinality with respect to getEstimate(byte[]) associated with the given key.
    • getActiveEntries

      public int getActiveEntries()
      Returns the number of active, unique keys across all internal maps
      Returns:
      the number of active, unique keys across all internal maps
    • getMemoryUsageBytes

      public long getMemoryUsageBytes()
      Returns total bytes used by all internal maps
      Returns:
      total bytes used by all internal maps
    • getKeyMemoryUsageBytes

      public long getKeyMemoryUsageBytes()
      Returns total bytes used for key storage
      Returns:
      total bytes used for key storage
    • getAverageSketchMemoryPerKey

      public double getAverageSketchMemoryPerKey()
      Returns the average memory storage per key that is dedicated to sketching the unique counts.
      Returns:
      the average memory storage per key that is dedicated to sketching the unique counts.
    • toString

      public String toString()
      Returns a string with a human-readable summary of the UniqueCountMap and all the internal maps
      Overrides:
      toString in class Object
      Returns:
      human-readable summary