Class CountMinSketch

java.lang.Object
org.apache.datasketches.count.CountMinSketch

public class CountMinSketch extends Object
Java implementation of the CountMin sketch data structure of Cormode and Muthukrishnan. This implementation is inspired by and compatible with the datasketches-cpp version by Charlie Dickens. The CountMin sketch is a probabilistic data structure that provides frequency estimates for items in a data stream. It uses multiple hash functions to distribute items across a two-dimensional array, providing approximate counts with configurable error bounds. Reference: http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
  • Method Summary

    Modifier and Type
    Method
    Description
    deserialize(byte[] b, long seed)
    Deserializes a CountMinSketch from the provided byte array.
    long
    getEstimate(byte[] item)
    Returns the estimated frequency for the given item.
    long
    getEstimate(long item)
    Returns the estimated frequency for the given item.
    long
    Returns the estimated frequency for the given item.
    long
    getLowerBound(byte[] item)
    Returns the lower bound of the estimated frequency for the given item.
    long
    getLowerBound(long item)
    Returns the lower bound of the estimated frequency for the given item.
    long
    Returns the lower bound of the estimated frequency for the given item.
    int
    Returns the number of buckets per hash function.
    byte
    Returns the number of hash functions used in this sketch.
    double
    Returns the relative error of the sketch.
    long
    Returns the hash seed used by this sketch.
    long
    Returns the total weight of all items inserted into the sketch.
    long
    getUpperBound(byte[] item)
    Returns the upper bound of the estimated frequency for the given item.
    long
    getUpperBound(long item)
    Returns the upper bound of the estimated frequency for the given item.
    long
    Returns the upper bound of the estimated frequency for the given item.
    boolean
    Checks if the CountMinSketch has processed any items.
    void
    Merges another CountMinSketch into this one.
    static int
    suggestNumBuckets(double relativeError)
    Suggests an appropriate number of buckets per hash function for a given relative error.
    static byte
    suggestNumHashes(double confidence)
    Suggests an appropriate number of hash functions to use for a given confidence level.
    byte[]
    Returns the sketch as a byte array.
    void
    update(byte[] item, long weight)
    Updates the sketch with the provided item and weight.
    void
    update(long item, long weight)
    Updates the sketch with the provided item and weight.
    void
    update(String item, long weight)
    Updates the sketch with the provided item and weight.

    Methods inherited from class Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Method Details

    • isEmpty

      public boolean isEmpty()
      Checks if the CountMinSketch has processed any items.
      Returns:
      True if the sketch is empty, otherwise false.
    • getNumHashes_

      public byte getNumHashes_()
      Returns the number of hash functions used in this sketch.
      Returns:
      The number of hash functions.
    • getNumBuckets_

      public int getNumBuckets_()
      Returns the number of buckets per hash function.
      Returns:
      The number of buckets.
    • getSeed_

      public long getSeed_()
      Returns the hash seed used by this sketch.
      Returns:
      The seed value.
    • getTotalWeight_

      public long getTotalWeight_()
      Returns the total weight of all items inserted into the sketch.
      Returns:
      The total weight.
    • getRelativeError

      public double getRelativeError()
      Returns the relative error of the sketch.
      Returns:
      The relative error.
    • suggestNumHashes

      public static byte suggestNumHashes(double confidence)
      Suggests an appropriate number of hash functions to use for a given confidence level.
      Parameters:
      confidence - The desired confidence level between 0 and 1.
      Returns:
      Suggested number of hash functions.
    • suggestNumBuckets

      public static int suggestNumBuckets(double relativeError)
      Suggests an appropriate number of buckets per hash function for a given relative error.
      Parameters:
      relativeError - The desired relative error.
      Returns:
      Suggested number of buckets.
    • update

      public void update(long item, long weight)
      Updates the sketch with the provided item and weight.
      Parameters:
      item - The item to update.
      weight - The weight of the item.
    • update

      public void update(String item, long weight)
      Updates the sketch with the provided item and weight.
      Parameters:
      item - The item to update.
      weight - The weight of the item.
    • update

      public void update(byte[] item, long weight)
      Updates the sketch with the provided item and weight.
      Parameters:
      item - The item to update.
      weight - The weight of the item.
    • getEstimate

      public long getEstimate(long item)
      Returns the estimated frequency for the given item.
      Parameters:
      item - The item to estimate.
      Returns:
      Estimated frequency.
    • getEstimate

      public long getEstimate(String item)
      Returns the estimated frequency for the given item.
      Parameters:
      item - The item to estimate.
      Returns:
      Estimated frequency.
    • getEstimate

      public long getEstimate(byte[] item)
      Returns the estimated frequency for the given item.
      Parameters:
      item - The item to estimate.
      Returns:
      Estimated frequency.
    • getUpperBound

      public long getUpperBound(long item)
      Returns the upper bound of the estimated frequency for the given item.
      Parameters:
      item - The item to estimate.
      Returns:
      Upper bound of estimated frequency.
    • getUpperBound

      public long getUpperBound(String item)
      Returns the upper bound of the estimated frequency for the given item.
      Parameters:
      item - The item to estimate.
      Returns:
      Upper bound of estimated frequency.
    • getUpperBound

      public long getUpperBound(byte[] item)
      Returns the upper bound of the estimated frequency for the given item.
      Parameters:
      item - The item to estimate.
      Returns:
      Upper bound of estimated frequency.
    • getLowerBound

      public long getLowerBound(long item)
      Returns the lower bound of the estimated frequency for the given item.
      Parameters:
      item - The item to estimate.
      Returns:
      Lower bound of estimated frequency.
    • getLowerBound

      public long getLowerBound(String item)
      Returns the lower bound of the estimated frequency for the given item.
      Parameters:
      item - The item to estimate.
      Returns:
      Lower bound of estimated frequency.
    • getLowerBound

      public long getLowerBound(byte[] item)
      Returns the lower bound of the estimated frequency for the given item.
      Parameters:
      item - The item to estimate.
      Returns:
      Lower bound of estimated frequency.
    • merge

      public void merge(CountMinSketch other)
      Merges another CountMinSketch into this one. The sketches must have the same configuration.
      Parameters:
      other - The other sketch to merge.
    • toByteArray

      public byte[] toByteArray()
      Returns the sketch as a byte array.
      Returns:
      the result byte array
    • deserialize

      public static CountMinSketch deserialize(byte[] b, long seed)
      Deserializes a CountMinSketch from the provided byte array.
      Parameters:
      b - The byte array containing the serialized sketch.
      seed - The seed used during serialization.
      Returns:
      The deserialized CountMinSketch.