Class LongsSketch


  • public class LongsSketch
    extends Object

    This sketch is useful for tracking approximate frequencies of long items with optional associated counts (long item, long count) that are members of a multiset of such items. The true frequency of an item is defined to be the sum of associated counts.

    This implementation provides the following capabilities:

    • Estimate the frequency of an item.
    • Return upper and lower bounds of any item, such that the true frequency is always between the upper and lower bounds.
    • Return a global maximum error that holds for all items in the stream.
    • Return an array of frequent items that qualify either a NO_FALSE_POSITIVES or a NO_FALSE_NEGATIVES error type.
    • Merge itself with another sketch object created from this class.
    • Serialize/Deserialize to/from a String or byte array.

    Space Usage

    The sketch is initialized with a maxMapSize that specifies the maximum physical length of the internal hash map of the form (long item, long count). The maxMapSize must be a power of 2.

    The hash map starts at a very small size (8 entries), and grows as needed up to the specified maxMapSize.

    At any moment the internal memory space usage of this sketch is 18 * mapSize bytes, plus a small constant number of additional bytes. The maximum internal memory space usage of this sketch will never exceed 18 * maxMapSize bytes, plus a small constant number of additional bytes.

    Maximum Capacity of the Sketch

    The LOAD_FACTOR for the hash map is internally set at 75%, which means at any time the map capacity of (item, count) pairs is mapCap = 0.75 * mapSize. The maximum capacity of (item, count) pairs of the sketch is maxMapCap = 0.75 * maxMapSize.

    Updating the sketch with (item, count) pairs

    If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count, otherwise, a new counter "(item, count) pair" is created. If the number of tracked counters reaches the maximum capacity of the hash map the sketch decrements all of the counters (by an approximately computed median), and removes any non-positive counters.

    Accuracy

    If fewer than 0.75 * maxMapSize different items are inserted into the sketch the estimated frequencies returned by the sketch will be exact.

    The logic of the frequent items sketch is such that the stored counts and true counts are never too different. More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).

    For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item. Specifically, (UB- LB) ≤ W * epsilon, where W denotes the sum of all item counts, and epsilon = 3.5/M, where M is the maxMapSize.

    This is a worst case guarantee that applies to arbitrary inputs.1 For inputs typically seen in practice (UB-LB) is usually much smaller.

    Background

    This code implements a variant of what is commonly known as the "Misra-Gries algorithm". Variants of it were discovered and rediscovered and redesigned several times over the years:

    • "Finding repeated elements", Misra, Gries, 1982
    • "Frequency estimation of Internet packet streams with limited space" Demaine, Lopez-Ortiz, Munro, 2002
    • "A simple algorithm for finding frequent elements in streams and bags" Karp, Shenker, Papadimitriou, 2003
    • "Efficient Computation of Frequent and Top-k Elements in Data Streams" Metwally, Agrawal, Abbadi, 2006
    1 For speed we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run. However, we have ensured that this probability is extremely small. For example, if the stream causes one table purge (rebuild), our proof of the worst case bound applies with probability at least 1 - 1E-14. If the stream causes 1E9 purges, our proof applies with probability at least 1 - 1E-5.
    Author:
    Justin Thaler, Lee Rhodes
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  LongsSketch.Row
      Row class that defines the return values from a getFrequentItems query.
    • Constructor Summary

      Constructors 
      Constructor Description
      LongsSketch​(int maxMapSize)
      Construct this sketch with the parameter maxMapSize and the default initialMapSize (8).
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static double getAprioriError​(int maxMapSize, long estimatedTotalStreamWeight)
      Returns the estimated a priori error given the maxMapSize for the sketch and the estimatedTotalStreamWeight.
      int getCurrentMapCapacity()
      Returns the current number of counters the sketch is configured to support.
      static double getEpsilon​(int maxMapSize)
      Returns epsilon used to compute a priori error.
      long getEstimate​(long item)
      Gets the estimate of the frequency of the given item.
      LongsSketch.Row[] getFrequentItems​(long threshold, ErrorType errorType)
      Returns an array of Rows that include frequent items, estimates, upper and lower bounds given a threshold and an ErrorCondition.
      LongsSketch.Row[] getFrequentItems​(ErrorType errorType)
      Returns an array of Rows that include frequent items, estimates, upper and lower bounds given an ErrorCondition and the default threshold.
      static LongsSketch getInstance​(String string)
      Returns a sketch instance of this class from the given String, which must be a String representation of this sketch class.
      static LongsSketch getInstance​(org.apache.datasketches.memory.Memory srcMem)
      Returns a sketch instance of this class from the given srcMem, which must be a Memory representation of this sketch class.
      long getLowerBound​(long item)
      Gets the guaranteed lower bound frequency of the given item, which can never be negative.
      long getMaximumError()  
      int getMaximumMapCapacity()
      Returns the maximum number of counters the sketch is configured to support.
      int getNumActiveItems()  
      int getStorageBytes()
      Returns the number of bytes required to store this sketch as an array of bytes.
      long getStreamLength()
      Returns the sum of the frequencies (weights or counts) in the stream seen so far by the sketch
      long getUpperBound​(long item)
      Gets the guaranteed upper bound frequency of the given item.
      boolean isEmpty()
      Returns true if this sketch is empty
      LongsSketch merge​(LongsSketch other)
      This function merges the other sketch into this one.
      void reset()
      Resets this sketch to a virgin state.
      String serializeToString()
      Returns a String representation of this sketch
      byte[] toByteArray()
      Returns a byte array representation of this sketch
      String toString()
      Returns a human readable summary of this sketch.
      static String toString​(byte[] byteArr)
      Returns a human readable string of the preamble of a byte array image of a LongsSketch.
      static String toString​(org.apache.datasketches.memory.Memory mem)
      Returns a human readable string of the preamble of a Memory image of a LongsSketch.
      void update​(long item)
      Update this sketch with an item and a frequency count of one.
      void update​(long item, long count)
      Update this sketch with a item and a positive frequency count (or weight).
    • Constructor Detail

      • LongsSketch

        public LongsSketch​(int maxMapSize)
        Construct this sketch with the parameter maxMapSize and the default initialMapSize (8).
        Parameters:
        maxMapSize - Determines the physical size of the internal hash map managed by this sketch and must be a power of 2. The maximum capacity of this internal hash map is 0.75 times * maxMapSize. Both the ultimate accuracy and size of this sketch are a function of maxMapSize.
    • Method Detail

      • getInstance

        public static LongsSketch getInstance​(org.apache.datasketches.memory.Memory srcMem)
        Returns a sketch instance of this class from the given srcMem, which must be a Memory representation of this sketch class.
        Parameters:
        srcMem - a Memory representation of a sketch of this class. See Memory
        Returns:
        a sketch instance of this class.
      • getInstance

        public static LongsSketch getInstance​(String string)
        Returns a sketch instance of this class from the given String, which must be a String representation of this sketch class.
        Parameters:
        string - a String representation of a sketch of this class.
        Returns:
        a sketch instance of this class.
      • getAprioriError

        public static double getAprioriError​(int maxMapSize,
                                             long estimatedTotalStreamWeight)
        Returns the estimated a priori error given the maxMapSize for the sketch and the estimatedTotalStreamWeight.
        Parameters:
        maxMapSize - the planned map size to be used when constructing this sketch.
        estimatedTotalStreamWeight - the estimated total stream weight.
        Returns:
        the estimated a priori error.
      • getCurrentMapCapacity

        public int getCurrentMapCapacity()
        Returns the current number of counters the sketch is configured to support.
        Returns:
        the current number of counters the sketch is configured to support.
      • getEpsilon

        public static double getEpsilon​(int maxMapSize)
        Returns epsilon used to compute a priori error. This is just the value 3.5 / maxMapSize.
        Parameters:
        maxMapSize - the planned map size to be used when constructing this sketch.
        Returns:
        epsilon used to compute a priori error.
      • getEstimate

        public long getEstimate​(long item)
        Gets the estimate of the frequency of the given item. Note: The true frequency of a item would be the sum of the counts as a result of the two update functions.
        Parameters:
        item - the given item
        Returns:
        the estimate of the frequency of the given item
      • getLowerBound

        public long getLowerBound​(long item)
        Gets the guaranteed lower bound frequency of the given item, which can never be negative.
        Parameters:
        item - the given item.
        Returns:
        the guaranteed lower bound frequency of the given item. That is, a number which is guaranteed to be no larger than the real frequency.
      • getFrequentItems

        public LongsSketch.Row[] getFrequentItems​(long threshold,
                                                  ErrorType errorType)
        Returns an array of Rows that include frequent items, estimates, upper and lower bounds given a threshold and an ErrorCondition. If the threshold is lower than getMaximumError(), then getMaximumError() will be used instead.

        The method first examines all active items in the sketch (items that have a counter).

        If ErrorType = NO_FALSE_NEGATIVES, this will include an item in the result list if getUpperBound(item) > threshold. There will be no false negatives, i.e., no Type II error. There may be items in the set with true frequencies less than the threshold (false positives).

        If ErrorType = NO_FALSE_POSITIVES, this will include an item in the result list if getLowerBound(item) > threshold. There will be no false positives, i.e., no Type I error. There may be items omitted from the set with true frequencies greater than the threshold (false negatives). This is a subset of the NO_FALSE_NEGATIVES case.

        Parameters:
        threshold - to include items in the result list
        errorType - determines whether no false positives or no false negatives are desired.
        Returns:
        an array of frequent items
      • getFrequentItems

        public LongsSketch.Row[] getFrequentItems​(ErrorType errorType)
        Returns an array of Rows that include frequent items, estimates, upper and lower bounds given an ErrorCondition and the default threshold. This is the same as getFrequentItems(getMaximumError(), errorType)
        Parameters:
        errorType - determines whether no false positives or no false negatives are desired.
        Returns:
        an array of frequent items
      • getMaximumError

        public long getMaximumError()
        Returns:
        An upper bound on the maximum error of getEstimate(item) for any item. This is equivalent to the maximum distance between the upper bound and the lower bound for any item.
      • getMaximumMapCapacity

        public int getMaximumMapCapacity()
        Returns the maximum number of counters the sketch is configured to support.
        Returns:
        the maximum number of counters the sketch is configured to support.
      • getNumActiveItems

        public int getNumActiveItems()
        Returns:
        the number of active items in the sketch.
      • getStorageBytes

        public int getStorageBytes()
        Returns the number of bytes required to store this sketch as an array of bytes.
        Returns:
        the number of bytes required to store this sketch as an array of bytes.
      • getStreamLength

        public long getStreamLength()
        Returns the sum of the frequencies (weights or counts) in the stream seen so far by the sketch
        Returns:
        the sum of the frequencies in the stream seen so far by the sketch
      • getUpperBound

        public long getUpperBound​(long item)
        Gets the guaranteed upper bound frequency of the given item.
        Parameters:
        item - the given item
        Returns:
        the guaranteed upper bound frequency of the given item. That is, a number which is guaranteed to be no smaller than the real frequency.
      • isEmpty

        public boolean isEmpty()
        Returns true if this sketch is empty
        Returns:
        true if this sketch is empty
      • merge

        public LongsSketch merge​(LongsSketch other)
        This function merges the other sketch into this one. The other sketch may be of a different size.
        Parameters:
        other - sketch of this class
        Returns:
        a sketch whose estimates are within the guarantees of the largest error tolerance of the two merged sketches.
      • reset

        public void reset()
        Resets this sketch to a virgin state.
      • serializeToString

        public String serializeToString()
        Returns a String representation of this sketch
        Returns:
        a String representation of this sketch
      • toByteArray

        public byte[] toByteArray()
        Returns a byte array representation of this sketch
        Returns:
        a byte array representation of this sketch
      • toString

        public String toString()
        Returns a human readable summary of this sketch.
        Overrides:
        toString in class Object
        Returns:
        a human readable summary of this sketch.
      • toString

        public static String toString​(byte[] byteArr)
        Returns a human readable string of the preamble of a byte array image of a LongsSketch.
        Parameters:
        byteArr - the given byte array
        Returns:
        a human readable string of the preamble of a byte array image of a LongsSketch.
      • toString

        public static String toString​(org.apache.datasketches.memory.Memory mem)
        Returns a human readable string of the preamble of a Memory image of a LongsSketch.
        Parameters:
        mem - the given Memory object
        Returns:
        a human readable string of the preamble of a Memory image of a LongsSketch.
      • update

        public void update​(long item)
        Update this sketch with an item and a frequency count of one.
        Parameters:
        item - for which the frequency should be increased.
      • update

        public void update​(long item,
                           long count)
        Update this sketch with a item and a positive frequency count (or weight).
        Parameters:
        item - for which the frequency should be increased. The item can be any long value and is only used by the sketch to determine uniqueness.
        count - the amount by which the frequency of the item should be increased. An count of zero is a no-op, and a negative count will throw an exception.