Class Sketch

  • All Implemented Interfaces:
    MemoryStatus
    Direct Known Subclasses:
    CompactSketch, UpdateSketch

    public abstract class Sketch
    extends Object
    implements MemoryStatus
    The top-level class for all theta sketches. This class is never constructed directly. Use the UpdateSketch.builder() methods to create UpdateSketches.
    Author:
    Lee Rhodes
    • Method Summary

      All Methods Static Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      CompactSketch compact()
      Converts this sketch to a ordered CompactSketch.
      abstract CompactSketch compact​(boolean dstOrdered, org.apache.datasketches.memory.WritableMemory dstMem)
      Convert this sketch to a CompactSketch.
      abstract int getCompactBytes()
      Returns the number of storage bytes required for this Sketch if its current state were compacted.
      static int getCompactSketchMaxBytes​(int lgNomEntries)
      Returns the maximum number of storage bytes required for a CompactSketch given the configured log_base2 of the number of nominal entries, which is a power of 2.
      int getCountLessThanThetaLong​(long thetaLong)
      Gets the number of hash values less than the given theta expressed as a long.
      abstract int getCurrentBytes()
      Returns the number of storage bytes required for this sketch in its current state.
      abstract double getEstimate()
      Gets the unique count estimate.
      abstract Family getFamily()
      Returns the Family that this sketch belongs to
      double getLowerBound​(int numStdDev)
      Gets the approximate lower error bound given the specified number of Standard Deviations.
      static int getMaxCompactSketchBytes​(int numberOfEntries)
      Returns the maximum number of storage bytes required for a CompactSketch with the given number of actual entries.
      static int getMaxUpdateSketchBytes​(int nomEntries)
      Returns the maximum number of storage bytes required for an UpdateSketch with the given number of nominal entries (power of 2).
      int getRetainedEntries()
      Returns the number of valid entries that have been retained by the sketch.
      abstract int getRetainedEntries​(boolean valid)
      Returns the number of entries that have been retained by the sketch.
      static int getSerializationVersion​(org.apache.datasketches.memory.Memory mem)
      Returns the serialization version from the given Memory
      double getTheta()
      Gets the value of theta as a double with a value between zero and one
      abstract long getThetaLong()
      Gets the value of theta as a long
      double getUpperBound​(int numStdDev)
      Gets the approximate upper error bound given the specified number of Standard Deviations.
      static Sketch heapify​(org.apache.datasketches.memory.Memory srcMem)
      Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.
      static Sketch heapify​(org.apache.datasketches.memory.Memory srcMem, long expectedSeed)
      Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.
      abstract boolean isCompact()
      Returns true if this sketch is in compact form.
      abstract boolean isEmpty()
      boolean isEstimationMode()
      Returns true if the sketch is Estimation Mode (as opposed to Exact Mode).
      abstract boolean isOrdered()
      Returns true if internal cache is ordered
      abstract HashIterator iterator()
      Returns a HashIterator that can be used to iterate over the retained hash values of the Theta sketch.
      abstract byte[] toByteArray()
      Serialize this sketch to a byte array form.
      String toString()
      Returns a human readable summary of the sketch.
      String toString​(boolean sketchSummary, boolean dataDetail, int width, boolean hexMode)
      Gets a human readable listing of contents and summary of the given sketch.
      static String toString​(byte[] byteArr)
      Returns a human readable string of the preamble of a byte array image of a Theta Sketch.
      static String toString​(org.apache.datasketches.memory.Memory mem)
      Returns a human readable string of the preamble of a Memory image of a Theta Sketch.
      static Sketch wrap​(org.apache.datasketches.memory.Memory srcMem)
      Wrap takes the sketch image in the given Memory and refers to it directly.
      static Sketch wrap​(org.apache.datasketches.memory.Memory srcMem, long expectedSeed)
      Wrap takes the sketch image in the given Memory and refers to it directly.
    • Method Detail

      • heapify

        public static Sketch heapify​(org.apache.datasketches.memory.Memory srcMem)
        Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.

        The resulting sketch will not retain any link to the source Memory.

        For Update Sketches this method checks if the Default Update Seed

        was used to create the source Memory image.

        For Compact Sketches this method assumes that the sketch image was created with the correct hash seed, so it is not checked.

        Parameters:
        srcMem - an image of a Sketch. See Memory.
        Returns:
        a Sketch on the heap.
      • heapify

        public static Sketch heapify​(org.apache.datasketches.memory.Memory srcMem,
                                     long expectedSeed)
        Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.

        The resulting sketch will not retain any link to the source Memory.

        For Update and Compact Sketches this method checks if the given expectedSeed was used to create the source Memory image. However, SerialVersion 1 sketches cannot be checked.

        Parameters:
        srcMem - an image of a Sketch that was created using the given expectedSeed. See Memory.
        expectedSeed - the seed used to validate the given Memory image. See Update Hash Seed. Compact sketches store a 16-bit hash of the seed, but not the seed itself.
        Returns:
        a Sketch on the heap.
      • wrap

        public static Sketch wrap​(org.apache.datasketches.memory.Memory srcMem)
        Wrap takes the sketch image in the given Memory and refers to it directly. There is no data copying onto the java heap. The wrap operation enables fast read-only merging and access to all the public read-only API.

        Only "Direct" Serialization Version 3 (i.e, OpenSource) sketches that have been explicitly stored as direct sketches can be wrapped. Wrapping earlier serial version sketches will result in a on-heap CompactSketch where all data will be copied to the heap. These early versions were never designed to "wrap".

        Wrapping any subclass of this class that is empty or contains only a single item will result in on-heap equivalent forms of empty and single item sketch respectively. This is actually faster and consumes less overall memory.

        For Update Sketches this method checks if the Default Update Seed

        was used to create the source Memory image.

        For Compact Sketches this method assumes that the sketch image was created with the correct hash seed, so it is not checked.

        Parameters:
        srcMem - an image of a Sketch. See Memory.
        Returns:
        a Sketch backed by the given Memory
      • wrap

        public static Sketch wrap​(org.apache.datasketches.memory.Memory srcMem,
                                  long expectedSeed)
        Wrap takes the sketch image in the given Memory and refers to it directly. There is no data copying onto the java heap. The wrap operation enables fast read-only merging and access to all the public read-only API.

        Only "Direct" Serialization Version 3 (i.e, OpenSource) sketches that have been explicitly stored as direct sketches can be wrapped. Wrapping earlier serial version sketches will result in a on-heap CompactSketch where all data will be copied to the heap. These early versions were never designed to "wrap".

        Wrapping any subclass of this class that is empty or contains only a single item will result in on-heap equivalent forms of empty and single item sketch respectively. This is actually faster and consumes less overall memory.

        For Update and Compact Sketches this method checks if the given expectedSeed was used to create the source Memory image. However, SerialVersion 1 sketches cannot be checked.

        Parameters:
        srcMem - an image of a Sketch. See Memory
        expectedSeed - the seed used to validate the given Memory image. See Update Hash Seed.
        Returns:
        a UpdateSketch backed by the given Memory except as above.
      • compact

        public CompactSketch compact()
        Converts this sketch to a ordered CompactSketch.

        If this.isCompact() == true this method returns this, otherwise, this method is equivalent to compact(true, null).

        A CompactSketch is always immutable.

        Returns:
        this sketch as an ordered CompactSketch.
      • compact

        public abstract CompactSketch compact​(boolean dstOrdered,
                                              org.apache.datasketches.memory.WritableMemory dstMem)
        Convert this sketch to a CompactSketch.

        If this sketch is a type of UpdateSketch, the compacting process converts the hash table of the UpdateSketch to a simple list of the valid hash values. Any hash values of zero or equal-to or greater than theta will be discarded. The number of valid values remaining in the CompactSketch depends on a number of factors, but may be larger or smaller than Nominal Entries (or k). It will never exceed 2k. If it is critical to always limit the size to no more than k, then rebuild() should be called on the UpdateSketch prior to calling this method.

        A CompactSketch is always immutable.

        A new CompactSketch object is created:

        • if dstMem != null
        • if dstMem == null and this.hasMemory() == true
        • if dstMem == null and this has more than 1 item and this.isOrdered() == false and dstOrdered == true.

        Otherwise, this operation returns this.

        Parameters:
        dstOrdered - assumed true if this sketch is empty or has only one value See Destination Ordered
        dstMem - See Destination Memory.
        Returns:
        this sketch as a CompactSketch.
      • getCompactBytes

        public abstract int getCompactBytes()
        Returns the number of storage bytes required for this Sketch if its current state were compacted. It this sketch is already in the compact form this is equivalent to calling getCurrentBytes().
        Returns:
        number of compact bytes
      • getCountLessThanThetaLong

        public int getCountLessThanThetaLong​(long thetaLong)
        Gets the number of hash values less than the given theta expressed as a long.
        Parameters:
        thetaLong - the given theta as a long between zero and Long.MAX_VALUE.
        Returns:
        the number of hash values less than the given thetaLong.
      • getCurrentBytes

        public abstract int getCurrentBytes()
        Returns the number of storage bytes required for this sketch in its current state.
        Returns:
        the number of storage bytes required for this sketch
      • getEstimate

        public abstract double getEstimate()
        Gets the unique count estimate.
        Returns:
        the sketch's best estimate of the cardinality of the input stream.
      • getFamily

        public abstract Family getFamily()
        Returns the Family that this sketch belongs to
        Returns:
        the Family that this sketch belongs to
      • getLowerBound

        public double getLowerBound​(int numStdDev)
        Gets the approximate lower error bound given the specified number of Standard Deviations. This will return getEstimate() if isEmpty() is true.
        Parameters:
        numStdDev - See Number of Standard Deviations
        Returns:
        the lower bound.
      • getMaxCompactSketchBytes

        public static int getMaxCompactSketchBytes​(int numberOfEntries)
        Returns the maximum number of storage bytes required for a CompactSketch with the given number of actual entries.
        Parameters:
        numberOfEntries - the actual number of retained entries stored in the sketch.
        Returns:
        the maximum number of storage bytes required for a CompactSketch with the given number of retained entries.
      • getCompactSketchMaxBytes

        public static int getCompactSketchMaxBytes​(int lgNomEntries)
        Returns the maximum number of storage bytes required for a CompactSketch given the configured log_base2 of the number of nominal entries, which is a power of 2.
        Parameters:
        lgNomEntries - Nominal Entries
        Returns:
        the maximum number of storage bytes required for a CompactSketch with the given nomEntries.
      • getMaxUpdateSketchBytes

        public static int getMaxUpdateSketchBytes​(int nomEntries)
        Returns the maximum number of storage bytes required for an UpdateSketch with the given number of nominal entries (power of 2).
        Parameters:
        nomEntries - Nominal Entries This will become the ceiling power of 2 if it is not.
        Returns:
        the maximum number of storage bytes required for a UpdateSketch with the given nomEntries
      • getRetainedEntries

        public int getRetainedEntries()
        Returns the number of valid entries that have been retained by the sketch.
        Returns:
        the number of valid retained entries
      • getRetainedEntries

        public abstract int getRetainedEntries​(boolean valid)
        Returns the number of entries that have been retained by the sketch.
        Parameters:
        valid - if true, returns the number of valid entries, which are less than theta and used for estimation. Otherwise, return the number of all entries, valid or not, that are currently in the internal sketch cache.
        Returns:
        the number of retained entries
      • getSerializationVersion

        public static int getSerializationVersion​(org.apache.datasketches.memory.Memory mem)
        Returns the serialization version from the given Memory
        Parameters:
        mem - the sketch Memory
        Returns:
        the serialization version from the Memory
      • getTheta

        public double getTheta()
        Gets the value of theta as a double with a value between zero and one
        Returns:
        the value of theta as a double
      • getThetaLong

        public abstract long getThetaLong()
        Gets the value of theta as a long
        Returns:
        the value of theta as a long
      • getUpperBound

        public double getUpperBound​(int numStdDev)
        Gets the approximate upper error bound given the specified number of Standard Deviations. This will return getEstimate() if isEmpty() is true.
        Parameters:
        numStdDev - See Number of Standard Deviations
        Returns:
        the upper bound.
      • isCompact

        public abstract boolean isCompact()
        Returns true if this sketch is in compact form.
        Returns:
        true if this sketch is in compact form.
      • isEmpty

        public abstract boolean isEmpty()
        Returns:
        true if empty.
      • isEstimationMode

        public boolean isEstimationMode()
        Returns true if the sketch is Estimation Mode (as opposed to Exact Mode). This is true if theta < 1.0 AND isEmpty() is false.
        Returns:
        true if the sketch is in estimation mode.
      • isOrdered

        public abstract boolean isOrdered()
        Returns true if internal cache is ordered
        Returns:
        true if internal cache is ordered
      • iterator

        public abstract HashIterator iterator()
        Returns a HashIterator that can be used to iterate over the retained hash values of the Theta sketch.
        Returns:
        a HashIterator that can be used to iterate over the retained hash values of the Theta sketch.
      • toByteArray

        public abstract byte[] toByteArray()
        Serialize this sketch to a byte array form.
        Returns:
        byte array of this sketch
      • toString

        public String toString()
        Returns a human readable summary of the sketch. This method is equivalent to the parameterized call:
        Sketch.toString(sketch, true, false, 8, true);
        Overrides:
        toString in class Object
        Returns:
        summary
      • toString

        public String toString​(boolean sketchSummary,
                               boolean dataDetail,
                               int width,
                               boolean hexMode)
        Gets a human readable listing of contents and summary of the given sketch. This can be a very long string. If this sketch is in a "dirty" state there may be values in the dataDetail view that are ≥ theta.
        Parameters:
        sketchSummary - If true the sketch summary will be output at the end.
        dataDetail - If true, includes all valid hash values in the sketch.
        width - The number of columns of hash values. Default is 8.
        hexMode - If true, hashes will be output in hex.
        Returns:
        The result string, which can be very long.
      • toString

        public static String toString​(byte[] byteArr)
        Returns a human readable string of the preamble of a byte array image of a Theta Sketch.
        Parameters:
        byteArr - the given byte array
        Returns:
        a human readable string of the preamble of a byte array image of a Theta Sketch.
      • toString

        public static String toString​(org.apache.datasketches.memory.Memory mem)
        Returns a human readable string of the preamble of a Memory image of a Theta Sketch.
        Parameters:
        mem - the given Memory object
        Returns:
        a human readable string of the preamble of a Memory image of a Theta Sketch.