Class Sketch

java.lang.Object
org.apache.datasketches.theta.Sketch
All Implemented Interfaces:
MemoryStatus
Direct Known Subclasses:
CompactSketch, UpdateSketch

public abstract class Sketch extends Object implements MemoryStatus
The top-level class for all theta sketches. This class is never constructed directly. Use the UpdateSketch.builder() methods to create UpdateSketches.
Author:
Lee Rhodes
  • Method Summary

    Modifier and Type
    Method
    Description
    Converts this sketch to a ordered CompactSketch.
    abstract CompactSketch
    compact(boolean dstOrdered, org.apache.datasketches.memory.WritableMemory dstMem)
    Convert this sketch to a CompactSketch.
    abstract int
    Returns the number of storage bytes required for this Sketch if its current state were compacted.
    static int
    getCompactSketchMaxBytes(int lgNomEntries)
    Returns the maximum number of storage bytes required for a CompactSketch given the configured log_base2 of the number of nominal entries, which is a power of 2.
    int
    getCountLessThanThetaLong(long thetaLong)
    Gets the number of hash values less than the given theta expressed as a long.
    abstract int
    Returns the number of storage bytes required for this sketch in its current state.
    abstract double
    Gets the unique count estimate.
    abstract Family
    Returns the Family that this sketch belongs to
    double
    getLowerBound(int numStdDev)
    Gets the approximate lower error bound given the specified number of Standard Deviations.
    static int
    getMaxCompactSketchBytes(int numberOfEntries)
    Returns the maximum number of storage bytes required for a CompactSketch with the given number of actual entries.
    static int
    getMaxUpdateSketchBytes(int nomEntries)
    Returns the maximum number of storage bytes required for an UpdateSketch with the given number of nominal entries (power of 2).
    int
    Returns the number of valid entries that have been retained by the sketch.
    abstract int
    getRetainedEntries(boolean valid)
    Returns the number of entries that have been retained by the sketch.
    static int
    getSerializationVersion(org.apache.datasketches.memory.Memory mem)
    Returns the serialization version from the given Memory
    double
    Gets the value of theta as a double with a value between zero and one
    abstract long
    Gets the value of theta as a long
    double
    getUpperBound(int numStdDev)
    Gets the approximate upper error bound given the specified number of Standard Deviations.
    static Sketch
    heapify(org.apache.datasketches.memory.Memory srcMem)
    Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.
    static Sketch
    heapify(org.apache.datasketches.memory.Memory srcMem, long expectedSeed)
    Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.
    abstract boolean
    Returns true if this sketch is in compact form.
    abstract boolean
    boolean
    Returns true if the sketch is Estimation Mode (as opposed to Exact Mode).
    abstract boolean
    Returns true if internal cache is ordered
    abstract HashIterator
    Returns a HashIterator that can be used to iterate over the retained hash values of the Theta sketch.
    abstract byte[]
    Serialize this sketch to a byte array form.
    Returns a human readable summary of the sketch.
    toString(boolean sketchSummary, boolean dataDetail, int width, boolean hexMode)
    Gets a human readable listing of contents and summary of the given sketch.
    static String
    toString(byte[] byteArr)
    Returns a human readable string of the preamble of a byte array image of a Theta Sketch.
    static String
    toString(org.apache.datasketches.memory.Memory mem)
    Returns a human readable string of the preamble of a Memory image of a Theta Sketch.
    static Sketch
    wrap(org.apache.datasketches.memory.Memory srcMem)
    Wrap takes the sketch image in the given Memory and refers to it directly.
    static Sketch
    wrap(org.apache.datasketches.memory.Memory srcMem, long expectedSeed)
    Wrap takes the sketch image in the given Memory and refers to it directly.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

    Methods inherited from interface org.apache.datasketches.common.MemoryStatus

    hasMemory, isDirect, isSameResource
  • Method Details

    • heapify

      public static Sketch heapify(org.apache.datasketches.memory.Memory srcMem)
      Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.

      The resulting sketch will not retain any link to the source Memory.

      For Update Sketches this method checks if the Default Update Seed

      was used to create the source Memory image.

      For Compact Sketches this method assumes that the sketch image was created with the correct hash seed, so it is not checked.

      Parameters:
      srcMem - an image of a Sketch. See Memory.
      Returns:
      a Sketch on the heap.
    • heapify

      public static Sketch heapify(org.apache.datasketches.memory.Memory srcMem, long expectedSeed)
      Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.

      The resulting sketch will not retain any link to the source Memory.

      For Update and Compact Sketches this method checks if the given expectedSeed was used to create the source Memory image. However, SerialVersion 1 sketches cannot be checked.

      Parameters:
      srcMem - an image of a Sketch that was created using the given expectedSeed. See Memory.
      expectedSeed - the seed used to validate the given Memory image. See Update Hash Seed. Compact sketches store a 16-bit hash of the seed, but not the seed itself.
      Returns:
      a Sketch on the heap.
    • wrap

      public static Sketch wrap(org.apache.datasketches.memory.Memory srcMem)
      Wrap takes the sketch image in the given Memory and refers to it directly. There is no data copying onto the java heap. The wrap operation enables fast read-only merging and access to all the public read-only API.

      Only "Direct" Serialization Version 3 (i.e, OpenSource) sketches that have been explicitly stored as direct sketches can be wrapped. Wrapping earlier serial version sketches will result in a on-heap CompactSketch where all data will be copied to the heap. These early versions were never designed to "wrap".

      Wrapping any subclass of this class that is empty or contains only a single item will result in on-heap equivalent forms of empty and single item sketch respectively. This is actually faster and consumes less overall memory.

      For Update Sketches this method checks if the Default Update Seed

      was used to create the source Memory image.

      For Compact Sketches this method assumes that the sketch image was created with the correct hash seed, so it is not checked.

      Parameters:
      srcMem - an image of a Sketch. See Memory.
      Returns:
      a Sketch backed by the given Memory
    • wrap

      public static Sketch wrap(org.apache.datasketches.memory.Memory srcMem, long expectedSeed)
      Wrap takes the sketch image in the given Memory and refers to it directly. There is no data copying onto the java heap. The wrap operation enables fast read-only merging and access to all the public read-only API.

      Only "Direct" Serialization Version 3 (i.e, OpenSource) sketches that have been explicitly stored as direct sketches can be wrapped. Wrapping earlier serial version sketches will result in a on-heap CompactSketch where all data will be copied to the heap. These early versions were never designed to "wrap".

      Wrapping any subclass of this class that is empty or contains only a single item will result in on-heap equivalent forms of empty and single item sketch respectively. This is actually faster and consumes less overall memory.

      For Update and Compact Sketches this method checks if the given expectedSeed was used to create the source Memory image. However, SerialVersion 1 sketches cannot be checked.

      Parameters:
      srcMem - an image of a Sketch. See Memory
      expectedSeed - the seed used to validate the given Memory image. See Update Hash Seed.
      Returns:
      a UpdateSketch backed by the given Memory except as above.
    • compact

      public CompactSketch compact()
      Converts this sketch to a ordered CompactSketch.

      If this.isCompact() == true this method returns this, otherwise, this method is equivalent to compact(true, null).

      A CompactSketch is always immutable.

      Returns:
      this sketch as an ordered CompactSketch.
    • compact

      public abstract CompactSketch compact(boolean dstOrdered, org.apache.datasketches.memory.WritableMemory dstMem)
      Convert this sketch to a CompactSketch.

      If this sketch is a type of UpdateSketch, the compacting process converts the hash table of the UpdateSketch to a simple list of the valid hash values. Any hash values of zero or equal-to or greater than theta will be discarded. The number of valid values remaining in the CompactSketch depends on a number of factors, but may be larger or smaller than Nominal Entries (or k). It will never exceed 2k. If it is critical to always limit the size to no more than k, then rebuild() should be called on the UpdateSketch prior to calling this method.

      A CompactSketch is always immutable.

      A new CompactSketch object is created:

      • if dstMem != null
      • if dstMem == null and this.hasMemory() == true
      • if dstMem == null and this has more than 1 item and this.isOrdered() == false and dstOrdered == true.

      Otherwise, this operation returns this.

      Parameters:
      dstOrdered - assumed true if this sketch is empty or has only one value See Destination Ordered
      dstMem - See Destination Memory.
      Returns:
      this sketch as a CompactSketch.
    • getCompactBytes

      public abstract int getCompactBytes()
      Returns the number of storage bytes required for this Sketch if its current state were compacted. It this sketch is already in the compact form this is equivalent to calling getCurrentBytes().
      Returns:
      number of compact bytes
    • getCountLessThanThetaLong

      public int getCountLessThanThetaLong(long thetaLong)
      Gets the number of hash values less than the given theta expressed as a long.
      Parameters:
      thetaLong - the given theta as a long between zero and Long.MAX_VALUE.
      Returns:
      the number of hash values less than the given thetaLong.
    • getCurrentBytes

      public abstract int getCurrentBytes()
      Returns the number of storage bytes required for this sketch in its current state.
      Returns:
      the number of storage bytes required for this sketch
    • getEstimate

      public abstract double getEstimate()
      Gets the unique count estimate.
      Returns:
      the sketch's best estimate of the cardinality of the input stream.
    • getFamily

      public abstract Family getFamily()
      Returns the Family that this sketch belongs to
      Returns:
      the Family that this sketch belongs to
    • getLowerBound

      public double getLowerBound(int numStdDev)
      Gets the approximate lower error bound given the specified number of Standard Deviations. This will return getEstimate() if isEmpty() is true.
      Parameters:
      numStdDev - See Number of Standard Deviations
      Returns:
      the lower bound.
    • getMaxCompactSketchBytes

      public static int getMaxCompactSketchBytes(int numberOfEntries)
      Returns the maximum number of storage bytes required for a CompactSketch with the given number of actual entries.
      Parameters:
      numberOfEntries - the actual number of retained entries stored in the sketch.
      Returns:
      the maximum number of storage bytes required for a CompactSketch with the given number of retained entries.
    • getCompactSketchMaxBytes

      public static int getCompactSketchMaxBytes(int lgNomEntries)
      Returns the maximum number of storage bytes required for a CompactSketch given the configured log_base2 of the number of nominal entries, which is a power of 2.
      Parameters:
      lgNomEntries - Nominal Entries
      Returns:
      the maximum number of storage bytes required for a CompactSketch with the given lgNomEntries.
    • getMaxUpdateSketchBytes

      public static int getMaxUpdateSketchBytes(int nomEntries)
      Returns the maximum number of storage bytes required for an UpdateSketch with the given number of nominal entries (power of 2).
      Parameters:
      nomEntries - Nominal Entries This will become the ceiling power of 2 if it is not.
      Returns:
      the maximum number of storage bytes required for a UpdateSketch with the given nomEntries
    • getRetainedEntries

      public int getRetainedEntries()
      Returns the number of valid entries that have been retained by the sketch.
      Returns:
      the number of valid retained entries
    • getRetainedEntries

      public abstract int getRetainedEntries(boolean valid)
      Returns the number of entries that have been retained by the sketch.
      Parameters:
      valid - if true, returns the number of valid entries, which are less than theta and used for estimation. Otherwise, return the number of all entries, valid or not, that are currently in the internal sketch cache.
      Returns:
      the number of retained entries
    • getSerializationVersion

      public static int getSerializationVersion(org.apache.datasketches.memory.Memory mem)
      Returns the serialization version from the given Memory
      Parameters:
      mem - the sketch Memory
      Returns:
      the serialization version from the Memory
    • getTheta

      public double getTheta()
      Gets the value of theta as a double with a value between zero and one
      Returns:
      the value of theta as a double
    • getThetaLong

      public abstract long getThetaLong()
      Gets the value of theta as a long
      Returns:
      the value of theta as a long
    • getUpperBound

      public double getUpperBound(int numStdDev)
      Gets the approximate upper error bound given the specified number of Standard Deviations. This will return getEstimate() if isEmpty() is true.
      Parameters:
      numStdDev - See Number of Standard Deviations
      Returns:
      the upper bound.
    • isCompact

      public abstract boolean isCompact()
      Returns true if this sketch is in compact form.
      Returns:
      true if this sketch is in compact form.
    • isEmpty

      public abstract boolean isEmpty()
      Returns:
      true if empty.
    • isEstimationMode

      public boolean isEstimationMode()
      Returns true if the sketch is Estimation Mode (as opposed to Exact Mode). This is true if theta < 1.0 AND isEmpty() is false.
      Returns:
      true if the sketch is in estimation mode.
    • isOrdered

      public abstract boolean isOrdered()
      Returns true if internal cache is ordered
      Returns:
      true if internal cache is ordered
    • iterator

      public abstract HashIterator iterator()
      Returns a HashIterator that can be used to iterate over the retained hash values of the Theta sketch.
      Returns:
      a HashIterator that can be used to iterate over the retained hash values of the Theta sketch.
    • toByteArray

      public abstract byte[] toByteArray()
      Serialize this sketch to a byte array form.
      Returns:
      byte array of this sketch
    • toString

      public String toString()
      Returns a human readable summary of the sketch. This method is equivalent to the parameterized call:
      Sketch.toString(sketch, true, false, 8, true);
      Overrides:
      toString in class Object
      Returns:
      summary
    • toString

      public String toString(boolean sketchSummary, boolean dataDetail, int width, boolean hexMode)
      Gets a human readable listing of contents and summary of the given sketch. This can be a very long string. If this sketch is in a "dirty" state there may be values in the dataDetail view that are ≥ theta.
      Parameters:
      sketchSummary - If true the sketch summary will be output at the end.
      dataDetail - If true, includes all valid hash values in the sketch.
      width - The number of columns of hash values. Default is 8.
      hexMode - If true, hashes will be output in hex.
      Returns:
      The result string, which can be very long.
    • toString

      public static String toString(byte[] byteArr)
      Returns a human readable string of the preamble of a byte array image of a Theta Sketch.
      Parameters:
      byteArr - the given byte array
      Returns:
      a human readable string of the preamble of a byte array image of a Theta Sketch.
    • toString

      public static String toString(org.apache.datasketches.memory.Memory mem)
      Returns a human readable string of the preamble of a Memory image of a Theta Sketch.
      Parameters:
      mem - the given Memory object
      Returns:
      a human readable string of the preamble of a Memory image of a Theta Sketch.