Class ThetaSketch

java.lang.Object
org.apache.datasketches.theta.ThetaSketch
All Implemented Interfaces:
MemorySegmentStatus
Direct Known Subclasses:
CompactThetaSketch, UpdatableThetaSketch

public abstract class ThetaSketch extends Object implements MemorySegmentStatus
The top-level class for all theta sketches. This class is never constructed directly. Use the UpdatableThetaSketchBuilder() methods to create UpdatableThetaSketches.
Author:
Lee Rhodes
  • Method Summary

    Modifier and Type
    Method
    Description
    Converts this sketch to a ordered CompactThetaSketch.
    compact(boolean dstOrdered, MemorySegment dstSeg)
    Convert this sketch to a CompactThetaSketch.
    abstract int
    Returns the number of storage bytes required for this ThetaSketch if its current state were compacted.
    static int
    getCompactSketchMaxBytes(int lgNomEntries)
    Returns the maximum number of storage bytes required for a CompactThetaSketch given the configured log_base2 of the number of nominal entries, which is a power of 2.
    int
    getCountLessThanThetaLong(long thetaLong)
    Gets the number of hash values less than the given theta expressed as a long.
    abstract int
    Returns the number of storage bytes required for this sketch in its current state.
    abstract double
    Gets the unique count estimate.
    static double
    Gets the estimate from the given MemorySegment
    abstract Family
    Returns the Family that this sketch belongs to
    double
    getLowerBound(int numStdDev)
    Gets the approximate lower error bound given the specified number of Standard Deviations.
    static double
    getLowerBound(int numStdDev, MemorySegment srcSeg)
    Gets the approximate lower error bound from a valid MemorySegment image of a ThetaSketch given the specified number of Standard Deviations.
    static int
    getMaxCompactSketchBytes(int numberOfEntries)
    Returns the maximum number of storage bytes required for a CompactThetaSketch with the given number of actual entries.
    static int
    getMaxUpdateSketchBytes(int nomEntries)
    Returns the maximum number of storage bytes required for an UpdatableThetaSketch with the given number of nominal entries (power of 2).
    int
    Returns the number of valid entries that have been retained by the sketch.
    abstract int
    getRetainedEntries(boolean valid)
    Returns the number of entries that have been retained by the sketch.
    static int
    Returns the number of valid entries that have been retained by the sketch from the given MemorySegment
    static int
    Returns the serialization version from the given MemorySegment
    double
    Gets the value of theta as a double with a value between zero and one
    abstract long
    Gets the value of theta as a long
    static int
    getUpdateSketchMaxBytes(int lgNomEntries)
    Returns the maximum number of storage bytes required for an UpdatableThetaSketch with the given log_base2 of the nominal entries.
    double
    getUpperBound(int numStdDev)
    Gets the approximate upper error bound given the specified number of Standard Deviations.
    static double
    getUpperBound(int numStdDev, MemorySegment srcSeg)
    Gets the approximate upper error bound from a valid MemorySegment image of a ThetaSketch given the specified number of Standard Deviations.
    Heapify takes the sketch image in MemorySegment and instantiates an on-heap ThetaSketch.
    heapify(MemorySegment srcSeg, long expectedSeed)
    Heapify takes the sketch image in MemorySegment and instantiates an on-heap ThetaSketch.
    abstract boolean
    Returns true if this sketch is in compact form.
    abstract boolean
    boolean
    Returns true if the sketch is Estimation Mode (as opposed to Exact Mode).
    abstract boolean
    Returns true if internal cache is ordered
    abstract HashIterator
    Returns a HashIterator that can be used to iterate over the retained hash values of the Theta sketch.
    abstract byte[]
    Serialize this sketch to a byte array form.
    Returns a human readable summary of the sketch.
    toString(boolean sketchSummary, boolean dataDetail, int width, boolean hexMode)
    Gets a human readable listing of contents and summary of the given sketch.
    static String
    toString(byte[] byteArr)
    Returns a human readable string of the preamble of a byte array image of a ThetaSketch.
    static String
    Returns a human readable string of the preamble of a MemorySegment image of a ThetaSketch.
    Wrap takes the sketch image in the given MemorySegment and refers to it directly.
    wrap(MemorySegment srcSeg, long expectedSeed)
    Wrap takes the sketch image in the given MemorySegment and refers to it directly.

    Methods inherited from class Object

    equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

    Methods inherited from interface MemorySegmentStatus

    hasMemorySegment, isOffHeap, isSameResource
  • Method Details

    • heapify

      public static ThetaSketch heapify(MemorySegment srcSeg)
      Heapify takes the sketch image in MemorySegment and instantiates an on-heap ThetaSketch.

      The resulting sketch will not retain any link to the source MemorySegment.

      For UpdatableThetaSketches this method checks if the Default Update Seed

      was used to create the source MemorySegment image.
      Parameters:
      srcSeg - an image of a ThetaSketch.
      Returns:
      a ThetaSketch on the heap.
    • heapify

      public static ThetaSketch heapify(MemorySegment srcSeg, long expectedSeed)
      Heapify takes the sketch image in MemorySegment and instantiates an on-heap ThetaSketch.

      The resulting sketch will not retain any link to the source MemorySegment.

      For UpdatableThetaSketches this method checks if the expectedSeed was used to create the source MemorySegment image.

      Parameters:
      srcSeg - an image of a ThetaSketch that was created using the given expectedSeed.
      expectedSeed - the seed used to validate the given MemorySegment image. See Update Hash Seed. Compact sketches store a 16-bit hash of the seed, but not the seed itself.
      Returns:
      a ThetaSketch on the heap.
    • wrap

      public static ThetaSketch wrap(MemorySegment srcSeg)
      Wrap takes the sketch image in the given MemorySegment and refers to it directly. There is no data copying onto the java heap. The wrap operation enables fast read-only merging and access to all the public read-only API.

      Only sketches that have been explicitly stored as direct sketches can be wrapped.

      Wrapping any subclass of this class that is empty or contains only a single item will result in on-heap equivalent forms of empty and single item sketch respectively. This is actually faster and consumes less overall space.

      This method checks if the Default Update Seed was used to create the source MemorySegment image.

      Parameters:
      srcSeg - a MemorySegment with an image of a ThetaSketch.
      Returns:
      a read-only ThetaSketch backed by the given MemorySegment
    • wrap

      public static ThetaSketch wrap(MemorySegment srcSeg, long expectedSeed)
      Wrap takes the sketch image in the given MemorySegment and refers to it directly. There is no data copying onto the java heap. The wrap operation enables fast read-only merging and access to all the public read-only API.

      Only sketches that have been explicitly stored as direct sketches can be wrapped.

      Wrapping any subclass of this class that is empty or contains only a single item will result in on-heap equivalent forms of empty and single item sketch respectively. This is actually faster and consumes less overall space.

      This method checks if the given expectedSeed was used to create the source MemorySegment image.

      Parameters:
      srcSeg - a MemorySegment with an image of a ThetaSketch.
      expectedSeed - the seed used to validate the given MemorySegment image. See Update Hash Seed.
      Returns:
      a read-only ThetaSketch backed by the given MemorySegment.
    • compact

      public CompactThetaSketch compact()
      Converts this sketch to a ordered CompactThetaSketch.

      If this.isCompact() == true this method returns this, otherwise, this method is equivalent to compact(true, null).

      A CompactThetaSketch is always immutable.

      Returns:
      this sketch as an ordered CompactThetaSketch.
    • compact

      public abstract CompactThetaSketch compact(boolean dstOrdered, MemorySegment dstSeg)
      Convert this sketch to a CompactThetaSketch.

      If this sketch is a type of UpdatableThetaSketch, the compacting process converts the hash table of the UpdatableThetaketch to a simple list of the valid hash values. Any hash values of zero or equal-to or greater than theta will be discarded. The number of valid values remaining in the CompactThetaSketch depends on a number of factors, but may be larger or smaller than Nominal Entries (or k). It will never exceed 2k. If it is critical to always limit the size to no more than k, then rebuild() should be called on the UpdatableThetaSketch prior to calling this method.

      A CompactThetaSketch is always immutable.

      A new CompactThetaSketch object is created:

      • if dstSeg!= null
      • if dstSeg == null and this.hasMemorySegment() == true
      • if dstSeg == null and this has more than 1 item and this.isOrdered() == false and dstOrdered == true.

      Otherwise, this operation returns this.

      Parameters:
      dstOrdered - assumed true if this sketch is empty or has only one value See Destination Ordered
      dstSeg - See Destination MemorySegment.
      Returns:
      this sketch as a CompactThetaSketch.
    • getCompactBytes

      public abstract int getCompactBytes()
      Returns the number of storage bytes required for this ThetaSketch if its current state were compacted. It this sketch is already in the compact form this is equivalent to calling getCurrentBytes().
      Returns:
      number of compact bytes
    • getCountLessThanThetaLong

      public int getCountLessThanThetaLong(long thetaLong)
      Gets the number of hash values less than the given theta expressed as a long.
      Parameters:
      thetaLong - the given theta as a long between zero and Long.MAX_VALUE.
      Returns:
      the number of hash values less than the given thetaLong.
    • getCurrentBytes

      public abstract int getCurrentBytes()
      Returns the number of storage bytes required for this sketch in its current state.
      Returns:
      the number of storage bytes required for this sketch
    • getEstimate

      public abstract double getEstimate()
      Gets the unique count estimate.
      Returns:
      the sketch's best estimate of the cardinality of the input stream.
    • getEstimate

      public static double getEstimate(MemorySegment srcSeg)
      Gets the estimate from the given MemorySegment
      Parameters:
      srcSeg - the given MemorySegment
      Returns:
      the result estimate
    • getFamily

      public abstract Family getFamily()
      Returns the Family that this sketch belongs to
      Returns:
      the Family that this sketch belongs to
    • getLowerBound

      public double getLowerBound(int numStdDev)
      Gets the approximate lower error bound given the specified number of Standard Deviations. This will return getEstimate() if isEmpty() is true.
      Parameters:
      numStdDev - See Number of Standard Deviations
      Returns:
      the lower bound.
    • getMaxCompactSketchBytes

      public static int getMaxCompactSketchBytes(int numberOfEntries)
      Returns the maximum number of storage bytes required for a CompactThetaSketch with the given number of actual entries.
      Parameters:
      numberOfEntries - the actual number of retained entries stored in the sketch.
      Returns:
      the maximum number of storage bytes required for a CompactThetaSketch with the given number of retained entries.
    • getCompactSketchMaxBytes

      public static int getCompactSketchMaxBytes(int lgNomEntries)
      Returns the maximum number of storage bytes required for a CompactThetaSketch given the configured log_base2 of the number of nominal entries, which is a power of 2.
      Parameters:
      lgNomEntries - Nominal Entries
      Returns:
      the maximum number of storage bytes required for a CompactThetaSketch with the given lgNomEntries.
    • getMaxUpdateSketchBytes

      public static int getMaxUpdateSketchBytes(int nomEntries)
      Returns the maximum number of storage bytes required for an UpdatableThetaSketch with the given number of nominal entries (power of 2).
      Parameters:
      nomEntries - Nominal Entries This will become the ceiling power of 2 if it is not.
      Returns:
      the maximum number of storage bytes required for a UpdatableThetaSketch with the given nomEntries
    • getUpdateSketchMaxBytes

      public static int getUpdateSketchMaxBytes(int lgNomEntries)
      Returns the maximum number of storage bytes required for an UpdatableThetaSketch with the given log_base2 of the nominal entries.
      Parameters:
      lgNomEntries - log_base2 of Nominal Entries
      Returns:
      the maximum number of storage bytes required for a UpdatableThetaSketch with the given lgNomEntries
    • getRetainedEntries

      public int getRetainedEntries()
      Returns the number of valid entries that have been retained by the sketch. For the AlphaSketch this returns only valid entries.
      Returns:
      the number of valid retained entries.
    • getRetainedEntries

      public abstract int getRetainedEntries(boolean valid)
      Returns the number of entries that have been retained by the sketch.
      Parameters:
      valid - This parameter is only relevant for the AlphaSketch. if true, returns the number of valid entries, which are less than theta and used for estimation. Otherwise, return the number of all entries, valid or not, that are currently in the internal sketch cache.
      Returns:
      the number of retained entries
    • getRetainedEntries

      public static int getRetainedEntries(MemorySegment srcSeg)
      Returns the number of valid entries that have been retained by the sketch from the given MemorySegment
      Parameters:
      srcSeg - the given MemorySegment that has an image of a ThetaSketch
      Returns:
      the number of valid retained entries
    • getSerializationVersion

      public static int getSerializationVersion(MemorySegment seg)
      Returns the serialization version from the given MemorySegment
      Parameters:
      seg - the sketch MemorySegment
      Returns:
      the serialization version from the MemorySegment
    • getTheta

      public double getTheta()
      Gets the value of theta as a double with a value between zero and one
      Returns:
      the value of theta as a double
    • getThetaLong

      public abstract long getThetaLong()
      Gets the value of theta as a long
      Returns:
      the value of theta as a long
    • getUpperBound

      public double getUpperBound(int numStdDev)
      Gets the approximate upper error bound given the specified number of Standard Deviations. This will return getEstimate() if isEmpty() is true.
      Parameters:
      numStdDev - See Number of Standard Deviations
      Returns:
      the upper bound.
    • isCompact

      public abstract boolean isCompact()
      Returns true if this sketch is in compact form.
      Returns:
      true if this sketch is in compact form.
    • isEmpty

      public abstract boolean isEmpty()
      Returns:
      true if empty.
    • isEstimationMode

      public boolean isEstimationMode()
      Returns true if the sketch is Estimation Mode (as opposed to Exact Mode). This is true if theta < 1.0 AND isEmpty() is false.
      Returns:
      true if the sketch is in estimation mode.
    • isOrdered

      public abstract boolean isOrdered()
      Returns true if internal cache is ordered
      Returns:
      true if internal cache is ordered
    • iterator

      public abstract HashIterator iterator()
      Returns a HashIterator that can be used to iterate over the retained hash values of the Theta sketch.
      Returns:
      a HashIterator that can be used to iterate over the retained hash values of the Theta sketch.
    • toByteArray

      public abstract byte[] toByteArray()
      Serialize this sketch to a byte array form.
      Returns:
      byte array of this sketch
    • toString

      public String toString()
      Returns a human readable summary of the sketch. This method is equivalent to the parameterized call:
      ThetaSketch.toString(ThetaSketch, true, false, 8, true);
      Overrides:
      toString in class Object
      Returns:
      summary
    • toString

      public String toString(boolean sketchSummary, boolean dataDetail, int width, boolean hexMode)
      Gets a human readable listing of contents and summary of the given sketch. This can be a very long string. If this sketch is in a "dirty" state there may be values in the dataDetail view that are ≥ theta.
      Parameters:
      sketchSummary - If true the sketch summary will be output at the end.
      dataDetail - If true, includes all valid hash values in the sketch.
      width - The number of columns of hash values. Default is 8.
      hexMode - If true, hashes will be output in hex.
      Returns:
      The result string, which can be very long.
    • toString

      public static String toString(byte[] byteArr)
      Returns a human readable string of the preamble of a byte array image of a ThetaSketch.
      Parameters:
      byteArr - the given byte array
      Returns:
      a human readable string of the preamble of a byte array image of a ThetaSketch.
    • toString

      public static String toString(MemorySegment seg)
      Returns a human readable string of the preamble of a MemorySegment image of a ThetaSketch.
      Parameters:
      seg - the given MemorySegment object
      Returns:
      a human readable string of the preamble of a MemorySegment image of a ThetaSketch.
    • getLowerBound

      public static double getLowerBound(int numStdDev, MemorySegment srcSeg)
      Gets the approximate lower error bound from a valid MemorySegment image of a ThetaSketch given the specified number of Standard Deviations. This will return getEstimate() if isEmpty() is true.
      Parameters:
      numStdDev - See Number of Standard Deviations
      srcSeg - the source MemorySegment
      Returns:
      the lower bound.
    • getUpperBound

      public static double getUpperBound(int numStdDev, MemorySegment srcSeg)
      Gets the approximate upper error bound from a valid MemorySegment image of a ThetaSketch given the specified number of Standard Deviations. This will return getEstimate() if isEmpty() is true.
      Parameters:
      numStdDev - See Number of Standard Deviations
      srcSeg - the source MemorySegment
      Returns:
      the upper bound.