Class ReservoirItemsSketch<T>

java.lang.Object
org.apache.datasketches.sampling.ReservoirItemsSketch<T>
Type Parameters:
T - The type of object held in the reservoir.

public final class ReservoirItemsSketch<T> extends Object
This sketch provides a reservoir sample over an input stream of items. The sketch contains a uniform random sample of unweighted items from the stream.
Author:
Jon Malkin, Kevin Lang
  • Method Summary

    Modifier and Type
    Method
    Description
    Computes an estimated subset sum from the entire stream for objects matching a given predicate.
    int
    Returns the sketch's value of k, the maximum number of samples stored in the reservoir.
    long
    Returns the number of items processed from the input stream
    int
    Returns the current number of items in the reservoir, which may be smaller than the reservoir capacity.
    T[]
    Returns a copy of the items in the reservoir, or null if empty.
    T[]
    getSamples(Class<?> clazz)
    Returns a copy of the items in the reservoir as members of Class clazz, or null if empty.
    static <T> ReservoirItemsSketch<T>
    heapify(org.apache.datasketches.memory.Memory srcMem, ArrayOfItemsSerDe<T> serDe)
    Returns a sketch instance of this class from the given srcMem, which must be a Memory representation of this sketch class.
    static <T> ReservoirItemsSketch<T>
    newInstance(int k)
    Construct a mergeable sampling sketch with up to k samples using the default resize factor (8).
    static <T> ReservoirItemsSketch<T>
    Construct a mergeable sampling sketch with up to k samples using a specified resize factor.
    void
    Resets this sketch to the empty state, but retains the original value of k.
    byte[]
    Returns a byte array representation of this sketch.
    byte[]
    toByteArray(ArrayOfItemsSerDe<? super T> serDe, Class<?> clazz)
    Returns a byte array representation of this sketch.
    Returns a human-readable summary of the sketch, without items.
    static String
    toString(byte[] byteArr)
    Returns a human readable string of the preamble of a byte array image of a ReservoirItemsSketch.
    static String
    toString(org.apache.datasketches.memory.Memory mem)
    Returns a human readable string of the preamble of a Memory image of a ReservoirItemsSketch.
    void
    update(T item)
    Randomly decide whether or not to include an item in the sample set.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Method Details

    • newInstance

      public static <T> ReservoirItemsSketch<T> newInstance(int k)
      Construct a mergeable sampling sketch with up to k samples using the default resize factor (8).
      Type Parameters:
      T - The type of object held in the reservoir.
      Parameters:
      k - Maximum size of sampling. Allocated size may be smaller until reservoir fills. Unlike many sketches in this package, this value does not need to be a power of 2.
      Returns:
      A ReservoirLongsSketch initialized with maximum size k and the default resize factor.
    • newInstance

      public static <T> ReservoirItemsSketch<T> newInstance(int k, ResizeFactor rf)
      Construct a mergeable sampling sketch with up to k samples using a specified resize factor.
      Type Parameters:
      T - The type of object held in the reservoir.
      Parameters:
      k - Maximum size of sampling. Allocated size may be smaller until reservoir fills. Unlike many sketches in this package, this value does not need to be a power of 2.
      rf - See Resize Factor
      Returns:
      A ReservoirLongsSketch initialized with maximum size k and resize factor rf.
    • heapify

      public static <T> ReservoirItemsSketch<T> heapify(org.apache.datasketches.memory.Memory srcMem, ArrayOfItemsSerDe<T> serDe)
      Returns a sketch instance of this class from the given srcMem, which must be a Memory representation of this sketch class.
      Type Parameters:
      T - The type of item this sketch contains
      Parameters:
      srcMem - a Memory representation of a sketch of this class. See Memory
      serDe - An instance of ArrayOfItemsSerDe
      Returns:
      a sketch instance of this class
    • getK

      public int getK()
      Returns the sketch's value of k, the maximum number of samples stored in the reservoir. The current number of items in the sketch may be lower.
      Returns:
      k, the maximum number of samples in the reservoir
    • getN

      public long getN()
      Returns the number of items processed from the input stream
      Returns:
      n, the number of stream items the sketch has seen
    • getNumSamples

      public int getNumSamples()
      Returns the current number of items in the reservoir, which may be smaller than the reservoir capacity.
      Returns:
      the number of items currently in the reservoir
    • update

      public void update(T item)
      Randomly decide whether or not to include an item in the sample set.
      Parameters:
      item - a unit-weight (equivalently, unweighted) item of the set being sampled from
    • reset

      public void reset()
      Resets this sketch to the empty state, but retains the original value of k.
    • getSamples

      public T[] getSamples()
      Returns a copy of the items in the reservoir, or null if empty. The returned array length may be smaller than the reservoir capacity.

      In order to allocate an array of generic type T, uses the class of the first item in the array. This method method may throw an ArrayAssignmentException if the reservoir stores instances of a polymorphic base class.

      Returns:
      A copy of the reservoir array
    • getSamples

      public T[] getSamples(Class<?> clazz)
      Returns a copy of the items in the reservoir as members of Class clazz, or null if empty. The returned array length may be smaller than the reservoir capacity.

      This method allocates an array of class clazz, which must either match or extend T. This method should be used when objects in the array are all instances of T but are not necessarily instances of the base class.

      Parameters:
      clazz - A class to which the items are cast before returning
      Returns:
      A copy of the reservoir array
    • toString

      public String toString()
      Returns a human-readable summary of the sketch, without items.
      Overrides:
      toString in class Object
      Returns:
      A string version of the sketch summary
    • toString

      public static String toString(byte[] byteArr)
      Returns a human readable string of the preamble of a byte array image of a ReservoirItemsSketch.
      Parameters:
      byteArr - the given byte array
      Returns:
      a human readable string of the preamble of a byte array image of a ReservoirItemsSketch.
    • toString

      public static String toString(org.apache.datasketches.memory.Memory mem)
      Returns a human readable string of the preamble of a Memory image of a ReservoirItemsSketch.
      Parameters:
      mem - the given Memory
      Returns:
      a human readable string of the preamble of a Memory image of a ReservoirItemsSketch.
    • toByteArray

      public byte[] toByteArray(ArrayOfItemsSerDe<? super T> serDe)
      Returns a byte array representation of this sketch. May fail for polymorphic item types.
      Parameters:
      serDe - An instance of ArrayOfItemsSerDe
      Returns:
      a byte array representation of this sketch
    • toByteArray

      public byte[] toByteArray(ArrayOfItemsSerDe<? super T> serDe, Class<?> clazz)
      Returns a byte array representation of this sketch. Copies contents into an array of the specified class for serialization to allow for polymorphic types.
      Parameters:
      serDe - An instance of ArrayOfItemsSerDe
      clazz - The class represented by <T>
      Returns:
      a byte array representation of this sketch
    • estimateSubsetSum

      public SampleSubsetSummary estimateSubsetSum(Predicate<T> predicate)
      Computes an estimated subset sum from the entire stream for objects matching a given predicate. Provides a lower bound, estimate, and upper bound using a target of 2 standard deviations.

      This is technically a heuristic method, and tries to err on the conservative side.

      Parameters:
      predicate - A predicate to use when identifying items.
      Returns:
      A summary object containing the estimate, upper and lower bounds, and the total sketch weight.