Class ReservoirItemsUnion<T>

java.lang.Object
org.apache.datasketches.sampling.ReservoirItemsUnion<T>
Type Parameters:
T - The specific Java type for this sketch

public final class ReservoirItemsUnion<T> extends Object
Class to union reservoir samples of generic items.

For efficiency reasons, the unioning process picks one of the two sketches to use as the base. As a result, we provide only a stateful union. Using the same approach for a merge would result in unpredictable side effects on the underlying sketches.

A union object is created with a maximum value of k, represented using the ReservoirSize class. The unioning process may cause the actual number of samples to fall below that maximum value, but never to exceed it. The result of a union will be a reservoir where each item from the global input has a uniform probability of selection, but there are no claims about higher order statistics. For instance, in general all possible permutations of the global input are not equally likely.

If taking the union of two reservoirs of different sizes, the output sample will contain no more than MIN(k_1, k_2) samples.

Author:
Jon Malkin, Kevin Lang
  • Method Details

    • newInstance

      public static <T> ReservoirItemsUnion<T> newInstance(int maxK)
      Creates an empty Union with a maximum reservoir capacity of size k.
      Type Parameters:
      T - The type of item this sketch contains
      Parameters:
      maxK - The maximum allowed reservoir capacity for any sketches in the union
      Returns:
      A new ReservoirItemsUnion
    • heapify

      public static <T> ReservoirItemsUnion<T> heapify(org.apache.datasketches.memory.Memory srcMem, ArrayOfItemsSerDe<T> serDe)
      Instantiates a Union from Memory
      Type Parameters:
      T - The type of item this sketch contains
      Parameters:
      srcMem - Memory object containing a serialized union
      serDe - An instance of ArrayOfItemsSerDe
      Returns:
      A ReservoirItemsUnion created from the provided Memory
    • getMaxK

      public int getMaxK()
      Returns the maximum allowed reservoir capacity in this union. The current reservoir capacity may be lower.
      Returns:
      The maximum allowed reservoir capacity in this union.
    • update

      public void update(ReservoirItemsSketch<T> sketchIn)
      Union the given sketch. This method can be repeatedly called. If the given sketch is null it is interpreted as an empty sketch.
      Parameters:
      sketchIn - The incoming sketch.
    • update

      public void update(org.apache.datasketches.memory.Memory mem, ArrayOfItemsSerDe<T> serDe)
      Union the given Memory image of the sketch.

      This method can be repeatedly called. If the given sketch is null it is interpreted as an empty sketch.

      Parameters:
      mem - Memory image of sketch to be merged
      serDe - An instance of ArrayOfItemsSerDe
    • update

      public void update(T datum)
      Present this union with a single item to be added to the union.
      Parameters:
      datum - The given datum of type T.
    • update

      public void update(long n, int k, ArrayList<T> input)
      Present this union with raw elements of a sketch. Useful when operating in a distributed environment like Pig Latin scripts, where an explicit SerDe may be overly complicated but keeping raw values is simple. Values are not copied and the input array may be modified.
      Parameters:
      n - Total items seen
      k - Reservoir size
      input - Reservoir samples
    • getResult

      public ReservoirItemsSketch<T> getResult()
      Returns a sketch representing the current state of the union.
      Returns:
      The result of any unions already processed.
    • toByteArray

      public byte[] toByteArray(ArrayOfItemsSerDe<T> serDe)
      Returns a byte array representation of this union
      Parameters:
      serDe - An instance of ArrayOfItemsSerDe
      Returns:
      a byte array representation of this union
    • toString

      public String toString()
      Returns a human-readable summary of the sketch, without items.
      Overrides:
      toString in class Object
      Returns:
      A string version of the sketch summary
    • toByteArray

      public byte[] toByteArray(ArrayOfItemsSerDe<T> serDe, Class<?> clazz)
      Returns a byte array representation of this union. This method should be used when the array elements are subclasses of a common base class.
      Parameters:
      serDe - An instance of ArrayOfItemsSerDe
      clazz - A class to which the items are cast before serialization
      Returns:
      a byte array representation of this union