Class ReservoirItemsUnion<T>

  • Type Parameters:
    T - The specific Java type for this sketch

    public final class ReservoirItemsUnion<T>
    extends Object
    Class to union reservoir samples of generic items.

    For efficiency reasons, the unioning process picks one of the two sketches to use as the base. As a result, we provide only a stateful union. Using the same approach for a merge would result in unpredictable side effects on the underlying sketches.

    A union object is created with a maximum value of k, represented using the ReservoirSize class. The unioning process may cause the actual number of samples to fall below that maximum value, but never to exceed it. The result of a union will be a reservoir where each item from the global input has a uniform probability of selection, but there are no claims about higher order statistics. For instance, in general all possible permutations of the global input are not equally likely.

    If taking the union of two reservoirs of different sizes, the output sample will contain no more than MIN(k_1, k_2) samples.

    Author:
    Jon Malkin, Kevin Lang
    • Method Detail

      • newInstance

        public static <T> ReservoirItemsUnion<T> newInstance​(int maxK)
        Creates an empty Union with a maximum reservoir capacity of size k.
        Type Parameters:
        T - The type of item this sketch contains
        Parameters:
        maxK - The maximum allowed reservoir capacity for any sketches in the union
        Returns:
        A new ReservoirItemsUnion
      • heapify

        public static <T> ReservoirItemsUnion<T> heapify​(org.apache.datasketches.memory.Memory srcMem,
                                                         ArrayOfItemsSerDe<T> serDe)
        Instantiates a Union from Memory
        Type Parameters:
        T - The type of item this sketch contains
        Parameters:
        srcMem - Memory object containing a serialized union
        serDe - An instance of ArrayOfItemsSerDe
        Returns:
        A ReservoirItemsUnion created from the provided Memory
      • getMaxK

        public int getMaxK()
        Returns the maximum allowed reservoir capacity in this union. The current reservoir capacity may be lower.
        Returns:
        The maximum allowed reservoir capacity in this union.
      • update

        public void update​(ReservoirItemsSketch<T> sketchIn)
        Union the given sketch. This method can be repeatedly called. If the given sketch is null it is interpreted as an empty sketch.
        Parameters:
        sketchIn - The incoming sketch.
      • update

        public void update​(org.apache.datasketches.memory.Memory mem,
                           ArrayOfItemsSerDe<T> serDe)
        Union the given Memory image of the sketch.

        This method can be repeatedly called. If the given sketch is null it is interpreted as an empty sketch.

        Parameters:
        mem - Memory image of sketch to be merged
        serDe - An instance of ArrayOfItemsSerDe
      • update

        public void update​(T datum)
        Present this union with a single item to be added to the union.
        Parameters:
        datum - The given datum of type T.
      • update

        public void update​(long n,
                           int k,
                           ArrayList<T> input)
        Present this union with raw elements of a sketch. Useful when operating in a distributed environment like Pig Latin scripts, where an explicit SerDe may be overly complicated but keeping raw values is simple. Values are not copied and the input array may be modified.
        Parameters:
        n - Total items seen
        k - Reservoir size
        input - Reservoir samples
      • getResult

        public ReservoirItemsSketch<T> getResult()
        Returns a sketch representing the current state of the union.
        Returns:
        The result of any unions already processed.
      • toByteArray

        public byte[] toByteArray​(ArrayOfItemsSerDe<T> serDe)
        Returns a byte array representation of this union
        Parameters:
        serDe - An instance of ArrayOfItemsSerDe
        Returns:
        a byte array representation of this union
      • toString

        public String toString()
        Returns a human-readable summary of the sketch, without items.
        Overrides:
        toString in class Object
        Returns:
        A string version of the sketch summary
      • toByteArray

        public byte[] toByteArray​(ArrayOfItemsSerDe<T> serDe,
                                  Class<?> clazz)
        Returns a byte array representation of this union. This method should be used when the array elements are subclasses of a common base class.
        Parameters:
        serDe - An instance of ArrayOfItemsSerDe
        clazz - A class to which the items are cast before serialization
        Returns:
        a byte array representation of this union