Class Partitioner<T,​S extends QuantilesGenericAPI<T> & PartitioningFeature<T>>

  • Type Parameters:
    T - the data type
    S - the quantiles sketch that implements both QuantilesGenericAPI and PartitioningFeature.

    public class Partitioner<T,​S extends QuantilesGenericAPI<T> & PartitioningFeature<T>>
    extends Object
    A partitioning process that can partition very large data sets into thousands of partitions of approximately the same size.

    The code included here does work fine for moderate sized partitioning tasks. As an example, using the test code in the test branch with the partitioning task of splitting a data set of 1 billion items into 324 partitions of size 3M items completed in under 3 minutes, which was performed on a single CPU. For much larger partitioning tasks, it is recommended that this code be leveraged into a parallelized systems environment.

    • Constructor Detail

      • Partitioner

        public Partitioner​(long tgtPartitionSize,
                           int maxPartsPerPass,
                           SketchFillRequest<T,​S> fillReq)
        This constructor assumes a QuantileSearchCriteria of INCLUSIVE.
        Parameters:
        tgtPartitionSize - the target size of the resulting partitions in number of items.
        maxPartsPerPass - The maximum number of partitions to request from the sketch. The smaller this number is the smaller the variance will be of the resulting partitions, but this will increase the number of passes of the source data set.
        fillReq - The is an implementation of the SketchFillRequest call-back supplied by the user and implements the SketchFillRequest interface.
      • Partitioner

        public Partitioner​(long tgtPartitionSize,
                           int maxPartsPerSk,
                           SketchFillRequest<T,​S> fillReq,
                           QuantileSearchCriteria criteria)
        This constructor includes the QuantileSearchCriteria criteria as a parameter.
        Parameters:
        tgtPartitionSize - the target size of the resulting partitions in number of items.
        maxPartsPerSk - The maximum number of partitions to request from the sketch. The smaller this number is the smaller the variance will be of the resulting partitions, but this will increase the number of passes of the source data set.
        fillReq - The is an implementation of the SketchFillRequest call-back supplied by the user.
        criteria - This is the desired QuantileSearchCriteria to be used.
    • Method Detail

      • partition

        public List<Partitioner.PartitionBoundsRow<T>> partition​(S sk)
        This initiates the partitioning process
        Parameters:
        sk - A sketch of the entire data set.
        Returns:
        the final partitioning list