Class ThetaUtil


  • public final class ThetaUtil
    extends Object
    Utility methods for the Theta Family of sketches
    Author:
    Lee Rhodes
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int DEFAULT_NOMINAL_ENTRIES
      The default nominal entries is provided as a convenience for those cases where the nominal sketch size in number of entries is not provided.
      static long DEFAULT_UPDATE_SEED
      The seed 9001 used in the sketch update methods is a prime number that was chosen very early on in experimental testing.
      static int MAX_LG_NOM_LONGS
      The largest Log2 nom entries allowed: 26.
      static int MIN_LG_ARR_LONGS
      The smallest Log2 cache size allowed: 5.
      static int MIN_LG_NOM_LONGS
      The smallest Log2 nom entries allowed: 4.
      static double REBUILD_THRESHOLD
      The hash table rebuild threshold = 15.0/16.0.
      static double RESIZE_THRESHOLD
      The resize threshold = 0.5; tuned for speed.
    • Field Detail

      • MIN_LG_NOM_LONGS

        public static final int MIN_LG_NOM_LONGS
        The smallest Log2 nom entries allowed: 4.
        See Also:
        Constant Field Values
      • MAX_LG_NOM_LONGS

        public static final int MAX_LG_NOM_LONGS
        The largest Log2 nom entries allowed: 26.
        See Also:
        Constant Field Values
      • REBUILD_THRESHOLD

        public static final double REBUILD_THRESHOLD
        The hash table rebuild threshold = 15.0/16.0.
        See Also:
        Constant Field Values
      • RESIZE_THRESHOLD

        public static final double RESIZE_THRESHOLD
        The resize threshold = 0.5; tuned for speed.
        See Also:
        Constant Field Values
      • DEFAULT_NOMINAL_ENTRIES

        public static final int DEFAULT_NOMINAL_ENTRIES
        The default nominal entries is provided as a convenience for those cases where the nominal sketch size in number of entries is not provided. A sketch of 4096 entries has a Relative Standard Error (RSE) of +/- 1.56% at a confidence of 68%; or equivalently, a Relative Error of +/- 3.1% at a confidence of 95.4%. See Default Nominal Entries
        See Also:
        Constant Field Values
      • DEFAULT_UPDATE_SEED

        public static final long DEFAULT_UPDATE_SEED
        The seed 9001 used in the sketch update methods is a prime number that was chosen very early on in experimental testing. Choosing a seed is somewhat arbitrary, and the author cannot prove that this particular seed is somehow superior to other seeds. There was some early Internet discussion that a seed of 0 did not produce as clean avalanche diagrams as non-zero seeds, but this may have been more related to the MurmurHash2 release, which did have some issues. As far as the author can determine, MurmurHash3 does not have these problems.

        In order to perform set operations on two sketches it is critical that the same hash function and seed are identical for both sketches, otherwise the assumed 1:1 relationship between the original source key value and the hashed bit string would be violated. Once you have developed a history of stored sketches you are stuck with it.

        WARNING: This seed is used internally by library sketches in different packages and thus must be declared public. However, this seed value must not be used by library users with the MurmurHash3 function. It should be viewed as existing for exclusive, private use by the library.

        See Default Update Seed

        See Also:
        Constant Field Values
      • MIN_LG_ARR_LONGS

        public static final int MIN_LG_ARR_LONGS
        The smallest Log2 cache size allowed: 5.
        See Also:
        Constant Field Values
    • Method Detail

      • checkSeedHashes

        public static short checkSeedHashes​(short seedHashA,
                                            short seedHashB)
        Check if the two seed hashes are equal. If not, throw an SketchesArgumentException.
        Parameters:
        seedHashA - the seedHash A
        seedHashB - the seedHash B
        Returns:
        seedHashA if they are equal
      • computeSeedHash

        public static short computeSeedHash​(long seed)
        Computes and checks the 16-bit seed hash from the given long seed. The seed hash may not be zero in order to maintain compatibility with older serialized versions that did not have this concept.
        Parameters:
        seed - See Update Hash Seed
        Returns:
        the seed hash.
      • startingSubMultiple

        public static int startingSubMultiple​(int lgTarget,
                                              int lgRF,
                                              int lgMin)
        Gets the smallest allowed exponent of 2 that it is a sub-multiple of the target by zero, one or more resize factors.
        Parameters:
        lgTarget - Log2 of the target size
        lgRF - Log_base2 of Resize Factor. See Resize Factor
        lgMin - Log2 of the minimum allowed starting size
        Returns:
        The Log2 of the starting size
      • checkNomLongs

        public static int checkNomLongs​(int nomLongs)
        Checks that the given nomLongs is within bounds and returns the Log2 of the ceiling power of 2 of the given nomLongs.
        Parameters:
        nomLongs - the given number of nominal longs. This can be any value from 16 to 67108864, inclusive.
        Returns:
        The Log2 of the ceiling power of 2 of the given nomLongs.