Class ThetaUtil

java.lang.Object
org.apache.datasketches.thetacommon.ThetaUtil

public final class ThetaUtil extends Object
Utility methods for the Theta Family of sketches
Author:
Lee Rhodes
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    The default nominal entries is provided as a convenience for those cases where the nominal sketch size in number of entries is not provided.
    static final long
    The seed 9001 used in the sketch update methods is a prime number that was chosen very early on in experimental testing.
    static final int
    The largest Log2 nom entries allowed: 26.
    static final int
    The smallest Log2 cache size allowed: 5.
    static final int
    The smallest Log2 nom entries allowed: 4.
    static final double
    The hash table rebuild threshold = 15.0/16.0.
    static final double
    The resize threshold = 0.5; tuned for speed.
  • Method Summary

    Modifier and Type
    Method
    Description
    static int
    checkNomLongs(int nomLongs)
    Checks that the given nomLongs is within bounds and returns the Log2 of the ceiling power of 2 of the given nomLongs.
    static short
    checkSeedHashes(short seedHashA, short seedHashB)
    Check if the two seed hashes are equal.
    static short
    computeSeedHash(long seed)
    Computes and checks the 16-bit seed hash from the given long seed.
    static int
    startingSubMultiple(int lgTarget, int lgRF, int lgMin)
    Gets the smallest allowed exponent of 2 that it is a sub-multiple of the target by zero, one or more resize factors.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • MIN_LG_NOM_LONGS

      public static final int MIN_LG_NOM_LONGS
      The smallest Log2 nom entries allowed: 4.
      See Also:
    • MAX_LG_NOM_LONGS

      public static final int MAX_LG_NOM_LONGS
      The largest Log2 nom entries allowed: 26.
      See Also:
    • REBUILD_THRESHOLD

      public static final double REBUILD_THRESHOLD
      The hash table rebuild threshold = 15.0/16.0.
      See Also:
    • RESIZE_THRESHOLD

      public static final double RESIZE_THRESHOLD
      The resize threshold = 0.5; tuned for speed.
      See Also:
    • DEFAULT_NOMINAL_ENTRIES

      public static final int DEFAULT_NOMINAL_ENTRIES
      The default nominal entries is provided as a convenience for those cases where the nominal sketch size in number of entries is not provided. A sketch of 4096 entries has a Relative Standard Error (RSE) of +/- 1.56% at a confidence of 68%; or equivalently, a Relative Error of +/- 3.1% at a confidence of 95.4%. See Default Nominal Entries
      See Also:
    • DEFAULT_UPDATE_SEED

      public static final long DEFAULT_UPDATE_SEED
      The seed 9001 used in the sketch update methods is a prime number that was chosen very early on in experimental testing. Choosing a seed is somewhat arbitrary, and the author cannot prove that this particular seed is somehow superior to other seeds. There was some early Internet discussion that a seed of 0 did not produce as clean avalanche diagrams as non-zero seeds, but this may have been more related to the MurmurHash2 release, which did have some issues. As far as the author can determine, MurmurHash3 does not have these problems.

      In order to perform set operations on two sketches it is critical that the same hash function and seed are identical for both sketches, otherwise the assumed 1:1 relationship between the original source key value and the hashed bit string would be violated. Once you have developed a history of stored sketches you are stuck with it.

      WARNING: This seed is used internally by library sketches in different packages and thus must be declared public. However, this seed value must not be used by library users with the MurmurHash3 function. It should be viewed as existing for exclusive, private use by the library.

      See Default Update Seed

      See Also:
    • MIN_LG_ARR_LONGS

      public static final int MIN_LG_ARR_LONGS
      The smallest Log2 cache size allowed: 5.
      See Also:
  • Method Details

    • checkSeedHashes

      public static short checkSeedHashes(short seedHashA, short seedHashB)
      Check if the two seed hashes are equal. If not, throw an SketchesArgumentException.
      Parameters:
      seedHashA - the seedHash A
      seedHashB - the seedHash B
      Returns:
      seedHashA if they are equal
    • computeSeedHash

      public static short computeSeedHash(long seed)
      Computes and checks the 16-bit seed hash from the given long seed. The seed hash may not be zero in order to maintain compatibility with older serialized versions that did not have this concept.
      Parameters:
      seed - See Update Hash Seed
      Returns:
      the seed hash.
    • startingSubMultiple

      public static int startingSubMultiple(int lgTarget, int lgRF, int lgMin)
      Gets the smallest allowed exponent of 2 that it is a sub-multiple of the target by zero, one or more resize factors.
      Parameters:
      lgTarget - Log2 of the target size
      lgRF - Log_base2 of Resize Factor. See Resize Factor
      lgMin - Log2 of the minimum allowed starting size
      Returns:
      The Log2 of the starting size
    • checkNomLongs

      public static int checkNomLongs(int nomLongs)
      Checks that the given nomLongs is within bounds and returns the Log2 of the ceiling power of 2 of the given nomLongs.
      Parameters:
      nomLongs - the given number of nominal longs. This can be any value from 16 to 67108864, inclusive.
      Returns:
      The Log2 of the ceiling power of 2 of the given nomLongs.