Class JaccardSimilarity

java.lang.Object
org.apache.datasketches.theta.JaccardSimilarity

public final class JaccardSimilarity extends Object
Jaccard similarity of two Theta Sketches.
Author:
Lee Rhodes
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static boolean
    dissimilarityTest(Sketch measured, Sketch expected, double threshold)
    Tests dissimilarity of a measured Sketch against an expected Sketch.
    static boolean
    exactlyEqual(Sketch sketchA, Sketch sketchB)
    Returns true if the two given sketches have exactly the same hash values and the same theta values.
    static double[]
    jaccard(Sketch sketchA, Sketch sketchB)
    Computes the Jaccard similarity index with upper and lower bounds.
    static boolean
    similarityTest(Sketch measured, Sketch expected, double threshold)
    Tests similarity of a measured Sketch against an expected Sketch.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • JaccardSimilarity

      public JaccardSimilarity()
  • Method Details

    • jaccard

      public static double[] jaccard(Sketch sketchA, Sketch sketchB)
      Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are distinct from each other. A Jaccard of .95 means the overlap between the two populations is 95% of the union of the two populations.

      Note: For very large pairs of sketches, where the configured nominal entries of the sketches are 2^25 or 2^26, this method may produce unpredictable results.

      Parameters:
      sketchA - given sketch A
      sketchB - given sketch B
      Returns:
      a double array {LowerBound, Estimate, UpperBound} of the Jaccard index. The Upper and Lower bounds are for a confidence interval of 95.4% or +/- 2 standard deviations.
    • exactlyEqual

      public static boolean exactlyEqual(Sketch sketchA, Sketch sketchB)
      Returns true if the two given sketches have exactly the same hash values and the same theta values. Thus, they are equivalent.
      Parameters:
      sketchA - the given sketch A
      sketchB - the given sketch B
      Returns:
      true if the two given sketches have exactly the same hash values and the same theta values.
    • similarityTest

      public static boolean similarityTest(Sketch measured, Sketch expected, double threshold)
      Tests similarity of a measured Sketch against an expected Sketch. Computes the lower bound of the Jaccard index JLB of the measured and expected sketches. if JLB ≥ threshold, then the sketches are considered to be similar with a confidence of 97.7%.
      Parameters:
      measured - the sketch to be tested
      expected - the reference sketch that is considered to be correct.
      threshold - a real value between zero and one.
      Returns:
      if true, the similarity of the two sketches is greater than the given threshold with at least 97.7% confidence.
    • dissimilarityTest

      public static boolean dissimilarityTest(Sketch measured, Sketch expected, double threshold)
      Tests dissimilarity of a measured Sketch against an expected Sketch. Computes the upper bound of the Jaccard index JUB of the measured and expected sketches. if JUB ≤ threshold, then the sketches are considered to be dissimilar with a confidence of 97.7%.
      Parameters:
      measured - the sketch to be tested
      expected - the reference sketch that is considered to be correct.
      threshold - a real value between zero and one.
      Returns:
      if true, the dissimilarity of the two sketches is greater than the given threshold with at least 97.7% confidence.