Package org.apache.datasketches.tuple
Class JaccardSimilarity
 java.lang.Object

 org.apache.datasketches.tuple.JaccardSimilarity

public final class JaccardSimilarity extends Object
Jaccard similarity of two Tuple Sketches, or alternatively, of a Tuple and Theta Sketch.Note: only retained hash values are compared, and the Tuple summary values are not accounted for in the similarity measure.
 Author:
 Lee Rhodes, David Cromberge


Constructor Summary
Constructors Constructor Description JaccardSimilarity()

Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static <S extends Summary>
booleandissimilarityTest(Sketch<S> measured, Sketch expected, S summary, SummarySetOperations<S> summarySetOps, double threshold)
Tests dissimilarity of a measured Sketch against an expected Sketch.static <S extends Summary>
booleandissimilarityTest(Sketch<S> measured, Sketch<S> expected, SummarySetOperations<S> summarySetOps, double threshold)
Tests dissimilarity of a measured Sketch against an expected Sketch.static <S extends Summary>
booleanexactlyEqual(Sketch<S> sketchA, Sketch sketchB, S summary, SummarySetOperations<S> summarySetOps)
Returns true if the two given sketches have exactly the same hash values and the same theta values.static <S extends Summary>
booleanexactlyEqual(Sketch<S> sketchA, Sketch<S> sketchB, SummarySetOperations<S> summarySetOps)
Returns true if the two given sketches have exactly the same hash values and the same theta values.static <S extends Summary>
double[]jaccard(Sketch<S> sketchA, Sketch sketchB, S summary, SummarySetOperations<S> summarySetOps)
Computes the Jaccard similarity index with upper and lower bounds.static <S extends Summary>
double[]jaccard(Sketch<S> sketchA, Sketch<S> sketchB, SummarySetOperations<S> summarySetOps)
Computes the Jaccard similarity index with upper and lower bounds.static <S extends Summary>
booleansimilarityTest(Sketch<S> measured, Sketch expected, S summary, SummarySetOperations<S> summarySetOps, double threshold)
Tests similarity of a measured Sketch against an expected Sketch.static <S extends Summary>
booleansimilarityTest(Sketch<S> measured, Sketch<S> expected, SummarySetOperations<S> summarySetOps, double threshold)
Tests similarity of a measured Sketch against an expected Sketch.



Method Detail

jaccard
public static <S extends Summary> double[] jaccard(Sketch<S> sketchA, Sketch<S> sketchB, SummarySetOperations<S> summarySetOps)
Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are distinct from each other. A Jaccard of .95 means the overlap between the two populations is 95% of the union of the two populations.Note: For very large pairs of sketches, where the configured nominal entries of the sketches are 2^25 or 2^26, this method may produce unpredictable results.
 Type Parameters:
S
 Summary Parameters:
sketchA
 The first argument, a Tuple sketch with summary type SsketchB
 The second argument, a Tuple sketch with summary type SsummarySetOps
 instance of SummarySetOperations used to unify or intersect summaries. Returns:
 a double array {LowerBound, Estimate, UpperBound} of the Jaccard index. The Upper and Lower bounds are for a confidence interval of 95.4% or +/ 2 standard deviations.

jaccard
public static <S extends Summary> double[] jaccard(Sketch<S> sketchA, Sketch sketchB, S summary, SummarySetOperations<S> summarySetOps)
Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are distinct from each other. A Jaccard of .95 means the overlap between the two populations is 95% of the union of the two populations.Note: For very large pairs of sketches, where the configured nominal entries of the sketches are 2^25 or 2^26, this method may produce unpredictable results.
 Type Parameters:
S
 Summary Parameters:
sketchA
 The first argument, a Tuple sketch with summary type SsketchB
 The second argument, a Theta sketchsummary
 the given proxy summary for the theta sketch, which doesn't have one. This may not be null.summarySetOps
 instance of SummarySetOperations used to unify or intersect summaries. Returns:
 a double array {LowerBound, Estimate, UpperBound} of the Jaccard index. The Upper and Lower bounds are for a confidence interval of 95.4% or +/ 2 standard deviations.

exactlyEqual
public static <S extends Summary> boolean exactlyEqual(Sketch<S> sketchA, Sketch<S> sketchB, SummarySetOperations<S> summarySetOps)
Returns true if the two given sketches have exactly the same hash values and the same theta values. Thus, they are equivalent. Type Parameters:
S
 Summary Parameters:
sketchA
 The first argument, a Tuple sketch with summary type SsketchB
 The second argument, a Tuple sketch with summary type SsummarySetOps
 instance of SummarySetOperations used to unify or intersect summaries. Returns:
 true if the two given sketches have exactly the same hash values and the same theta values.

exactlyEqual
public static <S extends Summary> boolean exactlyEqual(Sketch<S> sketchA, Sketch sketchB, S summary, SummarySetOperations<S> summarySetOps)
Returns true if the two given sketches have exactly the same hash values and the same theta values. Thus, they are equivalent. Type Parameters:
S
 Summary Parameters:
sketchA
 The first argument, a Tuple sketch with summary type SsketchB
 The second argument, a Theta sketchsummary
 the given proxy summary for the theta sketch, which doesn't have one. This may not be null.summarySetOps
 instance of SummarySetOperations used to unify or intersect summaries. Returns:
 true if the two given sketches have exactly the same hash values and the same theta values.

similarityTest
public static <S extends Summary> boolean similarityTest(Sketch<S> measured, Sketch<S> expected, SummarySetOperations<S> summarySetOps, double threshold)
Tests similarity of a measured Sketch against an expected Sketch. Computes the lower bound of the Jaccard index J_{LB} of the measured and expected sketches. if J_{LB} ≥ threshold, then the sketches are considered to be similar with a confidence of 97.7%. Type Parameters:
S
 Summary Parameters:
measured
 a Tuple sketch with summary type S to be testedexpected
 the reference Tuple sketch with summary type S that is considered to be correct.summarySetOps
 instance of SummarySetOperations used to unify or intersect summaries.threshold
 a real value between zero and one. Returns:
 if true, the similarity of the two sketches is greater than the given threshold with at least 97.7% confidence.

similarityTest
public static <S extends Summary> boolean similarityTest(Sketch<S> measured, Sketch expected, S summary, SummarySetOperations<S> summarySetOps, double threshold)
Tests similarity of a measured Sketch against an expected Sketch. Computes the lower bound of the Jaccard index J_{LB} of the measured and expected sketches. if J_{LB} ≥ threshold, then the sketches are considered to be similar with a confidence of 97.7%. Type Parameters:
S
 Summary Parameters:
measured
 a Tuple sketch with summary type S to be testedexpected
 the reference Theta sketch that is considered to be correct.summary
 the given proxy summary for the theta sketch, which doesn't have one. This may not be null.summarySetOps
 instance of SummarySetOperations used to unify or intersect summaries.threshold
 a real value between zero and one. Returns:
 if true, the similarity of the two sketches is greater than the given threshold with at least 97.7% confidence.

dissimilarityTest
public static <S extends Summary> boolean dissimilarityTest(Sketch<S> measured, Sketch<S> expected, SummarySetOperations<S> summarySetOps, double threshold)
Tests dissimilarity of a measured Sketch against an expected Sketch. Computes the upper bound of the Jaccard index J_{UB} of the measured and expected sketches. if J_{UB} ≤ threshold, then the sketches are considered to be dissimilar with a confidence of 97.7%. Type Parameters:
S
 Summary Parameters:
measured
 a Tuple sketch with summary type S to be testedexpected
 the reference Theta sketch that is considered to be correct.summarySetOps
 instance of SummarySetOperations used to unify or intersect summaries.threshold
 a real value between zero and one. Returns:
 if true, the dissimilarity of the two sketches is greater than the given threshold with at least 97.7% confidence.

dissimilarityTest
public static <S extends Summary> boolean dissimilarityTest(Sketch<S> measured, Sketch expected, S summary, SummarySetOperations<S> summarySetOps, double threshold)
Tests dissimilarity of a measured Sketch against an expected Sketch. Computes the upper bound of the Jaccard index J_{UB} of the measured and expected sketches. if J_{UB} ≤ threshold, then the sketches are considered to be dissimilar with a confidence of 97.7%. Type Parameters:
S
 Summary Parameters:
measured
 a Tuple sketch with summary type S to be testedexpected
 the reference Theta sketch that is considered to be correct.summary
 the given proxy summary for the theta sketch, which doesn't have one. This may not be null.summarySetOps
 instance of SummarySetOperations used to unify or intersect summaries.threshold
 a real value between zero and one. Returns:
 if true, the dissimilarity of the two sketches is greater than the given threshold with at least 97.7% confidence.

