Package org.apache.datasketches.tuple
Class JaccardSimilarity
java.lang.Object
org.apache.datasketches.tuple.JaccardSimilarity
Jaccard similarity of two Tuple Sketches, or alternatively, of a Tuple and Theta Sketch.
Note: only retained hash values are compared, and the Tuple summary values are not accounted for in the similarity measure.
- Author:
- Lee Rhodes, David Cromberge
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic <S extends Summary>
booleandissimilarityTest
(Sketch<S> measured, Sketch expected, S summary, SummarySetOperations<S> summarySetOps, double threshold) Tests dissimilarity of a measured Sketch against an expected Sketch.static <S extends Summary>
booleandissimilarityTest
(Sketch<S> measured, Sketch<S> expected, SummarySetOperations<S> summarySetOps, double threshold) Tests dissimilarity of a measured Sketch against an expected Sketch.static <S extends Summary>
booleanexactlyEqual
(Sketch<S> sketchA, Sketch sketchB, S summary, SummarySetOperations<S> summarySetOps) Returns true if the two given sketches have exactly the same hash values and the same theta values.static <S extends Summary>
booleanexactlyEqual
(Sketch<S> sketchA, Sketch<S> sketchB, SummarySetOperations<S> summarySetOps) Returns true if the two given sketches have exactly the same hash values and the same theta values.static <S extends Summary>
double[]jaccard
(Sketch<S> sketchA, Sketch sketchB, S summary, SummarySetOperations<S> summarySetOps) Computes the Jaccard similarity index with upper and lower bounds.static <S extends Summary>
double[]jaccard
(Sketch<S> sketchA, Sketch<S> sketchB, SummarySetOperations<S> summarySetOps) Computes the Jaccard similarity index with upper and lower bounds.static <S extends Summary>
booleansimilarityTest
(Sketch<S> measured, Sketch expected, S summary, SummarySetOperations<S> summarySetOps, double threshold) Tests similarity of a measured Sketch against an expected Sketch.static <S extends Summary>
booleansimilarityTest
(Sketch<S> measured, Sketch<S> expected, SummarySetOperations<S> summarySetOps, double threshold) Tests similarity of a measured Sketch against an expected Sketch.
-
Constructor Details
-
JaccardSimilarity
public JaccardSimilarity()
-
-
Method Details
-
jaccard
public static <S extends Summary> double[] jaccard(Sketch<S> sketchA, Sketch<S> sketchB, SummarySetOperations<S> summarySetOps) Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are distinct from each other. A Jaccard of .95 means the overlap between the two populations is 95% of the union of the two populations.Note: For very large pairs of sketches, where the configured nominal entries of the sketches are 2^25 or 2^26, this method may produce unpredictable results.
- Type Parameters:
S
- Summary- Parameters:
sketchA
- The first argument, a Tuple sketch with summary type SsketchB
- The second argument, a Tuple sketch with summary type SsummarySetOps
- instance of SummarySetOperations used to unify or intersect summaries.- Returns:
- a double array {LowerBound, Estimate, UpperBound} of the Jaccard index. The Upper and Lower bounds are for a confidence interval of 95.4% or +/- 2 standard deviations.
-
jaccard
public static <S extends Summary> double[] jaccard(Sketch<S> sketchA, Sketch sketchB, S summary, SummarySetOperations<S> summarySetOps) Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are distinct from each other. A Jaccard of .95 means the overlap between the two populations is 95% of the union of the two populations.Note: For very large pairs of sketches, where the configured nominal entries of the sketches are 2^25 or 2^26, this method may produce unpredictable results.
- Type Parameters:
S
- Summary- Parameters:
sketchA
- The first argument, a Tuple sketch with summary type SsketchB
- The second argument, a Theta sketchsummary
- the given proxy summary for the theta sketch, which doesn't have one. This may not be null.summarySetOps
- instance of SummarySetOperations used to unify or intersect summaries.- Returns:
- a double array {LowerBound, Estimate, UpperBound} of the Jaccard index. The Upper and Lower bounds are for a confidence interval of 95.4% or +/- 2 standard deviations.
-
exactlyEqual
public static <S extends Summary> boolean exactlyEqual(Sketch<S> sketchA, Sketch<S> sketchB, SummarySetOperations<S> summarySetOps) Returns true if the two given sketches have exactly the same hash values and the same theta values. Thus, they are equivalent.- Type Parameters:
S
- Summary- Parameters:
sketchA
- The first argument, a Tuple sketch with summary type SsketchB
- The second argument, a Tuple sketch with summary type SsummarySetOps
- instance of SummarySetOperations used to unify or intersect summaries.- Returns:
- true if the two given sketches have exactly the same hash values and the same theta values.
-
exactlyEqual
public static <S extends Summary> boolean exactlyEqual(Sketch<S> sketchA, Sketch sketchB, S summary, SummarySetOperations<S> summarySetOps) Returns true if the two given sketches have exactly the same hash values and the same theta values. Thus, they are equivalent.- Type Parameters:
S
- Summary- Parameters:
sketchA
- The first argument, a Tuple sketch with summary type SsketchB
- The second argument, a Theta sketchsummary
- the given proxy summary for the theta sketch, which doesn't have one. This may not be null.summarySetOps
- instance of SummarySetOperations used to unify or intersect summaries.- Returns:
- true if the two given sketches have exactly the same hash values and the same theta values.
-
similarityTest
public static <S extends Summary> boolean similarityTest(Sketch<S> measured, Sketch<S> expected, SummarySetOperations<S> summarySetOps, double threshold) Tests similarity of a measured Sketch against an expected Sketch. Computes the lower bound of the Jaccard index JLB of the measured and expected sketches. if JLB ≥ threshold, then the sketches are considered to be similar with a confidence of 97.7%.- Type Parameters:
S
- Summary- Parameters:
measured
- a Tuple sketch with summary type S to be testedexpected
- the reference Tuple sketch with summary type S that is considered to be correct.summarySetOps
- instance of SummarySetOperations used to unify or intersect summaries.threshold
- a real value between zero and one.- Returns:
- if true, the similarity of the two sketches is greater than the given threshold with at least 97.7% confidence.
-
similarityTest
public static <S extends Summary> boolean similarityTest(Sketch<S> measured, Sketch expected, S summary, SummarySetOperations<S> summarySetOps, double threshold) Tests similarity of a measured Sketch against an expected Sketch. Computes the lower bound of the Jaccard index JLB of the measured and expected sketches. if JLB ≥ threshold, then the sketches are considered to be similar with a confidence of 97.7%.- Type Parameters:
S
- Summary- Parameters:
measured
- a Tuple sketch with summary type S to be testedexpected
- the reference Theta sketch that is considered to be correct.summary
- the given proxy summary for the theta sketch, which doesn't have one. This may not be null.summarySetOps
- instance of SummarySetOperations used to unify or intersect summaries.threshold
- a real value between zero and one.- Returns:
- if true, the similarity of the two sketches is greater than the given threshold with at least 97.7% confidence.
-
dissimilarityTest
public static <S extends Summary> boolean dissimilarityTest(Sketch<S> measured, Sketch<S> expected, SummarySetOperations<S> summarySetOps, double threshold) Tests dissimilarity of a measured Sketch against an expected Sketch. Computes the upper bound of the Jaccard index JUB of the measured and expected sketches. if JUB ≤ threshold, then the sketches are considered to be dissimilar with a confidence of 97.7%.- Type Parameters:
S
- Summary- Parameters:
measured
- a Tuple sketch with summary type S to be testedexpected
- the reference Theta sketch that is considered to be correct.summarySetOps
- instance of SummarySetOperations used to unify or intersect summaries.threshold
- a real value between zero and one.- Returns:
- if true, the dissimilarity of the two sketches is greater than the given threshold with at least 97.7% confidence.
-
dissimilarityTest
public static <S extends Summary> boolean dissimilarityTest(Sketch<S> measured, Sketch expected, S summary, SummarySetOperations<S> summarySetOps, double threshold) Tests dissimilarity of a measured Sketch against an expected Sketch. Computes the upper bound of the Jaccard index JUB of the measured and expected sketches. if JUB ≤ threshold, then the sketches are considered to be dissimilar with a confidence of 97.7%.- Type Parameters:
S
- Summary- Parameters:
measured
- a Tuple sketch with summary type S to be testedexpected
- the reference Theta sketch that is considered to be correct.summary
- the given proxy summary for the theta sketch, which doesn't have one. This may not be null.summarySetOps
- instance of SummarySetOperations used to unify or intersect summaries.threshold
- a real value between zero and one.- Returns:
- if true, the dissimilarity of the two sketches is greater than the given threshold with at least 97.7% confidence.
-