Class JaccardSimilarity


  • public final class JaccardSimilarity
    extends Object
    Jaccard similarity of two Tuple Sketches, or alternatively, of a Tuple and Theta Sketch.

    Note: only retained hash values are compared, and the Tuple summary values are not accounted for in the similarity measure.

    Author:
    Lee Rhodes, David Cromberge
    • Constructor Detail

      • JaccardSimilarity

        public JaccardSimilarity()
    • Method Detail

      • jaccard

        public static <S extends Summary> double[] jaccard​(Sketch<S> sketchA,
                                                           Sketch<S> sketchB,
                                                           SummarySetOperations<S> summarySetOps)
        Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are distinct from each other. A Jaccard of .95 means the overlap between the two populations is 95% of the union of the two populations.

        Note: For very large pairs of sketches, where the configured nominal entries of the sketches are 2^25 or 2^26, this method may produce unpredictable results.

        Type Parameters:
        S - Summary
        Parameters:
        sketchA - The first argument, a Tuple sketch with summary type S
        sketchB - The second argument, a Tuple sketch with summary type S
        summarySetOps - instance of SummarySetOperations used to unify or intersect summaries.
        Returns:
        a double array {LowerBound, Estimate, UpperBound} of the Jaccard index. The Upper and Lower bounds are for a confidence interval of 95.4% or +/- 2 standard deviations.
      • jaccard

        public static <S extends Summary> double[] jaccard​(Sketch<S> sketchA,
                                                           Sketch sketchB,
                                                           S summary,
                                                           SummarySetOperations<S> summarySetOps)
        Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are distinct from each other. A Jaccard of .95 means the overlap between the two populations is 95% of the union of the two populations.

        Note: For very large pairs of sketches, where the configured nominal entries of the sketches are 2^25 or 2^26, this method may produce unpredictable results.

        Type Parameters:
        S - Summary
        Parameters:
        sketchA - The first argument, a Tuple sketch with summary type S
        sketchB - The second argument, a Theta sketch
        summary - the given proxy summary for the theta sketch, which doesn't have one. This may not be null.
        summarySetOps - instance of SummarySetOperations used to unify or intersect summaries.
        Returns:
        a double array {LowerBound, Estimate, UpperBound} of the Jaccard index. The Upper and Lower bounds are for a confidence interval of 95.4% or +/- 2 standard deviations.
      • exactlyEqual

        public static <S extends Summary> boolean exactlyEqual​(Sketch<S> sketchA,
                                                               Sketch<S> sketchB,
                                                               SummarySetOperations<S> summarySetOps)
        Returns true if the two given sketches have exactly the same hash values and the same theta values. Thus, they are equivalent.
        Type Parameters:
        S - Summary
        Parameters:
        sketchA - The first argument, a Tuple sketch with summary type S
        sketchB - The second argument, a Tuple sketch with summary type S
        summarySetOps - instance of SummarySetOperations used to unify or intersect summaries.
        Returns:
        true if the two given sketches have exactly the same hash values and the same theta values.
      • exactlyEqual

        public static <S extends Summary> boolean exactlyEqual​(Sketch<S> sketchA,
                                                               Sketch sketchB,
                                                               S summary,
                                                               SummarySetOperations<S> summarySetOps)
        Returns true if the two given sketches have exactly the same hash values and the same theta values. Thus, they are equivalent.
        Type Parameters:
        S - Summary
        Parameters:
        sketchA - The first argument, a Tuple sketch with summary type S
        sketchB - The second argument, a Theta sketch
        summary - the given proxy summary for the theta sketch, which doesn't have one. This may not be null.
        summarySetOps - instance of SummarySetOperations used to unify or intersect summaries.
        Returns:
        true if the two given sketches have exactly the same hash values and the same theta values.
      • similarityTest

        public static <S extends Summary> boolean similarityTest​(Sketch<S> measured,
                                                                 Sketch<S> expected,
                                                                 SummarySetOperations<S> summarySetOps,
                                                                 double threshold)
        Tests similarity of a measured Sketch against an expected Sketch. Computes the lower bound of the Jaccard index JLB of the measured and expected sketches. if JLB ≥ threshold, then the sketches are considered to be similar with a confidence of 97.7%.
        Type Parameters:
        S - Summary
        Parameters:
        measured - a Tuple sketch with summary type S to be tested
        expected - the reference Tuple sketch with summary type S that is considered to be correct.
        summarySetOps - instance of SummarySetOperations used to unify or intersect summaries.
        threshold - a real value between zero and one.
        Returns:
        if true, the similarity of the two sketches is greater than the given threshold with at least 97.7% confidence.
      • similarityTest

        public static <S extends Summary> boolean similarityTest​(Sketch<S> measured,
                                                                 Sketch expected,
                                                                 S summary,
                                                                 SummarySetOperations<S> summarySetOps,
                                                                 double threshold)
        Tests similarity of a measured Sketch against an expected Sketch. Computes the lower bound of the Jaccard index JLB of the measured and expected sketches. if JLB ≥ threshold, then the sketches are considered to be similar with a confidence of 97.7%.
        Type Parameters:
        S - Summary
        Parameters:
        measured - a Tuple sketch with summary type S to be tested
        expected - the reference Theta sketch that is considered to be correct.
        summary - the given proxy summary for the theta sketch, which doesn't have one. This may not be null.
        summarySetOps - instance of SummarySetOperations used to unify or intersect summaries.
        threshold - a real value between zero and one.
        Returns:
        if true, the similarity of the two sketches is greater than the given threshold with at least 97.7% confidence.
      • dissimilarityTest

        public static <S extends Summary> boolean dissimilarityTest​(Sketch<S> measured,
                                                                    Sketch<S> expected,
                                                                    SummarySetOperations<S> summarySetOps,
                                                                    double threshold)
        Tests dissimilarity of a measured Sketch against an expected Sketch. Computes the upper bound of the Jaccard index JUB of the measured and expected sketches. if JUB ≤ threshold, then the sketches are considered to be dissimilar with a confidence of 97.7%.
        Type Parameters:
        S - Summary
        Parameters:
        measured - a Tuple sketch with summary type S to be tested
        expected - the reference Theta sketch that is considered to be correct.
        summarySetOps - instance of SummarySetOperations used to unify or intersect summaries.
        threshold - a real value between zero and one.
        Returns:
        if true, the dissimilarity of the two sketches is greater than the given threshold with at least 97.7% confidence.
      • dissimilarityTest

        public static <S extends Summary> boolean dissimilarityTest​(Sketch<S> measured,
                                                                    Sketch expected,
                                                                    S summary,
                                                                    SummarySetOperations<S> summarySetOps,
                                                                    double threshold)
        Tests dissimilarity of a measured Sketch against an expected Sketch. Computes the upper bound of the Jaccard index JUB of the measured and expected sketches. if JUB ≤ threshold, then the sketches are considered to be dissimilar with a confidence of 97.7%.
        Type Parameters:
        S - Summary
        Parameters:
        measured - a Tuple sketch with summary type S to be tested
        expected - the reference Theta sketch that is considered to be correct.
        summary - the given proxy summary for the theta sketch, which doesn't have one. This may not be null.
        summarySetOps - instance of SummarySetOperations used to unify or intersect summaries.
        threshold - a real value between zero and one.
        Returns:
        if true, the dissimilarity of the two sketches is greater than the given threshold with at least 97.7% confidence.