datasketches-cpp
Static Public Member Functions | List of all members
jaccard_similarity_base< Union, Intersection, ExtractKey > Class Template Reference

Base class for Jaccard similarity. More...

#include <theta_jaccard_similarity_base.hpp>

Static Public Member Functions

template<typename SketchA , typename SketchB >
static std::array< double, 3 > jaccard (const SketchA &sketch_a, const SketchB &sketch_b, uint64_t seed=DEFAULT_SEED)
 Computes the Jaccard similarity index with upper and lower bounds. More...
 
template<typename SketchA , typename SketchB >
static bool exactly_equal (const SketchA &sketch_a, const SketchB &sketch_b, uint64_t seed=DEFAULT_SEED)
 Returns true if the two given sketches are equivalent. More...
 
template<typename SketchA , typename SketchB >
static bool similarity_test (const SketchA &actual, const SketchB &expected, double threshold, uint64_t seed=DEFAULT_SEED)
 Tests similarity of an actual Sketch against an expected Sketch. More...
 
template<typename SketchA , typename SketchB >
static bool dissimilarity_test (const SketchA &actual, const SketchB &expected, double threshold, uint64_t seed=DEFAULT_SEED)
 Tests dissimilarity of an actual Sketch against an expected Sketch. More...
 

Detailed Description

template<typename Union, typename Intersection, typename ExtractKey>
class datasketches::jaccard_similarity_base< Union, Intersection, ExtractKey >

Base class for Jaccard similarity.

Member Function Documentation

◆ jaccard()

static std::array<double, 3> jaccard ( const SketchA &  sketch_a,
const SketchB &  sketch_b,
uint64_t  seed = DEFAULT_SEED 
)
inlinestatic

Computes the Jaccard similarity index with upper and lower bounds.

The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are disjoint. A Jaccard of .95 means the overlap between the two sets is 95% of the union of the two sets.

Note: For very large pairs of sketches, where the configured nominal entries of the sketches are 2^25 or 2^26, this method may produce unpredictable results.

Parameters
sketch_agiven sketch A
sketch_bgiven sketch B
seedfor the hash function that was used to create the sketch
Returns
a double array {LowerBound, Estimate, UpperBound} of the Jaccard index. The Upper and Lower bounds are for a confidence interval of 95.4% or +/- 2 standard deviations.

◆ exactly_equal()

static bool exactly_equal ( const SketchA &  sketch_a,
const SketchB &  sketch_b,
uint64_t  seed = DEFAULT_SEED 
)
inlinestatic

Returns true if the two given sketches are equivalent.

Parameters
sketch_athe given sketch A
sketch_bthe given sketch B
seedfor the hash function that was used to create the sketch
Returns
true if the two given sketches are exactly equal

◆ similarity_test()

static bool similarity_test ( const SketchA &  actual,
const SketchB &  expected,
double  threshold,
uint64_t  seed = DEFAULT_SEED 
)
inlinestatic

Tests similarity of an actual Sketch against an expected Sketch.

Computes the lower bound of the Jaccard index JLB of the actual and expected sketches. if JLB ≥ threshold, then the sketches are considered to be similar with a confidence of 97.7%.

Parameters
actualthe sketch to be tested
expectedthe reference sketch that is considered to be correct
thresholda real value between zero and one
seedfor the hash function that was used to create the sketch
Returns
true if the similarity of the two sketches is greater than the given threshold with at least 97.7% confidence

◆ dissimilarity_test()

static bool dissimilarity_test ( const SketchA &  actual,
const SketchB &  expected,
double  threshold,
uint64_t  seed = DEFAULT_SEED 
)
inlinestatic

Tests dissimilarity of an actual Sketch against an expected Sketch.

Computes the upper bound of the Jaccard index JUB of the actual and expected sketches. if JUB ≤ threshold, then the sketches are considered to be dissimilar with a confidence of 97.7%.

Parameters
actualthe sketch to be tested
expectedthe reference sketch that is considered to be correct
thresholda real value between zero and one
seedfor the hash function that was used to create the sketch
Returns
true if the dissimilarity of the two sketches is greater than the given threshold with at least 97.7% confidence

The documentation for this class was generated from the following file: