Class FdtSketch
- java.lang.Object
-
- org.apache.datasketches.tuple.Sketch<S>
-
- org.apache.datasketches.tuple.UpdatableSketch<String[],ArrayOfStringsSummary>
-
- org.apache.datasketches.tuple.strings.ArrayOfStringsSketch
-
- org.apache.datasketches.fdt.FdtSketch
-
public final class FdtSketch extends ArrayOfStringsSketch
A Frequent Distinct Tuples sketch.Suppose our data is a stream of pairs {IP address, User ID} and we want to identify the IP addresses that have the most distinct User IDs. Or conversely, we would like to identify the User IDs that have the most distinct IP addresses. This is a common challenge in the analysis of big data and the FDT sketch helps solve this problem using probabilistic techniques.
More generally, given a multiset of tuples with dimensions {d1,d2, d3, ..., dN}, and a primary subset of dimensions M < N, our task is to identify the combinations of M subset dimensions that have the most frequent number of distinct combinations of the N-M non-primary dimensions.
Please refer to the web page https://datasketches.apache.org/docs/Frequency/FrequentDistinctTuplesSketch.html for a more complete discussion about this sketch.
- Author:
- Lee Rhodes
-
-
Field Summary
-
Fields inherited from class org.apache.datasketches.tuple.Sketch
PREAMBLE_LONGS, summaryFactory_
-
-
Constructor Summary
Constructors Constructor Description FdtSketch(double threshold, double rse)
Create a new instance of Frequent Distinct Tuples sketch with a size determined by the given threshold and rse.FdtSketch(int lgK)
Create new instance of Frequent Distinct Tuples sketch with the given Log-base2 of required nominal entries.FdtSketch(FdtSketch sketch)
Copy Constructor
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description CompactSketch<S>
compact()
Converts the current state of the sketch into a compact sketchFdtSketch
copy()
int
getCountLessThanThetaLong(long thetaLong)
Gets the number of hash values less than the given theta expressed as a long.int
getCurrentCapacity()
Get current capacityint
getLgK()
Get log_base2 of Nominal Entriesint
getNominalEntries()
Get configured nominal number of entriesPostProcessor
getPostProcessor()
Returns the PostProcessor that enables multiple queries against the sketch results.PostProcessor
getPostProcessor(Group group, char sep)
Returns the PostProcessor that enables multiple queries against the sketch results.ResizeFactor
getResizeFactor()
Get configured resize factorList<Group>
getResult(int[] priKeyIndices, int limit, int numStdDev, char sep)
Returns an ordered List of Groups of the most frequent distinct population of subset tuples represented by the count of entries of each group.int
getRetainedEntries()
float
getSamplingProbability()
Get configured sampling probabilityprotected void
insertSummary(int index, S summary)
TupleSketchIterator<S>
iterator()
Returns a SketchIteratorvoid
reset()
Resets this sketch an empty state.byte[]
toByteArray()
Deprecated.As of 3.0.0, serializing an UpdatableSketch is deprecated.void
trim()
Rebuilds reducing the actual number of entries to the nominal number of entries if neededvoid
update(String[] tuple)
Update the sketch with the given string array tuple.-
Methods inherited from class org.apache.datasketches.tuple.strings.ArrayOfStringsSketch
update
-
Methods inherited from class org.apache.datasketches.tuple.UpdatableSketch
update, update, update, update, update, update, update
-
Methods inherited from class org.apache.datasketches.tuple.Sketch
getEstimate, getEstimate, getLowerBound, getLowerBound, getSummaryFactory, getTheta, getThetaLong, getUpperBound, getUpperBound, isEmpty, isEstimationMode, toString
-
-
-
-
Constructor Detail
-
FdtSketch
public FdtSketch(int lgK)
Create new instance of Frequent Distinct Tuples sketch with the given Log-base2 of required nominal entries.- Parameters:
lgK
- Log-base2 of required nominal entries.
-
FdtSketch
public FdtSketch(double threshold, double rse)
Create a new instance of Frequent Distinct Tuples sketch with a size determined by the given threshold and rse.- Parameters:
threshold
- : the fraction, between zero and 1.0, of the total distinct stream length that defines a "Frequent" (or heavy) item.rse
- the maximum Relative Standard Error for the estimate of the distinct population of a reported tuple (selected with a primary key) at the threshold.
-
FdtSketch
public FdtSketch(FdtSketch sketch)
Copy Constructor- Parameters:
sketch
- the sketch to copy
-
-
Method Detail
-
copy
public FdtSketch copy()
- Overrides:
copy
in classArrayOfStringsSketch
- Returns:
- a deep copy of this sketch
-
update
public void update(String[] tuple)
Update the sketch with the given string array tuple.- Parameters:
tuple
- the given string array tuple.
-
getResult
public List<Group> getResult(int[] priKeyIndices, int limit, int numStdDev, char sep)
Returns an ordered List of Groups of the most frequent distinct population of subset tuples represented by the count of entries of each group.- Parameters:
priKeyIndices
- these indices define the dimensions used for the Primary Keys.limit
- the maximum number of groups to return. If this value is ≤ 0, all groups will be returned.numStdDev
- the number of standard deviations for the upper and lower error bounds, this value is an integer and must be one of 1, 2, or 3. See Number of Standard Deviationssep
- the separator character- Returns:
- an ordered List of Groups of the most frequent distinct population of subset tuples represented by the count of entries of each group.
-
getPostProcessor
public PostProcessor getPostProcessor()
Returns the PostProcessor that enables multiple queries against the sketch results. This assumes the default Group and the default separator character '|'.- Returns:
- the PostProcessor
-
getPostProcessor
public PostProcessor getPostProcessor(Group group, char sep)
Returns the PostProcessor that enables multiple queries against the sketch results.- Parameters:
group
- the Group class to use during post processing.sep
- the separator character.- Returns:
- the PostProcessor
-
getRetainedEntries
public int getRetainedEntries()
- Specified by:
getRetainedEntries
in classSketch<S extends Summary>
- Returns:
- number of retained entries
-
getCountLessThanThetaLong
public int getCountLessThanThetaLong(long thetaLong)
Description copied from class:Sketch
Gets the number of hash values less than the given theta expressed as a long.- Specified by:
getCountLessThanThetaLong
in classSketch<S extends Summary>
- Parameters:
thetaLong
- the given theta as a long between zero and Long.MAX_VALUE.- Returns:
- the number of hash values less than the given thetaLong.
-
getNominalEntries
public int getNominalEntries()
Get configured nominal number of entries- Returns:
- nominal number of entries
-
getLgK
public int getLgK()
Get log_base2 of Nominal Entries- Returns:
- log_base2 of Nominal Entries
-
getSamplingProbability
public float getSamplingProbability()
Get configured sampling probability- Returns:
- sampling probability
-
getCurrentCapacity
public int getCurrentCapacity()
Get current capacity- Returns:
- current capacity
-
getResizeFactor
public ResizeFactor getResizeFactor()
Get configured resize factor- Returns:
- resize factor
-
trim
public void trim()
Rebuilds reducing the actual number of entries to the nominal number of entries if needed
-
reset
public void reset()
Resets this sketch an empty state.
-
compact
public CompactSketch<S> compact()
Converts the current state of the sketch into a compact sketch
-
toByteArray
@Deprecated public byte[] toByteArray()
Deprecated.As of 3.0.0, serializing an UpdatableSketch is deprecated. This capability will be removed in a future release. Serializing a CompactSketch is not deprecated.This serializes an UpdatableSketch (QuickSelectSketch).- Specified by:
toByteArray
in classSketch<S extends Summary>
- Returns:
- serialized representation of an UpdatableSketch (QuickSelectSketch).
-
insertSummary
protected void insertSummary(int index, S summary)
-
iterator
public TupleSketchIterator<S> iterator()
Description copied from class:Sketch
Returns a SketchIterator
-
-