Sketch-Based Approximate Functions

Spark’s SQL and DataFrame APIs provide a collection of sketch-based approximate functions powered by the Apache DataSketches library. These functions enable efficient probabilistic computations on large datasets with bounded memory usage and accuracy guarantees.

Sketches are compact data structures that summarize large datasets, supporting distributed aggregation through serialization and merging. This makes them ideal for use cases including (so far):

Approximate count distinct (HLL, Theta, and Tuple sketches)
Approximate quantile estimation (KLL sketches)
Approximate frequent items (Top-K sketches)
Set operations on distinct counts (Theta and Tuple sketches)
Distinct counting with aggregated summaries (Tuple sketches)

HyperLogLog (HLL) Sketch Functions
Theta Sketch Functions
Tuple Sketch Functions
- tuple_sketch_agg_*
- tuple_union_agg_*
- tuple_intersection_agg_*
- tuple_sketch_estimate_*
- tuple_sketch_summary_*
- tuple_sketch_theta_*
- tuple_union_*
- tuple_intersection_*
- tuple_difference_*
- tuple_union_theta_* (union with Theta sketches)
- tuple_intersection_theta_* (intersection with Theta sketches)
- tuple_difference_theta_* (difference with Theta sketches)
KLL Quantile Sketch Functions
Approximate Top-K Functions
Best Practices
Common Use Cases and Examples

HyperLogLog (HLL) Sketch Functions

HyperLogLog sketches provide approximate count distinct functionality with configurable accuracy and memory usage. They are well-suited for counting unique values in very large datasets.

See the Apache DataSketches HLL documentation for more information.

hll_sketch_agg

Creates an HLL sketch from input values that can later be used to estimate count distinct.

Syntax:

hll_sketch_agg(expr [, lgConfigK])

Argument	Type	Description
`expr`	INT, BIGINT, STRING, or BINARY	The expression whose distinct values will be counted
`lgConfigK`	INT (optional)	Log-base-2 of K, where K is the number of buckets. Range: 4-21. Default: 12. Higher values provide more accuracy but use more memory.

Returns a BINARY containing the HLL sketch in updatable binary representation.

Examples:

-- Basic usage: create a sketch and estimate distinct count
SELECT hll_sketch_estimate(hll_sketch_agg(col))
FROM VALUES (1), (1), (2), (2), (3) tab(col);
-- Result: 3

-- With custom lgConfigK for higher accuracy
SELECT hll_sketch_estimate(hll_sketch_agg(col, 16))
FROM VALUES (50), (60), (60), (60), (75), (100) tab(col);
-- Result: 4

-- With string values
SELECT hll_sketch_estimate(hll_sketch_agg(col))
FROM VALUES ('abc'), ('def'), ('abc'), ('ghi'), ('abc') tab(col);
-- Result: 3

Notes:

NULL values are ignored during aggregation.
Empty strings (for STRING type) and empty byte arrays (for BINARY type) are ignored.
The sketch can be stored and later merged with other sketches using hll_union or hll_union_agg.

hll_union_agg

Aggregates multiple HLL sketches into a single merged sketch.

Syntax:

hll_union_agg(sketch [, allowDifferentLgConfigK])

Argument	Type	Description
`sketch`	BINARY	An HLL sketch in binary format (produced by `hll_sketch_agg`)
`allowDifferentLgConfigK`	BOOLEAN (optional)	If true, allows merging sketches with different lgConfigK values. Default: false.

Returns a BINARY containing the merged HLL sketch.

Examples:

-- Merge sketches from different partitions
SELECT hll_sketch_estimate(hll_union_agg(sketch, true))
FROM (
  SELECT hll_sketch_agg(col) as sketch
  FROM VALUES (1) tab(col)
  UNION ALL
  SELECT hll_sketch_agg(col, 20) as sketch
  FROM VALUES (1) tab(col)
);
-- Result: 1

-- Standard merge (same lgConfigK)
SELECT hll_sketch_estimate(hll_union_agg(sketch))
FROM (
  SELECT hll_sketch_agg(col) as sketch
  FROM VALUES (1), (2) tab(col)
  UNION ALL
  SELECT hll_sketch_agg(col) as sketch
  FROM VALUES (3), (4) tab(col)
);
-- Result: 4

Notes:

If allowDifferentLgConfigK is false and sketches have different lgConfigK values, an error is thrown.
The output sketch uses the minimum lgConfigK value of all input sketches when merging sketches with different sizes.

hll_sketch_estimate

Estimates the number of unique values from an HLL sketch.

Syntax:

hll_sketch_estimate(sketch)

Argument	Type	Description
`sketch`	BINARY	An HLL sketch in binary format

Returns a BIGINT representing the estimated count of distinct values.

Examples:

SELECT hll_sketch_estimate(hll_sketch_agg(col))
FROM VALUES (1), (1), (2), (2), (3) tab(col);
-- Result: 3

Errors:

Throws an error if the input is not a valid HLL sketch binary representation.

hll_union

Merges two HLL sketches into one (scalar function).

Syntax:

hll_union(first, second [, allowDifferentLgConfigK])

Argument	Type	Description
`first`	BINARY	First HLL sketch
`second`	BINARY	Second HLL sketch
`allowDifferentLgConfigK`	BOOLEAN (optional)	Allow different lgConfigK values. Default: false.

Returns a BINARY containing the merged HLL sketch.

Examples:

SELECT hll_sketch_estimate(
  hll_union(
    hll_sketch_agg(col1),
    hll_sketch_agg(col2)))
FROM VALUES (1, 4), (1, 4), (2, 5), (2, 5), (3, 6) tab(col1, col2);
-- Result: 6

Theta Sketch Functions

Theta sketches provide approximate count distinct with support for set operations (union, intersection, and difference). This makes them ideal for computing unique counts across overlapping datasets.

Note: If you need to track both distinct counts AND aggregated metrics (like sum, min, or max of associated values), consider using Tuple Sketch Functions instead, which extend Theta sketches with numeric summary values.

See the Apache DataSketches Theta documentation for more information.

theta_sketch_agg

Creates a Theta sketch from input values.

Syntax:

theta_sketch_agg(expr [, lgNomEntries])

Argument	Type	Description
`expr`	INT, BIGINT, FLOAT, DOUBLE, STRING, BINARY, ARRAY<INT>, or ARRAY<BIGINT>	The expression whose distinct values will be counted
`lgNomEntries`	INT (optional)	Log-base-2 of nominal entries. Range: 4-26. Default: 12.

Returns a BINARY containing the Theta sketch in compact binary representation.

Examples:

-- Basic distinct count
SELECT theta_sketch_estimate(theta_sketch_agg(col))
FROM VALUES (1), (1), (2), (2), (3) tab(col);
-- Result: 3

-- With custom lgNomEntries
SELECT theta_sketch_estimate(theta_sketch_agg(col, 22))
FROM VALUES (1), (2), (3), (4), (5), (6), (7) tab(col);
-- Result: 7

-- With array values
SELECT theta_sketch_estimate(theta_sketch_agg(col))
FROM VALUES (ARRAY(1, 2)), (ARRAY(3, 4)), (ARRAY(1, 2)) tab(col);
-- Result: 2

Notes:

NULL values are ignored.
Supports a wider range of input types compared to HLL sketches.
Empty arrays, empty strings, and empty binary values are ignored.

theta_union_agg

Aggregates multiple Theta sketches using union operation.

Syntax:

theta_union_agg(sketch [, lgNomEntries])

Argument	Type	Description
`sketch`	BINARY	A Theta sketch in binary format
`lgNomEntries`	INT (optional)	Log-base-2 of nominal entries. Range: 4-26. Default: 12.

Returns a BINARY containing the merged Theta sketch.

Examples:

SELECT theta_sketch_estimate(theta_union_agg(sketch, 15))
FROM (
  SELECT theta_sketch_agg(col1) as sketch
  FROM VALUES (1), (2), (3), (4), (5), (6), (7) tab(col1)
  UNION ALL
  SELECT theta_sketch_agg(col2, 20) as sketch
  FROM VALUES (5), (6), (7), (8), (9), (10), (11) tab(col2)
);
-- Result: 11

theta_intersection_agg

Aggregates multiple Theta sketches using intersection operation (finds common distinct values).

Syntax:

theta_intersection_agg(sketch)

Argument	Type	Description
`sketch`	BINARY	A Theta sketch in binary format

Returns a BINARY containing the intersected Theta sketch.

Examples:

SELECT theta_sketch_estimate(theta_intersection_agg(sketch))
FROM (
  SELECT theta_sketch_agg(col1) as sketch
  FROM VALUES (1), (2), (3), (4), (5), (6), (7) tab(col1)
  UNION ALL
  SELECT theta_sketch_agg(col2) as sketch
  FROM VALUES (5), (6), (7), (8), (9), (10), (11) tab(col2)
);
-- Result: 3 (values 5, 6, 7 are common)

theta_sketch_estimate

Estimates the number of unique values from a Theta sketch.

Syntax:

theta_sketch_estimate(sketch)

Argument	Type	Description
`sketch`	BINARY	A Theta sketch in binary format

Returns a BIGINT representing the estimated count of distinct values.

Examples:

SELECT theta_sketch_estimate(theta_sketch_agg(col))
FROM VALUES (1), (1), (2), (2), (3) tab(col);
-- Result: 3

theta_union

Merges two Theta sketches using union (scalar function).

Syntax:

theta_union(first, second [, lgNomEntries])

Argument	Type	Description
`first`	BINARY	First Theta sketch
`second`	BINARY	Second Theta sketch
`lgNomEntries`	INT (optional)	Log-base-2 of nominal entries. Range: 4-26. Default: 12.

Returns a BINARY containing the merged Theta sketch.

Examples:

SELECT theta_sketch_estimate(
  theta_union(
    theta_sketch_agg(col1),
    theta_sketch_agg(col2)))
FROM VALUES (1, 4), (1, 4), (2, 5), (2, 5), (3, 6) tab(col1, col2);
-- Result: 6

theta_intersection

Computes the intersection of two Theta sketches (scalar function).

Syntax:

theta_intersection(first, second)

Argument	Type	Description
`first`	BINARY	First Theta sketch
`second`	BINARY	Second Theta sketch

Returns a BINARY containing the intersected Theta sketch.

Examples:

SELECT theta_sketch_estimate(
  theta_intersection(
    theta_sketch_agg(col1),
    theta_sketch_agg(col2)))
FROM VALUES (5, 4), (1, 4), (2, 5), (2, 5), (3, 1) tab(col1, col2);
-- Result: 2 (values 1 and 5 are common)

theta_difference

Computes the set difference of two Theta sketches (A - B).

Syntax:

theta_difference(first, second)

Argument	Type	Description
`first`	BINARY	First Theta sketch (A)
`second`	BINARY	Second Theta sketch (B)

Returns a BINARY containing a Theta sketch representing values in A but not in B.

Examples:

SELECT theta_sketch_estimate(
  theta_difference(
    theta_sketch_agg(col1),
    theta_sketch_agg(col2)))
FROM VALUES (5, 4), (1, 4), (2, 5), (2, 5), (3, 1) tab(col1, col2);
-- Result: 2 (values 2 and 3 are in col1 but not col2)

Tuple Sketch Functions

Tuple sketches extend Theta sketches by associating a numeric summary value with each distinct key. They provide approximate distinct counting with set operations (union, intersection, difference) while also aggregating summary values using configurable modes (sum, min, max, etc.). This makes them useful for scenarios where you need both cardinality estimation and aggregated metrics.

Note: If you only need distinct counting without aggregated metrics, consider using Theta Sketch Functions instead, which are more memory-efficient for pure cardinality estimation with set operations.

See the Apache DataSketches Tuple documentation for more information.

Tuple functions are type-specific:

DOUBLE variants: For double-precision floating-point summary values
INTEGER variants: For integer summary values

Summary Aggregation Modes: When multiple values are associated with the same key (during sketch creation or set operations), the mode parameter determines how summary values are combined:

sum: Adds all summary values together (default)
min: Keeps the minimum summary value
max: Keeps the maximum summary value
alwaysone: Sets all summary values to 1, effectively treating the sketch like a Theta sketch (useful for converting Tuple sketches to cardinality-only analysis)

tuple_sketch_agg_*

Creates a Tuple sketch from key-value pairs, aggregating summary values for distinct keys.

Syntax:

tuple_sketch_agg_double(key, summary [, lgNomEntries] [, mode])
tuple_sketch_agg_integer(key, summary [, lgNomEntries] [, mode])

Argument	Type	Description
`key`	INT, BIGINT, FLOAT, DOUBLE, STRING, BINARY, ARRAY<INT>, or ARRAY<BIGINT>	The key column for distinct counting
`summary`	DOUBLE (for _double) or INT (for _integer)	The numeric value to aggregate for each key
`lgNomEntries`	INT (optional)	Log-base-2 of nominal entries. Range: 4-26. Default: 12.
`mode`	STRING (optional)	Summary aggregation mode: “sum” (default), “min”, “max”, “alwaysone” (sets all summary values to 1)

Returns a BINARY containing the Tuple sketch in compact binary representation.

Examples:

-- Basic usage: count distinct keys and sum their values
SELECT
  tuple_sketch_estimate_integer(tuple_sketch_agg_integer(key, value)) as distinct_keys,
  tuple_sketch_summary_integer(tuple_sketch_agg_integer(key, value)) as total_value
FROM VALUES (1, 10), (2, 20), (2, 30) tab(key, value);
-- Result: distinct_keys=2, total_value=60

-- With custom lgNomEntries and mode
SELECT tuple_sketch_summary_double(
  tuple_sketch_agg_double(key, value, 16, 'sum'), 'max')
FROM VALUES (1, 10.0), (2, 20.0), (2, 30.0) tab(key, value);
-- Result: 50.0 (max value for distinct keys)

Notes:

NULL keys are ignored during aggregation.
When the same key appears multiple times, summary values are aggregated according to the mode parameter.
The sketch can be stored and later merged using tuple union/intersection functions.

tuple_union_agg_*

Aggregates multiple Tuple sketches using union operation, combining distinct keys from all sketches.

Syntax:

tuple_union_agg_double(sketch [, lgNomEntries] [, mode])
tuple_union_agg_integer(sketch [, lgNomEntries] [, mode])

Argument	Type	Description
`sketch`	BINARY	A Tuple sketch in binary format
`lgNomEntries`	INT (optional)	Log-base-2 of nominal entries. Range: 4-26. Default: 12.
`mode`	STRING (optional)	Summary aggregation mode: “sum” (default), “min”, “max”, “alwaysone” (sets all summary values to 1)

Returns a BINARY containing the merged Tuple sketch.

Examples:

-- Merge sketches from different data sources
SELECT tuple_sketch_estimate_double(tuple_union_agg_double(sketch))
FROM (
  SELECT tuple_sketch_agg_double(key, value) as sketch
  FROM VALUES (1, 10.0), (2, 20.0) tab(key, value)
  UNION ALL
  SELECT tuple_sketch_agg_double(key, value) as sketch
  FROM VALUES (3, 30.0), (4, 40.0) tab(key, value)
);
-- Result: 4.0 (union of keys 1,2,3,4)

tuple_intersection_agg_*

Aggregates multiple Tuple sketches using intersection operation, finding common distinct keys.

Syntax:

tuple_intersection_agg_double(sketch [, mode])
tuple_intersection_agg_integer(sketch [, mode])

Argument	Type	Description
`sketch`	BINARY	A Tuple sketch in binary format
`mode`	STRING (optional)	Summary aggregation mode: “sum” (default), “min”, “max”, “alwaysone” (sets all summary values to 1)

Returns a BINARY containing the intersected Tuple sketch.

Examples:

-- Find common keys across sketches
SELECT tuple_sketch_estimate_integer(tuple_intersection_agg_integer(sketch))
FROM (
  SELECT tuple_sketch_agg_integer(key, value) as sketch
  FROM VALUES (1, 10), (2, 20), (3, 30) tab(key, value)
  UNION ALL
  SELECT tuple_sketch_agg_integer(key, value) as sketch
  FROM VALUES (2, 40), (3, 50), (4, 60) tab(key, value)
);
-- Result: 2.0 (keys 2 and 3 are common)

tuple_sketch_estimate_*

Estimates the number of distinct keys from a Tuple sketch.

Syntax:

tuple_sketch_estimate_double(sketch)
tuple_sketch_estimate_integer(sketch)

Argument	Type	Description
`sketch`	BINARY	A Tuple sketch in binary format

Returns a DOUBLE representing the estimated count of distinct keys.

Examples:

SELECT tuple_sketch_estimate_double(
  tuple_sketch_agg_double(key, value))
FROM VALUES (1, 10.0), (2, 20.0), (2, 30.0) tab(key, value);
-- Result: 2.0

tuple_sketch_summary_*

Returns the aggregated summary value from a Tuple sketch.

Syntax:

tuple_sketch_summary_double(sketch [, mode])
tuple_sketch_summary_integer(sketch [, mode])

Argument	Type	Description
`sketch`	BINARY	A Tuple sketch in binary format
`mode`	STRING (optional)	Summary aggregation mode: “sum” (default), “min”, “max”, “alwaysone” (sets all summary values to 1)

Returns the aggregated summary value (DOUBLE or BIGINT depending on variant).

Examples:

-- Get the sum of all summary values
SELECT tuple_sketch_summary_integer(
  tuple_sketch_agg_integer(key, value))
FROM VALUES (1, 10), (2, 20), (2, 30) tab(key, value);
-- Result: 60 (10 + 20 + 30, with key 2's values summed)

tuple_sketch_theta_*

Returns the theta value from a Tuple sketch, indicating sampling rate.

Syntax:

tuple_sketch_theta_double(sketch)
tuple_sketch_theta_integer(sketch)

Argument	Type	Description
`sketch`	BINARY	A Tuple sketch in binary format

Returns a DOUBLE between 0.0 and 1.0 representing the theta value (1.0 means no sampling).

Examples:

SELECT tuple_sketch_theta_double(
  tuple_sketch_agg_double(key, value))
FROM VALUES (1, 10.0), (2, 20.0) tab(key, value);
-- Result: 1.0

tuple_union_*

Merges two Tuple sketches using union (scalar function).

Syntax:

tuple_union_double(first, second [, lgNomEntries] [, mode])
tuple_union_integer(first, second [, lgNomEntries] [, mode])

Argument	Type	Description
`first`	BINARY	First Tuple sketch
`second`	BINARY	Second Tuple sketch
`lgNomEntries`	INT (optional)	Log-base-2 of nominal entries. Range: 4-26. Default: 12.
`mode`	STRING (optional)	Summary aggregation mode: “sum” (default), “min”, “max”, “alwaysone” (sets all summary values to 1)

Returns a BINARY containing the merged Tuple sketch.

Examples:

-- Union of two sketches from different columns
SELECT tuple_sketch_estimate_double(
  tuple_union_double(
    tuple_sketch_agg_double(key1, v1),
    tuple_sketch_agg_double(key2, v2)))
FROM VALUES (1, 10.0, 3, 30.0), (2, 20.0, 4, 40.0) tab(key1, v1, key2, v2);
-- Result: 4.0 (keys 1,2,3,4 are in the union)

tuple_intersection_*

Computes the intersection of two Tuple sketches (scalar function).

Syntax:

tuple_intersection_double(first, second [, mode])
tuple_intersection_integer(first, second [, mode])

Argument	Type	Description
`first`	BINARY	First Tuple sketch
`second`	BINARY	Second Tuple sketch
`mode`	STRING (optional)	Summary aggregation mode: “sum” (default), “min”, “max”, “alwaysone” (sets all summary values to 1)

Returns a BINARY containing the intersected Tuple sketch.

Examples:

-- Find keys present in both sketches
SELECT tuple_sketch_estimate_integer(
  tuple_intersection_integer(
    tuple_sketch_agg_integer(key1, v1),
    tuple_sketch_agg_integer(key2, v2)))
FROM VALUES (1, 10, 2, 20), (2, 20, 3, 30), (3, 30, 4, 40) tab(key1, v1, key2, v2);
-- Result: 2.0 (keys 2 and 3 are common)

tuple_difference_*

Computes the set difference of two Tuple sketches (A - B), returning keys in the first sketch but not in the second.

Syntax:

tuple_difference_double(first, second)
tuple_difference_integer(first, second)

Argument	Type	Description
`first`	BINARY	First Tuple sketch (A)
`second`	BINARY	Second Tuple sketch (B)

Returns a BINARY containing a Tuple sketch representing keys in A but not in B.

Examples:

-- Find keys only in first sketch
SELECT tuple_sketch_estimate_double(
  tuple_difference_double(
    tuple_sketch_agg_double(key1, v1),
    tuple_sketch_agg_double(key2, v2)))
FROM VALUES (1, 10.0, 4, 40.0), (2, 20.0, 4, 40.0), (3, 30.0, 5, 50.0), (4, 40.0, 5, 50.0) tab(key1, v1, key2, v2);
-- Result: 3.0 (keys 1, 2, 3 are in key1 but not key2)

tuple_union_theta_*

Merges a Tuple sketch with a Theta sketch using union (scalar function). This combines distinct keys from both sketches, with Theta sketch entries assigned a default summary value.

Syntax:

tuple_union_theta_double(tupleSketch, thetaSketch [, lgNomEntries] [, mode])
tuple_union_theta_integer(tupleSketch, thetaSketch [, lgNomEntries] [, mode])

Argument	Type	Description
`tupleSketch`	BINARY	Tuple sketch with double/integer summaries
`thetaSketch`	BINARY	Theta sketch (keys only)
`lgNomEntries`	INT (optional)	Log-base-2 of nominal entries. Range: 4-26. Default: 12.
`mode`	STRING (optional)	Summary aggregation mode: “sum” (default), “min”, “max”, “alwaysone”

Returns a BINARY containing the merged Tuple sketch.

Notes:

When Theta sketch entries (keys without summary values) are merged with Tuple sketch entries, they receive default summary values based on the mode:
- sum mode: 0.0 (identity element for addition)
- min mode: +Infinity (identity element for minimum)
- max mode: -Infinity (identity element for maximum)
- alwaysone mode: 1.0 (all entries set to 1)

Examples:

-- Union a Tuple sketch with a Theta sketch
SELECT tuple_sketch_estimate_double(
  tuple_union_theta_double(
    tuple_sketch_agg_double(user_id, spend),
    theta_sketch_agg(visitor_id)))
FROM VALUES (1, 10.0), (2, 20.0) users(user_id, spend)
CROSS JOIN (VALUES (3), (4) visitors(visitor_id));
-- Result: 4.0 (all keys: 1, 2, 3, 4)

-- With custom lgNomEntries and mode
SELECT tuple_sketch_estimate_integer(
  tuple_union_theta_integer(
    tuple_sketch_agg_integer(user_id, count),
    theta_sketch_agg(visitor_id),
    16,
    'sum'))
FROM VALUES (1, 5), (2, 10) users(user_id, count)
CROSS JOIN (VALUES (3), (4) visitors(visitor_id));
-- Result: 4.0

Use Case: Useful for combining user activity data (Tuple sketch with metrics) with anonymous visitor data (Theta sketch) to get total distinct users across both groups.

tuple_intersection_theta_*

Computes the intersection of a Tuple sketch with a Theta sketch (scalar function). Returns keys that exist in both sketches, preserving the summary values from the Tuple sketch.

Syntax:

tuple_intersection_theta_double(tupleSketch, thetaSketch [, mode])
tuple_intersection_theta_integer(tupleSketch, thetaSketch [, mode])

Argument	Type	Description
`tupleSketch`	BINARY	Tuple sketch with double/integer summaries
`thetaSketch`	BINARY	Theta sketch (keys only)
`mode`	STRING (optional)	Summary aggregation mode: “sum” (default), “min”, “max”, “alwaysone”

Returns a BINARY containing the intersected Tuple sketch.

Notes:

When Theta sketch entries (keys without summary values) are intersected with Tuple sketch entries, they receive default summary values based on the mode:
- sum mode: 0.0 (identity element for addition)
- min mode: +Infinity (identity element for minimum)
- max mode: -Infinity (identity element for maximum)
- alwaysone mode: 1.0 (all entries set to 1)

Examples:

-- Find registered users who also appear as visitors
SELECT tuple_sketch_estimate_double(
  tuple_intersection_theta_double(
    tuple_sketch_agg_double(user_id, spend),
    theta_sketch_agg(visitor_id)))
FROM VALUES (1, 10.0), (2, 20.0), (3, 30.0) users(user_id, spend)
CROSS JOIN (VALUES (2), (3), (4) visitors(visitor_id));
-- Result: 2.0 (keys 2 and 3 are common)

-- With mode parameter
SELECT tuple_sketch_summary_integer(
  tuple_intersection_theta_integer(
    tuple_sketch_agg_integer(user_id, count),
    theta_sketch_agg(visitor_id),
    'sum'))
FROM VALUES (1, 5), (2, 10), (3, 15) users(user_id, count)
CROSS JOIN (VALUES (2), (3), (4) visitors(visitor_id));
-- Result: 25 (sum of counts for keys 2 and 3)

Use Case: Identify registered users who also appear in visitor logs, and aggregate their associated metrics (e.g., total spend of users who visited the site).

tuple_difference_theta_*

Computes the set difference between a Tuple sketch and a Theta sketch (A - B). Returns keys in the Tuple sketch that are not in the Theta sketch, preserving summary values.

Syntax:

tuple_difference_theta_double(tupleSketch, thetaSketch)
tuple_difference_theta_integer(tupleSketch, thetaSketch)

Argument	Type	Description
`tupleSketch`	BINARY	Tuple sketch with double/integer summaries (A)
`thetaSketch`	BINARY	Theta sketch (B)

Returns a BINARY containing a Tuple sketch representing keys in A but not in B.

Examples:

-- Find registered users who have not visited recently
SELECT tuple_sketch_estimate_double(
  tuple_difference_theta_double(
    tuple_sketch_agg_double(user_id, lifetime_spend),
    theta_sketch_agg(recent_visitor_id)))
FROM VALUES (1, 100.0), (2, 200.0), (3, 300.0), (5, 500.0) all_users(user_id, lifetime_spend)
CROSS JOIN (VALUES (1), (2), (4) recent_visitors(recent_visitor_id));
-- Result: 2.0 (users 3 and 5 haven't visited recently)

-- Get the total lifetime spend of users who haven't visited recently
SELECT tuple_sketch_summary_double(
  tuple_difference_theta_double(
    tuple_sketch_agg_double(user_id, lifetime_spend),
    theta_sketch_agg(recent_visitor_id)))
FROM VALUES (1, 100.0), (2, 200.0), (3, 300.0), (5, 500.0) all_users(user_id, lifetime_spend)
CROSS JOIN (VALUES (1), (2), (4) recent_visitors(recent_visitor_id));
-- Result: 800.0 (sum of 300.0 + 500.0 from users 3 and 5)

Use Case: Identify inactive users (registered users who haven’t visited recently) and calculate their aggregated metrics like total lifetime value to prioritize re-engagement campaigns.

KLL Quantile Sketch Functions

KLL (K-Linear-Logarithmic) sketches provide approximate quantile estimation. They are useful for computing percentiles, medians, and other order statistics on large datasets without sorting.

See the Apache DataSketches KLL documentation for more information.

KLL functions are type-specific to avoid precision loss:

BIGINT variants: For integer types (TINYINT, SMALLINT, INT, BIGINT)
FLOAT variants: For FLOAT values only
DOUBLE variants: For FLOAT and DOUBLE values

kll_sketch_agg_*

Creates a KLL sketch from numeric values for quantile estimation.

Syntax:

kll_sketch_agg_bigint(expr [, k])
kll_sketch_agg_float(expr [, k])
kll_sketch_agg_double(expr [, k])

Argument	Type	Description
`expr`	Numeric (see variants above)	The numeric column to summarize
`k`	INT (optional)	Controls accuracy and size. Range: 8-65535. Default: 200 (~1.65% normalized rank error).

Returns a BINARY containing the KLL sketch in compact binary representation.

Examples:

-- Get median (0.5 quantile)
SELECT kll_sketch_get_quantile_bigint(kll_sketch_agg_bigint(col), 0.5)
FROM VALUES (1), (2), (3), (4), (5), (6), (7) tab(col);
-- Result: 4

-- With custom k for higher accuracy
SELECT kll_sketch_get_quantile_bigint(kll_sketch_agg_bigint(col, 400), 0.5)
FROM VALUES (1), (2), (3), (4), (5), (6), (7) tab(col);
-- Result: 4

Notes:

Use the appropriate variant to avoid precision loss: use _bigint for integers, _float for floats, _double for doubles.
NULL values are ignored during aggregation.

kll_merge_agg_*

Aggregates multiple KLL sketches of the same type by merging them together. This is useful for combining sketches created in separate aggregations (e.g., from different partitions or time windows). These are aggregate functions.

Syntax:

kll_merge_agg_bigint(sketch [, k])
kll_merge_agg_float(sketch [, k])
kll_merge_agg_double(sketch [, k])

Argument	Type	Description
`sketch`	BINARY	A KLL sketch in binary format (e.g., from `kll_sketch_agg_*`)
`k`	INT (optional)	Controls accuracy and size of the merged sketch. Range: 8-65535. If not specified, the merged sketch adopts the k value from the first input sketch.

Returns a BINARY containing the merged KLL sketch.

Examples:

-- Merge sketches from different partitions
SELECT kll_sketch_get_quantile_bigint(
  kll_merge_agg_bigint(sketch),
  0.5
)
FROM (
  SELECT kll_sketch_agg_bigint(col) as sketch
  FROM VALUES (1), (2), (3) tab(col)
  UNION ALL
  SELECT kll_sketch_agg_bigint(col) as sketch
  FROM VALUES (4), (5), (6) tab(col)
);
-- Result: 3

-- Get the total count from merged sketches
SELECT kll_sketch_get_n_bigint(kll_merge_agg_bigint(sketch))
FROM (
  SELECT kll_sketch_agg_bigint(col) as sketch
  FROM VALUES (1), (2), (3) tab(col)
  UNION ALL
  SELECT kll_sketch_agg_bigint(col) as sketch
  FROM VALUES (4), (5), (6) tab(col)
);
-- Result: 6

Notes:

When k is not specified, the merged sketch adopts the k value from the first input sketch.
The merge operation can handle input sketches with different k values.
NULL values are ignored during aggregation.
Use this function when you need to merge multiple sketches in an aggregation context. For merging exactly two sketches, use the scalar kll_sketch_merge_* functions instead.

kll_sketch_to_string_*

Returns a human-readable summary of the sketch.

Syntax:

kll_sketch_to_string_bigint(sketch)
kll_sketch_to_string_float(sketch)
kll_sketch_to_string_double(sketch)

Argument	Type	Description
`sketch`	BINARY	A KLL sketch of the corresponding type

Returns a STRING containing a human-readable summary including sketch parameters and statistics.

kll_sketch_get_n_*

Returns the number of items collected in the sketch.

Syntax:

kll_sketch_get_n_bigint(sketch)
kll_sketch_get_n_float(sketch)
kll_sketch_get_n_double(sketch)

Argument	Type	Description
`sketch`	BINARY	A KLL sketch of the corresponding type

Returns a BIGINT representing the count of items in the sketch.

Examples:

SELECT kll_sketch_get_n_bigint(kll_sketch_agg_bigint(col))
FROM VALUES (1), (2), (3), (4), (5), (6), (7) tab(col);
-- Result: 7

kll_sketch_merge_*

Merges two KLL sketches of the same type. These are scalar functions.

Syntax:

kll_sketch_merge_bigint(left, right)
kll_sketch_merge_float(left, right)
kll_sketch_merge_double(left, right)

Argument	Type	Description
`left`	BINARY	First KLL sketch
`right`	BINARY	Second KLL sketch (must be same type as left)

Returns a BINARY containing the merged KLL sketch.

Examples:

-- Merge two sketches from different data partitions
SELECT kll_sketch_get_quantile_bigint(
  kll_sketch_merge_bigint(
    kll_sketch_agg_bigint(col1),
    kll_sketch_agg_bigint(col2)), 0.5)
FROM VALUES (1, 6), (2, 7), (3, 8), (4, 9), (5, 10) tab(col1, col2);
-- Result: approximately 5 (median of 1-10)

Errors:

Throws an error if sketches are of incompatible types or formats.

Notes:

The merge operation can handle input sketches with different k values.
Use this function when you need to merge exactly two sketches in an scalar context. For merging multiple sketches in an aggregation context, use the aggregate kll_merge_agg_* functions instead.

kll_sketch_get_quantile_*

Gets the approximate value at a given quantile rank.

Syntax:

kll_sketch_get_quantile_bigint(sketch, rank)
kll_sketch_get_quantile_float(sketch, rank)
kll_sketch_get_quantile_double(sketch, rank)

Argument	Type	Description
`sketch`	BINARY	A KLL sketch of the corresponding type
`rank`	DOUBLE or ARRAY<DOUBLE>	Quantile rank(s) between 0.0 and 1.0. Use 0.5 for median, 0.95 for 95th percentile, etc.

Returns the approximate value at the given quantile:

If rank is a scalar: Returns the corresponding type (BIGINT, FLOAT, or DOUBLE)
If rank is an array: Returns ARRAY of the corresponding type

Examples:

-- Get the median
SELECT kll_sketch_get_quantile_bigint(kll_sketch_agg_bigint(col), 0.5)
FROM VALUES (1), (2), (3), (4), (5), (6), (7) tab(col);
-- Result: 4

-- Get multiple percentiles at once
SELECT kll_sketch_get_quantile_bigint(
  kll_sketch_agg_bigint(col),
  ARRAY(0.25, 0.5, 0.75, 0.95))
FROM VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10) tab(col);
-- Result: Array of values at 25th, 50th, 75th, and 95th percentiles

Errors:

Throws an error if rank values are outside [0.0, 1.0].
Returns NULL if the input sketch is NULL.

kll_sketch_get_rank_*

Gets the normalized rank (0.0 to 1.0) of a given value in the sketch’s distribution.

Syntax:

kll_sketch_get_rank_bigint(sketch, value)
kll_sketch_get_rank_float(sketch, value)
kll_sketch_get_rank_double(sketch, value)

Argument	Type	Description
`sketch`	BINARY	A KLL sketch of the corresponding type
`value`	Corresponding type (BIGINT, FLOAT, or DOUBLE)	The value to find the rank for

Returns a DOUBLE representing the normalized rank between 0.0 and 1.0.

Examples:

-- Find what percentile the value 3 is at
SELECT kll_sketch_get_rank_bigint(kll_sketch_agg_bigint(col), 3)
FROM VALUES (1), (2), (3), (4), (5), (6), (7) tab(col);
-- Result: approximately 0.43 (3 is around the 43rd percentile)

Approximate Top-K Functions

Top-K functions estimate the most frequent items (heavy hitters) in a dataset using the DataSketches Frequent Items sketch.

See the Apache DataSketches Frequency documentation for more information.

approx_top_k_accumulate

Creates a sketch that can be stored and later combined or estimated. Useful for pre-aggregating data.

Syntax:

approx_top_k_accumulate(expr [, maxItemsTracked])

Argument	Type	Description
`expr`	Same as `approx_top_k`	The column to accumulate
`maxItemsTracked`	INT (optional)	Maximum items tracked. Range: 1 to 1,000,000. Default: 10,000.

Returns a STRUCT containing a sketch state that can be passed to approx_top_k_combine or approx_top_k_estimate.

Examples:

-- Accumulate then estimate
SELECT approx_top_k_estimate(approx_top_k_accumulate(expr))
FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) tab(expr);
-- Result: [{"item":0,"count":2},{"item":4,"count":2},{"item":1,"count":2},{"item":2,"count":1},{"item":3,"count":1}]

approx_top_k_combine

Combines multiple sketches into a single sketch.

Syntax:

approx_top_k_combine(state [, maxItemsTracked])

Argument	Type	Description
`state`	STRUCT	A sketch state from `approx_top_k_accumulate` or `approx_top_k_combine`
`maxItemsTracked`	INT (optional)	If specified, sets the combined sketch size. If not specified, all input sketches must have the same maxItemsTracked.

Returns a STRUCT containing the combined sketch state.

Examples:

-- Combine sketches from different partitions
SELECT approx_top_k_estimate(approx_top_k_combine(sketch, 10000), 5)
FROM (
  SELECT approx_top_k_accumulate(expr) AS sketch
  FROM VALUES (0), (0), (1), (1) tab(expr)
  UNION ALL
  SELECT approx_top_k_accumulate(expr) AS sketch
  FROM VALUES (2), (3), (4), (4) tab(expr)
);
-- Result: [{"item":0,"count":2},{"item":4,"count":2},{"item":1,"count":2},{"item":2,"count":1},{"item":3,"count":1}]

Errors:

Throws an error if input sketches have different maxItemsTracked values and no explicit value is provided.
Throws an error if input sketches have different item data types.

approx_top_k_estimate

Extracts the top K items from a sketch.

Syntax:

approx_top_k_estimate(state [, k])

Argument	Type	Description
`state`	STRUCT	A sketch state from `approx_top_k_accumulate` or `approx_top_k_combine`
`k`	INT (optional)	Number of top items to return. Default: 5.

Returns an ARRAY<STRUCT<item, count>> containing the frequent items sorted by count descending.

Examples:

SELECT approx_top_k_estimate(approx_top_k_accumulate(expr), 2)
FROM VALUES 'a', 'b', 'c', 'c', 'c', 'c', 'd', 'd' tab(expr);
-- Result: [{"item":"c","count":4},{"item":"d","count":2}]

Best Practices

Choosing Between HLL, Theta, and Tuple Sketches

Use Case	Recommended Sketch
Simple count distinct	HLL (most memory efficient)
Set operations (union, intersection, difference)	Theta or Tuple
Distinct count with aggregated metrics (sum, min, max)	Tuple
Very high cardinality with moderate accuracy	HLL with higher lgConfigK
Need to compute overlaps between datasets	Theta or Tuple
Counting distinct users with total spend	Tuple (key=user_id, summary=spend)

Accuracy vs. Memory Trade-offs

Sketch Type	Parameter	Effect of Increasing
HLL	lgConfigK	Higher accuracy, more memory (2^lgConfigK bytes)
Theta	lgNomEntries	Higher accuracy, more memory (8 * 2^lgNomEntries bytes)
Tuple	lgNomEntries	Higher accuracy, more memory (16 * 2^lgNomEntries bytes for double, 12 * 2^lgNomEntries bytes for integer)
KLL	k	Higher accuracy, more memory
Top-K	maxItemsTracked	Better heavy-hitter detection, more memory

Storing and Reusing Sketches

Sketches can be stored in BINARY columns and later merged:

-- Create a table to store daily sketches
CREATE TABLE daily_user_sketches (
  date DATE,
  user_sketch BINARY
);

-- Insert daily sketches
INSERT INTO daily_user_sketches
SELECT current_date(), hll_sketch_agg(user_id)
FROM events;

-- Compute weekly unique users by merging daily sketches
SELECT hll_sketch_estimate(hll_union_agg(user_sketch))
FROM daily_user_sketches
WHERE date BETWEEN '2024-01-01' AND '2024-01-07';

Common Use Cases and Examples

Sketches are particularly valuable for periodic ETL jobs where you need to maintain running statistics across multiple batches of data. The general workflow is:

Aggregate input values into a sketch using an aggregate function
Store the sketch (as BINARY) in a table
Merge new sketches with previously stored sketches
Query the final sketch to get approximate answers

Example: Tracking Daily Unique Users with HLL Sketches

This example shows how to maintain a running count of unique users across daily batches.

-- Create a table to store daily HLL sketches
CREATE TABLE daily_user_sketches (
  event_date DATE,
  user_sketch BINARY
) USING PARQUET;

-- Day 1: Process first batch of events and store the sketch
INSERT INTO daily_user_sketches
SELECT 
  DATE'2024-01-01' as event_date,
  hll_sketch_agg(user_id) as user_sketch
FROM day1_events;

-- Day 2: Process second batch and store its sketch
INSERT INTO daily_user_sketches
SELECT 
  DATE'2024-01-02' as event_date,
  hll_sketch_agg(user_id) as user_sketch
FROM day2_events;

-- Query: Get unique users for a single day
SELECT 
  event_date,
  hll_sketch_estimate(user_sketch) as unique_users
FROM daily_user_sketches
WHERE event_date = DATE'2024-01-01';

-- Query: Get unique users across a date range (merging sketches)
SELECT hll_sketch_estimate(hll_union_agg(user_sketch)) as unique_users_in_week
FROM daily_user_sketches
WHERE event_date BETWEEN DATE'2024-01-01' AND DATE'2024-01-07';

Example: Computing Percentiles Over Time with KLL Sketches

This example shows how to track response time percentiles across hourly batches.

-- Create a table to store hourly KLL sketches for response times
CREATE TABLE hourly_latency_sketches (
  hour_ts TIMESTAMP,
  latency_sketch BINARY
) USING PARQUET;

-- Process each hour's data and store the sketch
INSERT INTO hourly_latency_sketches
SELECT 
  DATE_TRUNC('hour', event_time) as hour_ts,
  kll_sketch_agg_bigint(response_time_ms) as latency_sketch
FROM hourly_events
GROUP BY DATE_TRUNC('hour', event_time);

-- Query: Get p50, p95, p99 for a specific hour
SELECT 
  hour_ts,
  kll_sketch_get_quantile_bigint(latency_sketch, 0.5) as p50_ms,
  kll_sketch_get_quantile_bigint(latency_sketch, 0.95) as p95_ms,
  kll_sketch_get_quantile_bigint(latency_sketch, 0.99) as p99_ms
FROM hourly_latency_sketches
WHERE hour_ts = TIMESTAMP'2024-01-15 14:00:00';

-- Query: Get percentiles across a full day by merging hourly sketches
WITH daily_sketch AS (
  SELECT kll_merge_agg_bigint(latency_sketch) as merged_sketch
  FROM hourly_latency_sketches
  WHERE DATE(hour_ts) = DATE'2024-01-15'
)
SELECT
  kll_sketch_get_quantile_bigint(merged_sketch, 0.5) as p50_ms,
  kll_sketch_get_quantile_bigint(merged_sketch, 0.95) as p95_ms,
  kll_sketch_get_quantile_bigint(merged_sketch, 0.99) as p99_ms
FROM daily_sketch;

Example: Set Operations with Theta Sketches

Theta sketches support set operations, making them useful for analyzing overlapping populations.

-- Create sketches for users who performed different actions
CREATE TABLE action_sketches (
  action_type STRING,
  user_sketch BINARY
) USING PARQUET;

-- Store sketches for each action type
INSERT INTO action_sketches
SELECT 'purchase', theta_sketch_agg(user_id) FROM purchases;

INSERT INTO action_sketches
SELECT 'add_to_cart', theta_sketch_agg(user_id) FROM cart_additions;

INSERT INTO action_sketches
SELECT 'page_view', theta_sketch_agg(user_id) FROM page_views;

-- Query: How many users purchased?
SELECT theta_sketch_estimate(user_sketch) as purchasers
FROM action_sketches WHERE action_type = 'purchase';

-- Query: How many users added to cart but did NOT purchase?
SELECT theta_sketch_estimate(
  theta_difference(
    (SELECT user_sketch FROM action_sketches WHERE action_type = 'add_to_cart'),
    (SELECT user_sketch FROM action_sketches WHERE action_type = 'purchase')
  )
) as cart_abandoners;

-- Query: How many users both viewed pages AND purchased (intersection)?
SELECT theta_sketch_estimate(
  theta_intersection(
    (SELECT user_sketch FROM action_sketches WHERE action_type = 'page_view'),
    (SELECT user_sketch FROM action_sketches WHERE action_type = 'purchase')
  )
) as engaged_purchasers;

Track the most frequently occurring items across batches.

-- Create a table to store hourly top-k sketches
CREATE TABLE hourly_search_sketches (
  hour_ts TIMESTAMP,
  search_sketch STRUCT<sketch: BINARY, maxItemsTracked: INT, itemDataType: STRING, itemDataTypeDDL: STRING>
) USING PARQUET;

-- Process each hour's search queries
INSERT INTO hourly_search_sketches
SELECT 
  DATE_TRUNC('hour', search_time) as hour_ts,
  approx_top_k_accumulate(search_term, 10000) as search_sketch
FROM search_logs
GROUP BY DATE_TRUNC('hour', search_time);

-- Query: Get top 10 searches for a specific hour
SELECT approx_top_k_estimate(search_sketch, 10) as top_searches
FROM hourly_search_sketches
WHERE hour_ts = TIMESTAMP'2024-01-15 14:00:00';

-- Query: Get top 10 searches across the full day by combining sketches
SELECT approx_top_k_estimate(
  approx_top_k_combine(search_sketch, 10000),
  10
) as daily_top_searches
FROM hourly_search_sketches
WHERE DATE(hour_ts) = DATE'2024-01-15';

Example: Distinct Users with Aggregated Metrics Using Tuple Sketches

Tuple sketches enable tracking both distinct counts and aggregated metrics simultaneously. This is useful for scenarios like counting unique users while also summing their activity metrics.

-- Create a table to store daily tuple sketches tracking users and their spend
CREATE TABLE daily_user_spend_sketches (
  event_date DATE,
  user_spend_sketch BINARY
) USING PARQUET;

-- Day 1: Track distinct users with their total spend
INSERT INTO daily_user_spend_sketches
SELECT
  DATE'2024-01-01' as event_date,
  tuple_sketch_agg_double(user_id, purchase_amount) as user_spend_sketch
FROM purchases_day1;

-- Day 2: Track next day's users and spend
INSERT INTO daily_user_spend_sketches
SELECT
  DATE'2024-01-02' as event_date,
  tuple_sketch_agg_double(user_id, purchase_amount) as user_spend_sketch
FROM purchases_day2;

-- Query: Get distinct users and total spend for a single day
SELECT
  event_date,
  tuple_sketch_estimate_double(user_spend_sketch) as unique_users,
  tuple_sketch_summary_double(user_spend_sketch) as total_spend
FROM daily_user_spend_sketches
WHERE event_date = DATE'2024-01-01';

-- Query: Get unique users and total spend across a week (merging sketches)
SELECT
  tuple_sketch_estimate_double(tuple_union_agg_double(user_spend_sketch)) as weekly_unique_users,
  tuple_sketch_summary_double(tuple_union_agg_double(user_spend_sketch)) as weekly_total_spend
FROM daily_user_spend_sketches
WHERE event_date BETWEEN DATE'2024-01-01' AND DATE'2024-01-07';

-- Query: Find users who made purchases in week 1 but not week 2
WITH week1_sketch AS (
  SELECT tuple_union_agg_double(user_spend_sketch) as sketch
  FROM daily_user_spend_sketches
  WHERE event_date BETWEEN DATE'2024-01-01' AND DATE'2024-01-07'
),
week2_sketch AS (
  SELECT tuple_union_agg_double(user_spend_sketch) as sketch
  FROM daily_user_spend_sketches
  WHERE event_date BETWEEN DATE'2024-01-08' AND DATE'2024-01-14'
)
SELECT tuple_sketch_estimate_double(
  tuple_difference_double(
    (SELECT sketch FROM week1_sketch),
    (SELECT sketch FROM week2_sketch)
  )
) as churned_users;

-- Query: Find users who purchased in both weeks (intersection)
WITH week1_sketch AS (
  SELECT tuple_union_agg_double(user_spend_sketch) as sketch
  FROM daily_user_spend_sketches
  WHERE event_date BETWEEN DATE'2024-01-01' AND DATE'2024-01-07'
),
week2_sketch AS (
  SELECT tuple_union_agg_double(user_spend_sketch) as sketch
  FROM daily_user_spend_sketches
  WHERE event_date BETWEEN DATE'2024-01-08' AND DATE'2024-01-14'
)
SELECT
  tuple_sketch_estimate_double(
    tuple_intersection_double(
      (SELECT sketch FROM week1_sketch),
      (SELECT sketch FROM week2_sketch)
    )
  ) as retained_users;

-- Query: Combining registered users (with spend) and anonymous visitors using theta variants
-- This demonstrates merging Tuple sketches (with metrics) and Theta sketches (IDs only)

-- Assume we have both registered user purchases and anonymous visitor data
CREATE TABLE registered_user_purchases (
  purchase_date DATE,
  user_purchase_sketch BINARY  -- Tuple sketch: user_id with purchase_amount
) USING PARQUET;

CREATE TABLE anonymous_visitor_logs (
  visit_date DATE,
  visitor_sketch BINARY  -- Theta sketch: visitor_id only
) USING PARQUET;

-- Query: Total distinct users (registered + anonymous)
SELECT
  tuple_sketch_estimate_double(
    tuple_union_theta_double(
      (SELECT user_purchase_sketch FROM registered_user_purchases WHERE purchase_date = DATE'2024-01-01'),
      (SELECT visitor_sketch FROM anonymous_visitor_logs WHERE visit_date = DATE'2024-01-01')
    )
  ) as total_unique_users;

-- Query: Find registered users who also appeared as anonymous visitors
-- (e.g., logged-in users who also browsed anonymously)
SELECT
  tuple_sketch_estimate_double(
    tuple_intersection_theta_double(
      (SELECT user_purchase_sketch FROM registered_user_purchases WHERE purchase_date = DATE'2024-01-01'),
      (SELECT visitor_sketch FROM anonymous_visitor_logs WHERE visit_date = DATE'2024-01-01')
    )
  ) as cross_session_users;

-- Query: Registered users who never appeared as anonymous visitors
-- (users who only access the site while logged in)
SELECT
  tuple_sketch_estimate_double(
    tuple_difference_theta_double(
      (SELECT user_purchase_sketch FROM registered_user_purchases WHERE purchase_date = DATE'2024-01-01'),
      (SELECT visitor_sketch FROM anonymous_visitor_logs WHERE visit_date = DATE'2024-01-01')
    )
  ) as logged_in_only_users,
  tuple_sketch_summary_double(
    tuple_difference_theta_double(
      (SELECT user_purchase_sketch FROM registered_user_purchases WHERE purchase_date = DATE'2024-01-01'),
      (SELECT visitor_sketch FROM anonymous_visitor_logs WHERE visit_date = DATE'2024-01-01')
    )
  ) as logged_in_only_revenue;

Spark SQL Guide

Sketch-Based Approximate Functions

Table of Contents

HyperLogLog (HLL) Sketch Functions

hll_sketch_agg

hll_union_agg

hll_sketch_estimate

hll_union

Theta Sketch Functions

theta_sketch_agg

theta_union_agg

theta_intersection_agg

theta_sketch_estimate

theta_union

theta_intersection

theta_difference

Tuple Sketch Functions

tuple_sketch_agg_*

tuple_union_agg_*

tuple_intersection_agg_*

tuple_sketch_estimate_*

tuple_sketch_summary_*

tuple_sketch_theta_*

tuple_union_*

tuple_intersection_*

tuple_difference_*

tuple_union_theta_*

tuple_intersection_theta_*

tuple_difference_theta_*

KLL Quantile Sketch Functions

kll_sketch_agg_*

kll_merge_agg_*

kll_sketch_to_string_*

kll_sketch_get_n_*

kll_sketch_merge_*

kll_sketch_get_quantile_*

kll_sketch_get_rank_*

Approximate Top-K Functions

approx_top_k_accumulate

approx_top_k_combine

approx_top_k_estimate

Best Practices

Choosing Between HLL, Theta, and Tuple Sketches

Accuracy vs. Memory Trade-offs

Storing and Reusing Sketches

Common Use Cases and Examples

Example: Tracking Daily Unique Users with HLL Sketches

Example: Computing Percentiles Over Time with KLL Sketches

Example: Set Operations with Theta Sketches

Example: Finding Trending Items with Top-K Sketches

Example: Distinct Users with Aggregated Metrics Using Tuple Sketches