Sketch Functions

This page lists all sketch functions available in Spark SQL.


approx_top_k_estimate

approx_top_k_estimate(state, k) - Returns top k items with their frequency. k An optional INTEGER literal greater than 0. If k is not specified, it defaults to 5.

Examples:

> SELECT approx_top_k_estimate(approx_top_k_accumulate(expr)) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
 [{"item":0,"count":2},{"item":4,"count":2},{"item":1,"count":2},{"item":2,"count":1},{"item":3,"count":1}]

> SELECT approx_top_k_estimate(approx_top_k_accumulate(expr), 2) FROM VALUES 'a', 'b', 'c', 'c', 'c', 'c', 'd', 'd' tab(expr);
 [{"item":"c","count":4},{"item":"d","count":2}]

Since: 4.1.0


hll_sketch_estimate

hll_sketch_estimate(expr) - Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.

Examples:

> SELECT hll_sketch_estimate(hll_sketch_agg(col)) FROM VALUES (1), (1), (2), (2), (3) tab(col);
 3

Since: 3.5.0


hll_union

hll_union(first, second, allowDifferentLgConfigK) - Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Set allowDifferentLgConfigK to true to allow unions of sketches with different lgConfigK values (defaults to false).

Examples:

> SELECT hll_sketch_estimate(hll_union(hll_sketch_agg(col1), hll_sketch_agg(col2))) FROM VALUES (1, 4), (1, 4), (2, 5), (2, 5), (3, 6) tab(col1, col2);
 6

Since: 3.5.0


kll_sketch_get_n_bigint

kll_sketch_get_n_bigint(expr) - Returns the number of items collected in the sketch.

Examples:

> SELECT kll_sketch_get_n_bigint(kll_sketch_agg_bigint(col)) FROM VALUES (1), (2), (3), (4), (5) tab(col);
 5

Since: 4.1.0


kll_sketch_get_n_double

kll_sketch_get_n_double(expr) - Returns the number of items collected in the sketch.

Examples:

> SELECT kll_sketch_get_n_double(kll_sketch_agg_double(col)) FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
 5

Since: 4.1.0


kll_sketch_get_n_float

kll_sketch_get_n_float(expr) - Returns the number of items collected in the sketch.

Examples:

> SELECT kll_sketch_get_n_float(kll_sketch_agg_float(col)) FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
 5

Since: 4.1.0


kll_sketch_get_quantile_bigint

kll_sketch_get_quantile_bigint(left, right) - Extracts a single value from the quantiles sketch representing the desired quantile given the input rank. The desired quantile can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.

Examples:

> SELECT kll_sketch_get_quantile_bigint(kll_sketch_agg_bigint(col), 0.5) > 1 FROM VALUES (1), (2), (3), (4), (5) tab(col);
 true

Since: 4.1.0


kll_sketch_get_quantile_double

kll_sketch_get_quantile_double(left, right) - Extracts a single value from the quantiles sketch representing the desired quantile given the input rank. The desired quantile can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.

Examples:

> SELECT kll_sketch_get_quantile_double(kll_sketch_agg_double(col), 0.5) > 1 FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
 true

Since: 4.1.0


kll_sketch_get_quantile_float

kll_sketch_get_quantile_float(left, right) - Extracts a single value from the quantiles sketch representing the desired quantile given the input rank. The desired quantile can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.

Examples:

> SELECT kll_sketch_get_quantile_float(kll_sketch_agg_float(col), 0.5) > 1 FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
 true

Since: 4.1.0


kll_sketch_get_rank_bigint

kll_sketch_get_rank_bigint(left, right) - Extracts a single value from the quantiles sketch representing the desired rank given the input quantile. The desired rank can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.

Examples:

> SELECT kll_sketch_get_rank_bigint(kll_sketch_agg_bigint(col), 3) > 0.3 FROM VALUES (1), (2), (3), (4), (5) tab(col);
 true

Since: 4.1.0


kll_sketch_get_rank_double

kll_sketch_get_rank_double(left, right) - Extracts a single value from the quantiles sketch representing the desired rank given the input quantile. The desired rank can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.

Examples:

> SELECT kll_sketch_get_rank_double(kll_sketch_agg_double(col), 3.0) > 0.3 FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
 true

Since: 4.1.0


kll_sketch_get_rank_float

kll_sketch_get_rank_float(left, right) - Extracts a single value from the quantiles sketch representing the desired rank given the input quantile. The desired rank can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.

Examples:

> SELECT kll_sketch_get_rank_float(kll_sketch_agg_float(col), 3.0) > 0.3 FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
 true

Since: 4.1.0


kll_sketch_merge_bigint

kll_sketch_merge_bigint(left, right) - Merges two sketch buffers together into one.

Examples:

> SELECT LENGTH(kll_sketch_to_string_bigint(kll_sketch_merge_bigint(kll_sketch_agg_bigint(col), kll_sketch_agg_bigint(col)))) > 0 FROM VALUES (1), (2), (3), (4), (5) tab(col);
 true

Since: 4.1.0


kll_sketch_merge_double

kll_sketch_merge_double(left, right) - Merges two sketch buffers together into one.

Examples:

> SELECT LENGTH(kll_sketch_to_string_double(kll_sketch_merge_double(kll_sketch_agg_double(col), kll_sketch_agg_double(col)))) > 0 FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
 true

Since: 4.1.0


kll_sketch_merge_float

kll_sketch_merge_float(left, right) - Merges two sketch buffers together into one.

Examples:

> SELECT LENGTH(kll_sketch_to_string_float(kll_sketch_merge_float(kll_sketch_agg_float(col), kll_sketch_agg_float(col)))) > 0 FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
 true

Since: 4.1.0


kll_sketch_to_string_bigint

kll_sketch_to_string_bigint(expr) - Returns human readable summary information about this sketch.

Examples:

> SELECT LENGTH(kll_sketch_to_string_bigint(kll_sketch_agg_bigint(col))) > 0 FROM VALUES (1), (2), (3), (4), (5) tab(col);
 true

Since: 4.1.0


kll_sketch_to_string_double

kll_sketch_to_string_double(expr) - Returns human readable summary information about this sketch.

Examples:

> SELECT LENGTH(kll_sketch_to_string_double(kll_sketch_agg_double(col))) > 0 FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
 true

Since: 4.1.0


kll_sketch_to_string_float

kll_sketch_to_string_float(expr) - Returns human readable summary information about this sketch.

Examples:

> SELECT LENGTH(kll_sketch_to_string_float(kll_sketch_agg_float(col))) > 0 FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
 true

Since: 4.1.0


theta_difference

theta_difference(first, second) - Subtracts two binary representations of Datasketches ThetaSketch objects from two input columns using a ThetaSketch AnotB object.

Examples:

> SELECT theta_sketch_estimate(theta_difference(theta_sketch_agg(col1), theta_sketch_agg(col2))) FROM VALUES (5, 4), (1, 4), (2, 5), (2, 5), (3, 1) tab(col1, col2);
 2

Since: 4.1.0


theta_intersection

theta_intersection(first, second) - Intersects two binary representations of Datasketches ThetaSketch objects from two input columns using a ThetaSketch Intersect object.

Examples:

> SELECT theta_sketch_estimate(theta_intersection(theta_sketch_agg(col1), theta_sketch_agg(col2))) FROM VALUES (5, 4), (1, 4), (2, 5), (2, 5), (3, 1) tab(col1, col2);
 2

Since: 4.1.0


theta_sketch_estimate

theta_sketch_estimate(expr) - Returns the estimated number of unique values given the binary representation of a Datasketches ThetaSketch.

Examples:

> SELECT theta_sketch_estimate(theta_sketch_agg(col)) FROM VALUES (1), (1), (2), (2), (3) tab(col);
 3

Since: 4.1.0


theta_union

theta_union(first, second, lgNomEntries) - Merges two binary representations of Datasketches ThetaSketch objects using a ThetaSketch Union object. Users can set lgNomEntries to a value between 4 and 26 to find the union of sketches with different union buffer size values (defaults to 12).

Examples:

> SELECT theta_sketch_estimate(theta_union(theta_sketch_agg(col1), theta_sketch_agg(col2))) FROM VALUES (1, 4), (1, 4), (2, 5), (2, 5), (3, 6) tab(col1, col2);
 6

Since: 4.1.0


tuple_difference_double

tuple_difference_double(tupleSketch1, tupleSketch2) - Subtracts two binary representations of Datasketches TupleSketch objects with double summary data type using a TupleSketch AnotB object. Returns elements in the first sketch that are not in the second sketch.

Examples:

> SELECT tuple_sketch_estimate_double(tuple_difference_double(tuple_sketch_agg_double(col1, val1), tuple_sketch_agg_double(col2, val2))) FROM VALUES (5, 5.0D, 4, 4.0D), (1, 1.0D, 4, 4.0D), (2, 2.0D, 5, 5.0D), (3, 3.0D, 1, 1.0D) tab(col1, val1, col2, val2);
 2.0

Since: 4.2.0


tuple_difference_integer

tuple_difference_integer(tupleSketch1, tupleSketch2) - Subtracts two binary representations of Datasketches TupleSketch objects with integer summary data type using a TupleSketch AnotB object. Returns elements in the first sketch that are not in the second sketch.

Examples:

> SELECT tuple_sketch_estimate_integer(tuple_difference_integer(tuple_sketch_agg_integer(col1, val1), tuple_sketch_agg_integer(col2, val2))) FROM VALUES (5, 5, 4, 4), (1, 1, 4, 4), (2, 2, 5, 5), (3, 3, 1, 1) tab(col1, val1, col2, val2);
 2.0

Since: 4.2.0


tuple_intersection_double

tuple_intersection_double(tupleSketch1, tupleSketch2, mode) - Intersects two binary representations of Datasketches TupleSketch objects with double summary data type using a TupleSketch Intersection object. Users can set mode to 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').

Examples:

> SELECT tuple_sketch_estimate_double(tuple_intersection_double(tuple_sketch_agg_double(col1, val1), tuple_sketch_agg_double(col2, val2))) FROM VALUES (1, 1.0D, 1, 4.0D), (2, 2.0D, 2, 5.0D), (3, 3.0D, 4, 6.0D) tab(col1, val1, col2, val2);
 2.0

Since: 4.2.0


tuple_intersection_integer

tuple_intersection_integer(tupleSketch1, tupleSketch2, mode) - Intersects two binary representations of Datasketches TupleSketch objects with integer summary data type using a TupleSketch Intersection object. Users can set mode to 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').

Examples:

> SELECT tuple_sketch_estimate_integer(tuple_intersection_integer(tuple_sketch_agg_integer(col1, val1), tuple_sketch_agg_integer(col2, val2))) FROM VALUES (1, 1, 1, 4), (2, 2, 2, 5), (3, 3, 4, 6) tab(col1, val1, col2, val2);
 2.0

Since: 4.2.0


tuple_sketch_estimate_double

tuple_sketch_estimate_double(child) - Returns the estimated number of unique values given the binary representation of a Datasketches TupleSketch. The sketch's summary type must be a double.

Examples:

> SELECT tuple_sketch_estimate_double(tuple_sketch_agg_double(key, summary)) FROM VALUES (1, 1.0D), (1, 2.0D), (2, 3.0D) tab(key, summary);
 2.0

Since: 4.2.0


tuple_sketch_estimate_integer

tuple_sketch_estimate_integer(child) - Returns the estimated number of unique values given the binary representation of a Datasketches TupleSketch. The sketch's summary type must be an integer.

Examples:

> SELECT tuple_sketch_estimate_integer(tuple_sketch_agg_integer(key, summary)) FROM VALUES (1, 1), (1, 2), (2, 3) tab(key, summary);
 2.0

Since: 4.2.0


tuple_sketch_summary_double

tuple_sketch_summary_double(child, mode) - Aggregates the summary values from a double summary type Datasketches TupleSketch. The mode can be 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').

Examples:

> SELECT tuple_sketch_summary_double(tuple_sketch_agg_double(key, summary)) FROM VALUES (1, 1.0D), (1, 2.0D), (2, 3.0D) tab(key, summary);
 6.0

Since: 4.2.0


tuple_sketch_summary_integer

tuple_sketch_summary_integer(child, mode) - Aggregates the summary values from a integer summary type Datasketches TupleSketch. The mode can be 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').

Examples:

> SELECT tuple_sketch_summary_integer(tuple_sketch_agg_integer(key, summary)) FROM VALUES (1, 1), (1, 2), (2, 3) tab(key, summary);
 6

Since: 4.2.0


tuple_sketch_theta_double

tuple_sketch_theta_double(child) - Returns the theta value (sampling rate) from a Datasketches TupleSketch. The theta value represents the effective sampling rate of the sketch, between 0.0 and 1.0. The sketch's summary type must be a double.

Examples:

> SELECT tuple_sketch_theta_double(tuple_sketch_agg_double(key, summary)) FROM VALUES (1, 1.0D), (2, 2.0D), (3, 3.0D) tab(key, summary);
 1.0

Since: 4.2.0


tuple_sketch_theta_integer

tuple_sketch_theta_integer(child) - Returns the theta value (sampling rate) from a Datasketches TupleSketch. The theta value represents the effective sampling rate of the sketch, between 0.0 and 1.0. The sketch's summary type must be an integer.

Examples:

> SELECT tuple_sketch_theta_integer(tuple_sketch_agg_integer(key, summary)) FROM VALUES (1, 1), (2, 2), (3, 3) tab(key, summary);
 1.0

Since: 4.2.0


tuple_union_double

tuple_union_double(tupleSketch1, tupleSketch2, lgNomEntries, mode) - Merges two binary representations of Datasketches TupleSketch objects with double summary data type using a TupleSketch Union object. Users can set lgNomEntries to a value between 4 and 26 (defaults to 12) and mode to 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').

Examples:

> SELECT tuple_sketch_estimate_double(tuple_union_double(tuple_sketch_agg_double(col1, val1), tuple_sketch_agg_double(col2, val2))) FROM VALUES (1, 1.0D, 4, 4.0D), (2, 2.0D, 5, 5.0D), (3, 3.0D, 6, 6.0D) tab(col1, val1, col2, val2);
 6.0

Since: 4.2.0


tuple_union_integer

tuple_union_integer(tupleSketch1, tupleSketch2, lgNomEntries, mode) - Merges two binary representations of Datasketches TupleSketch objects with integer summary data type using a TupleSketch Union object. Users can set lgNomEntries to a value between 4 and 26 (defaults to 12) and mode to 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').

Examples:

> SELECT tuple_sketch_estimate_integer(tuple_union_integer(tuple_sketch_agg_integer(col1, val1), tuple_sketch_agg_integer(col2, val2))) FROM VALUES (1, 1, 4, 4), (2, 2, 5, 5), (3, 3, 6, 6) tab(col1, val1, col2, val2);
 6.0

Since: 4.2.0