Sketch Functions¶
This page lists all sketch functions available in Spark SQL.
approx_top_k_estimate¶
approx_top_k_estimate(state, k) - Returns top k items with their frequency.
k An optional INTEGER literal greater than 0. If k is not specified, it defaults to 5.
Examples:
> SELECT approx_top_k_estimate(approx_top_k_accumulate(expr)) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
[{"item":0,"count":2},{"item":4,"count":2},{"item":1,"count":2},{"item":2,"count":1},{"item":3,"count":1}]
> SELECT approx_top_k_estimate(approx_top_k_accumulate(expr), 2) FROM VALUES 'a', 'b', 'c', 'c', 'c', 'c', 'd', 'd' tab(expr);
[{"item":"c","count":4},{"item":"d","count":2}]
Since: 4.1.0
hll_sketch_estimate¶
hll_sketch_estimate(expr) - Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.
Examples:
> SELECT hll_sketch_estimate(hll_sketch_agg(col)) FROM VALUES (1), (1), (2), (2), (3) tab(col);
3
Since: 3.5.0
hll_union¶
hll_union(first, second, allowDifferentLgConfigK) - Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Set allowDifferentLgConfigK to true to allow unions of sketches with different lgConfigK values (defaults to false).
Examples:
> SELECT hll_sketch_estimate(hll_union(hll_sketch_agg(col1), hll_sketch_agg(col2))) FROM VALUES (1, 4), (1, 4), (2, 5), (2, 5), (3, 6) tab(col1, col2);
6
Since: 3.5.0
kll_sketch_get_n_bigint¶
kll_sketch_get_n_bigint(expr) - Returns the number of items collected in the sketch.
Examples:
> SELECT kll_sketch_get_n_bigint(kll_sketch_agg_bigint(col)) FROM VALUES (1), (2), (3), (4), (5) tab(col);
5
Since: 4.1.0
kll_sketch_get_n_double¶
kll_sketch_get_n_double(expr) - Returns the number of items collected in the sketch.
Examples:
> SELECT kll_sketch_get_n_double(kll_sketch_agg_double(col)) FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
5
Since: 4.1.0
kll_sketch_get_n_float¶
kll_sketch_get_n_float(expr) - Returns the number of items collected in the sketch.
Examples:
> SELECT kll_sketch_get_n_float(kll_sketch_agg_float(col)) FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
5
Since: 4.1.0
kll_sketch_get_quantile_bigint¶
kll_sketch_get_quantile_bigint(left, right) - Extracts a single value from the quantiles sketch representing the desired quantile given the input rank. The desired quantile can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.
Examples:
> SELECT kll_sketch_get_quantile_bigint(kll_sketch_agg_bigint(col), 0.5) > 1 FROM VALUES (1), (2), (3), (4), (5) tab(col);
true
Since: 4.1.0
kll_sketch_get_quantile_double¶
kll_sketch_get_quantile_double(left, right) - Extracts a single value from the quantiles sketch representing the desired quantile given the input rank. The desired quantile can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.
Examples:
> SELECT kll_sketch_get_quantile_double(kll_sketch_agg_double(col), 0.5) > 1 FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
true
Since: 4.1.0
kll_sketch_get_quantile_float¶
kll_sketch_get_quantile_float(left, right) - Extracts a single value from the quantiles sketch representing the desired quantile given the input rank. The desired quantile can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.
Examples:
> SELECT kll_sketch_get_quantile_float(kll_sketch_agg_float(col), 0.5) > 1 FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
true
Since: 4.1.0
kll_sketch_get_rank_bigint¶
kll_sketch_get_rank_bigint(left, right) - Extracts a single value from the quantiles sketch representing the desired rank given the input quantile. The desired rank can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.
Examples:
> SELECT kll_sketch_get_rank_bigint(kll_sketch_agg_bigint(col), 3) > 0.3 FROM VALUES (1), (2), (3), (4), (5) tab(col);
true
Since: 4.1.0
kll_sketch_get_rank_double¶
kll_sketch_get_rank_double(left, right) - Extracts a single value from the quantiles sketch representing the desired rank given the input quantile. The desired rank can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.
Examples:
> SELECT kll_sketch_get_rank_double(kll_sketch_agg_double(col), 3.0) > 0.3 FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
true
Since: 4.1.0
kll_sketch_get_rank_float¶
kll_sketch_get_rank_float(left, right) - Extracts a single value from the quantiles sketch representing the desired rank given the input quantile. The desired rank can either be a single value or an array. In the latter case, the function will return an array of results of equal length to the input array.
Examples:
> SELECT kll_sketch_get_rank_float(kll_sketch_agg_float(col), 3.0) > 0.3 FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
true
Since: 4.1.0
kll_sketch_merge_bigint¶
kll_sketch_merge_bigint(left, right) - Merges two sketch buffers together into one.
Examples:
> SELECT LENGTH(kll_sketch_to_string_bigint(kll_sketch_merge_bigint(kll_sketch_agg_bigint(col), kll_sketch_agg_bigint(col)))) > 0 FROM VALUES (1), (2), (3), (4), (5) tab(col);
true
Since: 4.1.0
kll_sketch_merge_double¶
kll_sketch_merge_double(left, right) - Merges two sketch buffers together into one.
Examples:
> SELECT LENGTH(kll_sketch_to_string_double(kll_sketch_merge_double(kll_sketch_agg_double(col), kll_sketch_agg_double(col)))) > 0 FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
true
Since: 4.1.0
kll_sketch_merge_float¶
kll_sketch_merge_float(left, right) - Merges two sketch buffers together into one.
Examples:
> SELECT LENGTH(kll_sketch_to_string_float(kll_sketch_merge_float(kll_sketch_agg_float(col), kll_sketch_agg_float(col)))) > 0 FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
true
Since: 4.1.0
kll_sketch_to_string_bigint¶
kll_sketch_to_string_bigint(expr) - Returns human readable summary information about this sketch.
Examples:
> SELECT LENGTH(kll_sketch_to_string_bigint(kll_sketch_agg_bigint(col))) > 0 FROM VALUES (1), (2), (3), (4), (5) tab(col);
true
Since: 4.1.0
kll_sketch_to_string_double¶
kll_sketch_to_string_double(expr) - Returns human readable summary information about this sketch.
Examples:
> SELECT LENGTH(kll_sketch_to_string_double(kll_sketch_agg_double(col))) > 0 FROM VALUES (CAST(1.0 AS DOUBLE)), (CAST(2.0 AS DOUBLE)), (CAST(3.0 AS DOUBLE)), (CAST(4.0 AS DOUBLE)), (CAST(5.0 AS DOUBLE)) tab(col);
true
Since: 4.1.0
kll_sketch_to_string_float¶
kll_sketch_to_string_float(expr) - Returns human readable summary information about this sketch.
Examples:
> SELECT LENGTH(kll_sketch_to_string_float(kll_sketch_agg_float(col))) > 0 FROM VALUES (CAST(1.0 AS FLOAT)), (CAST(2.0 AS FLOAT)), (CAST(3.0 AS FLOAT)), (CAST(4.0 AS FLOAT)), (CAST(5.0 AS FLOAT)) tab(col);
true
Since: 4.1.0
theta_difference¶
theta_difference(first, second) - Subtracts two binary representations of Datasketches ThetaSketch objects from two input columns using a ThetaSketch AnotB object.
Examples:
> SELECT theta_sketch_estimate(theta_difference(theta_sketch_agg(col1), theta_sketch_agg(col2))) FROM VALUES (5, 4), (1, 4), (2, 5), (2, 5), (3, 1) tab(col1, col2);
2
Since: 4.1.0
theta_intersection¶
theta_intersection(first, second) - Intersects two binary representations of Datasketches ThetaSketch objects from two input columns using a ThetaSketch Intersect object.
Examples:
> SELECT theta_sketch_estimate(theta_intersection(theta_sketch_agg(col1), theta_sketch_agg(col2))) FROM VALUES (5, 4), (1, 4), (2, 5), (2, 5), (3, 1) tab(col1, col2);
2
Since: 4.1.0
theta_sketch_estimate¶
theta_sketch_estimate(expr) - Returns the estimated number of unique values given the binary representation of a Datasketches ThetaSketch.
Examples:
> SELECT theta_sketch_estimate(theta_sketch_agg(col)) FROM VALUES (1), (1), (2), (2), (3) tab(col);
3
Since: 4.1.0
theta_union¶
theta_union(first, second, lgNomEntries) - Merges two binary representations of Datasketches ThetaSketch objects using a ThetaSketch Union object. Users can set lgNomEntries to a value between 4 and 26 to find the union of sketches with different union buffer size values (defaults to 12).
Examples:
> SELECT theta_sketch_estimate(theta_union(theta_sketch_agg(col1), theta_sketch_agg(col2))) FROM VALUES (1, 4), (1, 4), (2, 5), (2, 5), (3, 6) tab(col1, col2);
6
Since: 4.1.0
tuple_difference_double¶
tuple_difference_double(tupleSketch1, tupleSketch2) - Subtracts two binary representations of Datasketches TupleSketch objects with double summary data type using a TupleSketch AnotB object. Returns elements in the first sketch that are not in the second sketch.
Examples:
> SELECT tuple_sketch_estimate_double(tuple_difference_double(tuple_sketch_agg_double(col1, val1), tuple_sketch_agg_double(col2, val2))) FROM VALUES (5, 5.0D, 4, 4.0D), (1, 1.0D, 4, 4.0D), (2, 2.0D, 5, 5.0D), (3, 3.0D, 1, 1.0D) tab(col1, val1, col2, val2);
2.0
Since: 4.2.0
tuple_difference_integer¶
tuple_difference_integer(tupleSketch1, tupleSketch2) - Subtracts two binary representations of Datasketches TupleSketch objects with integer summary data type using a TupleSketch AnotB object. Returns elements in the first sketch that are not in the second sketch.
Examples:
> SELECT tuple_sketch_estimate_integer(tuple_difference_integer(tuple_sketch_agg_integer(col1, val1), tuple_sketch_agg_integer(col2, val2))) FROM VALUES (5, 5, 4, 4), (1, 1, 4, 4), (2, 2, 5, 5), (3, 3, 1, 1) tab(col1, val1, col2, val2);
2.0
Since: 4.2.0
tuple_intersection_double¶
tuple_intersection_double(tupleSketch1, tupleSketch2, mode) - Intersects two binary representations of Datasketches TupleSketch objects with double summary data type using a TupleSketch Intersection object. Users can set mode to 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').
Examples:
> SELECT tuple_sketch_estimate_double(tuple_intersection_double(tuple_sketch_agg_double(col1, val1), tuple_sketch_agg_double(col2, val2))) FROM VALUES (1, 1.0D, 1, 4.0D), (2, 2.0D, 2, 5.0D), (3, 3.0D, 4, 6.0D) tab(col1, val1, col2, val2);
2.0
Since: 4.2.0
tuple_intersection_integer¶
tuple_intersection_integer(tupleSketch1, tupleSketch2, mode) - Intersects two binary representations of Datasketches TupleSketch objects with integer summary data type using a TupleSketch Intersection object. Users can set mode to 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').
Examples:
> SELECT tuple_sketch_estimate_integer(tuple_intersection_integer(tuple_sketch_agg_integer(col1, val1), tuple_sketch_agg_integer(col2, val2))) FROM VALUES (1, 1, 1, 4), (2, 2, 2, 5), (3, 3, 4, 6) tab(col1, val1, col2, val2);
2.0
Since: 4.2.0
tuple_sketch_estimate_double¶
tuple_sketch_estimate_double(child) - Returns the estimated number of unique values given the binary representation of a Datasketches TupleSketch. The sketch's summary type must be a double.
Examples:
> SELECT tuple_sketch_estimate_double(tuple_sketch_agg_double(key, summary)) FROM VALUES (1, 1.0D), (1, 2.0D), (2, 3.0D) tab(key, summary);
2.0
Since: 4.2.0
tuple_sketch_estimate_integer¶
tuple_sketch_estimate_integer(child) - Returns the estimated number of unique values given the binary representation of a Datasketches TupleSketch. The sketch's summary type must be an integer.
Examples:
> SELECT tuple_sketch_estimate_integer(tuple_sketch_agg_integer(key, summary)) FROM VALUES (1, 1), (1, 2), (2, 3) tab(key, summary);
2.0
Since: 4.2.0
tuple_sketch_summary_double¶
tuple_sketch_summary_double(child, mode) - Aggregates the summary values from a double summary type Datasketches TupleSketch. The mode can be 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').
Examples:
> SELECT tuple_sketch_summary_double(tuple_sketch_agg_double(key, summary)) FROM VALUES (1, 1.0D), (1, 2.0D), (2, 3.0D) tab(key, summary);
6.0
Since: 4.2.0
tuple_sketch_summary_integer¶
tuple_sketch_summary_integer(child, mode) - Aggregates the summary values from a integer summary type Datasketches TupleSketch. The mode can be 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').
Examples:
> SELECT tuple_sketch_summary_integer(tuple_sketch_agg_integer(key, summary)) FROM VALUES (1, 1), (1, 2), (2, 3) tab(key, summary);
6
Since: 4.2.0
tuple_sketch_theta_double¶
tuple_sketch_theta_double(child) - Returns the theta value (sampling rate) from a Datasketches TupleSketch. The theta value represents the effective sampling rate of the sketch, between 0.0 and 1.0. The sketch's summary type must be a double.
Examples:
> SELECT tuple_sketch_theta_double(tuple_sketch_agg_double(key, summary)) FROM VALUES (1, 1.0D), (2, 2.0D), (3, 3.0D) tab(key, summary);
1.0
Since: 4.2.0
tuple_sketch_theta_integer¶
tuple_sketch_theta_integer(child) - Returns the theta value (sampling rate) from a Datasketches TupleSketch. The theta value represents the effective sampling rate of the sketch, between 0.0 and 1.0. The sketch's summary type must be an integer.
Examples:
> SELECT tuple_sketch_theta_integer(tuple_sketch_agg_integer(key, summary)) FROM VALUES (1, 1), (2, 2), (3, 3) tab(key, summary);
1.0
Since: 4.2.0
tuple_union_double¶
tuple_union_double(tupleSketch1, tupleSketch2, lgNomEntries, mode) - Merges two binary representations of Datasketches TupleSketch objects with double summary data type using a TupleSketch Union object. Users can set lgNomEntries to a value between 4 and 26 (defaults to 12) and mode to 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').
Examples:
> SELECT tuple_sketch_estimate_double(tuple_union_double(tuple_sketch_agg_double(col1, val1), tuple_sketch_agg_double(col2, val2))) FROM VALUES (1, 1.0D, 4, 4.0D), (2, 2.0D, 5, 5.0D), (3, 3.0D, 6, 6.0D) tab(col1, val1, col2, val2);
6.0
Since: 4.2.0
tuple_union_integer¶
tuple_union_integer(tupleSketch1, tupleSketch2, lgNomEntries, mode) - Merges two binary representations of Datasketches TupleSketch objects with integer summary data type using a TupleSketch Union object. Users can set lgNomEntries to a value between 4 and 26 (defaults to 12) and mode to 'sum', 'min', 'max', or 'alwaysone' (defaults to 'sum').
Examples:
> SELECT tuple_sketch_estimate_integer(tuple_union_integer(tuple_sketch_agg_integer(col1, val1), tuple_sketch_agg_integer(col2, val2))) FROM VALUES (1, 1, 4, 4), (2, 2, 5, 5), (3, 3, 6, 6) tab(col1, val1, col2, val2);
6.0
Since: 4.2.0