Window Functions

This page lists all window functions available in Spark SQL.


cume_dist

cume_dist() - Computes the position of a value relative to all values in the partition.

Examples:

> SELECT a, b, cume_dist() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   0.6666666666666666
 A1 1   0.6666666666666666
 A1 2   1.0
 A2 3   1.0

Since: 2.0.0


dense_rank

dense_rank() - Computes the rank of a value in a group of values. The result is one plus the previously assigned rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.

Arguments:

  • children - this is to base the rank on; a change in the value of one the children will trigger a change in rank. This is an internal parameter and will be assigned by the Analyser.

Examples:

> SELECT a, b, dense_rank(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   1
 A1 2   2
 A2 3   1

Since: 2.0.0


lag

lag(input[, offset[, default]]) - Returns the value of input at the offsetth row before the current row in the window. The default value of offset is 1 and the default value of default is null. If the value of input at the offsetth row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), default is returned.

Arguments:

  • input - a string expression to evaluate offset rows before the current row.
  • offset - an int expression which is rows to jump back in the partition.
  • default - a string expression which is to use when the offset row does not exist.

Examples:

> SELECT a, b, lag(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   NULL
 A1 1   1
 A1 2   1
 A2 3   NULL

Since: 2.0.0


lead

lead(input[, offset[, default]]) - Returns the value of input at the offsetth row after the current row in the window. The default value of offset is 1 and the default value of default is null. If the value of input at the offsetth row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of the window does not have any subsequent row), default is returned.

Arguments:

  • input - a string expression to evaluate offset rows after the current row.
  • offset - an int expression which is rows to jump ahead in the partition.
  • default - a string expression which is to use when the offset is larger than the window. The default value is null.

Examples:

> SELECT a, b, lead(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   2
 A1 2   NULL
 A2 3   NULL

Since: 2.0.0


nth_value

nth_value(input[, offset]) - Returns the value of input at the row that is the offsetth row from beginning of the window frame. Offset starts at 1. If ignoreNulls=true, we will skip nulls when finding the offsetth row. Otherwise, every row counts for the offset. If there is no such an offsetth row (e.g., when the offset is 10, size of the window frame is less than 10), null is returned.

Arguments:

  • input - the target column or expression that the function operates on.
  • offset - a positive int literal to indicate the offset in the window frame. It starts with 1.
  • ignoreNulls - an optional specification that indicates the NthValue should skip null values in the determination of which row to use.

Examples:

> SELECT a, b, nth_value(b, 2) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   1
 A1 2   1
 A2 3   NULL

Since: 3.1.0


ntile

ntile(n) - Divides the rows for each window partition into n buckets ranging from 1 to at most n.

Arguments:

  • buckets - an int expression which is number of buckets to divide the rows in. Default value is 1.

Examples:

> SELECT a, b, ntile(2) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   1
 A1 2   2
 A2 3   1

Since: 2.0.0


percent_rank

percent_rank() - Computes the percentage ranking of a value in a group of values.

Arguments:

  • children - this is to base the rank on; a change in the value of one the children will trigger a change in rank. This is an internal parameter and will be assigned by the Analyser.

Examples:

> SELECT a, b, percent_rank(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   0.0
 A1 1   0.0
 A1 2   1.0
 A2 3   0.0

Since: 2.0.0


rank

rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.

Arguments:

  • children - this is to base the rank on; a change in the value of one the children will trigger a change in rank. This is an internal parameter and will be assigned by the Analyser.

Examples:

> SELECT a, b, rank(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   1
 A1 2   3
 A2 3   1

Since: 2.0.0


row_number

row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.

Examples:

> SELECT a, b, row_number() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   2
 A1 2   3
 A2 3   1

Since: 2.0.0