pyspark.sql.functions.collect_list#
- pyspark.sql.functions.collect_list(col)[source]#
Aggregate function: Collects the values from a column into a list, maintaining duplicates, and returns this list of objects.
New in version 1.6.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- col
Column
or column name The target column on which the function is computed.
- col
- Returns
Column
A new Column object representing a list of collected values, with duplicate values included.
Notes
The function is non-deterministic as the order of collected results depends on the order of the rows, which possibly becomes non-deterministic after shuffle operations.
Examples
Example 1: Collect values from a DataFrame and sort the result in ascending order
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1,), (2,), (2,)], ('value',)) >>> df.select(sf.sort_array(sf.collect_list('value')).alias('sorted_list')).show() +-----------+ |sorted_list| +-----------+ | [1, 2, 2]| +-----------+
Example 2: Collect values from a DataFrame and sort the result in descending order
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(2,), (5,), (5,)], ('age',)) >>> df.select(sf.sort_array(sf.collect_list('age'), asc=False).alias('sorted_list')).show() +-----------+ |sorted_list| +-----------+ | [5, 5, 2]| +-----------+
Example 3: Collect values from a DataFrame with multiple columns and sort the result
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name")) >>> df = df.groupBy("name").agg(sf.sort_array(sf.collect_list('id')).alias('sorted_list')) >>> df.orderBy(sf.desc("name")).show() +----+-----------+ |name|sorted_list| +----+-----------+ |John| [1, 2]| | Ana| [3]| +----+-----------+