pyspark.sql.functions.array_distinct#
- pyspark.sql.functions.array_distinct(col)[source]#
Array function: removes duplicate values from the array.
New in version 2.4.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- col
Column
or str name of column or expression
- col
- Returns
Column
A new column that is an array of unique values from the input column.
Examples
Example 1: Removing duplicate values from a simple array
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([([1, 2, 3, 2],)], ['data']) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | [1, 2, 3]| +--------------------+
Example 2: Removing duplicate values from multiple arrays
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([([1, 2, 3, 2],), ([4, 5, 5, 4],)], ['data']) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | [1, 2, 3]| | [4, 5]| +--------------------+
Example 3: Removing duplicate values from an array with all identical values
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([([1, 1, 1],)], ['data']) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | [1]| +--------------------+
Example 4: Removing duplicate values from an array with no duplicate values
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([([1, 2, 3],)], ['data']) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | [1, 2, 3]| +--------------------+
Example 5: Removing duplicate values from an empty array
>>> from pyspark.sql import functions as sf >>> from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField >>> schema = StructType([ ... StructField("data", ArrayType(IntegerType()), True) ... ]) >>> df = spark.createDataFrame([([],)], schema) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | []| +--------------------+