pyspark.sql.DataFrame.collect#
- DataFrame.collect()[source]#
Returns all the records in the DataFrame as a list of
Row
.New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Returns
- list
A list of
Row
objects, each representing a row in the DataFrame.
See also
DataFrame.take
Returns the first n rows.
DataFrame.head
Returns the first n rows.
DataFrame.toPandas
Returns the data as a pandas DataFrame.
DataFrame.toArrow
Returns the data as a PyArrow Table.
Notes
This method should only be used if the resulting list is expected to be small, as all the data is loaded into the driver’s memory.
Examples
Example: Collecting all rows of a DataFrame
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df.collect() [Row(age=14, name='Tom'), Row(age=23, name='Alice'), Row(age=16, name='Bob')]
Example: Collecting all rows after filtering
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df.filter(df.age > 15).collect() [Row(age=23, name='Alice'), Row(age=16, name='Bob')]
Example: Collecting all rows after selecting specific columns
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df.select("name").collect() [Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]
Example: Collecting all rows after applying a function to a column
>>> from pyspark.sql.functions import upper >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df.select(upper(df.name)).collect() [Row(upper(name)='TOM'), Row(upper(name)='ALICE'), Row(upper(name)='BOB')]
Example: Collecting all rows from a DataFrame and converting a specific column to a list
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> rows = df.collect() >>> [row["name"] for row in rows] ['Tom', 'Alice', 'Bob']
Example: Collecting all rows from a DataFrame and converting to a list of dictionaries
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> rows = df.collect() >>> [row.asDict() for row in rows] [{'age': 14, 'name': 'Tom'}, {'age': 23, 'name': 'Alice'}, {'age': 16, 'name': 'Bob'}]