pyspark.sql.DataFrame.collect#
- DataFrame.collect()[source]#
- Returns all the records in the DataFrame as a list of - Row.- New in version 1.3.0. - Changed in version 3.4.0: Supports Spark Connect. - Returns
- list
- A list of - Rowobjects, each representing a row in the DataFrame.
 
 - See also - DataFrame.take
- Returns the first n rows. 
- DataFrame.head
- Returns the first n rows. 
- DataFrame.toPandas
- Returns the data as a pandas DataFrame. 
- DataFrame.toArrow
- Returns the data as a PyArrow Table. 
 - Notes - This method should only be used if the resulting list is expected to be small, as all the data is loaded into the driver’s memory. - Examples - Example: Collecting all rows of a DataFrame - >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df.collect() [Row(age=14, name='Tom'), Row(age=23, name='Alice'), Row(age=16, name='Bob')] - Example: Collecting all rows after filtering - >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df.filter(df.age > 15).collect() [Row(age=23, name='Alice'), Row(age=16, name='Bob')] - Example: Collecting all rows after selecting specific columns - >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df.select("name").collect() [Row(name='Tom'), Row(name='Alice'), Row(name='Bob')] - Example: Collecting all rows after applying a function to a column - >>> from pyspark.sql.functions import upper >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df.select(upper(df.name)).collect() [Row(upper(name)='TOM'), Row(upper(name)='ALICE'), Row(upper(name)='BOB')] - Example: Collecting all rows from a DataFrame and converting a specific column to a list - >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> rows = df.collect() >>> [row["name"] for row in rows] ['Tom', 'Alice', 'Bob'] - Example: Collecting all rows from a DataFrame and converting to a list of dictionaries - >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> rows = df.collect() >>> [row.asDict() for row in rows] [{'age': 14, 'name': 'Tom'}, {'age': 23, 'name': 'Alice'}, {'age': 16, 'name': 'Bob'}]