pyspark.sql.DataFrame#

class pyspark.sql.DataFrame(jdf, sql_ctx)[source]#

A distributed collection of data grouped into named columns.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Notes

A DataFrame should only be created as described above. It should not be directly created via using the constructor.

Examples

A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession:

>>> people = spark.createDataFrame([
...     {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50},
...     {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100},
...     {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150},
...     {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200}
... ])

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame, Column.

To select a column from the DataFrame, use the apply method:

>>> age_col = people.age

A more concrete example:

>>> # To create DataFrame using SparkSession
... department = spark.createDataFrame([
...     {"id": 1, "name": "PySpark"},
...     {"id": 2, "name": "ML"},
...     {"id": 3, "name": "Spark SQL"}
... ])
>>> people.filter(people.age > 30).join(
...     department, people.deptId == department.id).groupBy(
...     department.name, "gender").agg(
...         {"salary": "avg", "age": "max"}).sort("max(age)").show()
+-------+------+-----------+--------+
|   name|gender|avg(salary)|max(age)|
+-------+------+-----------+--------+
|PySpark|     M|       75.0|      50|
|     ML|     F|      150.0|      60|
+-------+------+-----------+--------+

Methods

agg(*exprs)

Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()).

alias(alias)

Returns a new DataFrame with an alias set.

approxQuantile(col, probabilities, relativeError)

Calculates the approximate quantiles of numerical columns of a DataFrame.

cache()

Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER).

checkpoint([eager])

Returns a checkpointed version of this DataFrame.

coalesce(numPartitions)

Returns a new DataFrame that has exactly numPartitions partitions.

colRegex(colName)

Selects column based on the column name specified as a regex and returns it as Column.

collect()

Returns all the records in the DataFrame as a list of Row.

corr(col1, col2[, method])

Calculates the correlation of two columns of a DataFrame as a double value.

count()

Returns the number of rows in this DataFrame.

cov(col1, col2)

Calculate the sample covariance for the given columns, specified by their names, as a double value.

createGlobalTempView(name)

Creates a global temporary view with this DataFrame.

createOrReplaceGlobalTempView(name)

Creates or replaces a global temporary view using the given name.

createOrReplaceTempView(name)

Creates or replaces a local temporary view with this DataFrame.

createTempView(name)

Creates a local temporary view with this DataFrame.

crossJoin(other)

Returns the cartesian product with another DataFrame.

crosstab(col1, col2)

Computes a pair-wise frequency table of the given columns.

cube(*cols)

Create a multi-dimensional cube for the current DataFrame using the specified columns, allowing aggregations to be performed on them.

describe(*cols)

Computes basic statistics for numeric and string columns.

distinct()

Returns a new DataFrame containing the distinct rows in this DataFrame.

drop(*cols)

Returns a new DataFrame without specified columns.

dropDuplicates([subset])

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.

dropDuplicatesWithinWatermark([subset])

Return a new DataFrame with duplicate rows removed,

drop_duplicates([subset])

drop_duplicates() is an alias for dropDuplicates().

dropna([how, thresh, subset])

Returns a new DataFrame omitting rows with null or NaN values.

exceptAll(other)

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.

exists()

Return a Column object for an EXISTS Subquery.

explain([extended, mode])

Prints the (logical and physical) plans to the console for debugging purposes.

fillna(value[, subset])

Returns a new DataFrame which null values are filled with new value.

filter(condition)

Filters rows using the given condition.

first()

Returns the first row as a Row.

foreach(f)

Applies the f function to all Row of this DataFrame.

foreachPartition(f)

Applies the f function to each partition of this DataFrame.

freqItems(cols[, support])

Finding frequent items for columns, possibly with false positives.

groupBy(*cols)

Groups the DataFrame by the specified columns so that aggregation can be performed on them.

groupby(*cols)

groupby() is an alias for groupBy().

groupingSets(groupingSets, *cols)

Create multi-dimensional aggregation for the current DataFrame using the specified grouping sets, so we can run aggregation on them.

head([n])

Returns the first n rows.

hint(name, *parameters)

Specifies some hint on the current DataFrame.

inputFiles()

Returns a best-effort snapshot of the files that compose this DataFrame.

intersect(other)

Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.

intersectAll(other)

Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.

isEmpty()

Checks if the DataFrame is empty and returns a boolean value.

isLocal()

Returns True if the collect() and take() methods can be run locally (without any Spark executors).

join(other[, on, how])

Joins with another DataFrame, using the given join expression.

lateralJoin(other[, on, how])

Lateral joins with another DataFrame, using the given join expression.

limit(num)

Limits the result count to the number specified.

localCheckpoint([eager, storageLevel])

Returns a locally checkpointed version of this DataFrame.

mapInArrow(func, schema[, barrier, profile])

Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pyarrow.RecordBatchs both as input and output, and returns the result as a DataFrame.

mapInPandas(func, schema[, barrier, profile])

Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pandas DataFrames both as input and output, and returns the result as a DataFrame.

melt(ids, values, variableColumnName, ...)

Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.

mergeInto(table, condition)

Merges a set of updates, insertions, and deletions based on a source table into a target table.

observe(observation, *exprs)

Define (named) metrics to observe on the DataFrame.

offset(num)

Returns a new :class: DataFrame by skipping the first n rows.

orderBy(*cols, **kwargs)

Returns a new DataFrame sorted by the specified column(s).

pandas_api([index_col])

Converts the existing DataFrame into a pandas-on-Spark DataFrame.

persist([storageLevel])

Sets the storage level to persist the contents of the DataFrame across operations after the first time it is computed.

printSchema([level])

Prints out the schema in the tree format.

randomSplit(weights[, seed])

Randomly splits this DataFrame with the provided weights.

registerTempTable(name)

Registers this DataFrame as a temporary table using the given name.

repartition(numPartitions, *cols)

Returns a new DataFrame partitioned by the given partitioning expressions.

repartitionByRange(numPartitions, *cols)

Returns a new DataFrame partitioned by the given partitioning expressions.

replace(to_replace[, value, subset])

Returns a new DataFrame replacing a value with another value.

rollup(*cols)

Create a multi-dimensional rollup for the current DataFrame using the specified columns, allowing for aggregation on them.

sameSemantics(other)

Returns True when the logical query plans inside both DataFrames are equal and therefore return the same results.

sample([withReplacement, fraction, seed])

Returns a sampled subset of this DataFrame.

sampleBy(col, fractions[, seed])

Returns a stratified sample without replacement based on the fraction given on each stratum.

scalar()

Return a Column object for a SCALAR Subquery containing exactly one row and one column.

select(*cols)

Projects a set of expressions and returns a new DataFrame.

selectExpr(*expr)

Projects a set of SQL expressions and returns a new DataFrame.

semanticHash()

Returns a hash code of the logical query plan against this DataFrame.

show([n, truncate, vertical])

Prints the first n rows of the DataFrame to the console.

sort(*cols, **kwargs)

Returns a new DataFrame sorted by the specified column(s).

sortWithinPartitions(*cols, **kwargs)

Returns a new DataFrame with each partition sorted by the specified column(s).

subtract(other)

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.

summary(*statistics)

Computes specified statistics for numeric and string columns.

tail(num)

Returns the last num rows as a list of Row.

take(num)

Returns the first num rows as a list of Row.

to(schema)

Returns a new DataFrame where each row is reconciled to match the specified schema.

toArrow()

Returns the contents of this DataFrame as PyArrow pyarrow.Table.

toDF(*cols)

Returns a new DataFrame that with new specified column names

toJSON([use_unicode])

Converts a DataFrame into a RDD of string.

toLocalIterator([prefetchPartitions])

Returns an iterator that contains all of the rows in this DataFrame.

toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

transform(func, *args, **kwargs)

Returns a new DataFrame.

transpose([indexColumn])

Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.

union(other)

Return a new DataFrame containing the union of rows in this and another DataFrame.

unionAll(other)

Return a new DataFrame containing the union of rows in this and another DataFrame.

unionByName(other[, allowMissingColumns])

Returns a new DataFrame containing union of rows in this and another DataFrame.

unpersist([blocking])

Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

unpivot(ids, values, variableColumnName, ...)

Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.

where(condition)

where() is an alias for filter().

withColumn(colName, col)

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

withColumnRenamed(existing, new)

Returns a new DataFrame by renaming an existing column.

withColumns(*colsMap)

Returns a new DataFrame by adding multiple columns or replacing the existing columns that have the same names.

withColumnsRenamed(colsMap)

Returns a new DataFrame by renaming multiple columns.

withMetadata(columnName, metadata)

Returns a new DataFrame by updating an existing column with metadata.

withWatermark(eventTime, delayThreshold)

Defines an event time watermark for this DataFrame.

writeTo(table)

Create a write configuration builder for v2 sources.

Attributes

columns

Retrieves the names of all columns in the DataFrame as a list.

dtypes

Returns all column names and their data types as a list.

executionInfo

Returns a ExecutionInfo object after the query was executed.

isStreaming

Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.

na

Returns a DataFrameNaFunctions for handling missing values.

plot

Returns a PySparkPlotAccessor for plotting functions.

rdd

Returns the content as an pyspark.RDD of Row.

schema

Returns the schema of this DataFrame as a pyspark.sql.types.StructType.

sparkSession

Returns Spark session that created this DataFrame.

stat

Returns a DataFrameStatFunctions for statistic functions.

storageLevel

Get the DataFrame's current storage level.

write

Interface for saving the content of the non-streaming DataFrame out into external storage.

writeStream

Interface for saving the content of the streaming DataFrame out into external storage.

is_cached