Spark Session#

The entry point to programming Spark with the Dataset and DataFrame API. To create a Spark session, you should use SparkSession.builder attribute. See also SparkSession.

SparkSession.active()

Returns the active or default SparkSession for the current thread, returned by the builder.

SparkSession.builder.appName(name)

Sets a name for the application, which will be shown in the Spark web UI.

SparkSession.builder.config([key, value, ...])

Sets a config option.

SparkSession.builder.enableHiveSupport()

Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.

SparkSession.builder.getOrCreate()

Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.

SparkSession.builder.master(master)

Sets the Spark master URL to connect to, such as "local" to run locally, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.

SparkSession.builder.remote(url)

Sets the Spark remote URL to connect to, such as "sc://host:port" to run it via Spark Connect server.

SparkSession.catalog

Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.

SparkSession.conf

Runtime configuration interface for Spark.

SparkSession.createDataFrame(data[, schema, ...])

Creates a DataFrame from an RDD, a list, a pandas.DataFrame, a numpy.ndarray, or a pyarrow.Table.

SparkSession.dataSource

Returns a DataSourceRegistration for data source registration.

SparkSession.getActiveSession()

Returns the active SparkSession for the current thread, returned by the builder

SparkSession.newSession()

Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.

SparkSession.profile

Returns a Profile for performance/memory profiling.

SparkSession.range(start[, end, step, ...])

Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.

SparkSession.read

Returns a DataFrameReader that can be used to read data in as a DataFrame.

SparkSession.readStream

Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.

SparkSession.sparkContext

Returns the underlying SparkContext.

SparkSession.sql(sqlQuery[, args])

Returns a DataFrame representing the result of the given query.

SparkSession.stop()

Stop the underlying SparkContext.

SparkSession.streams

Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context.

SparkSession.table(tableName)

Returns the specified table as a DataFrame.

SparkSession.tvf

Returns a TableValuedFunction that can be used to call a table-valued function (TVF).

SparkSession.udf

Returns a UDFRegistration for UDF registration.

SparkSession.udtf

Returns a UDTFRegistration for UDTF registration.

SparkSession.version

The version of Spark on which this application is running.

is_remote()

Returns if the current running environment is for Spark Connect.

Spark Connect Only#

SparkSession.builder.create()

Creates a new SparkSession.

SparkSession.addArtifact(*path[, pyfile, ...])

Add artifact(s) to the client session.

SparkSession.addArtifacts(*path[, pyfile, ...])

Add artifact(s) to the client session.

SparkSession.addTag(tag)

Add a tag to be assigned to all the operations started by this thread in this session.

SparkSession.clearProgressHandlers()

Clear all registered progress handlers.

SparkSession.clearTags()

Clear the current thread's operation tags.

SparkSession.client

Gives access to the Spark Connect client.

SparkSession.copyFromLocalToFs(local_path, ...)

Copy file from local to cloud storage file system.

SparkSession.getTags()

Get the tags that are currently set to be assigned to all the operations started by this thread.

SparkSession.interruptAll()

Interrupt all operations of this session currently running on the connected server.

SparkSession.interruptOperation(op_id)

Interrupt an operation of this session with the given operationId.

SparkSession.interruptTag(tag)

Interrupt all operations of this session with the given operation tag.

SparkSession.registerProgressHandler(handler)

Register a progress handler to be called when a progress update is received from the server.

SparkSession.removeProgressHandler(handler)

Remove a progress handler that was previously registered.

SparkSession.removeTag(tag)

Remove a tag previously added to be assigned to all the operations started by this thread in this session.