pyspark.sql.SparkSession.sql#
- SparkSession.sql(sqlQuery, args=None, **kwargs)[source]#
Returns a
DataFrame
representing the result of the given query. Whenkwargs
is specified, this method formats the given string by using the Python standard formatter. The method binds named parameters to SQL literals or positional parameters from args. It doesn’t support named and positional parameters in the same SQL query.New in version 2.0.0.
Changed in version 3.4.0: Supports Spark Connect and parameterized SQL.
Changed in version 3.5.0: Added positional parameters.
- Parameters
- sqlQuerystr
SQL query string.
- argsdict or list
A dictionary of parameter names to Python objects or a list of Python objects that can be converted to SQL literal expressions. See Supported Data Types for supported value types in Python. For example, dictionary keys: “rank”, “name”, “birthdate”; dictionary or list values: 1, “Steven”, datetime.date(2023, 4, 2). A value can be also a Column of a literal or collection constructor functions such as map(), array(), struct(), in that case it is taken as is.
New in version 3.4.0.
- kwargsdict
Other variables that the user wants to set that can be referenced in the query
Changed in version 3.3.0: Added optional argument
kwargs
to specify the mapping of variables in the query. This feature is experimental and unstable.
- Returns
Notes
In Spark Classic, a temporary view referenced in spark.sql is resolved immediately, while in Spark Connect it is lazily analyzed. So in Spark Connect if a view is dropped, modified or replaced after spark.sql, the execution may fail or generate different results.
Examples
Executing a SQL query.
>>> spark.sql("SELECT * FROM range(10) where id > 7").show() +---+ | id| +---+ | 8| | 9| +---+
Executing a SQL query with variables as Python formatter standard.
>>> spark.sql( ... "SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}", bound1=7, bound2=9 ... ).show() +---+ | id| +---+ | 8| +---+
>>> mydf = spark.range(10) >>> spark.sql( ... "SELECT {col} FROM {mydf} WHERE id IN {x}", ... col=mydf.id, mydf=mydf, x=tuple(range(4))).show() +---+ | id| +---+ | 0| | 1| | 2| | 3| +---+
>>> spark.sql(''' ... SELECT m1.a, m2.b ... FROM {table1} m1 INNER JOIN {table2} m2 ... ON m1.key = m2.key ... ORDER BY m1.a, m2.b''', ... table1=spark.createDataFrame([(1, "a"), (2, "b")], ["a", "key"]), ... table2=spark.createDataFrame([(3, "a"), (4, "b"), (5, "b")], ["b", "key"])).show() +---+---+ | a| b| +---+---+ | 1| 3| | 2| 4| | 2| 5| +---+---+
Also, it is possible to query using class:Column from
DataFrame
.>>> mydf = spark.createDataFrame([(1, 4), (2, 4), (3, 6)], ["A", "B"]) >>> spark.sql("SELECT {df.A}, {df[B]} FROM {df}", df=mydf).show() +---+---+ | A| B| +---+---+ | 1| 4| | 2| 4| | 3| 6| +---+---+
And substitute named parameters with the : prefix by SQL literals.
>>> from pyspark.sql.functions import create_map, lit >>> spark.sql( ... "SELECT *, element_at(:m, 'a') AS C FROM {df} WHERE {df[B]} > :minB", ... {"minB" : 5, "m" : create_map(lit('a'), lit(1))}, df=mydf).show() +---+---+---+ | A| B| C| +---+---+---+ | 3| 6| 1| +---+---+---+
Or positional parameters marked by ? in the SQL query by SQL literals.
>>> from pyspark.sql.functions import array, lit >>> spark.sql( ... "SELECT *, element_at(?, 1) AS C FROM {df} WHERE {df[B]} > ? and ? < {df[A]}", ... args=[array(lit(1), lit(2), lit(3)), 5, 2], df=mydf).show() +---+---+---+ | A| B| C| +---+---+---+ | 3| 6| 1| +---+---+---+