pyspark.sql.functions.substring#

pyspark.sql.functions.substring(str, pos, len)[source]#

Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.

New in version 1.5.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
strColumn or column name

target column to work on.

posColumn or column name or int

starting position in str.

Changed in version 4.0.0: pos now accepts column and column name.

lenColumn or column name or int

length of chars.

Changed in version 4.0.0: len now accepts column and column name.

Returns
Column

substring of given value.

Notes

The position is not zero based, but 1 based index.

Examples

Example 1: Using literal integers as arguments

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([('abcd',)], ['s',])
>>> df.select('*', sf.substring(df.s, 1, 2)).show()
+----+------------------+
|   s|substring(s, 1, 2)|
+----+------------------+
|abcd|                ab|
+----+------------------+

Example 2: Using columns as arguments

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([('Spark', 2, 3)], ['s', 'p', 'l'])
>>> df.select('*', sf.substring(df.s, 2, df.l)).show()
+-----+---+---+------------------+
|    s|  p|  l|substring(s, 2, l)|
+-----+---+---+------------------+
|Spark|  2|  3|               par|
+-----+---+---+------------------+
>>> df.select('*', sf.substring(df.s, df.p, 3)).show()
+-----+---+---+------------------+
|    s|  p|  l|substring(s, p, 3)|
+-----+---+---+------------------+
|Spark|  2|  3|               par|
+-----+---+---+------------------+
>>> df.select('*', sf.substring(df.s, df.p, df.l)).show()
+-----+---+---+------------------+
|    s|  p|  l|substring(s, p, l)|
+-----+---+---+------------------+
|Spark|  2|  3|               par|
+-----+---+---+------------------+

Example 3: Using column names as arguments

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([('Spark', 2, 3)], ['s', 'p', 'l'])
>>> df.select('*', sf.substring(df.s, 2, 'l')).show()
+-----+---+---+------------------+
|    s|  p|  l|substring(s, 2, l)|
+-----+---+---+------------------+
|Spark|  2|  3|               par|
+-----+---+---+------------------+
>>> df.select('*', sf.substring('s', 'p', 'l')).show()
+-----+---+---+------------------+
|    s|  p|  l|substring(s, p, l)|
+-----+---+---+------------------+
|Spark|  2|  3|               par|
+-----+---+---+------------------+