String Functions ¶

This page lists all string functions available in Spark SQL.

char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

> SELECT char_length('Spark SQL ');
 10
> SELECT char_length(x'537061726b2053514c');
 9
> SELECT CHAR_LENGTH('Spark SQL ');
 10
> SELECT CHARACTER_LENGTH('Spark SQL ');
 10

Since: 2.3.0

character_length ¶

character_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

> SELECT character_length('Spark SQL ');
 10
> SELECT character_length(x'537061726b2053514c');
 9
> SELECT CHAR_LENGTH('Spark SQL ');
 10
> SELECT CHARACTER_LENGTH('Spark SQL ');
 10

Since: 2.3.0

chr ¶

chr(expr) - Returns the ASCII character having the binary equivalent to expr. If n is larger than 256 the result is equivalent to chr(n % 256)

Examples:

> SELECT chr(65);
 A

Since: 2.3.0

collate ¶

collate(expr, collationName) - Marks a given expression with the specified collation.

Arguments:

expr - String expression to perform collation on.
collationName - Foldable string expression that specifies the collation name.

Examples:

> SELECT COLLATION('Spark SQL' collate UTF8_LCASE);
SYSTEM.BUILTIN.UTF8_LCASE

Since: 4.0.0

collation ¶

collation(expr) - Returns the collation name of a given expression.

Arguments:

expr - String expression to perform collation on.

Examples:

> SELECT collation('Spark SQL');
SYSTEM.BUILTIN.UTF8_BINARY

Since: 4.0.0

concat_ws ¶

concat_ws(sep[, str | array(str)]+) - Returns the concatenation of the strings separated by sep, skipping null values.

Examples:

> SELECT concat_ws(' ', 'Spark', 'SQL');
  Spark SQL
> SELECT concat_ws('s');

> SELECT concat_ws('/', 'foo', null, 'bar');
  foo/bar
> SELECT concat_ws(null, 'Spark', 'SQL');
  NULL

Since: 1.5.0

contains ¶

contains(left, right) - Returns a boolean. The value is True if right is found inside left. Returns NULL if either input expression is NULL. Otherwise, returns False. Both left or right must be of STRING or BINARY type.

Examples:

> SELECT contains('Spark SQL', 'Spark');
 true
> SELECT contains('Spark SQL', 'SPARK');
 false
> SELECT contains('Spark SQL', null);
 NULL
> SELECT contains(x'537061726b2053514c', x'537061726b');
 true

Since: 3.3.0

decode ¶

decode(bin, charset) - Decodes the first argument using the second argument character set. If either argument is null, the result will also be null.

decode(expr, search, result [, search, result ] ... [, default]) - Compares expr to each search value in order. If expr is equal to a search value, decode returns the corresponding result. If no match is found, then it returns default. If default is omitted, it returns null.

Arguments:

bin - a binary expression to decode
charset - one of the charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32' to decode bin into a STRING. It is case insensitive.

Examples:

> SELECT decode(encode('abc', 'utf-8'), 'utf-8');
 abc
> SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
 San Francisco
> SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
 Non domestic
> SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle');
 NULL
> SELECT decode(null, 6, 'Spark', NULL, 'SQL', 4, 'rocks');
 SQL

Note:

decode(expr, search, result [, search, result ] ... [, default]) is supported since 3.2.0

Since: 1.5.0

elt ¶

elt(n, input1, input2, ...) - Returns the n-th input, e.g., returns input2 when n is 2. The function returns NULL if the index exceeds the length of the array and spark.sql.ansi.enabled is set to false. If spark.sql.ansi.enabled is set to true, it throws ArrayIndexOutOfBoundsException for invalid indices.

Examples:

> SELECT elt(1, 'scala', 'java');
 scala
> SELECT elt(2, 'a', 1);
 1

Since: 2.0.0

encode ¶

encode(str, charset) - Encodes the first argument using the second argument character set. If either argument is null, the result will also be null.

Arguments:

str - a string expression
charset - one of the charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32' to encode str into a BINARY. It is case insensitive.

Examples:

> SELECT encode('abc', 'utf-8');
 abc

Since: 1.5.0

endswith ¶

endswith(left, right) - Returns a boolean. The value is True if left ends with right. Returns NULL if either input expression is NULL. Otherwise, returns False. Both left or right must be of STRING or BINARY type.

Examples:

> SELECT endswith('Spark SQL', 'SQL');
 true
> SELECT endswith('Spark SQL', 'Spark');
 false
> SELECT endswith('Spark SQL', null);
 NULL
> SELECT endswith(x'537061726b2053514c', x'537061726b');
 false
> SELECT endswith(x'537061726b2053514c', x'53514c');
 true

Since: 3.3.0

find_in_set ¶

find_in_set(str, str_array) - Returns the index (1-based) of the given string (str) in the comma-delimited list (str_array). Returns 0, if the string was not found or if the given string (str) contains a comma.

Examples:

> SELECT find_in_set('ab','abc,b,ab,c,def');
 3

Since: 1.5.0

format_number ¶

format_number(expr1, expr2) - Formats the number expr1 like '#,###,###.##', rounded to expr2 decimal places. If expr2 is 0, the result has no decimal point or fractional part. expr2 also accept a user specified format. This is supposed to function like MySQL's FORMAT.

Examples:

> SELECT format_number(12332.123456, 4);
 12,332.1235
> SELECT format_number(12332.123456, '##################.###');
 12332.123

Since: 1.5.0

format_string ¶

format_string(strfmt, obj, ...) - Returns a formatted string from printf-style format strings.

Examples:

> SELECT format_string("Hello World %d %s", 100, "days");
 Hello World 100 days

Since: 1.5.0

initcap ¶

initcap(str) - Returns str with the first letter of each word in uppercase. All other letters are in lowercase. Words are delimited by white space.

Examples:

> SELECT initcap('sPark sql');
 Spark Sql

Since: 1.5.0

instr ¶

instr(str, substr) - Returns the (1-based) index of the first occurrence of substr in str.

Examples:

> SELECT instr('SparkSQL', 'SQL');
 6

Since: 1.5.0

is_valid_utf8 ¶

is_valid_utf8(str) - Returns true if str is a valid UTF-8 string, otherwise returns false.

Arguments:

str - a string expression

Examples:

> SELECT is_valid_utf8('Spark');
 true
> SELECT is_valid_utf8(x'61');
 true
> SELECT is_valid_utf8(x'80');
 false
> SELECT is_valid_utf8(x'61C262');
 false

Since: 4.0.0

lcase ¶

lcase(str) - Returns str with all characters changed to lowercase.

Examples:

> SELECT lcase('SparkSql');
 sparksql

Since: 1.0.1

left ¶

left(str, len) - Returns the leftmost len(len can be string type) characters from the string str,if len is less or equal than 0 the result is an empty string.

Examples:

> SELECT left('Spark SQL', 3);
 Spa
> SELECT left(encode('Spark SQL', 'utf-8'), 3);
 Spa

Since: 2.3.0

len ¶

len(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

> SELECT len('Spark SQL ');
 10
> SELECT len(x'537061726b2053514c');
 9
> SELECT CHAR_LENGTH('Spark SQL ');
 10
> SELECT CHARACTER_LENGTH('Spark SQL ');
 10

Since: 3.4.0

length ¶

length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

> SELECT length('Spark SQL ');
 10
> SELECT length(x'537061726b2053514c');
 9
> SELECT CHAR_LENGTH('Spark SQL ');
 10
> SELECT CHARACTER_LENGTH('Spark SQL ');
 10

Since: 1.5.0

levenshtein ¶

levenshtein(str1, str2[, threshold]) - Returns the Levenshtein distance between the two given strings. If threshold is set and distance more than it, return -1.

Examples:

> SELECT levenshtein('kitten', 'sitting');
 3
> SELECT levenshtein('kitten', 'sitting', 2);
 -1

Since: 1.5.0

locate ¶

locate(substr, str[, pos]) - Returns the position of the first occurrence of substr in str after position pos. The given pos and return value are 1-based.

Examples:

> SELECT locate('bar', 'foobarbar');
 4
> SELECT locate('bar', 'foobarbar', 5);
 7
> SELECT POSITION('bar' IN 'foobarbar');
 4

Since: 1.5.0

lower ¶

lower(str) - Returns str with all characters changed to lowercase.

Examples:

> SELECT lower('SparkSql');
 sparksql

Since: 1.0.1

lpad ¶

lpad(str, len[, pad]) - Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters or bytes. If pad is not specified, str will be padded to the left with space characters if it is a character string, and with zeros if it is a byte sequence.

Examples:

> SELECT lpad('hi', 5, '??');
 ???hi
> SELECT lpad('hi', 1, '??');
 h
> SELECT lpad('hi', 5);
    hi
> SELECT hex(lpad(unhex('aabb'), 5));
 000000AABB
> SELECT hex(lpad(unhex('aabb'), 5, unhex('1122')));
 112211AABB

Since: 1.5.0

ltrim ¶

ltrim(str) - Removes the leading space characters from str.

Arguments:

str - a string expression
trimStr - the trim string characters to trim, the default value is a single space

Examples:

> SELECT ltrim('    SparkSQL   ');
 SparkSQL

Since: 1.5.0

luhn_check ¶

luhn_check(str ) - Checks that a string of digits is valid according to the Luhn algorithm. This checksum function is widely applied on credit card numbers and government identification numbers to distinguish valid numbers from mistyped, incorrect numbers.

Examples:

> SELECT luhn_check('8112189876');
 true
> SELECT luhn_check('79927398713');
 true
> SELECT luhn_check('79927398714');
 false

Since: 3.5.0

make_valid_utf8 ¶

make_valid_utf8(str) - Returns the original string if str is a valid UTF-8 string, otherwise returns a new string whose invalid UTF8 byte sequences are replaced using the UNICODE replacement character U+FFFD.

Arguments:

str - a string expression

Examples:

> SELECT make_valid_utf8('Spark');
 Spark
> SELECT make_valid_utf8(x'61');
 a
> SELECT make_valid_utf8(x'80');
 �
> SELECT make_valid_utf8(x'61C262');
 a�b

Since: 4.0.0

mask ¶

mask(input[, upperChar, lowerChar, digitChar, otherChar]) - masks the given string value. The function replaces characters with 'X' or 'x', and numbers with 'n'. This can be useful for creating copies of tables with sensitive information removed.

Arguments:

input - string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar - character to replace upper-case characters with. Specify NULL to retain original character. Default value: 'X'
lowerChar - character to replace lower-case characters with. Specify NULL to retain original character. Default value: 'x'
digitChar - character to replace digit characters with. Specify NULL to retain original character. Default value: 'n'
otherChar - character to replace all other characters with. Specify NULL to retain original character. Default value: NULL

Examples:

> SELECT mask('abcd-EFGH-8765-4321');
  xxxx-XXXX-nnnn-nnnn
> SELECT mask('abcd-EFGH-8765-4321', 'Q');
  xxxx-QQQQ-nnnn-nnnn
> SELECT mask('AbCD123-@$#', 'Q', 'q');
  QqQQnnn-@$#
> SELECT mask('AbCD123-@$#');
  XxXXnnn-@$#
> SELECT mask('AbCD123-@$#', 'Q');
  QxQQnnn-@$#
> SELECT mask('AbCD123-@$#', 'Q', 'q');
  QqQQnnn-@$#
> SELECT mask('AbCD123-@$#', 'Q', 'q', 'd');
  QqQQddd-@$#
> SELECT mask('AbCD123-@$#', 'Q', 'q', 'd', 'o');
  QqQQdddoooo
> SELECT mask('AbCD123-@$#', NULL, 'q', 'd', 'o');
  AqCDdddoooo
> SELECT mask('AbCD123-@$#', NULL, NULL, 'd', 'o');
  AbCDdddoooo
> SELECT mask('AbCD123-@$#', NULL, NULL, NULL, 'o');
  AbCD123oooo
> SELECT mask(NULL, NULL, NULL, NULL, 'o');
  NULL
> SELECT mask(NULL);
  NULL
> SELECT mask('AbCD123-@$#', NULL, NULL, NULL, NULL);
  AbCD123-@$#

Since: 3.4.0

octet_length ¶

octet_length(expr) - Returns the byte length of string data or number of bytes of binary data.

Examples:

> SELECT octet_length('Spark SQL');
 9
> SELECT octet_length(x'537061726b2053514c');
 9

Since: 2.3.0

overlay ¶

overlay(input, replace, pos[, len]) - Replace input with replace that starts at pos and is of length len.

Examples:

> SELECT overlay('Spark SQL' PLACING '_' FROM 6);
 Spark_SQL
> SELECT overlay('Spark SQL' PLACING 'CORE' FROM 7);
 Spark CORE
> SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0);
 Spark ANSI SQL
> SELECT overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4);
 Structured SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6);
 Spark_SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7);
 Spark CORE
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0);
 Spark ANSI SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4);
 Structured SQL

Since: 3.0.0

position ¶

position(substr, str[, pos]) - Returns the position of the first occurrence of substr in str after position pos. The given pos and return value are 1-based.

Examples:

> SELECT position('bar', 'foobarbar');
 4
> SELECT position('bar', 'foobarbar', 5);
 7
> SELECT POSITION('bar' IN 'foobarbar');
 4

Since: 2.3.0

printf ¶

printf(strfmt, obj, ...) - Returns a formatted string from printf-style format strings.

Examples:

> SELECT printf("Hello World %d %s", 100, "days");
 Hello World 100 days

Since: 1.5.0

quote ¶

quote(str) - Returns str enclosed by single quotes and each instance of single quote in it is preceded by a backslash.

Examples:

> SELECT quote('Don\'t');
 'Don\'t'

Since: 4.1.0

randstr ¶

randstr(length[, seed]) - Returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively).

Examples:

> SELECT randstr(3, 0) AS result;
 ceV

Since: 4.0.0

regexp_count ¶

regexp_count(str, regexp) - Returns a count of the number of times that the regular expression pattern regexp is matched in the string str.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Examples:

> SELECT regexp_count('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
 2
> SELECT regexp_count('abcdefghijklmnopqrstuvwxyz', '[a-z]{3}');
 8

Since: 3.4.0

regexp_extract ¶

regexp_extract(str, regexp[, idx]) - Extract the first string in the str that match the regexp expression and corresponding to the regex group index.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.
idx - an integer expression that representing the group index. The regex maybe contains multiple groups. idx indicates which regex group to extract. The group index should be non-negative. The minimum value of idx is 0, which means matching the entire regular expression. If idx is not specified, the default group index value is 1. The idx parameter is the Java regex Matcher group() method index.

Examples:

> SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1);
 100
> SELECT regexp_extract('100-200', r'(\d+)-(\d+)', 1);
 100

Since: 1.5.0

regexp_extract_all ¶

regexp_extract_all(str, regexp[, idx]) - Extract all strings in the str that match the regexp expression and corresponding to the regex group index.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.
idx - an integer expression that representing the group index. The regex may contains multiple groups. idx indicates which regex group to extract. The group index should be non-negative. The minimum value of idx is 0, which means matching the entire regular expression. If idx is not specified, the default group index value is 1. The idx parameter is the Java regex Matcher group() method index.

Examples:

> SELECT regexp_extract_all('100-200, 300-400', '(\\d+)-(\\d+)', 1);
 ["100","300"]
> SELECT regexp_extract_all('100-200, 300-400', r'(\d+)-(\d+)', 1);
 ["100","300"]

Since: 3.1.0

regexp_instr ¶

regexp_instr(str, regexp) - Searches a string for a regular expression and returns an integer that indicates the beginning position of the matched substring. Positions are 1-based, not 0-based. If no match is found, returns 0.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.

Examples:

> SELECT regexp_instr(r"\abc", r"^\\abc$");
 1
> SELECT regexp_instr('user@spark.apache.org', '@[^.]*');
 5

Since: 3.4.0

regexp_replace ¶

regexp_replace(str, regexp, rep[, position]) - Replaces all substrings of str that match regexp with rep.

Arguments:

str - a string expression to search for a regular expression pattern match.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.
rep - a string expression to replace matched substrings.
position - a positive integer literal that indicates the position within str to begin searching. The default is 1. If position is greater than the number of characters in str, the result is str.

Examples:

> SELECT regexp_replace('100-200', '(\\d+)', 'num');
 num-num
> SELECT regexp_replace('100-200', r'(\d+)', 'num');
 num-num

Since: 1.5.0

regexp_substr ¶

regexp_substr(str, regexp) - Returns the substring that matches the regular expression regexp within the string str. If the regular expression is not found, the result is null.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Examples:

> SELECT regexp_substr('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
 Steven
> SELECT regexp_substr('Steven Jones and Stephen Smith are the best players', 'Jeck');
 NULL

Since: 3.4.0

repeat ¶

repeat(str, n) - Returns the string which repeats the given string value n times.

Examples:

> SELECT repeat('123', 2);
 123123

Since: 1.5.0

replace ¶

replace(str, search[, replace]) - Replaces all occurrences of search with replace.

Arguments:

str - a string expression
search - a string expression. If search is not found in str, str is returned unchanged.
replace - a string expression. If replace is not specified or is an empty string, nothing replaces the string that is removed from str.

Examples:

> SELECT replace('ABCabc', 'abc', 'DEF');
 ABCDEF

Since: 2.3.0

right ¶

right(str, len) - Returns the rightmost len(len can be string type) characters from the string str,if len is less or equal than 0 the result is an empty string.

Examples:

> SELECT right('Spark SQL', 3);
 SQL

Since: 2.3.0

rpad ¶

rpad(str, len[, pad]) - Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. If pad is not specified, str will be padded to the right with space characters if it is a character string, and with zeros if it is a binary string.

Examples:

> SELECT rpad('hi', 5, '??');
 hi???
> SELECT rpad('hi', 1, '??');
 h
> SELECT rpad('hi', 5);
 hi
> SELECT hex(rpad(unhex('aabb'), 5));
 AABB000000
> SELECT hex(rpad(unhex('aabb'), 5, unhex('1122')));
 AABB112211

Since: 1.5.0

rtrim ¶

rtrim(str) - Removes the trailing space characters from str.

Arguments:

str - a string expression
trimStr - the trim string characters to trim, the default value is a single space

Examples:

> SELECT rtrim('    SparkSQL   ');
 SparkSQL

Since: 1.5.0

sentences ¶

sentences(str[, lang[, country]]) - Splits str into an array of array of words.

Arguments:

str - A STRING expression to be parsed.
lang - An optional STRING expression with a language code from ISO 639 Alpha-2 (e.g. 'DE'), Alpha-3, or a language subtag of up to 8 characters.
country - An optional STRING expression with a country code from ISO 3166 alpha-2 country code or a UN M.49 numeric-3 area code.

Examples:

> SELECT sentences('Hi there! Good morning.');
 [["Hi","there"],["Good","morning"]]
> SELECT sentences('Hi there! Good morning.', 'en');
 [["Hi","there"],["Good","morning"]]
> SELECT sentences('Hi there! Good morning.', 'en', 'US');
 [["Hi","there"],["Good","morning"]]

Since: 2.0.0

soundex ¶

soundex(str) - Returns Soundex code of the string.

Examples:

> SELECT soundex('Miller');
 M460

Since: 1.5.0

space ¶

space(n) - Returns a string consisting of n spaces.

Examples:

> SELECT concat(space(2), '1');
   1

Since: 1.5.0

split ¶

split(str, regex, limit) - Splits str around occurrences that match regex and returns an array with a length of at most limit

Arguments:

str - a string expression to split.
regex - a string representing a regular expression. The regex string should be a Java regular expression.
limit - an integer expression which controls the number of times the regex is applied.
- limit > 0: The resulting array's length will not be more than limit, and the resulting array's last entry will contain all input beyond the last matched regex.
- limit <= 0: regex will be applied as many times as possible, and the resulting array can be of any size.

Examples:

> SELECT split('oneAtwoBthreeC', '[ABC]');
 ["one","two","three",""]
> SELECT split('oneAtwoBthreeC', '[ABC]', -1);
 ["one","two","three",""]
> SELECT split('oneAtwoBthreeC', '[ABC]', 2);
 ["one","twoBthreeC"]

Since: 1.5.0

split_part ¶

split_part(str, delimiter, partNum) - Splits str by delimiter and return requested part of the split (1-based). If any input is null, returns null. if partNum is out of range of split parts, returns empty string. If partNum is 0, throws an error. If partNum is negative, the parts are counted backward from the end of the string. If the delimiter is an empty string, the str is not split.

Examples:

> SELECT split_part('11.12.13', '.', 3);
 13

Since: 3.3.0

startswith ¶

startswith(left, right) - Returns a boolean. The value is True if left starts with right. Returns NULL if either input expression is NULL. Otherwise, returns False. Both left or right must be of STRING or BINARY type.

Examples:

> SELECT startswith('Spark SQL', 'Spark');
 true
> SELECT startswith('Spark SQL', 'SQL');
 false
> SELECT startswith('Spark SQL', null);
 NULL
> SELECT startswith(x'537061726b2053514c', x'537061726b');
 true
> SELECT startswith(x'537061726b2053514c', x'53514c');
 false

Since: 3.3.0

substr ¶

substr(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

substr(str FROM pos[ FOR len]]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

Examples:

> SELECT substr('Spark SQL', 5);
 k SQL
> SELECT substr('Spark SQL', -3);
 SQL
> SELECT substr('Spark SQL', 5, 1);
 k
> SELECT substr('Spark SQL' FROM 5);
 k SQL
> SELECT substr('Spark SQL' FROM -3);
 SQL
> SELECT substr('Spark SQL' FROM 5 FOR 1);
 k
> SELECT substr(encode('Spark SQL', 'utf-8'), 5);
 k SQL

Since: 1.5.0

substring ¶

substring(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

substring(str FROM pos[ FOR len]]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

Examples:

> SELECT substring('Spark SQL', 5);
 k SQL
> SELECT substring('Spark SQL', -3);
 SQL
> SELECT substring('Spark SQL', 5, 1);
 k
> SELECT substring('Spark SQL' FROM 5);
 k SQL
> SELECT substring('Spark SQL' FROM -3);
 SQL
> SELECT substring('Spark SQL' FROM 5 FOR 1);
 k
> SELECT substring(encode('Spark SQL', 'utf-8'), 5);
 k SQL

Since: 1.5.0

substring_index ¶

substring_index(str, delim, count) - Returns the substring from str before count occurrences of the delimiter delim. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for delim.

Examples:

> SELECT substring_index('www.apache.org', '.', 2);
 www.apache

Since: 1.5.0

to_binary ¶

to_binary(str[, fmt]) - Converts the input str to a binary value based on the supplied fmt. fmt can be a case-insensitive string literal of "hex", "utf-8", "utf8", or "base64". By default, the binary format for conversion is "hex" if fmt is omitted. The function returns NULL if at least one of the input parameters is NULL.

Examples:

> SELECT to_binary('abc', 'utf-8');
 abc

Since: 3.3.0

to_char ¶

to_char(expr, format) - Convert expr to a string based on the format. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input value, generating a result string of the same length as the corresponding sequence in the format string. The result string is left-padded with zeros if the 0/9 sequence comprises more digits than the matching part of the decimal value, starts with 0, and is before the decimal point. Otherwise, it is padded with spaces. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' prints '+' for positive values but 'MI' prints a space. 'PR': Only allowed at the end of the format string; specifies that the result string will be wrapped by angle brackets if the input value is negative. ('<1>'). If expr is a datetime, format shall be a valid datetime pattern, see Datetime Patterns. If expr is a binary, it is converted to a string in one of the formats: 'base64': a base 64 string. 'hex': a string in the hexadecimal format. 'utf-8': the input binary is decoded to UTF-8 string.

Examples:

> SELECT to_char(454, '999');
 454
> SELECT to_char(454.00, '000D00');
 454.00
> SELECT to_char(12454, '99G999');
 12,454
> SELECT to_char(78.12, '$99.99');
 $78.12
> SELECT to_char(-12454.8, '99G999D9S');
 12,454.8-
> SELECT to_char(date'2016-04-08', 'y');
 2016
> SELECT to_char(x'537061726b2053514c', 'base64');
 U3BhcmsgU1FM
> SELECT to_char(x'537061726b2053514c', 'hex');
 537061726B2053514C
> SELECT to_char(encode('abc', 'utf-8'), 'utf-8');
 abc

Since: 3.4.0

to_number ¶

to_number(expr, fmt) - Convert string 'expr' to a number based on the string format 'fmt'. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input string. If the 0/9 sequence starts with 0 and is before the decimal point, it can only match a digit sequence of the same size. Otherwise, if the sequence starts with 9 or is after the decimal point, it can match a digit sequence that has the same or smaller size. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. 'expr' must match the grouping separator relevant for the size of the number. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' allows '-' but 'MI' does not. 'PR': Only allowed at the end of the format string; specifies that 'expr' indicates a negative number with wrapping angled brackets. ('<1>').

Examples:

> SELECT to_number('454', '999');
 454
> SELECT to_number('454.00', '000.00');
 454.00
> SELECT to_number('12,454', '99,999');
 12454
> SELECT to_number('$78.12', '$99.99');
 78.12
> SELECT to_number('12,454.8-', '99,999.9S');
 -12454.8

Since: 3.3.0

to_varchar ¶

to_varchar(expr, format) - Convert expr to a string based on the format. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input value, generating a result string of the same length as the corresponding sequence in the format string. The result string is left-padded with zeros if the 0/9 sequence comprises more digits than the matching part of the decimal value, starts with 0, and is before the decimal point. Otherwise, it is padded with spaces. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' prints '+' for positive values but 'MI' prints a space. 'PR': Only allowed at the end of the format string; specifies that the result string will be wrapped by angle brackets if the input value is negative. ('<1>'). If expr is a datetime, format shall be a valid datetime pattern, see Datetime Patterns. If expr is a binary, it is converted to a string in one of the formats: 'base64': a base 64 string. 'hex': a string in the hexadecimal format. 'utf-8': the input binary is decoded to UTF-8 string.

Examples:

> SELECT to_varchar(454, '999');
 454
> SELECT to_varchar(454.00, '000D00');
 454.00
> SELECT to_varchar(12454, '99G999');
 12,454
> SELECT to_varchar(78.12, '$99.99');
 $78.12
> SELECT to_varchar(-12454.8, '99G999D9S');
 12,454.8-
> SELECT to_varchar(date'2016-04-08', 'y');
 2016
> SELECT to_varchar(x'537061726b2053514c', 'base64');
 U3BhcmsgU1FM
> SELECT to_varchar(x'537061726b2053514c', 'hex');
 537061726B2053514C
> SELECT to_varchar(encode('abc', 'utf-8'), 'utf-8');
 abc

Since: 3.5.0

translate ¶

translate(input, from, to) - Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string.

Examples:

> SELECT translate('AaBbCc', 'abc', '123');
 A1B2C3

Since: 1.5.0

trim ¶

trim(str) - Removes the leading and trailing space characters from str.

trim(BOTH FROM str) - Removes the leading and trailing space characters from str.

trim(LEADING FROM str) - Removes the leading space characters from str.

trim(TRAILING FROM str) - Removes the trailing space characters from str.

trim(trimStr FROM str) - Remove the leading and trailing trimStr characters from str.

trim(BOTH trimStr FROM str) - Remove the leading and trailing trimStr characters from str.

trim(LEADING trimStr FROM str) - Remove the leading trimStr characters from str.

trim(TRAILING trimStr FROM str) - Remove the trailing trimStr characters from str.

Arguments:

str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string

Examples:

> SELECT trim('    SparkSQL   ');
 SparkSQL
> SELECT trim(BOTH FROM '    SparkSQL   ');
 SparkSQL
> SELECT trim(LEADING FROM '    SparkSQL   ');
 SparkSQL
> SELECT trim(TRAILING FROM '    SparkSQL   ');
     SparkSQL
> SELECT trim('SL' FROM 'SSparkSQLS');
 parkSQ
> SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
 parkSQ
> SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
 parkSQLS
> SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
 SSparkSQ

Since: 1.5.0

try_to_binary ¶

try_to_binary(str[, fmt]) - This is a special version of to_binary that performs the same operation, but returns a NULL value instead of raising an error if the conversion cannot be performed.

Examples:

> SELECT try_to_binary('abc', 'utf-8');
 abc
> select try_to_binary('a!', 'base64');
 NULL
> select try_to_binary('abc', 'invalidFormat');
 NULL

Since: 3.3.0

try_to_number ¶

try_to_number(expr, fmt) - Convert string 'expr' to a number based on the string format fmt. Returns NULL if the string 'expr' does not match the expected format. The format follows the same semantics as the to_number function.

Examples:

> SELECT try_to_number('454', '999');
 454
> SELECT try_to_number('454.00', '000.00');
 454.00
> SELECT try_to_number('12,454', '99,999');
 12454
> SELECT try_to_number('$78.12', '$99.99');
 78.12
> SELECT try_to_number('12,454.8-', '99,999.9S');
 -12454.8

Since: 3.3.0

try_validate_utf8 ¶

try_validate_utf8(str) - Returns the original string if str is a valid UTF-8 string, otherwise returns NULL.

Arguments:

str - a string expression

Examples:

> SELECT try_validate_utf8('Spark');
 Spark
> SELECT try_validate_utf8(x'61');
 a
> SELECT try_validate_utf8(x'80');
 NULL
> SELECT try_validate_utf8(x'61C262');
 NULL

Since: 4.0.0

ucase ¶

ucase(str) - Returns str with all characters changed to uppercase.

Examples:

> SELECT ucase('SparkSql');
 SPARKSQL

Since: 1.0.1

unbase64 ¶

unbase64(str) - Converts the argument from a base 64 string str to a binary.

Examples:

> SELECT unbase64('U3BhcmsgU1FM');
 Spark SQL

Since: 1.5.0

upper ¶

upper(str) - Returns str with all characters changed to uppercase.

Examples:

> SELECT upper('SparkSql');
 SPARKSQL

Since: 1.0.1

validate_utf8 ¶

validate_utf8(str) - Returns the original string if str is a valid UTF-8 string, otherwise throws an exception.

Arguments:

str - a string expression

Examples:

> SELECT validate_utf8('Spark');
 Spark
> SELECT validate_utf8(x'61');
 a

Since: 4.0.0

||¶

expr1 || expr2 - Returns the concatenation of expr1 and expr2.

Examples:

> SELECT 'Spark' || 'SQL';
 SparkSQL
> SELECT array(1, 2, 3) || array(4, 5) || array(6);
 [1,2,3,4,5,6]

Note:

|| for arrays is available since 2.4.0.

Since: 2.3.0