String Functions¶
This page lists all string functions available in Spark SQL.
ascii¶
ascii(str) - Returns the numeric value of the first character of str.
Examples:
> SELECT ascii('222');
50
> SELECT ascii(2);
50
Since: 1.5.0
base64¶
base64(bin) - Converts the argument from a binary bin to a base 64 string.
Examples:
> SELECT base64('Spark SQL');
U3BhcmsgU1FM
> SELECT base64(x'537061726b2053514c');
U3BhcmsgU1FM
Since: 1.5.0
bit_length¶
bit_length(expr) - Returns the bit length of string data or number of bits of binary data.
Examples:
> SELECT bit_length('Spark SQL');
72
> SELECT bit_length(x'537061726b2053514c');
72
Since: 2.3.0
btrim¶
btrim(str) - Removes the leading and trailing space characters from str.
btrim(str, trimStr) - Remove the leading and trailing trimStr characters from str.
Arguments:
- str - a string expression
- trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT btrim(' SparkSQL ');
SparkSQL
> SELECT btrim(encode(' SparkSQL ', 'utf-8'));
SparkSQL
> SELECT btrim('SSparkSQLS', 'SL');
parkSQ
> SELECT btrim(encode('SSparkSQLS', 'utf-8'), encode('SL', 'utf-8'));
parkSQ
Since: 3.2.0
char¶
char(expr) - Returns the ASCII character having the binary equivalent to expr. If n is larger than 256 the result is equivalent to chr(n % 256)
Examples:
> SELECT char(65);
A
Since: 2.3.0
char_length¶
char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
> SELECT char_length('Spark SQL ');
10
> SELECT char_length(x'537061726b2053514c');
9
> SELECT CHAR_LENGTH('Spark SQL ');
10
> SELECT CHARACTER_LENGTH('Spark SQL ');
10
Since: 2.3.0
character_length¶
character_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
> SELECT character_length('Spark SQL ');
10
> SELECT character_length(x'537061726b2053514c');
9
> SELECT CHAR_LENGTH('Spark SQL ');
10
> SELECT CHARACTER_LENGTH('Spark SQL ');
10
Since: 2.3.0
chr¶
chr(expr) - Returns the ASCII character having the binary equivalent to expr. If n is larger than 256 the result is equivalent to chr(n % 256)
Examples:
> SELECT chr(65);
A
Since: 2.3.0
collate¶
collate(expr, collationName) - Marks a given expression with the specified collation.
Arguments:
- expr - String expression to perform collation on.
- collationName - Foldable string expression that specifies the collation name.
Examples:
> SELECT COLLATION('Spark SQL' collate UTF8_LCASE);
SYSTEM.BUILTIN.UTF8_LCASE
Since: 4.0.0
collation¶
collation(expr) - Returns the collation name of a given expression.
Arguments:
- expr - String expression to perform collation on.
Examples:
> SELECT collation('Spark SQL');
SYSTEM.BUILTIN.UTF8_BINARY
Since: 4.0.0
concat_ws¶
concat_ws(sep[, str | array(str)]+) - Returns the concatenation of the strings separated by sep, skipping null values.
Examples:
> SELECT concat_ws(' ', 'Spark', 'SQL');
Spark SQL
> SELECT concat_ws('s');
> SELECT concat_ws('/', 'foo', null, 'bar');
foo/bar
> SELECT concat_ws(null, 'Spark', 'SQL');
NULL
Since: 1.5.0
contains¶
contains(left, right) - Returns a boolean. The value is True if right is found inside left. Returns NULL if either input expression is NULL. Otherwise, returns False. Both left or right must be of STRING or BINARY type.
Examples:
> SELECT contains('Spark SQL', 'Spark');
true
> SELECT contains('Spark SQL', 'SPARK');
false
> SELECT contains('Spark SQL', null);
NULL
> SELECT contains(x'537061726b2053514c', x'537061726b');
true
Since: 3.3.0
decode¶
decode(bin, charset) - Decodes the first argument using the second argument character set. If either argument is null, the result will also be null.
decode(expr, search, result [, search, result ] ... [, default]) - Compares expr to each search value in order. If expr is equal to a search value, decode returns the corresponding result. If no match is found, then it returns default. If default is omitted, it returns null.
Arguments:
- bin - a binary expression to decode
- charset - one of the charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32' to decode
bininto a STRING. It is case insensitive.
Examples:
> SELECT decode(encode('abc', 'utf-8'), 'utf-8');
abc
> SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
San Francisco
> SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
Non domestic
> SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle');
NULL
> SELECT decode(null, 6, 'Spark', NULL, 'SQL', 4, 'rocks');
SQL
Note:
decode(expr, search, result [, search, result ] ... [, default]) is supported since 3.2.0
Since: 1.5.0
elt¶
elt(n, input1, input2, ...) - Returns the n-th input, e.g., returns input2 when n is 2.
The function returns NULL if the index exceeds the length of the array
and spark.sql.ansi.enabled is set to false. If spark.sql.ansi.enabled is set to true,
it throws ArrayIndexOutOfBoundsException for invalid indices.
Examples:
> SELECT elt(1, 'scala', 'java');
scala
> SELECT elt(2, 'a', 1);
1
Since: 2.0.0
encode¶
encode(str, charset) - Encodes the first argument using the second argument character set. If either argument is null, the result will also be null.
Arguments:
- str - a string expression
- charset - one of the charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32' to encode
strinto a BINARY. It is case insensitive.
Examples:
> SELECT encode('abc', 'utf-8');
abc
Since: 1.5.0
endswith¶
endswith(left, right) - Returns a boolean. The value is True if left ends with right. Returns NULL if either input expression is NULL. Otherwise, returns False. Both left or right must be of STRING or BINARY type.
Examples:
> SELECT endswith('Spark SQL', 'SQL');
true
> SELECT endswith('Spark SQL', 'Spark');
false
> SELECT endswith('Spark SQL', null);
NULL
> SELECT endswith(x'537061726b2053514c', x'537061726b');
false
> SELECT endswith(x'537061726b2053514c', x'53514c');
true
Since: 3.3.0
find_in_set¶
find_in_set(str, str_array) - Returns the index (1-based) of the given string (str) in the comma-delimited list (str_array).
Returns 0, if the string was not found or if the given string (str) contains a comma.
Examples:
> SELECT find_in_set('ab','abc,b,ab,c,def');
3
Since: 1.5.0
format_number¶
format_number(expr1, expr2) - Formats the number expr1 like '#,###,###.##', rounded to expr2
decimal places. If expr2 is 0, the result has no decimal point or fractional part.
expr2 also accept a user specified format.
This is supposed to function like MySQL's FORMAT.
Examples:
> SELECT format_number(12332.123456, 4);
12,332.1235
> SELECT format_number(12332.123456, '##################.###');
12332.123
Since: 1.5.0
format_string¶
format_string(strfmt, obj, ...) - Returns a formatted string from printf-style format strings.
Examples:
> SELECT format_string("Hello World %d %s", 100, "days");
Hello World 100 days
Since: 1.5.0
initcap¶
initcap(str) - Returns str with the first letter of each word in uppercase.
All other letters are in lowercase. Words are delimited by white space.
Examples:
> SELECT initcap('sPark sql');
Spark Sql
Since: 1.5.0
instr¶
instr(str, substr) - Returns the (1-based) index of the first occurrence of substr in str.
Examples:
> SELECT instr('SparkSQL', 'SQL');
6
Since: 1.5.0
is_valid_utf8¶
is_valid_utf8(str) - Returns true if str is a valid UTF-8 string, otherwise returns false.
Arguments:
- str - a string expression
Examples:
> SELECT is_valid_utf8('Spark');
true
> SELECT is_valid_utf8(x'61');
true
> SELECT is_valid_utf8(x'80');
false
> SELECT is_valid_utf8(x'61C262');
false
Since: 4.0.0
lcase¶
lcase(str) - Returns str with all characters changed to lowercase.
Examples:
> SELECT lcase('SparkSql');
sparksql
Since: 1.0.1
left¶
left(str, len) - Returns the leftmost len(len can be string type) characters from the string str,if len is less or equal than 0 the result is an empty string.
Examples:
> SELECT left('Spark SQL', 3);
Spa
> SELECT left(encode('Spark SQL', 'utf-8'), 3);
Spa
Since: 2.3.0
len¶
len(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
> SELECT len('Spark SQL ');
10
> SELECT len(x'537061726b2053514c');
9
> SELECT CHAR_LENGTH('Spark SQL ');
10
> SELECT CHARACTER_LENGTH('Spark SQL ');
10
Since: 3.4.0
length¶
length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
> SELECT length('Spark SQL ');
10
> SELECT length(x'537061726b2053514c');
9
> SELECT CHAR_LENGTH('Spark SQL ');
10
> SELECT CHARACTER_LENGTH('Spark SQL ');
10
Since: 1.5.0
levenshtein¶
levenshtein(str1, str2[, threshold]) - Returns the Levenshtein distance between the two given strings. If threshold is set and distance more than it, return -1.
Examples:
> SELECT levenshtein('kitten', 'sitting');
3
> SELECT levenshtein('kitten', 'sitting', 2);
-1
Since: 1.5.0
locate¶
locate(substr, str[, pos]) - Returns the position of the first occurrence of substr in str after position pos.
The given pos and return value are 1-based.
Examples:
> SELECT locate('bar', 'foobarbar');
4
> SELECT locate('bar', 'foobarbar', 5);
7
> SELECT POSITION('bar' IN 'foobarbar');
4
Since: 1.5.0
lower¶
lower(str) - Returns str with all characters changed to lowercase.
Examples:
> SELECT lower('SparkSql');
sparksql
Since: 1.0.1
lpad¶
lpad(str, len[, pad]) - Returns str, left-padded with pad to a length of len.
If str is longer than len, the return value is shortened to len characters or bytes.
If pad is not specified, str will be padded to the left with space characters if it is
a character string, and with zeros if it is a byte sequence.
Examples:
> SELECT lpad('hi', 5, '??');
???hi
> SELECT lpad('hi', 1, '??');
h
> SELECT lpad('hi', 5);
hi
> SELECT hex(lpad(unhex('aabb'), 5));
000000AABB
> SELECT hex(lpad(unhex('aabb'), 5, unhex('1122')));
112211AABB
Since: 1.5.0
ltrim¶
ltrim(str) - Removes the leading space characters from str.
Arguments:
- str - a string expression
- trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT ltrim(' SparkSQL ');
SparkSQL
Since: 1.5.0
luhn_check¶
luhn_check(str ) - Checks that a string of digits is valid according to the Luhn algorithm. This checksum function is widely applied on credit card numbers and government identification numbers to distinguish valid numbers from mistyped, incorrect numbers.
Examples:
> SELECT luhn_check('8112189876');
true
> SELECT luhn_check('79927398713');
true
> SELECT luhn_check('79927398714');
false
Since: 3.5.0
make_valid_utf8¶
make_valid_utf8(str) - Returns the original string if str is a valid UTF-8 string, otherwise returns a new string whose invalid UTF8 byte sequences are replaced using the UNICODE replacement character U+FFFD.
Arguments:
- str - a string expression
Examples:
> SELECT make_valid_utf8('Spark');
Spark
> SELECT make_valid_utf8(x'61');
a
> SELECT make_valid_utf8(x'80');
�
> SELECT make_valid_utf8(x'61C262');
a�b
Since: 4.0.0
mask¶
mask(input[, upperChar, lowerChar, digitChar, otherChar]) - masks the given string value. The function replaces characters with 'X' or 'x', and numbers with 'n'. This can be useful for creating copies of tables with sensitive information removed.
Arguments:
- input - string value to mask. Supported types: STRING, VARCHAR, CHAR
- upperChar - character to replace upper-case characters with. Specify NULL to retain original character. Default value: 'X'
- lowerChar - character to replace lower-case characters with. Specify NULL to retain original character. Default value: 'x'
- digitChar - character to replace digit characters with. Specify NULL to retain original character. Default value: 'n'
- otherChar - character to replace all other characters with. Specify NULL to retain original character. Default value: NULL
Examples:
> SELECT mask('abcd-EFGH-8765-4321');
xxxx-XXXX-nnnn-nnnn
> SELECT mask('abcd-EFGH-8765-4321', 'Q');
xxxx-QQQQ-nnnn-nnnn
> SELECT mask('AbCD123-@$#', 'Q', 'q');
QqQQnnn-@$#
> SELECT mask('AbCD123-@$#');
XxXXnnn-@$#
> SELECT mask('AbCD123-@$#', 'Q');
QxQQnnn-@$#
> SELECT mask('AbCD123-@$#', 'Q', 'q');
QqQQnnn-@$#
> SELECT mask('AbCD123-@$#', 'Q', 'q', 'd');
QqQQddd-@$#
> SELECT mask('AbCD123-@$#', 'Q', 'q', 'd', 'o');
QqQQdddoooo
> SELECT mask('AbCD123-@$#', NULL, 'q', 'd', 'o');
AqCDdddoooo
> SELECT mask('AbCD123-@$#', NULL, NULL, 'd', 'o');
AbCDdddoooo
> SELECT mask('AbCD123-@$#', NULL, NULL, NULL, 'o');
AbCD123oooo
> SELECT mask(NULL, NULL, NULL, NULL, 'o');
NULL
> SELECT mask(NULL);
NULL
> SELECT mask('AbCD123-@$#', NULL, NULL, NULL, NULL);
AbCD123-@$#
Since: 3.4.0
octet_length¶
octet_length(expr) - Returns the byte length of string data or number of bytes of binary data.
Examples:
> SELECT octet_length('Spark SQL');
9
> SELECT octet_length(x'537061726b2053514c');
9
Since: 2.3.0
overlay¶
overlay(input, replace, pos[, len]) - Replace input with replace that starts at pos and is of length len.
Examples:
> SELECT overlay('Spark SQL' PLACING '_' FROM 6);
Spark_SQL
> SELECT overlay('Spark SQL' PLACING 'CORE' FROM 7);
Spark CORE
> SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0);
Spark ANSI SQL
> SELECT overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4);
Structured SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6);
Spark_SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7);
Spark CORE
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0);
Spark ANSI SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4);
Structured SQL
Since: 3.0.0
position¶
position(substr, str[, pos]) - Returns the position of the first occurrence of substr in str after position pos.
The given pos and return value are 1-based.
Examples:
> SELECT position('bar', 'foobarbar');
4
> SELECT position('bar', 'foobarbar', 5);
7
> SELECT POSITION('bar' IN 'foobarbar');
4
Since: 2.3.0
printf¶
printf(strfmt, obj, ...) - Returns a formatted string from printf-style format strings.
Examples:
> SELECT printf("Hello World %d %s", 100, "days");
Hello World 100 days
Since: 1.5.0
quote¶
quote(str) - Returns str enclosed by single quotes and each instance of single quote in it is preceded by a backslash.
Examples:
> SELECT quote('Don\'t');
'Don\'t'
Since: 4.1.0
randstr¶
randstr(length[, seed]) - Returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively).
Examples:
> SELECT randstr(3, 0) AS result;
ceV
Since: 4.0.0
regexp_count¶
regexp_count(str, regexp) - Returns a count of the number of times that the regular expression pattern regexp is matched in the string str.
Arguments:
- str - a string expression.
- regexp - a string representing a regular expression. The regex string should be a Java regular expression.
Examples:
> SELECT regexp_count('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
2
> SELECT regexp_count('abcdefghijklmnopqrstuvwxyz', '[a-z]{3}');
8
Since: 3.4.0
regexp_extract¶
regexp_extract(str, regexp[, idx]) - Extract the first string in the str that match the regexp
expression and corresponding to the regex group index.
Arguments:
- str - a string expression.
- regexp - a string representing a regular expression. The regex string should be a
Java regular expression.
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression forregexpcan be "^\abc$".
There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, theregexpthat can match "\abc" is "^\abc$".
It's recommended to use a raw string literal (with therprefix) to avoid escaping special characters in the pattern string if exists. - idx - an integer expression that representing the group index. The regex maybe contains
multiple groups.
idxindicates which regex group to extract. The group index should be non-negative. The minimum value ofidxis 0, which means matching the entire regular expression. Ifidxis not specified, the default group index value is 1. Theidxparameter is the Java regex Matcher group() method index.
Examples:
> SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1);
100
> SELECT regexp_extract('100-200', r'(\d+)-(\d+)', 1);
100
Since: 1.5.0
regexp_extract_all¶
regexp_extract_all(str, regexp[, idx]) - Extract all strings in the str that match the regexp
expression and corresponding to the regex group index.
Arguments:
- str - a string expression.
- regexp - a string representing a regular expression. The regex string should be a
Java regular expression.
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression forregexpcan be "^\abc$".
There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, theregexpthat can match "\abc" is "^\abc$".
It's recommended to use a raw string literal (with therprefix) to avoid escaping special characters in the pattern string if exists. - idx - an integer expression that representing the group index. The regex may contains
multiple groups.
idxindicates which regex group to extract. The group index should be non-negative. The minimum value ofidxis 0, which means matching the entire regular expression. Ifidxis not specified, the default group index value is 1. Theidxparameter is the Java regex Matcher group() method index.
Examples:
> SELECT regexp_extract_all('100-200, 300-400', '(\\d+)-(\\d+)', 1);
["100","300"]
> SELECT regexp_extract_all('100-200, 300-400', r'(\d+)-(\d+)', 1);
["100","300"]
Since: 3.1.0
regexp_instr¶
regexp_instr(str, regexp) - Searches a string for a regular expression and returns an integer that indicates the beginning position of the matched substring. Positions are 1-based, not 0-based. If no match is found, returns 0.
Arguments:
- str - a string expression.
- regexp - a string representing a regular expression. The regex string should be a
Java regular expression.
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression forregexpcan be "^\abc$".
There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, theregexpthat can match "\abc" is "^\abc$".
It's recommended to use a raw string literal (with therprefix) to avoid escaping special characters in the pattern string if exists.
Examples:
> SELECT regexp_instr(r"\abc", r"^\\abc$");
1
> SELECT regexp_instr('user@spark.apache.org', '@[^.]*');
5
Since: 3.4.0
regexp_replace¶
regexp_replace(str, regexp, rep[, position]) - Replaces all substrings of str that match regexp with rep.
Arguments:
- str - a string expression to search for a regular expression pattern match.
- regexp - a string representing a regular expression. The regex string should be a
Java regular expression.
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression forregexpcan be "^\abc$".
There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, theregexpthat can match "\abc" is "^\abc$".
It's recommended to use a raw string literal (with therprefix) to avoid escaping special characters in the pattern string if exists. - rep - a string expression to replace matched substrings.
- position - a positive integer literal that indicates the position within
strto begin searching. The default is 1. If position is greater than the number of characters instr, the result isstr.
Examples:
> SELECT regexp_replace('100-200', '(\\d+)', 'num');
num-num
> SELECT regexp_replace('100-200', r'(\d+)', 'num');
num-num
Since: 1.5.0
regexp_substr¶
regexp_substr(str, regexp) - Returns the substring that matches the regular expression regexp within the string str. If the regular expression is not found, the result is null.
Arguments:
- str - a string expression.
- regexp - a string representing a regular expression. The regex string should be a Java regular expression.
Examples:
> SELECT regexp_substr('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
Steven
> SELECT regexp_substr('Steven Jones and Stephen Smith are the best players', 'Jeck');
NULL
Since: 3.4.0
repeat¶
repeat(str, n) - Returns the string which repeats the given string value n times.
Examples:
> SELECT repeat('123', 2);
123123
Since: 1.5.0
replace¶
replace(str, search[, replace]) - Replaces all occurrences of search with replace.
Arguments:
- str - a string expression
- search - a string expression. If
searchis not found instr,stris returned unchanged. - replace - a string expression. If
replaceis not specified or is an empty string, nothing replaces the string that is removed fromstr.
Examples:
> SELECT replace('ABCabc', 'abc', 'DEF');
ABCDEF
Since: 2.3.0
right¶
right(str, len) - Returns the rightmost len(len can be string type) characters from the string str,if len is less or equal than 0 the result is an empty string.
Examples:
> SELECT right('Spark SQL', 3);
SQL
Since: 2.3.0
rpad¶
rpad(str, len[, pad]) - Returns str, right-padded with pad to a length of len.
If str is longer than len, the return value is shortened to len characters.
If pad is not specified, str will be padded to the right with space characters if it is
a character string, and with zeros if it is a binary string.
Examples:
> SELECT rpad('hi', 5, '??');
hi???
> SELECT rpad('hi', 1, '??');
h
> SELECT rpad('hi', 5);
hi
> SELECT hex(rpad(unhex('aabb'), 5));
AABB000000
> SELECT hex(rpad(unhex('aabb'), 5, unhex('1122')));
AABB112211
Since: 1.5.0
rtrim¶
rtrim(str) - Removes the trailing space characters from str.
Arguments:
- str - a string expression
- trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT rtrim(' SparkSQL ');
SparkSQL
Since: 1.5.0
sentences¶
sentences(str[, lang[, country]]) - Splits str into an array of array of words.
Arguments:
- str - A STRING expression to be parsed.
- lang - An optional STRING expression with a language code from ISO 639 Alpha-2 (e.g. 'DE'), Alpha-3, or a language subtag of up to 8 characters.
- country - An optional STRING expression with a country code from ISO 3166 alpha-2 country code or a UN M.49 numeric-3 area code.
Examples:
> SELECT sentences('Hi there! Good morning.');
[["Hi","there"],["Good","morning"]]
> SELECT sentences('Hi there! Good morning.', 'en');
[["Hi","there"],["Good","morning"]]
> SELECT sentences('Hi there! Good morning.', 'en', 'US');
[["Hi","there"],["Good","morning"]]
Since: 2.0.0
soundex¶
soundex(str) - Returns Soundex code of the string.
Examples:
> SELECT soundex('Miller');
M460
Since: 1.5.0
space¶
space(n) - Returns a string consisting of n spaces.
Examples:
> SELECT concat(space(2), '1');
1
Since: 1.5.0
split¶
split(str, regex, limit) - Splits str around occurrences that match regex and returns an array with a length of at most limit
Arguments:
- str - a string expression to split.
- regex - a string representing a regular expression. The regex string should be a Java regular expression.
- limit - an integer expression which controls the number of times the regex is applied.
- limit > 0: The resulting array's length will not be more than
limit, and the resulting array's last entry will contain all input beyond the last matched regex. - limit <= 0:
regexwill be applied as many times as possible, and the resulting array can be of any size.
- limit > 0: The resulting array's length will not be more than
Examples:
> SELECT split('oneAtwoBthreeC', '[ABC]');
["one","two","three",""]
> SELECT split('oneAtwoBthreeC', '[ABC]', -1);
["one","two","three",""]
> SELECT split('oneAtwoBthreeC', '[ABC]', 2);
["one","twoBthreeC"]
Since: 1.5.0
split_part¶
split_part(str, delimiter, partNum) - Splits str by delimiter and return
requested part of the split (1-based). If any input is null, returns null.
if partNum is out of range of split parts, returns empty string. If partNum is 0,
throws an error. If partNum is negative, the parts are counted backward from the
end of the string. If the delimiter is an empty string, the str is not split.
Examples:
> SELECT split_part('11.12.13', '.', 3);
13
Since: 3.3.0
startswith¶
startswith(left, right) - Returns a boolean. The value is True if left starts with right. Returns NULL if either input expression is NULL. Otherwise, returns False. Both left or right must be of STRING or BINARY type.
Examples:
> SELECT startswith('Spark SQL', 'Spark');
true
> SELECT startswith('Spark SQL', 'SQL');
false
> SELECT startswith('Spark SQL', null);
NULL
> SELECT startswith(x'537061726b2053514c', x'537061726b');
true
> SELECT startswith(x'537061726b2053514c', x'53514c');
false
Since: 3.3.0
substr¶
substr(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.
substr(str FROM pos[ FOR len]]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.
Examples:
> SELECT substr('Spark SQL', 5);
k SQL
> SELECT substr('Spark SQL', -3);
SQL
> SELECT substr('Spark SQL', 5, 1);
k
> SELECT substr('Spark SQL' FROM 5);
k SQL
> SELECT substr('Spark SQL' FROM -3);
SQL
> SELECT substr('Spark SQL' FROM 5 FOR 1);
k
> SELECT substr(encode('Spark SQL', 'utf-8'), 5);
k SQL
Since: 1.5.0
substring¶
substring(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.
substring(str FROM pos[ FOR len]]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.
Examples:
> SELECT substring('Spark SQL', 5);
k SQL
> SELECT substring('Spark SQL', -3);
SQL
> SELECT substring('Spark SQL', 5, 1);
k
> SELECT substring('Spark SQL' FROM 5);
k SQL
> SELECT substring('Spark SQL' FROM -3);
SQL
> SELECT substring('Spark SQL' FROM 5 FOR 1);
k
> SELECT substring(encode('Spark SQL', 'utf-8'), 5);
k SQL
Since: 1.5.0
substring_index¶
substring_index(str, delim, count) - Returns the substring from str before count occurrences of the delimiter delim.
If count is positive, everything to the left of the final delimiter (counting from the
left) is returned. If count is negative, everything to the right of the final delimiter
(counting from the right) is returned. The function substring_index performs a case-sensitive match
when searching for delim.
Examples:
> SELECT substring_index('www.apache.org', '.', 2);
www.apache
Since: 1.5.0
to_binary¶
to_binary(str[, fmt]) - Converts the input str to a binary value based on the supplied fmt.
fmt can be a case-insensitive string literal of "hex", "utf-8", "utf8", or "base64".
By default, the binary format for conversion is "hex" if fmt is omitted.
The function returns NULL if at least one of the input parameters is NULL.
Examples:
> SELECT to_binary('abc', 'utf-8');
abc
Since: 3.3.0
to_char¶
to_char(expr, format) - Convert expr to a string based on the format.
Throws an exception if the conversion fails. The format can consist of the following
characters, case insensitive:
'0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format
string matches a sequence of digits in the input value, generating a result string of the
same length as the corresponding sequence in the format string. The result string is
left-padded with zeros if the 0/9 sequence comprises more digits than the matching part of
the decimal value, starts with 0, and is before the decimal point. Otherwise, it is
padded with spaces.
'.' or 'D': Specifies the position of the decimal point (optional, only allowed once).
',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be
a 0 or 9 to the left and right of each grouping separator.
'$': Specifies the location of the $ currency sign. This character may only be specified
once.
'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at
the beginning or end of the format string). Note that 'S' prints '+' for positive values
but 'MI' prints a space.
'PR': Only allowed at the end of the format string; specifies that the result string will be
wrapped by angle brackets if the input value is negative.
('<1>').
If expr is a datetime, format shall be a valid datetime pattern, see Datetime Patterns.
If expr is a binary, it is converted to a string in one of the formats:
'base64': a base 64 string.
'hex': a string in the hexadecimal format.
'utf-8': the input binary is decoded to UTF-8 string.
Examples:
> SELECT to_char(454, '999');
454
> SELECT to_char(454.00, '000D00');
454.00
> SELECT to_char(12454, '99G999');
12,454
> SELECT to_char(78.12, '$99.99');
$78.12
> SELECT to_char(-12454.8, '99G999D9S');
12,454.8-
> SELECT to_char(date'2016-04-08', 'y');
2016
> SELECT to_char(x'537061726b2053514c', 'base64');
U3BhcmsgU1FM
> SELECT to_char(x'537061726b2053514c', 'hex');
537061726B2053514C
> SELECT to_char(encode('abc', 'utf-8'), 'utf-8');
abc
Since: 3.4.0
to_number¶
to_number(expr, fmt) - Convert string 'expr' to a number based on the string format 'fmt'. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input string. If the 0/9 sequence starts with 0 and is before the decimal point, it can only match a digit sequence of the same size. Otherwise, if the sequence starts with 9 or is after the decimal point, it can match a digit sequence that has the same or smaller size. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. 'expr' must match the grouping separator relevant for the size of the number. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' allows '-' but 'MI' does not. 'PR': Only allowed at the end of the format string; specifies that 'expr' indicates a negative number with wrapping angled brackets. ('<1>').
Examples:
> SELECT to_number('454', '999');
454
> SELECT to_number('454.00', '000.00');
454.00
> SELECT to_number('12,454', '99,999');
12454
> SELECT to_number('$78.12', '$99.99');
78.12
> SELECT to_number('12,454.8-', '99,999.9S');
-12454.8
Since: 3.3.0
to_varchar¶
to_varchar(expr, format) - Convert expr to a string based on the format.
Throws an exception if the conversion fails. The format can consist of the following
characters, case insensitive:
'0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format
string matches a sequence of digits in the input value, generating a result string of the
same length as the corresponding sequence in the format string. The result string is
left-padded with zeros if the 0/9 sequence comprises more digits than the matching part of
the decimal value, starts with 0, and is before the decimal point. Otherwise, it is
padded with spaces.
'.' or 'D': Specifies the position of the decimal point (optional, only allowed once).
',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be
a 0 or 9 to the left and right of each grouping separator.
'$': Specifies the location of the $ currency sign. This character may only be specified
once.
'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at
the beginning or end of the format string). Note that 'S' prints '+' for positive values
but 'MI' prints a space.
'PR': Only allowed at the end of the format string; specifies that the result string will be
wrapped by angle brackets if the input value is negative.
('<1>').
If expr is a datetime, format shall be a valid datetime pattern, see Datetime Patterns.
If expr is a binary, it is converted to a string in one of the formats:
'base64': a base 64 string.
'hex': a string in the hexadecimal format.
'utf-8': the input binary is decoded to UTF-8 string.
Examples:
> SELECT to_varchar(454, '999');
454
> SELECT to_varchar(454.00, '000D00');
454.00
> SELECT to_varchar(12454, '99G999');
12,454
> SELECT to_varchar(78.12, '$99.99');
$78.12
> SELECT to_varchar(-12454.8, '99G999D9S');
12,454.8-
> SELECT to_varchar(date'2016-04-08', 'y');
2016
> SELECT to_varchar(x'537061726b2053514c', 'base64');
U3BhcmsgU1FM
> SELECT to_varchar(x'537061726b2053514c', 'hex');
537061726B2053514C
> SELECT to_varchar(encode('abc', 'utf-8'), 'utf-8');
abc
Since: 3.5.0
translate¶
translate(input, from, to) - Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string.
Examples:
> SELECT translate('AaBbCc', 'abc', '123');
A1B2C3
Since: 1.5.0
trim¶
trim(str) - Removes the leading and trailing space characters from str.
trim(BOTH FROM str) - Removes the leading and trailing space characters from str.
trim(LEADING FROM str) - Removes the leading space characters from str.
trim(TRAILING FROM str) - Removes the trailing space characters from str.
trim(trimStr FROM str) - Remove the leading and trailing trimStr characters from str.
trim(BOTH trimStr FROM str) - Remove the leading and trailing trimStr characters from str.
trim(LEADING trimStr FROM str) - Remove the leading trimStr characters from str.
trim(TRAILING trimStr FROM str) - Remove the trailing trimStr characters from str.
Arguments:
- str - a string expression
- trimStr - the trim string characters to trim, the default value is a single space
- BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
- LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
- TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string
Examples:
> SELECT trim(' SparkSQL ');
SparkSQL
> SELECT trim(BOTH FROM ' SparkSQL ');
SparkSQL
> SELECT trim(LEADING FROM ' SparkSQL ');
SparkSQL
> SELECT trim(TRAILING FROM ' SparkSQL ');
SparkSQL
> SELECT trim('SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
parkSQLS
> SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
SSparkSQ
Since: 1.5.0
try_to_binary¶
try_to_binary(str[, fmt]) - This is a special version of to_binary that performs the same operation, but returns a NULL value instead of raising an error if the conversion cannot be performed.
Examples:
> SELECT try_to_binary('abc', 'utf-8');
abc
> select try_to_binary('a!', 'base64');
NULL
> select try_to_binary('abc', 'invalidFormat');
NULL
Since: 3.3.0
try_to_number¶
try_to_number(expr, fmt) - Convert string 'expr' to a number based on the string format fmt.
Returns NULL if the string 'expr' does not match the expected format. The format follows the
same semantics as the to_number function.
Examples:
> SELECT try_to_number('454', '999');
454
> SELECT try_to_number('454.00', '000.00');
454.00
> SELECT try_to_number('12,454', '99,999');
12454
> SELECT try_to_number('$78.12', '$99.99');
78.12
> SELECT try_to_number('12,454.8-', '99,999.9S');
-12454.8
Since: 3.3.0
try_validate_utf8¶
try_validate_utf8(str) - Returns the original string if str is a valid UTF-8 string, otherwise returns NULL.
Arguments:
- str - a string expression
Examples:
> SELECT try_validate_utf8('Spark');
Spark
> SELECT try_validate_utf8(x'61');
a
> SELECT try_validate_utf8(x'80');
NULL
> SELECT try_validate_utf8(x'61C262');
NULL
Since: 4.0.0
ucase¶
ucase(str) - Returns str with all characters changed to uppercase.
Examples:
> SELECT ucase('SparkSql');
SPARKSQL
Since: 1.0.1
unbase64¶
unbase64(str) - Converts the argument from a base 64 string str to a binary.
Examples:
> SELECT unbase64('U3BhcmsgU1FM');
Spark SQL
Since: 1.5.0
upper¶
upper(str) - Returns str with all characters changed to uppercase.
Examples:
> SELECT upper('SparkSql');
SPARKSQL
Since: 1.0.1
validate_utf8¶
validate_utf8(str) - Returns the original string if str is a valid UTF-8 string, otherwise throws an exception.
Arguments:
- str - a string expression
Examples:
> SELECT validate_utf8('Spark');
Spark
> SELECT validate_utf8(x'61');
a
Since: 4.0.0
||¶
expr1 || expr2 - Returns the concatenation of expr1 and expr2.
Examples:
> SELECT 'Spark' || 'SQL';
SparkSQL
> SELECT array(1, 2, 3) || array(4, 5) || array(6);
[1,2,3,4,5,6]
Note:
|| for arrays is available since 2.4.0.
Since: 2.3.0