pyspark.sql.functions.parse_url#
- pyspark.sql.functions.parse_url(url, partToExtract, key=None)[source]#
URL function: Extracts a specified part from a URL. If a key is provided, it returns the associated query parameter value.
New in version 3.5.0.
- Parameters
- Returns
Column
A new column of strings, each representing the value of the extracted part from the URL.
Examples
Example 1: Extracting the query part from a URL
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame( ... [("https://spark.apache.org/path?query=1", "QUERY")], ... ["url", "part"] ... ) >>> df.select(sf.parse_url(df.url, df.part)).show() +--------------------+ |parse_url(url, part)| +--------------------+ | query=1| +--------------------+
Example 2: Extracting the value of a specific query parameter from a URL
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame( ... [("https://spark.apache.org/path?query=1", "QUERY", "query")], ... ["url", "part", "key"] ... ) >>> df.select(sf.parse_url(df.url, df.part, df.key)).show() +-------------------------+ |parse_url(url, part, key)| +-------------------------+ | 1| +-------------------------+
Example 3: Extracting the protocol part from a URL
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame( ... [("https://spark.apache.org/path?query=1", "PROTOCOL")], ... ["url", "part"] ... ) >>> df.select(sf.parse_url(df.url, df.part)).show() +--------------------+ |parse_url(url, part)| +--------------------+ | https| +--------------------+
Example 4: Extracting the host part from a URL
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame( ... [("https://spark.apache.org/path?query=1", "HOST")], ... ["url", "part"] ... ) >>> df.select(sf.parse_url(df.url, df.part)).show() +--------------------+ |parse_url(url, part)| +--------------------+ | spark.apache.org| +--------------------+
Example 5: Extracting the path part from a URL
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame( ... [("https://spark.apache.org/path?query=1", "PATH")], ... ["url", "part"] ... ) >>> df.select(sf.parse_url(df.url, df.part)).show() +--------------------+ |parse_url(url, part)| +--------------------+ | /path| +--------------------+