The Operators and Functions Support Progress
Gluten is still under active development. Here is a list of supported operators and functions.
Since the same function may have different semantics between Presto and Spark, Velox implement the functions in Presto category, if we note a different semantics from Spark, then the function is implemented in Spark category. So Gluten will first try to find function in Velox’s spark category, if a function isn’t implemented then refer to Presto category.
We use some notations to describe the supporting status of operators/functions in the tables below, they are:
| Value | Description |
|---|---|
| S | Supported. Gluten or Velox supports fully. |
| S* | Mark for foldable expression that will be converted to alias after spark’s optimization. |
| [Blank Cell] | Not applicable case or needs to confirm. |
| PS | Partial Support. Velox only partially supports it. |
| NS | Not Supported. Velox backend does not support it. |
And also some notations for the function implementation’s restrictions:
| Value | Description |
|---|---|
| Mismatched | Some functions are implemented by Velox, but have different semantics from Apache Spark, we mark them as “Mismatched”. |
| ANSI OFF | Gluten doesn’t support ANSI mode. If it is enabled, Gluten will fall back to Vanilla Spark. |
Operator Map
Gluten supports 30+ operators (Drag to right to see all data types)
| Executor | Description | Gluten Name | Velox Name | BOOLEAN | BYTE | SHORT | INT | LONG | FLOAT | DOUBLE | STRING | NULL | BINARY | ARRAY | MAP | STRUCT(ROW) | DATE | TIMESTAMP | DECIMAL | CALENDAR | UDT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FileSourceScanExec | Reading data from files, often from Hive tables | FileSourceScanExecTransformer | TableScanNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| BatchScanExec | The backend for most file input | BatchScanExecTransformer | TableScanNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| FilterExec | The backend for most filter statements | FilterExecTransformer | FilterNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| ProjectExec | The backend for most select, withColumn and dropColumn statements | ProjectExecTransformer | ProjectNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| HashAggregateExec | The backend for hash based aggregations | HashAggregateBaseTransformer | AggregationNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| BroadcastHashJoinExec | Implementation of join using broadcast data | BroadcastHashJoinExecTransformer | HashJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| ShuffledHashJoinExec | Implementation of join using hashed shuffled data | ShuffleHashJoinExecTransformer | HashJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| SortExec | The backend for the sort operator | SortExecTransformer | OrderByNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| SortMergeJoinExec | Sort merge join, replacing with shuffled hash join | SortMergeJoinExecTransformer | MergeJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| WindowExec | Window operator backend | WindowExecTransformer | WindowNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| GlobalLimitExec | Limiting of results across partitions | LimitTransformer | LimitNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| LocalLimitExec | Per-partition limiting of results | LimitTransformer | LimitNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| ExpandExec | The backend for the expand operator | ExpandExecTransformer | GroupIdNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| UnionExec | The backend for the union operator | UnionExecTransformer | N | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| DataWritingCommandExec | Writing data | Y | TableWriteNode | S | S | S | S | S | S | S | S | S | S | S | NS | S | S | NS | S | NS | NS |
| CartesianProductExec | Implementation of join using brute force | CartesianProductExecTransformer | NestedLoopJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| ShuffleExchangeExec | The backend for most data being exchanged between processes | ColumnarShuffleExchangeExec | ExchangeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| The unnest operation expands arrays and maps into separate columns | N | UnnestNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| The top-n operation reorders a dataset based on one or more identified sort fields as well as a sorting order | N | TopNNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| The partitioned output operation redistributes data based on zero or more distribution fields | N | PartitionedOutputNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| The values operation returns specified data | N | ValuesNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| A receiving operation that merges multiple ordered streams to maintain orderedness | N | MergeExchangeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| An operation that merges multiple ordered streams to maintain orderedness | N | LocalMergeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| Partitions input data into multiple streams or combines data from multiple streams into a single stream | N | LocalPartitionNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| The enforce single row operation checks that input contains at most one row and returns that row unmodified | N | EnforceSingleRowNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| The assign unique id operation adds one column at the end of the input columns with unique value per row | N | AssignUniqueIdNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | S | S | S | S | S | |
| ReusedExchangeExec | A wrapper for reused exchange to have different output | ReusedExchangeExec | N | ||||||||||||||||||
| CollectLimitExec | Reduce to single partition and apply limit | ColumnarCollectLimitExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
| CollectTailExec | Collect the tail x elements from dataframe | ColumnarCollectTailExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
| BroadcastExchangeExec | The backend for broadcast exchange of data | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS |
| ObjectHashAggregateExec | The backend for hash based aggregations supporting TypedImperativeAggregate functions | HashAggregateExecBaseTransformer | N | ||||||||||||||||||
| SortAggregateExec | The backend for sort based aggregations | HashAggregateExecBaseTransformer (Partially supported) | N | ||||||||||||||||||
| CoalesceExec | Reduce the partition numbers | CoalesceExecTransformer | N | ||||||||||||||||||
| GenerateExec | The backend for operations that generate more output rows than input rows like explode | GenerateExecTransformer | UnnestNode | ||||||||||||||||||
| RangeExec | The backend for range operator | ColumnarRangeExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
| SampleExec | The backend for the sample operator | SampleExecTransformer | N | ||||||||||||||||||
| SubqueryBroadcastExec | Plan to collect and transform the broadcast key values | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS |
| TakeOrderedAndProjectExec | Take the first limit elements as defined by the sortOrder, and do projection if needed | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS |
| CustomShuffleReaderExec | A wrapper of shuffle query stage | N | N | ||||||||||||||||||
| InMemoryTableScanExec | Implementation of InMemory Table Scan | Y | Y | ||||||||||||||||||
| BroadcastNestedLoopJoinExec | Implementation of join using brute force. Full outer joins and joins where the broadcast side matches the join side (e.g.: LeftOuter with left broadcast) are not supported | BroadcastNestedLoopJoinExecTransformer | NestedLoopJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| AggregateInPandasExec | The backend for an Aggregation Pandas UDF, this accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
| ArrowEvalPythonExec | The backend of the Scalar Pandas UDFs. Accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
| FlatMapGroupsInPandasExec | The backend for Flat Map Groups Pandas UDF, Accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
| MapInPandasExec | The backend for Map Pandas Iterator UDF. Accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
| WindowInPandasExec | The backend for Window Aggregation Pandas UDF, Accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
| HiveTableScanExec | The Hive table scan operator. Column and partition pruning are both handled | Y | Y | ||||||||||||||||||
| InsertIntoHiveTable | Command for writing data out to a Hive table | Y | Y | ||||||||||||||||||
| Velox2Row | Convert Velox format to Row format | Y | Y | S | S | S | S | S | S | S | S | NS | S | NS | NS | NS | S | S | NS | NS | NS |
| Velox2Arrow | Convert Velox format to Arrow format | Y | Y | S | S | S | S | S | S | S | S | NS | S | S | S | S | S | NS | S | NS | NS |
| WindowGroupLimitExec | Optimize window with rank like function with filter on it | Y | Y | S | S | S | S | S | S | S | S | NS | S | S | S | S | S | NS | S | NS | NS |
Function Support Status
Spark categorizes built-in functions into four types: Scalar Functions, Aggregate Functions, Window Functions, and Generator Functions. In Gluten, function support is automatically generated by a script and maintained in separate files.
When running the script, the --spark_home arg should be set to either:
- The directory containing the Spark source code for the latest supported Spark version in Gluten, and the Spark project must be built from source.
- Or use the
install-spark-resources.shscript to get a directory with the necessary resource files:# Define a directory to use for the Spark files and the latest Spark version export spark_dir=/tmp/spark export spark_version=3.5 # Run the install-spark-resources.sh script .github/workflows/util/install-spark-resources.sh ${spark_version} ${spark_dir}After running the
install-spark-resources.sh, the--spark_homefor the document generation script will be something like:--spark_home=${spark_dir}/shims/spark35/spark_home"
Use the following command to generate and update the support status:
python3 tools/scripts/gen-function-support-docs.py --spark_home=/path/to/spark_source_code
Please check the links below for the detailed support status of each category:
Scalar Functions Support Status
Aggregate Functions Support Status