The Operators and Functions Support Progress

Gluten is still under active development. Here is a list of supported operators and functions.

Since the same function may have different semantics between Presto and Spark, Velox implement the functions in Presto category, if we note a different semantics from Spark, then the function is implemented in Spark category. So Gluten will first try to find function in Velox’s spark category, if a function isn’t implemented then refer to Presto category.

We use some notations to describe the supporting status of operators/functions in the tables below, they are:

Value	Description
S	Supported. Gluten or Velox supports fully.
S*	Mark for foldable expression that will be converted to alias after spark’s optimization.
[Blank Cell]	Not applicable case or needs to confirm.
PS	Partial Support. Velox only partially supports it.
NS	Not Supported. Velox backend does not support it.

And also some notations for the function implementation’s restrictions:

Value	Description
Mismatched	Some functions are implemented by Velox, but have different semantics from Apache Spark, we mark them as “Mismatched”.
ANSI OFF	Gluten doesn’t support ANSI mode. If it is enabled, Gluten will fall back to Vanilla Spark.

Operator Map

Gluten supports 30+ operators (Drag to right to see all data types)

Executor	Description	Gluten Name	Velox Name	BOOLEAN	BYTE	SHORT	INT	LONG	FLOAT	DOUBLE	STRING	NULL	BINARY	ARRAY	MAP	STRUCT(ROW)	DATE	TIMESTAMP	DECIMAL	CALENDAR	UDT
FileSourceScanExec	Reading data from files, often from Hive tables	FileSourceScanExecTransformer	TableScanNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
BatchScanExec	The backend for most file input	BatchScanExecTransformer	TableScanNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
FilterExec	The backend for most filter statements	FilterExecTransformer	FilterNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
ProjectExec	The backend for most select, withColumn and dropColumn statements	ProjectExecTransformer	ProjectNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
HashAggregateExec	The backend for hash based aggregations	HashAggregateBaseTransformer	AggregationNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
BroadcastHashJoinExec	Implementation of join using broadcast data	BroadcastHashJoinExecTransformer	HashJoinNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
ShuffledHashJoinExec	Implementation of join using hashed shuffled data	ShuffleHashJoinExecTransformer	HashJoinNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
SortExec	The backend for the sort operator	SortExecTransformer	OrderByNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
SortMergeJoinExec	Sort merge join, replacing with shuffled hash join	SortMergeJoinExecTransformer	MergeJoinNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
WindowExec	Window operator backend	WindowExecTransformer	WindowNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
GlobalLimitExec	Limiting of results across partitions	LimitTransformer	LimitNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
LocalLimitExec	Per-partition limiting of results	LimitTransformer	LimitNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
ExpandExec	The backend for the expand operator	ExpandExecTransformer	GroupIdNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
UnionExec	The backend for the union operator	UnionExecTransformer	N	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
DataWritingCommandExec	Writing data	Y	TableWriteNode	S	S	S	S	S	S	S	S	S	S	S	NS	S	S	NS	S	NS	NS
CartesianProductExec	Implementation of join using brute force	CartesianProductExecTransformer	NestedLoopJoinNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
ShuffleExchangeExec	The backend for most data being exchanged between processes	ColumnarShuffleExchangeExec	ExchangeNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS
	The unnest operation expands arrays and maps into separate columns	N	UnnestNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS
	The top-n operation reorders a dataset based on one or more identified sort fields as well as a sorting order	N	TopNNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS
	The partitioned output operation redistributes data based on zero or more distribution fields	N	PartitionedOutputNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS
	The values operation returns specified data	N	ValuesNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS
	A receiving operation that merges multiple ordered streams to maintain orderedness	N	MergeExchangeNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS
	An operation that merges multiple ordered streams to maintain orderedness	N	LocalMergeNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS
	Partitions input data into multiple streams or combines data from multiple streams into a single stream	N	LocalPartitionNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS
	The enforce single row operation checks that input contains at most one row and returns that row unmodified	N	EnforceSingleRowNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS
	The assign unique id operation adds one column at the end of the input columns with unique value per row	N	AssignUniqueIdNode	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	NS	S	S	S	S	S
ReusedExchangeExec	A wrapper for reused exchange to have different output	ReusedExchangeExec	N
CollectLimitExec	Reduce to single partition and apply limit	ColumnarCollectLimitExec	N	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S
CollectTailExec	Collect the tail `x` elements from dataframe	ColumnarCollectTailExec	N	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S
BroadcastExchangeExec	The backend for broadcast exchange of data	Y	Y	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	S	NS	NS
ObjectHashAggregateExec	The backend for hash based aggregations supporting TypedImperativeAggregate functions	HashAggregateExecBaseTransformer	N
SortAggregateExec	The backend for sort based aggregations	HashAggregateExecBaseTransformer (Partially supported)	N
CoalesceExec	Reduce the partition numbers	CoalesceExecTransformer	N
GenerateExec	The backend for operations that generate more output rows than input rows like explode	GenerateExecTransformer	UnnestNode
RangeExec	The backend for range operator	ColumnarRangeExec	N	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S	S
SampleExec	The backend for the sample operator	SampleExecTransformer	N
SubqueryBroadcastExec	Plan to collect and transform the broadcast key values	Y	Y	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	S	NS	NS
TakeOrderedAndProjectExec	Take the first limit elements as defined by the sortOrder, and do projection if needed	Y	Y	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	S	NS	NS
CustomShuffleReaderExec	A wrapper of shuffle query stage	N	N
InMemoryTableScanExec	Implementation of InMemory Table Scan	Y	Y
BroadcastNestedLoopJoinExec	Implementation of join using brute force. Full outer joins and joins where the broadcast side matches the join side (e.g.: LeftOuter with left broadcast) are not supported	BroadcastNestedLoopJoinExecTransformer	NestedLoopJoinNode	S	S	S	S	S	S	S	S	S	S	NS	NS	NS	S	NS	NS	NS	NS
AggregateInPandasExec	The backend for an Aggregation Pandas UDF, this accelerates the data transfer between the Java process and the Python process	N	N
ArrowEvalPythonExec	The backend of the Scalar Pandas UDFs. Accelerates the data transfer between the Java process and the Python process	N	N
FlatMapGroupsInPandasExec	The backend for Flat Map Groups Pandas UDF, Accelerates the data transfer between the Java process and the Python process	N	N
MapInPandasExec	The backend for Map Pandas Iterator UDF. Accelerates the data transfer between the Java process and the Python process	N	N
WindowInPandasExec	The backend for Window Aggregation Pandas UDF, Accelerates the data transfer between the Java process and the Python process	N	N
HiveTableScanExec	The Hive table scan operator. Column and partition pruning are both handled	Y	Y
InsertIntoHiveTable	Command for writing data out to a Hive table	Y	Y
Velox2Row	Convert Velox format to Row format	Y	Y	S	S	S	S	S	S	S	S	NS	S	NS	NS	NS	S	S	NS	NS	NS
Velox2Arrow	Convert Velox format to Arrow format	Y	Y	S	S	S	S	S	S	S	S	NS	S	S	S	S	S	NS	S	NS	NS
WindowGroupLimitExec	Optimize window with rank like function with filter on it	Y	Y	S	S	S	S	S	S	S	S	NS	S	S	S	S	S	NS	S	NS	NS

Function Support Status

Spark categorizes built-in functions into four types: Scalar Functions, Aggregate Functions, Window Functions, and Generator Functions. In Gluten, function support is automatically generated by a script and maintained in separate files.

When running the script, the --spark_home arg should be set to either:

The directory containing the Spark source code for the latest supported Spark version in Gluten, and the Spark project must be built from source.
Or use the install-spark-resources.sh script to get a directory with the necessary resource files:
```
# Define a directory to use for the Spark files and the latest Spark version
export spark_dir=/tmp/spark
export spark_version=3.5

# Run the install-spark-resources.sh script
.github/workflows/util/install-spark-resources.sh ${spark_version} ${spark_dir}
```
After running the install-spark-resources.sh, the --spark_home for the document generation script will be something like: --spark_home=${spark_dir}/shims/spark35/spark_home"

Use the following command to generate and update the support status:

python3 tools/scripts/gen-function-support-docs.py --spark_home=/path/to/spark_source_code

Please check the links below for the detailed support status of each category:

Scalar Functions Support Status

Aggregate Functions Support Status

Window Functions Support Status

Generator Functions Support Status