Iceberg Support in Velox Backend
Supported Spark version
All the spark version is supported, but for convenience, only Spark 3.4 is well tested. Now only read is supported in Gluten.
Support Status
Following value indicates the iceberg support progress:
| Value | Description |
|---|---|
| Offload | Offload to the Velox backend |
| PartialOffload | Some operators offload and some fallback |
| Fallback | Fallback to spark to execute |
| Exception | Cannot fallback by some conditions, throw the exception |
| ResultMismatch | Some hidden bug may cause result mismatch, especially for some corner case |
Adding catalogs
Fallback
Creating a table
Fallback
Writing
Fallback
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
PartialOffload
The write is fallback while read is offload.
INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1;
Reading
Read data
Offload/Fallback
| Table Type | No Delete | Position Delete | Equality Delete |
|---|---|---|---|
| unpartition | Offload | Offload | Fallback |
| partition | Fallback mostly | Fallback mostly | Fallback |
| metadata | Fallback |
Offload the simple query.
SELECT count(1) as count, data
FROM local.db.table
GROUP BY data;
If delete by Spark and copy on read, will generate position delete file, the query may offload.
If delete by Flink, may generate the equality delete file, fallback in tht case.
Now we only offload the simple query, for partition table, many operators are fallback by Expression StaticInvoke such as BucketFunction, wait to be supported.
DataFrame reads are supported and can now reference tables by name using spark.table:
val df = spark.table("local.db.table")
df.count()
Read metadata
Fallback
SELECT data, _file FROM local.db.table;
DataType
Timestamptz in orc format is not supported, throws exception. UUID type and Fixed type is fallback.
Format
PartialOffload
Supports parquet and orc format. Not support avro format.
SQL
Only support SELECT.
Schema evolution
PartialOffload
Gluten uses column name to match the parquet file, so if the column is renamed or the added column name is same to the deleted column, the scan will fall back.
Configuration
Catalogs
Supports all the catalog options, which is not used in native engine.
SQL Extensions
Fallback
Supports the option spark.sql.extensions, fallback the SQL command CALL.
Runtime configuration
Read options
| Spark option | Status |
|---|---|
| snapshot-id | Support |
| as-of-timestamp | Support |
| split-size | Support |
| lookback | Support |
| file-open-cost | Support |
| vectorization-enabled | Not Support |
| batch-size | Not Support |