{ "cells": [ { "cell_type": "markdown", "id": "4fa81d13", "metadata": {}, "source": [ "# ANSI Migration Guide - Pandas API on Spark\n", "ANSI mode is now on by default for Pandas API on Spark. This guide helps you understand the key behavior differences you’ll see.\n", "In short, with ANSI mode on, Pandas API on Spark behavior matches native pandas in cases where Pandas API on Spark with ANSI off did not." ] }, { "cell_type": "markdown", "id": "6e1c7952", "metadata": {}, "source": [ "## Behavior Change\n", "### String Number Comparison\n", "**ANSI off:** Spark implicitly casts numbers and strings, so `1` and `'1'` are considered equal.\n", "\n", "**ANSI on:** behaves like pandas, `1 == '1'` is False." ] }, { "cell_type": "markdown", "id": "69474e28-c1cd-40fe-8ec6-7373b56c4dee", "metadata": {}, "source": [ "Examples are as shown below:\n", "\n", "```python\n", ">>> pdf = pd.DataFrame({\"int\": [1, 2], \"str\": [\"1\", \"2\"]})\n", ">>> psdf = ps.from_pandas(pdf)\n", "\n", "# ANSI on\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", ">>> psdf[\"int\"] == psdf[\"str\"]\n", "0 False\n", "1 False\n", "dtype: bool\n", "\n", "# ANSI off\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", ">>> psdf[\"int\"] == psdf[\"str\"]\n", "0 True\n", "1 True\n", "dtype: bool\n", "\n", "# Pandas\n", ">>> pdf[\"int\"] == pdf[\"str\"]\n", "0 False\n", "1 False\n", "dtype: bool\n", "```" ] }, { "cell_type": "markdown", "id": "90a4ea8d", "metadata": {}, "source": [ "### Strict Casting\n", "**ANSI off:** invalid casts (e.g., `'a' → int`) quietly became NULL.\n", "\n", "**ANSI on:** the same casts raise errors." ] }, { "cell_type": "markdown", "id": "b361febc-4435-4bd1-9ee1-4874413d770c", "metadata": {}, "source": [ "Examples are as shown below:\n", "\n", "```python\n", ">>> pdf = pd.DataFrame({\"str\": [\"a\"]})\n", ">>> psdf = ps.from_pandas(pdf)\n", "\n", "# ANSI on\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", ">>> psdf[\"str\"].astype(int)\n", "Traceback (most recent call last):\n", "...\n", "pyspark.errors.exceptions.captured.NumberFormatException: [CAST_INVALID_INPUT] ...\n", "\n", "# ANSI off\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", ">>> psdf[\"str\"].astype(int)\n", "0 NaN\n", "Name: str, dtype: float64\n", "\n", "# Pandas\n", ">>> pdf[\"str\"].astype(int)\n", "Traceback (most recent call last):\n", "...\n", "ValueError: invalid literal for int() with base 10: 'a'\n", "```" ] }, { "cell_type": "markdown", "id": "e11583e2", "metadata": {}, "source": [ "### MultiIndex.to_series Return\n", "**ANSI off:** Each row is returned as an `ArrayType` value, e.g. `[1, red]`.\n", "\n", "**ANSI on:** Each row is returned as a `StructType` value, which appears as a tuple (e.g., `(1, red)`) if the Runtime SQL Configuration `spark.sql.execution.pandas.structHandlingMode` is set to `'row'`. Otherwise, the result may vary depending on whether Arrow is used. See more in the [Spark Runtime SQL Configuration docs](https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration)." ] }, { "cell_type": "markdown", "id": "4671a895-ed40-4bc4-b1bc-fa9fbb86cc18", "metadata": {}, "source": [ "Examples are as shown below:\n", "\n", "```python\n", ">>> arrays = [[1, 2], [\"red\", \"blue\"]]\n", ">>> pidx = pd.MultiIndex.from_arrays(arrays, names=(\"number\", \"color\"))\n", ">>> psidx = ps.from_pandas(pidx)\n", "\n", "# ANSI on\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", ">>> spark.conf.set(\"spark.sql.execution.pandas.structHandlingMode\", \"row\")\n", ">>> psidx.to_series()\n", "number color\n", "1 red (1, red)\n", "2 blue (2, blue)\n", "dtype: object\n", "\n", "# ANSI off\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", ">>> psidx.to_series()\n", "number color\n", "1 red [1, red]\n", "2 blue [2, blue]\n", "dtype: object\n", "\n", "# Pandas\n", ">>> pidx.to_series()\n", "number color\n", "1 red (1, red)\n", "2 blue (2, blue)\n", "dtype: object\n", "```" ] }, { "cell_type": "markdown", "id": "a9ceb6cb-3bc4-4c23-b74b-84e60fd64e11", "metadata": {}, "source": [ "### Invalid Mixed-Type Operations\n", "**ANSI off:** Spark implicitly coerces so these operations succeed.\n", "\n", "**ANSI on:** Behaves like pandas, such operations are disallowed and raise errors.\n", "\n", "Operation types that show behavior changes under ANSI mode:\n", "\n", "- **Decimal–Float Arithmetic**: `/`, `//`, `*`, `%` \n", "- **Boolean vs. None**: `|`, `&`, `^`" ] }, { "cell_type": "markdown", "id": "2a8d5705-11ea-458c-8528-c7b1b7c88472", "metadata": {}, "source": [ "Example: Decimal–Float Arithmetic\n", "```python\n", ">>> import decimal\n", ">>> pser = pd.Series([decimal.Decimal(1), decimal.Decimal(2)])\n", ">>> psser = ps.from_pandas(pser)\n", "\n", "# ANSI on\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", ">>> psser * 0.1\n", "Traceback (most recent call last):\n", "...\n", "TypeError: Multiplication can not be applied to given types.\n", "\n", "# ANSI off\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", ">>> psser * 0.1\n", "0 0.1\n", "1 0.2\n", "dtype: float64\n", "\n", "# Pandas\n", ">>> pser * 0.1\n", "...\n", "TypeError: unsupported operand type(s) for *: 'decimal.Decimal' and 'float'\n", "```" ] }, { "cell_type": "markdown", "id": "0d2b8268-4b98-4239-95db-5269f9c658d2", "metadata": {}, "source": [ "Example: Boolean vs. None\n", "```python\n", "# ANSI on\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", ">>> ps.Series([True, False]) | None\n", "Traceback (most recent call last):\n", "...\n", "TypeError: OR can not be applied to given types.\n", "\n", "# ANSI off\n", ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", ">>> ps.Series([True, False]) | None\n", "0 False \n", "1 False\n", "dtype: bool\n", "\n", "# Pandas\n", ">>> pd.Series([True, False]) | None\n", "...\n", "TypeError: unsupported operand type(s) for |: 'bool' and 'NoneType'\n", "```" ] }, { "cell_type": "markdown", "id": "fe146afd", "metadata": {}, "source": [ "## Related Configurations\n", "\n", "### `spark.sql.ansi.enabled` (Spark config)\n", "- Native Spark setting that controls ANSI mode. \n", "- The most powerful config to control both SQL and pandas API behavior. \n", "- If set to **False**, Spark reverts to the old behavior, and the other configs are not effective.\n", "\n", "### `compute.ansi_mode_support` (Pandas API on Spark option)\n", "- Indicates whether ANSI mode is fully supported. \n", "- Effective only when ANSI is enabled. \n", "- If set to **False**, pandas API on Spark may hit unexpected results or errors. \n", "- Default is **True**.\n", "\n", "### `compute.fail_on_ansi_mode` (Pandas API on Spark option)\n", "- Controls whether pandas API on Spark fails immediately when ANSI mode is enabled. \n", "- Effective only when ANSI is enabled and `compute.ansi_mode_support` is **False**. \n", "- If set to **False**, forces pandas API on Spark to work with the old behavior even when ANSI is enabled." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.13" } }, "nbformat": 4, "nbformat_minor": 5 }