Geospatial (Geometry/Geography) Types
Spark SQL supports GEOMETRY and GEOGRAPHY types for spatial data, as defined in the Open Geospatial Consortium (OGC) Simple Feature Access specification. At runtime, values are represented as Well-Known Binary (WKB) and are associated with a Spatial Reference Identifier (SRID) that defines the coordinate system. How values are persisted is determined by each data source.
Overview
| Type | Coordinate system | Typical use and notes |
|---|---|---|
| GEOMETRY | Cartesian (planar) | Projected or local coordinates; planar calculations. Represents points, lines, polygons in a flat coordinate system. Suitable for Web Mercator (SRID 3857), UTM, or local grids (e.g. engineering/CAD). Default SRID in Spark is 4326. |
| GEOGRAPHY | Geographic (latitude/longitude) | Earth-based data; distances and areas on the sphere/ellipsoid. Coordinates in longitude and latitude (degrees). Edge interpolation is always SPHERICAL. Default SRID is 4326 (WGS 84). |
When to use GEOMETRY vs GEOGRAPHY
Choose GEOMETRY when:
- Data is in local or projected coordinates (e.g. engineering/CAD in meters, or map tiles in Web Mercator).
- You need planar operations on a small or regional area: intersections, unions, clipping, containment, or overlays where treating the surface as flat is acceptable.
- Vertices are closely spaced or the extent is small enough that Earth curvature is negligible.
Choose GEOGRAPHY when:
- Data is global or spans large extents (e.g. country boundaries, worldwide points of interest).
- Distances or areas must respect Earth curvature (e.g. the shortest path between two cities, or the area of a polygon on the globe).
- Use cases include aviation, maritime, or global mobility where great-circle or geodesic behavior matters.
Using the wrong type can give misleading results: for example, the shortest path between London and New York on a sphere crosses Canada, whereas a planar GEOMETRY may suggest a path that does not.
Type Syntax in SQL
In SQL you must specify the type with an SRID or ANY:
- Fixed SRID (all values in the column share one SRID):
GEOMETRY(srid)— e.g.GEOMETRY(4326),GEOMETRY(3857)GEOGRAPHY(srid)— e.g.GEOGRAPHY(4326)
- Mixed SRID (values in the column may have different SRIDs):
GEOMETRY(ANY)GEOGRAPHY(ANY)
Unparameterized GEOMETRY or GEOGRAPHY (without (srid) or (ANY)) is not supported in SQL.
Creating Tables with Geometry or Geography Columns
-- Fixed SRID: all values must use the given SRID (e.g. WGS 84)
CREATE TABLE points (
id BIGINT,
pt GEOMETRY(4326)
);
CREATE TABLE locations (
id BIGINT,
loc GEOGRAPHY(4326)
);
-- Mixed SRID: each row can have a different SRID
CREATE TABLE mixed_geoms (
id BIGINT,
geom GEOMETRY(ANY)
);
Constructing Geometry and Geography Values
Values are created from Well-Known Binary (WKB) using built-in functions. WKB is a standard binary encoding for spatial shapes (points, lines, polygons, etc.). See Well-known binary for the format.
From WKB (binary):
ST_GeomFromWKB(wkb)— returns GEOMETRY with default SRID 0.ST_GeomFromWKB(wkb, srid)— returns GEOMETRY with the given SRID.ST_GeogFromWKB(wkb)— returns GEOGRAPHY with SRID 4326.
Example (point in WKB, then use in a table):
-- Point (1, 2) in WKB (little-endian point, 2D)
SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040');
SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326);
SELECT ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040');
INSERT INTO points (id, pt)
VALUES (1, ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326));
WKB coordinate handling
When parsing WKB, Spark applies the following rules. Violations result in a parse error.
- Empty points: For Point geometries (including points inside MultiPoint), NaN (Not a Number) coordinate values are allowed and represent an empty point (e.g.
POINT EMPTYin Well-Known Text). LineString and Polygon (and points inside them) do not allow NaN in coordinate values. - Non-point coordinates: Coordinate values in LineString, Polygon rings, and points that are part of those structures must be finite (no NaN, no positive or negative infinity).
- Infinity: Positive or negative infinity is never accepted in any coordinate value.
- Polygon rings: Each ring must be closed (first and last point equal) and have at least 4 points. A LineString must have at least 2 points.
- GEOGRAPHY bounds: When WKB is parsed as GEOGRAPHY (e.g. via
ST_GeogFromWKB), longitude must be in [-180, 180] (inclusive) and latitude in [-90, 90] (inclusive). GEOMETRY does not enforce these bounds. - Invalid WKB: Null or empty input, truncated bytes, invalid geometry class or byte order, or other malformed WKB.
Built-in Geospatial (ST) Functions
Spark SQL provides scalar functions for working with GEOMETRY and GEOGRAPHY values. They are grouped under st_funcs in the Built-in Functions API.
| Function | Description |
|---|---|
ST_AsBinary(geo) |
Returns the GEOMETRY or GEOGRAPHY value as WKB (BINARY). |
ST_GeomFromWKB(wkb) |
Parses WKB and returns a GEOMETRY with default SRID 0. |
ST_GeomFromWKB(wkb, srid) |
Parses WKB and returns a GEOMETRY with the given SRID. |
ST_GeogFromWKB(wkb) |
Parses WKB and returns a GEOGRAPHY with SRID 4326. |
ST_Srid(geo) |
Returns the SRID of the GEOMETRY or GEOGRAPHY value (NULL if input is NULL). |
ST_SetSrid(geo, srid) |
Returns a new GEOMETRY or GEOGRAPHY with the given SRID. |
Examples:
SELECT hex(ST_AsBinary(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040')));
-- 0101000000000000000000F03F0000000000000040
SELECT ST_Srid(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040'));
-- 4326
SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040'), 3857));
-- 3857
SRID and Stored Values
- Fixed-SRID columns: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use
ST_SetSridto set the value’s SRID to match the column). - Mixed-SRID columns (
GEOMETRY(ANY)orGEOGRAPHY(ANY)): Values can have different SRIDs. Only valid SRIDs are allowed. - Storage: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required.
Data Types Reference
For the full list of supported data types and API usage in Scala, Java, Python, and SQL, see Data Types.