Apache Kudu Schema Design

Kudu tables have a structured data model similar to tables in a traditional RDBMS. Schema design is critical for achieving the best performance and operational stability from Kudu. Every workload is unique, and there is no single schema design that is best for every table. This document outlines effective schema design philosophies for Kudu, paying particular attention to where they differ from approaches used for traditional RDBMS schemas.

At a high level, there are three concerns in Kudu schema design: column design, primary keys, and data distribution. Of these, only data distribution will be a new concept for those familiar with traditional relational databases. The next sections discuss altering the schema of an existing table, and known limitations with regard to schema design.

The Perfect Schema

The perfect schema would accomplish the following:

  • Data would be distributed in such a way that reads and writes are spread evenly across tablet servers. This is impacted by the partition schema.

  • Tablets would grow at an even, predictable rate and load across tablets would remain steady over time. This is most impacted by the partition schema.

  • Scans would read the minimum amount of data necessary to fulfill a query. This is impacted mostly by primary key design, but partition design also plays a role via partition pruning.

The perfect schema depends on the characteristics of your data, what you need to do with it, and the topology of your cluster. Schema design is the single most important thing within your control to maximize the performance of your Kudu cluster.

Primary Keys

Each Kudu table must declare a primary key comprised of one or more columns. Primary key columns must be non-nullable, and may not be a boolean or floating-point type. Every row in a table must have a unique set of values for its primary key columns. As with a traditional RDBMS, primary key selection is critical to ensuring performant database operations.

Unlike an RDBMS, Kudu does not provide an auto-incrementing column feature, so the application must always provide the full primary key during insert or ingestion. In addition, Kudu does not allow the primary key values of a row to be updated.

Within a tablet, rows are stored sorted lexicographically by primary key. Advanced schema designs can take advantage of this ordering to achieve good distribution of data among tablets, while retaining consistent ordering in intra-tablet scans. See Data Distribution for more information.

Data Distribution

Kudu tables, unlike traditional relational tables, are partitioned into tablets and distributed across many tablet servers. A row always belongs to a single tablet (and its replicas). The method of assigning rows to tablets is specified in a configurable partition schema for each table, during table creation.

Choosing a data distribution strategy requires you to understand the data model and expected workload of a table. For write-heavy workloads, it is important to design the distribution such that writes are spread across tablets in order to avoid overloading a single tablet. For workloads involving many short scans, performance can be improved if all of the data for the scan is located in the same tablet. Understanding these fundamental trade-offs is central to designing an effective partition schema.

No Default Partitioning
=== Kudu does not provide a default partitioning strategy when creating tables. It is strongly recommended to ensure that new tables have at least as many tablets as tablet servers (but Kudu can support many tablets per tablet server). ===

Kudu provides two types of partition schema: range partitioning and hash bucketing. These schema types can be used together or independently. Kudu does not yet allow tablets to be split after creation, so you must design your partition schema ahead of time to ensure that a sufficient number of tablets are created.

Range Partitioning

With range partitioning, rows are distributed into tablets using a totally-ordered distribution key. Each tablet is assigned a contiguous segment of the table’s distribution keyspace. Tables may be range partitioned on any subset of the primary key columns.

During table creation, tablet boundaries are specified as a sequence of split rows. Consider the following table schema (using SQL syntax for clarity):

CREATE TABLE customers (last_name STRING NOT NULL,
                        first_name STRING NOT NULL,
                        order_count INT32)
PRIMARY KEY (last_name, first_name)
DISTRIBUTE BY RANGE (last_name, first_name);

Specifying the split rows as (("b", ""), ("c", ""), ("d", ""), .., ("z", "")) (25 split rows total) will result in the creation of 26 tablets, with each tablet containing a range of customer surnames all beginning with a given letter. This is an effective partition schema for a workload where customers are inserted and updated uniformly by last name, and scans are typically performed over a range of surnames.

It may make sense to partition a table by range using only a subset of the primary key columns, or with a different ordering than the primary key. For instance, you can change the above example to specify that the range partition should only include the last_name column. In that case, Kudu would guarantee that all customers with the same last name would fall into the same tablet, regardless of the provided split rows.

Range Partition Management

Kudu 0.10 introduces the ability to specify bounded range partitions during table creation, and the ability add and drop range partitions on the fly. This is a good strategy for data which is always increasing, such as timestamps, or for categorical data, such as geographic regions.

For example, during table creation, bounded range partitions can be added for the regions 'US-EAST', 'US-WEST', and 'EUROPE'. If you attempt to insert a row with a region that does not match an existing range partition, the insertion will fail. Later, when a new region is needed it can be efficiently added as part of an ALTER TABLE operation. This feature is particularly useful for timeseries data, since it allows new range partitions for the current period to be added as needed, and old partitions covering historical periods to be dropped if necessary.

Hash Bucketing

Hash bucketing distributes rows by hash value into one of many buckets. Each tablet is responsible for the rows falling into a single bucket. The number of buckets (and therefore tablets), is specified during table creation. Typically, all of the primary key columns are used as the columns to hash, but as with range partitioning, any subset of the primary key columns can be used.

Hash partitioning is an effective strategy to increase the amount of parallelism for workloads that would otherwise skew writes into a small number of tablets. Consider the following table schema.

CREATE TABLE metrics (
  host STRING NOT NULL,
  metric STRING,
  time TIMESTAMP NOT NULL,
  measurement DOUBLE,
  PRIMARY KEY (time, metric, host),
)

If you use range partitioning over the primary key columns, inserts will tend to only go to the tablet covering the current time, which limits the maximum write throughput to the throughput of a single tablet. If you use hash partitioning, you can guarantee a number of parallel writes equal to the number of buckets specified when defining the partition schema. The trade-off is that a scan over a single time range now must touch each of these tablets, instead of (possibly) a single tablet. Hash bucketing can be an effective tool for mitigating other types of write skew as well, such as monotonically increasing values.

As an advanced optimization, you can create a table with more than one hash bucket component, as long as the column sets included in each are disjoint, and all hashed columns are part of the primary key. The total number of tablets created will be the product of the hash bucket counts. For example, the above metrics table could be created with two hash bucket components, one over the time column with 4 buckets, and one over the metric and host columns with 8 buckets. The total number of tablets will be 32. The advantage of using two separate hash bucket components is that scans which specify equality constraints on the metric and host columns will be able to skip 7/8 of the total tablets, leaving a total of just 4 tablets to scan.

Hash Bucketing and Range Partitioning

Hash bucketing can be combined with range partitioning. Adding hash bucketing to a range partitioned table has the effect of parallelizing operations that would otherwise operate sequentially over the range. The total number of tablets is the product of the number of hash buckets and the number of split rows plus one.

Schema Alterations

You can alter a table’s schema in the following ways:

  • Rename the table

  • Rename, add, or drop columns

  • Rename (but not drop) primary key columns

You cannot modify the partition schema after table creation.

Column Design

A Kudu Table consists of one or more columns, each with a predefined type. Columns that are not part of the primary key may optionally be nullable. Supported column types include:

  • boolean

  • 8-bit signed integer

  • 16-bit signed integer

  • 32-bit signed integer

  • 64-bit signed integer

  • timestamp

  • single-precision (32-bit) IEEE-754 floating-point number

  • double-precision (64-bit) IEEE-754 floating-point number

  • UTF-8 encoded string

  • binary

Kudu takes advantage of strongly-typed columns and a columnar on-disk storage format to provide efficient encoding and serialization. To make the most of these features, columns must be specified as the appropriate type, rather than simulating a 'schemaless' table using string or binary columns for data which may otherwise be structured. In addition to encoding, Kudu optionally allows compression to be specified on a per-column basis.

Column Encoding

Each column in a Kudu table can be created with an encoding, based on the type of the column. Columns use plain encoding by default.

Table 1. Encoding Types
Column Type Encoding

integer, timestamp

plain, bitshuffle, run length

float

plain, bitshuffle

bool

plain, dictionary, run length

string, binary

plain, prefix, dictionary

Plain Encoding

Data is stored in its natural format. For example, int32 values are stored as fixed-size 32-bit little-endian integers.

Bitshuffle Encoding

Data is rearranged to store the most significant bit of every value, followed by the second most significant bit of every value, and so on. Finally, the result is LZ4 compressed. Bitshuffle encoding is a good choice for columns that have many repeated values, or values that change by small amounts when sorted by primary key. The bitshuffle project has a good overview of performance and use cases.

Run Length Encoding

Runs (consecutive repeated values) are compressed in a column by storing only the value and the count. Run length encoding is effective for columns with many consecutive repeated values when sorted by primary key.

Dictionary Encoding

A dictionary of unique values is built, and each column value is encoded as its corresponding index in the dictionary. Dictionary encoding is effective for columns with low cardinality. If the column values of a given row set are unable to be compressed because the number of unique values is too high, Kudu will transparently fall back to plain encoding for that row set. This is evaluated during flush.

Prefix Encoding

Common prefixes are compressed in consecutive column values. Prefix encoding can be effective for values that share common prefixes, or the first column of the primary key, since rows are sorted by primary key within tablets.

Column Compression

Kudu allows per-column compression using LZ4, snappy, or zlib compression codecs. By default, columns are stored uncompressed. Consider using compression if reducing storage space is more important than raw scan performance.

Every data set will compress differently, but in general LZ4 has the least effect on performance, while zlib will compress to the smallest data sizes. Bitshuffle-encoded columns are inherently compressed using LZ4, so it is not typically beneficial to apply additional compression on top of this encoding.

Known Limitations

Kudu currently has some known limitations that may factor into schema design. When designing your schema, consider these limitations together, not in isolation. If you test these limitations and your findings are different from these, please share your test cases and results.

Number of Columns

Kudu has not been thoroughly tested with more than 200 columns and we recommend schemas with fewer than 50 columns per table.

Size of Rows

Kudu has not been thoroughly tested with rows larger than 10 kb. Most testing has been on rows at 1 kb.

Size of Cells

There is no hard limit imposed by Kudu, but large values (10s of kilobytes and above) are likely to perform poorly and may cause stability issues in current Kudu releases.

Immutable Primary Keys

Kudu does not allow you to update the primary key of a row after insertion.

Non-alterable Primary Key

Kudu does not allow you to alter the primary key columns after table creation.

Non-alterable Partition Schema

Kudu does not allow you to alter the partition schema after table creation.

Partition Pruning

When tables use hash buckets, the Java client does not yet use scan predicates to prune tablets for scans over these tables. In the future, specifying an equality predicate on all columns in the hash bucket component will limit the scan to only the tablets corresponding to the hash bucket.

Tablet Splitting

You currently cannot split or merge tablets after table creation. You must create the appropriate number of tablets in the partition schema at table creation. As a workaround, you can copy the contents of one table to another by using a CREATE TABLE AS SELECT statement or creating an empty table and using an INSERT query with SELECT in the predicate to populate the new table.