Kudu is a columnar storage manager developed for the Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.
Kudu’s design sets it apart. Some of Kudu’s benefits include:
Fast processing of OLAP workloads.
Integration with MapReduce, Spark, and other Hadoop ecosystem components.
Tight integration with Cloudera Impala, making it a good, mutable alternative to using HDFS with Parquet. See Kudu Impala Integration.
Strong but flexible consistency model.
Strong performance for running sequential and random workloads simultaneously.
Efficient utilization of hardware resources.
High availability. Tablet Servers and Masters use the Raft Consensus Algorithm.
Given a replication factor of 2f+1
, if f
tablet servers serving a given tablet
fail, the tablet is still available.
High availability for masters is not supported during the public beta. |
By combining all of these properties, Kudu targets support for families of applications that are difficult or impossible to implement on current-generation Hadoop storage technologies.
CREATE TABLE
Impala supports creating and dropping tables using Kudu as the persistence layer. The tables follow the same internal / external approach as other tables in Impala, allowing for flexible data ingestion and querying.
INSERT
Data can be inserted into Kudu tables in Impala using the same mechanisms as any other table with HDFS or HBase persistence.
UPDATE
/ DELETE
Impala supports the UPDATE
and DELETE
SQL commands to modify existing data in
a Kudu table row-by-row or as a batch. The syntax of the SQL commands is chosen
to be as compatible as possible to existing solutions. In addition to simple DELETE
or UPDATE
commands, you can specify complex joins in the FROM
clause of the query
using the same syntax as a regular SELECT
statement.
Similar to partitioning of tables in Hive, Kudu allows you to dynamically pre-split tables by hash or range into a predefined number of tablets, in order to distribute writes and queries evenly across your cluster. You can partition by any number of primary key columns, by any number of hashes and an optional list of split rows. See Schema Design.
To achieve the highest possible performance on modern hardware, the Kudu client within Impala parallelizes scans to multiple tablets.
Where possible, Impala pushes down predicate evaluation to Kudu, so that predicates are evaluated as close as possible to the data. Query performance is comparable to Parquet in many workloads.
This release of Kudu is a public beta. Do not run this beta release on production clusters. During the public beta period, Kudu will be supported via a public JIRA and a public mailing list, which will be monitored by the Kudu development team and community members. Commercial support is not available at this time.
You can submit any issues or feedback related to your Kudu experience via either the JIRA system or the mailing list. The Kudu development team and community members will respond and assist as quickly as possible.
The Kudu team will work with early adopters to fix bugs and release new binary drops when fixes or features are ready. However, we cannot commit to issue resolution or bug fix delivery times during the public beta period, and it is possible that some fixes or enhancements will not be selected for a release.
We can’t guarantee time frames or contents for future beta code drops. However, they will be announced to the user group when they occur.
No guarantees are made regarding upgrades from this release to follow-on releases. While multiple drops of beta code are planned, we can’t guarantee their schedules or contents.
A Quickstart VM is provided to get you up and running quickly.
You can install Kudu using provided deb/yum packages.
You can install Kudu, in clusters managed by Cloudera Manager, using parcels or deb/yum packages.
You can build Kudu from source.
For full installation details, see Kudu Installation.
RHEL 6.4 or newer, CentOS 6.4 or newer, and Ubuntu Trusty are are the only operating systems supported for installation in the public beta. Others may work but have not been tested.
Kudu has been tested with up to 4 TB of data per tablet server. More testing is needed for denser storage configurations.
Testing with more than 20 columns has been limited.
Kudu is primarily designed for analytic use cases and, in the beta release, you are likely to encounter issues if a single row contains multiple kilobytes of data.
The columns which make up the primary key must be listed first in the schema.
Key columns cannot be altered. You must drop and recreate a table to change its keys.
Key columns must not be null.
Columns with DOUBLE
, FLOAT
, or BOOL
types are not allowed as part of a
primary key definition.
Type and nullability of existing columns cannot be changed by altering the table.
A table’s primary key cannot be changed.
Dropping a column does not immediately reclaim space. Compaction must run first. There is no way to run compaction manually, but dropping the table will reclaim the space immediately.
Ingest via Sqoop or Flume is not supported in the public beta. The recommended
approach for bulk ingest is to use Impala’s CREATE TABLE AS SELECT
functionality
or use the Kudu Java or C++ API.
Tables must be manually pre-split into tablets using simple or compound primary keys. Automatic splitting is not yet possible. See Schema Design.
Tablets cannot currently be merged. Instead, create a new table with the contents of the old tables to be merged.
Replication and failover of Kudu masters is considered experimental. It is recommended to run a single master and periodically perform a manual backup of its data directories.
To use Kudu with Impala, you must install a special release of Impala called Impala_Kudu. Obtaining and installing a compatible Impala release is detailed in Kudu’s Impala Integration documentation.
To use Impala_Kudu alongside an existing Impala instance, you must install using parcels.
Updates, inserts, and deletes via Impala are non-transactional. If a query fails part of the way through, its partial effects will not be rolled back.
All queries will be distributed across all Impala nodes which host a replica of the target table(s), even if a predicate on a primary key could correctly restrict the query to a single tablet. This limits the maximum concurrency of short queries made via Impala.
No timestamp and decimal type support.
The maximum parallelism of a single query is limited to the number of tablets in a table. For good analytic performance, aim for 10 or more tablets per host or use large tables.
Impala is only able to push down predicates involving =
, ⇐
, >=
,
or BETWEEN
comparisons between any column and a literal value, and <
and >
for integer columns only. For example, for a table with an integer key ts
, and
a string key name
, the predicate WHERE ts >= 12345
will convert into an
efficient range scan, whereas where name > 'lipcon'
will currently fetch all
data from the table and evaluate the predicate within Impala.
Authentication and authorization are not included in the public beta.
Data encryption is not included in the public beta.
Potentially-incompatible C++ and Java API changes may be required during the public beta.
ALTER TABLE
is not yet fully supported via the client APIs. More ALTER TABLE
operations will become available in future betas.
The Python API is experimental and not supported.
The Spark DataFrame implementation is not yet complete.
The following are known bugs and issues with the current beta release. They will be addressed in later beta releases.
Building Kudu from source using gcc
4.6 causes runtime and test failures. Be sure
you are using a different version of gcc
if you build Kudu from source.
If the Kudu master is configured with the -log_fsync_all
option, tablet servers
and clients will experience frequent timeouts, and the cluster may become unusable.
If a tablet server has a very large number of tablets, it may take several minutes to start up. It is recommended to limit the number of tablets per server to 100 or fewer. Consider this limitation when pre-splitting your tables. If you notice slow start-up times, you can monitor the number of tablets per server in the web UI.