<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>1.1.0</version>
</dependency>
Kudu provides C++, Java and Python client APIs, as well as reference examples to illustrate their use.
Use of server-side or private interfaces is not supported, and interfaces which are not part of public APIs have no stability guarantees. |
You can view the C++ client API documentation
online. Alternatively, after building Kudu from source,
you can additionally build the doxygen
target (e.g., run make doxygen
if using make) and use the locally generated API documentation by opening
docs/doxygen/client_api/html/index.html
file in your favorite Web browser.
In order to build the doxygen target, it’s necessary to have
doxygen with Dot (graphviz) support installed at your build machine. If
you installed doxygen after building Kudu from source, you will need to run
cmake again to pick up the doxygen location and generate appropriate
targets.
|
You can view the Java API documentation online. Alternatively,
after building the Java client, Java API documentation is available
in java/kudu-client/target/apidocs/index.html
.
Several example applications are provided in the
kudu-examples Github
repository. Each example includes a README
that shows how to compile and run
it. These examples illustrate correct usage of the Kudu APIs, as well as how to
set up a virtual machine to run Kudu. The following list includes some of the
examples that are available today. Check the repository itself in case this list goes
out of date.
java/java-example
A simple Java application which connects to a Kudu instance, creates a table, writes data to it, then drops the table.
java/collectl
A small Java application which listens on a TCP socket for time series data corresponding to the Collectl wire protocol. The commonly-available collectl tool can be used to send example data to the server.
java/insert-loadgen
A Java application that generates random insert load.
python/dstat-kudu
An example program that shows how to use the Kudu Python API to load data into a new / existing Kudu table
generated by an external program, dstat
in this case.
python/graphite-kudu
An experimental plugin for using graphite-web with Kudu as a backend.
demo-vm-setup
Scripts to download and run a VirtualBox virtual machine with Kudu already installed. See Quickstart for more information.
These examples should serve as helpful starting points for your own Kudu applications and integrations.
The following Maven <dependency>
element is valid for the Apache Kudu public release
(since 1.0.0):
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>1.1.0</version>
</dependency>
Convenience binary artifacts for the Java client and various Java integrations (e.g. Spark, Flume) are also now available via the ASF Maven repository and Maven Central repository.
See Using Impala With Kudu for guidance on installing
and using Impala with Kudu, including several impala-shell
examples.
Kudu integrates with Spark through the Data Source API as of version 1.0.0. Include the kudu-spark dependency using the --packages option:
Use the kudu-spark_2.10 artifact if using Spark with Scala 2.10
spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.1.0
Use kudu-spark2_2.11 artifact if using Spark 2 with Scala 2.11
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.1.0
then import kudu-spark and create a dataframe:
import org.apache.kudu.spark.kudu._
// Read a table from Kudu
val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> "kudu_table")).kudu
// Query using the Spark API...
df.select("id").filter("id" >= 5).show()
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = sqlContext.sql("select id from kudu_table where id >= 5").show()
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051")
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"), new CreateTableOptions().setNumReplicas(1))
// Insert data
kuduContext.insertRows(df, "test_table")
// Delete data
kuduContext.deleteRows(filteredDF, "test_table")
// Upsert data
kuduContext.upsertRows(df, "test_table")
// Update data
val alteredDF = df.select("id", $"count" + 1)
kuduContext.updateRows(filteredRows, "test_table"
// Data can also be inserted into the Kudu table using the data source, though the methods on KuduContext are preferred
// NB: The default is to upsert rows; to perform standard inserts instead, set operation = insert in the options map
// NB: Only mode Append is supported
df.write.options(Map("kudu.master"-> "kudu.master:7051", "kudu.table"-> "test_table")).mode("append").kudu
// Check for the existence of a Kudu table
kuduContext.tableExists("another_table")
// Delete a Kudu table
kuduContext.deleteTable("unwanted_table")
Kudu tables with a name containing upper case or non-ascii characters must be assigned an alternate name when registered as a temporary table.
Kudu tables with a column name containing upper case or non-ascii characters may not be used with SparkSQL. Non-primary key columns may be renamed in Kudu to work around this issue.
NULL
, NOT NULL
, <>
, OR
, LIKE
, and IN
predicates are not pushed to
Kudu, and instead will be evaluated by the Spark task.
Kudu does not support all types supported by Spark SQL, such as Date
,
Decimal
and complex types.
The Kudu Python client provides a Python friendly interface to the C++ client API. The sample below demonstrates the use of part of the Python client.
import kudu
from kudu.client import Partitioning
from datetime import datetime
# Connect to Kudu master server
client = kudu.connect(host='kudu.master', port=7051)
# Define a schema for a new table
builder = kudu.schema_builder()
builder.add_column('key').type(kudu.int64).nullable(False).primary_key()
builder.add_column('ts_val', type_=kudu.unixtime_micros, nullable=False, compression='lz4')
schema = builder.build()
# Define partitioning schema
partitioning = Partitioning().add_hash_partitions(column_names=['key'], num_buckets=3)
# Create new table
client.create_table('python-example', schema, partitioning)
# Open a table
table = client.table('python-example')
# Create a new session so that we can apply write operations
session = client.new_session()
# Insert a row
op = table.new_insert({'key': 1, 'ts_val': datetime.utcnow()})
session.apply(op)
# Upsert a row
op = table.new_upsert({'key': 2, 'ts_val': "2016-01-01T00:00:00.000000"})
session.apply(op)
# Updating a row
op = table.new_update({'key': 1, 'ts_val': ("2017-01-01", "%Y-%m-%d")})
session.apply(op)
# Delete a row
op = table.new_delete({'key': 2})
session.apply(op)
# Flush write operations, if failures occur, capture print them.
try:
session.flush()
except kudu.KuduBadStatus as e:
print(session.get_pending_errors())
# Create a scanner and add a predicate
scanner = table.scanner()
scanner.add_predicate(table['ts_val'] == datetime(2017, 1, 1))
# Open Scanner and read all tuples
# Note: This doesn't scale for large scans
result = scanner.open().read_all_tuples()
Kudu was designed to integrate with MapReduce, YARN, Spark, and other frameworks in the Hadoop ecosystem. See RowCounter.java and ImportCsv.java for examples which you can model your own integrations on. Stay tuned for more examples using YARN and Spark in the future.