While the Apache Kudu project provides client bindings that allow users to mutate and fetch data, more complex access patterns are often written via SQL and compute engines. This is a non-exhaustive list of projects that integrate with Kudu to enhance ingest, querying capabilities, and orchestration.
The following integrations are among the most commonly used with Apache Kudu (sorted alphabetically).
Apache Drill provides schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage. See the Drill Kudu API documentation for more details.
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. See the Hive Kudu integration documentation for more details.
Apache Impala is the open source, native analytic database for Apache Hadoop. See the Kudu Impala integration documentation for more details.
Spark SQL is a Spark module for structured data processing. See the Kudu Spark integration documentation for more details.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. See the Presto Kudu connector documentation for more details.
Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends. See the Beam Kudu source and sink documentation for more details.
Apache Spark is a unified analytics engine for large-scale data processing. See the Kudu Spark integration documentation for more details.
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Kudu Python scanners can be converted to Pandas DataFrames. See Kudu’s Python tests for example usage.
Talend simplifies and automates big data integration projects with on demand Serverless Spark and machine learning. See Talend’s Kudu component documentation for more details.
Akka facilitates building highly concurrent, distributed, and resilient message-driven applications on the JVM. See the Alpakka Kudu connector documentation for more details.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. See the Flink Kudu connector documentation for more details.
Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. See the PutKudu processor documentation for more details.
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. See Kudu’s Spark Streaming tests for example usage.
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. See the Kafka Kudu connector documentation for more details.
StreamSets Data Collector is a lightweight, powerful engine that streams data in real time. See the StreamSets Data Collector Kudu destination documentation.
Striim is real-time data integration software that enables continuous data ingestion, in-flight stream processing, and delivery. See the Striim Kudu Writer documentation for more details.
TIBCO StreamBase® is an event processing platform for applying mathematical and relational processing to real-time data streams. See the StreamBase Kudu operator documentation for more details.
Informatica® PowerExchange® is a family of products that enables retrieval of a variety of data sources without having to develop custom data-access programs. See the PowerExchange for Kudu documentation for more details.
Camel is an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data. See the Camel Kudu component documentation for more details.
Cloudera Manager is an end-to-end application for managing CDH clusters. See the Cloudera Manager documentation for Kudu for more details.
Docker facilitates packaging software into standardized units for development, shipment, and deployment. See the official Apache Kudu Dockerhub and the Apache Kudu Docker Quickstart for more details.
Wavefront is a high-performance streaming analytics platform that supports 3D observability. See the Wavefront Kudu integration documentation for more details.
Zoomdata provides a high-performance BI engine and visually engaging, interactive dashboards. See Zoomdata’s Kudu page for more details.
While Kudu is an Apache-licensed open source project, software vendors may package and license it with other components to facilitate consumption. These offerings are typically bundled with support to tune and facilitate administration.