$ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash
Follow these instructions to set up and run the Kudu VM, and start with Kudu, Kudu_Impala, and CDH in minutes.
Install Oracle Virtualbox. The VM has been tested to work with VirtualBox version 4.3 on Ubuntu 14.04 and VirtualBox version 5 on OSX 10.9. VirtualBox is also included in most package managers: apt-get, brew, etc.
After the installation, make sure that VBoxManage
is in your PATH
by using the
which VBoxManage
command.
To download and start the VM, execute the following command in a terminal window.
$ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash
This command downloads a shell script which clones the kudu-examples
Git repository and
then downloads a VM image of about 1.2GB size into the current working
directory.[1] You can examine the script after downloading it by removing
the | bash
component of the command above. Once the setup is complete, you can verify
that everything works by connecting to the guest via SSH:
$ ssh demo@quickstart.cloudera
The username and password for the demo account are both demo
. In addition, the demo
user has password-less sudo
privileges so that you can install additional software or
manage the guest OS. You can also access the kudu-examples
as a shared folder in
/home/demo/kudu-examples/
on the guest or from your VirtualBox shared folder location on
the host. This is a quick way to make scripts or data visible to the guest.
You can quickly verify if Kudu and Impala are running by executing the following commands:
$ ps aux | grep kudu
$ ps aux | grep impalad
If you have issues connecting to the VM or one of the processes is not running, make sure to consult the Troubleshooting section.
To practice some typical operations with Kudu and Impala, we’ll use the San Francisco MTA GPS dataset. This dataset contains raw location data transmitted periodically from sensors installed on the busses in the SF MTA’s fleet.
Download the sample data and load it into HDFS
First we’ll download the sample dataset, prepare it, and upload it into the HDFS cluster.
The SF MTA’s site is often a bit slow, so we’ve mirrored a sample CSV file from the dataset at http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz
The original dataset uses DOS-style line endings, so we’ll convert it to
UNIX-style during the upload process using tr
.
$ wget http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz
$ hdfs dfs -mkdir /sfmta
$ zcat sfmtaAVLRawData01012013.csv.gz | tr -d '\r' | hadoop fs -put - /sfmta/data.csv
Create a new external Impala table to access the plain text data. To connect to Impala in the virtual machine issue the following command:
ssh demo@quickstart.cloudera -t impala-shell
Now, you can execute the following commands:
CREATE EXTERNAL TABLE sfmta_raw (
revision int,
report_time string,
vehicle_tag int,
longitude float,
latitude float,
speed float,
heading float
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/sfmta/'
TBLPROPERTIES ('skip.header.line.count'='1');
Validate if the data was actually loaded run the following command:
SELECT count(*) FROM sfmta_raw;
+----------+
| count(*) |
+----------+
| 859086 |
+----------+
Next we’ll create a Kudu table and load the data. Note that we convert
the string report_time
field into a unix-style timestamp for more efficient
storage.
CREATE TABLE sfmta (
report_time BIGINT NOT NULL,
vehicle_tag STRING NOT NULL,
longitude FLOAT NOT NULL,
latitude FLOAT NOT NULL,
speed FLOAT NOT NULL,
heading FLOAT NOT NULL,
PRIMARY KEY (report_time, vehicle_tag)
)
DISTRIBUTE BY HASH(report_time) INTO 8 BUCKETS
STORED AS KUDU;
INSERT INTO sfmta SELECT
UNIX_TIMESTAMP(report_time, 'MM/dd/yyyy HH:mm:ss'),
vehicle_tag,
longitude,
latitude,
speed,
heading
FROM sfmta_raw;
-- Modified 859086 row(s), 0 row error(s) in 8.55s
The created table uses a composite primary key. See Kudu Impala Integration for a more detailed introduction to the extended SQL syntax for Impala.
Now that the data is stored in Kudu, you can run queries against it. The following query finds the data point containing the highest recorded vehicle speed.
SELECT * FROM sfmta ORDER BY speed DESC LIMIT 1;
+-------------+-------------+--------------------+-------------------+-------------------+---------+
| report_time | vehicle_tag | longitude | latitude | speed | heading |
+-------------+-------------+--------------------+-------------------+-------------------+---------+
| 1357022342 | 5411 | -122.3968811035156 | 37.76665878295898 | 68.33300018310547 | 82 |
+-------------+-------------+--------------------+-------------------+-------------------+---------+
With a quick Google search we can see that this bus was traveling east on 16th street at 68MPH. At first glance, this seems unlikely to be true. Perhaps we do some research and find that this bus’s sensor equipment was broken and we decide to remove the data. With Kudu this is very easy to correct using standard SQL:
DELETE FROM sfmta WHERE vehicle_tag = '5411';
-- Modified 1169 row(s), 0 row error(s) in 0.25s
The above example showed how to load, query, and mutate a static dataset with Impala and Kudu. The real power of Kudu, however, is the ability to ingest and mutate data in a streaming fashion.
As an exercise to learn the Kudu programmatic APIs, try implementing a program that uses the SFMTA XML data feed to ingest this same dataset in real time into the Kudu table.
Make sure the host has a SSH client installed.
Make sure the VM is running, by running the following command and checking for a VM called kudu-demo
:
$ VBoxManage list runningvms
Verify that the VM’s IP address is included in the host’s /etc/hosts
file. You should
see a line that includes an IP address followed by the hostname
quickstart.cloudera
. To check the running VM’s IP address, use the VBoxManage
command below.
$ VBoxManage guestproperty get kudu-demo /VirtualBox/GuestInfo/Net/0/V4/IP
Value: 192.168.56.100
If you’ve used a Cloudera Quickstart VM before, your .ssh/known_hosts
file may
contain references to the previous VM’s SSH credentials. Remove any references to
quickstart.cloudera
from this file.
Running Kudu currently requires a CPU that supports SSE4.2 (Nehalem or later for Intel). To pass through SSE4.2 support into the guest VM, refer to the VirtualBox documentation
/etc/hosts
file with the name quickstart.cloudera
and the guest’s IP address.