Hadoop Docker

Running from existing setups

There are special branches for running hadoop in docker.

The docker-hadoop-runner* branches contain scripts that set up base images that can be used for running any Hadoop version.

The docker-hadoop* branches can be used for running a specific version.

Running from the source code

There is a setup under hadoop-dist that contains Docker Compose definitions for running the current version of Hadoop in a multi-node docker environment.

This is meant for testing code changes locally and debugging.

The base image used by the Docker setup is built as part of the maven lifecycle. The distribution files generated while building the project with the -Pdist profile enabled, will be used for running hadoop inside the containers.

In order to start the docker environment you need to do the following * Build the project, using the -Pdist profile

> mvn clean install -Dmaven.javadoc.skip=true -DskipTests -DskipShade -Pdist,src
  • From the project root, navigate under the docker-compose dir under the generated dist directory
> cd hadoop-dist/target/hadoop-<current-version>/compose/hadoop
  • Start the docker environment
> docker-compose up -d --scale datanode=3
  • Connect to a container to execute commands
> docker exec -it hadoop_datanode_1 bash
bash-4.2$ hdfs dfs -mkdir /test

Config files

To add or remove properties from the core-site.xml, hdfs-site.xml, etc. files used in the docker environment, simply edit the config file before starting the containers. The changes will be persisted in the docker environment.