Compile and build specific Hadoop source code branch using Azure VM
Sometimes you may want to test a Hadoop feature that is available in a specific branch that is not available as a binary release. For example, in my case, I want to try accessing Azure Data Lake Store (ADLS) via its WebHDFS endpoint. Access to ADLS requires OAuth2, support for which was added in Hadoop 2.8 (HDFS-8155) but is not available in the current Hadoop 2.7.x releases.
Hadoop source code is available in this mirrored GitHub repo https://github.com/apache/hadoop. Version 2.8 specific code is available in the branch appropriately called "branch-2.8"
Deploy Azure VM with Ubuntu 14.04-LTS
As is described in the Building instructions for Hadoop, "the easiest way to get an environment with all the appropriate tools is by means of the provided Docker config" (for Linux or Mac). Since my primary laptop is running Windows 10, I will deploy a Ubuntu 14.04 LTS virtual machine in my Azure subscription, use it to build Hadoop 2.8 binary tar.gz file, download the resultant file, and delete the VM once I am done.
I am using Standard_DS2 VM size created from Canonical Ubuntu 14.04 LTS Azure gallery image https://portal.azure.com/#create/Canonical.UbuntuServer1404LTS-ARM
Install Docker on Ubuntu 14.04
After the VM is deployed, I SSH into it using its public IP and quickly install Docker following the instructions for Ubuntu 14.04 from https://docs.docker.com/engine/installation/linux/ubuntulinux/
sudo apt-get update sudo apt-get install apt-transport-https ca-certificates sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | sudo tee --append /etc/apt/sources.list.d/docker.list sudo apt-get update sudo apt-get purge lxc-docker apt-cache policy docker-engine sudo apt-get install linux-image-extra-$(uname -r) sudo apt-get install docker-engine sudo service docker start sudo docker run hello-world
By default, I am not able to run "docker run hello-world" using my user account (i.e. azureuser) without using sudo. When I try it, I get back this message "docker: Cannot connect to the Docker daemon. Is the docker daemon running on this host?" This happens because by default docker daemon's Unix socket is owned by the user root and other users can access it only with sudo.
To enable azureuser to run docker without sudo, I follow the instructions from Docker to create group called "docker", add my user to that group, logout, log back in, and try docker run again.
sudo groupadd docker sudo usermod -aG docker `whoami` logout
After logging back in, I now can run "docker run hello-world" without problems.
Clone Hadoop 2.8 Branch
Since I want to compile specifically the branch called "branch-2.8", I use Git to clone only that specific branch to my home directory (/home/azureuser/hadoop-2.8) using this command:
git clone -b branch-2.8 --single-branch https://github.com/apache/hadoop.git hadoop-2.8
Start Docker Container with Hadoop Build Environment
Following instructions from https://github.com/apache/hadoop/blob/trunk/BUILDING.txt, I start the Hadoop build environment using the provided script:
cd hadoop-2.8/ ./start-build-env.sh
This process will take some time (~5-10 min) since it installs all of the required build environment tools (JDK, Maven, etc.) in the container.
Building Hadoop within the Docker Container
After the creation process is finished, I see my Hadoop Dev docker container running.
I try to start the Maven binary distribution build without native code, without running the tests, and without documentation.
mvn package -Pdist -DskipTests -Dtar
Resolving Permissions Error
However, I get a permissions error regarding the /home/azureuser/.m2 directory (used by Maven).
To fix this problem, I exit the docker container, and set the ownership of the /home/azureuser/.m2 directory to azureuser:azureuser.
sudo chown azureuser:azureuser ~/.m2
Restarting Container and Starting Maven Build
After the permission problem is resolved, I restart the docker container:
cd hadoop-2.8/ ./start-build-env.sh
Once within the container, I again try to start the Maven build and package:
mvn package -Pdist -DskipTests -Dtar
This process will take some time to complete. For me, on the Standard_DS2 Azure VM, it took about 9 minutes.
Download Binary Distribution File
After the build process is complete, the resultant files are found in the hadoop-dist/target directory.
I download the hadoop-dist-2.8.0-SNAPSHOT.tar.gz (200MB) file to my local machine from the Ubuntu Azure VM (e.g. using WinSCP, MobaXterm SFTP, etc.).
I also store this file as a block blob in a Azure Storage container so that I can quickly download it from there without rebuilding (https://avdatarepo1.blob.core.windows.net:443/hadoop/hadoop-2.8.0-SNAPSHOT.tar.gz)
Once I have the binary distribution file ready, I can go ahead and delete my Azure VM.
Conclusion
It is very convenient and quick to be able to use an Azure VM running Ubuntu 14.04-LTS and Docker to setup the temporary Hadoop build environment. Although in this case I specifically built the "branch-2.8" branch, the same process can be used to build other Hadoop branches (or trunk) from source.
I’m looking forward to your feedback and questions via Twitter https://twitter.com/ArsenVlad