Building a Multi-Node Hadoop v2 Cluster with Ubuntu on Windows Azure

With HDInsight, the Windows Azure platform provides a powerful Platform-as-a-Service (PaaS) offering for quickly spinning up and managing Hadoop clusters on top of Windows VMs. These clusters are based on the HortonWorks Data Platform (HDP) distribution. Currently, the newest version of HDP in HDInsight is 1.3.0, which is deployed with the HDInsight version 2.1 (go here for the Microsoft versioning story). For sure, HortonWorks will eventually release a 2.x version for HDInsight on Windows, but if you prefer Ubuntu or you need a Hadoop v2 cluster now – what to do …?

Well, the good news is that Windows Azure is a very flexible platform and does not only provide platform services like HDInsight and many others, but also a powerful Infrastructure-as-a-Service (IaaS) model which allows you to deploy virtual machines based on Windows or Linux and manage them in virtual networks. So, Windows Azure IaaS allows you to build your own Apache Hadoop v2 cluster running in Ubuntu. In this post I will walk you through the steps to get there. We will build up a Hadoop 2.2.0 cluster consisting of a single master node and 2 slave nodes, using a custom-built VM image for all Hadoop nodes. Basically, the post will enable you to build arbitrary-sized Hadoop clusters on top of Windows Azure.

Continue reading