Hadoop on demand with Openstack, Cloudinit and Salt

This post will describe the whole process of providing hadoop on demand through Openstack, Cloudinit and Saltstack. We will use Openstack to provision servers on demand, cloudconfig to install salt-master and salt-minion and saltstack to deploy hadoop namenode and datanodes, all fully automated. This makes us able to deploy a 5-100+ node (still need to test 20+) Hadoop HDFS cluster within 5 minutes.

Deploy salt-master with Cloudconfig

We will install and configure salt-master to auto_accept key files from the minions. Please note that this is a security risk.

Create the following master-cloudconfig file:

#cloud-config
apt_mirror: http://mirror.bjtu.edu.cn/ubuntu/

apt_sources:
 - source: "ppa:saltstack/salt"

packages:
 - python-software-properties
 - salt-master
 - git

runcmd:
 - bash /tmp/bootstrap-master.sh

write_files:
 - content: |
        #!/bin/bash
        sed -i 's/^#auto_accept.*$/auto_accept: True/g' /etc/salt/master
        sudo service salt-master restart
   path: /tmp/bootstrap-master.sh
   permissions: "0755"

Now launch openstack instance with: nova boot --image ubuntu_14.04 --user-data master-cloudconfig --flavor m1.medium --key-name sam salt_master. The salt-master that we launched in our case has as ip 192.168.111.66. You can ssh to this machine via ssh [email protected]

Configuring the salt master to deploy a hadoop cluster

In order to be able to deploy hadoop on the salt-master we are going to manually configure the salt-master with the official hadoop formula. We could also automate this via cloudconfig, but for sake of demonstration I'm going to list the manual steps here.

Configuring salt to use hadoop-formula and it's dependencies

We are going to use the following formula: Hadoop formula, which has as dependencies the hostsfile-formula and the sun-java-formula. The hostsfile formula is to make all nodes accessible by their hostname/fqdn and the sun-java-formula lets you automatically install java from oracle.

ssh [email protected] # SSH into your salt-master
sudo mkdir /srv/salt
sudo mkdir /srv/salt/formulas
sudo "cd /srv/salt/formulas && 
      git clone https://github.com/saltstack-formulas/hadoop-formula
      git clone https://github.com/saltstack-formulas/hostsfile-formula &&
      git clone https://github.com/saltstack-formulas/sun-java-formula"

cat << EOF > /etc/salt/master.d/file_roots.conf
file_roots:
  base:
    - /srv/salt
    - /srv/salt/formulas/hadoop-formula
    - /srv/salt/formulas/hostsfile-formula
    - /srv/salt/formulas/sun-java-formula
EOF

cat << EOF > /etc/salt/master.d/pillar_roots.conf
pillar_roots:
  base:
    - /srv/pillar
EOF

sudo service salt-master restart

Creating SSH keypairs for hadoop-formula

cd /srv/salt/formulas/hadoop-formula/hadoop/files && ./generate-keypairs.sh

Configuring the top.sls state file to include hadoop and deps

On the salt-master create the following file /srv/salt/top.sls:

base:
  'G@roles:hadoop_slave or G@roles:hadoop_master':
    - match: compound
    - hostsfile
    - hostsfile.hostname
    - sun-java
    - sun-java.env
    - hadoop
    - hadoop.hdfs

By using pillar we can also change the parameters of hadoop. You can check all available parameters that are change-able here: hadoop-pillar.example.

Create the following top.sls pillar file /srv/pillar/top.sls:

base:
  'G@roles:hadoop_slave or G@roles:hadoop_master':
     - hadoop

and the following hadoop specific pillar file /srv/pillar/hadoop.sls:

hadoop:
  version: hdp-2.2.0 # ['apache-1.2.1', 'apache-2.2.0', 'hdp-1.3.0', 'hdp-2.2.0', 'cdh-4.5.0', 'cdh-4.5.0-mr1']
  users:
    hadoop: 6000
    hdfs: 6001

Deploy salt-minions as hadoop nodes

Now that we have configured our salt-master we can start deploying the salt-minions which will either server as hadoop-master(namenode) or as hadoop-slave(datanode).

Launching hadoop-master(namenode) nodes

First create our cloudconfig file used by openstack hadoopmaster-cloudconfig:

#cloud-config
apt_mirror: "http://mirror.bjtu.edu.cn/ubuntu/"
apt_sources: 
  - source: "ppa:saltstack/salt"
packages: 
  - python-software-properties
  - salt-minion

runcmd: 
 - bash /tmp/bootstrap-minion.sh

write_files: 
  - content: |
         roles:
            - hadoop_master
    path: /etc/salt/grains
  - content: |
        #!/bin/bash
        sed -i 's/^#master.*$/master: 192.168.111.66/g' /etc/salt/minion
        sudo service salt-minion restart
    path: /tmp/bootstrap-minion.sh
    permissions: "0755"
  - content: |
        mine_functions:
            network.interfaces: []
            network.ip_addrs: []
            grains.items: []
    path: /etc/salt/minion.d/mine_functions.conf

Then create a new instance using openstack: nova boot --image ubuntu_14.04 --user-data hadoopmaster-cloudconfig --flavor m1.medium --key-name sam hadoop-master-1

Deploying hadoop-slave(datanode) nodes

Create a hadoopslave-cloudconfig:

#cloud-config
apt_mirror: "http://mirror.bjtu.edu.cn/ubuntu/"
apt_sources: 
  - source: "ppa:saltstack/salt"
packages: 
  - python-software-properties
  - salt-minion

runcmd: 
 - bash /tmp/bootstrap-minion.sh

write_files: 
  - content: |
         roles:
            - hadoop_slave
    path: /etc/salt/grains
  - content: |
        #!/bin/bash
        sed -i 's/^#master.*$/master: 192.168.111.66/g' /etc/salt/minion
        sudo service salt-minion restart
    path: /tmp/bootstrap-minion.sh
    permissions: "0755"
  - content: |
        mine_functions:
            network.interfaces: []
            network.ip_addrs: []
            grains.items: []
    path: /etc/salt/minion.d/mine_functions.conf

Now launch 10 hadoop-slave nodes nova boot --num-instances 10 --image ubuntu_14.04 --user-data hadoopslave-cloudconfig --flavor m1.medium --key-name sam hadoop-slave

Let saltstack deploy hadoop

First make sure that the salt-master detected all our hadoop-master and hadoop-slave nodes. Execute the following comand: salt '*' managed.up it should show your 11 nodes.

If all nodes are up and detected you can start deployment of hadoop with the following command:

salt '*' state.highstate

Accessing the hadoop HDFS cluster

After hadoop has been successfully deployed you can go to http://{hadoop_masterip}:50070 and check that 10 live nodes are active.

Now you can login to any of the nodes and save files to hdfs like this:

sudo -u hdfs -i
hadoop fs -mkdir -p /user/hdfs"
hadoop fs -put /tmp/bootstrap-minion.sh /user/hdfs/test.sh

Hope this tutorial was useful!

Notes: You can also configure salt-minion directly through cloudconfig instead.