Setting up a stand-alone spark cluster on OpenStack

This describes how start a stand alone Spark cluster on open stack, using two ansible playbooks. This has been tested on the Uppmax private cloud smog.

The open stack dymamic inventory code presented here is adapted from: https://github.com/lukaspustina/dynamic-inventory-for-ansible-with-openstack

How to start it?

Create a host from which to run ansible in your OpenStack dashboard and associate a floating IP to is so that you can ssh in to it.
ssh to the machine you just created.

Install the pre-requisites:

sudo apt-get install python-pip python-dev
sudo pip install ansible
sudo pip install python-novaclient

Clone this repository:

git clone https://github.com/johandahlberg/ansible_spark_openstack.git

Create a dir called files in the repo root dir and copy you ssh-keys (these cannot have a password) there. This is used to enable password-less ssh access between the nodes:
Download you OpenStack RC file from the OpenStack dashboard (it's available under "Access & Security -> API Access")
Source your OpenStack RC file: source <path to rc file>, and fill in your OpenStack password. This will load information about you OpenStack Setup into your environment.
Create the security group for spark. Since spark will start some services on random ports this will allow all tcp traffic within the security group:
```
nova secgroup-create spark "internal security group for spark"
nova secgroup-add-group-rule spark spark tcp 1 65535
```
Setup the name of your network. export OS_NETWORK_NAME="<name of your network>" If you like you can add this to your OpenStack RC file, or set it in your bash_rc. (You can find the name of your network in your OpenStack dashboard)
First run the playbook which creates your nodes. Open create_spark_cloud_playbook.yml and edit variables to set your ssh-key and how many workers you want to create, then run:
```
ansible-playbook -i localhost_inventory --private-key=<your_ssh_key> create_spark_cloud_playbook.yml
```
Open: deploy_spark_playbook.yml and set the ssh_keys_to_use variable to your ssh-key. Then install spark on the nodes (I've noticed that sometimes it takes a while for the ssh-server on the nodes to start, so if you get an initial ssh-error, wait a few minutes and try again).
```
ansible-playbook -i openstack_inventory.py --private-key=<your_ssh_key> deploy_spark_playbook.yml
```
Once this has finished successfully your spark cluster should be up and running! ssh into the spark-master node and try your new Spark cluster it by kicking of a shell:
```
./opt/spark-1.2.1-bin-hadoop2.4/bin/spark-shell --master spark://spark-master:7077 --executor-memory 6G
```

Tips

If you don't want to open the web-facing ports you can use ssh-forwarding to reach the web-interfaces, e.g

ssh -L 8080:spark-master:8080 -i <your key> ubuntu@<spark-master-ip>

Acknowledements

Mikael Huss for sharing his insights on Spark and collaborating with me on this
Zeeshan Ali Shah(@zeeshanali) for helping me get going with OpenStack