Building a Spark / SciPy / Cassandra “SparkLab” on AWS

I have just completed for a client a complete setup of a “SparkLab” on a cluster of AWS machines: the setup has been completely automated via a Bash script which I have published to this public github gist.

The following is a copy of the README file there; the script can be used also on any standalone Ubuntu Server (I have recently used on a VirtualBox VM to build a local development instance).

As usual, comments and suggestions welcome.

HOW-TO Build SparkLab

Author: Marco Massenzio
Date: 2015-01-22
Version: 1.0.0
Scope: Personal Use
License: Apache 2

https://go-shields.herokuapp.com/license-apache2-blue.png

Prerequisite

Create an AWS instance and install your SSH key on the machine so you can
ssh into it (use ubuntu user) making sure that the security groups allow
access to the instance’s port 22 (SSH) from anywhere (0.0.0.0/32) or, at least
from the subnet you will be accessing it from.

You may also want to place the instance on a private subnet and open a number
of ports from the subnet’s private LAN (eg, 10.10.0.0/16) so you can
connect to Cassandra, Spark, iPython, etc.

Copy the files to that instance:

$ scp build-sparklab.sh ubuntu@ipaddress:./
$ scp sparklab-requirements.txt ubuntu@ipaddress:./

then execute it:

$ ssh ubuntu@ipaddress
...
$ ./build-sparklab.sh

Notes

It may be easier to scp a more complete .bashrc to the AWS instance before starting;
also .gitconfig and .gitignore may be useful (optional) [1]

One issue (particularly for accessing Cassandra) is around the hostname; eventually, I’ve found that
it’s best to use Route53 and make the hostname DNS resolvable.

See Cassandra documentation as to how to make the server reachable from clients other
than localhost.

Starting Spark master:

$SPARK_HOME/sbin/start-master.sh

Sparker worker:

$SPARK_HOME/sbin/spark-class org.apache.spark.deploy.worker.Worker spark://ip-xx-xx-xxx-xxx:xxxx

$SPARK_HOME/sbin and $SPARK_HOME/bin should already be in PATH so you can execute
the above by just running the command.

Spark logs:

$SPARK_HOME/logs/spark-ubuntu-org.apache.spark.deploy.master.Master-1-ip-xx-xx-xxx-xxx.out

(still TODO: change spark logs to somewhere like /var/log/spark)

Starting cassandra:

sudo service cassandra start

cassandra logs:

/var/log/cassandra/system.log
/var/log/cassandra/output.log

IPython notebook

Start with:

workon scipy3
ipython notebook --pylab=inline --profile=nbserver

see documentation on securing ipython.

NOTE Currently SSL connection uses a self-signed cert saved in ~/.ssh/ipy-cert.pem
this will cause a warning in the browser – it is safe to ignore.

git

In order to clone projects from github to any newly created SparkLab instance, you need
to generate a key-pair and then add the public part to github’s SSH keys for the account.

If you don’t know how to do this, maybe you should not be reading this guide after all.

[1] Rename them from edit.* to their .* equivalent
Advertisements

3 thoughts on “Building a Spark / SciPy / Cassandra “SparkLab” on AWS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s