I have just completed for a client a complete setup of a “SparkLab” on a cluster of AWS machines: the setup has been completely automated via a Bash script which I have published to this public github gist.
The following is a copy of the README file there; the script can be used also on any standalone Ubuntu Server (I have recently used on a VirtualBox VM to build a local development instance).
As usual, comments and suggestions welcome.
HOW-TO Build SparkLab
Create an AWS instance and install your SSH key on the machine so you can
ssh into it (use ubuntu user) making sure that the security groups allow
access to the instance’s port 22 (SSH) from anywhere (0.0.0.0/32) or, at least
from the subnet you will be accessing it from.
You may also want to place the instance on a private subnet and open a number
of ports from the subnet’s private LAN (eg, 10.10.0.0/16) so you can
connect to Cassandra, Spark, iPython, etc.
Copy the files to that instance:
$ scp build-sparklab.sh ubuntu@ipaddress:./ $ scp sparklab-requirements.txt ubuntu@ipaddress:./
then execute it:
$ ssh ubuntu@ipaddress ... $ ./build-sparklab.sh
It may be easier to scp a more complete .bashrc to the AWS instance before starting;
also .gitconfig and .gitignore may be useful (optional) 
One issue (particularly for accessing Cassandra) is around the hostname; eventually, I’ve found that
it’s best to use Route53 and make the hostname DNS resolvable.
See Cassandra documentation as to how to make the server reachable from clients other
Starting Spark master:
$SPARK_HOME/sbin/spark-class org.apache.spark.deploy.worker.Worker spark://ip-xx-xx-xxx-xxx:xxxx
$SPARK_HOME/sbin and $SPARK_HOME/bin should already be in PATH so you can execute
the above by just running the command.
(still TODO: change spark logs to somewhere like /var/log/spark)
sudo service cassandra start
workon scipy3 ipython notebook --pylab=inline --profile=nbserver
see documentation on securing ipython.
NOTE Currently SSL connection uses a self-signed cert saved in ~/.ssh/ipy-cert.pem
this will cause a warning in the browser – it is safe to ignore.
In order to clone projects from github to any newly created SparkLab instance, you need
to generate a key-pair and then add the public part to github’s SSH keys for the account.
If you don’t know how to do this, maybe you should not be reading this guide after all.
|||Rename them from edit.* to their .* equivalent|