Code Trips & Tips

Building a Spark / SciPy / Cassandra “SparkLab” on AWS

I have just completed for a client a complete setup of a “SparkLab” on a cluster of AWS machines: the setup has been completely automated via a Bash script which I have published to this public github gist. The following is a copy of the README file there; the script can be used also on…

Marco

February 16, 2015

2–3 minutes

AWS, Cassandra, Python, Shell Scripts, Spark

I have just completed for a client a complete setup of a “SparkLab” on a cluster of AWS machines: the setup has been completely automated via a Bash script which I have published to this public github gist.

The following is a copy of the README file there; the script can be used also on any standalone Ubuntu Server (I have recently used on a VirtualBox VM to build a local development instance).

As usual, comments and suggestions welcome.

HOW-TO Build SparkLab

Author:	Marco Massenzio
Date:	2015-01-22
Version:	1.0.0
Scope:	Personal Use
License:	Apache 2

Prerequisite

Create an AWS instance and install your SSH key on the machine so you can
ssh into it (use ubuntu user) making sure that the security groups allow
access to the instance’s port 22 (SSH) from anywhere (0.0.0.0/32) or, at least
from the subnet you will be accessing it from.

You may also want to place the instance on a private subnet and open a number
of ports from the subnet’s private LAN (eg, 10.10.0.0/16) so you can
connect to Cassandra, Spark, iPython, etc.

Copy the files to that instance:

$ scp build-sparklab.sh ubuntu@ipaddress:./
$ scp sparklab-requirements.txt ubuntu@ipaddress:./

then execute it:

$ ssh ubuntu@ipaddress
...
$ ./build-sparklab.sh

Notes

It may be easier to scp a more complete .bashrc to the AWS instance before starting;
also .gitconfig and .gitignore may be useful (optional) [1]

One issue (particularly for accessing Cassandra) is around the hostname; eventually, I’ve found that
it’s best to use Route53 and make the hostname DNS resolvable.

See Cassandra documentation as to how to make the server reachable from clients other
than localhost.

Starting Spark master:

$SPARK_HOME/sbin/start-master.sh

Sparker worker:

$SPARK_HOME/sbin/spark-class org.apache.spark.deploy.worker.Worker spark://ip-xx-xx-xxx-xxx:xxxx

$SPARK_HOME/sbin and $SPARK_HOME/bin should already be in PATH so you can execute
the above by just running the command.

Spark logs:

$SPARK_HOME/logs/spark-ubuntu-org.apache.spark.deploy.master.Master-1-ip-xx-xx-xxx-xxx.out

(still TODO: change spark logs to somewhere like /var/log/spark)

Starting cassandra:

sudo service cassandra start

cassandra logs:

/var/log/cassandra/system.log
/var/log/cassandra/output.log

IPython notebook

Start with:

workon scipy3
ipython notebook --pylab=inline --profile=nbserver

see documentation on securing ipython.

NOTE Currently SSL connection uses a self-signed cert saved in ~/.ssh/ipy-cert.pem
this will cause a warning in the browser – it is safe to ignore.

git

In order to clone projects from github to any newly created SparkLab instance, you need
to generate a key-pair and then add the public part to github’s SSH keys for the account.

If you don’t know how to do this, maybe you should not be reading this guide after all.

[1]	Rename them from `edit.` to their `.` equivalent

3 responses to “Building a Spark / SciPy / Cassandra “SparkLab” on AWS”

dineshramitc

April 8, 2015 at 9:47 am

Reblogged this on Dinesh Ram Kali..

Reply
Building a Spark / SciPy / Cassandra “SparkLab” on AWS | Dinesh Ram Kali.

April 8, 2015 at 6:28 pm

[…] via Building a Spark / SciPy / Cassandra “SparkLab” on AWS | Code Trips & Tips. […]

Reply
Amazon – AWS – Free Tier – RDS – Create Instance | Learning in the Open

April 29, 2016 at 12:21 am

[…] Building a Spark / SciPy / Cassandra “SparkLab” on AWS https://codetrips.com/2015/02/16/building-a-spark-scipy-cassandra-sparklab-on-aws/ […]

Reply

Code Trips & Tips

Prerequisite

Notes

IPython notebook

git

Leave a comment Cancel reply

Trending

Running Your Own LLM Chat App on Apple Silicon with vLLM-MLX

When LLM do something impressive, yet fail the common sense test

Publish a Container to Amazon ECR using GitHub Actions

Implementing Pooled Embeddings in CUDA Kernel

Building a Spark / SciPy / Cassandra “SparkLab” on AWS

HOW-TO Build SparkLab

Prerequisite

Notes

IPython notebook

git

Share this:

3 responses to “Building a Spark / SciPy / Cassandra “SparkLab” on AWS”

Leave a comment Cancel reply

Trending

Running Your Own LLM Chat App on Apple Silicon with vLLM-MLX

When LLM do something impressive, yet fail the common sense test

Publish a Container to Amazon ECR using GitHub Actions

Implementing Pooled Embeddings in CUDA Kernel