Using Vagrant to test Apache Spark applications

Apache Spark is fast becoming the established platform for developing
big data applications both in batch processing and, more recently,
processing real-time data with the use of Spark streaming.

For me, Apache Spark really shines in that it allows you to write
applications to run on a Yarn Hadoop cluster and there is little to no
paradigm shift for developers coming from a functional background.

Conveniently, Spark does have a standalone mode in which it can be run
locally. This can be great for local validation of your code but I
felt that having a YARN cluster running HDFS would enable me to make
my code consistent between development and production environments.

There is a project, Apache Bigtop which
provides a means to deploy Apache Hadoop/Spark to virtual machines or
docker containers. This is definitely the avenue I would like to go
down in the future but, I wanted to get an idea of the components of a
YARN cluster as well as coming up with a lighter-weight solution

I therefore set about developing a very simple Vagrantfile with a
number of bash scripts to set up two machines:

  1. hadoop

    A ubuntu virtual machine to act as a pseudo-single node YARN
    cluster running HDFS and Hadoop. The scripts and configuration for
    this drew much influence from the official Apache Hadoop
    documentation as well as an indispensible
    posted by Sain Technology Solutions. Thanks for making it easy for

  2. spark

    Another ubuntu virtual machine that simply has Apache Spark
    installed on it.

To get started, git clone
, cd into the
repository and type vagrant up.

If all is well, you should be able to ssh into the spark instance and
run the spark interactive REPL:

You should also be able to upload files to the hadoop machine and
subsequently to HDFS for use in your spark application:

While this is by no means a best practice set of deployment scripts,
it proves to be useful for basic smoke tests before attempting to
interact with a real cluster which may be inaccessible during
development. For me, the real utility will be where I integrate a web
application with a spark application that may need to read data from
HDFS independently of Apache Spark.

I would love to hear of any other solutions people have to testing
Spark applications locally that depend on a YARN cluster.

This entry was posted in apache, hadoop, spark, vagrant, yarn. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *