Apache Spark is fast becoming the established platform for developing
big data applications both in batch processing and, more recently,
processing real-time data with the use of Spark streaming.
For me, Apache Spark really shines in that it allows you to write
applications to run on a Yarn Hadoop cluster and there is little to no
paradigm shift for developers coming from a functional background.
Conveniently, Spark does have a standalone mode in which it can be run
locally. This can be great for local validation of your code but I
felt that having a YARN cluster running HDFS would enable me to make
my code consistent between development and production environments.
There is a project, Apache Bigtop which
provides a means to deploy Apache Hadoop/Spark to virtual machines or
docker containers. This is definitely the avenue I would like to go
down in the future but, I wanted to get an idea of the components of a
YARN cluster as well as coming up with a lighter-weight solution
I therefore set about developing a very simple
Vagrantfile with a
number of bash scripts to set up two machines:
A ubuntu virtual machine to act as a pseudo-single node YARN
cluster running HDFS and Hadoop. The scripts and configuration for
this drew much influence from the official Apache Hadoop
documentation as well as an indispensible
posted by Sain Technology Solutions. Thanks for making it easy for
Another ubuntu virtual machine that simply has Apache Spark
installed on it.
To get started,
cd into the
repository and type
If all is well, you should be able to ssh into the spark instance and
run the spark interactive REPL:
vagrant ssh spark
./spark-2.0.2-bin-hadoop2.7/bin/spark-shell --master yarn
You should also be able to upload files to the hadoop machine and
subsequently to HDFS for use in your spark application:
# copy the file to the hadoop VM
scp FILE_TO_COPY email@example.com:/home/vagrant
# now ssh in
vagrant ssh hadoop
# and upload the file to hdfs (copy to root)
./hadoop-2.7.3/bin/hdfs dfs -copyFromLocal FILE_TO_COPY /
While this is by no means a best practice set of deployment scripts,
it proves to be useful for basic smoke tests before attempting to
interact with a real cluster which may be inaccessible during
development. For me, the real utility will be where I integrate a web
application with a spark application that may need to read data from
HDFS independently of Apache Spark.
I would love to hear of any other solutions people have to testing
Spark applications locally that depend on a YARN cluster.