Using Vagrant to test Apache Spark applications

Apache Spark is fast becoming the established platform for developing
big data applications both in batch processing and, more recently,
processing real-time data with the use of Spark streaming.

For me, Apache Spark really shines in that it allows you to write
applications to run on a Yarn Hadoop cluster and there is little to no
paradigm shift for developers coming from a functional background.

Conveniently, Spark does have a standalone mode in which it can be run
locally. This can be great for local validation of your code but I
felt that having a YARN cluster running HDFS would enable me to make
my code consistent between development and production environments.

There is a project, Apache Bigtop which
provides a means to deploy Apache Hadoop/Spark to virtual machines or
docker containers. This is definitely the avenue I would like to go
down in the future but, I wanted to get an idea of the components of a
YARN cluster as well as coming up with a lighter-weight solution

I therefore set about developing a very simple Vagrantfile with a
number of bash scripts to set up two machines:

  1. hadoop

    A ubuntu virtual machine to act as a pseudo-single node YARN
    cluster running HDFS and Hadoop. The scripts and configuration for
    this drew much influence from the official Apache Hadoop
    documentation as well as an indispensible
    posted by Sain Technology Solutions. Thanks for making it easy for

  2. spark

    Another ubuntu virtual machine that simply has Apache Spark
    installed on it.

To get started, git clone
, cd into the
repository and type vagrant up.

If all is well, you should be able to ssh into the spark instance and
run the spark interactive REPL:

You should also be able to upload files to the hadoop machine and
subsequently to HDFS for use in your spark application:

While this is by no means a best practice set of deployment scripts,
it proves to be useful for basic smoke tests before attempting to
interact with a real cluster which may be inaccessible during
development. For me, the real utility will be where I integrate a web
application with a spark application that may need to read data from
HDFS independently of Apache Spark.

I would love to hear of any other solutions people have to testing
Spark applications locally that depend on a YARN cluster.

R XML Package

I’ve spent a number of years programming in Java so, during my MSc in
Bioinformatics, it took me a while to become acquainted with the nuances and
the idioms of writing code in R. It has been discussed extensively elsewhere,
little better than John Cook’s lecture R: The Good, The Bad and The Ugly.
While at first I was frustrated with the language, I am starting to become fond
of the language, if not only because of the increasingly rich tooling (such as
RStudio) as well as the packaging system. While unrelated to the field of
Bioinformatics, I have started to write some sample R code for pleasure and
because of the brevity of the code that I can write. I have been working
towards creating a Shiny web app that can visualise exercise data that is
stored in an XML format that is validated against an XML schema. You can see
the code at For this I have been
using the XML package available from CRAN (kindly authored and maintained by
Duncan Temple Lang) which contains a really useful method

which will take an XML document with a fairly flat structure containing and create a data frame from them. As an example, the following:

would be rendered as a data.frame of the form

Foo Bar Baz
12 2.1 First
16 1.1 Not first
20 3.3 Last

Each of these columns will be interpreted as strings of characters. The
colClasses attribute of the xmlToDataFrame function allows the classes to be
specified as a vector, for instance c(“integer”, “numeric”, “character”).

This is great! Unfortunately, each of foo, bar and bar elements must be present
in at least one of the foobar elements. If we were to assume that this XML
document could optionally have a foobaz element of the type Boolean and we
specified our colClasses vector as such c(“integer”, “numeric”, “character”,
“boolean”) then if foobaz were not present in our document then xmlToDataFrame
would fail.

The only solution I have come up with to overcome this is to use xmlToDataFrame
without the colClasses argument and then replace each column with another
column that is of the specified type was read in from the XML document. I
currently do this in the

I am more than happy with the time savings the XML package has provided me in
converting my XML document into a Data.Frame in R. My solution to providing
types to the columns of my data frame, while probably very inefficient, is
ample for the few hundred entries I will have (or not depending how well I keep
to my fitness regime).

In the future I will reimplement this application in the Gosu programming
to show how we can use its type loader
system to use an xsd to statically generate objects directly from the xml

Personal project success

With a glut of free time of late, I have chosen to take some time to
write some code as part of a personal project. Primarily, I wanted to
really dive deeply into Spring framework beyond the basics of web
application development. At the same time I didn’t want the effort to
go to waste and so I have decided that I really want the project to
have a use in the real world.

With this in mind, I have decided to develop an app that allows
learners and speakers of the Welsh language to find
restaurants, coffee shops and (possibly) other kinds of services that
cater to their language needs. It’s a simple idea but, surprisingly,
there is very little in the way of this kind of thing online.

But this is a bit pie in the sky. I’ve had a number of ideas in the
past that have never come to anything. At work I’ve found that
implementing software with another person’s criteria in mind is
usually fairly easy to accomplish but, invariably I’ve failed
miserably when it comes to putting my own ideas into action. I’ve been
wondering why that is.

I have a fair few ideas why this was the case. Suffice to say, here
are some ideas I’ve had in order to try and overcome these difficulties
that, so far, seem to prove successful.

  1. Breaking down the task

    When I settle on an idea, if I meditate for a little while on the
    totality of all the work to be done, I find the whole thing

    If I use a physical kanban board I find this helps a lot. For more
    details on Kanban, see this great Google Talk with Eric
    . Writing small tasks on little cards and placing
    them on a cork board, I can instantly see the subsystems of the
    software, the features I have considered.

    I’ve taken to placing a pen and a small pile of cards next to the
    board so that the instant I have an idea, I can write it on a card
    and place it in the list of pending tasks. As I’m both building the
    software and deciding the parameters what it’s supposed to do, I
    can then easily discard any cards for ideas that turn out not to be
    so good.

    Also, without being tied down by anyone in for a deadline, it’s
    quite easy to arrange the order of the cards any reason that suits
    you. In my case, I try to work on tasks that present parts of
    Spring Framework that I haven’t used before and flip between these
    tasks and other that might be easier for me.

  2. Little and often

    When committing to read a book, I will never hold myself to a
    deadline. More often than not it’s because I find that I rarely
    have the time to commit to it. Some days I am far more motivated to
    read where I can read a number of chapters in one sitting whereas
    other days I have more important things to do.

    What keeps me going is making to commitment to do a little bit each
    day. Ideally I try to read for an hour but there are many times
    where I will only read for twenty minutes. But, by consistently
    doing something each day, I eventually get through the book and I
    don’t lost track of the narrative.

    Similarly, with my personal project, I am in the habit of doing a
    little programming each day. If I have little or no motivation,
    writing a small unit test or refactoring a small piece of code
    helps me to continue to progress with the project as well as making
    me more familiar with the code in the code base.

    Another thing that has really helped in this regard for me is the
    contribution graph on GitHub. Trying to keep as much of it green
    gives me a sense of achievement. For reading, I update my progress
    on goodreads. Sorry to all who are on the receiving end of them!

  3. Stick to the stack

    This has been a huge impediment to me in the past. Sticking to one
    technology to implement a personal project is particularly
    difficult when there are so many languages, frameworks and

    In this sense, the best tool is the one that you are already
    using. Of course you can do a rewrite if absolutely necessary but
    context switching to another language or framework is probably more
    costly than the gains you might get from a more modern set of
    libraries. If I really want to learn the latest and greatest of the
    JavaScript MVC frameworks, I’ll do all that I can to make sure I do it
    on a new project instead of retrofitting it to this one!

    In this way, I can concentrate on becoming better at using this
    framework or that language in a way that I wouldn’t have had I
    switched technologies frequently. Also, having a deeper knowledge
    of one technology may help you to better understand the
    capabilities of another. For example, understanding web development
    using the Spring framework will help you to understand the
    capabilities and underlying mechanisms of Groovy on Grails.

    Using a particular set of technologies for an extended period of
    time will also allow you to write code in a more idiomatic way that
    makes writing code more enjoyable and also, easier to write. Your
    project will become less of a burden and you will find that you can
    achieve more in a shorter period of time.

Ultimately, to make a personal project succeed, what I think it’s
important to have a clear idea set in your mind, hopefully to program
something that is going to have a real purpose. While the idea needs
to be clear, the detail should be broken down. Any thought of a
potential feature should be written down before it’s forgotten and can
be then be discarded later if it turns out not to be valuable.

Making sure that the project is worked on often , ideally daily, helps
to keep thoughts flowing and ensures that things stay productive.

Most importantly, not getting sidetracked by other thoughts and ideas,
usually other technologies cuts short the tendency to slow progress by
switching the focus or underlying tools used to complete the project.