Apache Spark is fast becoming the established platform for developing
big data applications both in batch processing and, more recently,
processing real-time data with the use of Spark streaming.
For me, Apache Spark really shines in that it allows you to write
applications to run on a Yarn Hadoop cluster and there is little to no
paradigm shift for developers coming from a functional background.
Conveniently, Spark does have a standalone mode in which it can be run
locally. This can be great for local validation of your code but I
felt that having a YARN cluster running HDFS would enable me to make
my code consistent between development and production environments.
There is a project, Apache Bigtop which
provides a means to deploy Apache Hadoop/Spark to virtual machines or
docker containers. This is definitely the avenue I would like to go
down in the future but, I wanted to get an idea of the components of a
YARN cluster as well as coming up with a lighter-weight solution
I therefore set about developing a very simple Vagrantfile with a
number of bash scripts to set up two machines:
A ubuntu virtual machine to act as a pseudo-single node YARN
cluster running HDFS and Hadoop. The scripts and configuration for
this drew much influence from the official Apache Hadoop
documentation as well as an indispensible tutorial
posted by Sain Technology Solutions. Thanks for making it easy for
Another ubuntu virtual machine that simply has Apache Spark
installed on it.
To get started, git clone
https://github.com/hiraethus/vagrant-apache-spark, cd into the
repository and type vagrant up.
If all is well, you should be able to ssh into the spark instance and
run the spark interactive REPL:
While this is by no means a best practice set of deployment scripts,
it proves to be useful for basic smoke tests before attempting to
interact with a real cluster which may be inaccessible during
development. For me, the real utility will be where I integrate a web
application with a spark application that may need to read data from
HDFS independently of Apache Spark.
I would love to hear of any other solutions people have to testing
Spark applications locally that depend on a YARN cluster.
I’ve spent a number of years programming in Java so, during my MSc in
Bioinformatics, it took me a while to become acquainted with the nuances and
the idioms of writing code in R. It has been discussed extensively elsewhere,
little better than John Cook’s lecture R: The Good, The Bad and The Ugly.
While at first I was frustrated with the language, I am starting to become fond
of the language, if not only because of the increasingly rich tooling (such as
RStudio) as well as the packaging system. While unrelated to the field of
Bioinformatics, I have started to write some sample R code for pleasure and
because of the brevity of the code that I can write. I have been working
towards creating a Shiny web app that can visualise exercise data that is
stored in an XML format that is validated against an XML schema. You can see
the code at http://github.com/hiraethus/workout.tracker. For this I have been
using the XML package available from CRAN (kindly authored and maintained by
Duncan Temple Lang) which contains a really useful method
Each of these columns will be interpreted as strings of characters. The
colClasses attribute of the xmlToDataFrame function allows the classes to be
specified as a vector, for instance c(“integer”, “numeric”, “character”).
This is great! Unfortunately, each of foo, bar and bar elements must be present
in at least one of the foobar elements. If we were to assume that this XML
document could optionally have a foobaz element of the type Boolean and we
specified our colClasses vector as such c(“integer”, “numeric”, “character”,
“boolean”) then if foobaz were not present in our document then xmlToDataFrame
The only solution I have come up with to overcome this is to use xmlToDataFrame
without the colClasses argument and then replace each column with another
column that is of the specified type was read in from the XML document. I
currently do this in the workout.tracker:
I am more than happy with the time savings the XML package has provided me in
converting my XML document into a Data.Frame in R. My solution to providing
types to the columns of my data frame, while probably very inefficient, is
ample for the few hundred entries I will have (or not depending how well I keep
to my fitness regime).
In the future I will reimplement this application in the Gosu programming
language to show how we can use its type loader
system to use an xsd to statically generate objects directly from the xml
With a glut of free time of late, I have chosen to take some time to
write some code as part of a personal project. Primarily, I wanted to
really dive deeply into Spring framework beyond the basics of web
application development. At the same time I didn’t want the effort to
go to waste and so I have decided that I really want the project to
have a use in the real world.
With this in mind, I have decided to develop an app that allows
learners and speakers of the Welsh language to find
restaurants, coffee shops and (possibly) other kinds of services that
cater to their language needs. It’s a simple idea but, surprisingly,
there is very little in the way of this kind of thing online.
But this is a bit pie in the sky. I’ve had a number of ideas in the
past that have never come to anything. At work I’ve found that
implementing software with another person’s criteria in mind is
usually fairly easy to accomplish but, invariably I’ve failed
miserably when it comes to putting my own ideas into action. I’ve been
wondering why that is.
I have a fair few ideas why this was the case. Suffice to say, here
are some ideas I’ve had in order to try and overcome these difficulties
that, so far, seem to prove successful.
Breaking down the task
When I settle on an idea, if I meditate for a little while on the
totality of all the work to be done, I find the whole thing
If I use a physical kanban board I find this helps a lot. For more
details on Kanban, see this great Google Talk with Eric
Brechner. Writing small tasks on little cards and placing
them on a cork board, I can instantly see the subsystems of the
software, the features I have considered.
I’ve taken to placing a pen and a small pile of cards next to the
board so that the instant I have an idea, I can write it on a card
and place it in the list of pending tasks. As I’m both building the
software and deciding the parameters what it’s supposed to do, I
can then easily discard any cards for ideas that turn out not to be
Also, without being tied down by anyone in for a deadline, it’s
quite easy to arrange the order of the cards any reason that suits
you. In my case, I try to work on tasks that present parts of
Spring Framework that I haven’t used before and flip between these
tasks and other that might be easier for me.
Little and often
When committing to read a book, I will never hold myself to a
deadline. More often than not it’s because I find that I rarely
have the time to commit to it. Some days I am far more motivated to
read where I can read a number of chapters in one sitting whereas
other days I have more important things to do.
What keeps me going is making to commitment to do a little bit each
day. Ideally I try to read for an hour but there are many times
where I will only read for twenty minutes. But, by consistently
doing something each day, I eventually get through the book and I
don’t lost track of the narrative.
Similarly, with my personal project, I am in the habit of doing a
little programming each day. If I have little or no motivation,
writing a small unit test or refactoring a small piece of code
helps me to continue to progress with the project as well as making
me more familiar with the code in the code base.
Another thing that has really helped in this regard for me is the
contribution graph on GitHub. Trying to keep as much of it green
gives me a sense of achievement. For reading, I update my progress
on goodreads. Sorry to all who are on the receiving end of them!
Stick to the stack
This has been a huge impediment to me in the past. Sticking to one
technology to implement a personal project is particularly
difficult when there are so many languages, frameworks and
In this sense, the best tool is the one that you are already
using. Of course you can do a rewrite if absolutely necessary but
context switching to another language or framework is probably more
costly than the gains you might get from a more modern set of
libraries. If I really want to learn the latest and greatest of the
on a new project instead of retrofitting it to this one!
In this way, I can concentrate on becoming better at using this
framework or that language in a way that I wouldn’t have had I
switched technologies frequently. Also, having a deeper knowledge
of one technology may help you to better understand the
capabilities of another. For example, understanding web development
using the Spring framework will help you to understand the
capabilities and underlying mechanisms of Groovy on Grails.
Using a particular set of technologies for an extended period of
time will also allow you to write code in a more idiomatic way that
makes writing code more enjoyable and also, easier to write. Your
project will become less of a burden and you will find that you can
achieve more in a shorter period of time.
Ultimately, to make a personal project succeed, what I think it’s
important to have a clear idea set in your mind, hopefully to program
something that is going to have a real purpose. While the idea needs
to be clear, the detail should be broken down. Any thought of a
potential feature should be written down before it’s forgotten and can
be then be discarded later if it turns out not to be valuable.
Making sure that the project is worked on often , ideally daily, helps
to keep thoughts flowing and ensures that things stay productive.
Most importantly, not getting sidetracked by other thoughts and ideas,
usually other technologies cuts short the tendency to slow progress by
switching the focus or underlying tools used to complete the project.