From around the autumn/fall of 2015, I went through some serious soul searching. Since around 2012, we had been using a well established distributed data processing technology. I am referring to MapReduce. Honestly we were using it, but using it with a lot of effort in manpower to keep it running. I would describe MapReduce like a 1970′s Porsche 911. It’s fast, it does the job but by heaven the engine is in the wrong place and get it wrong and in the country hedge you go in a frightening tailspin.
For the experienced technologists at my company, they weren’t too keen on looking at alternatives. I am being completely frank here. They knew how to make it work. Like driving that 1970′s Porsche 911 they knew when to go off the throttle and on to the break, slow into corners and fast out. I can go on about racing metaphors. Yes I am a Porsche enthusiast.
The rookies weren’t keen at all. They would much prefer the lastest 2015 Porsche 911. They wanted the easy to use API’s, fast set up and maintenance. Keeping sports car metaphors these rookies wanted traction control, GPS navigation, leather, iPhone dock, Blue tooth the works!
I had been hearing about Apache Spark, an alternative to MapReduce one that would offer greater easy of use, installation, performance and flexibility.
Honestly it sounded too good to be true. It really did! MapReduce developed by Google in 2002 and adopted in 2008, had done the rounds. It had been fighting hard since then and has a strong following in many companies.
I began by asking around. I took a field of opinions from people working with larger data sets and needing a cluster programming solution. What I got was suspicion of new technology.
Finally we spoke with a few contacts working in Big Data in Silicon Valley. They said that Apache Spark was the new disruptive kid on the block and was packing quite a punch.
Apache Spark was developed at University of California, Berkeley, as a response to short comings in cluster computing frameworks such as MapReduce and Dryad.
So here we have a number of conflicting opinions, so what did we do?
We went ahead and tried it.
I have to say, we were not disappointed.
It was easy to install and we found to how surprise compatible with our existing Hadoop Distributed filesystem (HDFS). To our joy it supported Amazon S3 and our beloved Cassandra NoSQL database. The subject of Cassandra is for another blog!
Anyway the above list of compatibilities came to our attention immediately as once we started to get our hands dirty. The other was support for Java and Maven.
However the real surprise came when we started using Apache Spark…..
The first thing we noticed was how easy it was to manage and process the data. Apache Spark’s principle programming abstraction is the Resilient Distributed Data Set (RDD). Imagine if you will an array! That’s essentially what an RDD appears to a programmer. What happens underneath is really interesting. The Apache Spark Engine takes the responsibility of processing the RDD across a cluster. The engine takes care of everything so that all you need to worry about is building and processing your RDD.
We began our work with text documents. What we do as a company is to perform natural language processing on textual content. Very often this means parsing the document and then processing each line.
So in keeping with that we decided to create a simple text processing application with Apache Spark.
We installed Apache Spark on an Amazon Cloud Instance. We began with creating an application to load an entire text document into an RDD and apply a search algorithm to recover specific lines of text. It was a very simple test, but it was indicative of how easy the Apache Spark API was to use.
We also noticed that using the same RDD concept we could work with real-time streams of data. Also there was support for machine learning libraries too.
From the start of our investigation to the present, we have become more and more convinced that Apache Spark was the right choice for us. Even our experienced MapReduce people have been converted, because its easy to use, fast and has a lot more useful features.
Beyond the added features offered by Apache Spark, what struck me was the ability to operate in real-time.
In our view this presents an opportunity from moving away from large scale batch processing of historical data but to a new paradigm where we are engaging with our data like never before.
I for one am very keen to see what insights we can learn from this shift.