Perspectives – The Data Dive

From around the autumn/fall of 2015, I went through some serious soul searching. Since around 2012, we had been using a well established distributed data processing technology. I am referring to MapReduce. Honestly we were using it, but using it with a lot of effort in manpower to keep it running. I would describe MapReduce like a 1970′s Porsche 911. It’s fast, it does the job but by heaven the engine is in the wrong place and get it wrong and in the country hedge you go in a frightening tailspin.

For the experienced technologists at my company, they weren’t too keen on looking at alternatives. I am being completely frank here. They knew how to make it work. Like driving that 1970′s Porsche 911 they knew when to go off the throttle and on to the break, slow into corners and fast out. I can go on about racing metaphors. Yes I am a Porsche enthusiast.

The rookies weren’t keen at all. They would much prefer the lastest 2015 Porsche 911. They wanted the easy to use API’s, fast set up and maintenance. Keeping sports car metaphors these rookies wanted traction control, GPS navigation, leather, iPhone dock, Blue tooth the works!

I had been hearing about Apache Spark, an alternative to MapReduce one that would offer greater easy of use, installation, performance and flexibility.

Honestly it sounded too good to be true. It really did! MapReduce developed by Google in 2002 and adopted in 2008, had done the rounds. It had been fighting hard since then and has a strong following in many companies.

I began by asking around. I took a field of opinions from people working with larger data sets and needing a cluster programming solution. What I got was suspicion of new technology.

Finally we spoke with a few contacts working in Big Data in Silicon Valley. They said that Apache Spark was the new disruptive kid on the block and was packing quite a punch.

Apache Spark was developed at University of California, Berkeley, as a response to short comings in cluster computing frameworks such as MapReduce and Dryad.

So here we have a number of conflicting opinions, so what did we do?

We went ahead and tried it.

I have to say, we were not disappointed.

It was easy to install and we found to how surprise compatible with our existing Hadoop Distributed filesystem (HDFS). To our joy it supported Amazon S3 and our beloved Cassandra NoSQL database. The subject of Cassandra is for another blog!

Anyway the above list of compatibilities came to our attention immediately as once we started to get our hands dirty. The other was support for Java and Maven.

However the real surprise came when we started using Apache Spark…..

The first thing we noticed was how easy it was to manage and process the data. Apache Spark’s principle programming abstraction is the Resilient Distributed Data Set (RDD). Imagine if you will an array! That’s essentially what an RDD appears to a programmer. What happens underneath is really interesting. The Apache Spark Engine takes the responsibility of processing the RDD across a cluster. The engine takes care of everything so that all you need to worry about is building and processing your RDD.

We began our work with text documents. What we do as a company is to perform natural language processing on textual content. Very often this means parsing the document and then processing each line.

So in keeping with that we decided to create a simple text processing application with Apache Spark.

We installed Apache Spark on an Amazon Cloud Instance. We began with creating an application to load an entire text document into an RDD and apply a search algorithm to recover specific lines of text. It was a very simple test, but it was indicative of how easy the Apache Spark API was to use.

We also noticed that using the same RDD concept we could work with real-time streams of data. Also there was support for machine learning libraries too.

From the start of our investigation to the present, we have become more and more convinced that Apache Spark was the right choice for us. Even our experienced MapReduce people have been converted, because its easy to use, fast and has a lot more useful features.

Beyond the added features offered by Apache Spark, what struck me was the ability to operate in real-time.

In our view this presents an opportunity from moving away from large scale batch processing of historical data but to a new paradigm where we are engaging with our data like never before.

I for one am very keen to see what insights we can learn from this shift.

By now the world has come to know of the passing of actor Leonard Nimoy, who inspired generations of souls to pursue a career in science and engineering, through his role in Star Trek.

Many of us at Apurba Technologies can cite Star Trek as the inspiration for first being interested in studying science as children, carrying this fire through school. Then studying science and engineering in college. Finally to bring this energy into our professional lives.

I can tell you now. This fire of inspiration has been lit in a new generation as I write this blog.

For those of you who never watched the show; which planet are you living on? It’s certainly not in the Alpha, Beta, Gamma or Delta quadrants of this galaxy!

Amongst the shows themes were exploration, the need tolerance in face of difference. The fact that mankind in the future would learn to appreciate one another and go out and explore the stars.

In Star Trek the intellectual scientist was the hero not the sporty jock. And that is precisely what resonated with all those young scientists and geeks at school. Star Trek showed them a future where they could make a difference. Where they can take center stage.

And the principle scientist hero in Star Trek was Spock, played so very wonderfully by the late great Leonard Nimoy!!!

I wasn’t even born when Star Trek came off the air. However when showed on Thursday nights on BBC2, my father (a scientist) and I tuned. I was absolutely mesmerized every week.

That was during my childhood and then into my teens the Next Generation arrived. However the topic of this blog is that first spark of inspiration given by the original series.

It was that first spark that fired the imagination of all those scientists and engineers to literally make life imitate art.

We see it every day in our mobile devices, voice recognition systems, user interfaces, so many the list is endless.

The title of this blog is, “The Undiscovered Country”, the name of a Star Trek movie. This movie in turn makes reference to Shakespeare’s hamlet.

“The undiscovered country from whose bourn, No traveller returns!”

Shakespeare is referring to those metaphysical questions of what happens to us after death. In the Star Trek movie, “The Undiscovered Country’, the reference is re-purposed to what the future may hold if we only had the courage and wisdom to shape it. For it is this future that is the undiscovered country.

And with this final word we at Apurba say a fond farewell to Leonard Nimoy and continue our journey to that undiscovered country!

To the future.

RIP Leonard Nimoy

LLAP everyone.

Category: Perspectives

Apache Sparked!!

The Undiscovered Country