Category: Big data

Apache Sparked!!

April 1, 2016 / Ari Mitra

From around the autumn/fall of 2015, I went through some serious soul searching. Since around 2012, we had been using a well established distributed data processing technology. I am referring to MapReduce. Honestly we were using it, but using it with a lot of effort in manpower to keep it running. I would describe MapReduce like a 1970′s Porsche 911. It’s fast, it does the job but by heaven the engine is in the wrong place and get it wrong and in the country hedge you go in a frightening tailspin.

For the experienced technologists at my company, they weren’t too keen on looking at alternatives. I am being completely frank here. They knew how to make it work. Like driving that 1970′s Porsche 911 they knew when to go off the throttle and on to the break, slow into corners and fast out. I can go on about racing metaphors. Yes I am a Porsche enthusiast.

The rookies weren’t keen at all. They would much prefer the lastest 2015 Porsche 911. They wanted the easy to use API’s, fast set up and maintenance. Keeping sports car metaphors these rookies wanted traction control, GPS navigation, leather, iPhone dock, Blue tooth the works!

I had been hearing about Apache Spark, an alternative to MapReduce one that would offer greater easy of use, installation, performance and flexibility.

Honestly it sounded too good to be true. It really did! MapReduce developed by Google in 2002 and adopted in 2008, had done the rounds. It had been fighting hard since then and has a strong following in many companies.

I began by asking around. I took a field of opinions from people working with larger data sets and needing a cluster programming solution. What I got was suspicion of new technology.

Finally we spoke with a few contacts working in Big Data in Silicon Valley. They said that Apache Spark was the new disruptive kid on the block and was packing quite a punch.

Apache Spark was developed at University of California, Berkeley, as a response to short comings in cluster computing frameworks such as MapReduce and Dryad.

So here we have a number of conflicting opinions, so what did we do?

We went ahead and tried it.

I have to say, we were not disappointed.

It was easy to install and we found to how surprise compatible with our existing Hadoop Distributed filesystem (HDFS). To our joy it supported Amazon S3 and our beloved Cassandra NoSQL database. The subject of Cassandra is for another blog!

Anyway the above list of compatibilities came to our attention immediately as once we started to get our hands dirty. The other was support for Java and Maven.

However the real surprise came when we started using Apache Spark…..

The first thing we noticed was how easy it was to manage and process the data. Apache Spark’s principle programming abstraction is the Resilient Distributed Data Set (RDD). Imagine if you will an array! That’s essentially what an RDD appears to a programmer. What happens underneath is really interesting. The Apache Spark Engine takes the responsibility of processing the RDD across a cluster. The engine takes care of everything so that all you need to worry about is building and processing your RDD.

We began our work with text documents. What we do as a company is to perform natural language processing on textual content. Very often this means parsing the document and then processing each line.

So in keeping with that we decided to create a simple text processing application with Apache Spark.

We installed Apache Spark on an Amazon Cloud Instance. We began with creating an application to load an entire text document into an RDD and apply a search algorithm to recover specific lines of text. It was a very simple test, but it was indicative of how easy the Apache Spark API was to use.

We also noticed that using the same RDD concept we could work with real-time streams of data. Also there was support for machine learning libraries too.

From the start of our investigation to the present, we have become more and more convinced that Apache Spark was the right choice for us. Even our experienced MapReduce people have been converted, because its easy to use, fast and has a lot more useful features.

Beyond the added features offered by Apache Spark, what struck me was the ability to operate in real-time.

In our view this presents an opportunity from moving away from large scale batch processing of historical data but to a new paradigm where we are engaging with our data like never before.

I for one am very keen to see what insights we can learn from this shift.

Finance, Health and the Watercooler

August 19, 2014 / Ari Mitra

Have you ever found some of the most confounding questions and off the wall answers come to you at the water cooler?

I had been reading an article in the paper on the rising cost of running the UK’s primary health care provider the national-health-service (NHS).

The NHS was founded by the then Labor Government after World War II, in 1948.

It is an institution that has weathered many storms during its history and continues to be reported on. Quite frankly the issues around Obamacare pale into comparison when considering how to keep this British National Institution running well into the 21^st century.

Now I must say, I am a technology entrepreneur, not a doctor. I am a lay person, who has been listening to the debate since he was a child.

The NHS is an incredibly large organisation. It serves as the primary health care provider for the majority of the population of the United Kingdom. And it serves the needs of the individual from birth to death.

The information generated by this institution is staggering. We talk about Big Data. Well, around the NHS we are talking about Big Data in mega supersize quantities by the volume, variety, velocity and volatility.

To give some context to healthcare big data.

When we talk about volume we imagine health care records for every patient and kept updated throughout that patient’s live. We can also mean all the documents, invoices generated by a hospital or other healthcare centers. Truly the list is endless.

So we now have our second term, variety. We have a wide range of data within our system. We never know just what we may need, so for safety’s sake we store a wide a variety as possible.

The third term is velocity. Information is coming to us from doctors, nurses, suppliers, medical instruments – once more it’s endless and it’s coming to us in varying degrees of speed.

For example a patient’s bi annual checkup means two updates to their healthcare record.

If we look at hospital purchase ledger that could be daily. If we look at a hospital patient, their notes could be updated hourly. If a patient was in intensive care then we could be looking at data coming to us in real-time, which would mean high velocity.

The last is volatility – this is where we track decisions that have been made. Maybe a different course of treatment was taken after a second opinion.

So that’s the potential nature of health care big data, but what does it mean to the NHS?

The critical areas of concern, for the NHS is the delivery, effectiveness and cost of health care. In my humble opinion an effective big data initiative could help greatly in meeting these areas of concern.

Now I am an engineer not a doctor, but even I can see that it’s all about the patient.

To be in the data context, it’s all centered about the patient’s health care record. At the inception of the NHS this would have been on paper by now though and it’s been 66 years, this should be electronic.

However even better be health care record in a format that would facilitate exchange between systems. Whether such a health care record or format exists is not in the scope of this blog.

Here we are in the art of the possible…

Just as we can have exchange of financial information between financial IT systems (we call it XBRL) we also have a counterpart in health care called HL7 – CDA (Health Level 7 – Clinical Document Architecture).

HL7 – CDA, like XBRL, is based on eXtensible Markup Language (XML).

The goal of CDA is to specify syntax and supply framework for the full semantics of a clinical document. It defines a clinical document as having the following six characteristics:

Persistence
Stewardship
Potential for authentication
Context
Wholeness
Human readability

Now let’s play a game of what if?

What if we could use HL7-CDA as a means to encompass the medical record of a patient?

What if we could map the records of medical care provided to the patient, outcome of that care and cost of care?

What if we could report on health care cost using XBRL and link this to the HL7-CDA document?

Well I suppose we could if we used XML databases, but what would be the point?

Maybe with a synthesis of different XML datasets, taxonomies, predictive analytics and visualization we could have a go at answering these questions:

How did the hospital meet the care requirements for the patient?
What costs were incurred by the hospital?
What were the care outcomes?
Where is the cost efficiency for the hospital, if any?
What drugs/treatments were used?
What was the cost in developing those drugs/treatments?

Now for countries without a state funded health service we could also be asking –

What is the variation in health care insurance premiums?
Was the patient covered adequately by their health insurance?

Clearly a lot of questions and one to be answered by a powerful health analytics system with a serious architecture!

Watch this space…

Thank you for reading.

The Perfect System

May 30, 2014 / Ari Mitra

I am continuing the Tron:Legacy metaphor.

In this movie the arch bad guy, an Artificial Intelligence (A.I), horribly misinterprets the requirements of his maker.

His maker said, “Go forth and build me the perfect system. BTW could you also manage all the other AI’s too while you are at it. I am going away to contemplate my navel”.

The bad guy did not start out as bad. He had the best of intentions. In his mind the perfect system was efficient and orderly. He manifested his interpretation of his maker’s will precisely but it was at the expense of creativity and openness.

In the end there was a revolt and the bad guy was overthrown.

I have been thinking, “What lessons can we learn for our world?”

Apart from that we should be careful about our requirements, pick a decent project management team and for goodness sake monitor project progress, what else?

It’s an ever changing world.

The systems we build need to adapt. Now of course there is the factor of obsolescence in any technology we use. That’s life!

However we can allow for this in our Enterprise Architecture.

So how have we addressed this at Apurba?

At Apurba we work with data, lots of data, from the big to the small and almost anywhere in between. We work in financials and so we spend a lot of time with eXtensible Business Reporting Language (XBRL).

We work in construction, energy and transport (more XBRL).

We work in health care and HL7 & CDA.

We work with standard XML, RDBMS and other disparate sources of data.

It could make us quite desperate, but it doesn’t, because we have a dynamic system architecture. This week we talked about this architecture in BigDataScience, Stanford.

Realized by allowing new components to be added and old ones removed dynamically.

This principle is in our flagship Lyticas family of analytics products.

Our clients can mix and match the services they need to meet their data analytics goals.

This can be an evolving processing and our clients don’t need to build their analytics capability with just Apurba products alone, our architecture allows products from other vendors to be added too.

So how do we do this specifically?

To get a bit more technical, we separate our data driven and event driven components. We then set up a mechanism for communication between components / services in our architecture.

The other principle we live by is to use the appropriate methodology for the scale of systems we have been entrusted to build.

There is no point going through an intense Enterprise Architecture methodology for developing a single web service.

We have Agile for the small, however it does help us to know the Enterprise Architecture landscape of our client’s environment to craft the most optimal solution.

Now I am going to stop here before I get too heavy. From my blogs you may gather I like methodology. I like it a lot. I like to learn from the experiences of others as well as my own screw ups & successes. However it must all be used in the right context and at the right time.

So what’s the thinking?

To seek perfect systems where every requirement is met perfectly for all time is folly.

To ensure our solutions can cope with change is smart.

To reach the two, one needs to accept that things will change and build the architecture to cope.

Thanks for reading.