The opportunities and challenges of using Natural Language Processing in enriching Electronic Health Records

The use of Electronic Health Records (EHR) is increasing in primary care practices, partially driven in the United States by the Health Information Technology for Economic and Clinical Health Act. In 2011, 55% of all physicians and 68% of family physicians were using an EHR system. In 2013, 78% of office-based physicians reported adopting an EHR system. EHRs can, however, be a source of frustration for physicians. A 2012 survey of family physicians revealed that only 38% were highly satisfied with their EHR. Among the barriers to EHR adoption and satisfaction are issues with usability, readability, loss of efficiency and productivity, and divergent stakeholder information needs, which are all crammed into a small and single form factor.

Cost of EHR Systems

The American Recovery and Reinvestment Act incentivizes expanding the meaningful use of electronic health record systems, but this comes at a cost. A recent study has reported the cost of implementing an electronic health record system in twenty-six primary care practices in a physician network in north Texas —taking into account hardware and software costs — as well as the time and effort invested in implementation. For an average five-physician practice, implementation cost is estimated to be around USD 162,000, with USD 85,500 in maintenance expenses during the first year. It is also estimated that the HealthTexas network implementation team and the practice implementation team needed an average of 611 hours to prepare for and implement the electronic health record system, and that the end users — physicians, other clinical staff, and nonclinical staff — needed 134 hours per physician, on average, to prepare for use of the record system in clinical encounters.

The Opportunity

This clearly has opened up an opportunity to innovate. Despite slower-than-expected growth, the global market for EHR is estimated to have reached USD 22.3 billion by the end of 2015, with the North American market projected to account for USD 10.1 billion or 47%, according to research released by Accenture (NYSE:ACN).

Although the worldwide EHR market is projected to grow at 5.5% annually through 2015, Accenture’s previous research shows that would represent a slowdown from roughly 9% growth during 2010. Despite the slower pace of growth globally, the combined EHR market in North and South America is expected to have reached USD 11.1 billion by the end of 2015, compared to an estimated USD 4 billion in the Asia Pacific region and USD 7.1 billion in Europe, the Middle East and Africa.

The Challenge

EHRs have the potential to improve outcomes and quality of care, yield cost savings, and increase engagement of patients with their own healthcare. When successfully integrated into clinical practice, EHRs automate and streamline clinician workflows, narrow the gap between information and action that can result in delayed or inadequate care. Although there is evolving evidence that EHRs can modestly improve clinical outcomes, one fundamental problem is that EHR systems were principally designed to support the transactional needs of administrators and billers, and less so to nurture the relationship between patients and their providers. Nowhere is this more apparent than in the ability of EHRs to handle unstructured, free-text data of the sort found in the history of present illness (HPI). Current EHR systems are not designed to capture the nature of HPI — an open-ended interview eliciting patient input — summarizing the information as free text within the patient record. There is huge untapped areas to innovate by exploiting the HPI to execute care plans and to document a foundational reference for subsequent encounters. In addition, the HPI can be used directly by an automated system driven by AI — replacing the current manual model using clinical coding specialists — into more structured data linked to payment and reimbursement.

“Although the market is growing, the ability of healthcare leaders to achieve sustained outcomes and proven returns on their investments pose a significant challenge to the adoption of electronic health records,” said Kaveh Safavi, global managing director of Accenture Health. “However, as market needs continue to change, we’re beginning to see innovative solutions emerge that can better adapt and scale electronic health records to meet the needs of specific patient populations as well as the business needs of health systems.”

In summary, with adoption of EHR came the challenge of data, and finding the information quickly and efficiently. With a typical 5-day hospital stay, with many doctors and nurses working on the same patient creating a huge amount of overlapping data, it becomes almost impossible to get a clear picture of what is happening with a patient by the 3rd day. The traditional EHR model is not effective in this setting.

Ripe for Innovation

One solution to the problem is to utilize human augmented machine learning to generate an insightful, patient-specific narrative — especially in the case of in-patient encounter — to simplify all of this data. Such a system will have the ability to use the Natural Language Technology (NLP) to process free-format text (“unstructured data”) stored within Patient Notes and aggregate that with the information located within the various Tables and Charts (the “structured data”). In a way, this is in line with the overall trend in work automation — the use of innovative technologies to facilitate the transition to electronic records from paper-based records — specifically for healthcare providers. NLP is one technology that can fundamentally change the way we interact with patient records and help improve clinical outcomes.

Let’s look a little closely at the data captured within an EHR system. Within the EHR, data is captured in one of four ways, entering data directly — including templates, scanning documents, transcribing text reports created with dictation or speech recognition and finally interfacing data from other information systems such as laboratory systems, radiology systems, blood pressure monitors, or electrocardiographs. This captured data, in turn, can be represented in either structured or unstructured forms. Structured data is, by definition, created through constrained choices in the form of data entry devices including drop-down menus, check boxes, and pre-filled templates. There are obvious advantages of this type of data format. They are easily searchable, aggregated, analyzed, reported, and linked to other information resources — but it suffers from data compression and more immortally loss of context —making them unsuitable for individualization of the EHR and too fragmented for intelligent holistic treatment that is possible with unstructured data.

Unstructured clinical data, on the other hand, exists in the form of free text narratives. Provider and patient encounters are commonly recorded in free-form clinical notes. Free text entries into the patient’s health record give the provider flexibility to note observations and concepts that are not supported or anticipated by the constrained choices associated with structured data. It is important to note that some data are inherently suitable for structured format, while others are not. NLP can be a powerful tool in achieving this balance — some part of unstructured text narratives can be transformed into structured data — leaving other data in free format text, but with derived annotation and semantic analytics, making the EHR data model real life situations more closely.

Not the Silver Bullet

NLP is not a silver bullet and clinical text poses significant challenges to NLP. This text is often ungrammatical, consists of bullet-point telegraphic phrases with limited context, and lacks complete sentences. Clinical notes make heavy use of acronyms and abbreviations, making them highly ambiguous. Word sense disambiguation also poses a challenge in extracting meaningful data from unstructured text. Clinical notes often contain terms or phrases that have more than one meaning. For example, discharge can signify either bodily excretion or release from a hospital; cold can refer to a disease, a temperature sensation, or an environmental condition. Similarly, the abbreviation MDcan be interpreted as the credential for “Doctor of Medicine” or as an abbreviation for “mental disorder.” This underscores the need to understand and model the context more closely, and NLP practitioners are working towards a working solution to these challenges.

One such solution is the standardization of medical language such as the Unified Medical Language System (UMLS). It is a set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems. We can use the UMLS to enhance or develop applications, such as electronic health records, classification tools, dictionaries and language translators. Specifically, the UMLS metathesaurus, is a repository of over 100 biomedical vocabularies, including CPT®, ICD-10-CMLOINC®MeSH®RxNorm, and SNOMED CT®, is an excellent tool in standardizing this variation. Within the Metathesaurus, terms across vocabularies are grouped together based on meaning — forming concepts — allowing us to capture and account for the huge variations in language and expressions.

This obviously helps, but even such exhaustive approaches have their limitations. Given the nature of language itself, each individual concept is often assigned multiple semantic type categories from the UMLS Semantic Network, making the meaningcontext-sensitive. For example, within UMLS, 33.1% of abbreviations have multiple meanings. The presence of abbreviation ambiguity is even higher in clinical notes, with a rate of 54.3%. This makes subjectivity a big factor in understanding clinical notes — makes it that much more difficult to derive actionable intelligence.

The Growing Market

Irrespective of these challenges, the NLP market is growing steadily and is forecasted to grow for some time, as shown in the Figure below.

According to a recent report, NLP Market for Healthcare and Life Sciences Industry will be worth USD 2.67 Billion by 2020. This report, titled, “Natural Language Processing Market for Health Care and Life Sciences Industry by Type (Rule-Based, Statistical, & Hybrid NLP Solutions), Region (North America, Europe, Asia-Pacific, Middle East and Africa, Latin America) – Global Forecast to 2020”, defines and divides the NLP market into various segments with an in-depth analysis and forecasting of revenues.

The global NLP market for health care and life sciences industry is expected to grow from USD 1.10 Billion in 2015 to USD 2.67 Billion by 2020, at a CAGR of 19.2%. In the current scenario, North America is expected to be the largest market on the basis of spending and adoption of NLP solutions for the healthcare and life sciences industry.

What Next?

The EHR is here to stay. Now is the time to innovate by introducing better ways to capture clinical data, better ways to interact with the data and better ways to use the data to improve clinical outcomes. NLP and machine leaning are obvious candidates to make this happen. That is why investments are piling up in this area. Jorge Conde is Andreessen Horowitz’ newest general partner and leads the firm’s investments at the cross section of biology, computer science, and healthcare. He was recently asked: “… you were an undergrad biology at Johns Hopkins, but you have an MBA from Harvard and also worked as an investment banker at Morgan Stanley! How does that all add up?” Jorge‘s answer was simple and to the point: “I went to finance to see if I could understand … what drives an industry, how does the operation actually work? But then I realized … that I wanted to build and do. And so … I did additional graduate work in the sciences at the medical school at Harvard and at MIT.” The article I am quoting this from is aptly titled The Century of Biology. Computational biology and by its extension healthcare is going to be the most exciting field for the 21st century and we will need to build the tools to support it. Well, we better get to work!

Big Data, Machine Learning and Healthcare – An increasingly significant interplay

The field of healthcare is undergoing a revolution with the increasing adoption of technology – devices, sensors, software, insights and artificial intelligence. Unfortunately buzzwords such as machine learning and Big Data have clouded this conversation. People throwing around these buzzwords do a major disservice to any real adoption of new technologies. And I am not the only one who is bothered by this: others are voicing their concerns too: Machine Learning – Can We Please Just Agree What This Means? Although machine learning and big data have become buzzwords, which these days do carry a negative connotation, in this particular case it is making significant inroads in healthcare. What is Big Data, what is machine learning and how are they changing healthcare?

Defining Big Data is like defining what life is – it depends on who you ask. The simplest and closest definition is attributed to Gartner – Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. But not everybody agrees, according to The Big Data Conundrum: How to Define It, published by MIT Technology Review, Jonathan Stuart Ward and Adam Barker at the University of St Andrews in Scotland surveyed what it means to different organizations and got very different results and bravely finished their survey with a definition of their own: Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning. One significant evolution of the term seems to be the marriage of the concept with the enabling technology or algorithmic framework, specifically database, optimizing algorithms and machine learning.

So let’s now turn our focus on machine learning, which has similar problems. To me machine learning is simply the process by which a computer can learn to do something. That something might be as simple as reading the alphabet, or as complex as driving a car on its own. Although explaining this to a non-technical audience is not easy, valiant efforts have been made by some lost souls, for example, Pararth Shah, Ex-Stanford student and currently at Google Research, How do you explain Machine Learning and Data Mining to a layman? and Daniel Tunkelang, data scientist, who led teams at LinkedIn and Google, How do you explain machine learning to a child?” How this learning can take place, however, is harder to explain. Before attempting that, let me clarify some other relevant technical jargons people may have thrown at you, such has AI, soft computing and computational intelligence. AI, which stands for Artificial intelligence, is the generic study of how human intelligence can be incorporated into computers. Machine learning, which is a sub-area of AI, on the other hand, concentrates on the theoretical foundations used by computational aspects of these algorithms, considered to belong to the field of Computational Intelligence and Soft Computing, some examples of which are neural networks, fuzzy systems and evolutionary algorithms. More simplistically, a machine-learning algorithm is used to determine the relationship between a system’s inputs and outputs using a learning data set that is representative of all the behavior found in the system using various data modeling techniques. This learning can be either supervised or unsupervised.

The interesting reality is that whether we are aware or not, machine learning based solutions are already part of our daily lives, so much so that BBC thought it would be fun just to point it out in Eight ways intelligent machines are already in your life. Not surprisingly, one of the eight areas mentioned there is healthcare.

Now we come to the hard part of this discussion. There are numerous different interplays between big data, machine learning and healthcare. Thousands of books are being written on it – my most recent search on Amazon with “machine learning” yielded 14,389 matches! Dedicated conferences on this topic are being organized. There are so many courses on it that David Venturi from Udacity was inspired to research and publish Every single Machine Learning course on the internet, ranked by your reviews. Virtually almost all healthcare startups are now expected to use some form of machine learning – VC funding to healthcare startups that uses some form of AI increased 29% year-over-year to hit 88 deals in 2016, and are already on track to reach a 6-year-high in 2017. A good starting point, however, is “Top 4 Machine Learning Use Cases for Healthcare Providers” written by Jennifer Bresnick. She broadly identifies the following areas where significant inroads have already been made by machine learning, imaging analytics and pathology, natural language processing and free-text data, clinical decision support and predictive analytics and finally, cyber-security and ransomware. If you are looking to get more specific, check out 7 Applications of Machine Learning in Pharma and Medicine by Daniel Faggella, where he identifies applications that use machine learning with the most forecasted impact on healthcare – disease identification/diagnosis, personalized treatment/behavioral modification, drug discovery/manufacturing, clinical trial research, radiology and radiotherapy, smart electronic health records and finally, epidemic outbreak prediction. In reality, it is becoming harder and harder to find any area of healthcare which is untouched by machine learning in some way these days.

Why is this happening? It’s because we are realizing that machine learning has enormous potential to make healthcare more efficient, smarter and cost effective. My prediction is that in the future, we will not talk about machine learning as a separate tool, it will become so ubiquitous that we will automatically assume it is part of a solution, much in the same way we do not think of internet search as a separate tool anymore, we automatically assume it is available. The more important question is now that the jinni is out of the bottle, where will it end? Will we one day have completely autonomous artificial systems such as the famous Emergency Medical Hologram Mark I replace human doctors as the creators of Star Trek imagined? Or will healthcare prove to be so complex that no machine can ever replace humans completely? Only time will tell.

When is a small sample size too small for statistical reporting?

It has been a fairly well known assumption in Statistics that a sample size of 30 is a so-called magic number in estimating distribution or statistical errors. The problem is that, firstly, according to Andrew Messing of the Center for Brain Science, Harvard University, like a lot of Rule of Thumb commonsense measures, this assumption does not have a solid theoretical basis to prove its veracity. Secondly, the number 30 is itself arbitrary, and some textbooks give alternative magic numbers of 50 or 20. Examples can be found in Is 30 the magic number issues in sample size estimation? or Shamanism as Statistical Knowledge: Is a Sample Size of 30 All You Need?, for example. Funny thing is that there is no formal proof that any of these numbers are useful because they all rely on assumptions that can fail to hold true in one or more ways, and as a result, the adequate sample size cannot be derived using the methods typically taught (and used) in the medical, social, cognitive, and behavioral sciences.

Like so many others before me, this got me thinking. One of my domains is healthcare data analytics, a field that is perpetually inundated with data. Should I test this rule of thumb and see if there is any truth to it?

Let’s set the background first. The data I was looking at centers around the treatment of Hepatitis C.  The goal of Hepatitis C therapy is to clear the patient’s blood of the Hepatitis C virus (HCV). During this treatment, the doctors routinely monitor the level of virus in the patient’s blood – a measurement known as viral load – typically in terms of International Units per milliliter (IU/mL). When I was slicing and dicing the data using different criteria such as age, sex, genotype etc., to report effectiveness of treatment, sometimes the sample sizes of these cohorts were becoming too small. So what should the minimum size of my sample set before I can confidently report that result?

Let’s look at some fairly simple mathematical model now. In this specific case, we assumed a T-distribution of our data. T-distribution is almost engineered so it gives a better estimate of our confidence intervals especially since we have a small sample size. It looks very similar to a normal distribution. It has a “mean”, which is our mean of our sampling distribution. But it also has fatter tails. So normally what we can do is that we find the estimate of the true standard deviation, and then we can say that the standard deviation of the sampling distribution is equal to the true standard deviation of our population divided by the square root of n, which is the sample size.

This is especially useful since we never know the true standard deviation, or we seldom know the true standard deviation. So, if we don’t know that, the best thing we can put in there is our sample standard deviation.

We do not necessarily call this estimate a probability interval; rather it is a “confidence interval” because we are making some assumptions. This confidence measure is going to change from sample to sample. And in particular, the expectation is that this is going to be a particularly bad estimate when we have a really small sample size.

Accordingly, we calculated confidence intervals with the above procedure for the data. If the size of the sample is more than a cut off, say 30, we have used Z-scores; otherwise we have used t-table for calculation. (I am assuming t-tables and Z-scores are outside the scope of this article.) This basically means that we first find the mean, then find the standard deviation, and finally find the standard error, which is equal to standard deviation divided by the square root of sample size. We then find the range either from the t-table or Z-score, as mentioned above. Finally, the adjusted range with a specific % confidence will be equal to the mean +/- the range, as calculated above.

The graph below shows the results where the sample size was 1,720 patients.

The next graph shows the results where the sample size is 28.

In both of these cases, it appears that sample size around 30 gives us enough statistical confidence in the results we are presenting. In both cases, however, we are bound by the fact that comparing effectiveness across treatments will probably be best related to the size of the sample (cohort) itself, the closest metric being the utilization factor. For the calculated values within each category, however, we should be able to report the numbers with a prescribed confidence interval. In all the calculations presented above, that confidence interval was 95%.

Somehow, we picked a set of “arbitrary” healthcare data and somehow, a sample size around 30 was an adequately large number to generate dependable statistics. But the question remains, why?

The best rationale I have come across of why this is such a popular number was given by Christopher C. Rout, of the University of KwaZulu-Natal, Department of Anesthetics and Critical Care, Durban, KwaZulu-Natal, South Africa. According to him, it is not “enough”, but rather it is that we need “at least” 30 samples before we can reasonably expect an analysis based upon the normal distribution (i.e. Z test) to be valid. That is, it represents a threshold above which the sample size is no longer considered small. It may have to do with the difference between the square roots of 1/n and 1/(n-1). At about 30 (actually between 32 and 33) this difference becomes less than 0.001, so in a way, the intuitive sense is that at or around that number of the sample size, the difference between samples of larger size may not contribute too much to the probability distribution calculation and a measure of estimated error goes down to acceptable levels.

Well, sounds about right to me.

Big Data Analytics – Test Driven Platform Design – Early smoke testing

When we began our adventures in Spark we soon brought up the topic of smoke testing.

So what’s smoke testing?

In my mind smoke testing is when we are making sure that our system doesn’t break as soon as it’s turned on.
It was brought to the foreground as at the time it was our first foray into this new world.

We had installed three critical components:

Apache Spark – Cluster Processing Framework
Cassandra – NoSQL Database
Hadoop – Storage and Cluster Processing Execution Framework

Okay these days that’s hardly earth shattering in Big Data Analytics world. In this world that trio is as common as Fish & Chips and mushy peas is in England. So I am not giving away any company secrets. This is good for me. However if there are some Big Data newbies out there reading this post, this is a great combination.

Back to our question though, how should we smoke test this? The main thrust of our Lyticas product is handling of XBRL which is a mixture of numerical and textual processing.  Also dealing with stock price information which comes as a time series.

To that we focused on testing how well our stack would respond to these Data format types.

1.    Text – XBRL

2.     Time Series Data

All the above use Spark Core functionality. We analysed XBRL  and retrieved statistical information from time series data.

One other important part of our strategy was to test the performance at the least optimum configuration possible. Kind of finding out if your starship can still maintain a stable warp field with a single nacelle.

Here is something else to ponder. What if your system would give an acceptable level of performance at the most basic configuration. For example running on the minimum number of processing nodes, databases and servers, performance was still strong because the quality of code!!!

I will leave this blog with that bombshell.

Big Data Analytics – Illuminating dark data

In this blog I am going to describe a scenario where a Big Data stack is introduced to provide a business advantage to an established information ecosystem. The company in this scenario is in biopharmaceutical and it involves dark data.

For years this company has invested in and maintained an enterprise content management to store all research related and operational documentation.

It’s been working well for many years. You know when a system is well received. The user community use it as simply as a kitchen appliance.

The only time it’s noticed is when there is a failure.

These particular systems have reached an enviable operation record. In the last year the unscheduled downtime has been 35 minutes spread over the year. That’s pretty good going for a system with 800 concurrent users and a total global community of 5000 users.

This system supports the development of new drugs, speculative research, marketing, finance, buildings operations in fact almost everything.

As it’s a single large repository, the security mechanism is incredibly granular, insuring information is dished out on a need to know basis.

All looks well but what lies beneath are some serious issues.

The escalating costs  of running this system are becoming difficult to justify.

It’s expensive to maintain. There are on going license, support, hardware and staff costs.

The knee jerk reaction is to switch to a system that’s cheaper to run.

Well let’s look at that for a moment.

To shift to an entirely new ecosystem is a massive cost in itself. It also carries a great deal of risk.

What if there is data loss? What if what is delivered has a poorer operational record?

IT ain’t stupid here. They know if they screw up, the scientists who create the value in this company will be out for their blood.

If upper management are prepared to deal with couple of thousand scientists, that’s fine. Like who listens to geeks anyway.

However when outages affect the pipeline to new drugs coming onto the market, that will affect share price. T

hat will get senior management closer to the executioners block! Which bit shall I chop off first????

So what are the alternatives to migrating to a new ecosystem?

Well augment what you have already.

This company is at least lucky that their current stack is extensible.

You are able to bolt on other technologies that can leverage their existing repository.

So let’s ask the question, “what additional features would your users like that you aren’t offering?”

The quick answer is collaboration. They don’t have spaces where they can collaborate across continents.

I mean the ability to facilitate knowledge creation through a synthesis of joint document authoring, review, publishing and audio/video conferencing. Okay now we are going off track!

This isn’t the analytics problem we are looking for.

However this is exactly what this company is investing in. They are doing it because it’s going to bring back some added value and also it’s something they can understand.

What I am proposing is something akin to sorcery! And it gives shiver down my spine. I am not talking about the feeling you get reading about You-Know-Who in the Harry Potter world created by the amazing JK Rowling.

I am taking about the creepy feeling you get when reading Lovecraft or Crowley.

The bump in the night that freaks you out when reading., “The Tibetan Book of the Dead”.

I am taking about going after dark data!

The information that exists in large repositories that is inaccessible due to non existent meta data. I am taking about metrics on fluctuations in dark data in close to real-time as we can.

The term dark data is not new. Here is the Gartner definition.
Dark data is the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.

In the context of the biopharma, dark data is content whose value goes unrealised.

For example that graduate students promising research that goes unnoticed.

If only a few of these ideas are realised for a pharma company it could be the make or break of a new drug. It could literally be worth billions.

Dark data is often untagged or at best the metadata applied to it gives no clue of what the content relates to. So how do we get value?

We have to go in and retrieve the semantic meaning from the text. We need to retrieve the concepts and create social graph.

Once we have that we can see bring the dark into the light and see the kind of information assets we have, whose created them, when and the distribution of dark data in our information repository.

Now the question is how? How can we do this?

This is where the tools we have applying to Big Data analytics can help. We can trawl through vast quantities of information using cluster processing powering semantic meaning & concept extraction. Then visualise what we have found out and assist the data scientist to uncover new value.

That’s the dream and it’s not far off…..

Big Data Definitions, Misconceptions and Myths

After all these years of being involved in ‘Big Data’, I have finally got around to write this blog, entitled, “Big Data – Definitions, Myths & Misconceptions”.

As I wrote that statement, I got the same feeling as if I asked the question, “What is God? What is the meaning of life!”. Such is the fervour and hype around this topic these days. There are countless books explaining how Big Data is already revolutionising our world. There are legions of companies saying that they are doing it.

But what does it mean?

Here is the trusted Gartner definition.

Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

I am sorry but I still don’t feel much the wiser! However like the most profound Zen Koans meaning is realised beneath the surface of the words.

So let dive in!

HIGH-VOLUME, HIGH-VELOCITY and/or HIGH VARIETY

To me these terms describe the characteristics of the information we are dealing with. Years before I worked in big data I worked in enterprise content management.

My clients were large multinational institutions, that needed to store Terra Bytes upon Terra Bytes of data for purposes ranging from regulatory compliance to supporting business critical operations. I suppose this is what comes to my mind when I think of high volume.

The next is High Velocity! To me that means the rate at which information systems are receiving and processing information. Consider an enterprise resource planning application for example an airline reservation system or a large supermarket distribution centre. Information is being updated continuously by in some cases thousands of concurrent users.

The final term is High Variety. Speaking from my enterprise content management background this means the range of the types of documents and content produced by a large institution. In many of the companies I consulted for, these documents were often unstructured Microsoft office documents, PDF’s, Audio and Video.

In addition to these documents were information in large database (structure data – built against a schema) and information in XML format.

Then we had the semi – structured metadata. A wide variety of information types, data and formats.

Now we have have delved into the meaning beneath the surface of these words high-volume, high-velocity and/or high variety, I am getting the feeling that although I have just explored these characteristics from my experience of enterprise content management, that actually these characteristics have been with us for a long while.

Consider something like these institutions, “Library of Congress, British Library, Bibliothèque nationale de France”. There are of course libraries with thousands upon thousands of books. Here the volume term is obvious, shelves as far as the eye can see. Variety, the range of topics and the Velocity the number of new publications coming in or books being tracked as they are borrowed.

If this was all Big Data was about then it feels that it’s the same old, same old but with a modern branding.

So here is a MISCONCEPTION. Big data is where you just have lots of information and its a rebadge of technology and concepts we have been using forever. It’s not of course…

The second part of the Gartner definition takes that misconception apart as its about getting something useful from these collections of information.

From my experience in enterprise content management that meant ensuring that the information can be retrieved after it being stored.

Taking the library example say I was looking for the book, “War and Peace”, rather than spending the next hundred years trying to find it on one of the myriad of shelves, it would be useful if all the books were tagged on Title, Subject, Author, location, What we are doing of course is applying metadata to retrieve documents.

What if we were trying to ask another question? Find me the book where two characters Pierre Bezukhov and Natasha Rostova marry. If they do, to bring up the precise sections of the book where they do!

We would need not only to do a full text search but natural language search too. This ‘little’ requirement brings with it a huge amount of work beneath the scene to deliver it. Now add to this a demand that we bring the information back in less than 4 seconds.

Let’s take another example. I want to know the number of books borrowed and any given time, organised into fiction/non-fiction, topic and author.

I want to be able to learn something about the demographics of the borrowers? Now that’s a question that uses an entirely different asset class of information.

Now what if I wanted to know this information to plan an advertising campaign or create the market for additional services?

Now we are taking Big Data!

Big Data is a place where we are no longer observing our systems but engaging with them to create value through gaining insight from the information being gathered every moment.

So what’s the myth? To create these systems its prohibitively expensive! Like imagine digitising the entire contents of a library. Well since the beginning of this new millennium, this Herculean effort has been going on and by now most of humanities greatest literature has probably been digitised. Meanwhile every new publication is being born digitised first!!! So it’s been paid for already!

Here is another myth! The use of advanced math for analysing trends and patterns in data such as complex machine learning algorithms are for university research labs or the closed doors of the likes of Google! Use of advanced computing techniques such as cluster programming is beyond the reach of your coders.

All a big myth! Why??

Because in the last two years these techniques have been packaged and made accessible, they are now waiting to be leveraged. This leads to a new frontier.

What if Big Data could be about gaining insight from all the information we have locked away in our current information systems? In many large companies information is dispersed in a variety of repositories? What if we could mine this information? What could we learn? Why not use the latest machine learning and cluster computing to do just that?

Apache Sparked!!

From around the autumn/fall of 2015, I went through some serious soul searching. Since around 2012, we had been using a well established distributed data processing technology. I am referring to MapReduce. Honestly we were using it, but using it with a lot of effort in manpower to keep it running. I would describe MapReduce like a 1970′s Porsche 911. It’s fast, it does the job but by heaven the engine is in the wrong place and get it wrong and in the country hedge you go in a frightening tailspin.

For the experienced technologists at my company, they weren’t too keen on looking at alternatives. I am being completely frank here. They knew how to make it work. Like driving that 1970′s Porsche 911 they knew when to go off the throttle and on to the break, slow into corners and fast out. I can go on about racing metaphors. Yes I am a Porsche enthusiast.

The rookies weren’t keen at all. They would much prefer the lastest 2015 Porsche 911. They wanted the easy to use API’s, fast set up and maintenance. Keeping sports car metaphors these rookies wanted traction control, GPS navigation, leather, iPhone dock, Blue tooth the works!

I had been hearing about Apache Spark, an alternative to MapReduce one that would offer greater easy of use, installation, performance and flexibility.

Honestly it sounded too good to be true. It really did! MapReduce developed by Google in 2002 and adopted in 2008, had done the rounds. It had been fighting hard since then and has a strong following in many companies.

I began by asking around. I took a field of opinions from people working with larger data sets and needing a cluster programming solution. What I got was suspicion of new technology.

Finally we spoke with a few contacts working in Big Data in Silicon Valley. They said that Apache Spark was the new disruptive kid on the block and was packing quite a punch.

Apache Spark was developed at University of California, Berkeley, as a response to short comings in cluster computing frameworks such as MapReduce and Dryad.

So here we have a number of conflicting opinions, so what did we do?

We went ahead and tried it.

I have to say, we were not disappointed.

It was easy to install and we found to how surprise compatible with our existing Hadoop Distributed filesystem (HDFS). To our joy it supported Amazon S3 and our beloved Cassandra NoSQL database. The subject of Cassandra is for another blog!

Anyway the above list of compatibilities came to our attention immediately as once we started to get our hands dirty. The other was support for Java and Maven.

However the real surprise came when we started using Apache Spark…..

The first thing we noticed was how easy it was to manage and process the data. Apache Spark’s principle programming abstraction is the Resilient Distributed Data Set (RDD). Imagine if you will an array! That’s essentially what an RDD appears to a programmer. What happens underneath is really interesting. The Apache Spark Engine takes the responsibility of processing the RDD across a cluster. The engine takes care of everything so that all you need to worry about is building and processing your RDD.

We began our work with text documents. What we do as a company is to perform natural language processing on textual content. Very often this means parsing the document and then processing each line.

So in keeping with that we decided to create a simple text processing application with Apache Spark.

We installed Apache Spark on an Amazon Cloud Instance. We began with creating an application to load an entire text document into an RDD and apply a search algorithm to recover specific lines of text. It was a very simple test, but it was indicative of how easy the Apache Spark API was to use.

We also noticed that using the same RDD concept we could work with real-time streams of data. Also there was support for machine learning libraries too.

From the start of our investigation to the present, we have become more and more convinced that Apache Spark was the right choice for us. Even our experienced MapReduce people have been converted, because its easy to use, fast and has  a lot more useful features.

Beyond the added features offered by Apache Spark, what struck me was the ability to operate in real-time.

In our view this presents an opportunity from moving away from large scale batch processing of historical data but to a new paradigm where we are engaging with our data like never before.

I for one am very keen to see what insights we can learn from this shift.

Very nice dear but what will it do for our business?

Hi Everyone!

How do I explain what we do? I know with a story. Now imagine my voice like a 1950s detective novel, and read on.

I was having a conversation with a family friend. This friend was a senior executive of a leading financial institution and is now running a venture capital fund. She asked me what I was doing these days. I spoke about my company and our flagship product Lyticas. I went on to talk about Financial Analytics, Big Data, private cloud etc. Without realising it I began to be talking in buzzwords and soundbites.

It was then she stopped me and said, “yes dear but what exactly do you?” For a moment I was stomped. I then paused collected my thoughts and began again.

I then asked her “auntie what do you do in venture capital and how did your experience at your previous company help you in your current capacity?”

She replied, “A critical part of VC, is in making accurate company valuations and benchmarking against other companies in the sector. We look closely at revenue and the prospects for growth.”

I heard this and asked, “do you do these things to decide whether a company is worth investing in and also if you have already invested, is this company on track?”

“Indeed!”, she replied.

“Cool! So how do you get your information? How long does it take to compile, process it and maintain the information. I say maintain as it’s always changing?”, I asked intently.

” we have a team of analysts and interns to do it. I don’t get involved myself!”, she said rather quickly.

“Well Aunty I can now describe to you what we do!”, I exclaimed, “our software helps you to make accurate company evaluations, by gathering data, processing it and providing the tools to help benchmark and forecast.”

“But honey we do that already how is your offering different?”

“We make it faster, greatly improve accuracy and make the information secure, yet accessible to wherever you are. There are no infrastructure costs as its in private cloud and supported 24/7″.

“Really? Show me!”

So in ten minutes, from her iPhone 6 we signed up to our application, http://app.lyticas-technology.com and began looking at the working capitals, earnings and equity ratios of recently IPO’ed SEC listed companies.

“What about the big blue chip, I have friends who look at economic forecasting, basically pension people and Swiss bankers who look after private wealth management?”

“Sure!”, and then we did an instant comparison of Microsoft, Apple, General Motors, Tesla, Johnson&Johnson, Pfizer, Marriot, Halliburton and Macy’s”.

I kid you not we looked at a mini portfolio the kind a private wealth manager would look at for his favourite Russian Oligarch.

“Okay”, she said, “what are you doing Monday?”
Yeah I know that script sounds as cheesy an informercial on cable TV. But I stand by what we do as a company, enable timely, effective execution in the area of company valuation, benchmarking and forecasting. We never stand still, we are always learning and moving ahead.

Actually you can sign up, just click on the link, http://app.lyticas-technology.com and let me know when you have signed up, ari@apurbatech.com. Myself or a colleague will take you through it and depending on if it’s right for you arrange a full system trial.

The Undiscovered Country

By now the world has come to know of the passing of actor Leonard Nimoy, who inspired generations of souls to pursue a career in science and engineering, through his role in Star Trek.

Many of us at Apurba Technologies can cite Star Trek as the inspiration for first being interested in studying science as children, carrying this fire through school. Then studying science and engineering in college. Finally to bring this energy into our professional lives.

I can tell you now. This fire of inspiration has been lit in a new generation as I write this blog.

For those of you who never watched the show; which planet are you living on? It’s certainly not in the Alpha, Beta, Gamma or Delta quadrants of this galaxy!

Amongst the shows themes were exploration, the need tolerance in face of difference. The fact that mankind in the future would learn to appreciate one another and go out and explore the stars.

In Star Trek the intellectual scientist was the hero not the sporty jock. And that is precisely what resonated with all those young scientists and geeks at school. Star Trek showed them a future where they could make a difference.  Where they can take center stage.

And the principle scientist hero in Star Trek was Spock, played so very wonderfully by the late great Leonard Nimoy!!!

I wasn’t even born when Star Trek came off the air. However when showed on Thursday nights on BBC2, my father (a scientist) and I tuned. I was absolutely mesmerized every week.

That was during my childhood and then into my teens the Next Generation arrived. However the topic of this blog is that first spark of inspiration given by the original series.

It was that first spark that fired the imagination of all those scientists and engineers to literally make life imitate art.

We see it every day in our mobile devices, voice recognition systems, user interfaces, so many the list is endless.

The title of this blog is, “The Undiscovered Country”, the name of a Star Trek movie.  This movie in turn makes reference to Shakespeare’s hamlet.

“The undiscovered country from whose bourn, No traveller returns!”

Shakespeare is referring to those metaphysical  questions of what happens to us after death. In the Star Trek movie, “The Undiscovered Country’, the reference is re-purposed to what the future may hold if we only had the courage and wisdom to shape it. For it is this future that is the undiscovered country.

And with this final word we at Apurba say a fond farewell to Leonard Nimoy and continue our journey to that undiscovered country!

To the future.

RIP Leonard Nimoy

LLAP everyone.

Data in Seattle

A few weeks ago actually between 14th September till 19th September I was in Seattle.  I was attending the XBRL.US Data forum with folk from Apurba Technologies and our partners at Fujitsu.

It was a great conference. We were around some wonderful people talking XBRL, and all the wonderful ways we can use XBRL as our data format.

And yes it was interesting!!! I am a geek and I am proud. However there was a a reason why I was there, apart from the fact that Seattle is just beautiful!!!

I was along to present at the data forum with my colleague from Fujitsu a solution of enhanced validation of XBRL.  This new solution called XWandCloud would provide all those CFO’s receiving rather official reminders to improve their XBRL filings a simple means to do just that. I am being intentionally polite, if I got such a letter from the SEC, I would be straight onto the Valium.

Then when I calmed down a bit I would register on XWandCloud and get my XBRL filings validated.  Finally after visiting my friendly XBRL doctor from Fujitsu and he has waived his magic XWand on the celestial cloud. I could then relax and take stock to look into the past, present and hypothesize on my future by using Apurba’s Lyticas Prism Cloud, built into XWandCloud. With the Lyticas Prism I could then analyse by pre-market statement using out of the box Key Performance Indicators (the common accounting ones) and compare them to my previous quarters.

For a reasonable subscription I could also see how my company is doing compared to others.  Now all of a sudden my XBRL filing appears to be having a useful purpose other than getting me into a state every damn quarter! I can use it to help my company chart a safe course.

And this leads to the real pearl of wisdom, I gathered from this conference.

XBRL can make a difference to regular folks. It’s not just a regulatory burden placed on companies. I have been aware that XBRL has plenty of uses other than financial, my own company Apurba Tecnologies Inc has been involved in using XBRL in the construction, energy and transportation industries (CET-Taxonomy) with AGC Surety Wells Fargo.

What made the conference specially for me was meeting other folks that felt the same and were doing something about it.

Another area that caught my eye was the use of XBRL as a means of reporting on corporate actions in the financial sector. Just the thought of using XBRL for these purposes made the light bulb turn on. Adoption of such a standard would facilitate data interchange.

This could be used by a financial institution for use within it’s own data ecosystem or could be the currency of a wider data ecosystem involving financial institutions (i.e banks, insurance companies), financial exchanges, private/public sector companies and governments.

It sounds like I am alluding to big data again? Well it seems these days that all rivers lead to that data ocean.

I am still processing all that I learned at the conference, and I will be blogging about it at the data dive during the coming months.

It may be a while before I blog again so I shall leave with some words from the Tao Te Ching –

“The Sea is lord of Ten thousand Streams only because it lies beneath them”.

Copyright © 2024 The Data Dive

Theme by Anders NorenUp ↑