After all these years of being involved in ‘Big Data’, I have finally got around to write this blog, entitled, “Big Data – Definitions, Myths & Misconceptions”.

As I wrote that statement, I got the same feeling as if I asked the question, “What is God? What is the meaning of life!”. Such is the fervour and hype around this topic these days. There are countless books explaining how Big Data is already revolutionising our world. There are legions of companies saying that they are doing it.

But what does it mean?

Here is the trusted Gartner definition.

Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

I am sorry but I still don’t feel much the wiser! However like the most profound Zen Koans meaning is realised beneath the surface of the words.

So let dive in!

HIGH-VOLUME, HIGH-VELOCITY and/or HIGH VARIETY

To me these terms describe the characteristics of the information we are dealing with. Years before I worked in big data I worked in enterprise content management.

My clients were large multinational institutions, that needed to store Terra Bytes upon Terra Bytes of data for purposes ranging from regulatory compliance to supporting business critical operations. I suppose this is what comes to my mind when I think of high volume.

The next is High Velocity! To me that means the rate at which information systems are receiving and processing information. Consider an enterprise resource planning application for example an airline reservation system or a large supermarket distribution centre. Information is being updated continuously by in some cases thousands of concurrent users.

The final term is High Variety. Speaking from my enterprise content management background this means the range of the types of documents and content produced by a large institution. In many of the companies I consulted for, these documents were often unstructured Microsoft office documents, PDF’s, Audio and Video.

In addition to these documents were information in large database (structure data – built against a schema) and information in XML format.

Then we had the semi – structured metadata. A wide variety of information types, data and formats.

Now we have have delved into the meaning beneath the surface of these words high-volume, high-velocity and/or high variety, I am getting the feeling that although I have just explored these characteristics from my experience of enterprise content management, that actually these characteristics have been with us for a long while.

Consider something like these institutions, “Library of Congress, British Library, Bibliothèque nationale de France”. There are of course libraries with thousands upon thousands of books. Here the volume term is obvious, shelves as far as the eye can see. Variety, the range of topics and the Velocity the number of new publications coming in or books being tracked as they are borrowed.

If this was all Big Data was about then it feels that it’s the same old, same old but with a modern branding.

So here is a MISCONCEPTION. Big data is where you just have lots of information and its a rebadge of technology and concepts we have been using forever. It’s not of course…

The second part of the Gartner definition takes that misconception apart as its about getting something useful from these collections of information.

From my experience in enterprise content management that meant ensuring that the information can be retrieved after it being stored.

Taking the library example say I was looking for the book, “War and Peace”, rather than spending the next hundred years trying to find it on one of the myriad of shelves, it would be useful if all the books were tagged on Title, Subject, Author, location, What we are doing of course is applying metadata to retrieve documents.

What if we were trying to ask another question? Find me the book where two characters Pierre Bezukhov and Natasha Rostova marry. If they do, to bring up the precise sections of the book where they do!

We would need not only to do a full text search but natural language search too. This ‘little’ requirement brings with it a huge amount of work beneath the scene to deliver it. Now add to this a demand that we bring the information back in less than 4 seconds.

Let’s take another example. I want to know the number of books borrowed and any given time, organised into fiction/non-fiction, topic and author.

I want to be able to learn something about the demographics of the borrowers? Now that’s a question that uses an entirely different asset class of information.

Now what if I wanted to know this information to plan an advertising campaign or create the market for additional services?

Now we are taking Big Data!

Big Data is a place where we are no longer observing our systems but engaging with them to create value through gaining insight from the information being gathered every moment.

So what’s the myth? To create these systems its prohibitively expensive! Like imagine digitising the entire contents of a library. Well since the beginning of this new millennium, this Herculean effort has been going on and by now most of humanities greatest literature has probably been digitised. Meanwhile every new publication is being born digitised first!!! So it’s been paid for already!

Here is another myth! The use of advanced math for analysing trends and patterns in data such as complex machine learning algorithms are for university research labs or the closed doors of the likes of Google! Use of advanced computing techniques such as cluster programming is beyond the reach of your coders.

All a big myth! Why??

Because in the last two years these techniques have been packaged and made accessible, they are now waiting to be leveraged. This leads to a new frontier.

What if Big Data could be about gaining insight from all the information we have locked away in our current information systems? In many large companies information is dispersed in a variety of repositories? What if we could mine this information? What could we learn? Why not use the latest machine learning and cluster computing to do just that?