The Data Dive

Category: Analytics (page 1 of 2)

Big Data Analytics – Test Driven Platform Design – Early smoke testing

When we began our adventures in Spark we soon brought up the topic of smoke testing.

So what’s smoke testing?

In my mind smoke testing is when we are making sure that our system doesn’t break as soon as it’s turned on.
It was brought to the foreground as at the time it was our first foray into this new world.

We had installed three critical components:

Apache Spark – Cluster Processing Framework
Cassandra – NoSQL Database
Hadoop – Storage and Cluster Processing Execution Framework

Okay these days that’s hardly earth shattering in Big Data Analytics world. In this world that trio is as common as Fish & Chips and mushy peas is in England. So I am not giving away any company secrets. This is good for me. However if there are some Big Data newbies out there reading this post, this is a great combination.

Back to our question though, how should we smoke test this? The main thrust of our Lyticas product is handling of XBRL which is a mixture of numerical and textual processing.  Also dealing with stock price information which comes as a time series.

To that we focused on testing how well our stack would respond to these Data format types.

1.    Text – XBRL

2.     Time Series Data

All the above use Spark Core functionality. We analysed XBRL  and retrieved statistical information from time series data.

One other important part of our strategy was to test the performance at the least optimum configuration possible. Kind of finding out if your starship can still maintain a stable warp field with a single nacelle.

Here is something else to ponder. What if your system would give an acceptable level of performance at the most basic configuration. For example running on the minimum number of processing nodes, databases and servers, performance was still strong because the quality of code!!!

I will leave this blog with that bombshell.




Big Data Analytics – Illuminating dark data

In this blog I am going to describe a scenario where a Big Data stack is introduced to provide a business advantage to an established information ecosystem. The company in this scenario is in biopharmaceutical and it involves dark data.

For years this company has invested in and maintained an enterprise content management to store all research related and operational documentation.

It’s been working well for many years. You know when a system is well received. The user community use it as simply as a kitchen appliance.

The only time it’s noticed is when there is a failure.

These particular systems have reached an enviable operation record. In the last year the unscheduled downtime has been 35 minutes spread over the year. That’s pretty good going for a system with 800 concurrent users and a total global community of 5000 users.

This system supports the development of new drugs, speculative research, marketing, finance, buildings operations in fact almost everything.

As it’s a single large repository, the security mechanism is incredibly granular, insuring information is dished out on a need to know basis.

All looks well but what lies beneath are some serious issues.

The escalating costs  of running this system are becoming difficult to justify.

It’s expensive to maintain. There are on going license, support, hardware and staff costs.

The knee jerk reaction is to switch to a system that’s cheaper to run.

Well let’s look at that for a moment.

To shift to an entirely new ecosystem is a massive cost in itself. It also carries a great deal of risk.

What if there is data loss? What if what is delivered has a poorer operational record?

IT ain’t stupid here. They know if they screw up, the scientists who create the value in this company will be out for their blood.

If upper management are prepared to deal with couple of thousand scientists, that’s fine. Like who listens to geeks anyway.

However when outages affect the pipeline to new drugs coming onto the market, that will affect share price. T

hat will get senior management closer to the executioners block! Which bit shall I chop off first????

So what are the alternatives to migrating to a new ecosystem?

Well augment what you have already.

This company is at least lucky that their current stack is extensible.

You are able to bolt on other technologies that can leverage their existing repository.

So let’s ask the question, “what additional features would your users like that you aren’t offering?”

The quick answer is collaboration. They don’t have spaces where they can collaborate across continents.

I mean the ability to facilitate knowledge creation through a synthesis of joint document authoring, review, publishing and audio/video conferencing. Okay now we are going off track!

This isn’t the analytics problem we are looking for.

However this is exactly what this company is investing in. They are doing it because it’s going to bring back some added value and also it’s something they can understand.

What I am proposing is something akin to sorcery! And it gives shiver down my spine. I am not talking about the feeling you get reading about You-Know-Who in the Harry Potter world created by the amazing JK Rowling.

I am taking about the creepy feeling you get when reading Lovecraft or Crowley.

The bump in the night that freaks you out when reading., “The Tibetan Book of the Dead”.

I am taking about going after dark data!

The information that exists in large repositories that is inaccessible due to non existent meta data. I am taking about metrics on fluctuations in dark data in close to real-time as we can.

The term dark data is not new. Here is the Gartner definition.
Dark data is the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.

In the context of the biopharma, dark data is content whose value goes unrealised.

For example that graduate students promising research that goes unnoticed.

If only a few of these ideas are realised for a pharma company it could be the make or break of a new drug. It could literally be worth billions.

Dark data is often untagged or at best the metadata applied to it gives no clue of what the content relates to. So how do we get value?

We have to go in and retrieve the semantic meaning from the text. We need to retrieve the concepts and create social graph.

Once we have that we can see bring the dark into the light and see the kind of information assets we have, whose created them, when and the distribution of dark data in our information repository.

Now the question is how? How can we do this?

This is where the tools we have applying to Big Data analytics can help. We can trawl through vast quantities of information using cluster processing powering semantic meaning & concept extraction. Then visualise what we have found out and assist the data scientist to uncover new value.

That’s the dream and it’s not far off…..




Big Data Definitions, Misconceptions and Myths

After all these years of being involved in ‘Big Data’, I have finally got around to write this blog, entitled, “Big Data – Definitions, Myths & Misconceptions”.

As I wrote that statement, I got the same feeling as if I asked the question, “What is God? What is the meaning of life!”. Such is the fervour and hype around this topic these days. There are countless books explaining how Big Data is already revolutionising our world. There are legions of companies saying that they are doing it.

But what does it mean?

Here is the trusted Gartner definition.

Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

I am sorry but I still don’t feel much the wiser! However like the most profound Zen Koans meaning is realised beneath the surface of the words.

So let dive in!


To me these terms describe the characteristics of the information we are dealing with. Years before I worked in big data I worked in enterprise content management.

My clients were large multinational institutions, that needed to store Terra Bytes upon Terra Bytes of data for purposes ranging from regulatory compliance to supporting business critical operations. I suppose this is what comes to my mind when I think of high volume.

The next is High Velocity! To me that means the rate at which information systems are receiving and processing information. Consider an enterprise resource planning application for example an airline reservation system or a large supermarket distribution centre. Information is being updated continuously by in some cases thousands of concurrent users.

The final term is High Variety. Speaking from my enterprise content management background this means the range of the types of documents and content produced by a large institution. In many of the companies I consulted for, these documents were often unstructured Microsoft office documents, PDF’s, Audio and Video.

In addition to these documents were information in large database (structure data – built against a schema) and information in XML format.

Then we had the semi – structured metadata. A wide variety of information types, data and formats.

Now we have have delved into the meaning beneath the surface of these words high-volume, high-velocity and/or high variety, I am getting the feeling that although I have just explored these characteristics from my experience of enterprise content management, that actually these characteristics have been with us for a long while.

Consider something like these institutions, “Library of Congress, British Library, Bibliothèque nationale de France”. There are of course libraries with thousands upon thousands of books. Here the volume term is obvious, shelves as far as the eye can see. Variety, the range of topics and the Velocity the number of new publications coming in or books being tracked as they are borrowed.

If this was all Big Data was about then it feels that it’s the same old, same old but with a modern branding.

So here is a MISCONCEPTION. Big data is where you just have lots of information and its a rebadge of technology and concepts we have been using forever. It’s not of course…

The second part of the Gartner definition takes that misconception apart as its about getting something useful from these collections of information.

From my experience in enterprise content management that meant ensuring that the information can be retrieved after it being stored.

Taking the library example say I was looking for the book, “War and Peace”, rather than spending the next hundred years trying to find it on one of the myriad of shelves, it would be useful if all the books were tagged on Title, Subject, Author, location, What we are doing of course is applying metadata to retrieve documents.

What if we were trying to ask another question? Find me the book where two characters Pierre Bezukhov and Natasha Rostova marry. If they do, to bring up the precise sections of the book where they do!

We would need not only to do a full text search but natural language search too. This ‘little’ requirement brings with it a huge amount of work beneath the scene to deliver it. Now add to this a demand that we bring the information back in less than 4 seconds.

Let’s take another example. I want to know the number of books borrowed and any given time, organised into fiction/non-fiction, topic and author.

I want to be able to learn something about the demographics of the borrowers? Now that’s a question that uses an entirely different asset class of information.

Now what if I wanted to know this information to plan an advertising campaign or create the market for additional services?

Now we are taking Big Data!

Big Data is a place where we are no longer observing our systems but engaging with them to create value through gaining insight from the information being gathered every moment.

So what’s the myth? To create these systems its prohibitively expensive! Like imagine digitising the entire contents of a library. Well since the beginning of this new millennium, this Herculean effort has been going on and by now most of humanities greatest literature has probably been digitised. Meanwhile every new publication is being born digitised first!!! So it’s been paid for already!

Here is another myth! The use of advanced math for analysing trends and patterns in data such as complex machine learning algorithms are for university research labs or the closed doors of the likes of Google! Use of advanced computing techniques such as cluster programming is beyond the reach of your coders.

All a big myth! Why??

Because in the last two years these techniques have been packaged and made accessible, they are now waiting to be leveraged. This leads to a new frontier.

What if Big Data could be about gaining insight from all the information we have locked away in our current information systems? In many large companies information is dispersed in a variety of repositories? What if we could mine this information? What could we learn? Why not use the latest machine learning and cluster computing to do just that?







Apache Sparked!!

From around the autumn/fall of 2015, I went through some serious soul searching. Since around 2012, we had been using a well established distributed data processing technology. I am referring to MapReduce. Honestly we were using it, but using it with a lot of effort in manpower to keep it running. I would describe MapReduce like a 1970′s Porsche 911. It’s fast, it does the job but by heaven the engine is in the wrong place and get it wrong and in the country hedge you go in a frightening tailspin.

For the experienced technologists at my company, they weren’t too keen on looking at alternatives. I am being completely frank here. They knew how to make it work. Like driving that 1970′s Porsche 911 they knew when to go off the throttle and on to the break, slow into corners and fast out. I can go on about racing metaphors. Yes I am a Porsche enthusiast.

The rookies weren’t keen at all. They would much prefer the lastest 2015 Porsche 911. They wanted the easy to use API’s, fast set up and maintenance. Keeping sports car metaphors these rookies wanted traction control, GPS navigation, leather, iPhone dock, Blue tooth the works!

I had been hearing about Apache Spark, an alternative to MapReduce one that would offer greater easy of use, installation, performance and flexibility.

Honestly it sounded too good to be true. It really did! MapReduce developed by Google in 2002 and adopted in 2008, had done the rounds. It had been fighting hard since then and has a strong following in many companies.

I began by asking around. I took a field of opinions from people working with larger data sets and needing a cluster programming solution. What I got was suspicion of new technology.

Finally we spoke with a few contacts working in Big Data in Silicon Valley. They said that Apache Spark was the new disruptive kid on the block and was packing quite a punch.

Apache Spark was developed at University of California, Berkeley, as a response to short comings in cluster computing frameworks such as MapReduce and Dryad.

So here we have a number of conflicting opinions, so what did we do?

We went ahead and tried it.

I have to say, we were not disappointed.

It was easy to install and we found to how surprise compatible with our existing Hadoop Distributed filesystem (HDFS). To our joy it supported Amazon S3 and our beloved Cassandra NoSQL database. The subject of Cassandra is for another blog!

Anyway the above list of compatibilities came to our attention immediately as once we started to get our hands dirty. The other was support for Java and Maven.

However the real surprise came when we started using Apache Spark…..

The first thing we noticed was how easy it was to manage and process the data. Apache Spark’s principle programming abstraction is the Resilient Distributed Data Set (RDD). Imagine if you will an array! That’s essentially what an RDD appears to a programmer. What happens underneath is really interesting. The Apache Spark Engine takes the responsibility of processing the RDD across a cluster. The engine takes care of everything so that all you need to worry about is building and processing your RDD.

We began our work with text documents. What we do as a company is to perform natural language processing on textual content. Very often this means parsing the document and then processing each line.

So in keeping with that we decided to create a simple text processing application with Apache Spark.

We installed Apache Spark on an Amazon Cloud Instance. We began with creating an application to load an entire text document into an RDD and apply a search algorithm to recover specific lines of text. It was a very simple test, but it was indicative of how easy the Apache Spark API was to use.

We also noticed that using the same RDD concept we could work with real-time streams of data. Also there was support for machine learning libraries too.

From the start of our investigation to the present, we have become more and more convinced that Apache Spark was the right choice for us. Even our experienced MapReduce people have been converted, because its easy to use, fast and has  a lot more useful features.

Beyond the added features offered by Apache Spark, what struck me was the ability to operate in real-time.

In our view this presents an opportunity from moving away from large scale batch processing of historical data but to a new paradigm where we are engaging with our data like never before.

I for one am very keen to see what insights we can learn from this shift.



Very nice dear but what will it do for our business?

Hi Everyone!

How do I explain what we do? I know with a story. Now imagine my voice like a 1950s detective novel, and read on.

I was having a conversation with a family friend. This friend was a senior executive of a leading financial institution and is now running a venture capital fund. She asked me what I was doing these days. I spoke about my company and our flagship product Lyticas. I went on to talk about Financial Analytics, Big Data, private cloud etc. Without realising it I began to be talking in buzzwords and soundbites.

It was then she stopped me and said, “yes dear but what exactly do you?” For a moment I was stomped. I then paused collected my thoughts and began again.

I then asked her “auntie what do you do in venture capital and how did your experience at your previous company help you in your current capacity?”

She replied, “A critical part of VC, is in making accurate company valuations and benchmarking against other companies in the sector. We look closely at revenue and the prospects for growth.”

I heard this and asked, “do you do these things to decide whether a company is worth investing in and also if you have already invested, is this company on track?”

“Indeed!”, she replied.

“Cool! So how do you get your information? How long does it take to compile, process it and maintain the information. I say maintain as it’s always changing?”, I asked intently.

” we have a team of analysts and interns to do it. I don’t get involved myself!”, she said rather quickly.

“Well Aunty I can now describe to you what we do!”, I exclaimed, “our software helps you to make accurate company evaluations, by gathering data, processing it and providing the tools to help benchmark and forecast.”

“But honey we do that already how is your offering different?”

“We make it faster, greatly improve accuracy and make the information secure, yet accessible to wherever you are. There are no infrastructure costs as its in private cloud and supported 24/7″.

“Really? Show me!”

So in ten minutes, from her iPhone 6 we signed up to our application, and began looking at the working capitals, earnings and equity ratios of recently IPO’ed SEC listed companies.

“What about the big blue chip, I have friends who look at economic forecasting, basically pension people and Swiss bankers who look after private wealth management?”

“Sure!”, and then we did an instant comparison of Microsoft, Apple, General Motors, Tesla, Johnson&Johnson, Pfizer, Marriot, Halliburton and Macy’s”.

I kid you not we looked at a mini portfolio the kind a private wealth manager would look at for his favourite Russian Oligarch.

“Okay”, she said, “what are you doing Monday?”
Yeah I know that script sounds as cheesy an informercial on cable TV. But I stand by what we do as a company, enable timely, effective execution in the area of company valuation, benchmarking and forecasting. We never stand still, we are always learning and moving ahead.

Actually you can sign up, just click on the link, and let me know when you have signed up, Myself or a colleague will take you through it and depending on if it’s right for you arrange a full system trial.



Data in Seattle

A few weeks ago actually between 14th September till 19th September I was in Seattle.  I was attending the XBRL.US Data forum with folk from Apurba Technologies and our partners at Fujitsu.

It was a great conference. We were around some wonderful people talking XBRL, and all the wonderful ways we can use XBRL as our data format.

And yes it was interesting!!! I am a geek and I am proud. However there was a a reason why I was there, apart from the fact that Seattle is just beautiful!!!

I was along to present at the data forum with my colleague from Fujitsu a solution of enhanced validation of XBRL.  This new solution called XWandCloud would provide all those CFO’s receiving rather official reminders to improve their XBRL filings a simple means to do just that. I am being intentionally polite, if I got such a letter from the SEC, I would be straight onto the Valium.

Then when I calmed down a bit I would register on XWandCloud and get my XBRL filings validated.  Finally after visiting my friendly XBRL doctor from Fujitsu and he has waived his magic XWand on the celestial cloud. I could then relax and take stock to look into the past, present and hypothesize on my future by using Apurba’s Lyticas Prism Cloud, built into XWandCloud. With the Lyticas Prism I could then analyse by pre-market statement using out of the box Key Performance Indicators (the common accounting ones) and compare them to my previous quarters.

For a reasonable subscription I could also see how my company is doing compared to others.  Now all of a sudden my XBRL filing appears to be having a useful purpose other than getting me into a state every damn quarter! I can use it to help my company chart a safe course.

And this leads to the real pearl of wisdom, I gathered from this conference.

XBRL can make a difference to regular folks. It’s not just a regulatory burden placed on companies. I have been aware that XBRL has plenty of uses other than financial, my own company Apurba Tecnologies Inc has been involved in using XBRL in the construction, energy and transportation industries (CET-Taxonomy) with AGC Surety Wells Fargo.

What made the conference specially for me was meeting other folks that felt the same and were doing something about it.

Another area that caught my eye was the use of XBRL as a means of reporting on corporate actions in the financial sector. Just the thought of using XBRL for these purposes made the light bulb turn on. Adoption of such a standard would facilitate data interchange.

This could be used by a financial institution for use within it’s own data ecosystem or could be the currency of a wider data ecosystem involving financial institutions (i.e banks, insurance companies), financial exchanges, private/public sector companies and governments.

It sounds like I am alluding to big data again? Well it seems these days that all rivers lead to that data ocean.

I am still processing all that I learned at the conference, and I will be blogging about it at the data dive during the coming months.

It may be a while before I blog again so I shall leave with some words from the Tao Te Ching -

“The Sea is lord of Ten thousand Streams only because it lies beneath them”.


Finance, Health and the Watercooler

Have you ever found some of the most confounding questions and off the wall answers come to you at the water cooler?

I had been reading an article in the paper on the rising cost of running the UK’s primary health care provider the national-health-service (NHS).

The NHS was founded by the then Labor Government after World War II, in 1948.

It is an institution that has weathered many storms during its history and continues to be reported on. Quite frankly the issues around Obamacare pale into comparison when considering how to keep this British National Institution running well into the 21st century.

Now I must say, I am a technology entrepreneur, not a doctor. I am a lay person, who has been listening to the debate since he was a child.

The NHS is an incredibly large organisation. It serves as the primary health care provider for the majority of the population of the United Kingdom. And it serves the needs of the individual from birth to death.

The information generated by this institution is staggering. We talk about Big Data. Well, around the NHS we are talking about Big Data in mega supersize quantities by the volume, variety, velocity and volatility.

To give some context to healthcare big data.

When we talk about volume we imagine health care records for every patient and kept updated throughout that patient’s live. We can also mean all the documents, invoices generated by a hospital or other healthcare centers. Truly the list is endless.

So we now have our second term, variety. We have a wide range of data within our system. We never know just what we may need, so for safety’s sake we store a wide a variety as possible.

The third term is velocity.  Information is coming to us from doctors, nurses, suppliers, medical instruments – once more it’s endless and it’s coming to us in varying degrees of speed.

For example a patient’s bi annual checkup means two updates to their healthcare record.

If we look at hospital purchase ledger that could be daily. If we look at a hospital patient, their notes could be updated  hourly. If a patient was in intensive care then we could be looking at data coming to us in real-time, which would mean high velocity.

The last is volatility – this is where we track decisions that have been made. Maybe a different course of treatment was taken after a second opinion.

So that’s the potential nature of health care big data,  but what does it mean to the NHS?

The critical areas of concern, for the NHS is the delivery, effectiveness and cost of health care. In my humble opinion an effective big data initiative could help greatly in meeting these areas of concern.

Now I am an engineer not a doctor, but even I can see that it’s all about the patient.

To be in the data context, it’s all centered about the patient’s health care record. At the inception of the NHS this would have been on paper by now though and it’s been 66 years, this should be electronic.

However even better be health care record in a format that would facilitate exchange between systems. Whether such a health care record or format exists is not in the scope of this blog.

Here we are in the art of the possible…

Just as we can have exchange of financial information between financial IT systems (we call it XBRL) we also have a counterpart in health care called HL7 – CDA (Health Level 7 – Clinical Document Architecture).

HL7 – CDA, like XBRL, is based on eXtensible Markup Language (XML).

The goal of CDA is to specify syntax and supply framework for the full semantics of a clinical document. It defines a clinical document as having the following six characteristics:

  • Persistence
  • Stewardship
  • Potential for authentication
  • Context
  • Wholeness
  • Human readability

Now let’s play a game of what if?

What if we could use HL7-CDA as a means to encompass the medical record of a patient?

What if we could map the records of medical care provided to the patient, outcome of that care and cost of care?

What if we could report on health care cost using XBRL and link this to the HL7-CDA document?

Well I suppose we could if we used XML databases, but what would be the point?

Maybe with a synthesis of different XML datasets, taxonomies, predictive analytics and visualization we could have a go at answering these questions:

  •  How did the hospital meet the care requirements for the patient?
  • What costs were incurred by the hospital?
  • What were the care outcomes?
  • Where is the cost efficiency for the hospital, if any?
  • What drugs/treatments were used?
  • What was the cost in developing those drugs/treatments?

Now for countries without a state funded health service we could also be asking -

  • What is the variation in health care insurance premiums?
  • Was the patient covered adequately by their health insurance?

Clearly a lot of questions and one to be answered by a powerful health analytics system with a serious architecture!

Watch this space…

Thank you for reading.

Data interoperability

Warning: Division by zero in /home/content/67/5605767/html/blog/wp-content/plugins/nice-youtube-lite/niceYouTubeLite.php on line 184

Data is fragile. Entering it is labor intensive, checking for quality is hard, reusing it is even harder. Take the example of something as simple as tracking progress of construction projects. Fairly simple forms with information fields such as project beginning date, estimated end date, percentage conclusion, resource allocation, funding status etc. But in reality, the process is heavily manual. Forms are often hand written taking days to process since the data needs to be  re-entered, often manually again, in electronic format. The form then gets disseminated and the necessary action, such as release of funds for the next stage of the project, can finally take place. Some of these large projects, although primarily owned by a contractor, are eventually carried out by hundreds of sub-contractors. Once a subcontractor fails to deliver on deadline, this delay ripples through an entire Eco-system and often puts an overall project in jeopardy .

Now imagine that the data were tagged during submission using a technology like XBRL (eXtensibe Business Reporting Language). Everyone in the value chain, from the sub-contractors who are submitting the data and tracking their own progress, the contractors who own the overall project, the federal agencies who often fund these projects and the Banks who provide Bonds, can read and consume this data within minutes.  Decisions are taken in minutes too. This can result is millions of dollars of savings for all involved, not to mention efficiency and transparency of the status of the projects.

The first step towards this was taken today by a consortium of companies under the leadership of USC Chico and technical backing from Apurba by proposing an XBRL-CET (Construction-Energy-Transportation) taxonomy.

The Perfect System

I am continuing the Tron:Legacy metaphor.

In this movie the arch bad guy, an Artificial Intelligence (A.I), horribly misinterprets the requirements of his maker.

His maker said, “Go forth and build me the perfect system.  BTW could you also manage all the other AI’s too while you are at it. I am going away to contemplate my navel”.

The bad guy did not start out as bad. He had the best of intentions. In his mind the perfect system was efficient and orderly. He manifested his interpretation of his maker’s will precisely but it was at the expense of creativity and openness.

In the end there was a revolt and the bad guy was overthrown.

I have been thinking, “What lessons can we learn for our world?”

Apart from that we should be careful about our requirements, pick a decent project management team and for goodness sake monitor project progress, what else?

It’s an ever changing world.

The systems we build need to adapt. Now of course there is the factor of obsolescence in any technology we use. That’s life!

However we can allow for this in our Enterprise Architecture.

So how have we addressed this at Apurba?

At Apurba we work with data, lots of data, from the big to the small and almost anywhere in between. We work in financials and so we spend a lot of time with eXtensible Business Reporting Language (XBRL).

We work in construction, energy and transport (more XBRL).

We work in health care and HL7 & CDA.

We work with standard XML, RDBMS and other disparate sources of data.

It could make us quite desperate, but it doesn’t, because we have a dynamic system architecture. This week we talked about this architecture in BigDataScience, Stanford.

Realized by allowing new components to be added and old ones removed dynamically.

This principle  is in our flagship Lyticas family of analytics products.

Our clients can mix and match the services they need to meet their data analytics goals.

This can be an evolving processing and our clients don’t need to build their analytics capability with just Apurba products alone, our architecture allows products from other vendors to be added too.

So how do we do this specifically?

To get a bit more technical, we separate our data driven and event driven components. We then set up a mechanism for communication between components / services in our architecture.

The other principle we live by is to use the appropriate methodology for the scale of systems we have been entrusted to build.

There is no point going through an intense Enterprise Architecture methodology for developing a single web service.

We have Agile for the small, however it does help us to know the Enterprise Architecture landscape of our client’s environment to craft the most optimal solution.

Now I am going to stop here before I get too heavy. From my blogs you may gather I like methodology. I like it a lot. I like to learn from the experiences of others as well as my own screw ups & successes. However it must all be used in the right context and at the right time.

So what’s the thinking?

To seek perfect systems where every requirement is met perfectly for all time is folly.

To ensure our solutions can cope with change is smart.

To reach the two, one needs to accept that things will change and build the architecture to cope.

Thanks for reading.

Compliance and analytics – Two sides of the same coin

Thanks to US SEC, we now have tagged financial data freely available. What are we doing with it? Imagine this – someone has acquired the data (thanks to an accounting system), someone has prepared the data (courtesy of CFOs and their teams) and someone has labeled the data (this time thanks goes to the person who did the XBRL tagging), someone made sure the labels are correct (the ever suffering auditors get the credit this time) and finally someone made sure all this was done according to some process someone has established and monitored (this time our thanks goes to the SEC EDGAR validation system).  Wow! All we now have to do is to figure out what model to use and what colors to apply in the graphs – we have analytics and visualization!

Data visualization – Pulling data from various data files from SEC filing

This all sounds too simplistic, right? That is because the picture I just painted is just that.

There are tons of issues that are hidden in this simple process. The devil is always in the details. What is the quality of this tagging? How much details are in the financial tables versus hidden in the notes Sections? How granular is this information? Is the information sufficiently functionally related so that a complete picture can be drawn? Is it possible to query the model built based on this data? Does the model give us enough data points to forecast anything reliably? How much data is really tagged? How much of this tagging is consistent across multiple quarters? How consistent are different companies in tagging the same concept with the same element? How much of the data is using extended customized tagging? All of these are valid questions and raises a lot of very legitimate issues.

But what is really the primary question we should ask? To me, that question is:

“Does this tagged data help build better analytics than what we could before there was no tagged data available?”

A company snapshot

As someone who has been working on data analytics for quite a number of years, I can safely say that the answer is an emphatic YES!  Yes, there are problems. Yes, it is not highly reliable or accurate or sometimes even usable, but it is better than what we could have done previously. We now have tools that can build quick models, connect relevant data, compare performances and even make predictions. And this trend has not been completely missed either. Leena Roselli, Senior Research Manager at  Financial Executives Research Foundation, Inc. (FERF) recently authored a report titled “Data Mining with XBRL and Other Sources” and explored some solutions that are just hitting the market including I-Metrix (RR Donnelly), Ask 9W (9W Search) and Lyticas (Apurba). While we are still pioneering in the financial analytics and visualization space using XBRL as the primary source of data, the initial solutions are quite promising.

The bottom line is that this mandate has given us a golden opportunity to move from data mandate to data consumption, from the avoidance of punishment to generating deeper Business Intelligence. Join us in that voyage!



Copyright © 2017 The Data Dive

Theme by Anders NorenUp ↑