When we began our adventures in Spark we soon brought up the topic of smoke testing.
So what’s smoke testing?
In my mind smoke testing is when we are making sure that our system doesn’t break as soon as it’s turned on.
It was brought to the foreground as at the time it was our first foray into this new world.
We had installed three critical components:
Apache Spark – Cluster Processing Framework
Cassandra – NoSQL Database
Hadoop – Storage and Cluster Processing Execution Framework
Okay these days that’s hardly earth shattering in Big Data Analytics world. In this world that trio is as common as Fish & Chips and mushy peas is in England. So I am not giving away any company secrets. This is good for me. However if there are some Big Data newbies out there reading this post, this is a great combination.
Back to our question though, how should we smoke test this? The main thrust of our Lyticas product is handling of XBRL which is a mixture of numerical and textual processing. Also dealing with stock price information which comes as a time series.
To that we focused on testing how well our stack would respond to these Data format types.
1. Text – XBRL
2. Time Series Data
All the above use Spark Core functionality. We analysed XBRL and retrieved statistical information from time series data.
One other important part of our strategy was to test the performance at the least optimum configuration possible. Kind of finding out if your starship can still maintain a stable warp field with a single nacelle.
Here is something else to ponder. What if your system would give an acceptable level of performance at the most basic configuration. For example running on the minimum number of processing nodes, databases and servers, performance was still strong because the quality of code!!!
I will leave this blog with that bombshell.