by Damon Samuel, January 13, 2017
Is Big Data mostly hype, or does it offer real hope for gaining unprecedented insights that can be leveraged into massive competitive advantages?
The answer is quite simple: it all depends upon what you do with that data.
Just simply collecting and storing massive volumes of data won’t do you much good. But if you can combine analytics with the immense amounts of data now available, then you’ll have those treasured insights that can offer real-world contributions to the success of your business.
How can you apply analytics to your store of Big Data? Depending upon your situation, one of the following three options will be your best bet for combining Big Data with analytics.
Three Ways to Move Analytics into Big Data
Companies are investing in distributed file systems, and capturing much larger volumes of data than ever before. The idea is that larger samples should lead to better analysis. And the holy grail of census level information is tantalizingly close.
Fresh off a six-month project moving analytics onto a Hadoop stack, and distributing it, we realized that there are really only three options for tackling this process:
As a thought experiment, consider that we have data measuring the height of every person in the USA. The data is partitioned by state, and we want to find the average height of all people.
One: You can continue to sample the data and apply traditional analytic best practices.
This option is the easiest to execute and develop, and — assuming you do an effective job with the sampling — it’s still going to provide relevant insights and effective models. Be aware, that because you are sampling, there’s always going to be the possibility of an edge case that your model won’t account for. This has been a risk for a long time, but it is frustrating that it still exists today.
Properly executed, this option would sample the whole population and estimate the average. Traditionally, this would get you an estimate that is going to have a very small error from the true average.
Two: You can break down the math of the analytic techniques that you’re wanting to apply into the MapReduce framework, and execute it so that the model can be trained on 100% of the data.
While this option is faithful to what the model is supposed to be, it requires that you know beforehand the structure of the model that you’re wanting to develop.
This option is also quite mathematically and programmatically intense. Even for something as simple as Ordinary Least Squares you must split out the matrix algebra into mappers and reducers to execute.
In our thought experiment, we would map the sums of heights and counts of people for each state. Next, we would code a reducer to total up the sums and counts by state, which would leave us with simple division to get the true average.
Third: You can move to some of the nascent distributed analytic tools that are available such as Spark, and the SparkR and PySpark tie-ins. These tools are beginning to address the complexity in the second option, helping to make analysts’ jobs easier.
Even so, there are a limited number of analytic functions that are currently supported — more are coming online all the time, but it’s still just a limited tool set that’s available. And there’s also a limited skillset base in the marketplace of people who are familiar with these new tools, and effective in using them.
But the advantage of this third option is that it dramatically reduces the programming complexity. It reduces the need to a priori specify the model structure you’re deploying before coding your MapReduce. And it allows you to build your models on entire data sets rather than samples.
Back to our thought experiment: This option will also yield the true average without the coding complexity seen in option 2 by simply calling the average function and passing it in.
The Best of the Three?
So, which of the above three options will work best for you? The answer — as it so often is — is that it depends.
The selection of the best option is very context dependent: the specific business case that you’re trying to solve; the specific data that you’re working with; the hardware that you have available to you; the analyst skillsets that are readily available to your project team. All of these factors combine to dictate which of the three methods will be the best choice for your situation.
For companies that are just starting down the Big Data path, the sampling option is probably best — option one.
For extreme research cases, where the methodology that you’re looking to employ is not already developed and distributed in an existing package, your best bet is to write the math into the MapReduce function — option two.
For more advanced companies that are well along the Big Data maturity curve, Spark (or one of its variants) is the way to go — option three.
There are also hybrid approaches where the initial model is designed on a smaller data, set then coded in a MapReduce framework or in Spark, etc., to get the true estimates on the entire population.
Business Cases Will Drive Your Decisions
A very basic, but very important consideration is the simple question of how you will use Big Data. Business drivers are very important.
At a recent Cloudera Hadoop roadshow I attended, a Cloudera representative stated that, in their experience, approximately 90% of the data that’s pulled-in to Hadoop results from offloading pre-existing structured and relational databases. Because of that, traditional analytic methodologies and processes are still highly relevant to using the data, and to using larger volumes of data to solve business problems.
This suggests that option 1 will remain viable for some time, as the use cases that require streaming or recursive learning are still new and not yet in demand by the business units.
What Does the Future Hold for Big Data Analytics?
Everyone wants to know where the future is heading — and of course, nobody really knows with certainty. But I’m happy to offer a couple of educated guesses.
Let’s start with this: I feel that the analytic tool of the future is still up in the air. Linda Burtch at Burtchworks has been doing a SAS v R survey for a few years now, and just included Python in 2016. The combination of Python and R finally eclipsed SAS in its dominance. Spark wasn’t even considered.
Neither SAS, R, nor Python were developed to take advantage of multiple cores on multiple nodes, compared to Spark which was designed for just that reason. As more companies leverage distributed file systems on premise or in the cloud, more pressure will be put on legacy tools to adapt. Meanwhile, Spark has begun supporting data frames with the 2.0 release, and is quickly ramping up its ability to work with R via Sparklyr, and Python with Pyspark.
While Python has its advantages as a complete programming language, I believe that the critical mass of R in data science, combined with Spark support, will leave R/SparkR as the dominant data science tool in the open source community. This is based on conjecture; the only reason I and some of my analytic colleagues started working in Python was because it played better with Spark and MapReduce.
Let’s also talk a bit about the future of SAS.
SAS was the de facto analytic tool up until about five or six years ago. Though there have been other tools available, SAS has absolutely dominated the marketplace for executing analytics. But they’ve been threatened and impacted tremendously by the adoption of open source tools such as R and Python in the past several years. And they continue to be under considerable competitive pressures.
But they’re fighting back with several Big Data tie-ins of their own, efforts that are allowing their code to also take advantage of distributed architecture.
It’s still a little too early to call whether they’re going to be successful in maintaining market share with this approach. But given the vast amounts of senior SAS skillset available in the marketplace, it will be something to keep an eye on during the next few years.
We’re Happy to Help
At RCG, we have lots of experience in tackling each of the options for applying analytics to Big Data. We can help you decide which option is best for your unique situation. And we can guide you around the landmines that may be awaiting your journey into Big Data analytics.
Without analytics, the hoopla about Big Data is mostly just hype. But combine the unprecedented volume of data now available with insightful analytics, and you’ll have a very powerful tool that offers massive potential.
And even if you don’t tap into that potential, you can be certain that some of your competitors will.
To Learn More:
Contact Us to find out which of these tools makes the most sense for you environment.
Read more about our Big Data solutions
Subscribe to receive more blogs like this from RCG