Is Domain Knowledge Needed for Data Science?

by Ramesh Koovelimadhom –

Much has been written about this. There have been arguments for and against. Those in favor say that knowing the domain helps us create the right hypotheses which data science can help test and prove or disprove. Without domain understanding, data science could become a long fishing expedition in the ever-increasing data-lake and the “Iceberg of Business” could start melting long before the changes needed are really implemented.

And those questioning the need for domain knowledge quote data mining competitions such as Kaggle¹ and KDD² that have demonstrated how data science can be successfully outsourced to people without domain expertise. Many companies have run competitions on such diverse topics as optimizing flight routes, predicting ocean health and diabetic retinopathy detection. Data scientists with little or no expertise in the domain have responded brilliantly with useful solutions. Some data scientists have even won across multiple domains, indicating that data science skills are transferable across domains.

And then there are those that provide the counter argument to Kaggle’s success, is that in these competitions, the domain experts have already generated the hypothesis by posing the right business question and preparing the data, and the competitors need only model and test.

Today’s massive data sets along with the mathematical tools and computing power to crunch these numbers, the old world paradigm of hypothesizing before modeling is likely to be challenged. Google has shown a whole new way of understanding the world without any a priori models or theories with their approach to language learning.

So are domain experts necessary? Is domain knowledge necessary? What is “domain knowledge”? How much is enough?

Every field of software engineering talks about the need for domain knowledge. Business Analysis requires domain knowledge. Testing requires domain knowledge. How much domain knowledge does Data Science need?

Let’s try and think through these questions via an example from investment banking.

Is it enough to know that a firm, acting as underwriter or agent, serves as an intermediary between an issuer of securities and investing institutions? And that normally an investment bank buys a new issue of securities for a negotiated price? That the investment bank then forms a syndicate and resells the securities to its customers and to the public?
Is it enough to know that they stand at the heart of financial markets in that they help make both the primary market and, through their trading desks and market makers, the secondary market too?
Is it enough to know that stock is a share in the ownership of a company? And that there are different types of stock?
Is it enough to know the definitions of EPS and PE Ratio and the formulas behind them?
Is it enough to know that futures are a type of derivative instrument or a financial contract? Or do we need to know the nuances behind how the hedgers and speculators operate? And do we have to know the finer details behind going long or short that will help me formulate better hypotheses?
Will having cursory knowledge of Spreads in that they involve taking advantage of the price difference between two different contracts of the same commodity suffice? Or will we need to know the gory details behind calendar spreads and inter-market spreads? And do we have to fully understand what is being talked about when one of the traders tells us that one of his investor clients may take a Short June Wheat and Long June Pork Bellies?

I posit that Data Science will always remain a team sport. Domain knowledge will never be enough if we choose to operate under the paradigm that a Data Scientist with technology and mathematical skills should also have deep domain knowledge. The key to effective teams is communication. And as much domain knowledge as would not hinder communication will be enough to have small teams create valuable insight using Data Science. The experts in the business will bridge the other knowledge gaps.

Gartner had this new class of Citizen Data Science in their 2015 Hype Cycle³, and is expecting it to reach plateau in 2-5 years in the innovation trigger region. Gartner research director Alexander Linden suggests cultivating “citizen data scientists”—people on the business side that may have some data skills, possibly from a math or even social science degree—and putting them to work exploring and analyzing data. We certainly need to do that, noting the team approach to creating “things of value” is not going to go away any time soon.

Works Cited

1. Kaggle https://www.kaggle.com/

2. KD nuggets "Datasets for Data Science, Machine Learning, AI & Analytics" Retrieved from https://www.kdnuggets.com/datasets/index.html

3. KD nuggets "Gartner 2015 Hype Cycle: Big Data is Out, Machine Learning is in" Retrieved from https://www.kdnuggets.com/2015/08/gartner-2015-hype-cycle-big-data-is-out-machine-learning-is-in.html