Skip to content

Big Data Governance Challenges

| September 9, 2016 | By

by Ramesh Koovelimadhom – 

Successful Data Governance solves business problems by identifying root causes of data problems that impede business effectiveness, implementing management control over the activities that affect these root causes, and measuring improvement.

What has changed with Big Data now in the picture? Big Data projects have all the traditional data management challenges, like the lack of a big data governance strategy, but also add a few more.

  • Relying heavily on data scientists: Organizations tend to lean on their high priced data scientists for everything needed to turn raw data into insight. There is still significant effort that will go into preparing the data, curating it and governing it. Use appropriate skills sets for the data preparation; this more often than not, is not the data scientist’s forte. A good data governance strategy also has several components that are not typically in the data scientist’s skill set, like setting up processes that dictate how data is stored and protected, setting up standards and procedures for authorized use data of, and setting up controls and procedures to ensure the rules are being followed. Data governance is best led by a collection of data stakeholders from IT, line of business, and compliance where regulatory oversight is needed.
  • Letting schemas run wild: Just because HDFS enables you to throw just about any kind of data into it and is very forgiving, you cannot throw what you want into the data lake and look to sort things out later. The “schema on read” approach may work for some types of data, especially ones that change often and can’t be pigeonholed into preconceived schemas. But schema on read can only take you so far, and at some point, schemas must be enforced. The schema-on-read approach runs counter to core data governance principles, which require that you know what kind of data you’re storing and processing; hence more vigilance is needed in curating what is in the lake.
  • Data ROT – not cleaning up: A good data retirement strategy is always a part of a good data governance strategy. With big data, the volumes of data grow exponentially. A Veritas’ Data Genomics Index 2016 survey1 found that 40 to 60 percent of the data an average organization stores these days is redundant, obsolete, or trivial (ROT). Not cleaning up is not just a failure of good business sense—it’s a failure of data governance.
  • Not supporting people and process with technology:  Good data governance is about having the right sponsor, the right people in place, right policies, and the right set of tools to support these people and processes. So there’s a lot that goes into having an effective data governance strategy. There will never be a single tool that will solve all data governance challenges, so good data governance is also about finding the right mix of tools to support the endeavor, and constantly be on the lookout for new tools to make governance easier. Tools such as Apache Atlas2 (incubating) that came out of Hortonworks Data Governance Initiative can help automate some governance processes. Atlas is a scalable and extensible set of core foundational governance services designed to enable enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem: it  features data classification, centralized auditing, search and lineage, and security and policy management.

 

Works Cited

1. Veritas. (2016, March 15) "Veritas Global Databerg Report Finds 85% of Stored Data Is Either Dark or Redundant, Obsolete, or Trivial (ROT)" Retrieved from https://www.veritas.com/news-releases/2016-03-15-veritas-global-databerg-report-finds-85-percent-of-stored-data

2. Apache Atlas https://atlas.apache.org/#/