This is the first installment to a two-part blog series focused on Governing Big Data and Hadoop.
So, you’re ready to embark on your data-driven journey, huh? The business case and project blueprint are well defined and you’ve already secured executive sponsorship for your digital transformation. You’re ready to run a modern data platform based on Hadoop and your team is set-up on the starting blocks to deliver the promises of Big Data to the wider organization.
But then you feel some hesitation as you envision a whole new set of challenges: are you ready to operate at the fast-pace of Big Data?; to control the risks that will inevitably arise from the proliferation of data in your data lake?; to scale a data lab that is currently only accessible by a few data scientists, to a broadly shared, self-service, center of excellence that anyone can access and that seamlessly connects with your critical business processes?
Like it or not, you’re not equipped for success until you address the legacy enterprise challenges related to security, documentation, auditing and traceability. But the good news is that there is a modern way to harness the power of your Hadoop initiative with data governance in order to bring you significant business benefits.
Tackling the Six Most Pressing Issues in Governing Various types of New Big Data
To get a full understanding of the potential benefits and best practices related to Data Governance on Hadoop, Talend commissioned a report by TDWI, which outlines six pillars to ensure the success of your Big Data project:
1. Deliver Big Data accessibility to a wide audience, without putting data at risk. Self-service approaches and tools allow IT leaders to empower data workers and analysts to do their own data provisioning in an autonomous way. But one cannot just throw data preparation tools into the hands of business users without first having a governance framework to deliver this service in a managed and scalable way.
2. Accelerate data ingestion with smart discovery and exploration. It takes weeks, sometimes months, to onboard new sets of data and publish it to the right audience(s) using traditional data platforms. Now, with new “schema-on-read” approaches, IT and data experts can onboard data as it comes. As soon as it is done, data is accessible on tap to a whole community of data workers that gain the flexibility to further discover, model, connect and refine data in an ad-hoc way, at any time.
3. Capture metadata for the fullest use and governance. Metadata is the crown jewel of data-driven applications. It increases data accessibility by embedding documentation, brings context on top of raw data for better interpretation and draws the connection between disparate data points to turn data into meaning and insights. Last but not least, it brings control and traceability over the information supply chain. Modern data platforms provide new ways to capture, stitch, crowdsource and curate metadata.
4. Unify the disciplines of data management into a common platform. Silos are destroying the value of enterprise data and bring both quality and security risks. There’s a need to establish a single point of control and access to data across integration styles, while decentralizing responsibilities across data citizens.
5. Consider Hadoop for its flexibility, but beware of its governance challenge. Hadoop can process bigger and more diverse data faster, and delivers it to a wider audience in a more agile way. But, now that you can operate at extreme scale, speed and reach, there’s a mandate to master data traceability and auditability, protection, documentation, policy enforcement, etc. Consider environments like Apache Atlas or Cloudera Navigator, together with metadata driven platforms, to fully address those challenges.
6. Get ready for change, continuous innovation and diversity. IT systems are evolving from monolithic to multi-platform. SQL databases are no longer a one size fits all environment where data is modeled, stored, linked, processed and access. Metadata driven approaches help simplify data access across disparate data stores, provide data lineage and traceability, as well as accelerate data migration and movement.