4 Considerations for Delivering Data Quality on Hadoop
Organizations are increasingly trying to become data driven. In my last blog, I outlined steps organizations should take to become data driven using the Kotter Model. In this blog I want to highlight a key aspect of the digital transformation journey, which is Data Governance specifically in the era of Big Data.
As companies adopt Big Data technologies, they will have to understand how their standard Data Governance practices, such as Data Quality and Stewardship, that they’ve used to building Enterprise Data Warehouses, Master Data Management (MDM) and Business Intelligence (BI) reporting, will apply to the new source systems that will be processed through Hadoop.
But the types of source systems which were sourced for Data warehouses and MDM had limitations in terms of the 3V’s (Velocity, Variety, Volume) . This is because there was no requirement to access these complex data structures for BI reporting. With the rise of the Hadoop ecosystem, a new term, Data Lake, got coined to accommodate all the diverse data that Hadoop is able to support. Gartner cautions that “Data Lakes carry substantial risks.” One of the concerns when it comes to Data Lakes is how to manage Data Governance. Until now, companies have had some sort of governance implemented in their existing Data Architectures, but the growing volumes of unstructured and streaming data that are in Data Lakes today are forcing companies to revisit their processes for maintaining data governance and stewardship.
Here are some of the key elements to consider for governing data in Hadoop:
Analytics needs to be part of Data Organization: As the technological landscape around Data changes, companies need to rethink their IT organizational structure supporting and monitoring data. Companies should consider having a center of excellence (COE) in Data (EDW, BI, Master Data) and analytics (Data Scientists et all) need to be part of the COE. This will allow the COE leadership to have some control over Data Governance. Data Scientists will have access to raw and transformed data that EDWs and MDM use, which could be used in their analytics. Data Quality rules and Stewardship processes can be applied to the data that Data Scientists use wherever applicable.
Prioritize what data needs to be cleansed: Though all data could be important, not all data is equal. It is important to define where the data came from, how the data will be used and how the data will be consumed. Data that is being consumed by your customers or vendors from your business ecosystems will need to be cleansed, matched and survived. Stringent data quality rules might be needed and applied to data that requires strict audit trails and carries regulatory compliance guidelines. On the other hand, we would not get much value in cleansing social media data or data coming from sensors. We should also consider having the data cleansed on the consumption side rather than on the acquisition side. Therefore a single governance architecture might not apply for all types of data.
Balance governance vs results: We have to keep in mind that the results used for analytics could have an impact if the data is ‘cleansed’. So by applying data quality rules on the data, you could damage the analytical-value of that data. Traditional data quality practices always insist on ‘correcting’ the data, but in analytics that data could be an outlier and could signify a change in the pattern. Having some kind of data quality methods to signify the quality of the dataset as a whole as opposed to individual record content would make sense in these scenarios. Data quality checks such as ‘the Data Load is 10x smaller or larger than expected’ or ‘more than half of all the values are empty or null’ would be a better fit for Data Scientists who like to apply machine learning models on the data.
Make use of big data processing engines: Data Quality tasks such as profiling and matching inherently involve processing record by record and/or doing aggregations on the source data and hence can be processed in Hadoop. Therefore, Data Quality can also take advantage of the distributed computing in Hadoop and cloud architectures. Additionally, we can also apply machine learning to Data Quality functions such as data matching.
Data Governance is becoming a key area of focus for CIOs. It would be a challenge for CIO's to extend governance to Big Data Analytics. Making analytics a part of the Data COE team and analyzing how they can apply their existing governance policies to Hadoop would be a good start.