Data governance with Snowflake: 3 things you need to know

Data is ruling the world. But can we trust the data?

With SaaS applications on the rise and data processing moving to the cloud, countless amounts of data arrive at an ever-increasing pace. This requires business decisions be made in real time. Whether your organization has decided to migrate its data from its legacy data silos or load endless amounts of raw data from disparate sources, chances are you may have considered a cloud data warehouse, such as Snowflake, to address both of these common data integration use cases.

Yet data coming from so many disparate origins can become difficult to keep track of. It is a top priority for an organization to guarantee the accuracy and appropriateness of their data’s source and, above all, meeting the expectations of self-service by all users. This is where data governance is most impactful.

Data Governance is not only about data protection and control, it is also about enabling and socializing people from everywhere in the organization to share and process meaningful information drawn from this data.It defends the integrity, quality and trustworthiness of the data being shared across the organization. Apply a well-designed data governance strategy to a cloud-based data warehouse, and the benefits will be amplified.

Snowflake as a modern data warehouse

Snowflake is a cloud data warehouse that offers the performance, concurrency, and simplicity needed to store and analyze all your organization’s data in one location. It provisions data storage repositories that can be used for ingesting structured data used for reporting and data analysis. Snowflake’s capability of accepting mountains of unrefined data from numerous sources in various formats also makes it an attractive Data Lake solution to many IT decision-makers. Because of Snowflake’s ability to separate its storage from its computing resources, you can grow your data lake’s storage capacity dynamically without regard to compute nodes and resize your computing clusters elastically to meet demands only when they are needed.

Beyond the warehouse and into the lake

As an alternative to storing varying and sometimes limited datasets in scattered, disparate data silos, a data lake should offer a single, integrated system for easily storing and accessing vast amounts of data while providing complete and direct access to raw (unfiltered) organizational data. It’s the place where data should be accessible to business intelligence professionals, and many other users throughout the organization.

A data lake built on a modern data warehouse should offer the following advantages:

  1. Immediately load, analyze and query of raw data without prior parsing or transformation.
  2. Stream either structured and semi-structured data, without hand coding or any manual intervention.
  3. Manage native SQL and schema-on-read queries against structured and semi-structured data.
  4. Store massive volumes of raw data cost-effectively, while deploying only the computing capacity needed.

The importance of Data Governance

Data governance should be a top priority for any data-driven organization that is keen on making the most of their data for analytics and business intelligence purposes. Using a cloud data warehouse like Snowflake is the right approach. Consequently, IT leaders eager to take on the digital transformation challenge without planning a proper data governance strategy could potentially make the mistake of diving head-first into the data lake they’ve built only to find themselves reemerging in a data swamp.

The consequences of living without data governance and data quality

With the countless amounts of data being poured into the data lake at an ever-increasing pace, requiring business decisions to be made in real-time, any type of data quality is almost impossible to scale without implementing the appropriate measures. Ideally, the data sets entering your data lake should enrich it, but unfortunately sometimes they contaminate it. As a consequence, it could potentially take weeks for IT teams to publish new data sources from what took only second to ingest. To make matters worse, when data consumers don’t realize that the new data has been made available, they end up creating their own versions of ‘the truth’ by adding their own rules in top of newly created data sources. Ultimately, too much time is spent, or wasted, on preparing and protecting data rather than on analyzing information and delivering valuable business insights.

Top-down vs bottom-up

Traditionally, when building an enterprise data warehouse, data governance has been applied through a top-down approach. First, a central data model must be defined. This involves the expertise of data professionals such as data stewards, data scientists, data curators, data protection officers or data engineers to remodel the data several times for semantics purposes before it can be ingested for analytics. Once ingested, the data catalog will regulate lineage and accessibility privileges. Although this method is effective in centrally managing data, this traditional approach to data governance can’t scale to the digital era: too few people access too little data.

An alternative approach is to design a data governance for the data lake through a bottom-up approach. This more agile model has multiple advantages over the centralized model. It scales across data sources, use cases and audiences and no specific file structure is needed for data to be brought in. Using a cloud infrastructure and big data, this method can drastically accelerate the data ingestion process with raw data. Data lakes generally start with a data lab approach, where only the most data-savvy people can access raw data. It will then require other layers of governance to connect the data to business context before other users can take advantage of it. A data governance strategy like this ensures that the data lake will consistently provide a trusted, single source of truth to all users.

However, the bottom-up governance approach can also be difficult to manage as it is incorporated as an afterthought rather that alongside the model.

Balancing a collaborative data governance process

With more and more incoming data sources, introduced by more and more people from different parts of the organization, the ideal governed data lake will have the right data governance strategy; one that establishes a more collaborative approach to governance up front. This way, the most knowledgeable among your business users can become content providers and curators. Working with data as a team from the start is essential to this approach. Otherwise, you may become overwhelmed by the amount of work needed to validate the trustworthiness of the data pouring into your data lake.

Delivering data you can trust

So, we now understand why data governance is so important from the initial stages of a cloud data migration. We also understand that implementing a collaborative data governance strategy is the only way to go. Now, let’s explore the recommended steps for applying this to your data lake on Snowflake.

First step: Discovery and cleansing

Capture and identify what is needed to ensure the quality of your data sets using modern pattern recognition, data profiling and data quality tools. If you apply them as soon as your data enters the landscape, you can understand what’s in your data and you can make it more meaningful. Your discovery and cleanse phase should include the following tools and functions:

  • Automatically profiling through data cataloging. Make this process systematic by applying it automatically to each core data set. Data is automatically profiled, metadata is created and classified to facilitate data discovery.
  • Self-service data preparation. Allow potentially anyone to access a data set and then cleanse, standardize, transform, or enrich the data.
  • Conduct data quality operations upfront and natively from the data sources, along with the data lifecycle to ensure that any data operator or user or app could consume trusted data at the end.
  • Pervasiveness through self-service. Deliver the capability across all platforms and apps and put it in to everyone’s hands, from a developer to a business analyst.

Second step: Organize and empower

The benefit of centralizing trusted data into a shareable environment is that it will save time and resources of your organization once operationalized. can be accomplished through the following:

  • Organize a data catalog and create a single source of trusted and protected data which will provide control for documenting the data and its lineage. This information should include: Where is the data coming from, Who touched that data and What are the relationships between various data sets. Data lineage will give you the big picture to trace your data flows from its sources to its final destination as well as meet compliance for privacy regulations such as GDPR or CCPA.
  • Empower people to curate, remediate and protect data. Enable the back-office capability to designate data stewards who can maintain data and make it easy and attractive to find and consume. Leave the preparation to the people who can accurately qualify it and the sensitive data to those who should view it.
  • Engage peers for bettering data. Using collaborative data management features such as data stewardship, you can create orchestrated workflows and stewardship campaigns that will get everyone involved in data quality.

Third Step: Automate and Enable

After all discovered and cleansed data is centrally organized and key stakeholders have been engaged to collaborate in the stewardship of the data to keep it trusted and compliant, it is time to implement the automation phase. Automating data processing is crucial not only for maintaining a scalable workflow, but also for eliminating tedious and counterproductive manual tasks as soon as they become repetitive.

  • Use machine learning to learn from remediation and deduplication to suggest the next best action to apply to the data pipeline or capture implied knowledge from the users and run it at scale through automation.
  • Automate protection with or encryption. Selectively share data across your organization for development, analysis and more, without disclosing Personally Identifiable Information to people who aren’t authorized to see it.
  • Enable everyone. Establish one Platform for all a leverage user-friendly apps for your stakeholders’ community.
  • Use API Services to pump valuable datasets from your data lake back into you line of business apps. Direct the flow of your data pipelines toward applications that benefit from the trusted data created by your data governance efforts and feed precious intelligence back into your line of business apps.

Inevitably, as more organizations roll out their digital transformation strategy, there will be a massive interest in data governance as they shift towards cloud data integration. Snowflake, as we’ve mentioned, offers a modern cloud data warehouse solution where a data lake can be built to accommodate anything from your large data migration to big data projects, regardless their format or source. This is a tremendous advantage considering that you can load and access all your data from a single source of truth.

That said, there is no guarantee that the information provided in the data lake will be reliable unless a robust data governance strategy is undertaken. Only through proper discovery and cleansing, stewardship, quality and self-service, will data governance be truly realized.

Talend and Snowflake

Talend Data Fabric works hand in hand with cloud data warehouses like Snowflake to provide data management with real-time speed and unshakeable trust. It leverages smart technologies like pattern recognition, data cataloging, data lineage, and machine learning to organize data at scale and turn data governance into a team sport by enabling organization-wide collaboration on data ownership, curation, remediation, and reuse.

Without technology, methodology, and security best practices, a data lake can easily become a stagnant pool of stale data with little to no intelligible value to anyone. However, if the right measures are taken to ensure data governance is implemented, the promise of a properly designed, built, and deployed data lake could ultimately be delivered, bringing with it accessible insights and value to the entire organization. 

Learn more about how Talend and Snowflake work together.

Ready to get started with Talend?