What is Data Lineage? (and How to Get Started)

You’re surrounded by data. Literally, every part of your business depends on it in one way or another. While you’re busy making decisions about how best to manage your data, it might feel like there’s no time to dive into the intricacies of precisely how well it’s working for your company.

Consider this. Data should be working for your company 24/7. To that end, knowing the details of its origin, how it got there, and how it’s traveling through the business is paramount to its value. Enter data lineage, a masterful tool that can dig into the origins of that goldmine, make sense of it, and make sure it ends up in the hands that need it most.  

Let’s explore what data lineage is and is not, how it’s even more important in the Cloud, and how to find the best tool for your needs.

Download Choosing the Right Data Quality Tools now.
Download Now

Data lineage explained

Data lineage is a map of the data journey, which includes its origin, each stop along the way, and an explanation on how and why the data has moved over time. The data lineage can be documented visually from source to eventual destination — noting stops, deviations, or changes along the way. The process simplifies tracking for operational aspects like day-to-day use and error resolution.

Data lineage vs. data provenance

While data lineage provides an in-depth description of where data comes from including its analytic life cycle, data provenance is its historical record keeper. Data provenance is responsible for providing a list of origin, including inputs, entities, systems, and processes related to specific data. Provenance focuses on the origin of the data, allowing data scientists to determine its quality.

Information from data provenance helps provide error tracking, re-enactment of flow for updates, and source identification. Further, it helps sort source data in a data warehouse, and identify relevant audit trails for governance. A number of different provenance forms exist, including copy- provenance, how-provenance, and why-provenance. Data Lineage is considered why-provenance, focusing on the flow of data.  

Data provenance can be used to determine the quality of data, allowing:

  • Decision making around specific data by revealing how it was collected
  • Determination of the trust level behind the data
  • Verification of the process used to collect the data
  • Duplication of the process when it is valuable

Download The Definitive Guide to Data Quality now.
Download Now

Why data lineage is important

With continually increasing streams of data available via the cloud, business users need data accessibility and simplicity for business intelligence. Information provided by a data lifecycle, including how it moves through ETL (extract, transform, load), files, reports, and databases can help a business dig deeper to improve all aspects of product life. Data lineage provides that information and more.

Information provided by source tracking alone can facilitate error resolution, process changes, and reduce the time and resources necessary for inevitable system migrations when updates become inevitable. Data quality is enhanced by knowing who made a change, how something was updated, which processes were used, and assuring data always flows through data protection techniques. A data lineage tool creates invaluable business confidence amongst users.

Data lineage is especially valuable in these areas:

  • Business Viability: Quality data keeps a business in business. All departments, including marketing, manufacturing, management, and sales rely on data.  Information collected from demographic and customer behavior helps refine design and improve product availability. Changes over time can be reviewed regularly by team leaders, helping them make decisions about products and sales. Details provided through data lineage paint a picture that allows a business continuous education around its products.
  • Changing Data: Data changes over time. New ways to acquire data and accumulate data must be combined and analyzed to be used by management to generate revenue. Data lineage provides tracking that makes this difficult task possible.
  • IT Requirements: When your IT team creates a new software development process, they will need access to all data sources. The comprehensive list provided by a data lineage tool saves time and money by quickly locating data sources.

If a business wants to review, for example, where sales information entered the system in order to test an idea about a new product or process, data lineage can provide that information. An extraordinary amount of data enters a business system each day, and data lineage reduces risk by providing data origin and information about how it is traveling through the system.

When it comes to trusting data and ensuring governance, lineage information becomes especially important. For example, the healthcare and finance industries are subject to strict regulatory reporting and must rely on data provenance and demonstrate lineage especially with today’s large open source technologies. Providing a record of where data came from, how it was used, who viewed it and whether it was sent, copied, transformed or received, all in real time assures that full details about any person or system in contact with data are available at any time.

Data lineage helps Save The Children clean CRM records

Data lineage is helping organizations in many different industries address the large flow of information from data, especially in the Cloud. Save the Children (SCUK), an organization that steps in to help during humanitarian emergencies, requires rigid management of data and fund expenditures to be in compliance with strict regulations. The organization needed a solution that would help them avoid suspicion of fraud, donor overwhelm by other charity contacts, and offer detailed transparency, while also assuring that data management practices were easily accessible.

Over 50 data streams including direct debit, online, messaging, and advertising on TV kept the SCUK’s customer relationship management system (CRM) team on their toes. They faced challenges just keeping on top of constant growth of data load along with cleaning historical data, which encompassed 800 tables and 800 million records.

A comprehensive data lineage tool easily changed the cleansing and loading landscape, reducing information import by 60 percent. Other benefits included:

  • A specific view of donors that resulted in more targeted marketing resulting and increased donations
  • Reduced duplications before CRM loading
  • Compliance readiness through Improved visibility around specific records or pieces of data at any point in its lifecycle. Data can easily be traced to its access point and anywhere along its path to validate all process changes.

The cloud and the future of data lineage

Data simplifies the role of gathering information in some ways and complicates the role of its management in others. The internet, cloud computing, mobile devices and the Internet of Things (IoT), have made mass amounts of data accessible to every business.

The cloud makes data governance, the collection of process, roles, policies, standards, and metrics that ensure effective and efficient use of information, imperative for helping businesses to succeed. Data lineage helps sort and organize all that data, giving businesses a clear window to their data for fact checking and rapid access.

As the cloud continues to grow and evolve, data lineage will become increasingly important for governance issues. While data governance efforts protect data, they can also slow down or limit access. Trustworthy data that isn’t delivered to the right resource at the right time can have a negative affect on time to market.

Is your organization ready to manage data input from the cloud so that you can make more informed decisions in the moment?

Data lineage plays an important role in this rapidly changing system. Tracking data’s origin, and its path through your business, including transformations and targets, is the only way to tackle errors head on, and make governance issues a thing of the past through transparency.

The sheer volume of data at any given moment becomes unmanageable without the proper software tools and solutions. Getting behind the times, and losing track of the data streaming in is simply not an option. A cloud solution offers scalability and reduced cost, as well as de-duplication, data quality, simple data exchange, and multiple source collection and storage. The data governance afforded by a data lineage solution is the key to a smooth ride in the cloud.

Download The Definitive Guide to Data Governance now.
Download Now

How to get started with data lineage

The General Data Protection Regulation (GDPR), which took effect in May of 2018, requires organizations to focus on data lineage to understand the flow of data through their system. Data lineage offers data governance by making future changes and transitions — whether people or systems —trackable and malleable. But, how do you get started?

Data lineage is the perfect place to start to ensure data quality. Though tedious and time consuming, it is a must-have for any business.

  • Identify Data Elements: Contact business users to identify critical points for business function.
  • Tracking Origin: Track listed elements back to their origin one-by-one.
  • Note Sources and Links: Create a spreadsheet to label sources and link elements that can be combined.
  • Create a Map: Build maps for each system and a master map of the whole picture.

It takes a fair amount of in-house staff and training to effectively sort through a data system, not to mention the time and money involved. Today, there are comprehensive data quality solutions that include data lineage. These tools can easily sort and organize your data — saving time and money, and resulting in noticeable gains to your bottom line.

The right data lineage tool for your business

Now that you understand the importance of data lineage, it’s essential to find a data quality tool that meets your business needs. Consider finding a cloud-based solution that optimizes the data lineage process to provide the best tracking, monitoring, and governance.

Talend Data Fabric is cloud-native, suite of apps that is leading the industry in data integration and data management. This comprehensive solution serves as a data lineage tool with end-to-end benefits like:

  • Data Collection
  • Data Governance
  • Data Transformation
  • Data Quality and Sharing

Begin mapping your data’s journey today. Try Talend Data Fabric to experience the benefits of organization-wide trusted data.

 

| Last Updated: August 8th, 2019