How to Stitch Data Lineage [GDPR Step 11]

The General Data Protection Regulation (GDPR), introduced by the European Union (EU), took effect on May 25, 2018. With the introduction of GDPR, organizations have to focus on data lineage of their subjects—such as customers, employees, and prospects—and understand/track the flow of personal data through systems.

Talend recently hosted an on-demand webinar, Practical Steps to GDPR Compliance, that focuses on a comprehensive,16-step plan to operationalize a data governance program that supports GDPR compliance.

Stitching data lineage is Step 11 in this plan. To learn more about the first ten steps, check out the links in the sidebar.

Watch Practical Steps to GDPR Compliance now.
Watch Now

The GDPR’s Perspective on Data Lineage

Data lineage is the process of understanding data flow: where data originated, through what systems it traveled, and where it ended. It’s often represented visually for clarity.

By understanding its lineage, organizations can unambiguously track details regarding data changes—such as the who made the change, what was updated, when it happened, and which system was used—creating greater confidence in an organization’s data quality.

Armed with this knowledge, organizations can ensure that sensitive data only flows through systems that have data protection techniques, such as anonymization and pseudonymization. They can also be better prepared for regulatory reporting. Hence, data lineage is crucial for the GDPR.

Record Processing Activities

Article 30 of the GDPR requires organizations to maintain a record of processing activities. The recordkeeping requirements also extend to processors, who process data on behalf of an organization.

This record must include:

  • A description of the personal data categories.
  • A description of the categories of recipients of personal data, including those in third countries or international organizations.
  • Transfers of personal data sent to a third country or an international organization.

Right to be Forgotten

Article 17 of the GDPR gives data subjects the right to erasure, commonly known as the “right to be forgotten.” This means that organizations need to implement functionalities that could completely wipe the personal data of a customer from storage. In order to achieve this, it is first essential to know all the systems that customer data resides in.

Right to Data Access and Portability

Article 20 of the GDPR provides the right to data portability for subjects (i.e., customers can request all their data in a machine-readable format). They can then use this data for informational purposes or to move to another platform. Again, data lineage is fundamental to render this service to the customer.

How to Track Data Lineage

To achieve the mandate of data lineage by the GDPR, organizations need to establish the following prerequisites:

Once these foundational steps are in place, data governance teams need to strengthen metadata management and data lineage capabilities to comply with this GDPR article.

Using Talend for Data Lineage

Talend Metadata Manager supports data lineage across multiple platforms, including Hadoop and NoSQL. As the complete data landscape is defined in the metadata manager, the data flows and dependencies are graphically presented to the user via an automated mechanism.

In the example shown in Figure 1, although the opt-in data is used by both the CRM and MDM systems, the metadata manager clearly shows that opt-in was first captured at the CRM system.

data lineage

Figure 1: Talend Metadata Manager draws an end-to-end view of critical data such as opt-ins to track and trace where data comes from and where it goes.

Talend Big Data Platform also integrates with Apache Atlas and Cloudera Navigator to provide lineage for data flows within a data lake. In a complex, big data environment with multiple information sources, this feature is useful to isolate potential issues.

Next Steps to Data Lineage

Tracking data lineage is not just relevant for production applications, but also for other environments such as test and backup. It is not just for use within an organization, but also when EU data flows to other countries or vendors. GDPR is also applicable to non-EU companies, if their data subjects are EU-based. Given this requirement, there is a need to address data lineage from a holistic viewpoint, and Talend tools help achieve that.

The next step of Talend’s comprehensive 16-step plan is governing analytical models.

To learn more about this, and see all 16 steps together, don’t miss the on-demand webinar, Practical Steps to GDPR Compliance. The video covers information on developing standards and controls, identifying data owners and critical data elements, conducting risk assessments, improving data quality, and more.

    

| Last Updated: July 18th, 2019