[GDPR Step 11] How to Stitch Data Lineage
The General Data Protection Regulation (GDPR), introduced by the European Union (EU), took effect on May 25, 2018. With the introduction of GDPR, organizations have to focus on data lineage of their subjects—such as customers, employees, and prospects—and understand/track the flow of personal data through systems.
Talend recently hosted an on-demand webinar, Practical Steps to GDPR Compliance, that focuses on a comprehensive,16-step plan to operationalize a data governance program that supports GDPR compliance.
Stitching data lineage is Step 11 in this plan. To learn more about the first ten steps, check out the links in the sidebar.
The GDPR’s Perspective on Data Lineage
Data lineage is the process of understanding data flow: where data originated, through what systems it traveled, and where it ended. It’s often represented visually for clarity.
By understanding its lineage, organizations can unambiguously track details regarding data changes—such as the who made the change, what was updated, when it happened, and which system was used—creating greater confidence in an organization’s data quality.
Armed with this knowledge, organizations can ensure that sensitive data only flows through systems that have data protection techniques, such as anonymization and pseudonymization. They can also be better prepared for regulatory reporting. Hence, data lineage is crucial for the GDPR.
Record Processing Activities
Article 30 of the GDPR requires organizations to maintain a record of processing activities. The recordkeeping requirements also extend to processors, who process data on behalf of an organization.
This record must include:
- A description of the personal data categories.
- A description of the categories of recipients of personal data, including those in third countries or international organizations.
- Transfers of personal data sent to a third country or an international organization.
Right to be Forgotten
Article 17 of the GDPR gives data subjects the right to erasure, commonly known as the “right to be forgotten.” This means that organizations need to implement functionalities that could completely wipe the personal data of a customer from storage. In order to achieve this, it is first essential to know all the systems that customer data resides in.
Right to Data Access and Portability
Article 20 of the GDPR provides the right to data portability for subjects (i.e., customers can request all their data in a machine-readable format). They can then use this data for informational purposes or to move to another platform. Again, data lineage is fundamental to render this service to the customer.
How to Track Data Lineage
To achieve the mandate of data lineage by the GDPR, organizations need to establish the following prerequisites:
- Create a data taxonomy
- Identify data owners
- Identify critical data elements
- Have a clear understanding of data subjects and their intentions (Step 5 and step 6).
- Identify processors/vendors
Once these foundational steps are in place, data governance teams need to strengthen metadata management and data lineage capabilities to comply with this GDPR article.
Using Talend for Data Lineage
Talend Metadata Manager supports data lineage across multiple platforms, including Hadoop and NoSQL. As the complete data landscape is defined in the metadata manager, the data flows and dependencies are graphically presented to the user via an automated mechanism.
In the example shown in Figure 1, although the opt-in data is used by both the CRM and MDM systems, the metadata manager clearly shows that opt-in was first captured at the CRM system.
Figure 1: Talend Metadata Manager draws an end-to-end view of critical data such as opt-ins to track and trace where data comes from and where it goes.
Talend Big Data Platform also integrates with Apache Atlas and Cloudera Navigator to provide lineage for data flows within a data lake. In a complex, big data environment with multiple information sources, this feature is useful to isolate potential issues.
Next Steps to Data Lineage
Tracking data lineage is not just relevant for production applications, but also for other environments such as test and backup. It is not just for use within an organization, but also when EU data flows to other countries or vendors. GDPR is also applicable to non-EU companies, if their data subjects are EU-based. Given this requirement, there is a need to address data lineage from a holistic viewpoint, and Talend tools help achieve that.
The next step of Talend’s comprehensive 16-step plan is governing analytical models.
← Step 10 | Step 12 →
More related articles
- Pillars to GDPR Success (2 of 5): Data Capture and Integration
- Pillars to GDPR Success (4 of 5): Self-Service Curation and Certification
- Pillars to GDPR Success (3 of 5): Anonymize and Pseudonymize for Data Protection with Data Masking
- Pillars to GDPR Success (5 of 5): Data Access and Portability
- Preparing for GDPR
- [GDPR Step 14] How to Govern the Lifecycle of Information
- Pillars to GDPR Success (1 of 5): Data Classification and Lineage
- PCI DSS: Definition, 12 Requirements, and Compliance
- [GDPR Step 15] How to Set Up Data Sharing Agreements
- [GDPR Step 16] How to Enforce Compliance with Controls
- [GDPR Step 13] How to Manage End-User Computing
- [GDPR Step 09] How to Conduct Vendor Risk Assessments
- [GDPR Step 12] How to Govern Analytical Models
- [GDPR Step 10] How to Improve Data Quality
- [GDPR Step 08] How to Conduct Data Protection Impact Assessments
- [GDPR Step 07] How to Establish Data Masking Standards
- [GDPR Step 3] How to Confirm Data Owners
- [GDPR Step 06] How to Define Acceptable Use Standards for GDPR
- [GDPR Step 2] The Importance of Creating Data Taxonomy
- [GDPR Step 4] How to Identify Critical Datasets and Critical Data Elements
- What is Data Portability?
- [GDPR Step 01] How to Develop Policies, Standards, and Controls
- What is Data Privacy?
- [GDPR Step 5] How to Establish Data Collection Standards