[GDPR Step 4] How to Identify Critical Datasets and Critical Data Elements

The General Data Protection Regulation (GDPR), which went into effect on May 25, 2018, aims to create better data protection policies and holds the organizations that handle personal data more accountable than before. This means organizations must now focus on data governance. To achieve this, a clear understanding of personal data and how it is stored, used, and protected is required.

Talend recently hosted an on-demand webinar, Practical Steps to GDPR Compliance, that focuses on a comprehensive 16-step plan to operationalize a data governance program that supports GDPR compliance.

Identifying critical datasets and critical data elements (CDEs) is Step 4 in this plan. Take a look at the first three steps of the plan here: establishing policies, standards, and controls; creating a data taxonomy; and assigning data ownership.

Why Identifying Critical Data Elements is Important for GDPR

The following articles of the GDPR put data elements in the spotlight:

Article 4 of GDPR defines personal data as “any information related to an identified or identifiable natural person (‘data subject’) … such as a name, identification number, location data, online identifier, or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”

Article 9 of GDPR restricts the processing of special categories of personal data such as race, ethnic origin, or political opinions.

To comply with these articles of the GDPR, it is necessary to identify such personal data as CDEs and then take relevant steps to protect them.

For example, if an organization collects sensor data from a motor vehicle or tracks its coordinates, and if the vehicle remains stationary overnight, it is reasonable to infer that this coordinate is the vehicle owner’s address. It then becomes easier to track the identity of the person. Hence, sensor data or the coordinates can become CDEs, as they can indirectly lead to revealing the subject’s personal data.

Identifying CDEs highlights that an organization is dedicated to ensuring personal data is not compromised.

How to Identify Critical Data Elements Using Talend

Data stewards have an important role in this process. They should prioritize their efforts by identifying critical datasets and CDEs within their respective data categories. For example, employee identity consists of a number of CDEs, including name, gender, date of birth, and national ID. Employee social media information consists of a number of critical datasets, such as Facebook, Twitter, and LinkedIn profile information.

The data governance team needs to determine whether standards for data collection and data use are best set at the level of critical datasets, rather than for individual CDEs. For example, acceptable use and security standards may be better managed for overall Facebook information (critical dataset) rather than for Facebook ID (CDE).

Talend Metadata Manager supports an ISO 11179 business glossary that contains personal data-related business terms. For example, it may contain an inventory of business terms for customer identity such as name, email address, and phone number. It will also define the semantics of the critical data elements using predefined semantics (such as e-mail, first name, last name, IBAN, etc.) so that footprints from those critical data elements can be captured automatically across datasets. This means that Metadata Manager can act as more than a business glossary, but rather as the single point of entry for capturing personal data footprints across datasets.

Here, are two approaches that Talend Metadata Manager supports to identify CDEs for GDPR:

  1. Top-down approach — Describing the enterprise data landscape as a whole, the tool supports the mapping of high-level data definitions to actual physical fields in source systems across the enterprise.
  2. Bottom-up approach — In this approach, physical data points are captured automatically and then linked to high-level GDPR data definitions as applicable. The physical fields will be based on technical metadata, harvested from source systems based on a rich variety of connectors from Talend Metadata Manager (see Figure 1).

Figure 1: Defining or retro-engineering data models and data elements with Talend Metadata Manager.

The broad range of connectors provides an accurate view of the data landscape, similar to a GPS navigator that can alert a driver when traffic conditions change.

This second approach is more popular in the big data era, as data comes from multiple sources and it becomes essential to automatically profile and discover the data before confirming whether it contains personal data and take actions for compliance accordingly.

Next Steps in Identifying Critical Data

Legal and compliance need to sign off on the processing of personal data during the design phase of a project. So, irrespective of which approach is followed, data governance must work with these teams to define “personal data” for the GDPR.

Identifying CDEs related to personal data is crucial to taking actions for GDPR controls, and Talend Metadata Manager can help. The next step involved in the 16-step plan is establishing data collection standards.

← Step 3  |  Step 5 →

Ready to get started with Talend?