This document is for developers who want to start exploring the data profiles stored in the Talend Data Quality mart and building data quality dashboards using a business intelligence (BI) tool — like Looker.

The document covers:

  • Profiling a dataset
  • Creating helper database views to query the Data Quality mart
  • Using a Looker Block to jumpstart your dashboard development
  • Showing data quality metrics evolution over time

What you’ll need

To create a dashboard this way, you’ll need:

  • Talend Studio, with the Profiling perspective available
  • Write access to the repository where the Data Quality mart is stored (we use Snowflake for this example)
  • Access to a BI tool (we use Looker, but you could use Tableau, Power BI, or any other tool)
  • Access to datasets for profiling

High-level design

This solution is designed to run in a cloud environment that, conceptually, looks like this.

High level design

Building a business data quality dashboard

A business dashboard should provide a view across six data quality dimensions:

  1. Completeness
  2. Timeliness
  3. Validity (conformity)
  4. Accuracy
  5. Consistency
  6. Uniqueness

As an example, let’s build a data quality dashboard to support the International BCBS 239 standard, which defines the principles for effective risk data aggregation and risk reporting. The overall objective of the standard is to strengthen banks’ risk data aggregation capabilities and internal risk reporting practices, which in turn enhances risk management and decision-making processes at banks.

We would like to build a dashboard that provides transparency to our organization on the quality of data being used for risk reports, considering several relevant BCBS principles:

  • Principle 3 — Accuracy and integrity
  • Principle 4 — Completeness
  • Principle 5 — Timeliness
  • Principle 7 — Accuracy
  • Principle 10 — Frequency

Business rules to implement

The user would like to execute the following business rules on a dataset stored in a file called customer.csv:

Rule ID

Data quality dimension Attribute  Rule  Comment
Rule_01 Validity Email Where a customer has provided an email address, it should be valid

Email IDs should adhere to the pattern abc@company.com 
 
They should not have any special characters other than “.”, “@”, and “_”

Rule_02 Uniqueness Code

Every customer should have a unique identifier code

Duplicate IDs are not allowed
Rule_03 Completeness Phone_num

Every customer should have a phone number, as these are used for two-factor authentication

 
Rule_04 Validity DOB

Every customer should have a valid date of birth

 

 

Data quality metrics to capture 

For each business rule we would like to capture:

  • Total rows processed
  • Total rows failed
  • Pass %

Dashboard design

This data quality dashboard is divided into two levels:

Summary level — overall data quality

The summary level shows the overall score across all data that’s being profiled and monitored. It provides the ability to filter by:

  • framework (BCBS, Dodd-Frank, CCAR, etc.)
  • department (finance, risk, operations, etc.)
  • data quality dimension (the six dimensions cited above)

Data Quality Dashboard Report

 

Rule level

The rule level shows metrics at the business rule level. This includes data like:

  • Rule name
  • Data concept
  • Data element
  • Rule description
  • Data quality dimension
  • Total processed
  • Total failed
  • Pass %

 

data-quality dashboard report

Conceptual data model for the business data quality dashboard

The Talend Data Quality mart uses terminology like “analysis” and “reports,” but business users working in data management may use different terminology. They may be more familiar with terms such as

  • Data concept — describes a business concept or entity, such as a customer, product, or trade.
  • Data element — describes the attributes of a business concept or entity, such as customer name or age.
  • Business rule — defines what good data looks like for a data element. For example, a customer that’s a corporation should have an incorporation date.
  • Department — usually a line of business that might like to see only those data concepts that are relevant to them, so it’s useful to group concepts by department.
  • Rule execution — captures the metrics of a business rule execution at a point in time.

DQ-Mart Conceptual Model

Implementation

Now that we’ve specced out our dashboard and understand the conceptual data model, we can start implementing it.

Set up the data mart using Snowflake

You must configure your data warehouse before you build your dashboard. Here are the steps we took to configure Snowflake before we built our sample dashboard.

  1. Create the schemas TALEND_DQ and LOOKER_SCRATCH.
  2. Deploy the Talend Data Quality mart tables and views in the schema TALEND_DQ.
    Talend Studio will do this automatically — just point it to Snowflake when creating a report.

  3. Deploy the BI views manually in Snowflake (see Appendix 1).
    1. bi_column_analysis
    2. bi_overview_analysis
    3. bi_comparison_analysis
    4. bi_dq_metrics

Notes on the view bi_dq_metrics

The view bi_dq_metrics needs to be customized depending on how you implement your business rules. The ideal way would be to have a mapping table to map the Rule name (or ID) with the data quality dimension metadata (i.e. Consistency, Validity, etc). For this example, we are not using a mapping table, but are using rule naming conventions.

Mapping data quality rules to data quality dimensions

To allocate a data quality rule to a data quality dimension, use a rule naming convention. In our example, Rule_01_Validity_Email looks for the term “Validity” in the rule name to determine whether this rule belongs to the validity data quality dimension.

Calculating failed rows

The indicator count provides the number of rows that matched the definition of an indicator. Whether this is a pass or a fail depends on your interpretation. For example, suppose you a have a data quality rule that checks a text value for a specific pattern (say special characters). You need to decide whether rows matching the pattern is a pass or a fail.

 

Set up the Talend environment

  1. Start Talend Studio.
  2. Create a Project — we’ll call it PROJECT_BCBS.
  3. Select the Profiling perspective (Window > Perspective > Profiling).

Create and execute the data quality rules

  1. Configure the dataset to be profiled:
    1. Under Metadata, right-click on <File Delimited> and select <Create File Delimited Connection>.
    2. Using the wizard import the file customer.csv.
  2. Create an analysis.
    1. Under Data Profiling, right-click on and select.
    2. Select <Column Analysis>.
    3. Name the rule Rule_01_Validity_Email_address.
    4. In <Connections>, select customers.
    5. In <Select Columns>, select Email.
    6. In <Select Indicators>, select Email Address under Patterns > Regex Patterns > internet.
    7. Then do the same for the other rules:
      1. Rule_02_Uniqueness_Code
      2. Rule_03_Completeness_Phone_Num
      3. Rule_04_Validity_DOB
      4. data profiling
  3. Create a report.
    1. Under Data Profiling, right click on <Reports> and select <New Report>.
    2. Give it the name dq_report_bcbs, then click <Finish>.
    3. Under the section Analysis List, click on <Select Analysis>, then add the four rules you created into this one report.
    4. Under the section Generated Report Settings, you can output the report to a file (pdf, xml, html, xls) or to the Data Quality data mart.
    5. To write the report to the data mart, fill in the details in Database Connection Settings section.
    6. Click on the <Run> button on the top of the screen to create the report.

                   

Set up Looker

Similarly, you may have to configure your reporting tool before you create a dashboard. Here’s what we did to configure Looker.

  1. Create a database connection to Snowflake.
    1. Go to Admin > Connections.
    2. Click <New Connection>.
  2. Get the Looker Block source code.
    1. Browse to https://github.com/llooker/talendpartner_dataquality.
    2. Download the code as a Zip file.
    3. Unzip the file onto your local machine.
  3. Create a new LookML project.
    1. Go to Develop > Manage LookML Projects.
    2. Click on <New LookML Project>.
    3. Enter a <Project Name> — we used talend_dq_project.
    4. Select <Blank Project>.
    5. Click on <Create Project>.
  4. Upload the Looker Block code into the new project.
    1. In the file browser, create three new folders: dashboards, models, views.
    2. Add the files from each folder into their specific folders by dragging and dropping them into the File Browser, then dragging and dropping them into the right folders. The final result should look like this:
    3. Ensure the manifest.lkml file is pointing to the right database connection.
  5. Open the dashboard.
    1. In the File Browser, open the folder dashboards.
    2. Select the file data_quality.dashboard.
    3. In the drop-down next to the filename, click on <View Dashboard>.
    4. Once the dashboard is opened, you might need to update the filters to ensure you can see the latest data:

                   

Ongoing data quality monitoring (orchestration)

Once the report has been built, you may want to create a Job in Talend Studio and schedule it, so you can execute the report on a regular basis to monitor your data quality.

The steps you need to take to create a Job are documented on help.talend.com.

What to do next

  • We’ve provided Looker Block to create the summary-level dashboard — you can now implement the rule-level dashboard.
  • Technical users such as developers and business analysts can build a profiles dashboard that allows users to drill down into data profile results.

 

Exploring data profiling results

Types of data profiles

Talend Studio supports 17 types of data profiles, which we call analyses.

The results of these analyses can be exported to the Data Quality mart, except for the correlation analyses (numerical, time, and nominal).

Each analysis can be categorized into one of five areas:

  • Column analysis — generally profiling single columns of data
  • Overview analysis — analysis of data models and structures
  • Column set analysis — analysis across a set of columns (full records)
  • Match analysis — compares a set of columns and creates groups of similar records
  • Comparison analysis — compares two datasets that have the same columns, similar to a reconciliation

 

Types of data analysis  
stored in the Data Quality mart 

 

Column

analysis

 
bi_column_

analysis.sql 

Overview analysis 
bi_overview_

analysis.sql 

Column set

analysis 

Match analysis

Comparison analysis 
bi_comparison_

analysis.sql 

Structural analysis 
         
1 Connection overview      X    
2 Catalog overview      X    
3 Schema overview      X    
Cross table analysis          
4 Redundancy          X
Table analysis          
5 Business rule  X        
6 Match        X  
7 Functional dependency          X
8 Column set      X    
Column analysis           
9 Semantic-aware X        
10 Basic column X        
11 Nominal values X        
12 Pattern frequency X        
13 Discrete data X        
14 Summary statistics X        

 

Dashboard to explore profiles

Here’s a mock dashboard that you could build to display the results from a query of these views.

 

You can build the above dashboard using the database views (see Appendix 1):

  • bi_column_analysis
  • bi_overview_analysis
  • bi_comparison_analysi

 

Appendix 1

Additional database views to be created in the data mart

The Data Quality mart is designed with a star schema and consists of fact tables and a number of associated dimension tables. The full documentation is available on help.talend.com.

You can create database views to make it easy to query the Data Quality mart. (Note — these views return only the last version of each analysis — so if you created Analysis v0.1 which profiled column First Name, and later you modified it to v0.2 which profiled column Last Name, then these views will only provide the results of v0.2.)

Column analysis (bi_column_analysis)

create view bi_column_analysis as
  SELECT a.rep_label AS report_name,
     ih.run_time,
     a.an_pk,
     a.an_label AS analysis_name,
     ih.elt_connection_name as connection_name,
     ih.elt_table_name AS table_name,
     ih.elt_column_name AS column_name,
     ih.ind_category as indicator_category,
     ih.ind_label AS indicator_name,
     ih.indv_row_count AS rows_processed,
     ih.indv_int_value AS indicator_value
  FROM tdq_v_analysis a,
     tdq_v_ind_histo ih
  WHERE a.rep_uuid = ih.rep_uuid AND a.an_uuid = ih.an_uuid
  ORDER BY a.rep_label, ih.run_time, a.an_pk;

This view covers all types of column analysis except semantic-aware.

Overview analysis (bi_overview_analysis)

create or replace view  bi_overview_analysis as
select tva.rep_label as report_name,
    oi.rep_runtime as report_run_time,
    oi.an_pk as analysis_pk,
    tva.an_label as analysis_name,
    tae.elt_connection_name database_connection_name,
    tae.elt_catalog_name catalog_name,
    tae.elt_schema_name schema_name,
    oi.tv_name table_or_view_name,
    oi.tv_rowcount number_of_rows,
    oi.tv_keycount number_of_keys,
    oi.tv_indexcount number_of_indexes,
    oi.tv_triggercount number_of_triggers
from TDQ_V_LAST_OVERVIEW_ANALY_EXEC lo, 
    TDQ_OVERVIEW_INDVALUE oi,
     tdq_analyzed_element tae,
     tdq_v_analysis tva
where lo.analysis_pk = oi.an_pk
     and oi.elt_pk = tae.elt_pk
     and oi.tv_name <> 'NULL TALEND'
     and lo.analysis_pk = tva.an_pk

Comparison analysis (bi_comparison_analysis)

create view bi_comparison_analysis as
select ma.analysis_pk,
    a.rep_label as report_name,
    mi.an_runtime as report_runtime,
    a.an_label as analysis_name,
    mi.indicator_pk,
    tid.ind_label as indicator_name,
    tid.ind_category as indicator_category,
    tid.ind_description as indicator_description,
    mi.table_a_pk,
    table_a.elt_table_name table_name_a,
    mi.table_b_pk,
    table_b.elt_table_name table_name_b,
    mi.indv_row_count as number_of_rows,
    mi.indv_match_count as number_rows_matched,
    (mi.indv_match_percent*100) as match_percentage
from  TDQ_V_LAST_MATCH_ANALYSIS_EXEC ma,
    tdq_v_analysis a,
    tdq_match_indvalue mi,
    tdq_analyzed_element table_a,
    tdq_analyzed_element table_b,
    tdq_indicator_definition tid
where ma.analysis_pk = a.an_pk
and ma.analysis_pk  = mi.analysis_pk
and mi.table_a_pk = table_a.elt_pk
and mi.table_b_pk = table_b.elt_pk
and mi.indicator_pk = tid.ind_pk

Data Quality metrics (bi_dq_metrics)

create view bi_dq_metrics AS
select *, rows_processed-rows_failed as rows_passed
from ( 
  select report_name as report_name,
             run_time ,
             connection_name as department_name,
             analysis_name as rule_name,
             table_name as data_concept,
             column_name as data_element,
             indicator_category,
             indicator_name,
    case
    when charindex ('Validity',analysis_name) > 0 then 'Validity'
    when charindex ('Timeliness',analysis_name) > 0 then 'Timeliness'
    when charindex ('Accuracy',analysis_name) > 0 then 'Accuracy'
    when charindex ('Uniqueness',analysis_name) > 0 then 'Uniqueness'
             when charindex ('Completeness',analysis_name) > 0 then 'Completeness'
             when charindex ('Consistency',analysis_name) > 0 then 'Consistency'
    else 'Other'
    end as dq_dimension,
            rows_processed,
    CASE
    WHEN indicator_category in ('Simple Statistics') = true THEN indicator_value
           when indicator_category in ('Pattern Matching') = true  then rows_processed - indicator_value
           ELSE indicator_value
    END AS rows_failed
  from bi_column_analysis
) as dq_metrics_only
where dq_dimension <> 'Other';

 

Appendix 2

Queries for use with any BI Tool

Here are some example queries that retrieve values to be displayed for the business data quality dashboard (if you are not using the Looker Block).

Overall data quality score for the last data quality run

select (sum((rows_processed-rows_failed))  / sum (rows_processed )) * 100 as last_run_dq_score
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )

Total rows processed

select sum(rows_processed ) as rows_processed
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )

Failed rows

select sum(rows_failed) as rows_failed
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )

Completeness score

select (sum((rows_processed-rows_failed))  / sum (rows_processed )) * 100 as last_run_dq_score
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )
and dq_dimension = 'Completeness'

Validity score

select (sum((rows_processed-rows_failed))  / sum (rows_processed )) * 100 as last_run_dq_score
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )
and dq_dimension = 'Validity'

Consistency score

select (sum((rows_processed-rows_failed))  / sum (rows_processed )) * 100 as last_run_dq_score
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )
and dq_dimension = 'Consistency'

Accuracy score

select (sum((rows_processed-rows_failed))  / sum (rows_processed )) * 100 as last_run_dq_score
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )
and dq_dimension = 'Accuracy'

Data quality score over time

select run_time, (sum((rows_processed-rows_failed))  / sum (rows_processed )) * 100 as last_run_dq_score
from bi_dq_metrics
group by run_time

Data quality score by concept (last run)

select data_concept , (sum((rows_processed-rows_failed))  / sum (rows_processed )) * 100 as dq_score
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )
group by data_concept

Data quality score by element (last run)

select data_element , (sum((rows_processed-rows_failed))  / sum (rows_processed )) * 100 as dq_score 
from bi_dq_metrics 
where run_time = (select max (run_time) from bi_dq_metrics dm ) 
group by data_element  

Data quality score by rule (last run)

select substring(rule_name,1,7) , (sum((rows_processed-rows_failed))  / sum (rows_processed )) * 100 as dq_score
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )
group by substring(rule_name,1,7)

List of data quality rules applied (last run)

select data_concept , data_element , rule_name, indicator_name , dq_dimension , rows_processed , rows_failed,
((rows_processed-rows_failed)  /  rows_processed)  as dq_score
from bi_dq_metrics
where run_time = (select max (run_time) from bi_dq_metrics dm )