This document is for developers who want to start exploring the data profiles stored in the Talend Data Quality mart and building data quality dashboards using a business intelligence (BI) tool — like Looker.
The document covers:
- Profiling a dataset
- Creating helper database views to query the Data Quality mart
- Using a Looker Block to jumpstart your dashboard development
- Showing data quality metrics evolution over time
What you’ll need
To create a dashboard this way, you’ll need:
- Talend Studio, with the Profiling perspective available
- Write access to the repository where the Data Quality mart is stored (we use Snowflake for this example)
- Access to a BI tool (we use Looker, but you could use Tableau, Power BI, or any other tool)
- Access to datasets for profiling
This solution is designed to run in a cloud environment that, conceptually, looks like this.
Building a business data quality dashboard
A business dashboard should provide a view across six data quality dimensions:
- Validity (conformity)
As an example, let’s build a data quality dashboard to support the International BCBS 239 standard, which defines the principles for effective risk data aggregation and risk reporting. The overall objective of the standard is to strengthen banks’ risk data aggregation capabilities and internal risk reporting practices, which in turn enhances risk management and decision-making processes at banks.
We would like to build a dashboard that provides transparency to our organization on the quality of data being used for risk reports, considering several relevant BCBS principles:
- Principle 3 — Accuracy and integrity
- Principle 4 — Completeness
- Principle 5 — Timeliness
- Principle 7 — Accuracy
- Principle 10 — Frequency
Business rules to implement
The user would like to execute the following business rules on a dataset stored in a file called customer.csv:
|Data quality dimension||Attribute||Rule||Comment|
|Rule_01||Validity||Where a customer has provided an email address, it should be valid||
Email IDs should adhere to the pattern email@example.com
Every customer should have a unique identifier code
|Duplicate IDs are not allowed|
Every customer should have a phone number, as these are used for two-factor authentication
Every customer should have a valid date of birth
Data quality metrics to capture
For each business rule we would like to capture:
- Total rows processed
- Total rows failed
- Pass %
This data quality dashboard is divided into two levels:
Summary level — overall data quality
The summary level shows the overall score across all data that’s being profiled and monitored. It provides the ability to filter by:
- framework (BCBS, Dodd-Frank, CCAR, etc.)
- department (finance, risk, operations, etc.)
- data quality dimension (the six dimensions cited above)
The rule level shows metrics at the business rule level. This includes data like:
- Rule name
- Data concept
- Data element
- Rule description
- Data quality dimension
- Total processed
- Total failed
- Pass %
Conceptual data model for the business data quality dashboard
The Talend Data Quality mart uses terminology like “analysis” and “reports,” but business users working in data management may use different terminology. They may be more familiar with terms such as
- Data concept — describes a business concept or entity, such as a customer, product, or trade.
- Data element — describes the attributes of a business concept or entity, such as customer name or age.
- Business rule — defines what good data looks like for a data element. For example, a customer that’s a corporation should have an incorporation date.
- Department — usually a line of business that might like to see only those data concepts that are relevant to them, so it’s useful to group concepts by department.
- Rule execution — captures the metrics of a business rule execution at a point in time.
Now that we’ve specced out our dashboard and understand the conceptual data model, we can start implementing it.
Set up the data mart using Snowflake
You must configure your data warehouse before you build your dashboard. Here are the steps we took to configure Snowflake before we built our sample dashboard.
- Create the schemas TALEND_DQ and LOOKER_SCRATCH.
- Deploy the Talend Data Quality mart tables and views in the schema TALEND_DQ.
Talend Studio will do this automatically — just point it to Snowflake when creating a report.
- Deploy the BI views manually in Snowflake (see Appendix 1).
Notes on the view bi_dq_metrics
The view bi_dq_metrics needs to be customized depending on how you implement your business rules. The ideal way would be to have a mapping table to map the Rule name (or ID) with the data quality dimension metadata (i.e. Consistency, Validity, etc). For this example, we are not using a mapping table, but are using rule naming conventions.
Mapping data quality rules to data quality dimensions
To allocate a data quality rule to a data quality dimension, use a rule naming convention. In our example, Rule_01_Validity_Email looks for the term “Validity” in the rule name to determine whether this rule belongs to the validity data quality dimension.
Calculating failed rows
The indicator count provides the number of rows that matched the definition of an indicator. Whether this is a pass or a fail depends on your interpretation. For example, suppose you a have a data quality rule that checks a text value for a specific pattern (say special characters). You need to decide whether rows matching the pattern is a pass or a fail.
Set up the Talend environment
- Start Talend Studio.
- Create a Project — we’ll call it PROJECT_BCBS.
- Select the Profiling perspective (Window > Perspective > Profiling).
Create and execute the data quality rules
- Configure the dataset to be profiled:
- Under Metadata, right-click on <File Delimited> and select <Create File Delimited Connection>.
- Using the wizard import the file customer.csv.
- Create an analysis.
- Under Data Profiling, right-click on and select.
- Select <Column Analysis>.
- Name the rule Rule_01_Validity_Email_address.
- In <Connections>, select customers.
- In <Select Columns>, select Email.
- In <Select Indicators>, select Email Address under Patterns > Regex Patterns > internet.
- Then do the same for the other rules:
- Create a report.
- Under Data Profiling, right click on <Reports> and select <New Report>.
- Give it the name dq_report_bcbs, then click <Finish>.
- Under the section Analysis List, click on <Select Analysis>, then add the four rules you created into this one report.
- Under the section Generated Report Settings, you can output the report to a file (pdf, xml, html, xls) or to the Data Quality data mart.
- To write the report to the data mart, fill in the details in Database Connection Settings section.
- Click on the <Run> button on the top of the screen to create the report.
Set up Looker
Similarly, you may have to configure your reporting tool before you create a dashboard. Here’s what we did to configure Looker.
- Create a database connection to Snowflake.
- Go to Admin > Connections.
- Click <New Connection>.
- Get the Looker Block source code.
- Browse to https://github.com/llooker/talendpartner_dataquality.
- Download the code as a Zip file.
- Unzip the file onto your local machine.
- Create a new LookML project.
- Go to Develop > Manage LookML Projects.
- Click on <New LookML Project>.
- Enter a <Project Name> — we used talend_dq_project.
- Select <Blank Project>.
- Click on <Create Project>.
- Upload the Looker Block code into the new project.
- In the file browser, create three new folders: dashboards, models, views.
- Add the files from each folder into their specific folders by dragging and dropping them into the File Browser, then dragging and dropping them into the right folders. The final result should look like this:
- Ensure the manifest.lkml file is pointing to the right database connection.
- Open the dashboard.
- In the File Browser, open the folder dashboards.
- Select the file data_quality.dashboard.
- In the drop-down next to the filename, click on <View Dashboard>.
- Once the dashboard is opened, you might need to update the filters to ensure you can see the latest data:
Ongoing data quality monitoring (orchestration)
Once the report has been built, you may want to create a Job in Talend Studio and schedule it, so you can execute the report on a regular basis to monitor your data quality.
The steps you need to take to create a Job are documented on help.talend.com.
What to do next
- We’ve provided Looker Block to create the summary-level dashboard — you can now implement the rule-level dashboard.
- Technical users such as developers and business analysts can build a profiles dashboard that allows users to drill down into data profile results.
Exploring data profiling results
Types of data profiles
Talend Studio supports 17 types of data profiles, which we call analyses.
The results of these analyses can be exported to the Data Quality mart, except for the correlation analyses (numerical, time, and nominal).
Each analysis can be categorized into one of five areas:
- Column analysis — generally profiling single columns of data
- Overview analysis — analysis of data models and structures
- Column set analysis — analysis across a set of columns (full records)
- Match analysis — compares a set of columns and creates groups of similar records
- Comparison analysis — compares two datasets that have the same columns, similar to a reconciliation
Types of data analysis
|Cross table analysis|
Dashboard to explore profiles
Here’s a mock dashboard that you could build to display the results from a query of these views.
You can build the above dashboard using the database views (see Appendix 1):
Additional database views to be created in the data mart
The Data Quality mart is designed with a star schema and consists of fact tables and a number of associated dimension tables. The full documentation is available on help.talend.com.
You can create database views to make it easy to query the Data Quality mart. (Note — these views return only the last version of each analysis — so if you created Analysis v0.1 which profiled column First Name, and later you modified it to v0.2 which profiled column Last Name, then these views will only provide the results of v0.2.)
Column analysis (bi_column_analysis)
create view bi_column_analysis as SELECT a.rep_label AS report_name, ih.run_time, a.an_pk, a.an_label AS analysis_name, ih.elt_connection_name as connection_name, ih.elt_table_name AS table_name, ih.elt_column_name AS column_name, ih.ind_category as indicator_category, ih.ind_label AS indicator_name, ih.indv_row_count AS rows_processed, ih.indv_int_value AS indicator_value FROM tdq_v_analysis a, tdq_v_ind_histo ih WHERE a.rep_uuid = ih.rep_uuid AND a.an_uuid = ih.an_uuid ORDER BY a.rep_label, ih.run_time, a.an_pk;
This view covers all types of column analysis except semantic-aware.
Overview analysis (bi_overview_analysis)
create or replace view bi_overview_analysis as select tva.rep_label as report_name, oi.rep_runtime as report_run_time, oi.an_pk as analysis_pk, tva.an_label as analysis_name, tae.elt_connection_name database_connection_name, tae.elt_catalog_name catalog_name, tae.elt_schema_name schema_name, oi.tv_name table_or_view_name, oi.tv_rowcount number_of_rows, oi.tv_keycount number_of_keys, oi.tv_indexcount number_of_indexes, oi.tv_triggercount number_of_triggers from TDQ_V_LAST_OVERVIEW_ANALY_EXEC lo, TDQ_OVERVIEW_INDVALUE oi, tdq_analyzed_element tae, tdq_v_analysis tva where lo.analysis_pk = oi.an_pk and oi.elt_pk = tae.elt_pk and oi.tv_name <> 'NULL TALEND' and lo.analysis_pk = tva.an_pk
Comparison analysis (bi_comparison_analysis)
create view bi_comparison_analysis as select ma.analysis_pk, a.rep_label as report_name, mi.an_runtime as report_runtime, a.an_label as analysis_name, mi.indicator_pk, tid.ind_label as indicator_name, tid.ind_category as indicator_category, tid.ind_description as indicator_description, mi.table_a_pk, table_a.elt_table_name table_name_a, mi.table_b_pk, table_b.elt_table_name table_name_b, mi.indv_row_count as number_of_rows, mi.indv_match_count as number_rows_matched, (mi.indv_match_percent*100) as match_percentage from TDQ_V_LAST_MATCH_ANALYSIS_EXEC ma, tdq_v_analysis a, tdq_match_indvalue mi, tdq_analyzed_element table_a, tdq_analyzed_element table_b, tdq_indicator_definition tid where ma.analysis_pk = a.an_pk and ma.analysis_pk = mi.analysis_pk and mi.table_a_pk = table_a.elt_pk and mi.table_b_pk = table_b.elt_pk and mi.indicator_pk = tid.ind_pk
Data Quality metrics (bi_dq_metrics)
create view bi_dq_metrics AS select *, rows_processed-rows_failed as rows_passed from ( select report_name as report_name, run_time , connection_name as department_name, analysis_name as rule_name, table_name as data_concept, column_name as data_element, indicator_category, indicator_name, case when charindex ('Validity',analysis_name) > 0 then 'Validity' when charindex ('Timeliness',analysis_name) > 0 then 'Timeliness' when charindex ('Accuracy',analysis_name) > 0 then 'Accuracy' when charindex ('Uniqueness',analysis_name) > 0 then 'Uniqueness' when charindex ('Completeness',analysis_name) > 0 then 'Completeness' when charindex ('Consistency',analysis_name) > 0 then 'Consistency' else 'Other' end as dq_dimension, rows_processed, CASE WHEN indicator_category in ('Simple Statistics') = true THEN indicator_value when indicator_category in ('Pattern Matching') = true then rows_processed - indicator_value ELSE indicator_value END AS rows_failed from bi_column_analysis ) as dq_metrics_only where dq_dimension <> 'Other';
Queries for use with any BI Tool
Here are some example queries that retrieve values to be displayed for the business data quality dashboard (if you are not using the Looker Block).
Overall data quality score for the last data quality run
select (sum((rows_processed-rows_failed)) / sum (rows_processed )) * 100 as last_run_dq_score from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm )
Total rows processed
select sum(rows_processed ) as rows_processed from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm )
select sum(rows_failed) as rows_failed from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm )
select (sum((rows_processed-rows_failed)) / sum (rows_processed )) * 100 as last_run_dq_score from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm ) and dq_dimension = 'Completeness'
select (sum((rows_processed-rows_failed)) / sum (rows_processed )) * 100 as last_run_dq_score from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm ) and dq_dimension = 'Validity'
select (sum((rows_processed-rows_failed)) / sum (rows_processed )) * 100 as last_run_dq_score from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm ) and dq_dimension = 'Consistency'
select (sum((rows_processed-rows_failed)) / sum (rows_processed )) * 100 as last_run_dq_score from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm ) and dq_dimension = 'Accuracy'
Data quality score over time
select run_time, (sum((rows_processed-rows_failed)) / sum (rows_processed )) * 100 as last_run_dq_score from bi_dq_metrics group by run_time
Data quality score by concept (last run)
select data_concept , (sum((rows_processed-rows_failed)) / sum (rows_processed )) * 100 as dq_score from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm ) group by data_concept
Data quality score by element (last run)
select data_element , (sum((rows_processed-rows_failed)) / sum (rows_processed )) * 100 as dq_score from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm ) group by data_element
Data quality score by rule (last run)
select substring(rule_name,1,7) , (sum((rows_processed-rows_failed)) / sum (rows_processed )) * 100 as dq_score from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm ) group by substring(rule_name,1,7)
List of data quality rules applied (last run)
select data_concept , data_element , rule_name, indicator_name , dq_dimension , rows_processed , rows_failed, ((rows_processed-rows_failed) / rows_processed) as dq_score from bi_dq_metrics where run_time = (select max (run_time) from bi_dq_metrics dm )