Step-by-Step: How to Check Data Quality with Talend Using Your Own RegEx Pattern Library
In this example, I have been asked to check if the corporate standard for naming employee documents has been adhered to. If not, I then want to know how many deviations are in the dataset. You have in your database a table of employees, employee documents (contracts of employment, performance reviews etc.), and a table of document types. Let's begin by utilizing those data sets and see what we come up with.
The "DOCUMENT_TYPE" table is as below:
After that I assign documents to employees using the "EMPLOYEE_DOCUMENTS" table:
What I need to do is check that the "DOCUMENT_NAME" fields being added to the database, match a corporate standard for each document type. For this example, I will focus on checking the document type "Contract of Employment", but will build a solution which will work for all document types.
The naming convention for "contracts of employment" is:
- YYYYMMDD-Contract-of-Employment-cc-(et)-fn ln
- YYYYMMDD-Contract-Employment-cc-(et)-fn ln
- cc is a country code of either "UK" or "US"
- et is an "employment type" of either "Permanent" or "Contract"
- fn is an employee first name
- ln is an employee last name
So there are 2 acceptable naming conventions for this particular document type, but both are quite strict in their requirements. YYYMMDD is obviously a date format.
Valid examples of a document name are:
- 20161201-Employment-Contract-UK-(Permanent)-Bill Smith
- 20161201-Employment-Contract-US-(Contract)-Bill Smith
- 20161201-Contract-Employment-UK-(Permanent)-Bill Smith
- 20161201-Contract-Employment-US-(Contract)-Bill Smith
Now I'll create a library, in the database, of valid regular expression patterns for each document type, rather than hardcode them into Talend. Then, I want to create a table of valid regular expressions for each document type, and add some examples in the database of valid document names:
RegEx is hard to remember - and so the excellent http://regexr.com/ allows us to develop and test our document naming patterns before we commit them to the database.
So, now I've built a library of valid patterns for our document type, and can now build a Talend data integration job, to check any data quality issues in our document names.
The valid RegEx patterns are below.
It should be noted that I am not a magician, and I've decided not to check for valid dates at the start of the name, but I am checking that the year of 2010-2016, and a day value from 01 to 31, and a month value from 01-12 is included.
And I obviously can't differentiate, using Regex, "first names" from "last names, but I can check that there have 2 words separated by a space at the end, with each word capitalized.
Creating the Talend Job
Highlighting the main points of the job design only. I first pull in the records from the "EMPLOYEE_DOCUMENTS" table for the "Contract of Employment" document type (DOCUMENT_TYPE_ID = 1). For this I'll use a "tOracleInput" component as below:
Next, I'll add the output to a "tMap", and then create a lookup table for the "tMap" which pulls in the regex library of valid document names I created earlier in the database. i.e. the "EMPLOYEE_DOCUMENT_PATTERNS" table:
In "tMap" I join the data on the "DOCUMENT_TYPE_ID" field, using an inner join. I also want to ensure "All Matches" is selected as I'll want to check each document type against all of the allowed naming conventions.
In "tMap", I can map across all of the fields from the "EMPLOYEE_DOCUMENTS" table, and also output the valid REGEX patterns from the lookup table:
Our output now will be all of the "contract of employment" documents, joined with all possible valid REGEX patterns for this specific document type.
Next, add a "tFilterRow" component, and switch to "Advanced Mode". Here I'm simply using the Java "matches" function against the "DOCUMENT_NAME", and pass in the "REGEX_PATTERN" value from the input row, which comes from our database table of regex patterns:
For a quick test, you can capture the filter outputs into a set of "tLogRow" components to check which documents pass the test, and which fail. That job looks something like this:
For reference, these are the sample document names in the database:
I then run the job and test the output, and in this instance the output is 1 pass and 7 fails:
Examining the output, I can identify the "pass" as a record for Fred Mackenzie. The document name matches the naming convention of PATTERN_ID = 1 in the database:
So the document name "20140301-Employment-Contract-UK-(Permanent)-Fred Mackenzie" matches the pattern:
When I explore the 7 failures, everything is looking good - except I notice the document above is also being output as a failure, but this time against PATTERN_ID = 2:
This is to be expected, this document matches one of the regex patterns for "contract of employment", but not both! I somehow need to create a definitive list of "real" failures, i.e. those documents which don't match ANY of the acceptable regex patterns.
For this I've decided to use some "tHashOutputs" to capture the Valid and Invalid rows, so we can use the output later in the job:
I can now create a sub job, which reads in the data from the "tHashOutput" components above, and joins them in a "tMap". We create a couple of "tHashInput" components, paired to the "tHashOutput" components above.
Tip: I had to steal the schema from the "tHashOuput" components. The schema "copy and paste" option is a good choice if I don't want to create a repository schema.
The main data into the "tMap" is the list of failed documents, and the lookup table is the list of valid documents:
In "tMap", I can create an inner join on the "EMPLOYEE_DOCUMENT_ID", but configure the output to catch the rejects of the inner join:
In other words, feed in the failed documents, and only output those documents which don't have a match in the valid documents. This way, if a document passes at least one rule, it will not get output as a failure. There may well be better ways of doing this...
I can run the job and check. Indeed we now see 6 failures not 7, and Fred's document has been removed from the "naughty list".
In all actuality, I don't have 6 failures, there are only 4 documents in this sample set. This is a consequence of the join to the RegEx pattern table, where there are multiple patterns associated to individual document types.
I can reduce the data to the individual document failures using a "tUniqRow" component:
And now I can see that the output is the 3 true document failures:
And that is, at a very high level, how Talend enables you to check data quality using your own RegEx pattern library. Question? Comments? Please feel free to tell me below. Until next time!
Disclaimer: All opinions expressed in this article are my own and do not necessarily reflect the position of my employer.