Data Quality (DQ) is an art form. If I was to define two simple rules for a project involving some element of DQ, they would be:
- Don’t underestimate the time and effort required to get it right
- Talk to the people who own and use the data and get continuous feedback!
In many cases the DQ role on a project is a full-time role. It is a continuous process of rule refinement, results analysis and discussion with the users and owners of the data. Typically the DI (Data Integration) developers will build a framework into which DQ rules can be inserted, adjusted and tested constantly without the need to rewrite the rest of the DI processes each time.
This blog focuses on the process of matching data, but many of the principles can also be used in other DQ tasks.
First of all, lets understand what we are trying to achieve – why might we want to match?
- To find the same ‘entity’ (be it product, customer, etc.) in different source systems
- To de-duplicate data within a single system – or at least identify duplicates within a system for a Data Steward to be able to take some sort of action
- As part of a process of building a single version of the truth, possibly as part of an MDM (Master Data Management) initiative
- As part of a data-entry process to avoid the creation of duplicate data
To help us with these tasks I propose a simple DQ methodology:
As you can see, this is an iterative process and, as I said earlier, you are unlikely to find that just one ‘round’ of matching achieves the desired results.
Let’s look at each step on the diagram:
Before we can attempt to match, we must first understand our data. We employ two different strategies to achieve this:
- Reading relevant documentation
- Talking to stakeholders, end users, system administrators, data stewards, etc.
- Source system demonstrations
- Discussing the change in data over time
- Technical Profiling
- Using Talend tools to test assumptions and explore the data
- Analyse the actual change in data over time!
Both strategies must be employed to ensure success. One thing I have found is that end users of systems are constantly finding ways to ‘bend’ systems to do things that business teams need to do, but perhaps that the system wasn’t designed to do. A simple example of this would be a system that doesn’t include a tick box that let’s call centre operators know that a customer does not want to be contacted by phone. You may find that another field which allows free text has been co-opted for this purpose e.g.
Adam Pemble ****DO NOT PHONE****
This is why we cannot rely on the system documentation alone. A combination of consulting the users and technical profiling would help us identify this unexpected use of a field.
Typically for this step I would start by listing every assumption and query about the data – you should have at least one thing for every field of every table and file, plus row and table level rules. Next, design Talend Profiler Analyses and /or Data Integration Jobs to test those assumptions. These results will then be discussed with the business users and owners of the data. The reports produced by the DQ profiler can be a great way to share information with these business users, especially if they are not very technical. DI can also produce results in formats familiar to business users e.g. spreadsheets.
Specific to the task of matching, some examples of assumptions we may wish to test:
- “In source system A, every Customer has at least one address and every address has at least one associated customer”
- “Every product should have a colour, perhaps we can use that in our matching rules? The colour field is free text in the source system.”
- “Source systems A and B both have an email address field – can we match on that?”
- “Source system X contains a lot of duplicates in the Customer table”
It is also important to analyse the lineage of each piece of data. For example, say we had an email address field. We may profile it and discover that it contains 100% correctly formatted email addresses. Is this because the front-end system is enforcing this, or is it just by chance? If it is the latter, our DI jobs may need to be written to cope with the possibility of an incorrect or missing email, even though none currently exist.
Note: I may write a future blog going into more detail about the importance of analysis before beginning to write any process logic.
Whilst performing the Analysis stage, it is likely that we will notice things about our data that will have an impact on our ability to match records. For example, we might profile a ‘colour’ column for a product and find results similar to those shown below:
What do we notice?
- Blk is an abbreviation of Black
- Chr is an abbreviation of Chrome
- Aqua MAY be a synonym of Blue
- Blu is a typo of Blue
If we were to do an exact match based on colour, some matches could be missed. Fuzzy matching could introduce false positives (more on this later).
To improve our matching accuracy, we need to standardise our data BEFORE attempting to match. Talend allows you to apply a number of different standardisation techniques including:
- Synonym indexes
- Reference data lookups
- Phone Number Standardisation
- Address Validation tools
- Other external validation / enrichment tools
- Grammar-based parsers
Let’s look at each of these in turn:
Our scenario with colours would be a classic use case for a synonym index. Simply a synonym index is a list of ‘words’ (i.e. our master values) and synonyms (related terms that we would like to standardise or convert to our master value). For example:
The above is an excerpt from one of the synonym indexes that Talend provides ‘out of the box’ (https://help.talend.com/display/TalendPlatformUniversalStudioUserGuide55EN/E.2++Description+of+available+indexes), in this case one that deals with names and nicknames. Talend also provides components to build your own indexes (the index itself is a Lucene index, stored on the file system) and to standardise data against these indexes. The advantages of using Lucene is that it is fast, it is an open standard and that we can leverage Lucene’s fuzzy search capabilities, so we can in some cases cater for synonyms that we can’t predict at design time (e.g. typos).
These jobs are in the Talend demo DQ project if you want to play with them. The indexes can also be utilised in the tStandardizeRow component, which we will discuss shortly.
Reference data lookups
A classic DI lookup to a database table or other source of reference data e.g. an MDM system. Typically used as a join in a tMap.
Phone Number Standardisation
There is a handy component in the Talend DQ pallet that you should know about: tStandardizePhoneNumber.
It uses a google library to try to standardise a phone number into one of a number of available formats, based on the country of origin. If it can’t do this, it lets you know that the data supplied is not a valid phone number. Take the example of the following French phone numbers:
They both standardise to:
01 47 04 56 70
Using the ‘National’ format option. In their original form, we would not have been able to match based on these records – after standardisation, we can make an exact match.
Address Validation tools
Talend provides components with our DQ offerings that allow you to interface to an external address validation tool such as Experian QAS, Loqate or MelissaData. Why would you want to do this? Well, if you are pulling address data from different systems, the likelihood is that the address data will be held in differing formats e.g. Address1, Address2 etc. vs Building, Street, Locality, etc. These formats may have different rules for governing the input of addresses – from no rules (free text) to validating against an external source. Even if two addresses are held in the same structure, there is no guarantee that the individual ‘tokens’ of the address will be in the same place or have the same level of completeness. This is where address validation tools come in. They take in an address in its raw form from a source and then using sophisticated algorithms, standardise and match the address against an external reference source like a PAF file (Post Office Address file) – a file from the postal organisation of a country that contains all addresses, updated on a regular basis. The result is returned in a well-structured and most importantly consistent format, with additional information such as geospatial information and match scores. Take the example below:
Two addresses that are quite different from each other to a computer; however, to a human, we can see that they are the same address. Running the addresses through an address validation tool (in this case Loqate) we get the same, standardised address as an output. Now our matching problem is much simpler.
You might ask – can I not build something like this with Talend rather than buy another tool? I was once part of a project where this was attempted (not with Talend, it was a different tool), which had quite poor-quality addresses. The issue is that addresses were designed to allow a human to deliver a letter to a given place, and there is a great deal of variation in how addresses can be represented. Six months of consultancy later, we had something that worked in most cases, but of course it was then realised that it would have been cheaper to buy a tool…. Why is address validation not built into Talend you might ask? There are a number of reasons:
- Not all customers require Address Validation – it would make the product more expensive
- Those customers that do may already be using one of the major vendors, they don’t want a different Talend proprietary system
- Different tools on the market suit different needs – e.g. MelissaData is centred on the US
- Why should Talend re-invent the wheel, when we could just allow you to utilise existing ‘best of breed’ solutions?
Other external validation / enrichment tools
There are many tools available on the market, most of which are easy to interface with using Talend (typically via a webservice or api). For example Dun and Bradstreet is a popular source of company data and Experian provides credit information on individuals. All of this data could be useful to an organisation in general and also potentially useful in matching processes.
Sometimes we will come across a data field that has been entered as free text, but could contain multiple bits of useful information that we could use to match, if only we could extract it consistently. Take for example a field that holds a product description:
34-9923 Monolithic Membrane 4' x 8' 26 lbs
4' x 8' Monolithic Membrane 26lbs Green
Now again, as a human, we can see that there is a high likelihood that these two descriptions are referring to the same product. What we need to be able to do is identify all of the different ‘tokens’: Product code, Name, Dimensions, Weight, and Colour - and create a set of rules to be able to ‘parse’ out these tokens, no matter the order or variations in representation (e.g. the spaces in 26 lbs but not in 26lbs). Essentially, what we are defining is some simple rules for a language or ‘grammar’. Talend includes a variety of parsing components which can help you, from simple regular expressions through to tStandardiseRow, which lets you construct an ANTLR grammar:
A warning though: this is a hard task for even experienced professionals to get right. We are looking to include some additional intelligence in the tool to help you with building these sorts of rules in the future.
Next time: This blog continues with part 2: matching and survivorship / stewardship