Data Matching 101: What Tools Does Talend Have?
This blog is the second part of a three-part series looking at Data Matching. In the first part, we looked at the theory behind data matching. In this second part, we will look at the tools Talend provides in its suite to enable you to do Data Matching, and how the theory is put into practice.
If you remember, we discussed how you match by first blocking data into similar groups, things that are unlikely to change, and then match within those groups using various matching parameters. This is basically what the Data Matching Components in the Talend suite do.
To start, we will look at the tools available in the Talend Data Quality (TDQ) toolset. TDQ provides some components grouped into the various sections, the one of interest to us is the Matching section. We will look at the major components individually. In using the components, we are again doing just what we described in the first blog in this series. We are choosing features that are unlikely to change (Blocking) and then matching within those features (Blocks).
tFirstnameMatch is a component that checks first names against an index file embedded in the component itself. This component searches through first names in the index file according to the input gender and input country you specify in the component settings. The index file has reference first names for about 162 countries. Some of the countries listed in the index have a huge number of reference first names. It also contains a ‘Fuzzy’ search option which can do including approximate matches.
tFuzzyJoin joins two tables by doing a ‘fuzzy’ match on several columns. It compares columns from the main flow with reference columns from the lookup flow and outputs the main flow data and/or the rejected data. This component allows you to define a matching type column and select from a list the method to be used to check the incoming data against the reference data. The types of Matches available are Exact Match, Metaphone, Double Metaphone and Levenshtein. Exact Match is self-explanatory, Metaphone and Double Metaphone are based on a phonetic algorithm for indexing entries by their pronunciation. Levenshtein is more complicated; it basically measures the distance between two words as the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. You can set minimum and maximum distances. In this method, the distance is the number of character changes (insertion, deletion or substitution) that need to be carried out for the entry to match the reference fully. For example, if you want the minimum distance to be 0 and the maximum distance to be 2. This will output all entries that exactly match or that have maximum two character changes. There are many other methods available.
tMatchGroup is a component that compares columns in both standard input data flows and in Map Reduce input data flows by using matching methods and groups similar encountered duplicates together. Several tMatchGroup components can be used sequentially to match data against different blocking keys. This will refine the groups received by each of the tMatchGroup components through creating different data partitions that overlap previous data blocks and so on. In defining a group, the first processed record of each group is the master record of the group. The other records are computed as to their distances from the master records and then are distributed to the due master record accordingly. The Matching algorithms used by this component are Exact Match, Soundex, Metaphone, Double Metaphone, Levenshtein, Jaro (which matches processed entries according to spelling deviations. It counts the number of matched characters between two strings. The higher the distance is, the more similar the strings are), Jaro-Winkler (a variant of Jaro, but it gives more importance to the beginning of the string), Fingerprint key (matches entries after doing things such as removing white and control spaces), q-grams (which matches processed entries by dividing strings into letter blocks of length q in order to create a number of q length grams. The matching result is given as the number of q-gram matches over possible q-grams), and finally custom. Custom Matching enables you to load an external matching algorithm from a Java library using the custom Matcher column.
tFuzzyMatch is a component which compares a column from the main flow with a reference column from the lookup flow and outputs the main flow data displaying the ‘distance’ between them. In this component, the Match Types are Metaphone, Double Metaphone, and Levenshtein.
tFuzzyUniqRow is a component which compares columns in the input flow by using a defined matching method and collects the encountered duplicates. In this component the Match Types are Exact, Metaphone, Double Metaphone, and Levenshtein.
tGenKey is a Big Data component that generates a functional key from the input columns, by applying different types of algorithms on each column and grouping the computed results in one key. It outputs this key together with the input columns. This component helps narrow down your data filter/ matching results using the generated functional key. It can be used as a Standard component or in MapReduce jobs. The algorithms used can do things such as format data as well as do matching such as Exact, Soundex (a simple phonetic matching based on the removal of vowels), Metaphone, Double Metaphone and ColognPhonetic; a soundex base phonetic algorithm optimized for the German language. It encodes a string into a Cologne phonetic value. This code represents the character string that will be included in the functional key.
These are the major components that can do Matching, there are others, but they follow the same methods and use the same Matching Types and algorithms as above.
In the final blog in this series, we will look at how to tune your matching to obtain the best possible matching with your data.