Data Matching and Combinations Using Large Data Sets
When doing data matching with large sets of data, consideration should be given to the combinations that can be generated, and it’s associated effects on performance. This has an effect when using Talend’s Data Integration Matching and Data Quality components. Matching routines do not scale in a linear fashion.
For those that have read my various blogs and articles on data matching, you already know that the basic premise behind data matching is the mathematics of probabilities. For objects being matched (people, widgets, etc), the basic idea is to identify attributes that are unlikely to change, and then block and match on those attributes. You then use math to identify the probabilities that each of the attributes in turn matches, and them sum up all the weighted probabilities to get a final match value. That could be either a match, an unmatch or a possible match.
Now, these simple combinations work both ways. Blocking into columns reduces the number of combinations done by the matching routines, the match within a block, but, it does mean the larger the data gets, the more combinations need to be checked overall.
Here is a simple example to illustrate what I mean. Imagine matching data from two datasets, “A” and “B”, each containing 10 records. Ignoring blocking keys, each record in “A” must be checked against the 10 records in “B”, resulting in 100 combinations in total. 100 checks in effect.
Now, this can be reduced by making a good choice of your blocking key, but the maximum could be 100. So, the result could be that record number 1 in dataset “A” might match with, say 2 records in dataset “B”. Record 2 and 3 may have no matches, record 4 may have a possible match in “B” and 1 definite match.. and so it goes on. Ten records in each set could result in both 100 checks and potentially more than 10 results. That’s why you could start with 10 records in each dataset and get something like 3 matches, 4 possible matches, and 6 non-matches. In other words, you end up with more results than the number of input records that you started with. Simply put, one record may match more than once, so it's usual to get more results out than you would first expect. As discussed, blocking will reduce the overall number of combinations to be checked, but there will always be more combinations and therefore results, than the initial number of records unless there are say only a few distinct duplicates, or nothing matches at all.
An example of this is shown below. Here, I have built a simple matching job using Talend components. In this case, I am matching a dataset on one hundred thousand demographic records against a reference set of ten thousand records. The data is input and a Talend tRecordMatch component was used to match the records on just two field, ‘First Name’ and ‘Last Name’ and blocking on’ County of residence’, i.e. the area where they live. Now, this is not ideal, but I am using this to simply demonstrate the scenario above. The match threshold is set at "80%".
When we run the job, the results show that we have found over 32,000 potential matches, over 74,000 records that don’t match and around 1,200 matches. If we add all these up, we see the total come out to more than 100,000. As described above, this is due to some records matching more than once, resulting in many more records than we started with. More importantly, this means that a very large number of combinations have been checked, and of course, will result in the job requiring sufficient resources in order to process all those combinations. In this example we can immediately see that we have far too many potential matches.
Dealing With Data Matching In Large Datasets
Now, for small numbers of records its not a problem, but for large datasets, the combinations can get overwhelming. This can obviously affect performance as well. Now, the main drivers here in performance are the blocking keys and the matching rules. In general, for records with say 10 attributes, you would want around 2-3 blocking keys. Within those blocks, you would want to match around 3-4 attributes, but remember the most important thing is the accuracy of the matches.
The higher the match threshold, the less matches you get. However, you run the risk of missing real matches and increase the size of your possible matches that need to be checked, usually done manually. The lower the thresholds, the more matches you get, the more combinations are possible, but you run the risk this time of getting false positive matches. It’s a trade-off and this is where the size of the dataset becomes important. For datasets in the thousands and hundreds of thousands its not usually an issue, but for datasets in the millions and tens of millions, it can be.
The Importance of Tuning You Data Matching
This is why tuning your data matching, blocking routines, and algorithms is crucially important, but unfortunately there is no simple ‘one stop’ recipe to setting it up. You need to start with a first pass and check the results. Make changes and then run it again. The process then repeats until you get to a point of diminishing returns. This is also why a careful choice of your overall strategy and consideration for the possible combinations involved is key too.
So, the overall message is straightforward. When doing data matching against large datasets you want a good choice of blocking keys, matching algorithms and thresholds in order do two things:
- Reduce the number of false positives and negatives you will get, and therefore improve the quality of your results.
- Reduce the overall number of combinations that have to be checked in order to reduce the amount of resources needed to do you matching.
As discussed, this can be a trade off but it is crucially important when matching with large datasets. Finally, it's important to note that this does not simply scale linearly. A dataset twice as large as the last one does not take twice as long to process. Mathematically, matching does not scale linearly (like a straight line), but scales more like a power relationship. It is possible to estimate what this could be in practice, but you would need to first understand how many combinations are possible. The point, in practice, is therefore to carefully consider those possible combinations and take this into account when matching data from large datasets.
How are you data matching efforts going? Is there anything you’d like me to cover in my next article? Let me know in the comments and until then, happy data matching!