Data Matching with Different Regional Data Sets
When it comes to Data Matching, there is no ‘one size fits all menu’. Different matching routines, different algorithms and different tuning parameters will all apply to different datasets. You generally can’t take one matching setup used to match data from one distinct data set and apply it to another. This proves especially true when matching datasets from different regions or countries. Let me explain.
Data Matching for Attributes that are Unlikely to Change
Data Matching is all about identifying unique attributes that a person, or object, has; and then using those attributes to match individual members within that set. These attributes should be things that are ‘unlikely to change’ over time. For a person, these would be things like "Name" and "Date of Birth". Attributes like "Address" are much more likely to change and therefore of less importance, although this does not mean you should not use them. It’s just that they are less unique and therefore of less value, or lend less weight, to the matching process. In the case of objects, they would be attributes that uniquely identify that object, so in the case of say, a cup (if you manufactured cups) those attributes would be things like "Size", "Volume", "Shape", "Color", etc. The attributes themselves are not too important, it’s the fact that they should be ‘things’ that are unlikely to change over time.
So, back to data relating to people. This is generally the main use case for data matching. So here comes the challenge. Can’t we use one set of data matching routines for a ‘person database’ and just use the same routines etc. for another dataset? Well, the answer is no, unfortunately. There are always going to be differences in the data that will manifest itself during the matching, and none more so than using datasets from different geographical regions such as different countries. Data matching routines are always tuned for a specific dataset, and whilst there are always going to be differences from dataset to dataset. The difference becomes much more distinct when you chose data from different geographical regions. Let us explore this some more.
Data Matching for Regional Data Sets
First, I must mention a caveat. I am going to assume that matching is done in western character sets, using Romanized names, not in languages or character sets such as Japanese or Chinese. This does not mean the data must contain only English or western names, far from it, it just means the matching routines are those which we can use for names that we can write using western, or Romanized characters. I will not consider matching using non-western characters here.
Now, let us consider the matching of names. To do this for the name itself, we use matching routines that do things like phoneticize the names and then look for differences between the result. But first, the methodology involves blocking on names, sorting out the data in different piles that have similar attributes. It’s the age-old ‘matching the socks’ problem. You wouldn’t match socks in a great pile of fresh laundry by picking one sock at a time from the whole pile and then trying to find its duplicate. That would be very inefficient and take ages to complete. You instinctively know what to do, you sort them out first into similar piles, or ‘blocks’, of similar socks. Say, a pile of black socks, a pile of white socks, a pile of colored socks etc. and then you sort through those smaller piles looking for matches. It’s the same principle here. We sort the data into blocks of similar attributes, then match within those blocks. Ideally, these blocks should be of a manageable and similar size. Now, here comes the main point.
Different geographic regions will produce different distributions of block sizes and types that result in differences to the matching that will need to be done in those blocks, and this can manifest itself in terms of performance, efficiency, accuracy and overall quality of the matching. Regional variations in the distribution of names within different geographical regions, and therefore groups of data, can vary widely and therefore cause big differences in the results obtained.
Let’s look specifically at surnames for a moment. In the UK, according to the National Office of Statistics, there are around 270,000 surnames that cover around 95% of the population. Now obviously, some surnames are much more common than others. Surnames such as Jones, Brown, Patel example are amongst the most common, but the important thing is there is a distribution of these names that follow a specific graphical shape if we chose to plot them. There will be a big cluster of common names at one end, followed by a specific tailing-off of names to the other, whilst the shape of the curve would be specific to the UK and to the UK alone. Different countries or regions would have different shapes to their distributions. This is an important point. Some regions would have a much narrower distribution, names could be much more similar or common, whilst some regions would be broader, names would be much less common. The overall distribution of distinct names could be much more or much less and this would, therefore, affect the results of any matching we did within datasets emanating from within those regions. A smaller distribution of names would result in bigger block sizes and therefore more data to match on within those blocks. This could take longer, be less efficient and could even affect the accuracy of those matches. A larger distribution of names would result in many more blocks of a smaller size, each of which would need to be processed.
Data Matching Variances Across the Globe
Let’s take a look at how this varies across the globe. A good example of regional differences comes from Taiwan. Roughly forty percent of the population share just six different surnames (when using the Romanised form). Matching within datasets using names from Taiwanese data will, therefore, result in some very large blocks. Thailand, on the other hand, presents a completely different challenge. In Thailand, there are no common surnames. There is actually a law called the ‘Surname Act’ that states surnames cannot be duplicated and families should have unique surnames. In Thailand, it is incredibly rare for any two people to share the same name. In our blocking exercise, this would result in a huge number of very small blocks.
The two examples above may be extreme, but they perfectly illustrate the challenge. Datasets containing names vary from region to region and therefore the blocking and matching strategy can vary widely from place to place. You cannot simply use the same routines and algorithms for different datasets, each dataset is unique and must be treated so. Different matching strategies must be adopted for each set, each matching exercise must be ‘tuned’ for that specific dataset in order to find the most effective strategy and the results will vary. It doesn’t matter what toolset you choose to use; the same principle applies to all as it’s an issue that is in the data and cannot be changed or ignored.
To summarize, the general point is that regional, geographic, cultural and language variations can make big differences to how you go about matching personal data within different datasets. Each dataset must be treated differently. You must have a good understanding of the data contained within those datasets and you must tune and optimize your matching routines and strategy for each dataset. Care must be taken to understand the data and select the best strategy for each separate dataset. Blocking and matching strategies will vary, you cannot just simply reuse the exact same approach and routines from dataset to dataset, this can vary widely from region to region. Until next time!