Getting the most of your data with record matching

Return to Insights

Here at Transform, we’ve been exploring innovative ways to harness data and found record matching a simple way to get more from your data. Let’s say you have a database with sales data and a separate database with customer preferences from survey data, we can make this data more useful than the sum of its parts by connecting the two and identifying the same people in the two sources. This gives a single customer view from which we can divulge deeper insights. This can also be applied to other types of data. For example, in the UK, the Clinical Practice Research Datalink (CPRD), which anonymously links people’s prescription data, GP and hospital records, and laboratory data, has become an invaluable resource for healthcare researchers.

In this article, we’ll explore the challenge of record matching and potential solutions to see you on your way to a single customer view.

The challenge

Clean, quality data is a fundamental prerequisite in analytics. Accurate record matching is an important aspect of this, but we can’t always rely on unique identifiers in data. Sometimes we need to fall back on comparing other information to identify the same individual in different data. We’ll look at two methods of record matching now:

Deterministic matching

Simple intuition determines that if two records have the same name, date of birth and postcode they almost definitely belong to the same person. With a coded rule, we can mark the two records as a match; an approach known as deterministic matching. But we can also build a set of rules to influence binary decision-making; perhaps including another rule that has a bit of leniency and allows the postcode to be slightly different but requires the recorded email address to be an exact match. These sets of listed rules can cover various scenarios resulting in matches. This is a simple but effective approach we have used to consolidate databases.

Probabilistic matching

Matching can become tricky in the edge cases, where data quality issues arise from inconsistent recorded information like different spelling in a name. Binary outcomes from rules make it tricky to prioritise “stronger matches” from “weaker matches”. And there are also coincidences to account for, where two people very much look like the same person.

Probabilistic matching is a different way of looking at this problem. Instead of making a binary decision, a set of comparisons between two records leads us to a probability that the two records match. We treat each comparison as an independent piece of evidence and tally the results to come to a conclusion. For example, we start by comparing the first name – let's say we have ‘Jack’ and ‘Jak’ - with this evidence alone we have only 1% confidence the records are a match. Next, we see the middle name and surname are exactly the same, boosting it up to 60%. However, the postcodes are from completely different parts of the country, knocking it back down to 20%, and so on until we reach a final probability.

The method typically used to model and estimate these probabilities is the Fellegi-Sunter method which is equivalent to the broader idea of the Naïve Bayes algorithms. A well-known limitation is the inherent assumption of independence of each observation. For the computational requirements, the Fellegi-Sunter method is a very efficient and accurate method.

We can get more accuracy by accounting for nuances in the data. Feature engineering allows more complex comparisons to be made, such as pulling apart the different elements of a postcode and comparing it at different regional levels. Term frequencies adjust for the commonality of specific values, for example, a matching on a common name ‘John Smith’ might be weaker evidence for a match than a rarer name.

Conclusions

The formalised principle of the Fellegi-Sunter method has been knocking around since 1969, the year of Led Zeppelin’s debut album. It’s not a particularly complex statistical model, but it’s a common situation where computational resource has allowed some of these models to come into their own. There's a lot of value in looking at the right type of artificial intelligence, and it’s important to identify cheap wins like this where a lot of value can be added without the use of beefier algorithms.

If you want to learn more about how record matching could empower your organisation’s data, get in touch with our experts at tranformation@transformuk.com and keep up with our thinking on LinkedIn.

Return to Insights