Busting myths about probabilistic linkage
By James Doidge, ADRC-England Senior Research Associate at University College London
Record linkage is one of the most widely applicable but poorly understood practices in administrative data research. While linked data have surfed the wave of mounting interest in ‘big data’ and evidence-based policy, record linkage has been going behind the scenes for decades. Few researchers appreciate its relevance to the longitudinal analysis of almost every administrative dataset. Administrative data are nearly always recorded in event-based units, such as services provided, appointments booked, or registered diagnoses and life events. To analyse these data in terms of people requires that the records be linked, just as does any analysis that combines two or more datasets. A unique and — importantly — error-free identifier, such as a well-record NHS or National Insurance number, might allow this linkage to be implemented using relatively straightforward ‘deterministic’ rules (e.g. if records agree on NHS number then records are linked). When unique identifiers are not available, or are recorded with less than perfect quality, then the problem becomes more complex. It may still be possible to use ‘fuzzy’ rules that allow for some level of disagreement, or combinations of different rules, but as the number of available matching variables increases and the quality of matching variables decreases, specifying deterministic rules becomes increasingly complicated. This amplifies the influence of design choices made by the data linker and thus the “dark art of data linkage” is born. Probabilistic linkage is one way to make this process of designing linkage algorithms if not simpler then at least more driven by the data and explicitly stated statistical theory.
As the name might suggest, the aim of probabilistic linkage is to rank record pairs according the probability that the pair are a match (truly relate to the same person or entity). Misleadingly, this is not quite how probabilistic linkage operates. The probability that a pair is a match is never known and often not even estimated. The probabilities that lie at the heart of probabilistic linkage are rather the probabilities of observing agreement or disagreement on each matching variable given that a pair is a match (the ‘value’) or a non-match (the ‘value’). These probabilities are combined into scores that, under certain statistical assumptions, go up when the probability of being a match is high and go down when the probability is low. What this means is that, in practice, probabilistic linkage is merely a way of assigning a score to every possible pattern of agreement on the available matching variables — or, thought of another way, to every possible decision rule that could be set. If, as is generally done, a threshold score is set, above which pairs are classified as links and below which pairs are classified as non-links, then for every probabilistic threshold that could be chosen, there exists an equivalent set of deterministic rules that could have been specified instead. So depending on how they are implemented, probabilistic linkage and deterministic linkage can produce exactly equivalent — or very different results; the devil lies in the detail of how each is designed.
Now there are three important ways that the available choices in probabilistic linkage exceed those that can be practically implemented in deterministic linkage. These three differences can give probabilistic linkage an edge in outperforming deterministic procedures. Firstly, handling large numbers of matching variables is relatively straightforward in probabilistic linkage but prohibitively complex in deterministic linkage. Take for example, the probabilistic linkage of mothers and babies in Hospital Episode Statistics reported by Harron et al. (2016). This linkage involved 23 matching variables. Considering only binary indicators of agreement (agree/disagree), this would have meant potential patterns of agreement or potential decision rules to choose between; far beyond the capacity of any human analyst. The consequence of this is that deterministic linkage procedures generally only consider a small set of matching variables, even when there are many that could be used to improve the classification of links.
A second way that probabilistic linkage is more flexible than deterministic linkage is in its capacity to handle distance measures of partial agreement, which reflect similarity on ordinal or continuous scales. ‘Jon’, for example, might be considered to have a 75% similarity to ‘John’ (or one ‘edit distance’, etc.), while a deterministic rule would have to treat such a comparison as either meeting some sufficient level of agreement (using, for example, name dictionaries) or not. The finer level of information that is retained using distance measures in probabilistic linkage provides a finer level of discrimination between records and so can lead to lower levels of linkage error.
The final way that probabilistic linkage is more flexible than deterministic linkage is in its capacity to incorporate scores that reflect the different frequencies of specific values of matching variables. Agreement on a rare name like ‘Doidge’, for example, is much more likely to indicate a match than agreement on a common name like ‘Smith’. To imagine how this might play out in deterministic linkage, take the possible decision rules indicated earlier and now add variations that reflect to the frequencies of each value of each matching variable. This is another way that probabilistic linkage can allow for a much finer level of discrimination between records.
In summary, there are at least three important ways that probabilistic linkage is more flexible than deterministic linkage and can allow for better discrimination between matches and non-matches and lower rates of linkage error. Most of the other distinctions that are often drawn between probabilistic linkage and deterministic linkage — such as that probabilistic produces more false matches and deterministic linkage more missed matches — are, however, more a reflection of differences in how each is implemented than of any fundamental differences between the procedures. For further explanation of these, check out the article ‘Demystifying probabilistic linkage: Common myths and misconceptions’, where we examine the truths underlying eight common myths and misconceptions about probabilistic record linkage.
- Jamie C. Doidge and Katie Harron, Demystifying probabilistic linkage: Common myths and misconceptions in The International Journal of Population Data Science, January 2018, Volume 3, Issue 1
- Katie Harron, Ruth Gilbert, David Cromwell and Jan van der Meulen, Linking Data for Mothers and Babies in De-Identified Electronic Health Data in PLOS ONE, October 2016, Volume 11, Issue 10
Written by Dr James Doidge. Published on the ADRN blog under Creative Commons license CC BY-NC-SA 4.0.
Published on 17 January 2018