Bipartite record linkage has the goal of identifying observations referring to the same individual, called coreferent observations, across two distinct non-duplicated datasets. The two main approaches to solve this task are the Fellegi–Sunter model, which relies on pairwise comparisons of observations, and the graphical record linkage model, which directly models the data and groups together coreferent observations. In this paper, we aim to investigate the similarities between these two methods. We show that both models can be expressed in terms of a latent binary matrix indicating coreferent record pairs, that they can be framed as particular latent class analysis models and that they admit a direct relationship between their parameters under a common data model. Moreover, we propose a unified estimation framework based on a classification expectation–maximization algorithm. The proposed estimation method properly incorporates the problem constraints, while still allowing for a computationally efficient implementation. Moreover, it allows for an interchangeable use of the same distributional assumptions on the linkage distribution between the two models. Empirical results using the proposed estimation method demonstrate satisfactory and mostly equivalent performance for two models both on simulations and on a real dataset commonly used as a benchmark for record linkage.
Redivo, E. (2026). Linking the Comparison and Graphical Approaches to Bipartite Matching. INTERNATIONAL STATISTICAL REVIEW, NA, 1-26 [10.1111/insr.70038].
Linking the Comparison and Graphical Approaches to Bipartite Matching
Redivo, Edoardo
2026
Abstract
Bipartite record linkage has the goal of identifying observations referring to the same individual, called coreferent observations, across two distinct non-duplicated datasets. The two main approaches to solve this task are the Fellegi–Sunter model, which relies on pairwise comparisons of observations, and the graphical record linkage model, which directly models the data and groups together coreferent observations. In this paper, we aim to investigate the similarities between these two methods. We show that both models can be expressed in terms of a latent binary matrix indicating coreferent record pairs, that they can be framed as particular latent class analysis models and that they admit a direct relationship between their parameters under a common data model. Moreover, we propose a unified estimation framework based on a classification expectation–maximization algorithm. The proposed estimation method properly incorporates the problem constraints, while still allowing for a computationally efficient implementation. Moreover, it allows for an interchangeable use of the same distributional assumptions on the linkage distribution between the two models. Empirical results using the proposed estimation method demonstrate satisfactory and mostly equivalent performance for two models both on simulations and on a real dataset commonly used as a benchmark for record linkage.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



