In the context of Global Health, massive administrative datasets have become indispensable tools for health surveillance. However, the sheer scale of Big Data can mask systemic selection biases that standard mathematical adjustments may not fully mitigate. In this study, I propose a methodological audit of a recent large-scale cohort (N = 2,975,035) concerning COVID-19 vaccination and oncological outcomes. By benchmarking the cohort's architecture against national demographic and epidemiological gold standards through single-proportion Z-tests, we identified notable structural divergences. The first inferential test yielded a Z-score of -260.39 (p < 10-50), suggesting a structural under-sampling of the elderly population (32.2% deficit) relative to the reference population. The second test identified a statistically inconsistent cancer incidence deficit in the non-vaccinated control group (Z = -15.23, p < 10-50). These findings indicate that the reported statistical signals may emerge as a computational consequence of structural selection bias, where an artificially deflated baseline in the control group potentially inflates Hazard Ratios. Within a One Health approach, ensuring the structural integrity of data is crucial for effective prevention and control measures. We conclude that large-scale surveillance studies could be inferentially validated against demographic benchmarks to ensure that public health conclusions are grounded in baseline equivalence, thereby safeguarding the reliability of global health monitoring.
Roccetti, M. (2026). Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts. AIMS PUBLIC HEALTH, 13(2), 589-597 [10.3934/publichealth.2026031].
Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts
Roccetti, Marco
2026
Abstract
In the context of Global Health, massive administrative datasets have become indispensable tools for health surveillance. However, the sheer scale of Big Data can mask systemic selection biases that standard mathematical adjustments may not fully mitigate. In this study, I propose a methodological audit of a recent large-scale cohort (N = 2,975,035) concerning COVID-19 vaccination and oncological outcomes. By benchmarking the cohort's architecture against national demographic and epidemiological gold standards through single-proportion Z-tests, we identified notable structural divergences. The first inferential test yielded a Z-score of -260.39 (p < 10-50), suggesting a structural under-sampling of the elderly population (32.2% deficit) relative to the reference population. The second test identified a statistically inconsistent cancer incidence deficit in the non-vaccinated control group (Z = -15.23, p < 10-50). These findings indicate that the reported statistical signals may emerge as a computational consequence of structural selection bias, where an artificially deflated baseline in the control group potentially inflates Hazard Ratios. Within a One Health approach, ensuring the structural integrity of data is crucial for effective prevention and control measures. We conclude that large-scale surveillance studies could be inferentially validated against demographic benchmarks to ensure that public health conclusions are grounded in baseline equivalence, thereby safeguarding the reliability of global health monitoring.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



