Standard metrics for measuring the performance of machine learning models work well when the data are objective, meaning there is only one correct output for a given input. However, for tasks where the outcome is subjective—such as opinions or morality—these metrics can only reference an average gold standard, potentially underestimating the model’s performance. Inter-annotator agreement metrics, such as Fleiss’ Kappa and Cohen’s Kappa, can help estimate the subjectivity of a task. However, they do not provide a way to quantify what level of performance is satisfactory. Moreover, they are not always applicable in different settings, such as multi-label data, where annotators can assign multiple labels, or when the annotator set varies across samples. To address this challenge, we propose a novel inter-annotator agreement metric based on the F1-score, designed to support binary, multi-class, and multi-label data while accounting for incomplete annotations. Unlike traditional metrics, our approach enables direct comparison between human and machine annotations within a unified framework and accommodates varying levels of annotator coverage. Empirical validation on synthetic datasets demonstrates the proposed metric’s robustness against annotation imbalances and incomplete sample coverage. Additionally, a case study on multi-label data shows that the proposed metric effectively highlights how modern LLMs match human performance in moral value classification.
Bulla, L., Mongiovì, M., Gangemi, A. (2025). Underperformance or Pluralism: A Machine Learning Perspective on Inter-Annotator Agreement. IOS Press BV [10.3233/FAIA250647].
Underperformance or Pluralism: A Machine Learning Perspective on Inter-Annotator Agreement
Luana Bulla
;Aldo Gangemi
2025
Abstract
Standard metrics for measuring the performance of machine learning models work well when the data are objective, meaning there is only one correct output for a given input. However, for tasks where the outcome is subjective—such as opinions or morality—these metrics can only reference an average gold standard, potentially underestimating the model’s performance. Inter-annotator agreement metrics, such as Fleiss’ Kappa and Cohen’s Kappa, can help estimate the subjectivity of a task. However, they do not provide a way to quantify what level of performance is satisfactory. Moreover, they are not always applicable in different settings, such as multi-label data, where annotators can assign multiple labels, or when the annotator set varies across samples. To address this challenge, we propose a novel inter-annotator agreement metric based on the F1-score, designed to support binary, multi-class, and multi-label data while accounting for incomplete annotations. Unlike traditional metrics, our approach enables direct comparison between human and machine annotations within a unified framework and accommodates varying levels of annotator coverage. Empirical validation on synthetic datasets demonstrates the proposed metric’s robustness against annotation imbalances and incomplete sample coverage. Additionally, a case study on multi-label data shows that the proposed metric effectively highlights how modern LLMs match human performance in moral value classification.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



