Standard metrics for measuring the performance of machine learning models work well when the data are objective, meaning there is only one correct output for a given input. However, for tasks where the outcome is subjective—such as opinions or morality—these metrics can only reference an average gold standard, potentially underestimating the model’s performance. Inter-annotator agreement metrics, such as Fleiss’ Kappa and Cohen’s Kappa, can help estimate the subjectivity of a task. However, they do not provide a way to quantify what level of performance is satisfactory. Moreover, they are not always applicable in different settings, such as multi-label data, where annotators can assign multiple labels, or when the annotator set varies across samples. To address this challenge, we propose a novel inter-annotator agreement metric based on the F1-score, designed to support binary, multi-class, and multi-label data while accounting for incomplete annotations. Unlike traditional metrics, our approach enables direct comparison between human and machine annotations within a unified framework and accommodates varying levels of annotator coverage. Empirical validation on synthetic datasets demonstrates the proposed metric’s robustness against annotation imbalances and incomplete sample coverage. Additionally, a case study on multi-label data shows that the proposed metric effectively highlights how modern LLMs match human performance in moral value classification.

Bulla, L., Mongiovì, M., Gangemi, A. (2025). Underperformance or Pluralism: A Machine Learning Perspective on Inter-Annotator Agreement. IOS Press BV [10.3233/FAIA250647].

Underperformance or Pluralism: A Machine Learning Perspective on Inter-Annotator Agreement

Luana Bulla
;
Aldo Gangemi
2025

Abstract

Standard metrics for measuring the performance of machine learning models work well when the data are objective, meaning there is only one correct output for a given input. However, for tasks where the outcome is subjective—such as opinions or morality—these metrics can only reference an average gold standard, potentially underestimating the model’s performance. Inter-annotator agreement metrics, such as Fleiss’ Kappa and Cohen’s Kappa, can help estimate the subjectivity of a task. However, they do not provide a way to quantify what level of performance is satisfactory. Moreover, they are not always applicable in different settings, such as multi-label data, where annotators can assign multiple labels, or when the annotator set varies across samples. To address this challenge, we propose a novel inter-annotator agreement metric based on the F1-score, designed to support binary, multi-class, and multi-label data while accounting for incomplete annotations. Unlike traditional metrics, our approach enables direct comparison between human and machine annotations within a unified framework and accommodates varying levels of annotator coverage. Empirical validation on synthetic datasets demonstrates the proposed metric’s robustness against annotation imbalances and incomplete sample coverage. Additionally, a case study on multi-label data shows that the proposed metric effectively highlights how modern LLMs match human performance in moral value classification.
2025
Frontiers in Artificial Intelligence and Applications
303
315
Bulla, L., Mongiovì, M., Gangemi, A. (2025). Underperformance or Pluralism: A Machine Learning Perspective on Inter-Annotator Agreement. IOS Press BV [10.3233/FAIA250647].
Bulla, Luana; Mongiovì, Misael; Gangemi, Aldo
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1064615
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact