This paper investigates whether Large Language Models (LLMs) can effectively act as judges for evaluating open-ended text generation tasks, such as summarization, by interpreting nuanced ed- itorial criteria. Traditional metrics like ROUGE and BLEU rely on surface-level overlap, while human evaluations remain costly and in- consistent. To address this, we propose a structured rubric with five dimensions: coherence, consistency, fluency, relevance, and order- ing, each defined with explicit sub-criteria to guide LLMs in assess- ing semantic fidelity and structural quality. Using a purpose-built dataset of Italian news summaries generated by GPT-4o, each tai- lored to isolate specific criteria, we evaluate LLMs’ ability to assign scores and rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) for criteria like relevance but reveal systematic biases, such as overestimating flu- ency and coherence, likely due to training data biases. We identify challenges in rubric interpretation, particularly for hierarchical or ab- stract criteria, and highlight limitations in cross-genre generalization. The study underscores the potential of LLMs as scalable evaluators but emphasizes the need for fine-tuning, diverse benchmarks, and refined rubrics to mitigate biases and enhance reliability. Future di- rections include expanding to multilingual and multi-genre contexts and exploring task-specific instruction tuning to improve alignment with human editorial standards.

Donati, N., Torroni, P., Savino, G. (2025). Do Large Language Models understand how to be judges?. UP - Universidade do Porto.

Do Large Language Models understand how to be judges?

Donati Nicolò
Primo
Software
;
Torroni Paolo
Ultimo
Supervision
;
2025

Abstract

This paper investigates whether Large Language Models (LLMs) can effectively act as judges for evaluating open-ended text generation tasks, such as summarization, by interpreting nuanced ed- itorial criteria. Traditional metrics like ROUGE and BLEU rely on surface-level overlap, while human evaluations remain costly and in- consistent. To address this, we propose a structured rubric with five dimensions: coherence, consistency, fluency, relevance, and order- ing, each defined with explicit sub-criteria to guide LLMs in assess- ing semantic fidelity and structural quality. Using a purpose-built dataset of Italian news summaries generated by GPT-4o, each tai- lored to isolate specific criteria, we evaluate LLMs’ ability to assign scores and rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) for criteria like relevance but reveal systematic biases, such as overestimating flu- ency and coherence, likely due to training data biases. We identify challenges in rubric interpretation, particularly for hierarchical or ab- stract criteria, and highlight limitations in cross-genre generalization. The study underscores the potential of LLMs as scalable evaluators but emphasizes the need for fine-tuning, diverse benchmarks, and refined rubrics to mitigate biases and enhance reliability. Future di- rections include expanding to multilingual and multi-genre contexts and exploring task-specific instruction tuning to improve alignment with human editorial standards.
2025
Proceedings of the 2nd LUHME Workshop
85
102
Donati, N., Torroni, P., Savino, G. (2025). Do Large Language Models understand how to be judges?. UP - Universidade do Porto.
Donati, Nicolò; Torroni, Paolo; Savino, Giuseppe
File in questo prodotto:
File Dimensione Formato  
2025.luhme-1.9.pdf

accesso aperto

Descrizione: file scaricato dai proceedings acl
Tipo: Versione (PDF) editoriale / Version Of Record
Licenza: Creative commons
Dimensione 761.56 kB
Formato Adobe PDF
761.56 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1036791
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact