This paper investigates whether Large Language Models (LLMs) can effectively act as judges for evaluating open-ended text generation tasks, such as summarization, by interpreting nuanced ed- itorial criteria. Traditional metrics like ROUGE and BLEU rely on surface-level overlap, while human evaluations remain costly and in- consistent. To address this, we propose a structured rubric with five dimensions: coherence, consistency, fluency, relevance, and order- ing, each defined with explicit sub-criteria to guide LLMs in assess- ing semantic fidelity and structural quality. Using a purpose-built dataset of Italian news summaries generated by GPT-4o, each tai- lored to isolate specific criteria, we evaluate LLMs’ ability to assign scores and rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) for criteria like relevance but reveal systematic biases, such as overestimating flu- ency and coherence, likely due to training data biases. We identify challenges in rubric interpretation, particularly for hierarchical or ab- stract criteria, and highlight limitations in cross-genre generalization. The study underscores the potential of LLMs as scalable evaluators but emphasizes the need for fine-tuning, diverse benchmarks, and refined rubrics to mitigate biases and enhance reliability. Future di- rections include expanding to multilingual and multi-genre contexts and exploring task-specific instruction tuning to improve alignment with human editorial standards.
Donati, N., Torroni, P., Savino, G. (2025). Do Large Language Models understand how to be judges?. UP - Universidade do Porto.
Do Large Language Models understand how to be judges?
Donati Nicolò
Primo
Software
;Torroni PaoloUltimo
Supervision
;
2025
Abstract
This paper investigates whether Large Language Models (LLMs) can effectively act as judges for evaluating open-ended text generation tasks, such as summarization, by interpreting nuanced ed- itorial criteria. Traditional metrics like ROUGE and BLEU rely on surface-level overlap, while human evaluations remain costly and in- consistent. To address this, we propose a structured rubric with five dimensions: coherence, consistency, fluency, relevance, and order- ing, each defined with explicit sub-criteria to guide LLMs in assess- ing semantic fidelity and structural quality. Using a purpose-built dataset of Italian news summaries generated by GPT-4o, each tai- lored to isolate specific criteria, we evaluate LLMs’ ability to assign scores and rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) for criteria like relevance but reveal systematic biases, such as overestimating flu- ency and coherence, likely due to training data biases. We identify challenges in rubric interpretation, particularly for hierarchical or ab- stract criteria, and highlight limitations in cross-genre generalization. The study underscores the potential of LLMs as scalable evaluators but emphasizes the need for fine-tuning, diverse benchmarks, and refined rubrics to mitigate biases and enhance reliability. Future di- rections include expanding to multilingual and multi-genre contexts and exploring task-specific instruction tuning to improve alignment with human editorial standards.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025.luhme-1.9.pdf
accesso aperto
Descrizione: file scaricato dai proceedings acl
Tipo:
Versione (PDF) editoriale / Version Of Record
Licenza:
Creative commons
Dimensione
761.56 kB
Formato
Adobe PDF
|
761.56 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


