CRIS Current Research Information System

This paper investigates whether Large Language Models (LLMs) can effectively act as judges for evaluating open-ended text generation tasks, such as summarization, by interpreting nuanced ed- itorial criteria. Traditional metrics like ROUGE and BLEU rely on surface-level overlap, while human evaluations remain costly and in- consistent. To address this, we propose a structured rubric with five dimensions: coherence, consistency, fluency, relevance, and order- ing, each defined with explicit sub-criteria to guide LLMs in assess- ing semantic fidelity and structural quality. Using a purpose-built dataset of Italian news summaries generated by GPT-4o, each tai- lored to isolate specific criteria, we evaluate LLMs’ ability to assign scores and rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) for criteria like relevance but reveal systematic biases, such as overestimating flu- ency and coherence, likely due to training data biases. We identify challenges in rubric interpretation, particularly for hierarchical or ab- stract criteria, and highlight limitations in cross-genre generalization. The study underscores the potential of LLMs as scalable evaluators but emphasizes the need for fine-tuning, diverse benchmarks, and refined rubrics to mitigate biases and enhance reliability. Future di- rections include expanding to multilingual and multi-genre contexts and exploring task-specific instruction tuning to improve alignment with human editorial standards.

Donati, N., Torroni, P., Savino, G. (2025). Do Large Language Models understand how to be judges?. UP - Universidade do Porto.

Do Large Language Models understand how to be judges?

Donati Nicolò^{Primo

Software};Torroni Paolo^{Ultimo

Supervision};

2025

Abstract

This paper investigates whether Large Language Models (LLMs) can effectively act as judges for evaluating open-ended text generation tasks, such as summarization, by interpreting nuanced ed- itorial criteria. Traditional metrics like ROUGE and BLEU rely on surface-level overlap, while human evaluations remain costly and in- consistent. To address this, we propose a structured rubric with five dimensions: coherence, consistency, fluency, relevance, and order- ing, each defined with explicit sub-criteria to guide LLMs in assess- ing semantic fidelity and structural quality. Using a purpose-built dataset of Italian news summaries generated by GPT-4o, each tai- lored to isolate specific criteria, we evaluate LLMs’ ability to assign scores and rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) for criteria like relevance but reveal systematic biases, such as overestimating flu- ency and coherence, likely due to training data biases. We identify challenges in rubric interpretation, particularly for hierarchical or ab- stract criteria, and highlight limitations in cross-genre generalization. The study underscores the potential of LLMs as scalable evaluators but emphasizes the need for fine-tuning, diverse benchmarks, and refined rubrics to mitigate biases and enhance reliability. Future di- rections include expanding to multilingual and multi-genre contexts and exploring task-specific instruction tuning to improve alignment with human editorial standards.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Titolo del volume
	
				Proceedings of the 2nd LUHME Workshop
			
	Pagina iniziale
	
				85
			
	Pagina finale
	
				102
			
	Citazione
	
				Donati, N., Torroni, P., Savino, G. (2025). Do Large Language Models understand how to be judges?. UP - Universidade do Porto.
			
	Tutti gli autori
	
						Donati, Nicolò; Torroni, Paolo; Savino, Giuseppe
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2025.luhme-1.9.pdf accesso aperto Descrizione: file scaricato dai proceedings acl Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Creative commons Dimensione 761.56 kB Formato Adobe PDF Visualizza/Apri	761.56 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1036791

Citazioni

ND

ND

ND

ND

social impact