A Two-Dimensional Evaluation Framework for Factual and Reasoning Assessment of LLMs in Legal Question Answering

Gultekin, Sinan; Rossi Reich, Matteo; Galloni, Francesca; Lagioia, Francesca; Consiglio, Elena; Sartor, Giovanni; Bagnato, Sara

doi:10.3233/faia251579

Deploying Large Language Models (LLMs) for legal question-answering requires ensuring factual accuracy and logical coherence. Current evaluation metrics inadequately capture legal reasoning complexity, while expert assessments lack scalability. We propose a two-dimensional framework that independently measures Truthfulness and Reasoning Soundness in model outputs, applied to Italian asylum proceedings requiring evidence-based analysis. This dual-axis approach reveals critical issues - such as legally correct answers derived through unsound or hallucinatory reasoning - that standard metrics fail to detect. To enable large-scale application, we implement an automated LLM-as-a-Judge system with bias-mitigation techniques. Experimental results demonstrate strong correspondence between automated judgments and expert evaluations, confirming framework reliability. This work advances diagnostic methodology for assessing LLMs in legal domains, offering both theoretical insight and practical applicability toward more trustworthy and accountable legal AI systems.

Gultekin, S., Rossi Reich, M., Galloni, F., Lagioia, F., Consiglio, E., Sartor, G., et al. (2025). A Two-Dimensional Evaluation Framework for Factual and Reasoning Assessment of LLMs in Legal Question Answering. IOS Press BV [10.3233/faia251579].

A Two-Dimensional Evaluation Framework for Factual and Reasoning Assessment of LLMs in Legal Question Answering

Gultekin, Sinan;Rossi Reich, Matteo;Galloni, Francesca;Lagioia, Francesca;Consiglio, Elena;Sartor, Giovanni;Bagnato, Sara

2025

Abstract

Deploying Large Language Models (LLMs) for legal question-answering requires ensuring factual accuracy and logical coherence. Current evaluation metrics inadequately capture legal reasoning complexity, while expert assessments lack scalability. We propose a two-dimensional framework that independently measures Truthfulness and Reasoning Soundness in model outputs, applied to Italian asylum proceedings requiring evidence-based analysis. This dual-axis approach reveals critical issues - such as legally correct answers derived through unsound or hallucinatory reasoning - that standard metrics fail to detect. To enable large-scale application, we implement an automated LLM-as-a-Judge system with bias-mitigation techniques. Experimental results demonstrate strong correspondence between automated judgments and expert evaluations, confirming framework reliability. This work advances diagnostic methodology for assessing LLMs in legal domains, offering both theoretical insight and practical applicability toward more trustworthy and accountable legal AI systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Titolo del volume
	
				Frontiers in Artificial Intelligence and Applications
			
	Pagina iniziale
	
				86
			
	Pagina finale
	
				97
			
	Collana/Serie
	
				FRONTIERS IN ARTIFICIAL INTELLIGENCE AND APPLICATIONS
			
	Codice DOI
	
				https://dx.doi.org/10.3233/faia251579
			
	Citazione
	
				Gultekin, S., Rossi Reich, M., Galloni, F., Lagioia, F., Consiglio, E., Sartor, G., et al. (2025). A Two-Dimensional Evaluation Framework for Factual and Reasoning Assessment of LLMs in Legal Question Answering. IOS Press BV [10.3233/faia251579].
			
	Tutti gli autori
	
						Gultekin, Sinan; Rossi Reich, Matteo; Galloni, Francesca; Lagioia, Francesca; Consiglio, Elena; Sartor, Giovanni; Bagnato, Sara

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1044623

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

1

ND

ND

CRIS Current Research Information System