CRIS Current Research Information System

Medical multiple-choice question answering (MCQA) benchmarks show that models achieve near-human accuracy, with some benchmarks approaching saturation–leading to claims of clinical readiness. Yet a single accuracy score is a poor proxy for competence: models that change answers under minor input perturbations cannot be considered reliable. We argue that reliability underpins accuracy–only consistent predictions make correctness meaningful. We release ReMedQA, a new benchmark that augments three standard medical MCQA datasets with open-ended answers and systematically perturbed options. Building on this, we introduce ReAcc and ReCon, two reliability metrics: ReAcc measures the proportion of questions answered correctly across all variations, while ReCon measures the proportion answered consistently regardless of correctness. Our evaluation shows that high MCQA accuracy masks low reliability: models remain sensitive to format and perturbation changes, and domain specialization offers no robustness gain. MCQA underestimates smaller models while inflating large ones that exploit structural cues–with some exceeding 50% accuracy even when the original questions are hidden. This shows that, despite near-saturated accuracy, we are not yet done with medical MCQA benchmarks.

Cocchieri, A., Ragazzi, L., Tagliavini, G., Moro, G. (2026). ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks? [10.18653/v1/2026.eacl-long.124].

ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?

Alessio Cocchieri^Co-primo;Luca Ragazzi^Co-primo;Giuseppe Tagliavini;Gianluca Moro^Co-primo

2026

Abstract

Medical multiple-choice question answering (MCQA) benchmarks show that models achieve near-human accuracy, with some benchmarks approaching saturation–leading to claims of clinical readiness. Yet a single accuracy score is a poor proxy for competence: models that change answers under minor input perturbations cannot be considered reliable. We argue that reliability underpins accuracy–only consistent predictions make correctness meaningful. We release ReMedQA, a new benchmark that augments three standard medical MCQA datasets with open-ended answers and systematically perturbed options. Building on this, we introduce ReAcc and ReCon, two reliability metrics: ReAcc measures the proportion of questions answered correctly across all variations, while ReCon measures the proportion answered consistently regardless of correctness. Our evaluation shows that high MCQA accuracy masks low reliability: models remain sensitive to format and perturbation changes, and domain specialization offers no robustness gain. MCQA underestimates smaller models while inflating large ones that exploit structural cues–with some exceeding 50% accuracy even when the original questions are hidden. This shows that, despite near-saturated accuracy, we are not yet done with medical MCQA benchmarks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Titolo del volume
	
				Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
			
	Pagina iniziale
	
				2706
			
	Pagina finale
	
				2738
			
	Codice DOI
	
				https://dx.doi.org/10.18653/v1/2026.eacl-long.124
			
	Citazione
	
				Cocchieri, A., Ragazzi, L., Tagliavini, G., Moro, G. (2026). ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks? [10.18653/v1/2026.eacl-long.124].
			
	Tutti gli autori
	
						Cocchieri, Alessio; Ragazzi, Luca; Tagliavini, Giuseppe; Moro, Gianluca
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2026.eacl-long.124.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 4.99 MB Formato Adobe PDF Visualizza/Apri	4.99 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1061019

Citazioni

ND

ND

ND

ND

social impact