CRIS Current Research Information System

Introduction This study investigates the efficacy of large language models (LLMs) for generating accurate scientific responses through a comparative evaluation of five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, Mistral Large 2, and Llama 3.1 70B.Methods Sixteen expert scientific reviewers assessed these models in terms of depth, accuracy, relevance, and clarity.Results Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini, with notable variability among the other models. Additionally, retrieval-augmented generation (RAG) techniques were applied to improve LLM performance, and prompts were refined to improve answers. The results indicate that although LLMs such as Claude 3.5 Sonnet have potential for scientific tasks, other models may require more development or additional prompt engineering to reach comparable accuracy. Reviewers' perceptions of artificial intelligence (AI) utility and trustworthiness showed a positive shift after evaluation. However, ethical concerns, particularly with respect to transparency and disclosure, remained consistent.Discussion The study highlights the need for structured frameworks for evaluating LLMs and ethical considerations essential for responsible AI integration in scientific research. These findings should be interpreted with caution, as the limited sample size and domain-specific focus of the exam questions restrict the generalizability of the results.

Álvarez-Martínez, F.J., Esteban, L., Frungillo, L., Butassi, E., Zambon, A., Herranz-López, M., et al. (2025). There are significant differences among artificial intelligence large language models when answering scientific questions. FRONTIERS IN ARTIFICIAL INTELLIGENCE, 8, 1-11 [10.3389/frai.2025.1664303].

There are significant differences among artificial intelligence large language models when answering scientific questions

Álvarez-Martínez F. J.;Esteban L.;Frungillo L.;Butassi E.;Zambon A.;Herranz-López M.;Aranda M.;Pollastro F.;Tixier A. S.;Garcia-Perez J. V.;Arráez-Román D.;Ross A.;Mena P.;Edrada-Ebel R. A.;Lyng J.;Micol V.;Borrás-Rocher F.;Barrajón-Catalán E.

2025

Abstract

Introduction This study investigates the efficacy of large language models (LLMs) for generating accurate scientific responses through a comparative evaluation of five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, Mistral Large 2, and Llama 3.1 70B.Methods Sixteen expert scientific reviewers assessed these models in terms of depth, accuracy, relevance, and clarity.Results Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini, with notable variability among the other models. Additionally, retrieval-augmented generation (RAG) techniques were applied to improve LLM performance, and prompts were refined to improve answers. The results indicate that although LLMs such as Claude 3.5 Sonnet have potential for scientific tasks, other models may require more development or additional prompt engineering to reach comparable accuracy. Reviewers' perceptions of artificial intelligence (AI) utility and trustworthiness showed a positive shift after evaluation. However, ethical concerns, particularly with respect to transparency and disclosure, remained consistent.Discussion The study highlights the need for structured frameworks for evaluating LLMs and ethical considerations essential for responsible AI integration in scientific research. These findings should be interpreted with caution, as the limited sample size and domain-specific focus of the exam questions restrict the generalizability of the results.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Rivista
	
				FRONTIERS IN ARTIFICIAL INTELLIGENCE
			
	Codice DOI
	
				https://dx.doi.org/10.3389/frai.2025.1664303
			
	Citazione
	
				Álvarez-Martínez, F.J., Esteban, L., Frungillo, L., Butassi, E., Zambon, A., Herranz-López, M., et al. (2025). There are significant differences among artificial intelligence large language models when answering scientific questions. FRONTIERS IN ARTIFICIAL INTELLIGENCE, 8, 1-11 [10.3389/frai.2025.1664303].
			
	Tutti gli autori
	
						Álvarez-Martínez, F. J.; Esteban, L.; Frungillo, L.; Butassi, E.; Zambon, A.; Herranz-López, M.; Aranda, M.; Pollastro, F.; Tixier, A. S.; Garcia-Pere...espandi
						
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
frai-8-1664303 (1).pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 1.96 MB Formato Adobe PDF Visualizza/Apri	1.96 MB	Adobe PDF	Visualizza/Apri
Data Sheet 1(4).docx accesso aperto Tipo: File Supplementare Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 696.34 kB Formato Microsoft Word XML Visualizza/Apri	696.34 kB	Microsoft Word XML	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1026971

Citazioni

1

0

0

ND

social impact