CRIS Current Research Information System

Introduction Artificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.Methods We performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.Results We found high variability in the text consistency of AI chatbot responses (median range 26%-68%). Intra-rater reliability ranged from "almost perfect" to "substantial," while inter-rater reliability varied from "almost perfect" to "moderate." Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.Conclusions Despite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.

Rossettini, G., Bargeri, S., Cook, C., Guida, S., Palese, A., Rodeghiero, L., et al. (2025). Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study. FRONTIERS IN DIGITAL HEALTH, 7, 1-9 [10.3389/fdgth.2025.1574287].

Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study

Rossettini, G;Bargeri, S;Cook, C;Guida, S;Palese, A;Rodeghiero, L;Pillastrini, P;Turolla, A;Castellini, G;Gianola, S

2025

Abstract

Introduction Artificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.Methods We performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.Results We found high variability in the text consistency of AI chatbot responses (median range 26%-68%). Intra-rater reliability ranged from "almost perfect" to "substantial," while inter-rater reliability varied from "almost perfect" to "moderate." Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.Conclusions Despite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Rivista
	
				FRONTIERS IN DIGITAL HEALTH
			
	Codice DOI
	
				https://dx.doi.org/10.3389/fdgth.2025.1574287
			
	Citazione
	
				Rossettini, G., Bargeri, S., Cook, C., Guida, S., Palese, A., Rodeghiero, L., et al. (2025). Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study. FRONTIERS IN DIGITAL HEALTH, 7, 1-9 [10.3389/fdgth.2025.1574287].
			
	Tutti gli autori
	
						Rossettini, G; Bargeri, S; Cook, C; Guida, S; Palese, A; Rodeghiero, L; Pillastrini, P; Turolla, A; Castellini, G; Gianola, S
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Rodeghiero_2025.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 835.19 kB Formato Adobe PDF Visualizza/Apri	835.19 kB	Adobe PDF	Visualizza/Apri
Table 1.docx accesso aperto Tipo: File Supplementare Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 199.06 kB Formato Microsoft Word XML Visualizza/Apri	199.06 kB	Microsoft Word XML	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1019919

Citazioni

5

13

11

ND

social impact