Background/Objectives: Trichoscopy is an important diagnostic tool for hair and scalp disorders, but it requires significant expertise. Publicly available large language models (LLMs) are becoming more popular among both physicians and patients, yet their usefulness in trichology is unknown. We aimed to evaluate the diagnostic accuracy of four publicly available LLMs when interpreting trichoscopic images, as well as to compare their performance with that of dermatology residents, board-certified dermatologists, and trichology experts. Method: In this prospective comparative study, a preprocessed set of trichoscopic images was assessed in an online image-based survey. To reduced recognition bias from public image repositories, all images were structurally transformed while preserving diagnostic features. Fifteen dermatologists (five residents, four board-certified dermatologists, six trichology experts) provided a suspected diagnosis (SD), and up to three the differential diagnoses (DD). Four LLMs (ChatGPT-4o, Claude Sonnet 4, Gemini 2.5 Flash, and Grok-3) evaluated the images under the same conditions. Results: The overall diagnostic accuracy among 15 dermatologists was 58.1% (95% CI, 53.0-63.0) for SD and 68.3% (95% CI, 63.4-72.8) for SD + DD. Experts significantly outperformed residents and board-certified dermatologists. AI models achieved an accuracy of 18.2% (95% CI, 11.8-26.9) for SD and 44.4% (95% CI, 35.0-54.3) for SD + DD. Gemini 2.5 Flash performed best, with an accuracy of 62.5% for SD + DD. Agreement among dermatologists increased with experience (AC1 up to 0.65 for experts), while agreement among AI models was moderate to good (AC1 up to 0.70). Agreement between AI models and dermatologists was only slight to fair (AC1 = 0.06 for SD and 0.21 for SD + DD). All human-AI differences were statistically significant (p < 0.001). Conclusions: In trichology, publicly available LLMs currently underperform compared to human experts, especially in providing a single correct diagnosis. These models require further development and specialized training before they can reliably assist with trichological diagnoses in routine care.

Signer, B., Mokhtari, A., Cazzaniga, S., Brand, F., Caro, G., De Viragh, P.a., et al. (2026). Publicly Available Large Language Models for Trichoscopy: A Head-to-Head Comparison with Dermatologists. DIAGNOSTICS, 16(1), 1-9 [10.3390/diagnostics16010169].

Publicly Available Large Language Models for Trichoscopy: A Head-to-Head Comparison with Dermatologists

Iorizzo M;Piraccini BM;Starace M;
2026

Abstract

Background/Objectives: Trichoscopy is an important diagnostic tool for hair and scalp disorders, but it requires significant expertise. Publicly available large language models (LLMs) are becoming more popular among both physicians and patients, yet their usefulness in trichology is unknown. We aimed to evaluate the diagnostic accuracy of four publicly available LLMs when interpreting trichoscopic images, as well as to compare their performance with that of dermatology residents, board-certified dermatologists, and trichology experts. Method: In this prospective comparative study, a preprocessed set of trichoscopic images was assessed in an online image-based survey. To reduced recognition bias from public image repositories, all images were structurally transformed while preserving diagnostic features. Fifteen dermatologists (five residents, four board-certified dermatologists, six trichology experts) provided a suspected diagnosis (SD), and up to three the differential diagnoses (DD). Four LLMs (ChatGPT-4o, Claude Sonnet 4, Gemini 2.5 Flash, and Grok-3) evaluated the images under the same conditions. Results: The overall diagnostic accuracy among 15 dermatologists was 58.1% (95% CI, 53.0-63.0) for SD and 68.3% (95% CI, 63.4-72.8) for SD + DD. Experts significantly outperformed residents and board-certified dermatologists. AI models achieved an accuracy of 18.2% (95% CI, 11.8-26.9) for SD and 44.4% (95% CI, 35.0-54.3) for SD + DD. Gemini 2.5 Flash performed best, with an accuracy of 62.5% for SD + DD. Agreement among dermatologists increased with experience (AC1 up to 0.65 for experts), while agreement among AI models was moderate to good (AC1 up to 0.70). Agreement between AI models and dermatologists was only slight to fair (AC1 = 0.06 for SD and 0.21 for SD + DD). All human-AI differences were statistically significant (p < 0.001). Conclusions: In trichology, publicly available LLMs currently underperform compared to human experts, especially in providing a single correct diagnosis. These models require further development and specialized training before they can reliably assist with trichological diagnoses in routine care.
2026
Signer, B., Mokhtari, A., Cazzaniga, S., Brand, F., Caro, G., De Viragh, P.a., et al. (2026). Publicly Available Large Language Models for Trichoscopy: A Head-to-Head Comparison with Dermatologists. DIAGNOSTICS, 16(1), 1-9 [10.3390/diagnostics16010169].
Signer, B; Mokhtari, A; Cazzaniga, S; Brand, F; Caro, G; De Viragh, Pa; Heidemeyer, K; Hosseini, A; Iorizzo, M; Junge, A; Martignoni, Z; Reygagne, Pe;...espandi
File in questo prodotto:
File Dimensione Formato  
diagnostics-16-00169-v2.pdf

accesso aperto

Tipo: Versione (PDF) editoriale / Version Of Record
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione 413.2 kB
Formato Adobe PDF
413.2 kB Adobe PDF Visualizza/Apri
diagnostics-16-00169-s001.zip

accesso aperto

Tipo: File Supplementare
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione 108.84 kB
Formato Zip File
108.84 kB Zip File Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1042933
Citazioni
  • ???jsp.display-item.citation.pmc??? 1
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact