Comparative analysis of supervised and self-supervised learning with small and imbalanced medical imaging datasets

Espis, A.; Marzi, C.; Diciotti, S.

doi:10.1038/s41598-025-99000-0

Self-supervised learning (SSL) in computer vision has shown its potential to reduce reliance on labeled data. However, most studies focused on balanced, large, broad-domain datasets like ImageNet, whereas, in real-world medical applications, dataset size is typically limited. This study compares the performance of SSL versus supervised learning (SL) on small, imbalanced medical imaging datasets. We experimented with four binary classification tasks: age prediction and diagnosis of Alzheimer’s disease from brain magnetic resonance imaging scans, pneumonia from chest radiograms, and retinal diseases associated with choroidal neovascularization from optical coherence tomography with a mean size of training sets of 843 images, 771 images, 1,214 images, and 33,484 images, respectively. We tested various combinations of label availability and class frequency distribution, repeating the training with different random seeds to assess result uncertainty. In most experiments involving small training sets, SL outperformed the selected SSL paradigms, even when a limited portion of labeled data was available. Our findings highlight the importance of carefully selecting learning paradigms based on specific application requirements, which are influenced by factors such as training set size, label availability, and class frequency distribution.

Espis, A., Marzi, C., Diciotti, S. (2025). Comparative analysis of supervised and self-supervised learning with small and imbalanced medical imaging datasets. SCIENTIFIC REPORTS, 15(1), 1-21 [10.1038/s41598-025-99000-0].

Comparative analysis of supervised and self-supervised learning with small and imbalanced medical imaging datasets

Espis A.^Primo;Marzi C.^Secondo;Diciotti S.^Ultimo

2025

Abstract

Self-supervised learning (SSL) in computer vision has shown its potential to reduce reliance on labeled data. However, most studies focused on balanced, large, broad-domain datasets like ImageNet, whereas, in real-world medical applications, dataset size is typically limited. This study compares the performance of SSL versus supervised learning (SL) on small, imbalanced medical imaging datasets. We experimented with four binary classification tasks: age prediction and diagnosis of Alzheimer’s disease from brain magnetic resonance imaging scans, pneumonia from chest radiograms, and retinal diseases associated with choroidal neovascularization from optical coherence tomography with a mean size of training sets of 843 images, 771 images, 1,214 images, and 33,484 images, respectively. We tested various combinations of label availability and class frequency distribution, repeating the training with different random seeds to assess result uncertainty. In most experiments involving small training sets, SL outperformed the selected SSL paradigms, even when a limited portion of labeled data was available. Our findings highlight the importance of carefully selecting learning paradigms based on specific application requirements, which are influenced by factors such as training set size, label availability, and class frequency distribution.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Rivista
	
				SCIENTIFIC REPORTS
			
	Codice DOI
	
				https://dx.doi.org/10.1038/s41598-025-99000-0
			
	Citazione
	
				Espis, A., Marzi, C., Diciotti, S. (2025). Comparative analysis of supervised and self-supervised learning with small and imbalanced medical imaging datasets. SCIENTIFIC REPORTS, 15(1), 1-21 [10.1038/s41598-025-99000-0].
			
	Tutti gli autori
	
						Espis, A.; Marzi, C.; Diciotti, S.

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1050696

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

3

5

5

ND

CRIS Current Research Information System