Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval

Moro, Gianluca; Salvatori, Stefano

doi:10.1007/978-3-031-17849-8_4

Fashion multi-modal retrieval has been recently addressed using vision-and-language transformers. However, these models cannot scale in training time and memory requirements due to the quadratic attention mechanism. Moreover, they design the retrieval as a classification task, assigning a similarity score to pairs of text and images in input. Each query is thus resolved inefficiently by pairing it, at runtime, with every text or image in the entire dataset, precluding the scalability to large-scale datasets. We propose a novel approach for efficient multi-modal retrieval in the fashion domain that combines self-supervised pretraining with linear attention and deep metric learning to create a latent space where spatial proximity among instances translates into a semantic similarity score. Unlike existing contributions, our approach separately embeds text and images, decoupling them and allowing to collocate and search in the space, after training, even for new images with missing text and vice versa. Experiments show that with a single 12 GB GPU, our solution outperforms, both in efficacy and efficiency, existing state-of-the-art contributions on the FashionGen dataset. Our architecture also enables the adoption of multidimensional indices, with which retrieval scales in logarithmic time up to millions, and potentially billions, of text and images.

Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval / Gianluca Moro; Stefano Salvatori. - ELETTRONICO. - (2022), pp. 40-53. (Intervento presentato al convegno International Conference on Similarity Search and Applications tenutosi a Bologna nel 10, 2022) [10.1007/978-3-031-17849-8_4].

Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval

Gianluca Moro^Primo;Stefano Salvatori^Secondo

2022

Abstract

Fashion multi-modal retrieval has been recently addressed using vision-and-language transformers. However, these models cannot scale in training time and memory requirements due to the quadratic attention mechanism. Moreover, they design the retrieval as a classification task, assigning a similarity score to pairs of text and images in input. Each query is thus resolved inefficiently by pairing it, at runtime, with every text or image in the entire dataset, precluding the scalability to large-scale datasets. We propose a novel approach for efficient multi-modal retrieval in the fashion domain that combines self-supervised pretraining with linear attention and deep metric learning to create a latent space where spatial proximity among instances translates into a semantic similarity score. Unlike existing contributions, our approach separately embeds text and images, decoupling them and allowing to collocate and search in the space, after training, even for new images with missing text and vice versa. Experiments show that with a single 12 GB GPU, our solution outperforms, both in efficacy and efficiency, existing state-of-the-art contributions on the FashionGen dataset. Our architecture also enables the adoption of multidimensional indices, with which retrieval scales in logarithmic time up to millions, and potentially billions, of text and images.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Titolo del volume
	
				International Conference on Similarity Search and Applications
			
	Pagina iniziale
	
				40
			
	Pagina finale
	
				53
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-031-17849-8_4
			
	Citazione
	
				Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval / Gianluca Moro; Stefano Salvatori. - ELETTRONICO. - (2022), pp. 40-53. (Intervento presentato al  convegno International Conference on Similarity Search and Applications tenutosi a Bologna nel 10, 2022) [10.1007/978-3-031-17849-8_4].
			
	Tutti gli autori
	
						Gianluca Moro; Stefano Salvatori

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/901176

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

CRIS Current Research Information System