LocalViT: Analyzing Locality in Vision Transformers

Yawei, Li; Zhang, Kai; Cao, Jiezhang; Timofte, Radu; Magno, Michele; Benini, Luca; Van Goo, Luc

doi:10.1109/IROS55552.2023.10342025

The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking is a locality mechanism for infor-mation exchange within a local region. In this paper, locality mechanism is systematically investigated by carefully designed controlled experiments. We add locality to vision transformers into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to vision transformers with different architecture designs, which shows the generalization of the locality concept. For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines Swin-T [1], DeiT-T [2] and PVT-T [3] by 1.0%, 2.6 % and 3.1 % with a negligible increase in the number of parameters and computational effort. Code is available at https://github.com/ofsoundof/LocalViT.

Li, Y., Zhang, K., Cao, J., Timofte, R., Magno, M., Benini, L., et al. (2023). LocalViT: Analyzing Locality in Vision Transformers [10.1109/IROS55552.2023.10342025].

LocalViT: Analyzing Locality in Vision Transformers

Li, Yawei;Zhang, Kai;Cao, Jiezhang;Timofte, Radu;Magno, Michele;Benini, Luca;Van Goo, Luc

2023

Abstract

The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking is a locality mechanism for infor-mation exchange within a local region. In this paper, locality mechanism is systematically investigated by carefully designed controlled experiments. We add locality to vision transformers into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to vision transformers with different architecture designs, which shows the generalization of the locality concept. For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines Swin-T [1], DeiT-T [2] and PVT-T [3] by 1.0%, 2.6 % and 3.1 % with a negligible increase in the number of parameters and computational effort. Code is available at https://github.com/ofsoundof/LocalViT.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Titolo del volume
	
				2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
			
	Pagina iniziale
	
				.
			
	Pagina finale
	
				.
			
	Codice DOI
	
				https://dx.doi.org/10.1109/IROS55552.2023.10342025
			
	Citazione
	
				Li, Y., Zhang, K., Cao, J., Timofte, R., Magno, M., Benini, L., et al. (2023). LocalViT: Analyzing Locality in Vision Transformers [10.1109/IROS55552.2023.10342025].
			
	Tutti gli autori
	
						Li, Yawei; Zhang, Kai; Cao, Jiezhang; Timofte, Radu; Magno, Michele; Benini, Luca; Van Goo, Luc

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/958793

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

CRIS Current Research Information System