CRIS Current Research Information System

In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality improving FID score by 5.6%, 1.4% and 3.4% on CelebMask-HQ, Ade20k and DeepFashion datasets respectively, as well as improved global and local style transfer. Code and models available at https://github.com/TFonta/CA2SIS.

Fontanini, T., Ferrari, C., Lisanti, G., Bertozzi, M., Prati, A. (2025). Semantic Image Synthesis via Class-Adaptive Cross-Attention. IEEE ACCESS, 13, 10326-10339 [10.1109/ACCESS.2025.3529216].

Semantic Image Synthesis via Class-Adaptive Cross-Attention

Fontanini T.;Ferrari C.;Lisanti G.;Bertozzi M.;Prati A.

2025

Abstract

In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality improving FID score by 5.6%, 1.4% and 3.4% on CelebMask-HQ, Ade20k and DeepFashion datasets respectively, as well as improved global and local style transfer. Code and models available at https://github.com/TFonta/CA2SIS.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Rivista
	
				IEEE ACCESS
			
	Codice DOI
	
				https://dx.doi.org/10.1109/ACCESS.2025.3529216
			
	Citazione
	
				Fontanini, T., Ferrari, C., Lisanti, G., Bertozzi, M., Prati, A. (2025). Semantic Image Synthesis via Class-Adaptive Cross-Attention. IEEE ACCESS, 13, 10326-10339 [10.1109/ACCESS.2025.3529216].
			
	Tutti gli autori
	
						Fontanini, T.; Ferrari, C.; Lisanti, G.; Bertozzi, M.; Prati, A.
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Semantic_Image_Synthesis_via_Class-Adaptive_Cross-Attention.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 4.26 MB Formato Adobe PDF Visualizza/Apri	4.26 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1007849

Citazioni

ND

2

3

social impact