Semantic and Sequential Alignment for Referring Video Object Segmentation

Pan, Feiyu; Fang, Hao; Fangkai, Li; Yanyu, Xu; Yawei, Li; Benini, Luca; Xiankai, Lu

doi:10.1109/cvpr52734.2025.01776

Referring video object segmentation (RVOS) seeks to segment the objects within a video referred by linguistic expressions. Existing RVOS solutions follow a "fuse then select"paradigm: establishing semantic correlation between visual and linguistic feature, and performing frame-level query interaction to select the instance mask per frame with instance segmentation module. This paradigm overlooks the challenge of semantic gap between the linguistic descriptor and the video object as well as the underlying clutters in the video. This paper proposes a novel Semantic and Sequential Alignment (SSA) paradigm to handle these challenges. We first insert a lightweight adapter after the vision language model (VLM) to perform the semantic alignment. Then, prior to selecting mask per frame, we exploit the trajectory-to-instance enhancement for each frame via sequential alignment. This paradigm leverages the visual-language alignment inherent in VLM during adaptation and tries to capture global information by ensembling trajectories. This helps understand videos and the corresponding descriptors by mitigating the discrepancy with intricate activity semantics, particularly when facing occlusion or similar interference. SSA demonstrates competitive performance while maintaining fewer learnable parameters.

Pan, F., Fang, H., Li, F., Xu, Y., Li, Y., Benini, L., et al. (2025). Semantic and Sequential Alignment for Referring Video Object Segmentation. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA : IEEE Computer Society [10.1109/cvpr52734.2025.01776].

Semantic and Sequential Alignment for Referring Video Object Segmentation

Pan, Feiyu;Fang, Hao;Li, Fangkai;Xu, Yanyu;Li, Yawei;Benini, Luca;Lu, Xiankai

2025

Abstract

Referring video object segmentation (RVOS) seeks to segment the objects within a video referred by linguistic expressions. Existing RVOS solutions follow a "fuse then select"paradigm: establishing semantic correlation between visual and linguistic feature, and performing frame-level query interaction to select the instance mask per frame with instance segmentation module. This paradigm overlooks the challenge of semantic gap between the linguistic descriptor and the video object as well as the underlying clutters in the video. This paper proposes a novel Semantic and Sequential Alignment (SSA) paradigm to handle these challenges. We first insert a lightweight adapter after the vision language model (VLM) to perform the semantic alignment. Then, prior to selecting mask per frame, we exploit the trajectory-to-instance enhancement for each frame via sequential alignment. This paradigm leverages the visual-language alignment inherent in VLM during adaptation and tries to capture global information by ensembling trajectories. This helps understand videos and the corresponding descriptors by mitigating the discrepancy with intricate activity semantics, particularly when facing occlusion or similar interference. SSA demonstrates competitive performance while maintaining fewer learnable parameters.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Titolo del volume
	
				Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
			
	Pagina iniziale
	
				19067
			
	Pagina finale
	
				19076
			
	Codice DOI
	
				https://dx.doi.org/10.1109/cvpr52734.2025.01776
			
	Citazione
	
				Pan, F., Fang, H., Li, F., Xu, Y., Li, Y., Benini, L., et al. (2025). Semantic and Sequential Alignment for Referring Video Object Segmentation. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA : IEEE Computer Society [10.1109/cvpr52734.2025.01776].
			
	Tutti gli autori
	
						Pan, Feiyu; Fang, Hao; Li, Fangkai; Xu, Yanyu; Li, Yawei; Benini, Luca; Lu, Xiankai

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1040026

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

15

4

ND

CRIS Current Research Information System