CRIS Current Research Information System

Whole understanding of the surroundings is paramount to autonomous systems. Recent works have shown that deep neural networks can learn geometry (depth) and motion (optical flow) from a monocular video without any explicit supervision from ground truth annotations, particularly hard to source for these two tasks. In this paper, we take an additional step toward holistic scene understanding with monocular cameras by learning depth and motion alongside with semantics, with supervision for the latter provided by a pre-trained network distilling proxy ground truth images. We address the three tasks jointly by a) a novel training protocol based on knowledge distillation and selfsupervision and b) a compact network architecture which enables efficient scene understanding on both power hungry GPUs and low-power embedded platforms. We thoroughly assess the performance of our framework and show that it yields state-of-the-art results for monocular depth estimation, optical flow and motion segmentation.

F. Tosi, F.A. (2020). Distilled semantics for comprehensive scene understanding from videos. IEEE/CVF [10.1109/CVPR42600.2020.00471].

Distilled semantics for comprehensive scene understanding from videos

F. Tosi;F. Aleotti;P. Zama Ramirez;M. Poggi;S. Salti;L. Di Stefano;S. Mattoccia

2020

Abstract

Whole understanding of the surroundings is paramount to autonomous systems. Recent works have shown that deep neural networks can learn geometry (depth) and motion (optical flow) from a monocular video without any explicit supervision from ground truth annotations, particularly hard to source for these two tasks. In this paper, we take an additional step toward holistic scene understanding with monocular cameras by learning depth and motion alongside with semantics, with supervision for the latter provided by a pre-trained network distilling proxy ground truth images. We address the three tasks jointly by a) a novel training protocol based on knowledge distillation and selfsupervision and b) a compact network architecture which enables efficient scene understanding on both power hungry GPUs and low-power embedded platforms. We thoroughly assess the performance of our framework and show that it yields state-of-the-art results for monocular depth estimation, optical flow and motion segmentation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Titolo del volume
	
				2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
			
	Pagina iniziale
	
				4653
			
	Pagina finale
	
				4664
			
	Codice DOI
	
				https://dx.doi.org/10.1109/CVPR42600.2020.00471
			
	Citazione
	
				F. Tosi, F.A. (2020). Distilled semantics for comprehensive scene understanding from videos. IEEE/CVF [10.1109/CVPR42600.2020.00471].
			
	Tutti gli autori
	
						F. Tosi, F. Aleotti, P. Zama Ramirez, M. Poggi, S. Salti, L. Di Stefano, S. Mattoccia
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Tosi_Distilled_Semantics_for_CVPR_2020_supplemental (1).pdf accesso aperto Tipo: File Supplementare Licenza: Licenza per accesso libero gratuito Dimensione 1.97 MB Formato Adobe PDF Visualizza/Apri	1.97 MB	Adobe PDF	Visualizza/Apri
Tosi_Distilled_Semantics_for_Comprehensive_Scene_Understanding_from_Videos_CVPR_2020_paper.pdf accesso aperto Tipo: Postprint / Author's Accepted Manuscript (AAM) - versione accettata per la pubblicazione dopo la peer-review Licenza: Licenza per accesso libero gratuito Dimensione 1.59 MB Formato Adobe PDF Visualizza/Apri	1.59 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/764271

Citazioni

ND

73

56

social impact