Whole understanding of the surroundings is paramount to autonomous systems. Recent works have shown that deep neural networks can learn geometry (depth) and motion (optical flow) from a monocular video without any explicit supervision from ground truth annotations, particularly hard to source for these two tasks. In this paper, we take an additional step toward holistic scene understanding with monocular cameras by learning depth and motion alongside with semantics, with supervision for the latter provided by a pre-trained network distilling proxy ground truth images. We address the three tasks jointly by a) a novel training protocol based on knowledge distillation and selfsupervision and b) a compact network architecture which enables efficient scene understanding on both power hungry GPUs and low-power embedded platforms. We thoroughly assess the performance of our framework and show that it yields state-of-the-art results for monocular depth estimation, optical flow and motion segmentation.
F. Tosi, F.A. (2020). Distilled semantics for comprehensive scene understanding from videos. IEEE/CVF [10.1109/CVPR42600.2020.00471].
Distilled semantics for comprehensive scene understanding from videos
F. Tosi;F. Aleotti;P. Zama Ramirez;M. Poggi;S. Salti;L. Di Stefano;S. Mattoccia
2020
Abstract
Whole understanding of the surroundings is paramount to autonomous systems. Recent works have shown that deep neural networks can learn geometry (depth) and motion (optical flow) from a monocular video without any explicit supervision from ground truth annotations, particularly hard to source for these two tasks. In this paper, we take an additional step toward holistic scene understanding with monocular cameras by learning depth and motion alongside with semantics, with supervision for the latter provided by a pre-trained network distilling proxy ground truth images. We address the three tasks jointly by a) a novel training protocol based on knowledge distillation and selfsupervision and b) a compact network architecture which enables efficient scene understanding on both power hungry GPUs and low-power embedded platforms. We thoroughly assess the performance of our framework and show that it yields state-of-the-art results for monocular depth estimation, optical flow and motion segmentation.File | Dimensione | Formato | |
---|---|---|---|
Tosi_Distilled_Semantics_for_CVPR_2020_supplemental (1).pdf
accesso aperto
Tipo:
File Supplementare
Licenza:
Licenza per accesso libero gratuito
Dimensione
1.97 MB
Formato
Adobe PDF
|
1.97 MB | Adobe PDF | Visualizza/Apri |
Tosi_Distilled_Semantics_for_Comprehensive_Scene_Understanding_from_Videos_CVPR_2020_paper.pdf
accesso aperto
Tipo:
Postprint
Licenza:
Licenza per accesso libero gratuito
Dimensione
1.59 MB
Formato
Adobe PDF
|
1.59 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.