Deep Learning algorithms and models greatly benefit from the release of large-scale datasets, also including synthetically generated data, when real-life data is scarce. Multimodal datasets feature more descriptive environmental information than single-sensor ones, but they are generally small and not widely accessible. In this paper, we construct a synthetically-generated image classification dataset consisting of grayscale camera images and depth information acquired from an 8x8-pixel Time-of-Flight sensor. We propose and evaluate six Convolutional Neural Network-based feature-level fusion models to integrate the multimodal data, outperforming the accuracy of the cameraonly model by up to 17% in real-world settings. By pretraining the model on synthetically-generated sample pairs, followed by fine-tuning it with only 16 real-domain samples, we outperform a non-pretrained counterpart by 35% while maintaining the storage constraints in the order of hundreds of kB. Our proposed convolutional model, pretrained on both synthetic and real-world sensor data, achieves a top-1 accuracy of 86.48%, proving the benefits of using multimodal datasets to train feature-level data fusion neural networks. Low-power emerging embedded microcontrollers, such as multi-core RISC-V systems-on-chip, are perfect candidates for running our model due to their reduced power consumption and parallel computing capabilities that speed up inference.
Brander, C., Cioflan, C., Niculescu, V., Müller, H., Polonelli, T., Magno, M., et al. (2023). Improving Data-Scarce Image Classification Through Multimodal Synthetic Data Pretraining. 345 E 47TH ST, NEW YORK, NY 10017 USA : IEEE [10.1109/sas58821.2023.10254154].
Improving Data-Scarce Image Classification Through Multimodal Synthetic Data Pretraining
Polonelli, Tommaso;Magno, Michele;Benini, Luca
2023
Abstract
Deep Learning algorithms and models greatly benefit from the release of large-scale datasets, also including synthetically generated data, when real-life data is scarce. Multimodal datasets feature more descriptive environmental information than single-sensor ones, but they are generally small and not widely accessible. In this paper, we construct a synthetically-generated image classification dataset consisting of grayscale camera images and depth information acquired from an 8x8-pixel Time-of-Flight sensor. We propose and evaluate six Convolutional Neural Network-based feature-level fusion models to integrate the multimodal data, outperforming the accuracy of the cameraonly model by up to 17% in real-world settings. By pretraining the model on synthetically-generated sample pairs, followed by fine-tuning it with only 16 real-domain samples, we outperform a non-pretrained counterpart by 35% while maintaining the storage constraints in the order of hundreds of kB. Our proposed convolutional model, pretrained on both synthetic and real-world sensor data, achieves a top-1 accuracy of 86.48%, proving the benefits of using multimodal datasets to train feature-level data fusion neural networks. Low-power emerging embedded microcontrollers, such as multi-core RISC-V systems-on-chip, are perfect candidates for running our model due to their reduced power consumption and parallel computing capabilities that speed up inference.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.