Many data scientists are currently pointing out that the amount of Machine Learning (ML) research that will cross into practice will depend, not just on the ability of the specialized algorithms used to scrutinize positive/negative examples, but also on the quality of the data exploited for training those algorithms. Our experience, while training a neural network with a huge dataset comprised of over fifteen million water meter readings, confirms such conjecture. In this paper, we report on the actions we took to extrapolate from that database just those data that could correctly represent the complex statistical phenomenon in play. With an adequate re-organization of those data, we got an interesting, yet controversial, result. On the one hand, we improved the accuracy on the prediction when a water meter fails/needs disassembly based on a history of water consumption measurements, thus making smarter a meter maintenance process; on the other hand, all this came with the paradox of a (statistical) transformation of the initial dataset: while we alleviate a problem with a restructured and better interpretable data model, we simultaneously change the replicated form of those data.
Roccetti, M. (2019). A paradox in ML design: Less data for a smarter water metering cognification experience. Nw York : ACM [10.1145/3342428.3342685].
A paradox in ML design: Less data for a smarter water metering cognification experience
Roccetti M.
;Delnevo G.;Casini L.;Zagni N.;Cappiello G.
2019
Abstract
Many data scientists are currently pointing out that the amount of Machine Learning (ML) research that will cross into practice will depend, not just on the ability of the specialized algorithms used to scrutinize positive/negative examples, but also on the quality of the data exploited for training those algorithms. Our experience, while training a neural network with a huge dataset comprised of over fifteen million water meter readings, confirms such conjecture. In this paper, we report on the actions we took to extrapolate from that database just those data that could correctly represent the complex statistical phenomenon in play. With an adequate re-organization of those data, we got an interesting, yet controversial, result. On the one hand, we improved the accuracy on the prediction when a water meter fails/needs disassembly based on a history of water consumption measurements, thus making smarter a meter maintenance process; on the other hand, all this came with the paradox of a (statistical) transformation of the initial dataset: while we alleviate a problem with a restructured and better interpretable data model, we simultaneously change the replicated form of those data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.