The metric search paradigm has been to this day successfully applied to several real-world problems, ranging from multimedia to data mining, from decision support to pattern recognition, to statistical and medical applications. Indeed, its simplicity makes it a perfect candidate for solving a variety of similarity problems arising in applications. The casual reader may wonder what prevents the metric space paradigm to become ubiquitously applicable to the ever-increasing range of applications that can benefit from it. The answer to this question is so dreadful that researchers have given it the hideous name of "curse of dimensionality" (an entry in this issue of the bulletin is devoted to this concept). In its essence, the curse of dimensionality says that, whenever the (intrinsic) dimensionality D of the metric space is high, an efficient solution to NN (nearest neighbor) queries is impossible, and only a sequential scan of the whole dataset could guarantee that the correct result is found. This behavior is basically due to the fact that the variance of the distances to the query object q vanishes with increasing values of D, so that all data objects have almost the same distance to q. In such scenarios, one may however argue that NN queries lose of significance, since any data object would have a distance to the query object comparable to the minimal one. On the other hand, in several real-world cases searching for the exact NN is still difficult, yet the distribution of distances exhibits a sufficiently high variance to make the problem worth solving. In such cases, it is also observed that locating the NN of a query point is, in itself, a relatively easy task, whose complexity indeed decreases with space dimensionality. As a matter of fact, the hard problem in high-D exact NN search is to determine how to stop, i.e., how to guarantee that the current result is the correct one. From this it follows that most of the time spent in an (exact) NN search is wasted time, during which little (or no) improvement is obtained.

Approximate and Probabilistic Methods

CIACCIA, PAOLO;PATELLA, MARCO
2010

Abstract

The metric search paradigm has been to this day successfully applied to several real-world problems, ranging from multimedia to data mining, from decision support to pattern recognition, to statistical and medical applications. Indeed, its simplicity makes it a perfect candidate for solving a variety of similarity problems arising in applications. The casual reader may wonder what prevents the metric space paradigm to become ubiquitously applicable to the ever-increasing range of applications that can benefit from it. The answer to this question is so dreadful that researchers have given it the hideous name of "curse of dimensionality" (an entry in this issue of the bulletin is devoted to this concept). In its essence, the curse of dimensionality says that, whenever the (intrinsic) dimensionality D of the metric space is high, an efficient solution to NN (nearest neighbor) queries is impossible, and only a sequential scan of the whole dataset could guarantee that the correct result is found. This behavior is basically due to the fact that the variance of the distances to the query object q vanishes with increasing values of D, so that all data objects have almost the same distance to q. In such scenarios, one may however argue that NN queries lose of significance, since any data object would have a distance to the query object comparable to the minimal one. On the other hand, in several real-world cases searching for the exact NN is still difficult, yet the distribution of distances exhibits a sufficiently high variance to make the problem worth solving. In such cases, it is also observed that locating the NN of a query point is, in itself, a relatively easy task, whose complexity indeed decreases with space dimensionality. As a matter of fact, the hard problem in high-D exact NN search is to determine how to stop, i.e., how to guarantee that the current result is the correct one. From this it follows that most of the time spent in an (exact) NN search is wasted time, during which little (or no) improvement is obtained.
SIGSPATIAL SPECIAL
Paolo Ciaccia; Marco Patella
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11585/91721
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact