Approximate and Probabilistic Methods

Ciaccia, Paolo; Patella, Marco

The metric search paradigm has been to this day successfully applied to several real-world problems, ranging from multimedia to data mining, from decision support to pattern recognition, to statistical and medical applications. Indeed, its simplicity makes it a perfect candidate for solving a variety of similarity problems arising in applications. The casual reader may wonder what prevents the metric space paradigm to become ubiquitously applicable to the ever-increasing range of applications that can benefit from it. The answer to this question is so dreadful that researchers have given it the hideous name of "curse of dimensionality" (an entry in this issue of the bulletin is devoted to this concept). In its essence, the curse of dimensionality says that, whenever the (intrinsic) dimensionality D of the metric space is high, an efficient solution to NN (nearest neighbor) queries is impossible, and only a sequential scan of the whole dataset could guarantee that the correct result is found. This behavior is basically due to the fact that the variance of the distances to the query object q vanishes with increasing values of D, so that all data objects have almost the same distance to q. In such scenarios, one may however argue that NN queries lose of significance, since any data object would have a distance to the query object comparable to the minimal one. On the other hand, in several real-world cases searching for the exact NN is still difficult, yet the distribution of distances exhibits a sufficiently high variance to make the problem worth solving. In such cases, it is also observed that locating the NN of a query point is, in itself, a relatively easy task, whose complexity indeed decreases with space dimensionality. As a matter of fact, the hard problem in high-D exact NN search is to determine how to stop, i.e., how to guarantee that the current result is the correct one. From this it follows that most of the time spent in an (exact) NN search is wasted time, during which little (or no) improvement is obtained.

Paolo Ciaccia, Marco Patella (2010). Approximate and Probabilistic Methods. SIGSPATIAL SPECIAL, 2, 16-19.