In high-dimensional and complex metric spaces, determining the nearest neighbor (NN) of a query object q can be a very expensive task, because of the poor partitioning operated by index structures - the so-called `curse of dimensionality'. This also affects approximately correct (AC) algorithms, which return as result a point whose distance from q is less than (1+ε) times the distance between q and its true NN. In this paper we introduce a new approach to approximate similarity search, called PAC-NN queries, where the error bound ε can be exceeded with probability δ and both ε and δ parameters can be tuned at query time to trade the quality of the result for the cost of the search. We describe sequential and index-based PAC-NN algorithms that exploit the distance distribution of the query object in order to determine a stopping condition that respects the error bound. Analysis and experimental evaluation of the sequential algorithm confirm that, for moderately large data sets and suitable ε and δ values, PAC-NN queries can be efficiently solved and the error controlled. Then, we provide experimental evidence that indexing can further speed-up the retrieval process by up to 1-2 orders of magnitude without giving up the accuracy of the result.
Ciaccia Paolo, Patella Marco (2000). PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces. Los Alamitos, CA, United States : IEEE [10.1109/ICDE.2000.839417].
PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces
Ciaccia Paolo;Patella Marco
2000
Abstract
In high-dimensional and complex metric spaces, determining the nearest neighbor (NN) of a query object q can be a very expensive task, because of the poor partitioning operated by index structures - the so-called `curse of dimensionality'. This also affects approximately correct (AC) algorithms, which return as result a point whose distance from q is less than (1+ε) times the distance between q and its true NN. In this paper we introduce a new approach to approximate similarity search, called PAC-NN queries, where the error bound ε can be exceeded with probability δ and both ε and δ parameters can be tuned at query time to trade the quality of the result for the cost of the search. We describe sequential and index-based PAC-NN algorithms that exploit the distance distribution of the query object in order to determine a stopping condition that respects the error bound. Analysis and experimental evaluation of the sequential algorithm confirm that, for moderately large data sets and suitable ε and δ values, PAC-NN queries can be efficiently solved and the error controlled. Then, we provide experimental evidence that indexing can further speed-up the retrieval process by up to 1-2 orders of magnitude without giving up the accuracy of the result.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.