Domination in the Probabilistic World: Computing Skylines for Arbitrary Correlations and Ranking Semantics

Bartolini, Ilaria; Ciaccia, Paolo; Patella, Marco

doi:10.1145/2602135

In a probabilistic database, deciding if a tuple u is better than another tuple v has not a univocal solution, rather it depends on the specific Probabilistic Ranking Semantics (PRS) one wants to adopt so as to combine together tuples' scores and probabilities. In deterministic databases it is known that skyline queries are a remarkable alternative to (top-k) ranking queries, because they remove from the user the burden of specifying a scoring function that combines values of different attributes into a single score. The skyline of a deterministic relation R is the set of undominated tuples in R -- tuple u dominates tuple v iff on all the attributes of interest u is better than or equal to v and strictly better on at least one attribute. Domination is equivalent to having s(u) ≥ s(v) for all monotone scoring functions s(). The skyline of a probabilistic relation Rp can be similarly defined as the set of P-undominated tuples in Rp, where now u P-dominates v iff, whatever monotone scoring function one would use to combine the skyline attributes, u is reputed better than v by the PRS at hand. This definition, which is applicable to arbitrary ranking semantics and probabilistic correlation models, is parametric in the adopted PRS, thus it ensures that ranking and skyline queries will always return consistent results. In this article we provide an overall view of the problem of computing the skyline of a probabilistic relation. We show how, under mild conditions that indeed hold for all known PRSs, checking P-domination can be cast into an optimization problem, whose complexity we characterize for a variety of combinations of ranking semantics and correlation models. For each analyzed case we also provide specific P-domination rules, which are exploited by the algorithm we detail for the case where the probabilistic model is known to the query processor. We also consider the case in which the probability of tuple events can only be obtained through an oracle, and describe another skyline algorithm for this loosely integrated scenario. Our experimental evaluation of P-domination rules and skyline algorithms confirms the theoretical analysis.

CRIS Current Research Information System