Kernel density estimators are a popular family of nonparametric estimators with applications to exploratory statistics and data mining. Since kernel estimators must be constructed from the data, if the data are sensitive, only indirect representations of the estimate, such as graphs or tabulations, can be stored or transmitted. However, even such representations might contain enough information to allow for data reconstruction, yielding an inference problem for kernel estimates. The inference problem for kernel estimators can be described by a system of nonlinear equations that arises naturally from the kernel estimate of a multivariate dataset. The solution to the system is the set of data from which the kernel estimate was computed and, in practice a good approximation to the solution is not available. A serious threat to data privacy is posed by publicly available solvers for nonlinear systems. This paper investigates the numerical solution of the nonlinear systems arising from the kernel estimate of a multivariate dataset and shows that this task is challenging. In fact, the Jacobian matrix of the system is numerically singular and a large number of solvers for nonlinear equations fails as they have to solve linear systems whose coefficient matrix is given by the Jacobian. Further, up to date solvers for optimization problems that do not suffer from this drawback may fail to solve the nonlinear system. To show this fact, we tested a subspace trustregion method, a BFGS method and a gradient projection method on both a synthetic and a real dataset. These methods are able to find a solution to the optimization problem even starting far from it. However, the experimental results on both the synthetic and the real dataset show that, if the initial guess is not very close to the solution, all three methods fail to converge to a solution of the system of equations. Then, unless a very good approximation of the solution is known, the dataset cannot be reconstructed by using publicly available solvers.
S. Bellavia, S. Lodi, B. Morini (2006). Inferences on Kernel Density Estimates by Solving Nonlinear Systems. LOS ALAMITOS, CALIFORNIA : IEEE Computer Society.
Inferences on Kernel Density Estimates by Solving Nonlinear Systems
LODI, STEFANO;
2006
Abstract
Kernel density estimators are a popular family of nonparametric estimators with applications to exploratory statistics and data mining. Since kernel estimators must be constructed from the data, if the data are sensitive, only indirect representations of the estimate, such as graphs or tabulations, can be stored or transmitted. However, even such representations might contain enough information to allow for data reconstruction, yielding an inference problem for kernel estimates. The inference problem for kernel estimators can be described by a system of nonlinear equations that arises naturally from the kernel estimate of a multivariate dataset. The solution to the system is the set of data from which the kernel estimate was computed and, in practice a good approximation to the solution is not available. A serious threat to data privacy is posed by publicly available solvers for nonlinear systems. This paper investigates the numerical solution of the nonlinear systems arising from the kernel estimate of a multivariate dataset and shows that this task is challenging. In fact, the Jacobian matrix of the system is numerically singular and a large number of solvers for nonlinear equations fails as they have to solve linear systems whose coefficient matrix is given by the Jacobian. Further, up to date solvers for optimization problems that do not suffer from this drawback may fail to solve the nonlinear system. To show this fact, we tested a subspace trustregion method, a BFGS method and a gradient projection method on both a synthetic and a real dataset. These methods are able to find a solution to the optimization problem even starting far from it. However, the experimental results on both the synthetic and the real dataset show that, if the initial guess is not very close to the solution, all three methods fail to converge to a solution of the system of equations. Then, unless a very good approximation of the solution is known, the dataset cannot be reconstructed by using publicly available solvers.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.