Let $X=(X_1,\ldots, X_p)$ be the vector of covariates in a regression problem and let $\widetilde{X}$ be a knockoff copy of $X$ (in the sense of Candes et al. 2018). In a number of applications, mainly in genetics, there is a finite set $F$ such that $X_i\in F$ for each $i=1,\ldots,p$. Despite the latter fact, to make variable selection with the knockoff procedure, $X$ is usually modeled as an absolutely continuous random vector. While comprehensible from the point of view of applications, this approximate procedure does not make sense theoretically, since $X$ is supported by the finite set $F^p$. In this paper, explicit formulae for the joint distribution of $(X,\widetilde{X})$ are provided when $P(X\in F^p)=1$ and $X$ is partially exchangeable. In fact, when $X_i\in F$ for all $i$, assuming $X$ partially exchangeable is often a good strategy. In a few situations, even if extreme, it may be also reasonable to assume $X$ exchangeable. Hence, some attention is paid to the exchangeable special case. The robustness of $\widetilde{X}$, with respect to the de Finetti's measure $\pi$ of $X$, is investigated as well. Let $\mathcal{L}_\pi(\widetilde{X}\mid X=x)$ be the conditional distribution of $\widetilde{X}$, given $X=x$, when $X$ is exchangeable and the de Finetti's measure of $X$ is $\pi$. It is shown that $\norm{\mathcal{L}_{\pi_1}(\widetilde{X}\mid X=x)-\mathcal{L}_{\pi_2}(\widetilde{X}\mid X=x)}\le c(x)\,\norm{\pi_1-\pi_2}$ where $\norm{\cdot}$ is total variation distance and $c(x)$ a suitable constant. Finally, a numerical experiment is performed. Overall, the knockoffs of this paper outperform the alternatives (i.e., the knockoffs obtained by giving $X$ an absolutely continuous distribution) as regards the false discovery rate but are slightly weaker in terms of power.
Dreassi, E., Pratelli, L., Rigo, P. (In stampa/Attività in corso). Knockoffs for partially exchangeable categorical covariates. STATISTICAL METHODS & APPLICATIONS, 1, 1-25.
Knockoffs for partially exchangeable categorical covariates
Rigo Pietro
In corso di stampa
Abstract
Let $X=(X_1,\ldots, X_p)$ be the vector of covariates in a regression problem and let $\widetilde{X}$ be a knockoff copy of $X$ (in the sense of Candes et al. 2018). In a number of applications, mainly in genetics, there is a finite set $F$ such that $X_i\in F$ for each $i=1,\ldots,p$. Despite the latter fact, to make variable selection with the knockoff procedure, $X$ is usually modeled as an absolutely continuous random vector. While comprehensible from the point of view of applications, this approximate procedure does not make sense theoretically, since $X$ is supported by the finite set $F^p$. In this paper, explicit formulae for the joint distribution of $(X,\widetilde{X})$ are provided when $P(X\in F^p)=1$ and $X$ is partially exchangeable. In fact, when $X_i\in F$ for all $i$, assuming $X$ partially exchangeable is often a good strategy. In a few situations, even if extreme, it may be also reasonable to assume $X$ exchangeable. Hence, some attention is paid to the exchangeable special case. The robustness of $\widetilde{X}$, with respect to the de Finetti's measure $\pi$ of $X$, is investigated as well. Let $\mathcal{L}_\pi(\widetilde{X}\mid X=x)$ be the conditional distribution of $\widetilde{X}$, given $X=x$, when $X$ is exchangeable and the de Finetti's measure of $X$ is $\pi$. It is shown that $\norm{\mathcal{L}_{\pi_1}(\widetilde{X}\mid X=x)-\mathcal{L}_{\pi_2}(\widetilde{X}\mid X=x)}\le c(x)\,\norm{\pi_1-\pi_2}$ where $\norm{\cdot}$ is total variation distance and $c(x)$ a suitable constant. Finally, a numerical experiment is performed. Overall, the knockoffs of this paper outperform the alternatives (i.e., the knockoffs obtained by giving $X$ an absolutely continuous distribution) as regards the false discovery rate but are slightly weaker in terms of power.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


