Scalable geometric density estimation

(1)

Ye Wang Antonio Canale David Dunson Dept. of Statistics

Duke University

Dept. of Economics & Statistics and Collegio Carlo Alberto

University of Turin

Dept. of Statistics Duke University

Abstract

It is standard to assume a low-dimensional structure in estimating a high-dimensional density. However, popular methods, such as probabilistic principal component analysis, scale poorly computationally. We introduce a novel empirical Bayes method that we term geometric density estimation (GEODE) and show that, with mild conditions and among all d-dimensional linear subspaces, the span of the d leading principal axes of the data maximizes the model posterior. With these axes pre-computed using fast singular value decomposition, GEODE easily scales to high dimensional problems while providing uncer-tainty characterization. The model is also ca-pable of imputing missing data and dynami-cally deleting redundant dimensions. Finally, we generalize GEODE by mixing it across a dyadic clustering tree. Both simulation stud-ies and real world data applications show su-perior performance of GEODE in terms of robustness and computational efficiency.

1 Introduction

Let yi∈ <D, for i = 1, . . . , N , be a sample from an

un-known distribution having support in a subset of <D_.

We are interested in estimating its density when D is large, and the data have a low-dimensional structure with intrinsic dimension p such that p _{D. Kernel} methods work well in low dimensions, but face chal-lenges in scaling up to large D settings. Moreover, careful tuning of bandwidth is needed, since the choice of bandwidth fundamentally impacts performance (Liu

et al., 2007). Bayesian nonparametric models (Escobar and West, 1995; Rasmussen, 1999) provide an alterna-tive approach for density estimation, specifying priors for the bandwidth parameters allowing adaptive es-timation without cross-validation (Shen et al., 2013). However, inference is prohibitively costly.

To combat the curse of dimensionality, it is pop-ular to assume that the data concentrate near a low-dimensional linear subspace. Principal compo-nent analysis (PCA) is a ubiquitous technique build-ing upon such assumption. Tipping and Bishop (1999b) generalized PCA within a density estimation framework and introduced probabilistic PCA (PPCA). PPCA can be viewed as a special case of the factor analyzer model (FA), which does not assume isotropic error. Carvalho et al. (2008) and Bhattacharya and Dunson (2011) (among many others) have successfully applied FA under the Bayesian paradigm while addi-tionally assuming sparsity. However, FA involves com-plex computation that does not scale well. PPCA, on the other hand, can be fitted via the expectation maximization algorithm (EM), which is computation-ally cheaper especicomputation-ally when D is large (Roweis, 1998). EM also offers a straightforward way to accommo-date missing data through imputation at each itera-tion. However, PPCA is not able to scale to massive dimensional problems, and is sensitive to the choice of p. Moreover, the computational cost will explode even in the existence of a tiny proportion of missing data. These computational bottlenecks of PPCA are illustrated in_{§ 4.}

Randomized singular value decompositions (SVD) are able to estimate the geometric structure of the data vectors with tiny error at a very small computational cost (Rokhlin et al., 2009). Unfortunately, traditional PPCA cannot utilize this cheaply obtained geomet-ric information. We propose a novel Bayesian model that we term geometric density estimation (GEODE), which leverages on fast SVD algorithms, and hence easily scales to high dimensional problems (D = 106

(2)

mixture of GEODE (mGEODE) to account for non-linear cases via a dyadic clustering tree, and illustrate its performance via real world image data.

The remainder of the paper is organized as follows. We start with a brief review of PPCA and discuss its computational bottlenecks. We then propose GEODE in_{§ 3, and demonstrate its performance via simulation} in _{§ 4. mGEODE is proposed in § 5. A detailed} dis-cussion on the computational cost is reported in _{§ 6.} An image inpainting application is presented in _{§ 7,} and a discussion is reported in _{§ 8.}

2 PPCA Revisited

Letting C = ΓΓ>+ σ2_{I where Γ}_{∈ <}D×d_{and d < D,}

the PPCA model can be written as

y∼ N (µ, C). (1)

Letting λ1, . . . , λD denote the eigenvalues of the

sam-ple covariance matrix ordered descendingly, it can be shown (Tipping and Bishop, 1999b) that the MLE of Γ and σ2_{is given as} ΓM L= Ud(Λd− σ2I)1/2R, σ2M L= 1 D_{− d} D X j=d+1 λj,

where the column vectors in the D× d matrix Udare

the principal eigenvectors of the sample covariance ma-trix, with the corresponding eigenvalues λ1, . . . , λd in

the d_{× d diagonal matrix Λ}d, and R is an arbitrary

d_{× d orthogonal rotation matrix.}

We use Y to denote the demeaned observation matrix in which each row is a demeaned data vector. Accord-ing to the above result, one can solve PPCA by ap-plying a SVD on Y . Note that depending on whether the sample size N is smaller than D, either all D or all N singular values are needed, with a computational cost of at least_O(min{ND2_{, N}2_D

}) even using itera-tive methods. Iteraitera-tive methods are likely to be non-robust if more than a small number of singular values are needed.

Roweis (1998) pointed out that using EM could be computationally cheaper, especially in cases where D is large. Although reported computational cost is O(NDd), a cost that is comparable to that of our proposed model, the simulation results show that the constant in the cost of EM is much larger. Moreover, it is not clear how the convergence rate of EM varies with respect to D and d, and performance is very sen-sitive to the choice of d. In practice, d is typically picked by cross validation, whose computational cost is prohibitive when D or N is large.

A fast rank-d SVD (Rokhlin et al., 2009) approximates an exact SVD to its first d leading singular values at a certain error bound with a high probability. The com-putational cost is _{O(NDd). The algorithm is proved} to have a better performance when singular values af-ter the dth singular value decay very slowly, which is typically the case in practice. One might ask if fast SVD can be applied to PPCA, since the closed form MLE comes from a SVD on Y . Unfortunately, instead of just the first d singular values, one will need all of them in computing σ2

M L. Moreover, since fast rank-d

SVD relies on randomization, there is a correspond-ing approximation error. It is not clear how PPCA performs in the presence of such error.

3 Geometric Density Estimation

In this section, we develop GEODE piece by piece. The correctness of the method is first justified via a theorem, and a shrinkage prior is then specified to fa-cilitate GEODE to automatically identify and delete redundant dimensions. Finally, an efficient Gibbs sam-pler is designed for posterior computation.

3.1 Model Formulation

Let W be a D× d matrix with column vectors being orthonormal, Σ = diag(α2

1, . . . , α2p), C = W ΣW>+

σ2_{I and S =} 1 N

PN

i=1(yi− µ)(yi− µ)>. The model is

as follows

y∼ N (µ, C) σ2_{∼ IG(a}σ, bσ),

(2) where IG(aσ, bσ) denotes an inverse Gamma

distri-bution with shape parameter aσ and rate parameter

bσ. The corresponding log-posterior (up to an

addi-tive constant) is L = −N₂D ln(2π) + ln_{|C| + tr(C}−1S)+ 2aσ+ 2 N ln(σ −2_{) +}2bσ N σ −2 _.

3.2 Empirical Bayes Solution for µ and W µ and W carries the geometric information of the data vectors and uniquely define a linear subspace in _<D_,

with µ being the origin. Motivated by computational considerations and from an empirical Bayes perspec-tive, we will first estimate µ and W relying on a single pass through the data, and then fix them afterwards at these estimated values. In particular, we solve the following optimization problem:

( ˆµ, ˆW ) = arg max µ,W max σ2_,ΣL(µ, W , σ 2_{, Σ)}_. ₍₃₎

(3)

The following theorem shows that a closed form so-lution to (3) exists and can be obtained via a single SVD through the data. The proof is reported in the supplementary material.

Theorem 1. Let λ1, . . . , λD be the eigenvalues of

S ordered descendingly, define ek = (D − d)λk −

Pk+D−d−1

j=k λj for all k ≤ d + 1 and define q =

PD

j=d+1λj− (D − d)λD. Suppose

Condition 1: d < rank(S) Condition 2: (aσ+ 1)λD≤ N₂q

Condition 3: For all ek > 0, bσ<N₂ek.

Then

ˆ

µ = ¯y, ˆW = Ud

solves (3), where the column vectors of the D_{×d matrix} Ud are the d leading right singular vectors of Y .

Condition 1 is trivial since it has to be satisfied in the first place in order for the model to make sense. In practice, condition 2 and 3 are easily met when N is large. Theorem 1 shows that the span of the d leading right singular vectors of Y maximizes the posterior of model (2) among all d-dimensional linear subspace in <D_{, regardless of the choice of prior for σ}2_.

We term the column vectors of ˆW as the principal axes of the data and denote them by wj, for j = 1, . . . , d.

The theorem yields a practical method for obtaining ˆ

µ and ˆW , which is summarized as follows. • ˆµ = _N1 PN_i=1yi.

• Obtain wj, for j = 1, . . . , d, via applying the fast

rank-d SVD on Y .

This will be the first step in our method. As we do not know the true dimension p of the subspace, we ob-tain d principal axes, with d chosen to correspond to a conservative upper bound, so that we are confident a priori that d ≥ p. In the sequel, we will define a shrinkage prior on Σ that will effectively delete redun-dant principal axes, and favor the model around the true p. Exploiting this behavior, an adaptive Gibbs sampler will be proposed.

3.3 Prior for Σ and Learning of p

For the diagonal elements of Σ, we could simply fix them at the values that maximize (3), and from the proof of Theorem 1 we know that such values are α2

j =

λj− σ2, for all j = 1, . . . , d. In this way our model

only contains a single unknown σ2. However this is problematic since our inference would rely heavily on the accuracy of the SVD on Y . Moreover, with all except σ2 _{fixed a priori, the model uncertainty tends}

to be severely underestimated. Hence, instead of fixing α2

j’s, we will learn them by

giv-ing them a carefully designed prior distribution. This prior distribution also facilitates GEODE to automat-ically delete redundant principal axes.

Equipped with ˆµ and W and with another passˆ through the data, we can obtain for all i = 1, . . . , N sufficient statistics Ai = (yi− ˆµ)>(yi− ˆµ) and Zi =

ˆ

W>(yi− ˆµ), with Zi(j) denoting the jth element of

Zi. We then apply a random variable transformation

uj = (1 + σ−2α2j)−1, for j = 1, . . . , d. With basic

algebra, the likelihood of GEODE is then

f (yi)∝(σ2)−D/2 d Y j=1 u1/2j exp −1₂σ−2_× Ai− d X j=1 (1− uj)(Zi(j))2 .

The derivation is reported in the supplementary ma-terial.

Delete redundant principal axes. We rely on the following geometric intuition. It is easy to check that w>

j y ∼ N (w>jµ, α2j + σ2). Hence α2j is the

signal variance along the direction wj, which should

be decreasing for j = 1, . . . , p and be zero for j = p + 1, . . . , d. This motivates us to penalize α2

j by

in-creasingly shrinking towards zero as j increases, which is equivalent to shrinking uj increasingly for larger j.

To accomplish this adaptive shrinkage, we propose a multiplicative exponential process prior that adapts the prior of Bhattacharya and Dunson (2011). Letting δj = Qjk=1τk, the prior is given for j = 1, . . . , d as

follows:

uj∼ Ga(0,1)(δj+ 1, 1)

τj∼ Exp[1,∞)(aτ)

where Ga(0,1)(δj + 1, 1) denotes a Gamma

distribu-tion with shape parameter δj+ 1 and rate parameter

1 truncated within (0, 1) and Exp_[1,∞)(aτ) denotes an

Exponential distribution with parameter aτtruncated

within [1,_{∞). δ}j and τj are the global and the local

shrinkage parameter for α2

j, respectively. Since τj≥ 1

for j = 1, . . . , d, δj = Qjk=1τk is increasing with

re-spect to j. As a result, ujis stochastically approaching

one, since the truncated gamma density concentrates around one as δj increases. In practice, we fix aσ= 2,

(4)

3.4 Posterior Computation

To avoid paying a heavy computational price for choos-ing a conservative upper bound d ≥ p, we automat-ically delete redundant principal axes as computa-tion proceeds. To this end, we adopt an adaptive Gibbs sampler related to that developed by Bhat-tacharya and Dunson (2011). To be specific, we let {1, . . . , d} = A ∪ R, with A ∩ R = ∅. The sets A and R index the active and removed axes, respec-tively. At iteration t of the sampler, with probability p(t) = exp(c0+ c1t), we refine sets A and R, where

c0 =−1 and c1 = 0.005 are chosen to favor frequent

adaptation early in the chain and exponentially fast decay in frequency. In the refinement step, we remove all axes inA having less than tol = 10−2_{impact, and if}

no such axis exists, move the small element ofR back toA. We will stop the adaptation after a pre-specified stopping time so that the Gibbs sampler will converge to the posterior distribution associated with one se-lected set of principal axes. Before the stopping time, the Gibbs sampler is jumping around different target distributions and all samples will be thrown away as burn-ins.

The algorithm implementing GEODE can be summa-rized as follows:

Step 1 (preprocessing): Obtain ˆµ and ˆW as described in_{§ 3.2 and compute sufficient statistics A}iand Zi for

i = 1, . . . , N .

Step 2 (Gibbs sampler): Set_{A = {1, . . . , d} and R =} ∅. Iterate until obtaining T posterior samples:

1. Update uj for all j ∈ A according to

Ga(0,1) ˆaj, ˆbj, where ˆaj = Qk<=j,k∈Aτk+ N/2 and ˆbj= 1 +12σ−2 Pn i=1(Z (j) i )2.

2. Update τjfor all j∈ A according to Exp[1,∞) λˆj,

where ˆλj = aτ− ln(Qk>j−1,k∈Auk)

3. Update σ−2according to Ga ˆc, ˆd, where ˆc = aσ+

DN/2, ˆd =1 2 PN i=1 Ai−Pj∈A(1− uj)(Zi(j))2 + bσ.

4. If after the stopping time, go directly to the next iteration. Otherwise,

• if before the stopping time, then with proba-bility 1−p(t) go directly to the next iteration and otherwise go to step 5.

• if at the stopping time, move all j ∈ A such that rt j= αtj 2 / maxj∈A αtj 2 < tol fromA toR and go to next iteration.

5. Move all j ∈ A such that rt

j = αt j 2 / maxj∈A αtj 2 < tol from _{A to R.} If

no such j exists, then move the smallest j from R to A.

The derivation of the conditional posteriors is in the supplementary material. The preprocessing part only involves two passes through the data, with a compu-tational cost linear in D, while the cost of the Gibbs sampler is independent of D. This makes it easy to scale to high dimensional problems. The superior com-putational performance of GEODE is illustrated in the next section via simulations and a detailed discussion on the computational cost is reported in§ 6.

3.5 Missing Data Imputation

Bayesian models can easily utilize partially observed data by probabilistically imputing the missing fea-tures based on their conditional posterior distribution. Moreover, prediction can also be viewed as a missing data imputation problem. We propose several scalable missing data strategies for GEODE, and discuss the appropriateness of these strategies in different missing data scenarios.

Notations yM and yO are introduced as the missing

part and the observed part of y respectively. Let µM

and WM denote the parts of µ and W

correspond-ing to yM, and let µO and WO denote the parts

cor-responding to yO. The following proposition enables

efficient sampling from the conditional posterior distri-bution p(yM|yO, Θ), where Θ denotes all the unknown

parameters in the model.

Proposition 1. Introduce augmented data η _{∈ <}d

such that (y_{|η, Θ) ∼ N (µ + W η, σ}2_{I) and (η}

|Θ ∼ N (0, Σ). Then we have the conditional distribu-tion with η marginalized out equal (y_{|Θ) ∼ N (µ +} W ΣW>_{, σ}2_{I). Furthermore, we have}

η_|yO, Θ∼ N ( ˆµη, ˆCη),

yM|η, yO, Θ∼ N (µM + WMηi, σ2I),

where ˆCη = ΣWO>WO/σ2 + I−1Σ and ˆµη =

ˆ

CηWO>(yO− µO)/σ2.

Corollary 1. For any Θ, the following are true yO|Θ, µO, WO∼ N (µO, WOΣWO>+ σ2I),

yM|yO, Θ, µO, WO∼ N ( ˆµM, ˆCM), (4)

where ˆµM = µM+ WMµˆη and ˆCM = WMCˆηWM>+

σ2_I.

Proofs are reported in the supplementary material. Let DM denote the number of missing features in y.

Equipped with Proposition 1 and Corollary 1, we pro-pose the following three imputation strategies:

(5)

Figure 1: Comparing the MSE of estimating σ2_under

different D and p. Results for PPCA are color coded with black denoting d = p, blue denoting d = p + 5, and purple denoting d = p + 10.

• Small DM (≤ d): sample using (4) is preferred

to sampling using Proposition 1 due to numer-ical concerns. The computational complexity is O(d3_).

• Moderately large DM (> d): sample via the data

augmentation technique provided in Proposition 1. The computational complexity isO(d3_+D

Md).

• Large DM: sample via data augmentation in the

first few steps of the Gibbs sampler, and later on fix the value of yM to its last update. When DM

is large, we cannot afford to run a Gibbs sampler with each step having a complexity linear in DM.

4 Simulation Studies

In this section, we compare GEODE to its counterpart PPCA in terms of accuracy, robustness and computa-tional efficiency via simulations. All experiments are conducted in Matlab version 2015a on an OS X laptop with a double 3.1 GHz Intel(R) Core(TM) i7 processor. PPCA is fitted using Matlab function ppca under the statistics and machine learning toolbox. This function implements an EM algorithm for PPCA, which han-dles missing data (Roweis, 1998; Ilin and Raiko, 2010). All results reported are obtained by averaging over 10 replicated experiments.

Moderately large D without missing data: In this first simulation study, we let D vary from 100 to 500 and fix N to be 500. We test both methods on three differ-ent intrinsic dimensions, i.e., p _{∈ {5, 10, 20}. To test} the robustness of PPCA to the choices of d, we let d take values in_{{p, p + 5, p + 10} while fixing d = 30 for}

Figure 2: Comparing CPU times fitting the model un-der different D and p. Results for PPCA are color coded, with black denoting d = p, blue denoting d = p + 5, and purple denoting d = p + 10.

GEODE. We evaluate the performance of both meth-ods in terms of mean square error (MSE) in estimating σ2_{. The comparison of the two models is presented in}

Figure 1. The thin black lines denote the MSE of PPCA, with a correct guess of p, i.e., d = p. Though they seem to be better than GEODE, the difference decreases as p increases. However, when incorrectly choosing d, which is common in practice, the perfor-mance of PPCA (denoted by the thin colored lines) drops dramatically, and is much worse than GEODE. The corresponding CPU time of fitting both models is reported in Figure 2. The computational cost of PPCA grows fast in D, while the cost of GEODE grows much slower and is dominated by the Gibbs sampler part. This explains why in Figure 2 the cost of GEODE seems not to grow in D. To check how well GEODE dynamically delete the redundant principal axes, for j = 1, . . . , d, we calculate the average proportion of j’s being inside _{R within all iterations where} adapta-tion takes place. These proporadapta-tions are visualized in Figure 3. It can be easily seen that GEODE is able to quickly identify the redundant axes. Hence, we can conclude from the simulations that GEODE performs almost as well but slightly worse than the best PPCA can achieve, but with a much smaller computational cost. Moreover, GEODE automatically select p start-ing from any crude guess d, while the performance of PPCA is highly sensitive to the choice of d.

Moderately large D with missing data: Though the EM algorithm of PPCA offers a straightforward way to impute missing data, even a very small proportion of missingness explodes its computational cost. In this simulation study, we fix N = 500, and randomly select

(6)

Figure 3: Proportion of inclusion for j within all adap-tations averaged across all replicates.

Figure 4: Top: MSE in estimating σ2_{; Bottom: the}

CPU time fitting GEODE.

25 observations with 5 features missing. We fit PPCA with d = p and run simulations for D = 100 and D = 200. In estimating σ2_{, GEODE generates almost as}

good results as PPCA, with a CPU time less than 4 seconds for both cases. However, the CPU time of fitting PPCA is 177 seconds for D = 100 and 578 seconds for D = 200.

Massive D: To evaluate the scalability of GEODE to massive dimensions, we redo the previous two exper-iments on GEODE with D varying from 105 _{to 10}6_.

Note that D = 106 _{is the largest that we can test on}

our computer due to storage limits (with N = 500 and D = 106_{, a single Y takes more than 3 GB storage).}

For illustration purposes, we fix p = 5. The MSE and the CPU time are reported in Figure 4. It is clear that GEODE remains computationally feasible even when D = 106_{, while providing very good performance.}

5 Mixture of GEODE

Mixture of PPCA (Tipping and Bishop, 1999a) ex-tends PPCA to characterize non-Gaussian data. How-ever, it inherits the computational drawbacks of PPCA. Mixture of factor analyzers (MFA) avoids the isotropic error constraint of mixture of PPCA, but the corresponding EM algorithm (Ghahramani and Hin-ton, 1996) suffers a similar computational bottleneck. Bayesian MFA is a straightforward Bayesian imple-mentation in small dimensional problems (Diebolt and Robert, 1994; Richardson and Green, 1997), but faces problems in scaling beyond a few 100 dimensions. In-spired by the empirical Bayes idea of GEODE in linear cases, we propose to learn and fix a multiscale set of potential principal component axes in a first stage. In the second stage, we mix across these principal com-ponent axes according to their likelihood in a Bayesian paradigm. The ability to learn the intrinsic dimension p’s (we allow different spaces having different dimen-sions) and the ability to characterize uncertainty are both inherited from the linear case.

5.1 Multiscale Principal Axes

To acquire a multiscale set of principal directions, we first adopt METIS (Karypis and Kumar, 1998) to ob-tain a dyadic clustering tree of the dataset. This is partly motivated by the compressive sensing technique developed by Allard et al. (2012), which efficiently compresses the data by locally finding the best linear subspaces to approximate these dyadic clusters. More-over, a tree structure allows the model to adapt to different local smoothness by mixing across fine scales and coarse scales. A dyadic clustering tree of the data {yi}Ni=1 is defined as follows.

Definition 1. With s = 0, . . . , L denoting the scale index and h = 1, . . . , 2s_{denoting the node index within}

scale s, a level-L dyadic clustering tree of _{yi}Ni=1

is a family of index sets _Dsh⊆ {1, . . . , N} such that

• for every s, S2h=1s Dsh={1, . . . , N};

• for s ≤ s0 _{and 1}_{≤ h}0 _{≤ 2}s0

, either_Ds0h0 ⊆ Dsh or

Ds0h0∩ Dsh=∅;

• for s < s0 _{and 1}_{≤ h}0 _{≤ 2}s0_{, there exists a unique}

h = 1, 2, . . . , 2s _{such that}_D

s0h0 ⊆ Dsh.

The tree is denoted by {Dsh}L.

METIS generates the tree-structure clustering by par-titioning a weighted graph constructed from the data. Following the suggestion by Allard et al. (2012), we add an edge between each data point and its k nearest neighbors and set the weight between any yiand yjto

(7)

be e−kyi−yjk22/δ. δ is chosen adaptively at each point y_i as the distance between yiand itsbk/2c nearest

neigh-bor. In practice, we fix k to be 30 and constrain the leaf size_|DLh| to be greater than 10, for h = 1, . . . , 2L.

The depth of the tree L depends on the sample size N , and is automatically decided by METIS. Performance of METIS is illustrated in the supplementary material through multiple simulations.

Equipped with a level-L dyadic clustering tree_{Dsh}L,

the corresponding multiscale principal axes are defined as follows:

Definition 2. The multiscale principal axes of {yi}Ni=1 with respect to a level-L dyadic clustering tree

{Dsh}L are defined as a family of centroids ˆµsh∈ <D

and a family of orthogonal matrices ˆWsh∈ <D×dsuch

that for all s and h

• ˆµsh= _|D1_sh_|Pi∈Dshyi;

• ˆWsh = Ush, where the d column vectors in Ush

are the leading d right singular vectors of Ysh. Ysh

is a|Dsh| × D matrix with each row representing

a demeaned data vector from Dsh.

The multiscale principal axes are denoted as { ˆµsh, ˆWsh}L.

In practice, we use fast rank-d SVD to compute ˆWsh.

5.2 Model Formulation

Equiped with the multiscale principal axes { ˆµsh, ˆWsh}L, the mixture of GEODE model

(mGEODE) is given by

y_∼X

sh

πshN ( ˆµsh, Csh) (5)

where Csh = ˆWshΣshWˆsh> + σ2sI and Σsh is a d× d

positive diagonal matrix, for s = 0, . . . , L and h = 1, . . . , 2s_. _{For all s and h, Σ}

sh and σ2s are given

the same prior distribution as in GEODE. We assume isotropic error variance σ2

s for each scale s to enable

clusters from the same scale to share information. We then finish the formulation of mGEODE by choos-ing a prior for the multiscale mixchoos-ing weights πsh. This

prior should be structured to allow adaptive learning of the appropriate tradeoff between coarse and fine scales. Heavily favoring coarse scales may lead to re-duced variance but also high bias if the coarse scale approximation is not accurate. High weights on fine scales may lead to low bias but high variance due to limited sample size in each fine resolution component. With this motivation, Canale and Dunson (2016) pro-posed a multiresolution stick-breaking process gener-alizing usual “flat” stick-breaking (Sethuraman, 1994).

In particular, let

Ssh∼ Be(1, aS), Rsh∼ Be(bR, bR) (6)

with Sshdenoting the probability that the observation

stops at node (s, h) of a binary tree and Rshdenoting

the probability that the observation moves down to the right from node (s, h) conditioning on not stopping at node (s, h). Hence

πsh= Ssh

Y

r<s

(1− Sr gshr)Tshr (7) where gshr =dh/2s−re denotes the ancestors of node

(s, h) at scale r, Tshr = Rr gshr if node (r + 1, gsh(r+1)) is the right daughter of node(r + 1, gshr) ,

other-wise Tshr = 1− Rr gshr. Canale and Dunson (2016) showed thatP∞_s=0P2_h=1s πsh= 1 almost surely for any

aS, bR > 0. This result makes the defined weights a

proper set of multiscale mixing weights. As aS

in-creases, finer scales are favored, resulting in a highly non-Gaussian density.

In practice, we only consider a truncated finite-depth multiscale mixture with depth being L. Let _{˜πsh}s≤L

denote the truncated weights, which are identical to {πsh} except that the stopping probabilities at scale L

are set equal to one to ensure PL_s=1P2_h=1s π˜sh= 1.

5.3 Posterior Computation

The posterior sampling for the mGEODE is almost identical to the GEODE, except for the newly intro-duced variables si, hi, Sshand Rsh. The conditional

posterior of si, hiis given by

p(si= s, hi= h)

∝πshN (µsh, WshΣshWsh> + σs2I).

The conditional posteriors of Sshand Rshare given by

Ssh∼ Beta(1 + nsh, aS+ vsh− nsh),

Rsh∼ Beta(bR+ rsh, bR+ vsh− nsh− rsh),

where vsh is the number of observations passing

through node (s, h), nsh is the number of

observa-tions stopping at node (s, h), and rsh is the number

of observations that continue to the right after passing through node (s, h).

The remaining Gibbs steps are very similar to GEODE and are provided in the supplementary material.

6 Computational Aspects

GEODE Letting T denote the total number of Gibbs sampler iterations, the computational cost can be split as follows.

(8)

Figure 5: The first row shows the original images, second row shows the images with pixels missing, and the third row shows the reconstructed images.

Construction of principal axes: The complexity of fast rank-d SVD is_O(NDd).

Comstruction of sufficient statistics: The complex-ity of computing Ai = ˜yi>y˜i for all i is O(ND) and

the complexity of computing Zi = W>y˜i for all i is

O(NDd). Hence, the overall complexity is O(NDd). Gibbs sampler: The cost is dominated by updating σ2

and u, whose complexities are both_{O(NT d).}

Hence the overall complexity of the GEODE is O(NDd + NT d).

mGEODE Letting K denote the number of near-est neighbours in constructing the weighted graph, the computational cost can be split as follows.

Construction of weighted graph: The complexity of ANN in finding K nearest neighbours isO(DN log N) (Arya et al., 1998). The complexity of computing the weights for the graph is O(KN D).

Graph partition: The cost of METIS _{O(KN log N).} Construction of multiscale principal axes: For each node (s, h), the cost of applying the fast rank-d SVD is _O(|Dsh|Dd). We have |Dsh| = O(2−sN ) and there

are 2L_such

Dsh’s. Summing them all with L < log2N

we obtain a total cost of_{O(N log NDd).}

Construction of sufficient statistics: As in the lin-ear case, for each node (s, h), the complexity is O(|Dsh|Dd). Similar to deriving the complexity for

multiple principal axes, the complexity for construct-ing the sufficient statistics isO(N log NDd).

Gibbs sampler: The complexity of the sampler is dom-inated by updating (si, hi) for all i and updating ush

for all nodes, whose complexities are bothO(NT 2L_d).

Hence, the overall complexity of mGEODE is O(N log NDd + NT 2L_d).

Moreover, the Gibbs sampler converges fast with su-perb mixing. MCMC diagnostic based on potential scale reduction factor and effective sample size can be found in the supplementary materials. In practice we

fix T = 3000 with the number of burn-in fixed at 1000 and the stopping time fixed at 800. No thinning is needed.

7 Application

Given that GEODE has already been carefully stud-ied in § 4, we devote this section to illustrate the per-formance of mGEODE through an image inpainting application.

The Frey faces data (Roweis et al., 2002) contains 1965 20_{× 28 video frames of a single face with different} ex-pressions. mGEODE is trained on 1000 images for less than 2 minutes and then reconstruct (predict) the rest 965 damaged images. The reconstruction is done in less than 10 minutes. The mean absolute reconstruc-tion error of mGEODE is 7.04, which outperforms the error of 7.40 reported by Titsias and Lawrence (2010). 14 reconstructions are shown in Figure 5. In this ap-plication, we set d = 20. Increasing d moderately had essentially no impact on the results.

8 Discussion

In high dimensional applications, PPCA is ubiq-uitously used since it not only provides an low-dimensional embedding, but also a density estima-tion. Unfortunately, the dominating algorithm fitting PPCA is not scalable to high dimensions. We tackle this problem by proposing a empirical Bayes model novelly built upon a fast SVD technique. The pro-posed GEODE showed excellent performance in scal-ing computationally, while providscal-ing a valid charac-terization of uncertainty in predictions. It also showed excellent performance in inferring the subspace dimen-sion and handling missing data. We also propose a mixture of GEODE model which mixes local GEODE models across a dyadic clustering tree. This mGEODE model showed excellent performance in a real world data application.

(9)

References

W. K. Allard, G. Chen, and M. Maggioni. Multi-scale geometric methods for data sets ii: Geomet-ric multi-resolution analysis. Applied and Computa-tional Harmonic Analysis, 32(3):435–462, 2012. S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman,

and A. Y. Wu. An optimal algorithm for approxi-mate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891–923, 1998. A. Bhattacharya and D. B. Dunson. Sparse Bayesian infinite factor models. Biometrika, 98(2):291–306, 2011.

A. Canale and D. B. Dunson. Multiscale Bernstein polynomials for densities. Statistica Sinica, 2016. In press, doi: 10.5705/ss.202015.0163.

C. M. Carvalho, J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West. High-dimensional sparse factor modeling: applications in gene expression ge-nomics. Journal of the American Statistical Associ-ation, 103(484), 2008.

J. Diebolt and C. P. Robert. Estimation of finite mix-ture distributions through bayesian sampling. Jour-nal of the Royal Statistical Society. Series B (Sta-tistical Methodology), 56(2):363–375, 1994.

M. D. Escobar and M. West. Bayesian density esti-mation and inference using mixtures. Journal of the American Statistical Association, 90(430):577–588, 1995.

Z. Ghahramani and G. E. Hinton. The em algo-rithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto, 1996.

A. Ilin and T. Raiko. Practical approaches to princi-pal component analysis in the presence of missing values. The Journal of Machine Learning Research, 11:1957–2000, 2010.

G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359– 392, 1998.

H. Liu, J. D. Lafferty, and L. A. Wasserman. Sparse nonparametric density estimation in high dimen-sions using the rodeo. In International Conference on Artificial Intelligence and Statistics, pages 283– 290, 2007.

C. E. Rasmussen. The infinite Gaussian mixture model. In NIPS, volume 12, pages 554–560. MIT; 1998, 1999.

S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical

Society: Series B (Statistical Methodology), 59(4): 731–792, 1997.

V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31(3): 1100–1124, 2009.

S. Roweis. Em algorithms for pca and spca. Ad-vances in neural information processing systems, pages 626–632, 1998.

S. T. Roweis, L. K. Saul, and G. E. Hinton. Global coordination of local linear models. In NIPS, vol-ume 2, pages 889–896. MIT; 1998, 2002.

J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994.

W. Shen, S. T. Tokdar, and S. Ghosal. Adap-tive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika, 100(4):623–640, 2013.

M. E. Tipping and C. M. Bishop. Mixtures of proba-bilistic principal component analyzers. Neural com-putation, 11(2):443–482, 1999a.

M. E. Tipping and C. M. Bishop. Probabilistic prin-cipal component analysis. Journal of the Royal Sta-tistical Society: Series B (StaSta-tistical Methodology), 61(3):611–622, 1999b.

M. Titsias and N. Lawrence. Bayesian Gaussian pro-cess latent variable model. In the International Con-ference on Articial Intelligence and Statistics, 2010.