Melissa: Bayesian clustering and imputation of single-cell methylomes

(1)

Additional information for the paper

Melissa: Bayesian clustering and imputation of single

cell methylomes

Chantriolnt-Andreas Kapourani

1,2,∗

Guido Sanguinetti

1,3,∗

[email protected]

1_{School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK}

2_{MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh EH4 2XU, UK} 3_{Synthetic and Systems Biology, University of Edinburgh, Edinburgh EH9 3BF, UK}

∗_{To whom correspondence should be addressed}

1 Melissa: mean-field variational inference derivations

In mean-field variational inference the intractable posterior distribution of the latent variables

p(θ | X) is approximated by a factorized distribution q(θ) =

Q

i

q

i

(θ

i

), where θ denotes the latent

variables and X the observed variables. Then we search over the space of approximating

distribu-tions to find the distribution with the minimum Kullback-Leibler (KL) divergence with the actual

posterior

KL(q(θ) || p(θ | X)) = −

Z

q(θ) ln

p(θ | X)

q(θ)

dθ.

(1)

The KL divergence can then be minimised by performing a free form minimisation over the

q

i

(θ

i

) leading to the following update equation

q

i

(θ

i

) =

exp hln p(X, θ)i

_q j6=i

R exp hln p(X, θ)i

_q j6=i

dθ

i

,

(2)

where h·i

qj6=i

denotes an expectation with respect to the distributions q

j

(θ

j

) for all j 6= i. The joint

distribution over the observed and latent variables for the Melissa model is

p(Y, Z, C, W, π, τ |X) =p(Y|Z) p(Z|C, W, X) p(C|π)p(π) p(W|τ ) p(τ ),

(3)

where the factorisation corresponds to the probabilistic graphical model shown in Figure 7 in the

main text. We assume that the variational approximation to our posterior distribution factorises

over the latent variables (mean-field variational inference)

q(Z, C, W, π, τ ) = q(Z) q(C) q(W) q(π) q(τ ).

(4)

1.1 Deriving optimised factors

(2)

Factor q(C)

ln q(C) = hln p(Y, Z, C, π

π

π, W, τττ |X)i

_q(Z,π_π_π,W,τ_τ_{τ )}

+ const

=

*

ln

n

p(Y|Z)

|

{z

}

const

p(Z|C, W, X) p(C|π

π

π) p(π

π

π) p(W|τττ ) p(τττ )

|

{z

}

const

o

+

q(Z,πππ,W,τττ )

+ const

=

N

X

n=1 K

X

k=1

c

nk M

X

m=1

hln N (z

nm

|X

nm

w

mk

, I

nm

)i

q(znm,wmk)

+

N

X

n=1 K

X

k=1

c

nk

hln π

k

i

q(πk)

+ const

=

N

X

n=1 K

X

k=1

c

nk

ln ρ

nk

+ const,

(5)

where

ln ρ

nk

=

M

X

m=1

−

1

2 z

nm

− X

nm

w

mk

T

z

nm

− X

nm

w

mk

q(znm,wmk)

+ hln π

k

i

q(πk)

.

Taking the exponential on both sides and requiring that this distribution be normalised we obtain

q(C) =

N

Y

n=1 K

Y

k=1

r

cnk nk

where

r

nk

=

ρ

nk

P

K j=1

ρ

nj

.

Factor q(τττ )

ln q(τττ ) =

*

ln

n

p(Y|Z) p(Z|C, W, X) p(C|π

π

π) p(π

π

π)

|

{z

}

const

p(W|τττ ) p(τττ )

o

+

q(Z,C,πππ,W)

+ const

= hln p(W|τττ )i

_q(W)

+ ln p(τττ ) + const

=

K

X

k=1 M

X

m=1

hln p(w

_mk

|τ

_k

)i

_q(w mk)

+

K

X

k=1

ln p(τ

k

) + const.

(6)

Here we observe that the right hand side comprises a sum over k, i.e. each τ

k

is independent of

each other, hence

ln q(τ

k

) =

M

X

m=1

hln p(w

mk

|τ

k

)i

q(wmk)

+ ln p(τ

k

) + const

=

M D

2 ln τ

k

−

τ

k

2

M

X

m=1

w

T mk

w

mk

q(wmk)

|

{z

}

Gaussian PDF

+ (α

0

− 1) ln τ

k

− β

0

τ

k

|

{z

}

Gamma PDF

= (α

0

+

M D

2 − 1)

|

{z

}

αkparameter

ln τ

k

−

β

0

+

1

2

M

X

m=1

w

T mk

w

mk

q(wmk)

|

{z

}

βkparameter

τ

k

,

(7)

which is the logarithm of the (un-normalised) Gamma distribution, leading to

q(τ

k

) = Gamma(τ

k

|α

k

, β

k

)

α

k

= α

0

+

M D

2 β

k

= β

0

+

1

2

M

X

m=1

w

T mk

w

mk

q(wmk)

.

(8)

(3)

Note that the update for the α hyperparameter depends only on the total number of genomic

regions and the number of basis functions used to estimate the underlying methylation profiles.

On the other hand the β hyperparameter is updated at each CAVI iteration, since it depends on

the expected value of the regression coefficients. The expected value of the Gamma distribution

is E = α/β, and the inverse of this quantity is the variance parameter for the prior Gaussian

distribution of the coefficients w. Large values of E result in small variance Gaussian priors, hence

the model is substantially penalised when weights are moving away from prior mean µ

0

= 0; as a

consequence the model will tend to prune away clusters, that is, set all weights w

mk

= 0. This may

strongly affect the model in the initial iterations of CAVI, which will affect the β parameter but

not the α parameter of the Gamma distribution, potentially leading to convergence to a suboptimal

local maximum. Hence, one should be cautious when setting the initial values for these parameters;

in the current implementation of Melissa we set α

0

= 0.5 and β

0

=

√

a

k

.

Factor q(π

π

π)

ln q(π

π

π) =

*

ln

n

p(Y|Z) p(Z|C, W, X)

|

{z

}

const

p(C|π

π

π) p(π

π

π) p(W|τττ ) p(τττ )

|

{z

}

const

o

+

q(Z,W,C,τττ )

+ const

= ln p(π

π

π) + hln p(C|π

π

π)i

_q(C)

+ const

= ln C(δδδ

0

)

|

{z

}

const

+

K

X

k=1

ln π

δ_k0k−1

+

K

X

k=1 N

X

n=1

hc

nk

i

q(cnk)

|

{z

}

rnk

ln π

k

+ const

=

K

X

k=1

ln π

_kδ0k−1

+

K

X

k=1 N

X

n=1

r

nk

ln π

k

+ const.

(9)

Taking the exponential on both sides we observe that q(π

π

π) is a Dirichlet distribution

q(π

π

π) =

K

Y

k=1

π

(δ0k+ PN n=1rnk−1) k

= Dir(π

π

π|δδδ).

(10)

where δδδ has components δ

k

given by δ

k

= δ

0k

+

P

N n=1

r

nk

.

Factor q(W)

ln q(wmk) = * lnnp(Y|Z) | {z } const p(Z|C, W, X) (C|πππ) p(πππ) | {z } const p(W|τττ ) p(τττ ) |{z} const o + q(Z,C,πππ,τττ ) + const = N X n=1 hcnkiq(cnk) hln N (znm|Xnmwmk, Inm)iq(znm)+ hln p(wmk|τk)iq(τk)+ const = N X n=1 rnk −1 2 znm− Xnmwmk T znm− Xnmwmk q(znm) −1 2hτkiq(τk)w T mkwmk+ const = N X n=1 rnk wT_mkXT_nm hznmiq(znm)− 1 2w T mkX T nmXnmwmk −1 2hτkiq(τk)w T mkwmk+ const = wT_mk N X n=1 rnkXTnm hznmi_q(z_nm₎− 1 2w T mk ( hτki_q(τ_k₎I + N X n=1 rnkXTnmXnm ) wmk+ const. (11)

(4)

Because this is a quadratic form, the distribution q(w

mk

) is a Gaussian distribution and we can

complete the square to identify the mean and the covariance matrix

q(w

mk

) = N (w

mk

|λ

λ

mk

, S

mk

)

λ

mk

= S

mk N

X

n=1

r

nk

X

Tnm

hz

nm

i

q(znm)

S

mk

=

hτ

k

i

q(τk)

I +

N

X

n=1

r

nk

X

Tnm

X

nm

!

−1

.

(12)

Factor q(Z)

The log of the optimised factor assuming that the corresponding y

nmi

= 1 is

ln q(z

nmi

) =

*

ln

n

p(Y|Z) p(Z|C, W, X) p(C|π

π

π) p(π

π

π) p(W|τττ ) p(τττ )

|

{z

}

const

o

+

q(C,πππ,W,τττ )

+ const

= ln p(y

nmi

|z

nmi

) +

*

ln

K

Y

k=1

N (z

nmi

|w

Tmk

x

nmi

, 1)

cnk

+

q(cn,wm)

+ const

= y

nmi

ln 1(z

nmi

> 0) + (1 − y

nmi

) ln 1(z

nmi

≤ 0)

|

{z

}

0, since ynmi=1

+

K

X

k=1

r

nk

ln N (z

nmi

|w

mkT

x

nmi

, 1)

q(wmk)

+ const

= ln 1(z

nmi

> 0) −

1

2 z

2 nmi K

X

k=1

r

nk

|

{z

}

1

+z

nmi K

X

k=1

r

nk

w

Tmk

q(wmk)

x

nmi

+ const.

(13)

Exponentiating this quantity and setting µ

nmi

=

P

Kk=1

r

nk

w

Tmk

q(wmk)

x

nmi

we obtain

q(z

nmi

) ∝ 1(z

nmi

> 0) exp

−

1

2 z

2

nmi

+ z

nmi

µ

nmi

.

(14)

We observe that the optimized factor q(z

nmi

) is an un-normalised Truncated Normal distribution

q(z

nmi

) =

(

T N

+

(z

nmi

|µ

nmi

, 1)

if y

nmi

= 1

T N

−

(z

nmi

|µ

nmi

, 1)

if y

nmi

= 0.

(15)

1.2 Variational lower bound

The variational lower bound L(q) (i.e. evidence lower bound (ELBO)) is given by

L(q) =

X

C

Z

q(Z, C, π

π

π, W, τττ ) ln

p(Y, Z, C, π

π

π, W, τττ |X)

q(Z, C, π

π

π, W, τττ )

dZ dπ

π

π dW dτττ

= hln p(Y, Z, C, π

π

π, W, τττ |X)i

_q(Z,C,π_π_π,W,τ_τ_{τ )}

− hln q(Z, C, π

π

π, W, τττ )i

_q(Z,C,π_π_π,W,τ_τ_{τ )}

= hln p(Y|Z)i

_q(Z)

+ hln p(Z|C, W, X)i

_q(Z,C,W)

+ hln p(C|π

π

π)i

_q(C,π_π_π)

+ hln p(π

π

π)i

_q(π_π_π)

+ hln p(W|τττ )i

_q(W,τ_τ_{τ )}

+ hln p(τττ )i

_q(τ_τ_{τ )}

− hln q(Z)i

_q(Z)

− hln q(C)i

_q(C)

− hln q(π

π

π)i

_q(π_π_π)

− hln q(W)i

_q(W)

− hln q(τττ )i

_q(τ_τ_{τ )}

.

(16)

We can derive the expectations in a similar fashion to Section

1.1 . The ELBO L(q) is used to assess

(5)

1.3 Predictive density

The predictive density of a new observation y

∗

which will be associated with a latent variable c

∗

,

latent observation z

∗

and covariates X

∗

is given by

p(y

∗

|X

∗

, Y, X) =

X

c

Z

p(y

∗

, c

∗

, z

∗

, π

π

π, W, τττ |X

∗

, Y, X) dπ

π

π dτττ dW dz

∗

=

X

c

Z

p(y

∗

|z

∗

)p(z

∗

|c

∗

, W, X

∗

)p(c

∗

|π

π

π)p(π

π

π, W, τττ |Y, X) dπ

π

π dτττ dW dz

∗

'

K

X

k=1

Z

p(y

∗

|z

∗

)p(z

∗

|W

k

, X

∗

)π

k

q(π

π

π)q(W

k

)q(τ

k

) dπ

π

π dτ

k

dW

k

dz

∗

=

K

X

k=1

δ

k

ˆ

δ

Z

p(y

∗

|z

∗

)N (z

∗

|X

∗

W

k

, I

n

)N (W

k

|λ

λ

k

, S

k

) dW

k

dz

∗

=

K

X

k=1

δ

k

ˆ

δ

Z

p(y

∗

|z

∗

)N z

∗

|X

∗

λ

k

, I

n

+ diag X

∗

S

k

X

T∗

dz

∗

=

K

X

k=1

δ

k

ˆ

δ

(R

∞ 0

N z

∗

|X

∗

λ

k

, I

n

+ diag X

∗

S

k

X

T ∗

dz

where y

∗

= 1

R

0 −∞

N z

∗

|X

∗

λ

k

, I

n

+ diag X

∗

S

k

X

T∗

dz

where y

∗

= 0

=

K

X

k=1

δ

k

ˆ

δ

Φ (ρ)

y∗

_{(1 − Φ (ρ))}

(1−y∗)

=

K

X

k=1

δ

k

ˆ

δ

Bern

y

∗

Φ (ρ)

.

(17)

where ˆ

δ =

P

k

δ

k

, Φ(·) denotes the cumulative distribution function (cdf) of the standard normal

distribution and

ρ =

X

∗

λ

k

I

n

+ diag X

∗

S

k

X

T∗

(6)

2 Additional figures

0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 CpG coverage F−measure a 0.3 0.4 0.5 0.6 0.7 0.8 25 50 75 100 125 150 175 200 225 250 Cells assayed F−measure Model Melissa BPRMeth RF Melissa Rate Rate GMM b

Figure S1:

Melissa robustly imputes CpG methylation states. (a) Imputation performance in terms of

F-measure as we vary the proportion of covered CpGs used for training. Higher values correspond to better imputation performance. For each CpG coverage setting a total of 10 random splits of the data to training and test sets was performed. Each coloured circle corresponds to a different simulation. The plot shows also the LOESS curve for each method as we increase CpG coverage. (b) Imputation performance measured by F-measure for varying number of cells assayed. In (a) N = 200 cells were simulated and cluster dissimilarity was set to 0.5, and in (b) CpG coverage was set to 0.4 and cluster dissimilarity to 0.5.

0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cluster dissimilarity A UC a 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cluster dissimilarity F−measure Model Melissa BPRMeth RF Melissa Rate Rate GMM b

Figure S2:

Melissa robustly imputes CpG methylation states for different levels of dissimilarity across

clusters; values closer to zero correspond to highly similar cell sub-populations, whereas values closer to one correspond to well separated cell sub-populations. Imputation performance is measured in terms of (a) AUC and (b) F-measure. Higher values correspond to better imputation performance. For each CpG coverage setting a total of 10 random splits of the data to training and test sets was performed. Each coloured circle corresponds to a different simulation. The plot shows also the LOESS curve for each method as we increase the cluster dissimilarity across cell sub-populations. The CpG coverage was set to 0.4 and a total of N = 200 cells were simulated per experiment.

(7)

0

5

10

15

20

50

100

200

500 1000

2000

Cells

Hours

Model

VB

Gibbs

Model Cost

Figure S3:

Melissa efficiently imputes and clusters cell sub-populations. Running times for varying number

of cells for the variational Bayes (VB) and Gibbs sampling implementations for the Melissa model, where each cell consists of M = 200 genomic regions.

0.80 0.85 0.90 0.95 20% 50% 80% CpG coverage A UC

a

0.6 0.7 0.8 20% 50% 80% CpG Coverage F−measure Model Melissa DeepCpG BPRMeth RF Melissa Rate Indep Rate

b

Figure S4:

Melissa robustly imputes CpG methylation states on the subsampled ENCODE RRBS

methy-lation data. Imputation performance in terms of (a) AUC and (b) F-measure for varying levels of CpG coverage for pre-defined ±2.5 kb regions around TSS. For each CpG coverage setting a total of 10 random splits of the data to training and test sets was performed. Each dot corresponds to a different simulation.

(8)

0.70 0.75 0.80 0.85 0.90 0.95 20% 50% 80% CpG coverage A UC

a

b

Figure S5:

Melissa robustly imputes CpG methylation states on the subsampled ENCODE WGBS

methy-lation data. Imputation performance in terms of (a) AUC and (b) F-measure for varying levels of CpG coverage for pre-defined ±2.5 kb regions around TSS. For each CpG coverage setting a total of 10 random splits of the data to training and test sets was performed. Each dot corresponds to a different simulation.

20% coverage 50% coverage 80% coverage

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 Recall Precision Model Melissa DeepCpG BPRMeth RF Melissa Rate Indep Rate

a

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

T

rue positiv

e r

ate

b

Figure S6:

(a) Precision recall curves and (b) receiver operating characteristic curves on varying CpG

coverage levels for imputing CpG methylation states for the subsampled ENCODE RRBS methylation data for pre-defined ±2.5 kb regions around TSS.

(9)

20% coverage 50% coverage 80% coverage 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 Recall Precision Model Melissa DeepCpG BPRMeth RF Melissa Rate Indep Rate

a

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

T

rue positiv

e r

ate

b

Figure S7:

(a) Precision recall curves and (b) receiver operating characteristic curves on varying CpG

coverage levels for imputing CpG methylation states for the subsampled ENCODE WGBS methylation data for pre-defined ±2.5 kb regions around TSS.

0.80 0.85 0.90 0.95 20% 50% 80% CpG coverage A UC

a

b

Figure S8:

Melissa robustly imputes CpG methylation states on the subsampled ENCODE RRBS

methy-lation data. Imputation performance in terms of (a) AUC and (b) F-measure for varying levels of CpG coverage for pre-defined ±5 kb regions around TSS. For each CpG coverage setting a total of 10 random splits of the data to training and test sets was performed. Each dot corresponds to a different simulation.

(10)

0.65 0.70 0.75 0.80 0.85 0.90 0.95 20% 50% 80% CpG coverage A UC

a

0.6 0.7 0.8 0.9 20% 50% 80% CpG Coverage F−measure Model Melissa DeepCpG BPRMeth RF Melissa Rate Indep Rate

b

Figure S9:

Melissa robustly imputes CpG methylation states on the subsampled ENCODE WGBS

methy-lation data. Imputation performance in terms of (a) AUC and (b) F-measure for varying levels of CpG coverage for pre-defined ±5 kb regions around TSS. For each CpG coverage setting a total of 10 random splits of the data to training and test sets was performed. Each dot corresponds to a different simulation.

0.5 0.6 0.7 0.8 0.9

Promoter 10kb Promoter 5kb Promoter 3kb Nanog Super enhancers Active enhancers

F−measure Model Melissa DeepCpG DeepCpG Sub BPRMeth RF Melissa Rate Indep Rate

Figure S10:

Prediction performance using the F-measure metric for imputing CpG methylation states of

theAngermueller et al. (2016) dataset. Higher values correspond to better imputation performance. Each

coloured boxplot indicates the performance using 10 random splits of the data in training and test sets; due to high computational costs, DeepCpG was trained only once and the boxplots denote the variability across ten random subsamplings of the test set. Shown is the prediction performance for alternative genomic contexts: promoters (±1.5 kb, ±2.5 kb and ±5 kb regions), active enhancers, super enhancers and Nanog regulatory regions.

(11)

Nanog Super enhancers Active enhancers

Promoter 10kb Promoter 5kb Promoter 3kb

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

T rue positiv e r ate Model Melissa DeepCpG DeepCpG Sub BPRMeth RF Melissa Rate Indep Rate

Figure S11:

Receiver operating characteristic curves for imputing CpG methylation states of the

Anger-mueller et al.(2016) dataset.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 Recall Precision Model Melissa DeepCpG DeepCpG Sub BPRMeth RF Melissa Rate Indep Rate

Figure S12:

Precision recall curves for imputing CpG methylation states of theAngermueller et al.(2016)

(12)

0.00 0.25 0.50 0.75 1.00 −5Kb TSS +5Kb Gene Nanog met le v el Cluster 1 2 3 −5Kb TSS +5Kb Gene Raf1 −5Kb TSS +5Kb Gene Lefty2 0.00 0.25 0.50 0.75 1.00 −5Kb TSS +5Kb Gene Dnmt3b met le v el Cluster 1 2 3 −5Kb TSS +5Kb Gene Fgf5 −5Kb TSS +5Kb Gene Wnt10a

Figure S13:

Example profiles for different promoter regions of developmental genes with window length

±5kb for theAngermueller et al. (2016) dataset. Melissa identified three cell sub-populations.

0.65 0.70 0.75

Promoter 10kb Promoter 5kb Promoter 3kb Nanog Super enhancers Active enhancers

F−measure Model Melissa DeepCpG DeepCpG Sub BPRMeth RF Melissa Rate Rate

Figure S14:

Prediction performance using the F-measure metric for imputing CpG methylation states of

the Smallwood et al. (2014) dataset. Higher values correspond to better imputation performance. Each

coloured boxplot indicates the performance using 10 random splits of the data in training and test sets; due to high computational costs, DeepCpG was trained only once and the boxplots denote the variability across ten random subsamplings of the test set. Shown is the prediction performance for alternative genomic contexts: promoters (±1.5 kb, ±2.5 kb and ±5 kb regions), active enhancers, super enhancers and Nanog regulatory regions.

(13)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

T rue positiv e r ate Model Melissa DeepCpG DeepCpG Sub BPRMeth RF Melissa Rate Rate

Figure S15:

Receiver operating characteristic curves for imputing CpG methylation states of theSmallwood

et al.(2014) dataset.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 Recall Precision Model Melissa DeepCpG DeepCpG Sub BPRMeth RF Melissa Rate Rate

Figure S16:

Precision recall curves for imputing CpG methylation states of the Smallwood et al.(2014)

(14)

0.00 0.25 0.50 0.75 1.00 −5Kb TSS +5Kb Gene Notch1 met le v el Cluster 1 2 3 −5Kb TSS +5Kb Gene Gata4 −5Kb TSS +5Kb Gene Mapk12 0.00 0.25 0.50 0.75 1.00 −5Kb TSS +5Kb Gene Apc met le v el Cluster 1 2 3 −5Kb TSS +5Kb Gene Nr5a2 −5Kb TSS +5Kb Gene Pdgfrb

Figure S17:

Example profiles for different promoter regions with window length ±5 kb for theSmallwood

et al.(2014) dataset. Melissa identified three cell sub-populations.

0.00 0.25 0.50 0.75 1.00 Up Centre Down Enhancer 1 met le v el Cluster 1 2 3 Up Centre Down Enhancer 2 Up Centre Down Enhancer 3

Figure S18:

Example profiles for different enhancer regions for theSmallwood et al.(2014) dataset. Melissa identified three cell sub-populations.

0e+00 5e+06 1e+07 0 25 50 75 100

% of methylation

Number of CpGs

Digital output of single cell DNA methylation

Figure S19:

Digital output of single cell DNA methylation. Histogram of the distribution of CpG

methy-lation values for 10 randomly sampled single cells from theAngermueller et al.(2016) study. As expected, the proportion of binary CpGs is very high (around 98.8%) and only around 0.5% of CpG sites are hemi-methylated.

(15)

0 100 200 300 400 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions Promoter 10kb regions 0 100 200 300 400 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions Promoter 5kb regions 0 50 100 150 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions Promoter 3kb regions 0 25 50 75 100 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions Nanog regions 0 100 200 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions

Active enhancers regions

0 50 100 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions

Super enhancers regions

Figure S20:

Boxplots of CpG coverage distribution across different genomic contexts for theAngermueller

et al.(2016) dataset after the filtering process. The x-axis shows CpG coverage bins and the y-axis shows

the distribution of the number of genomic regions with N CpGs covered across cells, that is, each dot in the boxplot represents a different cell.

0 300 600 900 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions Promoter 10kb regions 0 200 400 600 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions Promoter 5kb regions 0 250 500 750 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions Promoter 3kb regions 0 100 200 300 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions Nanog regions 0 250 500 750 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions

Active enhancers regions

0 50 100 150 200 N<20 20<N<30 30<N<40 40<N<50 50<N<60 60<N<70 70<N<80 80<N<90 90<N<100N>100 CpG coverage Number of regions

Super enhancers regions

Figure S21:

Boxplots of CpG coverage distribution across different genomic contexts for the Smallwood

et al.(2014) dataset after the filtering process. The x-axis shows CpG coverage bins and the y-axis shows

the distribution of the number of genomic regions with N CpGs covered across cells, that is, each dot in the boxplot represents a different cell.

(16)

3 Additional tables

Genomic context

CpGs (in millions)

Time (in hours)

Promoter 10kb

6

5.6 Promoter 5kb

2.1

2.9 Promoter 3kb

0.62

1.31 Nanog

0.18

0.61 Super enhancers

0.5

1 Active enhancers

0.7

1.76 Table S1:

Melissa training time for the Angermueller et al. (2016) mouse ESC dataset. Across different

genomic contexts are shown the total number of CpGs used for training set and the time required (in hours) for running Melissa to impute and cluster single cells. Note that the numbers in millions refer to the total number of CpGs not the genomic regions which are generally up to thousands. As a comparison, the DeepCpG model took about three to four days to train on around four million CpGs.

Genomic context

CpGs (in millions)

Time (in hours)

Promoter 10kb

4.13

4 Promoter 5kb

1.54

2.21 Promoter 3kb

0.98

1.83 Nanog

0.26

0.9 Super enhancers

0.29

0.9 Active enhancers

0.85

2 Table S2:

Melissa training time for the Smallwood et al. (2014) mouse ESC dataset. Across different

genomic contexts are shown the total number of CpGs used for training set and the time required (in hours) for running Melissa to impute and cluster single cells. Note that the numbers in millions refer to the total number of CpGs not the genomic regions which are generally up to thousands. As a comparison, the DeepCpG model took about three to four days to train on around four million CpGs.

Genomic context

Smallwood study

Angermueller study

Promoter 10kb

21%

17%

Promoter 5kb

23%

20%

Promoter 3kb

24%

Nanog

19%

17%

Super enhancers

19%

12%

Active enhancers

25%

17%

(17)