Meta-statistics for biomarker selection in the omics sciences

(1)

As Different as Chalk and Cheese:

Meta-statistics for Biomarker Detection

Ron Wehrens Pietro Franceschi

Biostatistics and Data Management

http://cri.fmach.eu/BDM

StatSeq Workshop, Verona April 18–19, 2012

(2)

Biomarker identification (2-class situation)

G1

G2

G1

(3)

Biomarker identification (2-class situation)

G1

G2

G1

G2

(4)

Common approaches

Context Tools Challenges

Univariate statistical tests multiple testing

correlations

Multivariate PCLDA, PLSDA, VIP, ... model validation

variable selection optimization criterion

(5)

Common approaches

Context Tools Challenges

Univariate statistical tests multiple testing

correlations

Multivariate PCLDA, PLSDA, VIP, ... model validation

variable selection optimization criterion

local optima

(6)

Spike-in apple data

10 control apples

10 treated apples - spiked 9 chemical compounds 3 sets of concentrations ESI+ and ESI−

In each: 22 “true” biomarkers

Total: six comparisons between treated and control

(7)

One apple sample – the result

(8)

(9)

Higher criticism

“God loves 0.05 just as much as 0.06” – selecting optimal thresholds

Second-level of significance testing (Tukey, 1976) “Real differences are rare and weak”

Here: extension to multivariate methods Compare to “standard” thresholds:

α = 0.05,VIP= 1

Dohono and Jin, PNAS (2008) Wehrens and Franceschi – submitted.

(10)

Higher Criticism: theory

HC= max i √ N(i/N − pi) p i/N(1 − i/N) ● ● ● ● ● ● ● ●● ●● ●● ●● ●●● ●●●● ●●● ●●●●●●_●●● ●●●●●● ●_●● ●●●● ●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●_●●_●_●●●●●●●●●●●●●●●●●●● ● ●●●_● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●● ●●●_● ●●●●●●●●●●●●●●●●● ●_● ●● 0.00 0.02 0.04 0.06 0.08 0.10 1.0 2.0 3.0 4.0 HC thresholding, alpha = 0.1 Ordered scores HCi

(11)

Results for apple data: selection efficiency

Efficiency (in %) 0.1 0.2 0.3 0.4 0.5 Negative ionization PCLDA (2 LVs) PCLDA (4 LVs) PCLDA (10 LVs) PLSDA (2 LVs) PLSDA (4 LVs) PLSDA (10 LVs) VIP (2 LVs) VIP (4 LVs) VIP (10 LVs) t test 0.1 0.2 0.3 0.4 0.5 Positive ionization Group 1 Group 2 Group 3

(12)

Stability-based biomarker selection

Central idea: good biomarkers are stable

Perturb data by subsampling (samples as well as variables)

Fit many models on many perturbed data sets Which variables come up consistently at the top? Parameters: defining subsampling, consistency

Wehrens et al., Anal. Chim. Acta (2011)

(13)

SBBS results

Simulations based on real apple data

0.0 0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.9 1.0 Scenario E FPR TPR ● ● ● ● ● ● ● PCLDA (4 PCs): CBBS PCLDA (4 PCs): SBBS t−test: CBBS t−test: SBBS 0.0 0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.9 1.0 Scenario E FPR TPR ● ● ● ● ● ● ● PLSDA (4 LVs): CBBS PLSDA (4 LVs): SBBS t−test: CBBS t−test: SBBS

(14)

Biomarker selection – conclusions

Both SBBS and HC form viable alternatives Much better than “standard” thresholds HC: very few parameters

... but time-consuming SBBS: faster

... but more settings

... and you need > 10 replicates

Both implemented in R package BioMark, available from http://cran.r-project.org

(15)

Acknowledgements

BDM: Pietro Franceschi Matthias Scholz Nir Shahaf Yonghui Dong Nikola Dordevic FQAN: Fulvio Mattivi Urska Vrhovsek Domenico Masuero