• Non ci sono risultati.

RNA sequencing data: biases and normalization

N/A
N/A
Protected

Academic year: 2021

Condividi "RNA sequencing data: biases and normalization"

Copied!
1
0
0

Testo completo

(1)

EMBnet.

journal

18.A

P

OSTERS

99

RNA sequencing data: biases and normalization

F. Finotello1 , E. Lavezzo2, L. Barzon2, P. Fontana3, A. Si-Ammour3, S. Toppo2, B. Di Camillo1

1Department of Information Engineering, University of Padova, Italy 2Department of Molecular Medicine, University of Padova, Italy 3Edmund Mach Foundation, San Michele all’Adige, Trento, Italy

Motivations

In recent years, RNA sequencing (RNA-seq) has rapidly become the method of choice for meas-uring and comparing gene transcription levels. Despite its wide application, it is now clear that this methodology is not free from biases and that a careful normalization procedure is the basis for a correct data interpretation. The most common normalization techniques account for: library size, gene or transcript length and sequence-specific biases such as GC-content effects. The aim of the present work is to investigate biases affecting RNA seq data and their effect on differential ex-pression analysis. In order to reduce biases due to over-simplification of gene transcription mod-els, we consider exon-based counts.

Methods

We two used publicly available RNA-seq data sets from two-group comparison studies which are characterized by multiple technical rep-licates. We summarized read counts at exon level and investigated their dependence on sequence-specific covariates: GC-content and exon length. In addition, we considered the ef-fect of library size correction on between-groups comparison and the impact of the above men-tioned biases on the detection of differentially expressed exons. The assessment is performed on raw data, as well as on data normalized with different approaches: RPKM [1], library size scal-ing, based on Trimmed Mean of M-values (TMM) [2] and on Poisson goodness-of-fit statistic ap-plied to non differentially expressed genes [3], and within-lane normalization based on loess re-gression of log-counts on GC-content and exon length [4]. We selected differentially expressed exons using the GLM-based version of edgeR [5] as it can consider an “offset” matrix which

codi-fies counts normalization, that can be computed with the desired approach, and library size scal-ing factors specified by the user.

Results

In our study, read counts show a significant pendence on exon length and a moderate de-pendence on GC-content. Exon length bias also affects differential expression analysis: longer exons tend to have lower P-values and to be se-lected as differentially expressed more frequent-ly than shorter exons. The tested normalization techniques do not completely remove biases and, in particular, RPKM approach over-corrects for exon length bias. Moreover, the choice of the strategy for library size adjustment has a great im-pact on the direction of the detected differential expression. The results obtained on these data sets demonstrate that RNA-seq data normaliza-tion is still an open issue. Further efforts should be directed towards the clarification of the rela-tionship between read counts and sequence-specific biases, which are, in turn, correlated to each other, and the definition of new models for their correction.

References

1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature methods. 2008;5(7):621-8.

2. Robinson MD, Oshlack A. A scaling normalization meth-od for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.

3. Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. 2011.

4. Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-seq data. BMC Bioinformatics. 2011;12(1):480.

5. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res. 2012.

N

e

xt

G

e

n

e

ra

ti

o

n

S

e

q

u

e

n

c

in

g

Riferimenti

Documenti correlati

The two opposite pathways (fragmentation vs. oligomerization) are closely reflected in the trends of the relevant EEM spectra: a blue-shift has been observed upon

In order to distinguish individuals belonging to the two original populations, three strategies were carried out: EvaBx third instar nymphs caged with EvaTo adults (Experiment 1);

The results of our experimental and numerical research suggest that in the fallow plot (a) infiltration is heterogeneous and could be locally influenced by plant growth, (b)

Simultaneous analysis of DNA content, adult hemoglobin HbA ( α2β2) and fetal hemo- globin HbF (α2γ2) resulted in a robust and sensitive assay, capable of detecting changes at the

2 Also at State Key Laboratory of Nuclear Physics and Technology, Peking University, Beijing, China. 3 Also at Institut Pluridisciplinaire Hubert Curien, Université de

Per quanto riguarda l’istruzione si osserva una costante crescita dei livelli di capitale umano, con un andamento migliore rispetto alla media regionale: se si considera per

to the bimaximal (LC) or tri-bimaximal value, the number of parameters reduces to three, and the Dirac phase δ in the PMNS matrix can be predicted in terms of the PMNS