An overview to RNA-sequencing
Anna Esteve Codina
Functional Bioinformatics Group CNAG, Barcelona
• Situated in the Parc Científic de Barcelona (PCB)
• Funds from the Spanish and Catalan Governments
• Started sequencing operations in March 2010
• Currently 50 staff, >50% informatics
• Directed by Ivo Gut
• www.cnag.eu
Mission
Our mission is to carry out large scale-projects in genome analysis that will
lead to significant improvements in people's health and quality of life, in
collaboration with the Spanish, European and International Research
Community.
Sequencing Platforms
9 Illumina HiSeq2000
2 illumina HiSeq2500
2 Illumina Genome Analyzer Iix
1 MiSeq
>800 Gbases per day
(equivalent to 8 human genomes at >30x)
Resources at CNAG
Informatics
2.2 petabyte hardiscs
950 core cluster supercomputer
10 x 10 Gb/s connection to the Barcelona
Supercomputing Centre (BSC)
CHRONIC LYMPHOCYTIC LEUKEMIA
International Cancer Genome Consortium
Elias Campo and Carlos López-Otín
LYNX
Sequencing the Genome of the Iberian Lynx
José Antonio Godoy
PRIMATES
Elucidating structural variation in primate genomes
Tomàs Marquès
CITRUS
Sequencing the genome of orange varieties
Manuel Talón
BLUEPRINT
Generation of 100 reference epigenomes
Henk Stunnenberg
Projects at CNAG
OUTLINE
Introduction
The technology
Comparison with microarrays
Applications
Challenges
Future perspectives
A real case study
Whose the responsible of these cell transformations?
INTR
The transcriptome is the set of all RNA molecules, including mRNA, rRNA,
tRNA, and other non-coding RNA produced in one or a population of cells
Unlike the genome, which is roughly fixed for a given cell line (excluding
mutations), the transcriptome can vary with external environmental
conditions
The transcriptome reflects the genes that are being actively expressed at
any given time
The study of transcriptomics, also referred to as expression profiling,
examines the expression level of mRNAs in a given cell population, often
using high-throughput techniques based on DNA microarray technology.
The use of next-generation sequencing technology to study the
transcriptome at the nucleotide level is known as RNA-Seq
INTR
ODUC
INTR
THE TE
CHNOL
OG
Platforms
THE
TE
CHN
OL
OG
Y
Emulsion PCR / Pyrosequencing 1 M reads/run (1day) 700 bp read length 700 Mb/run Emulsion PCR / Sequencing by synthesis/ Electronic detection 10 M reads/run (2h) 100 bp read length 1 Gb/run Bridge amplification Sequencing by synthesis 3000 M reads/run (8 days) 2x100 bp read length 600 Gb/runHigh-throughput sequencer
rRNA most abundant, needs to be removed prior to library
preparation
THE
TE
CHNOL
OG
Y
Split reads
THE
TE
CHN
OL
OG
Y
ALIGNMENT & ASSEMBLY
THE
TE
CHN
OL
OG
Y
How real data looks like
THE
TE
CHN
OL
OG
Y
Unlike hybridization-based approaches, RNA-Seq
is not limited
to detecting transcripts that correspondto existing genomic sequence
Attractive
for non-model organisms
with genomic sequences that are yet to be determined Can reveal the precise location of transcription boundaries, to a
single-base resolution
RNA-Seq can also
reveal sequence variations
(for example, SNPs) in the transcribed regions It has a
large dynamic range
of expression levels over which transcripts can be detected (>10,000 fold) DNA microarrays lack sensitivity for genes expressed either at low or very high levels
Requires
less RNA as
starting material RNA-Seq
does not have an upper limit
for quantification Detection of
alternative splicing
, ability to detect different isoforms Ability to detect
allele specific expression
C
OMP
ARI
SON
WIT
H
MICR
O
ARR
A
YS
C
OMP
ARI
SON
WIT
H
MICR
O
ARR
A
YS
Quantification
Expression analysis: compare between conditions
Allele-specific expression (imprinting)
Annotation
To catalogue all species of transcripts (mRNA, lncRNAs, microRNAs, snoRNAs…)
Improve annotation (transcription start sites, 5’ and 3’ ends, polyA alternative sites)
Novel gene detection
Alternative splicing, novel isoforms
Fusion transcripts
Non-model organisms characterization (De novo transcriptome)
Metatranscriptomics
APPL
ICA
RNA-SEQ IMPROVES ANNOTATION
APPL
ICA
APPLICA
TION
S
M
I
P
R
O
V
I
N
G
A
N
N
O
T
A
T
I
O
N
APPLICA
TION
S
DETECTION OF ALTERNATIVE SPLICING
More than 90% of the human multi-exonic genes are alternatively spliced
Crucial to catalogue the complete repertoire of splicing events and to understand how altered splicing patterns contribute to development, cell differentiation and disease
Counting the number of reads mapping to each exon and spanning each splice junction, allow the splice efficiency of each junction to be determined and the levels of distinct isoforms to be
quantified
Examination of splicing patterns and transcript connectivity in an unbiased and genome-wide manner requires full-length transcript sequences to be obtained, which may be enabled in the future with emerging technologies
APPL
ICA
APPL
ICA
TIONS
Gautier Koscielny - Applied Bioinformatics to Plant Sciences Thursday, December 16, 2010
26
. Nat Genet. 42(12):1060-7 (2010)
DETECTION OF ALTERNATIVE SPLICING
APPL
ICA
Example from mosquitoe RNA-Seq study
Gautier Koscielny - Applied Bioinformatics to Plant Sciences Thursday, December 16, 2010
20
5’ extension of an
existing gene model
Novel gene alternatively spliced
between male and female
G. Koscielny (ECCB 2010)
APPLICA
TION
S
NOVEL TRANSCRIPTION
The novel transcribed regions combined with many undiscovered novel splicing variants suggest that there is considerably more transcript complexity than previously appreciated
APPLICA
TION
ANTISENSE TRANSCRIPTION
Transcriptomic studies revealed a pervasive presence of antisense transcription events
Now clear they are functional, before considered to reflect biological or technical noise
Increasing interest in profiling transcriptomes at greater depth to fully characterize sense and
antisense transcription
Strand-specific RNA libraries yield information about transcript orientation, which valuable specially
for regions with overlapping genes
APPL
ICA
ANTISENSE TRANSCRIPTION
APP
LICA
TI
ON
S
GENE FUSION DETECTION
APPL
ICA
TIONS
Philadelphia Chromosome is an aberrant translocation associated with chronic
myelogenous leukemia (CML)
Particular importance for cancer research (BCR-ABL, TEL-AML1, AML1-ETO, TMPRSS2-ERG)
GENE FUSION DETECTION
False positives coming from template switching during reverse transcription and amplification
Alleviated with longer reads and sufficient throughput
APPLICA
TION
S
SNP DETECTION
APP
LICA
TI
ON
S
DETECTION OF ALLELE-SPECIFIC EXPRESSION
APPLICA
TION
DETECTION OF ALLELE-SPECIFIC EXPRESSION
APPLICA
TION
DETECTION OF RNA-EDITING
APPLICA
TION
S
APPLICA
TION
S
A
A
A
A
genomeAPPLICA
TION
help to understand the molecular basis of phenotypic variation
APPL
ICA
TIONS
The goal of differential expression analysis is to identify genes that change in abundance between conditions
In certain situations, gene-level count based methods may not recover true
DE when some isoforms of a gene are up and other down
DIFFERENTIAL EXPRESSION ANALYSIS
Different summarization strategies will result in the inclusion or exclusion of
different sets of reads in the table of counts
APPL
ICA
Template switching during cDNA synthesis leading to artefactual chimeric RNA
Reverse transcriptases lack of proofreading mechanisms
Non-uniformity of coverage
Read mapping uncertainty (sequencing errors, repetitive elements, gene families)
PolyA mRNA enrichment step may also enrich for other RNA degradation products
CHALLENGE
UNEQUAL DISTRIBUTION OF COVERAGE
Larger RNA molecules must be fragmented into smaller pieces (200–500 bp) to be compatible with most deep-sequencing technologies
Common fragmentation methods include RNA fragmentation (RNA hydrolysis or
nebulization) and cDNA fragmentation (DNase I treatment or sonication)
Each of these methods creates a different bias in the outcome
For example, RNA fragmentation has little bias over the transcript body, but is depleted for transcript ends compared with other methods
Conversely, cDNA fragmentation is usually strongly biased towards the identification of sequences from the 3′ ends of transcripts, and thereby provides valuable information about the precise identity of these ends.
CHALLENGE
Direct RNA sequencing
No PCR biases
Giant DNA molecules (>5kb)
Genome assembly
Structural variants (CNV)
Full-length transcripts
FUTU
RE
PERE
SPE
C
TIVE
Even “homogeneous” groups of dendritic cells can respond very differently from one another, as evidenced by the variability in expression of the Cxcl1 gene seen here. Cells are outlined in grey, and Cxcl1 expression appears in magenta.
FUTURE
PERE
SPE
CTIVE
S
Single Cell RNA-seq (CNAG)
Fluidigm
S
TUDYING
THE
PIG
TRANSCRIPTOME
➤A
T THAT TIME, THERE WERE NO RNASEQ STUDIES IN LIVESTOCK TRANSCRIPTOMES DESPITE THEIR SOCIO-ECONOMIC INTEREST, WE PROVIDED THE FIRST IN PIGS.
➤W
E AIMED TO STUDY THE RELATION BETWEEN EXTREME PHENOTYPIC DIFFERENCES AND THEIR TRANSCRIPTOME PATTERNS, AS WELL AS TO IMPROVE PIG GENOME ANNOTATION (NOVELMATERIAL & METHODS
L
ARGEW
HITE:
I
NTERNATIONAL COMMERCIAL LINEI
BERIAN:
T
RADITIONAL UNIMPROVED BREED➤
B
OTH PIGS HOUSED WITH SAME CONDITIONS AND PREPUBESCENT AT SLAUGHTER TIMEV
ERYL
EANV
ERYF
ATH
IGHP
ROLIFICACYL
OWP
ROLIFICACYR
APIDG
ROWTHS
LOWG
ROWTHP
RODUCTIONF
ARMSP
OORH
OUSINGMATERIAL & METHODS
RNA-SEQ POLYA RNA PIGMALEGONADS
1 LANE ILLUMINA GA IIX
50 BPPAIRED-ENDREADS
TAGGEDSAMPLES MAPPING /ALIGNING TOPHAT ASSEMBLY9 REFERENCE GENOME ASSEMBLYOF TRANSCRIPTS CUFFLINKS QUANTIFICATIONOF TRANSCRIPTOME
CUFFLINKSVS DEGSEQ
COMPARISONWITH MICROARRAYS DIFFERENTIAL EXPRESSIONANALYSIS IMPROVEGENE ANNOTATION
R
EADSP
ERCENTAGEANNOTATEDEXONS 44.1% ANNOTATEDINTRONS 18.7%
5'UPSTREAM/3'DOWNSTREAM 26.6%
A
SSEMBLEDTRANSCRIPTSP
ERCENTAGEEXACTLYWITHANNOTATEDEXONS 1.2% INTERGENIC TRANSCRIPTS 36.1% INTRONRETENTIONEVENTS 35.6% CONTAINEDINKNOWNISOFORMS 12.5% PRE-MRNA MOLECULES 6.2% POLYMERASERUN-ONFRAGMENTS 3.6% PUTATIVENOVELISOFORMSOFKNOWNGENES 2.9%
MAIN RESULTS: GENOME ANNOTATION
➤
O
RTHOLOGS& P
ARALOGSA
NNOTATION714 PUTATIVE NOVELCODINGGENES
382 HOMO SAPIENS 344 KNOWN 38 NOVELPREDICTED 393 BOSTAURUS 378 KNOWN 15 NOVELPREDICTED SUSSCROFA 89 KNOWN 653 NOVELPREDICTED
MAIN RESULTS: GENOME ANNOTATION
362
➤ TRANSPOSABLEELEMENTS
➤ ONLY 3% OF THETOTALAMOUNTOF TE INTHEPIGGENOMEAREEXPRESSEDINMALEPIGGONADS
➤ HOWEVER, THEYCONSTITUTEAPPROXIMATELY 20% OFTHEPIGGONADTRANSCRIPTOME
➤ 16% OF PROTEINCODINGGENESCONTAIN TE INTHEIRSEQUENCE
➤ DNA TRANSPOSONSMOREACTIVE, BUT LINES MOREEXPRESSEDIN IBERIANTHAN LARGE WHITE
➤ LNCRNASANNOTATION
➤ 2047 PUTATIVE LNCRNASWEREDETECTED
➤ CONSERVATIONSTUDYTHEYCANBECLASSIFIEDINTO 3 CATEGORIES
➤ 469 LNCRNASCONSERVEDACROSSALLMAMMALS
➤ 322 CONSERVEDAMONG ARTIODACTYLA
➤ THERESTAREPIGSPECIFIC (CHECKBIOLOGICALRELEVANCE)
MAIN RESULTS: GENOME ANNOTATION
18 MAMMALIANGENOMES Q UERY L NC RNA S NO HOMOLOG LOW SIMILARITY HIGH SIMILARITY
➤ CORRELATIONOFGENEEXPRESSIONBETWEEN BOTHBREEDSISRATHER HIGH (R=0.85)
➤ HIGHLYEXPRESSED GENES: HEATSCHOCKPROTEINS, RIBOSOMALPROTEINS ANDAPOPROTEINS
➤13,000 ANNOTATEDGENES EXPRESSEDINGONADS, THEMAJORITY OFTHEGENS (90%) AREMILDLYEXPRESSED
➤ TWOMETHODSFORGENEEXPRESSIONQUANTIFICATION: DEGSEQUNIQUELYMAPPEDREADS, CUFFLINKS AMBIGUOUSLY MAPPEDREADS, THE FORMER UNDERSTIMATE EXPRESSION OFGENEPARALOGS
➤CORRELATION WITHMICROARRAYSISQUITEHIGH (R=0.71)
1% 5% 50% 40% 3% 0% 10% 20% 30% 40% 50% 60% >10000 FPKM 1000-10000 FPKM 100-1000 FPKM 10-100 FPKM 1-10 FPKM
MAIN RESULTS: QUANTIFICATION OF EXPRESSION
➤ I
NTERSECTION OFDIFFERENTIALLY EXPRESSEDGENES➤
OVER-
REPRESENTATIONOFGENE ONTOLOGIES➤ I
N AGREEMENTWITH THEEXTREME PHENOTYPICCHARACTERISTICS OFTHEI
BERIAN& L
ARGEW
HITE BREEDS,
IN TERMS OFPROLIFICACY,
GROWTHANDFATDEPOSITIONREPRODUCTION
DEVELOPMENTAL PROCESS
FATTY ACID METABOLIC PROCESS
2651
256
219
RNAseq
Microarrays
SUMMARY
➤ A
HIGHPROPORTIONOFTHE PIGMALEGONADTRANSCRIPTOMEISMADE OFTRANSPOSABLEELEMENTS,
INAGREEMENTWITHMICEGERMLINESAND HUMANBRAINSTUDIES
.
➤ W
E CONFIRM THEINCOMPLETE ANNOTATIONOFTHE PIGREFERENCEGENOME,
THEMAJORITYOFTHEREADSMAPPEDOUT OFTHEEXONBOUNDARIES
. W
E FOUNDSEVERAL NOVELEXPRESSEDTRANSCRIPTSININTERGENICREGIONS
,
SOME OFTHEMBEINGP
CG
WITH HUMANAND COWORTHOLOGS,
OTHERS BEINGPUTATIVELONG
-
NON-
CODING-RNA
S.
➤ B
OTH IBERIANANDL
ARGE TRANSCRIPTOMESHOWED AHIGHCORRELATIONOFGENEEXPRESSION(
R=0.85),
SHOWINGTHATTRANSCRIPTOMEIS RATHERCONSERVEDBETWEENBREEDS.
➤ T
HECORRELATIONBETWEENRNA
SEQ ANDMICROARRAYSIS QUITEHIGH(
R=0.71)
➤ D
IFFERENTIALLYEXPRESSEDGENESBETWEENBOTHBREEDS AREOVER-
REPRESENTEDINSPERMATOGENESISANDLIPID METABOLISM