An overview to RNA-sequencing (Seminaris Tecnològics 2014)

(1)

An overview to RNA-sequencing

Anna Esteve Codina

Functional Bioinformatics Group CNAG, Barcelona

(2)

• Situated in the Parc Científic de Barcelona (PCB)

• Funds from the Spanish and Catalan Governments

• Started sequencing operations in March 2010

• Currently 50 staff, >50% informatics

• Directed by Ivo Gut

• www.cnag.eu

Mission

Our mission is to carry out large scale-projects in genome analysis that will

lead to significant improvements in people's health and quality of life, in

collaboration with the Spanish, European and International Research

Community.

(3)

Sequencing Platforms

9 Illumina HiSeq2000

2 illumina HiSeq2500

2 Illumina Genome Analyzer Iix

1 MiSeq

>800 Gbases per day

(equivalent to 8 human genomes at >30x)

Resources at CNAG

Informatics

2.2 petabyte hardiscs

950 core cluster supercomputer

10 x 10 Gb/s connection to the Barcelona

Supercomputing Centre (BSC)

(4)

CHRONIC LYMPHOCYTIC LEUKEMIA

International Cancer Genome Consortium

Elias Campo and Carlos López-Otín

LYNX

Sequencing the Genome of the Iberian Lynx

José Antonio Godoy

PRIMATES

Elucidating structural variation in primate genomes

Tomàs Marquès

CITRUS

Sequencing the genome of orange varieties

Manuel Talón

BLUEPRINT

Generation of 100 reference epigenomes

Henk Stunnenberg

Projects at CNAG

(5)

OUTLINE



Introduction



The technology



Comparison with microarrays



Applications



Challenges



Future perspectives



A real case study

(6)

Whose the responsible of these cell transformations?

INTR

(7)



The transcriptome is the set of all RNA molecules, including mRNA, rRNA,

tRNA, and other non-coding RNA produced in one or a population of cells



Unlike the genome, which is roughly fixed for a given cell line (excluding

mutations), the transcriptome can vary with external environmental

conditions



The transcriptome reflects the genes that are being actively expressed at

any given time



The study of transcriptomics, also referred to as expression profiling,

examines the expression level of mRNAs in a given cell population, often

using high-throughput techniques based on DNA microarray technology.



The use of next-generation sequencing technology to study the

transcriptome at the nucleotide level is known as RNA-Seq

INTR

ODUC

(8)

INTR

(9)

THE TE

CHNOL

OG

(10)

Platforms

THE

TE

CHN

OL

OG

Y

Emulsion PCR / Pyrosequencing 1 M reads/run (1day) 700 bp read length 700 Mb/run Emulsion PCR / Sequencing by synthesis/ Electronic detection 10 M reads/run (2h) 100 bp read length 1 Gb/run Bridge amplification Sequencing by synthesis 3000 M reads/run (8 days) 2x100 bp read length 600 Gb/run

(11)

High-throughput sequencer

rRNA most abundant, needs to be removed prior to library

preparation

THE

TE

CHNOL

OG

Y

(12)

Split reads

THE

TE

CHN

OL

OG

Y

(13)

ALIGNMENT & ASSEMBLY

THE

TE

CHN

OL

OG

Y

(14)

How real data looks like

THE

TE

CHN

OL

OG

Y

(15)

 Unlike hybridization-based approaches, RNA-Seq

is not limited

to detecting transcripts that correspond

to existing genomic sequence

 Attractive

for non-model organisms

with genomic sequences that are yet to be determined

 Can reveal the precise location of transcription boundaries, to a

single-base resolution

 RNA-Seq can also

reveal sequence variations

(for example, SNPs) in the transcribed regions

 It has a

large dynamic range

of expression levels over which transcripts can be detected (>10,000 fold)

 DNA microarrays lack sensitivity for genes expressed either at low or very high levels

 Requires

less RNA as

starting material

 RNA-Seq

does not have an upper limit

for quantification

 Detection of

alternative splicing

, ability to detect different isoforms

 Ability to detect

allele specific expression

C

OMP

ARI

SON

WIT

H

MICR

O

ARR

A

YS

(16)

C

OMP

ARI

SON

WIT

H

MICR

O

ARR

A

YS

(17)

Quantification



Expression analysis: compare between conditions



Allele-specific expression (imprinting)

Annotation



To catalogue all species of transcripts (mRNA, lncRNAs, microRNAs, snoRNAs…)



Improve annotation (transcription start sites, 5’ and 3’ ends, polyA alternative sites)



Novel gene detection



Alternative splicing, novel isoforms



Fusion transcripts



Non-model organisms characterization (De novo transcriptome)



Metatranscriptomics

APPL

ICA

(18)

RNA-SEQ IMPROVES ANNOTATION

APPL

ICA

(19)

APPLICA

TION

S

M

I

P

R

O

V

I

N

G

A

N

O

T

A

T

I

O

N

(20)

APPLICA

TION

S

(21)

DETECTION OF ALTERNATIVE SPLICING

 More than 90% of the human multi-exonic genes are alternatively spliced

 Crucial to catalogue the complete repertoire of splicing events and to understand how altered splicing patterns contribute to development, cell differentiation and disease

 Counting the number of reads mapping to each exon and spanning each splice junction, allow the splice efficiency of each junction to be determined and the levels of distinct isoforms to be

quantified

 Examination of splicing patterns and transcript connectivity in an unbiased and genome-wide manner requires full-length transcript sequences to be obtained, which may be enabled in the future with emerging technologies

APPL

ICA

(22)

APPL

ICA

TIONS

(23)

Gautier Koscielny - Applied Bioinformatics to Plant Sciences Thursday, December 16, 2010

26

. Nat Genet. 42(12):1060-7 (2010)

DETECTION OF ALTERNATIVE SPLICING

APPL

ICA

(24)

Example from mosquitoe RNA-Seq study

Gautier Koscielny - Applied Bioinformatics to Plant Sciences Thursday, December 16, 2010

20

5’ extension of an

existing gene model

Novel gene alternatively spliced

between male and female

G. Koscielny (ECCB 2010)

APPLICA

TION

S

(25)

NOVEL TRANSCRIPTION

The novel transcribed regions combined with many undiscovered novel splicing variants suggest that there is considerably more transcript complexity than previously appreciated

APPLICA

TION

(26)

ANTISENSE TRANSCRIPTION

 Transcriptomic studies revealed a pervasive presence of antisense transcription events

 Now clear they are functional, before considered to reflect biological or technical noise

 Increasing interest in profiling transcriptomes at greater depth to fully characterize sense and

antisense transcription

 Strand-specific RNA libraries yield information about transcript orientation, which valuable specially

for regions with overlapping genes

APPL

ICA

(27)

ANTISENSE TRANSCRIPTION

APP

LICA

TI

ON

S

(28)

GENE FUSION DETECTION

APPL

ICA

TIONS



Philadelphia Chromosome is an aberrant translocation associated with chronic

myelogenous leukemia (CML)



Particular importance for cancer research (BCR-ABL, TEL-AML1, AML1-ETO, TMPRSS2-ERG)

(29)

GENE FUSION DETECTION



False positives coming from template switching during reverse transcription and amplification



Alleviated with longer reads and sufficient throughput

APPLICA

TION

S

(30)

SNP DETECTION

APP

LICA

TI

ON

S

(31)

DETECTION OF ALLELE-SPECIFIC EXPRESSION

APPLICA

TION

(32)

DETECTION OF ALLELE-SPECIFIC EXPRESSION

APPLICA

TION

(33)

DETECTION OF RNA-EDITING

APPLICA

TION

S

(34)

APPLICA

TION

S

A

genome

(35)

APPLICA

TION

(36)

help to understand the molecular basis of phenotypic variation

APPL

ICA

TIONS

The goal of differential expression analysis is to identify genes that change in abundance between conditions

(37)

In certain situations, gene-level count based methods may not recover true

DE when some isoforms of a gene are up and other down

DIFFERENTIAL EXPRESSION ANALYSIS

Different summarization strategies will result in the inclusion or exclusion of

different sets of reads in the table of counts

APPL

ICA

(38)



Template switching during cDNA synthesis leading to artefactual chimeric RNA



Reverse transcriptases lack of proofreading mechanisms



Non-uniformity of coverage



Read mapping uncertainty (sequencing errors, repetitive elements, gene families)



PolyA mRNA enrichment step may also enrich for other RNA degradation products

CHALLENGE

(39)

UNEQUAL DISTRIBUTION OF COVERAGE

Larger RNA molecules must be fragmented into smaller pieces (200–500 bp) to be compatible with most deep-sequencing technologies

Common fragmentation methods include RNA fragmentation (RNA hydrolysis or

nebulization) and cDNA fragmentation (DNase I treatment or sonication)

Each of these methods creates a different bias in the outcome

For example, RNA fragmentation has little bias over the transcript body, but is depleted for transcript ends compared with other methods

Conversely, cDNA fragmentation is usually strongly biased towards the identification of sequences from the 3′ ends of transcripts, and thereby provides valuable information about the precise identity of these ends.

CHALLENGE

(40)



Direct RNA sequencing



No PCR biases



Giant DNA molecules (>5kb)



Genome assembly



Structural variants (CNV)



Full-length transcripts

FUTU

RE

PERE

SPE

C

TIVE

(41)

Even “homogeneous” groups of dendritic cells can respond very differently from one another, as evidenced by the variability in expression of the Cxcl1 gene seen here. Cells are outlined in grey, and Cxcl1 expression appears in magenta.

FUTURE

PERE

SPE

CTIVE

S

(42)

Single Cell RNA-seq (CNAG)

Fluidigm

(43)

(44)

(45)

S

TUDYING

THE

PIG

TRANSCRIPTOME

➤A

T THAT TIME, THERE WERE NO RNASEQ STUDIES IN LIVESTOCK TRANSCRIPTOMES DESPITE THEIR SOCIO-ECONOMIC INTEREST

, WE PROVIDED THE FIRST IN PIGS.

➤W

E AIMED TO STUDY THE RELATION BETWEEN EXTREME PHENOTYPIC DIFFERENCES AND THEIR TRANSCRIPTOME PATTERNS, AS WELL AS TO IMPROVE PIG GENOME ANNOTATION (NOVEL

(46)

MATERIAL & METHODS

L

ARGE

W

HITE

:

I

NTERNATIONAL COMMERCIAL LINE

I

BERIAN

:

T

RADITIONAL UNIMPROVED BREED

➤

B

OTH PIGS HOUSED WITH SAME CONDITIONS AND PREPUBESCENT AT SLAUGHTER TIME

V

ERY

L

EAN

V

ERY

F

AT

H

IGH

P

ROLIFICACY

L

OW

P

ROLIFICACY

R

APID

G

ROWTH

S

LOW

G

ROWTH

P

RODUCTION

F

ARMS

P

OOR

H

OUSING

(47)

MATERIAL & METHODS

RNA-SEQ POLYA RNA PIGMALEGONADS

1 LANE ILLUMINA GA IIX

50 BPPAIRED-ENDREADS

TAGGEDSAMPLES MAPPING /ALIGNING TOPHAT ASSEMBLY9 REFERENCE GENOME ASSEMBLYOF TRANSCRIPTS CUFFLINKS QUANTIFICATIONOF TRANSCRIPTOME

CUFFLINKSVS DEGSEQ

COMPARISONWITH MICROARRAYS DIFFERENTIAL EXPRESSIONANALYSIS IMPROVEGENE ANNOTATION

(48)

R

EADS

P

ERCENTAGE

ANNOTATEDEXONS 44.1% ANNOTATEDINTRONS 18.7%

5'UPSTREAM/3'DOWNSTREAM 26.6%

A

SSEMBLEDTRANSCRIPTS

P

ERCENTAGE

EXACTLYWITHANNOTATEDEXONS 1.2% INTERGENIC TRANSCRIPTS 36.1% INTRONRETENTIONEVENTS 35.6% CONTAINEDINKNOWNISOFORMS 12.5% PRE-MRNA MOLECULES 6.2% POLYMERASERUN-ONFRAGMENTS 3.6% PUTATIVENOVELISOFORMSOFKNOWNGENES 2.9%

MAIN RESULTS: GENOME ANNOTATION

(49)

➤

O

RTHOLOGS

& P

ARALOGS

A

NNOTATION

714 PUTATIVE NOVELCODINGGENES

382 HOMO SAPIENS 344 KNOWN 38 NOVELPREDICTED 393 BOSTAURUS 378 KNOWN 15 NOVELPREDICTED SUSSCROFA 89 KNOWN 653 NOVELPREDICTED

MAIN RESULTS: GENOME ANNOTATION

362

(50)

➤ TRANSPOSABLEELEMENTS

➤ ONLY 3% OF THETOTALAMOUNTOF TE INTHEPIGGENOMEAREEXPRESSEDINMALEPIGGONADS

➤ HOWEVER, THEYCONSTITUTEAPPROXIMATELY 20% OFTHEPIGGONADTRANSCRIPTOME

➤ 16% OF PROTEINCODINGGENESCONTAIN TE INTHEIRSEQUENCE

➤ DNA TRANSPOSONSMOREACTIVE, BUT LINES MOREEXPRESSEDIN IBERIANTHAN LARGE WHITE

➤ LNCRNASANNOTATION

➤ 2047 PUTATIVE LNCRNASWEREDETECTED

➤ CONSERVATIONSTUDYTHEYCANBECLASSIFIEDINTO 3 CATEGORIES

➤ 469 LNCRNASCONSERVEDACROSSALLMAMMALS

➤ 322 CONSERVEDAMONG ARTIODACTYLA

➤ THERESTAREPIGSPECIFIC (CHECKBIOLOGICALRELEVANCE)

MAIN RESULTS: GENOME ANNOTATION

18 MAMMALIANGENOMES Q UERY L NC RNA S NO HOMOLOG LOW SIMILARITY HIGH SIMILARITY

(51)

➤ CORRELATIONOFGENEEXPRESSIONBETWEEN BOTHBREEDSISRATHER HIGH (R=0.85)

➤ HIGHLYEXPRESSED GENES: HEATSCHOCKPROTEINS, RIBOSOMALPROTEINS ANDAPOPROTEINS

➤13,000 ANNOTATEDGENES EXPRESSEDINGONADS, THEMAJORITY OFTHEGENS (90%) AREMILDLYEXPRESSED

➤ TWOMETHODSFORGENEEXPRESSIONQUANTIFICATION: DEGSEQUNIQUELYMAPPEDREADS, CUFFLINKS AMBIGUOUSLY MAPPEDREADS, THE FORMER UNDERSTIMATE EXPRESSION OFGENEPARALOGS

➤CORRELATION WITHMICROARRAYSISQUITEHIGH (R=0.71)

1% 5% 50% 40% 3% 0% 10% 20% 30% 40% 50% 60% >10000 FPKM 1000-10000 FPKM 100-1000 FPKM 10-100 FPKM 1-10 FPKM

MAIN RESULTS: QUANTIFICATION OF EXPRESSION

(52)

➤ I

NTERSECTION OFDIFFERENTIALLY EXPRESSEDGENES

➤

OVER

-

REPRESENTATIONOFGENE ONTOLOGIES

➤ I

N AGREEMENTWITH THEEXTREME PHENOTYPICCHARACTERISTICS OFTHE

I

BERIAN

& L

ARGE

W

HITE BREEDS

,

IN TERMS OFPROLIFICACY

,

GROWTHANDFATDEPOSITION

REPRODUCTION

DEVELOPMENTAL PROCESS

FATTY ACID METABOLIC PROCESS

2651

256

219

RNAseq

Microarrays

(53)

SUMMARY

➤ A

HIGHPROPORTIONOFTHE PIGMALEGONADTRANSCRIPTOMEISMADE OFTRANSPOSABLEELEMENTS

,

IN

AGREEMENTWITHMICEGERMLINESAND HUMANBRAINSTUDIES

.

➤ W

E CONFIRM THEINCOMPLETE ANNOTATIONOFTHE PIGREFERENCEGENOME

,

THEMAJORITYOFTHE

READSMAPPEDOUT OFTHEEXONBOUNDARIES

. W

E FOUNDSEVERAL NOVELEXPRESSEDTRANSCRIPTSIN

INTERGENICREGIONS

,

SOME OFTHEMBEING

P

C

G

WITH HUMANAND COWORTHOLOGS

,

OTHERS BEING

PUTATIVELONG

-

NON

-

CODING

-RNA

S

.

➤ B

OTH IBERIANAND

L

ARGE TRANSCRIPTOMESHOWED AHIGHCORRELATIONOFGENEEXPRESSION

(

R

=0.85),

SHOWINGTHATTRANSCRIPTOMEIS RATHERCONSERVEDBETWEENBREEDS

.

➤ T

HECORRELATIONBETWEEN

RNA

SEQ ANDMICROARRAYSIS QUITEHIGH

(

R

=0.71)

➤ D

IFFERENTIALLYEXPRESSEDGENESBETWEENBOTHBREEDS AREOVER

-

REPRESENTEDINSPERMATOGENESIS

ANDLIPID METABOLISM

GO,

INAGREEMENTWITH THEIR EXTREMEPHENOTYPICIN TERMSOFPROLIFICACY

(54)