• Non ci sono risultati.

Long non-coding RNAs, a novel class of regulatory RNAs: identification and characterization in the model species Brachypodium distachyon

N/A
N/A
Protected

Academic year: 2021

Condividi "Long non-coding RNAs, a novel class of regulatory RNAs: identification and characterization in the model species Brachypodium distachyon"

Copied!
72
0
0

Testo completo

(1)

 

 

 

 

 

 

 

 

 

 

Department of Agriculture, Food and Environment

Master of Science

in

Plant and Microbe Biotechnology

Candidate

Alice Pieri

Long non-coding RNAs, a novel class of

regulatory RNAs:

identification and characterization in the

model species Brachypodium distachyon

Supervisors

Prof. Mario Enrico Pè

Dr. Rodolfo Bernardi

Co-Supervisor

Prof. Andrea Cavallini

(2)
(3)

Table of contents

Abstract ... 2

1 Introduction ... 4

1.1 RNA world ... 4

1.2 Non-coding RNA ... 5

1.3 Long non-coding RNAs ... 9

1.3.1 LncRNAs identification: genomics approaches ... 18

2 Research objectives ... 21

3 Materials and methods ... 22

3.1 RNA-Seq libraries ... 22

3.2 Bioinformatic analysis ... 25

3.3 LncRNAs identification ... 26

3.4 Expression profiles and differential expression analysis ... 26

3.5 Interaction with microRNAs (target mimicry) ... 27

3.6 RNA extraction and reverse transcription ... 28

3.7 LncRNA validation using Real-Time PCR ... 28

4 Results ... 31

4.1 Experimental data set ... 31

4.2 Quality check ... 31

4.3 Alignment to the reference genome ... 36

4.4 Identified lncRNAs ... 39

4.5 LncRNAs expression analysis ... 42

4.6 Predicted endogenous lncRNAs: target mimicry ... 51

5 Discussion ... 53

5.1 Bona fide lncRNAs ... 53

5.2 Expression patterns and differential expression analysis ... 54

5.3 lncRNA-miRNA interaction ... 56

6 Conclusions ... 58

(4)

Abstract

Ninety percent of the eukaryotic genome is transcribed although only a small part corresponds to protein coding mRNAs, suggesting that a large proportion of transcribed RNAs do not code for proteins, hence classified as non-coding RNAs (ncRNAs). High-throughput sequencing technology has allowed the identification and characterization of several classes of ncRNAs with key roles in various biological processes. Among ncRNAs, long ncRNAs (lncRNAs) are transcripts typically longer than 200 nucleotides that tend to be expressed at low levels and exhibit tissue-specific/cell-specific or stress responsive expression profiles. LncRNAs have been identified in animals and in plants as well, where they are involved in different regulatory pathways both in development and stress responses, even if the understanding of molecular basis of these mechanisms remains largely unexplored.

My thesis project aims at identifying lncRNAs in Brachypodium distachyon (Bd), a wild grass belonging to the Pooideae and a model species for temperate cereals, such as wheat and barley.

A whole-genome annotation and a detailed analysis of lncRNAs expression patterns have been performed for the first time in Brachypodium. Moreover the potential lncRNA targets were investigated to highlight new regulatory networks and cross-talk between different RNA molecules.

Public and proprietary RNA-Seq data sets from 15 different experiments conducted in the reference inbred line Bd21 were analysed in this study. Public RNA-Seq data from different experiments, including several plant organs, were downloaded from the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra). Proprietary RNA-Seq libraries were previously produced by the lab from three developmental leaf areas: proliferation, expansion and mature, grown in control and drought stress conditions. For each proprietary RNA-Seq sample, three biological replicates were produced.

This dataset is characterized by a total of 705 millions reads, which were subjected to a quality analysis. Each experiment was aligned independently to the Bd21 reference genome (v.2.1) using the spliced read aligner TopHat2 and, successively, for each experiment the transcriptome was de novo assembled using Cufflinks.

In order to identify Bd lncRNAs an in house bioinformatic pipeline was used. Briefly, this pipeline applies five filters based on the main lncRNA features: size selection,

(5)

Open Reading Frame filter, known protein domain filter, Coding Potential Calculator, filter of housekeeping lncRNAs and precursors of small RNAs. Starting from the whole set of loci/isoforms (99141) de novo reconstructed, 2507 bona fide lncRNAs were identified.

Bona fide lncRNAs differential expression analysis was taken into account for datasets

with replicates, i.e. proprietary libraries from different developing areas of the third leaf. This analysis revealed that several lncRNAs are differentially expressed during leaf cell differentiation and during drought treatment. Some lncRNAs resulted more abundant in specific plant stages, tissues or organs.

Moreover, a computational method developed to identify endogenous microRNA target mimic (eTM) allowed to investigate the link between lncRNAs and microRNAs through target mimicry, a regulatory mechanism for miRNA functions in plants in which the decoy RNAs bind to miRNAs via complementary sequences and therefore could interfere with the interaction between miRNAs and their authentic targets.

(6)

1 Introduction

1.1 RNA world

RNA plays a central role in the pathway from DNA to proteins, known as the “central dogma” of molecular biology. The classic view of the central dogma of biology states that “the coded genetic information hard-wired into DNA is transcribed into individual transportable cassettes, composed of messenger RNA (mRNA); each mRNA cassette contains the program for synthesis of a particular protein (or small number of proteins)” (Lodish et al., 2000).

As a general rule, the classic view of central dogma of biology reflects how the sequence information is organized and transferred between information-carrying biopolymers (DNA, mRNA and proteins) in living organisms. However, many exceptions to this dogma are now known as a result of genomic studies performed during the last decade and supported by new sequencing technologies.

Nowadays we know that up to 90% of eukaryotic genome is transcribed into both protein-coding and non protein-coding RNAs (ncRNAs). Until recently, discrimination between these two categories was relatively straightforward. Most transcripts were clearly identifiable as protein-coding messenger RNAs, which convey the genetic information from DNA to ribosome, and were readily distinguished from the small number of well characterized ncRNAs, with a completely different structure and, anyway, involved in protein biosynthesis, such as transfer, ribosomal and spliceosomal RNAs.

Recently, genome-wide studies have revealed the existence of thousands of non-coding transcripts (Dinger et al., 2008). First of all, small regulatory RNAs (microRNAs and small interfering RNAs) were discovered and classified in different categories on the basis of length, function, biogenesis, structural features and protein binding partners (Farazi et al., 2008). Small RNAs were found to perform diverse biological functions by guiding sequence-specific gene silencing at transcriptional and/or post-transcriptional level, known as RNA interference (RNAi) (Farazi et al., 2008). Actually, a phenomenon of RNAi was reported for the first time in 1990, by Napoli et al. trying to deepen the colour of petunias. Overexpressing a key enzyme in flavonoid biosynthesis, they obtained white petunias, as result of the turning off of the gene (Napoli et al.,

(7)

1990). Nowadays, RNAi is widely used for systematic analysis of gene function and is under investigation for its potential therapeutic applications (Haussecker, 2014).

Attention is now shifting toward a novel class of ncRNAs, long non-coding RNAs (lncRNAs), increasingly recognized as functional regulatory component in eukaryotic gene regulation. They were discovered in 90s in humans (Brown et al., 1992) and only in 2007 in plants (Franco-Zorrilla et al., 2007).

The FANTOM consortium pioneered the genome-wide discovery of lncRNAs in mouse in 2000s, based on cDNA sequencing (Maeda et al., 2006). ENCODE (Derrien et al., 2012) and NONCODE (Xie et al., 2014) projects followed, especially thanks to RNA Sequencing (RNA-Seq) technology advent, allowing the annotation of novel human and mouse lncRNAs. In plants, the study of lncRNAs is still in its infancy, with a fine annotation only for rice (Zhou et al., 2009), maize (Li et al., 2014), cotton (Wang et al., 2015), Populus (Chen et al., 2015; Shuai et al., 2014) and tomato (Zhu et al., 2015).

1.2 Non-coding RNA

The so called “dark matter” of the genome, i.e. non-coding genome, comprises a diverse group of transcripts:

• “Housekeeping” ncRNAs (ribosomal RNAs, transfer RNAs, small nuclear RNAs and small nucleolar RNAs).

• “Regulatory” ncRNAs:

o Small regulatory RNAs, such as micro RNAs (miRNAs) and small interfering RNAs (siRNAs).

o Long non-coding RNAs (lncRNAs) (Kim and Sung, 2012).

Small regulatory RNAs (sRNAs) are tiny molecules of approximately 20-24 nucleotides in length. They act to fine-tuning gene expression through sequence complementary-dependent mechanisms at both transcriptional and post-transcriptional level, binding complementary target mRNAs, inhibiting their translation and interacting with epigenetic DNA-methylation for RNA-directed DNA methylation (RdDM).

sRNAs are generated via processing of longer double-stranded RNA (dsRNA) precursors by a key RNaseIII-like enzyme termed Dicer, evolutionary conserved in

(8)

Depending on their biogenesis and functions, sRNAs are classified as micro RNAs and small interfering RNA (Finnegan and Matzke, 2003).

siRNAs were first detected in plants in 1999 (Hamilton and Baulcombe, 1999). In general, siRNAs can be derived from all regions of perfect duplex RNAs and, at least in plants, they accumulate in both sense and antisense polarities. Perfect duplex RNAs can be synthetic RNAs, replicating viruses or even the result of the transcription of nuclear genes.

In plants, siRNAs have a variety of functions that can be grouped in at least two broad categories: those that trigger changes in the chromatin state of elements from which they derive and those that derive from and defend against exogenous RNA sequences such as viruses or sense transgene transcripts (Bonnet et al., 2006).

miRNAs originate from specific endogenous genes called MIR genes, preferentially localized within intergenic regions. They derive via Dicer cleavage of imperfect duplex stem-loop RNAs, ~70-200 nt in length.

The biogenesis of plant miRNAs happens within specialized regions of the nucleus, called D-bodies, where RNA polymerase II mediate the production of a primary miRNA (pri-miRNA) transcript. The pri-miRNA is then processed into a shorter stem-loop precursor-miRNA (pre-miRNA) formed by base-paring between self-complementarity regions. Pre-miRNA is processed again, releasing a miRNA/miRNA* duplex, later exported to the cytoplasm, possibly by HASTY (HST), where the miRNA* is usually degraded and the mature miRNA is recruited by RNA-induced silencing complex (RISC) (Voinnet, 2009) (Figure 1.2).

Both siRNAs and miRNAs are loaded into a RISC, and associated with a member of the Argonaute protein family, which has RNA-binding ability.

Through this complex, siRNAs will then bind to the same messenger RNA from which they originate, and cleave the mRNA, silencing its expression. Whereas miRNAs will bind specifically to a target messenger RNA, and guide its cleavage (in most of the cases) or will repress its translation (Bonnet et al., 2006) (Figure 1.1, 1.2).

(9)

Plant miRNAs were discovered for the first time in Arabidopsis thaliana in 2002. Some plant miRNAs were found to be conserved in many plant genomes such as those of

Oryza sativa, Zea mays and those of more ancient vascular plant genera such as ferns or

even nonvascular plants such as mosses (Bonnet et al., 2006). Surprisingly, a miRNA family, 854, has been shown to be expressed in Arabidopsis thaliana, Caenorhabditis

elegans, Mus musculus and Homo sapiens. Interestingly, across these diverse species,

miRNA-854 targets the same mRNA, uridylate binding protein 1b (UBP1b), which normally encodes a member of a heterogeneous nuclear RNA binding protein-1 (hnRBP-1) gene family. This indicates an evolutionary common origin of miRNA-854 as a regulator of the basal eukaryotic transcription mechanism in both plants and animals for many hundreds of millions of years (Pogue et al., 2014). Nevertheless, also non-conserved miRNAs exist. It seems that plant genomes encode more non-conserved miRNA families than conserved miRNA families, such as those of Arabidopsis or those that control cotton fiber differentiation and elongation (Fahlgren et al., 2007; Rajagopalan et al., 2006; Zhang et al., 2006). The non-conserved plant miRNAs presumably emerged and dissipated in short evolutionary time scales (Sunkar and Jagadeeswaran, 2008).

Figure 1.1 Small interfering RNAs (siRNAs).

Long double-stranded RNAs (dsRNAs) from diverse origins (viruses, transpososons, transgenes, etc.) are converted into 21 nt long siRNAs by DICER enzymes. These small RNAs are then loaded into RISC and associated with AGO4 or another Argonaute protein. The complex will then bind to the same messenger RNA from which they originate, and cleave the mRNA, silencing its expression. Small interfering RNAs can also bind to the mRNA and initiate the transformation of single-stranded RNA (ssRNA) into dsRNA, thus amplifying siRNA production. RDR, RNA-dependent RNA polymerase. (Bonnet et al., 2006)

(10)

Figure 1.2 Plant microRNA (miRNA) biogenesis. MicroRNA genes are

transcribed from their own locus by pol-II. The hairpin-like secondary structure is further processed by DICER in several steps to produce miRNA:miRNA* duplexes. The duplexes are then methylated by HEN1, before being exported to the cytoplasm, possibly by HASTY. Here the duplex is unwound and the miRNA is associated with AGO1. This complex, known as RISC, will bind specifically to a target messenger RNA, and guide its cleavage (in most of the cases) or will repress its translation. DCL, Dicer-like; HYL, HYPONASTIC LEAVES. (Bonnet et al., 2006).

(11)

1.3 Long non-coding RNAs

Small RNAs have been largely studied and are well known for their important roles in transcriptional and post-transcriptional regulation. Whereas, only recently, lncRNAs have been identified and characterized in several animal and plant species. In plants, first studies were conducted in Arabidopsis thaliana and published around 2011, whereas the fine annotation of lncRNAs in other plant spices, such as maize and rice, was reported only in 2014 and 2015 (Ding, et al., 2012a; Liu et al., 2012; Lin Li et al., 2014; Zhou et al., 2009; H. Wang et al., 2014a; M. Wang et al., 2015; B. Zhu et al., 2015; Derrien et al., 2012; Iyer et al., 2015). The long time employed in lncRNA study in plant kingdom is understandable if we think about the limited availability of sequenced and annotated genomes (Phytozome v.10.3 http://phytozome.jgi.doe.gov/). LncRNAs are non-coding RNAs longer than 200 nucleotides that, unlike sRNAs, are able to act as regulatory RNA without being processed.

Initial studies about lncRNAs were conducted in animals and, in 90s, Xist (X-inactive specific transcript) was the first identified lncRNA (Brown et al., 1992).

Xist, a lncRNA of 17kb in mouse or 19kb in human, is one of the earliest examples that lncRNAs regulate gene transcription by modifying the chromatin status. In female mammals, one of the two copies of the X chromosome (Xi) is inactivated to maintain the same dosage of gene products as males. Xist, is a major effector in X chromosome silencing (Zhang et al., 2013b). It is specifically transcribed from and coats the Xi in somatic cells. Xist was recently shown to directly interact with Ezh2, the catalytic subunit of the Polycomb repressive complex 2 (PRC2), through a Repeat A motif, resulting in recruitment of PRC2 to the chromosome. PRC2 spreads across and silences genes along the Xi through catalysing the repressive trimethylation of lysine 27 on histone H3 (H3K27). During this process, Xist seems to alter the three-dimensional architecture of Xi to facilitate repositioning of active genes into the repressive compartment (Bergmann and Spector, 2014).

Another important lncRNA characterized in animals is MALAT1, originally identified amongst several genes up-regulated in metastatic non-small cell lung cancer (Ji et al., 2003). Recent studies identified mis-expression and mutations of MALAT1 in several

(12)

other cancers, rendering MALAT1 a tumor marker with potential as a prognostic or therapeutic target (Gutschner et al., 2013).

The advent of RNA-Seq technology has largely stimulated lncRNAs annotation, in fact, the GENCODE consortium within the framework of the ENCODE project recently reported 9277 manually annotated genes producing 14880 transcripts (Derrien et al., 2012). A even more recent study about lncRNAs in human transcriptome revealed 58648 lncRNAs if which 79% were previously unannotated (Iyer et al., 2015).

Despite the large number of lncRNAs identified, there is a strong imbalance between the number of lncRNAs identified and those that have been functionally annotated, such as Xist, MALAT1, Air, KCNQ1ot1, HOTAIR, Frigidair, HOTTIP, PANDA, COOLAIR, COLDAIR, TERRA, DHFR-Minor and few others (Ma et al., 2012).

Compared with human and animals, the study of lncRNAs in plants is still in its infancy.

In 2007 the first lncRNA in plants was discovered, Induced by Phosphate Starvation 1 (IPS1), involved in phosphate uptake through target mimicry. IPS1 is a noncleavable lncRNA that forms a nonproductive interaction with the partially complementary miR-399, preventing it from cleaving its target, PHO2 RNA, which negatively affects shoot Pi content and Pi remobilization (Franco-Zorrilla et al., 2007) (Figure 1.3C).

After that work, there are only few other examples of functionally characterized lncRNAs, such as COOLAIR, COLDAIR and LDMAR (Swiezewski et al., 2009; Heo and Sung, 2011; Ding et al., 2012a).

COOLAIR (cold induced antisense intragenic RNA) is a group of capped, polyadenylated and alternatively spliced lncRNAs. This group comprises cold induced antisense transcripts covering the entire Flowering Locus C (FLC), a master repressor of flowering in Arabidopsis. They react to cold earlier than Vernalization insensitive 3 (VIN3), the earliest factor in the polycomb silencing mechanism. COOLAIR is believed to negatively regulate FLC sense transcription in a polycomb-independent manner (Figure 1.3 A).

However, COOLAIR only transiently suppresses the FLC, and polycomb machinery is indispensable for the construction of epigenetic memory of FLC inactivation. It has

(13)

been demonstrated that COOLAIR suppresses FLC through transcription interference, in particular promoter interference.

COLDAIR is another class of lncRNA derived from FLC locus. Unlike COOLAIR, COLDAIR is a sense transcript approximately 1100 nucleotides long with 5’ cap but no polyA tail, these features could lead one to think that these lncRNAs are transcribed by RNA polymerase V and IV. Nevertheless, COLDAIR is transcribed by RNA polymerase II like many other lncRNAs in mammals. The COLDAIR expression is induced by cold exposure. COLDAIR specifically interact with CLF (Curly Leaf), a key component of PRC2 complex. Hence, it is very likely that COLDAIR negatively modulates FLC via a polycomb-dependent model (Figure 1.3 B). COLDAIR may play a role in the recruitment of PRC2 to FLC chromatin to trigger the epigenetic memory establishment of FLC silencing by vernalization (Zhang et al., 2013b).

LDMAR is a lncRNA that controls Photo-Sensitive Male Sterility (PSMS) in rice. Originated from an elite japonica rice variety Nongken 58N (NK58N), Nongken 58S (NK58S) was a spontaneous mutant exhibiting PSMS, i.e. its pollen becomes completely sterile when grown under long-day conditions, whereas the pollens are viable under short-day growth conditions. The PSMS in NK58N is caused by a C-to-G mutation in the LDMAR gene (Ding et al., 2012a). Probably, the C-to-G mutation altered the secondary structure of LDMAR in NK58S, and the structural alteration brought DNA methylation in the promoter region, which suppressed the LDMAR expression, and the insufficient LDMAR eventually led to the sterility of NK 58S under long-days (Figure 1.3 D). Later, Psi-LDMAR, a siRNA derived from the sense strand of LDMAR promoter region was also found to be responsible for the regulation of DNA methylation in this region (Ding et al., 2012b).

(14)

In addition to the lncRNAs described above, a few other lncRNAs from plant were also characterized, as reported in Table 1.1.

Figure 1.3 Schematic representation of four types of lncRNA regulation mechanism in plants. (A) COOLAIR

regulates FLC in a transcription interference model. Top: FLC is transcribed by RNA Polymerase II before prolonged cold exposure; Middle: early cold treatment induced the expression of COOLAIR, which interferes with the RNA polymerase II binding to the FLC promoter, thus transiently suppresses the FLC expression; Bottom: after longer cold exposure, VIN3 recruits PRC2 complex to deposit H3K27me3 modification on FLC loci; (B) COLDAIR regulates FLC in a PRC2 associated histone modification model. Top: COLDAIR is induced by cold treatment. Middle: COLDAIR recruit PRC2 complex to the FLC loci; Bottom: PRC2 complex deposit H3K27me3 on the FLC loci; (C) ISP1 regulates PHO2 in a target mimicry model. Top: under normal growth condition, miR399 specifically binds to PHO2 and degrades PHO2 mRNA; Bottom: ISP1 competitively binds with miR399 to arrest its degradation function on PHO2; (D) LDMAR regulates the transcription of itself by a DNA methylation model. In NK58N, LDMAR is normally expressed. In NK58S, the C-to-G mutation altered the secondary structure of LDMAR and leads to the promoter DNA methylation, which reduced the LDMAR expression responsible for PSMS in NK58S. (Zhang et al., 2013b)

(15)

Except this few examples of functionally characterized lncRNAs, other studies focused only on annotation of these molecules, in Arabidopsis (6480 lincRNAs and, more recently, 37238 lncNATs) (Liu, Wang, and Chua, 2015; H. Wang et al., 2014a), in rice (7142 lncNATs originating small RNAs) (Zhou et al., 2009), in maize (20163 putative lncRNAs, of which 1704 high-confidence lncRNAs) (Li et al., 2014), in cotton (50566 lincRNA and 5826 lncNAT transcripts) (Wang et al., 2015), in Populus (P. trichocarpa 2542 lincRNAs; P. tomentosa 1377 lncRNAs) (Chen et al., 2015; Shuai et al., 2014) and tomato (3679 lncRNAs) (Zhu et al., 2015). Recently, in rice, new lncRNAs have been identified, as competing endogenous RNAs (ceRNAs), which sequester miR160 or miR164 in a type of target mimicry, and one lncRNA, XLOC_057324, demonstrated to play a role in panicle development and fertility (Zhang et al., 2014).

LncRNAs are generally transcribed by RNA polymerase II, so they are always capped, polyadenylated and frequently spliced (Ulitsky and Bartel, 2013).

However many novel lncRNAs have been found to be transcribed by RNA polymerase III, which was previously thought to only transcribe housekeeping RNAs like tRNA and 5S RNA (Zhang et al., 2013b).

Some lncRNAs are generated by plant-specific polymerase V, capped at the 5’ end and lacking apparent poly(A) tails. These lncRNAs function as a scaffold for the RdDM pathway (Kim and Sung, 2012).

These features are important to choose the approach for lncRNAs identification, for Table 1.1 Summary of the reported lncRNA genes in plants. (Zhang et al., 2013b)

(16)

On the basis of their genomic origins, lncRNAs can be classified into: • Long intergenic ncRNAs (lincRNAs)

• Intronic ncRNAs (incRNAs)

• Natural antisense transcripts (NATs), referred to the antisense transcripts of protein-coding transcripts. NATs are broadly grouped into two categories based on whether they act in cis or in trans. The so-called cis-NATs are transcribed from the same loci as sense transcripts and therefore have perfect match with the sense transcripts. On the contrary, trans-NATs are transcribed from different genomic loci and usually display only partial complementarity with the sense transcript (Zhang et al., 2013b; Rinn and Chang, 2012).

Figure 1.4 Anatomy of lncRNA loci. lncRNAs are often defined by their

location relative to nearby protein coding genes. Antisense lncRNAs are lncRNAs that initiate inside of a protein coding gene and transcribe in the opposite direction that overlaps coding exons. Intronic lncRNAs are lncRNAs that initiate inside of an intron of a protein coding gene in either direction and terminates without overlapping exons. Intergenic lncRNAs are lncRNAs with separate transcriptional units from protein coding genes. (Rinn and Chang, 2012)

(17)

Compared with protein-coding genes and even small noncoding RNAs, most lncRNAs lack strong sequence conservation between species and they do not contain evolutionarily conserved long ORF. They are usually expressed at low levels, developmentally regulated and often exhibit tissue-specific and cell type-specific patterns. A significant proportion of lncRNAs are located exclusively in the nucleus with a few exceptions that are localized in cytosolic fractions (Wang et al., 2015). From the recent literature it’s emerging that lncRNAs are potent regulators involved in several biological processes in eukaryotic cells. The greatest part of information about lncRNA function derives from animal world, in which the study of lncRNAs began and developed rapidly, especially thanks to the availability of sequenced genomes and the growing importance that lncRNAs have been found to have in many diseases (Brown et

al., 1992; Dong et al., 2014; Engreitz et al., 2013; Gao et al., 2015; Li et al., 2015; Shi et al., 2015; Yao et al., 2015).

In particular, it has been shown that lncRNAs can regulate gene expression at different levels: transcriptional, post-transcriptional and post-translational (Liu et al., 2015). At transcriptional level, they can regulate the polymerase II transcription machinery in many ways (Figure 1.5), e.g. they can regulate the DNA-binding activity of transcription factor (TF) or can regulate mediator complex formation (Lai et al., 2013). There is also a class of animal lncRNAs called “enhancer” RNAs, that are transcribed from the enhancer domains and/or transcription factor binding sites of genes, and they may regulate transcription activities of their flanking genes by recruiting transcription activators/repressors and/or controlling chromatin topology. One mode of action, typical of plant lncRNAs instead, is to trigger the formation of a stable RNA–DNA triplex so as to control TF binding specificity on promoter regions (Liu et al., 2015) (Figure 1.5).

Furthermore, many lncRNAs have been shown to play a role in modifying chromatin marks and some can directly interact with chromatin modifiers, promoting or repressing transcription activity (Wang et al., 2011).

In plants, lncRNAs activity is also correlated to small RNAs. Double-stranded RNAs could be processed into 21- to 24-nt small RNAs, which may initiate

(18)

post-double-stranded RNA duplexes with natural antisense transcripts to carry out their functions (Figure 1.6 A).

Concerning post-transcriptional regulation, a considerable number of repeat-containing lncRNAs (RC-lncRNA), originated from genes overlapping with transposable elements or repeat, can generate small RNAs which, in turn, can participate in post transcriptional gene silencing (PTGS) and/or RdDM pathways (Liu et al., 2012). LncRNAs can also interact with another class of regulatory RNAs, microRNAs, through the so called target mimicry (Figure 1.6 B). It is a regulatory mechanism for miRNA functions in plants in which the decoy RNAs bind to miRNAs via partially complementary sequences and therefore block the interaction between miRNAs and their authentic targets. In this way, lncRNAs can act as sponges of miRNAs (Franco-Zorrilla et al., 2007; Wu et al., 2013). In post-translational regulation, lncRNAs are able to regulate protein-protein interaction, interacting with RNA-binding proteins (RBPs) (Wang et al., 2014b) (Figure 1.6 C). For instance, Arabidopsis genome encodes many genes encoding RBPs, which can potentially associate with lncRNAs to execute their functions, such as the regulation of ABA signalling pathway (Liu, Wang, and Chua, 2015). Also protein subcellular location can be regulated by lncRNAs (Figure 1.6 F). A case of a plant lncRNA altering the subcellular localization of a protein was reported some years ago, in M. truncatula, where the endo40 RNA encodes only a short open reading frame and it colocalizes with RNA-Binding Protein 1 (MtRBP1). During nodule development, MtRBP1 together with

enod40 translocates from nuclear speckles to the cytosol in root cells, whereas in the

cells without expressed enod40, MtRBP1 is retained in the nucleus (Campalans et al., 2004). Recent studies found that several lncRNAs act as scaffolds (Figure 1.6 E), which have the capacity to bind some protein partners, serving as adaptors to form the functional protein complexes, e.g. TERRA and HOTAIR (Ma et al., 2012).

(19)

Figure 1.5 Schematic diagram of lncRNAs in transcriptional regulation.

LncRNAs (red) can regulate transcription machinery and chromatin modification on promoter regions, and transcription-factor-binding affinity on enhancer regions in the nucleus. LncRNAs can also regulate RNA-DNA hybridization and chromatin topological structures. TF, transcription factors. (Liu, Wang, and Chua, 2015)

Figure 1.6 Schematic diagram of lncRNAs in post-transcriptional and post-translational

regulation. (a) LncRNA (red) can form double-stranded RNA duplexes by complementary sequence to its targeted RNA. (b) lncRNA as a target mimic to down-regulate miRNA activity. RISC, RNA-induced silencing complex. (c) lncRNA regulates protein-protein interaction. (d) lncRNA changes protein structure to expose/protect specific amino acid for modification. (e) lncRNA as a scaffold to regulate assembly of protein complex subunits. (f) lncRNA guides RNA-binding protein relocation. (Liu, Wang, and Chua, 2015)

(20)

1.3.1 LncRNAs identification: genomics approaches

Over the years, different approaches in lncRNA study have been used, from microarrays onward to RNA-Seq. The identification of lncRNAs relies on the detection of transcription from genomic regions that are not annotated as protein coding (Fatica and Bozzoni, 2014).

Although traditional microarrays are incapable of identifying novel lncRNAs and are not able to distinguish different splicing variants, in past years, they were the first choice in many applications. In particular, since the identification of several lncRNA loci as part of the ENCODE project, the completeness of microarrays for human lncRNAs has been drastically improved. However, microarray is not sensitive enough to detect RNA transcripts with low-expression level. Thus the use of microarray to identify lncRNAs is limited due to the low expression level of many lncRNAs (Lee and Kikyo, 2012; Ma et al., 2012).

As variation of traditional microarrays, DNA tiling arrays contain overlapping oligonucleotides that cover an entire length of a defined DNA region. A major advantage of using tiling arrays is their capacity to identify novel lncRNAs in a selected DNA region without prior knowledge of their precise locations within the region. This methodology allows the analysis of global transcription from specific genomic regions and was initially used for both identification and expression analysis of lncRNAs (Fatica and Bozzoni, 2014). Interesting discoveries have been achieved by employing tilling arrays. For instance, Rinn et al. (2007) focused on lncRNAs expressed in the region of the human HOX genes and compared skin fibroblasts isolated from different anatomical regions of the body. They printed 400000 probes of 50 bases in length with each probe overlapping the next one by 45 bases to cover all four human HOX gene clusters. Polyadenylated RNAs prepared from fibroblasts were then hybridized to the tiling arrays, resulting in the discovery of the lncRNA HOTAIR transcribed from an intergenic region within the HOXC cluster (Rinn et al., 2007). A similar HOX tiling array was used to identify lncRNAs specifically expressed in metastatic breast carcinoma (Gupta et al., 2010).

SAGE (serial analysis of gene expression) technology is based on the generation of short sequence tags of unbiased cDNA sequence by restriction enzymes. SAGE tags are

(21)

concatenated before cloning and sequencing. This methodology allows both the quantification of transcripts throughout the transcriptome and the identification of new transcripts and it has been used and proved to be an efficient approach in studying lncRNAs. For example, Gibb et al. compiled 272 human SAGE libraries. By passing over 24 million tags they were able to generate lncRNA expression profiles in human normal and cancer tissues (Gibb et al., 2011). Lee et al. also used SAGE to identify potential lncRNA candidates in male germ cell (Lee et al., 2012). However, SAGE is much more expensive than microarray, therefore is not widely employed in large-scale studies (Fatica and Bozzoni, 2014; Ma et al., 2012).

CAGE (cap analysis of gene expression) is a technique similar to SAGE, however, unlike it, CAGE sequence tags are not unbiased but originate from the 5’ end of a transcript. Therefore CAGE can be used to locate an exact transcription start site in the genome, in addition to quantifying expression level (Kawaji et al., 2006).

For instance, CAGE tags were used for experimental validation of the annotated start sites and expression level of lncRNAs in GENCODE project (Derrien et al., 2012). To date, with the development of next generation sequencing (NGS) technology, the most widely used methodology to study lncRNAs is certainly RNA-Seq. Sequencing of transcriptomes by RNA-Seq is one of the most powerful methodologies for de novo discovery and expression analyses of lncRNAs. In this method, total RNA is converted to a cDNA library that is directly sequenced by high-throughput sequencing instruments. There are several types of sequencing technologies but Illumina platforms are currently the most commonly used for RNA-Seq experiments. A single sequencing run produces billions of reads that are subsequently aligned to a reference genome. The basic workflow for lncRNA identification using RNA-Seq is shown in Figure 1.7 (Fatica and Bozzoni, 2014; Ma et al., 2012).

(22)

Compared to traditional microarray technology, RNA-Seq has many advantages in studying gene expression. It is more sensitive in detecting less-abundant transcripts, and identifying novel alternative splicing isoforms and novel ncRNA transcripts.

RNA-Seq is able to detect transcripts that are missing or incomplete in the reference genome and allows for accurate quantification of expression levels, making it an ideal approach for lncRNA discovery. With an ultra sequencing depth RNA-Seq can be used to discover rare transcripts that are expressed in just a few cells within a tissue (Ma et

al., 2012; Zhu and Wang, 2012).

There is also a method, chromatin immunoprecipitation (ChIP), that uses chromatin signatures, to study actively transcribed genes including lncRNAs. When combined with DNA sequencing (that is ChIP-Seq), this method can infer the genomic distribution of either proteins or histone modifications (Fatica and Bozzoni, 2014; Ma et

al., 2012). Analysis of loci with specific histone modifications that characterize active

transcription such as H3K4me3 (the marker of active promoters) and H3K36me3 (the marker of transcribed region), allowed an indirect identification of many unknown lncRNAs, For example, Guttman et al. identified 1600 large multiexonic lncRNAs that are regulated by key transcription factors such as p53 and NFkB (Guttman et al., 2009). Figure 1.7 Workflow of lncRNA identification from RNA-Seq. (Ma et al., 2012)

(23)

In plant genomes, initial analyses for lncRNA identification were based on the bioinformatics search for RNAs with poor coding capacity in cDNA databases. Through this approach, some lncRNAs were identified in Arabidopsis and Medicago truncatula (Ben Amor et al., 2009; Hirsch et al., 2006; Wen et al., 2007).

Also the analysis of expressed sequence tag (EST) databases helped in some lncRNAs identification, such as in wheat affected by Puccinia striiformis, suggesting their participation in pathogen-defense responses (Zhang et al., 2013a).

Just as in animal world, so in plants, tilling array provided a reach source for lncRNA discovery, for example in rice and Arabidopsis (Li et al., 2006; Matsui et al., 2010; Rehrauer et al., 2010).

Nevertheless, the greatest contribution in plant lncRNA study derives from RNA-Seq, especially for the annotation of these molecules (Liu, Wang, and Chua, 2015; H. Wang

et al., 2014a; Zhou et al., 2009; Li et al., 2014; Wang et al., 2015; Zhu et al., 2015;

Zhang et al., 2014).

2 Research objectives

The present project aims at performing, for the first time, a whole-genome annotation and a detailed analysis of lncRNAs in Brachypodium distachyon (Bd), a wild grass belonging to the Pooideae and an important model species for temperate cereals, such as wheat and barley.

The project takes advantages of public and proprietary RNA-Seq data sets from 15 different experiments conducted in the reference inbred line Bd21. Public RNA-Seq libraries consist of different plant tissues (see Table 3.1). Proprietary RNA-Seq libraries were previously produced in our lab from three developmental leaf areas: proliferation, expansion and mature, grown in control and drought stress conditions, considering three biological replicates for each sample.

The application of a modified in silico pipeline, described in Li et al. (2014), to the dataset provided a comprehensive systematic annotation of Bd lncRNAs.

(24)

For those libraries with biological replicates, a differential expression analysis was performed, with the aim of finding developmental, tissue-specific and cell specific lncRNAs.

Moreover the potential lncRNA target are investigated to highlight new regulatory networks and cross-talk between different RNA molecules. For instance, the algorithm developed by Wu et al. (2013) allow to investigate the link between lncRNAs and microRNAs through target mimicry, which is, as described above, a regulatory mechanism for miRNA functions in plants in which the decoy RNAs bind to miRNAs via complementary sequences and therefore block the interaction between miRNAs and their authentic targets.

3 Materials and methods

3.1 RNA-Seq libraries

A total of 15 RNA-Seq libraries generated from the reference inbred line Bd21, both public and proprietary data, were selected. Public RNA-Seq data were downloaded from the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra). These samples were produced for the “Conserved Poaceae Specific Genes project” by Davidson et al. (2012) and comprise nine different plant tissue/organs grown in the greenhouse under long-day conditions (14-16 h light). As reported in Table 3.1, these libraries are poly(A+) selected, single end (SE), generated with Illumina technology.

Proprietary libraries originated from a drought stress experiment conducted in January 2011 in Milano (Italy), in which Brachypodium distachyon inbred line 21 was grown under control and severe drought stress conditions, according to the protocol described in Verelst et al. (2013). In particular, plants were grown into a growth chamber, with controlled conditions of light, temperature and humidity. Control plants were maintained at 1.82 g of water per g dry soil, while stressed plants were dried down to 0.45 g water/g dry soil (severe drought stress).

All the samples were collected from the third leaf, about 24 hours after the emergence and each leaf was divided in 3 developmental zones: proliferation, expansion and mature zone.

(25)

RNA-Seq libraries were generated from each developmental zone for a total of 18 samples: three types of leaf cells (proliferation cells, expansion cells and mature cells) grown in control and drought condition, considering three biological replicates. For the analyses of the present thesis work, the biological replicates were merged in a single file, making a total of 6 libraries (proliferation, expansion, and mature cells in control and stress conditions).

As reported in Table 3.1, these libraries are poly(A+) selected, single end (SE), generated with Illumina technology.

(26)

Samples accession SRA reads Raw Reads length Experiment Project References Leaves 20 DAS SRR349785 SRR352143 56673394 35 SE Poly (A) RNA-Seq Conserved Poaceae Specific Genes Project Davidson et al., 2012 Emerging inflorescence SRR349787 23137165 35 SE Poly (A) RNA-Seq Conserved Poaceae Specific Genes Project Davidson et al., 2012 Early inflorescence SRR349786 17601221 35 SE Poly (A) RNA-Seq Conserved Poaceae Specific Genes Project Davidson et al., 2012

Anther SRR352140 26059840 35 SE Poly (A) RNA-Seq Conserved Poaceae Specific Genes Project Davidson et al., 2012 Pistil SRR352137 17712829 35 SE Poly (A)

RNA-Seq Conserved Poaceae Specific Genes Project Davidson et al., 2012 Seed 5 DAP SRR352139 25922555 35 SE Poly (A)

RNA-Seq Conserved Poaceae Specific Genes Project Davidson et al., 2012 Seed 10 DAP SRR352141 25766517 35 SE Poly (A)

RNA-Seq Conserved Poaceae Specific Genes Project Davidson et al., 2012 Embryo 25 DAP SRR352138 SRR352144 48681058 35 SE Poly (A) RNA-Seq Conserved Poaceae Specific Genes Project Davidson et al., 2012 Endosperm 25 DAP SRR352142 27365511 35 SE Poly (A) RNA-Seq Conserved Poaceae Specific Genes Project Davidson et al., 2012 #3 leaf Proliferation ctrl (Pc)

- 76248037 50 SE RNA-Seq Poly (A) -

Bertolini et al. Unpublished #3 leaf Proliferation stress (Ps) - 82616294 50 SE Poly (A) RNA-Seq - Bertolini et al. Unpublished #3 leaf Expansion ctrl (Ec) - 77830217 50 SE Poly (A) RNA-Seq - Bertolini et al. Unpublished #3 leaf Expansion stress (Es) - 72602295 50 SE Poly (A) RNA-Seq - Bertolini et al. Unpublished #3 leaf Mature ctrl - 58589296 50 SE Poly (A) RNA-Seq - Bertolini et al. Unpublished #3 leaf Mature stress - 68671998 50 SE Poly (A) RNA-Seq - Bertolini et al. Unpublished

Table 3.1 Summary of the main characteristics of the public (first 9) and the proprietary (last 6) libraries.

(27)

3.2 Bioinformatic analysis

First, a quality check of all the samples was assessed using FastQC tool (Andrews, 2015), than the mRNA reads were trimmed using ERNE-FILTER (v.1.3) (Del Fabbro et

al., 2013). For libraries containing adapter contaminants, Cutadapt (v.1.2.1) (Martin,

2011) was applied. ERNE-FILTER was used to remove low quality bases from the ends of the reads. Due to the different characteristics of public and proprietary libraries, different parameters of ERNE-FILTER were applied for the two data sets: a minimum PHRED score of 30 and a minimum read length of 35 bp.

After the quality and trimming filters, each experiment was aligned independently to the reference Bd21 genome (v.2.1), using a method of two mapping iterations, with the spliced read aligner TopHat2 version 2.0.9 (Trapnell et al., 2012). This method was first proposed by Cabili et al. (2011) and it was used also in the lncRNA identification in maize (Li et al., 2014). This approach allows maximizing the mapping efficiency by using the spliced site information derived from the first mapping iteration in all the samples.

Hence, after the first alignment, each experiment was re-aligned for a second time using the pooled splice sites file, considering the following parameters:

• Maximum number of mismatches = 0 • Minimum intron length = 10

• Maximum intron length = 500000 • Library type = fr-firststrand

Successively, for each experiment the transcriptome was de novo assembled using Cufflinks version 2.0.9 (Trapnell et al., 2010).

All the transcripts, assembled with Cufflinks, were then merged with Cuffmerge to remove all the redundant transcripts and obtain only unique sequences. Finally, Gffread was used to generate a multi FASTA file. This FASTA file, containing both novel transcripts and previously annotated transcripts, was used to identify bona fide lncRNAs.

(28)

3.3 LncRNAs identification

To identify lncRNAs a modified bioinformatic pipeline “LncRNA_Finder”, described in Li et al. (2014), was used.

Particularly this pipeline applies five filters on the basis of the main lncRNA characteristics:

• Size selection: minimum lncRNA length was set at 200 nucleotides.

• Open Reading Frame (ORF) filter: maximum potential ORF length was kept as default at 100.

• Known protein domain filter: transcripts were aligned to a protein database of all Angiosperms downloaded from Plaza v.2.5 (http://plaza.psb.ugent.be/), the e-value parameter was set at 0.001.

• Coding Potential Calculator (CPC): this parameter was used to evaluate the quality, completeness and homology of ORFs.

• Filter of housekeeping RNAs and precursors of small RNAs: since putative lncRNAs may contain precursors of housekeeping RNAs and small RNAs, filtered transcripts were compared to housekeeping RNAs, from Pfam databases (http://pfam.xfam.org/), and to public and proprietary small RNAs databases, 19 libraries from http://mpss.udel.edu/ and 8 libraries from Bertolini et al. (2013) from different tissue and cell types. The number of mismatches in the alignment with small RNA was kept as default at 0.

After each filter an output file, that can ben checked, is obtained.

At the end, putative lncRNAs with sequence similarity with small RNAs were classified as pre-lncRNAs, whereas transcripts that do not have similarity with any class of non-coding RNAs were defined as High Confidence lncRNAs (HC-lncRNAs).

3.4 Expression profiles and differential expression analysis

Considering the wide variety of libraries present in this study, it is interesting to see how lncRNAs expression changes across different stages, tissues, organs or conditions. For this purpose, for all the libraries, even those with no replicates (that cannot be the subject of differential expression analysis), lncRNAs expression was expressed in

(29)

Reads Per Kilobase per Million mapped reads (RPKM), calculated using the CPM function of edgeR, a Bioconductor package (Robinson et al., 2010).

Proprietary libraries, from different developing areas of the third growing leaf, were used to conduct a differential expression analysis. The R package DESeq2, based on the negative binomial distribution, which makes possible the evaluation of the raw variance of data from experimental design with small numbers of biological replicates, was applied (Love et al., 2014).

To identify differentially expressed loci between drought and control conditions in the same developing zone and during leaf development in the proliferating and expansion conditions, a series of pairwise comparisons were set up, as reported in Table 3.2.

3.5 Interaction with microRNAs (target mimicry)

One of the most interesting biological function of lncRNAs, recently reported in plants (Franco-Zorrilla et al., 2007; Wu et al., 2013), is the action as endogenous target mimics (eTMs) of miRNAs, or target mimicry, via partially complementary sequences, blocking the interaction between miRNAs and their authentic targets (Wu et al., 2013). To identify HC-lncRNAs with potential target mimicry properties, the computational method described in Wu et al. (2013) was used. Briefly, a base-pairing interaction between Bd miRNAs and the annotated lncRNAs was predicted, based on: (1) bulges were only permitted at the 5’ end 9th to 12th positions of miRNA sequence; (2) the bulge in eTMs should be composed of only three nucleotides; (3) perfect nucleotide pairing was required at the 5’ end from second to eighth positions of miRNA sequence; (4) except for the central bulge, the total mismatches and G/U pairs allowed was set to 3. Comparison Drought stress Ps vs Pc Es vs Ec Leaf development Pc vs Ec Ps vs Es

Table 3.2 List of pairwise comparison performed during

the differential expression analysis. (P) proliferation, (E) expansion, (C) control and (S) drought stress conditions.  

(30)

3.6 RNA extraction and reverse transcription

To validate the identified HC-lncRNAs, it is important to conduct a Real-Time PCR of some putative lncRNAs. Therefore, cDNA of some of the datasets is needed.

So, RNA from proliferation and expansion zone, both in control and stress conditions, was extracted with the Plant/Fungi Total RNA Purification Kit (Norgen Biotek Corporation). The samples were flash-frozen in liquid nitrogen and conserved in a -80°C freezer. They were then grinded rapidly with mortar and pestle, using liquid nitrogen to ensure that the integrity of the RNA was not compromised. For the extraction, a maximum of 50 mg of starting material was needed. Briefly, the process involves a column with a resin that can bind RNA specifically, then washed with a provided wash solution and eluted with 80 µl of DEPC water.

The purified RNA samples were then retrotranscribed, starting from 1 µg of material, using iScript™ cDNA Synthesis Kit (Bio-Rad), following the manufactures' protocol.

3.7 LncRNA validation using Real-Time PCR

To validate the expression patterns of lncRNAs, Real-Time PCR experiments will be conducted in the near future. 10 lncRNAs have been selected among the up-regulated and down-regulated lncRNAs in control and stress condition between Pc/Ec and Ps/Es, as result of the differential expression analysis (Table 3.3).

(31)

LncRNA Length log2FoldChange Expression TCONS_00090190 358 -2.391265344 CTRL_down TCONS_00089035 486 -2.255784059 CTRL_down TCONS_00088496 2371 -3.057039273 CTRL_down TCONS_00081033 1199 -2.894632964 CTRL_down TCONS_00020495 358 -3.217483298 CTRL_down TCONS_00089348 301 3.139134476 CTRL_up TCONS_00086597 533 3.323773011 CTRL_up TCONS_00030330 1749 2.971583098 CTRL_up TCONS_00009992 1457 3.656548937 CTRL_up TCONS_00019564 1512 -4.318542051 STRESS_down

Table 3.3 List of lncRNAs selected for Real-Time PCR validation. log2FoldChange indicates

(32)

Primers Sequence (5’-3’) Length (nt) Tm (°C) LncRNA Forward GGAACCAGAGAAGAAGAGGAGG 22 59.50 TCONS_00090190 Reverse ATAGACAGGGAAGAGCTTGGAC 22 59.23 Forward AGTGGTAGAGTGGAGTGGAGT 21 59.57 TCONS_00089035 Reverse GCGTCGTATATGTTGTCAGCG 21 59.81 Forward TGGATCGGTGTAGAATCGACC 21 59.32 TCONS_00088496 Reverse TGATCTCCTACCAATCTGCGG 21 59.31 Forward CAAGCACTGAGGAATCTGACG 21 59.00 TCONS_00081033 Reverse TGAGGAGTGTATGCCAACTCG 21 59.80 Forward ATCCAGTGACCTACAGCTGC 20 59.46 TCONS_00020495 Reverse GGACCACGCATGTCACTAGT 20 59.75 Forward CCTACGCAACCCAAGCTATCT 21 59.86 TCONS_00089348 Reverse CAACCGACCCAGTATAAACGC 21 59.34 Forward CATTGACACTGTCCTGGGTTTC 22 59.45 TCONS_00086597 Reverse TACCCTAAAAGATACTCCGGCC 22 59.03 Forward GGGGTTCTTAGTCTTCGGGTT 21 59.37 TCONS_00030330 Reverse TCCTCGGTTATTCACAGCTCC 21 59.52 Forward GCCATCCCATACTTCCTAGCC 21 60.00 TCONS_00009992 Reverse CAAACAATCCACCCGCTTCTC 21 59.80 Forward CCGTTGTTTTGGTAGCTGCAA 21 59.93 TCONS_00019564 Reverse CAAGGATTTGTCGGTGCGTAC 21 59.87

(33)

4 Results

4.1 Experimental data set

Initial dataset for lncRNAs identification in Brachypodium distachyon was created selecting a total of 15 RNA-Seq libraries generated from the reference inbred line Bd21. Nine libraries were public and produced for the “Conserved Poaceae Specific Genes project” by Davidson et al. (2012), originating from nine different plant tissues/organs: leaves, early and emerging inflorescence, anther, pistil, seeds 5 and 10 days after pollination, embryo and endosperm. These libraries were downloaded from the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra).

Six proprietary libraries originated from a drought stress experiment conducted on

Brachypodium distachyon in January 2011 in Milano (Italy). RNA-Seq libraries were

generated from three developmental zones of the third leaf collected about 24 hours after the emergence: proliferation, expansion and mature zone. A total of 18 samples were available: three types of leaf cells (proliferation cells, expansion cells and mature cells) grown in control and drought condition, considering three biological replicates. For the analyses of the present thesis work, the biological replicates were merged in a single file, making a total of 6 libraries (proliferation, expansion, and mature cells in control and stress conditions).

The libraries are poly(A+) selected, single end (SE), generated with Illumina technology, making a total of 705478227 reads.

4.2 Quality check

Each library was subject to quality control that allows measuring the quality based on ten metrics: (1) Per base sequence quality; (2) Per sequence quality scores; (3) Per base sequence content; (4) Per base GC content; (5) Per sequence GC content; (6) Per base N content; (7) Sequence length distribution; (8) Sequence duplication levels; (9) Overrapresented sequences; (10) Kmer content. Some examples are reported in Figure 4.1.

Public and proprietary libraries had 35 and 50 bp long reads, respectively. Both had an average PHRED score of 30. The GC content was around 50% for all the libraries. Only

(34)

SRR352143 and Mature ctrl) overrepresented sequences were found, referred to adapter contaminants.

Despite the high quality of the raw libraries, a quality trimming step was performed to remove the few Ns and the low quality nucleotide in the reads, as well as the sequencing adapters present in the sequence reads, with the aim to obtain a better alignment to the reference genome. Quality of trimmed reads was assessed, showing an improved quality in terms of PHRED (Q) scores: public and proprietary libraries had Q score greater than 30 and 33, respectively (Figure 4.2). Statistic of reads trimming is shown in Table 4.1.

(35)

A B

D C

Figure 4.1 Visual output from FastQC of the library SRR352137. A Per base sequence quality plot. X-axis

shows the base positions in the reads, Y-axis shows quality score (Q score). B Per sequence quality scores. Mean sequence quality are shown on x-axis. C Per sequence GC content. D Per base N content.  

(36)

Samples SRA accession Raw reads Reads trimmed % reads trimmed Leaves 20 DAS SRR349785 SRR352143 56673394 53118138 93.73 Emerging inflorescence SRR349787 23137165 21880389 95.29 Early inflorescence SRR349786 17601221 16771555 94.57 Anther SRR352140 26059840 23226658 89.13 Pistil SRR352137 17712829 16739912 94.51 Seed 5 DAP SRR352139 25922555 22737023 87.71 Seed 10 DAP SRR352141 25766517 24540576 95.24 Embryo 25 DAP SRR352138 SRR352144 48681058 44986617 92.41 Endosperm 25 DAP SRR352142 27365511 24533458 89.65 #3 leaf Proliferation ctrl (Pc) - 76248037 70970362 93.08 #3 leaf Proliferation stress (Ps) - 82616294 77234409 93.49 #3 leaf Expansion ctrl (Ec) - 77830217 71996832 92.50

#3 leaf Expansion stress

(Es) - 72602295 67857125 93.46

#3 leaf Mature ctrl - 58589296 54766323 93.47 #3 leaf Mature stress - 68671998 63660599 92.70

Total 705478227 650613914 92.22

(37)

A B

C D

Figure 4.2 Visual output from FastQC of the trimmed library Ps. A Per base sequence quality plot. X-axis shows

the base positions in the reads, Y-axis shows quality score (Q score). B Per sequence quality scores. Mean sequence quality are shown on x-axis.  C Per sequence GC content. D Per base N content.  

(38)

4.3 Alignment to the reference genome

Each library was independently mapped against Bd21 reference genome version 2.1, downloaded from Phytozome v.10.3 (http://phytozome.jgi.doe.gov/), using TopHat2 (Trapnell et al., 2012). This aligner program was designed to map reads from RNA-Seq experiments to a reference genome, allowing to identify exon-exon splice junctions. It is built on the ultrafast unspliced short read mapping program Bowtie, based on the exon-first approach (Langmead et al., 2009). Briefly, TopHat exon-first maps reads to the genome, identifying potential exons using Bowtie. All reads that do not map to the genome are set aside as “Initially Unmapped Reads” (IUM reads). In this way TopHat builds a database of possible splice junctions and then maps the IUM reads against the junctions with a seed-and-extend strategy (Figure 4.3). This approach allows revealing new alternative spliced transcripts and isoforms and increase the overall percentage of mapped reads (Trapnell et al., 2009).

To optimize the alignment, the Bd21 annotation file of genes and exons (GFF3), version 2.1, available from Phytozome v.10.3 (http://phytozome.jgi.doe.gov/) was used. This new annotation file contains an improved annotation with a greater number of annotated genes and exons.

To maximize the use of splice site information derived from all samples, two iterations method first proposed by Cabili et al. (2011) was used. Therefore, after the first alignment, each experiment was re-aligned to the reference genome providing the sorted and non-redundant pooled splice sites file to the program.

Statistics of two iterations reads alignment are shown in Table 4.2. On average, the 96% of the reads was mapped to the reference genome.

(39)

Figure 4.3 The TopHat pipeline. RNA-Seq reads are mapped against

the whole reference genome, and those reads that do not map are set aside. An initial consensus of mapped regions is computed by Maq. Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences. (Trapnell et al., 2009)

(40)

Samples accession SRA trimmed Reads Mapped reads % mapped reads Unmapped reads % unmapped reads Leaves 20 DAS SRR349785 SRR352143 53118138 51605492 97.2 1512646 2.8 Emerging inflorescence SRR349787 21880389 21315167 97.4 565222 2.6 Early inflorescence SRR349786 16771555 16191851 96.5 579704 3.5 Anther SRR352140 23226658 22380996 96.4 845662 3.6 Pistil SRR352137 16739912 15759963 94.1 979949 5.9 Seed 5 DAP SRR352139 22737023 22094542 97.2 642481 2.8 Seed 10 DAP SRR352141 24540576 21938177 89.4 2602399 10.6 Embryo 25 DAP SRR352138 SRR352144 44986617 42581740 94.7 2404877 5.3 Endosperm 25 DAP SRR352142 24533458 23314366 95.0 1219092 5.0 #3 leaf Proliferation ctrl (Pc) - 70970362 68887254 97.1 2083108 2.9 #3 leaf Proliferation stress (Ps) - 77234409 75050280 97.2 2184129 2.8 #3 leaf Expansion ctrl (Ec) - 71996832 70274662 97.6 1722170 2.4 #3 leaf Expansion stress (Es) - 67857125 65991118 97.3 1866007 2.7 #3 leaf Mature ctrl - 54766323 53460628 97.6 1305695 2.4 #3 leaf Mature stress - 63660599 62116512 97.6 1544087 2.4 Total 650613914 632962748 97.3 22057228 3.4

Table 4.2 Statistic of reads alignment to the reference genome. The number of aligned reads is referred to

(41)

Successively, for each experiment the transcriptome was de novo assembled. The redundant transcripts were removed to select only unique sequences, obtaining a whole set of 99141 loci/isoforms.

4.4 Identified lncRNAs

From the 15 libraries analysed, a whole set of 99141 loci/isoforms was obtained. Among these loci/isoforms, known transcripts were discarded, obtaining 6590 new transcripts (Figure 4.5).

To distinguish lncRNA candidates, five sequential stringent filters to the 6590 transcripts were employed, according to the modified bioinformatic pipeline described by Li et al. (2014). First, transcripts shorter than 200 nucleotides and with ORFs longer than 100 amino acids were discarded and 3874 transcripts were retained (Figure 4.5). Transcripts passing this selection were aligned to protein database of all Angiosperms, downloaded from Plaza v.2.5 (http://plaza.psb.ugent.be/) to eliminate transcripts encoding protein. Next, the CPC was used to assess the protein-coding potential in order to eliminate other possible coding transcripts, and 2516 transcripts were obtained (Figure 4.5). After employing these criteria, the 2516 transcripts were considered as putative lncRNAs. Since putative lncRNAs may contain precursors of housekeeping RNAs and small RNAs, filtered transcripts were compared to a database of housekeeping RNAs and to a small RNAs databases (for details see the Materials and methods). Thus, putative lncRNAs with sequence similarity with small RNAs were classified as pre-lncRNAs, whereas the 2507 transcripts that do not have similarity with any class of non-coding RNAs were defined as High Confidence Bd lncRNAs (HC-lncRNAs) (Figure 4.5).

(42)

Then the main features of identified HC-lncRNAs were examined, considering average length, %GC, number of exons and genome distribution (Figure 4.6).

Similarly to lncRNAs of other plants, such as tomato and maize (Li et al., 2014; Zhu et

al., 2015), the greatest part of HC-lncRNAs (~70%) are shorter than 1000 nucleotides,

with a median length of 300 nucleotides (Figure 4.6 A). Whereas, Bd lncRNAs are Figure 4.5 Detailed schematic diagram of the bioinformatic pipeline, described in Li et al. (2014), for

identification of Bd lncRNAs. New transcripts were filtered with five criteria for identification of lncRNAs. (1) length ≥200 nucleotides and ORF ≤100 amino acids; (2) not encoding known protein domains; (3) little coding potential; (4) not housekeeping ncRNAs; and (5) not small RNA precursors.

(43)

2014). In accordance with previous studies, which have shown that both plant and animal lncRNAs are shorter and harbour fewer exons than protein-coding genes (Ding

et al., 2012a; Li et al., 2014; Zhu et al., 2015), most of the genes encoding Bd lncRNAs

only contained one exon (Figure 4.6 B).

A Circular plot clearly showed that Bd lncRNAs were pervasively transcribed across all the genome (Figure 4.6 D), similarly to the distribution observed in S. lycopersicum (Zhu et al., 2015).

A B

C D

Figure 4.6 The main features of identified lncRNAs. A Bd lncRNAs length distribution. B Number of exons in

Bd lncRNAs. C %GC in Bd lncRNAs. D Bd lncRNAs genome distribution; lncRNAs (magenta), miRNAs (yellow), coding genes (light blue).  

(44)

4.5 LncRNAs expression analysis

Considering all different samples within the dataset, we were able to study the lncRNAs expression levels across developing stages, conditions, tissues and organs. An heat map was created, representing the HC-lncRNAs expression in Reads Per Kilobase per Million mapped reads (RPKM) (Figure 4.7). We observed that Bd lncRNAs are expressed in each sample and some of them have a uniform expression. Some lncRNAs appear to be specifically expressed in different stages of leaf development or reproductive organs, particularly in anthers.

Figure 4.7 HC-lncRNAs abundance in 15 different Bd stages/conditions/tissues/organs. Expression

in Reads Per Kilobase per Million mapped reads (RPKM). AV means average from 2 or 3 biological replicates.  

(45)

To assess whether lncRNA expression changes during leaf development in control and drought stress conditions, between proliferation and expansion zone, a differential expression analysis was performed. Two classes of comparisons were performed:

• During drought stress, evaluating the expression profile of the same cell type in different growing conditions (Ps vs Pc, Es vs Ec).

• During leaf development, comparing the expression profiles of different cell types in the same growing conditions (Pc vs Ec, Ps vs Es).

Based on p-value < 0.05, in the first comparison, we retrieved a low number of differentially expressed lncRNAs in drought and control conditions, both in proliferation and expansion cells, with low fold change and no differentially expressed lncRNA in common between the two developing areas (Figure 4.8 A). On the other hand, the second comparison addressed many differentially expressed lncRNAs during leaf development in the same growing conditions, especially in drought stress (Figure 4.8 B), even if the values of fold change were still quite low, between -3 and 3. In particular, as we can see from the Venn diagram, 49 lncRNAs were differentially expressed from proliferation to expansion state in control conditions (31 up-regulated and 18 down-regulated), whereas 101 lncRNAs were differentially expressed from proliferation to expansion in drought stress conditions (72 up-regulated and 29 down-regulated). 89 were the lncRNAs differentially regulated during leaf development, independently from growing conditions (64 up-regulated and 25 down-regulated). It seems clear that lncRNA expression profile is drastically influenced by leaf development, both in control and stress conditions. Whereas, drought treatment does not induce a drastic change in lncRNA expression profile.

Comparison

Drought stress Ps vs Pc

Es vs Ec

Leaf development Pc vs Ec

Ps vs Es

Table 3.2 List of pairwise comparison performed during

the differential expression analysis. (P) proliferation, (E) expansion, (C) control and (S) drought stress conditions.  

(46)

A

 

B

 

Figure 4.8 Differentially Expressed (DE) HC lncRNAs during the drought treatment (A) and cell differentiation

Riferimenti

Documenti correlati

Tutte queste edizioni hanno subito gli influssi del lavoro di Galland, infatti i racconti da lui aggiunti sono entrati a far parte a pieno titolo anche di queste edizione in

In this study, a genetic characterization of Appenninica sheep breed was carried out with thirty microsatellite markers; the genetic relationships between Appenninica and

Instead, the increase in tree size and related physiological processes (expressed as product between diameter at breast height and tree height) explained the reduction in

La rivoluzione dei figli dei fiori, lo spiritualismo intelligente, il corpo nella lotta sono i tre ar- gomenti che si ritrovano in “Poeta delle Ceneri” e “Il PCI ai giovani!!”

This paper analyses the accommodation of the Latin names into the Greek orthography, trying to illustrate the complex overlap of different criteria of transliteration: the

In addition to the nominal and the in situ model, we decided to vary the migration rate of the giant planets to check how the number of Trojans and their asymmetry ratio change due to

To overcome this shortcoming, a novel directional decomposition strategy of the wind speed is herein formulated that opens the doors to a robust comparison between