Caraterrizzazione computazionale di melo. Funzione proteica, disordine e variabilità.

(1)

CORSO DI DOTTORATO DI RICERCA IN SCIENZE E TECNOLOGIE AGRARIE

in convenzione con Fondazione E. Mach

31o CICLO

COMPUTATIONAL CHARACTERIZATION OF APPLE. PROTEIN FUNCTIONS, INTRINSIC DISORDER AND VARIABILITY

Anno di discussione: 2019

Supervisore: Prof. Alessandro Cestaro

(2)

Domesticated apple is the most important temperate fruit crop and has been cultivated in Asia and Europe from antiquity. As a consequence of its self-incompatibility a wide variability within a same population is observed for phenotype characters of domesticated apple. Furthermore, its lack of interspecific reproductive barriers is thought to be the main driver of its domestication process that was heavily influenced by natural or artificial hybridization with wild apples. The main progenitor of the domestic apple is considered to be Malus sieversii which grows wild in the Heavenly Moun-tains (Tien Shan). Domesticated apple was then brought by merchants along the Silk Route where it hybridized with other wild apple species. This combination of evolutionary events contributed to domesticated apple retaining high genetic variability throughout the domestication process and to this day, despite such process is known to shrink down genetic variabil-ity. This retained variability is thought to be the base of the great diversity in domesticated apple yield among different cultivars. However, while the genome of domesticated apple has been available since 2012, annotation for domesticated apple is lacking in standard resources gathering genome and protein data despite its economic importance. Ensembl, the repository for nucleic acid sequences does not include the genome of domesticated apple or any other species from the genus Malus and does not even feature a way to encode information relative to variability amongcultivars/accessions. Uni-versal Protein Resource (UniProt), which gathers annotation at the protein level, only annotates around 10% of domesticated apple genes. I argue that at the origin of lacking annotation stands a recent phenomenon known as Big Data, which in this case is produced by Next Generation Sequencing (NGS)technologies. Due to the ever-improving sequencing technologies pro-ducing an ever-increasing amount of sequenced genomes, computer science

(3)

final users can easily access and retrieve information of interest. However, plant-specific data is less integrated than human data and this brought to plant-related information to be fragmented in many different plant-specific DBs. This is why domesticated apple is absent or mostly absent from En-semblandUniProt. To fill the gap left by existing resources, we developed PhytoTypeDB(http://phytotypedb.bio.unipd.it), a database contain-ing the inter-cultivar variability of functionally annotated plant proteins. PhytoTypeDB is a user-friendly resource developed to help plant scientist to retrieve updated information about gene function and variability. Users can browse information either from a gene or species centric perspective surfing among thecultivarvariations. The most distinctive feature of Phy-toTypeDB is the possibility of exploring, at the same time, gene function along with the intra-specific variability. To generate PhytoTypeDB anno-tation, the concept of gene family and domain were vastly exploited. This concepts revolve around the premise of protein sequence conservation being the drive for function conservation. While this idea is undoubtedly right for globular proteins, it is not enough to cover the whole protein function space. Furthermore, limiting the annotation to conserved domains auto-matically bias annotation coverage towards highly conserved regions. It has been suggested almost 20 years ago, however, that many proteins or regions of proteins in various proteomes lack such stable three-dimensional structure and are rather intrinsically disordered under native conditions (IDPs). These proteins play critical roles in the cell and host the vast ma-jority of variability since they are quickly evolving. To investigate this class of proteins we transferred from the U.S.A. to Italy the central resource of high quality manually curatedIntrinsic Disorder (ID)annotation,Database of Protein Disorder (DisProt). After re-annotating all legacy entries and adding a supplementary two hundred annotations as a community effort of

(4)

the foundations for a future periodic assessment ofID prediction similar to Critical Assessment of protein Structure Prediction (CASP) - called Criti-cal Assessment of Intrinsic protein Disorder (CAID)- that we are currently running (http://disprotcentral.org/caid). In the field of prediction of ID we published a novel method - MobiDB-lite - that was included as the first ID predictor in the famous domain annotation resource InterPro. MobiDB-litewas used to predictIDon the largest scale possible, all protein sequence of theUniProt sequence space. These annotations were collected in the newest version of MobiDB, along with annotation from many differ-ent sources. Finally, I delved in the analysis of the very nuanced array of phenomena that fall under the name of ID trying to further classify and extrapolate patterns on a large scale dataset.

(5)

Tabaro, Ivan Miˇceti´c, Lisanna Paladin, Marco Carraro and all the members of the BiocomputingUP laboratory at the University of Padua for their invaluable help througout my PhD both as colleagues and as friends. I would like to thank Alessandro Cestaro and Silvio Tosatto for giving me the opportunity to grow in this field. I would like to thank my parents and brother and Giorgia Tormena and all my relatives for supporting me. I would particularly like to thank Eleonora Thomaseth for standing beside me the whole time. Finally, I would like to acknowledge the thousands of individuals who have coded for the LaTeX project for free. It is due to their efforts that we can generate professionally typeset PDFs now.

(6)

(7)

List of Figures vii List of Tables xi Glossary xiii Acronyms xix 1 Introduction 1 1.1 Malus x domestica . . . 1 1.2 Apple genetics . . . 2

1.2.1 Next generation sequencing . . . 4

1.2.2 Apple Genome . . . 5

1.2.3 Whole genome duplications . . . 7

1.3 Apple cultivars . . . 9

1.3.1 Definition of cultivar . . . 9

1.3.2 Malus x domestica cultivars . . . 9

1.3.3 Genetics of cultivars . . . 10

1.4 Annotation of Apple and plant species . . . 11

1.4.1 Genome annotation . . . 11

1.4.2 Gene prediction . . . 15

1.4.3 Gene function prediction . . . 16

1.4.4 Gene families . . . 17

1.4.4.1 MADs-Box transcription factors . . . 18

(8)

1.4.5.1 Sequence-structure-function paradigm . . . 20

1.4.5.2 Breaking the protein rules . . . 20

1.4.5.3 Biological function of intrinsic disorder . . . 21

1.5 Apple and plant species in biological databases . . . 22

1.5.1 Biological databases . . . 23

1.5.2 Plants and Apple in Ensembl . . . 24

1.5.3 Plants and apple in UniProt . . . 24

1.5.4 Plants Databases . . . 26

1.6 Why annotation is lacking: Big Data . . . 33

1.6.1 Big data applications . . . 34

1.6.2 Big data in biology . . . 34

2 Personal contributions and thesis outline 37 2.1 High throughput gene function prediction . . . 38

2.2 Protein intrinsic disorder . . . 39

2.3 List of Publications . . . 40 3 PhytoTypeDB 43 3.1 Introduction . . . 45 3.2 Implementation . . . 46 3.3 Usage . . . 47 3.4 Conclusions . . . 52 4 DisProt 55 4.1 Introduction . . . 57

4.2 Detection and characterization of IDPs . . . 59

4.3 Database structure and implementation . . . 61

4.3.1 Database records . . . 61

4.3.2 Annotation pipeline . . . 62

4.3.3 Entry page . . . 64

4.3.4 Browsing and searching data . . . 64

4.3.5 Feedback page . . . 65

4.3.6 Web technology . . . 65

(9)

4.6 Conclusions and future work . . . 69

5 Where differences resemble 71 5.1 Introduction . . . 73

5.2 Materials and Methods . . . 74

5.3 Results . . . 76

5.3.1 Dataset composition and overlap . . . 76

5.3.2 Source organisms . . . 77

5.3.3 Length distribution . . . 81

5.3.4 Amino acid frequencies . . . 81

5.3.5 Conformational propensity . . . 87

5.3.6 Low complexity content . . . 92

5.3.7 Functional characterization . . . 94

5.3.8 Defining ID flavors . . . 98

5.4 Conclusions . . . 99

6 A benchmark for disorder predictors 101 6.1 Introduction . . . 103

6.2 Materials and methods . . . 105

6.2.1 Datasets and classifications . . . 105

6.2.2 Predictors . . . 106

6.2.3 Performance assessment . . . 108

6.3 Results . . . 110

6.3.1 Predictors performance . . . 110

6.3.2 Performance on different subsets . . . 112

6.3.3 Consensus of disorder predictions . . . 114

6.4 Discussion . . . 122

7 MobiDB-lite 125 7.1 Introduction . . . 127

7.2 Implementation . . . 127

(10)

7.5 Conclusion . . . 131

7.6 MobiDB-Lite in InterPro . . . 131

8 MobiDB 135 8.1 Introduction . . . 137

8.2 Database description . . . 138

8.3 New curated data . . . 140

8.4 New indirect annotations . . . 141

8.5 New predictors . . . 142

8.6 Usage and annotated data . . . 144

9 ID flavors and functions in the protein universe 147 9.1 Introduction . . . 149

9.2 Results . . . 150

9.2.1 Disorder regions . . . 150

9.2.2 Disorder and taxonomy . . . 154

9.2.3 Classification and functional characterization . . . 155

9.4 Conclusions . . . 162

9.5 Materials and Methods . . . 162

9.5.1 Datasets . . . 162 9.5.2 Statistics . . . 163 9.5.3 Flavors of disorder . . . 163 9.5.4 Functional enrichment . . . 164 10 Conclusions 165 References 171

(11)

1.1 Evolutionary history of the cultivated apple . . . 3

1.2 Cost of sequencing . . . 6

1.3 A model explaining the evolution from a 9-chromosome ancestor to the 17-chromosome karyotype of extant Maleae, including the genus Malus. 8 1.4 The neighborjoining tree of the 63 accessions . . . 12

1.5 Dendrograms of SDH genes . . . 19

1.6 TrEMBL entries per taxonomic group . . . 25

1.7 TrEMBL entries in eukaryota . . . 25

1.8 Annotation scores for UniProt subsets . . . 27

1.9 Cost of sequencing and informatics . . . 35

1.10 Growth of entries in UniProt/TrEMBL . . . 36

1.11 Growth of entries in UniProt/SwissProt . . . 36

3.1 Entry overview . . . 48

3.2 Region Description . . . 49

3.3 Variants description . . . 51

3.4 Cultivar Selection Page . . . 53

4.1 DisProt entry page . . . 63

4.2 Distribution of DisProt region length . . . 66

5.1 Merge strategy for overlapping annotation regions . . . 76

5.2 Overlap of IDPs annotated by multiple databases . . . 78

5.3 Number of overlapping residues . . . 79

(12)

5.6 Taxonomic diversity significance . . . 83

5.7 ID region length distributions . . . 84

5.8 Heat map of statistical significance of differences in region lengths . . . 85

5.9 Hierarchically-clustered heat map of amino acid enrichment . . . 86

5.10 Hierarchically-clustered heat map of amino acid absolute frequencies . . 87

5.11 Correlation matrix of absolute amino acid frequencies . . . 88

5.12 Correlation matrix of fold increase amino acid frequencies . . . 89

5.13 Das and Pappu classification of ID regions . . . 90

5.14 Heat map of statistical significance of differences in conformational propen-sity . . . 91

5.15 Quantification of low complexity content of ID regions . . . 92

5.16 Heat map of statistical significance of differences in region low complexity content . . . 93

5.17 Top five enriched Molecular Function GO-terms . . . 95

5.18 Top five enriched Biological Process GO-terms . . . 96

5.19 Top five enriched Cellular Component GO-terms . . . 97

6.1 Amino acid state definitions . . . 106

6.2 DisProt complement ID content . . . 107

6.3 DisProt pairwise identity distribution . . . 109

6.4 DisProt pairwise identity distribution . . . 113

6.5 Difference in performance between secondary and primary methods on the full DisProt 7.0 . . . 115

6.6 Difference in performance between viral and non-viral proteins on the full DisProt 7.0 dataset . . . 116

6.7 Difference in performance between plant and non-plant proteins on the full DisProt 7.0 dataset . . . 117

6.8 Difference in performance between plant and non-plant proteins on the full DisProt 7.0 dataset with strict order definition . . . 118

6.9 Proportion of disordered and structured residues in DisProt 7.0 annota-tion as a funcannota-tion of ten methods predicting disorder . . . 120

6.10 Distribution of the twenty amino acids in different fractions of the Dis-Prot complement dataset . . . 121

(13)

7.2 MobiDB-lite annotation in InterPro . . . 133

8.1 MobiDB 3.0 data overview . . . 139

9.1 Comparison between short (A, left) and long (B, right) IDRs . . . 151

9.2 Length distribution for fully disordered proteins . . . 152

9.3 Box plot of low complexity content distribution in different datasets . . 153

9.4 Disorder region position in the sequence . . . 154

9.5 Disorder distribution for the domains of life . . . 155

9.6 Classification of disorder flavors . . . 156

9.7 GO term enrichment for long IDRs . . . 158

(14)

(15)

1.1 Plant databases focus . . . 32

4.1 Distribution of DisProt annotation . . . 62

4.2 Controlled vocabulary for disorder functions . . . 67

5.1 Datasets composition . . . 78

6.1 DisProt Dataset composition . . . 108

6.2 DisProt complement performance sorted by descending MCC . . . 111

7.1 Performance of ID predictors . . . 132

7.2 Performance of ID predictors . . . 132

8.1 MobiDB 3.0 databases . . . 140

8.2 MobiDB-lite predictors. . . 143

(16)

(17)

Glossary

p value In statistical hypothesis testing, the p value or probability value or asymptotic signifi-cance is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary (such as the sam-ple mean difference between two compared groups) would be greater or equal to the ac-tual observed results. 75,76,83,85,88,89, 91,93,112

allopolyploidization The process of allopolyploid formation.7

AngularJS A structural framework for dynamic web apps that lets you use Hyper Text Markup Language (HTML)as your template language and lets you extend HTML’s syn-tax to express your application’s components clearly and succinctly.65

autopolyploidization The process or the result of becomingautopolyploid.7

Big Data A term used to refer to the study and applications of data sets that are too com-plex for traditional data-processing applica-tion software to adequately deal with. Big data challenges include capturing data, data storage, data analysis, search, sharing, trans-fer, visualization, querying, updating, infor-mation privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. ii, 24, 33,34, 37,168

BioJS A library of over hundred JavaScript com-ponents enabling you to visualize and process data using current web technologies. 65

cal Process, and Cellular Component.46,166 CATH / Gene3D The CATH database is a hier-archical domain classification of protein struc-tures in theProtein Data Bank (PDB). Pro-tein structures are classified using a combi-nation of automated and manual procedures. There are four major levels in this hierarchy. Class, Architecture, Topology (fold family) and Homologous superfamily.22,49,50 CD-HIT A very widely used program for

cluster-ing and comparcluster-ing protein or nucleotide se-quences. CD-HIT was originally developed by Dr. Weizhong Li at Dr. Adam Godzik’s Lab at the Burnham Institute.76

ClinVar A freely accessible, public archive of re-ports of the relationships among human vari-ations and phenotypes, with supporting evi-dence.22,23

COMPARTMENTS A weekly updated web re-source that integrates evidence on protein sub-cellular localization from manually curated lit-erature, high-throughput screens, automatic text mining, and sequence-based prediction methods.22

crabapple From (3): “Wild apple species, usually producing profuse blossom and small acidic fruits. The word crab comes from the Old English ‘crabbe’ meaning bitter or sharp tast-ing. Many crabapples are cultivated as orna-mental trees and their apples are sometimes used for preserves. In Western Europe the term crabapple is often used to refer to Malus sylvestris (the European crabapple), in the Caucasus this term refers to Malus orientalis (the Caucasian crabapple) and, in Siberia, to Malus baccata (the Siberian crabapple). The native North American crabapples are Malus fusca, Malus coronaria, Malus angustifolia, and Malus ioensis. Malus sieversii, the main

(18)

cultivar A cultivar is an assemblage of plants that (a) has been selected for a particular character or combination of characters, (b) is distinct, uniform and stable in those characters, and (c) when propagated by appropriate means, retains those characters. ii,iii,9–11,24,33, 38,44,46,48,50–53,165,166,168

D2P2 A community resource for pre-computed dis-order predictions on a large library of proteins from completely-sequenced genomes.58 DELL An American multinational computer

tech-nology company that develops, sells, repairs, and supports computers and related prod-ucts and services. Named after its founder, Michael Dell, the company is one of the largest technological corporations in the world, em-ploying more than 103 300 people in the U.S. and around the world.34

DIBS A repository for protein complexes that are formed between IDPs and globular/ordered partner proteins. (http://dibs.enzim.ttk. mta.hu/). 72,140

DisProt A community resource annotating protein sequences forIntrinsically Disordered Region (IDR)s from the literature. It classifies intrin-sic disorder based on experimental methods and three ontologies for molecular function, transition and binding partner.iii,39,56,167 EMC Named for the initials of its 3 founding prin-cipals (Egan, Marino, Curly), it’s now part ofDELL. It focuses on developing and selling data storage and data management hardware and software.34

Ensembl A genome annotation system, developed jointly by theEuropean Bioinformatics Insti-tute (EBI) and the Wellcome Trust Sanger Institute, which has been used for the anno-tation, analysis and display of genomes since 2000.ii,iii,14,24,26,33,38,44,45,165 Expected Heterozygosity (He) Heterozygosity

expected in a population. 10

Fondazione Edmund Mach An italian institute for agriculture research located in the north of italy. 2,5

FuzDB A repository of fuzzy protein com-plexes based on experimental evidence (ei-ther structural or biochemical) (http:// protdyn-database.org/). 72–74, 76,77,81, 86,87,92,94,98,99,138,141

Gene Ontology Gene Ontology is a framework for the model of biology. It defines con-cepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects: molec-ular function, biological process and cellmolec-ular component. 17,22,45,46,94,155,165 genotype The genotype is the part of the genetic

makeup of a cell, and therefore of any indi-vidual, which determines one of its character-istics (phenotype). The term was coined by the Danish botanist, plant physiologist and geneticist Wilhelm Johannsen in 1903. 5 HP An American multinational information

tech-nology company developing and providing a wide variety of hardware components as well as software and related services to consumers, small- and medium-sized businesses and large enterprises.34

IBM An American multinational information tech-nology company headquartered in Armonk, New York, United States, with operations in over 170 countries. The company began in 1911 as the Computing-Tabulating-Recording Company (CTR) and was renamed in 1924. 34

IDEAL A collection of knowledge on experimentally verified intrinsically disordered proteins or in-trinsically disordered regions. IDEAL con-tains manually curated annotations on IDPs in locations, structures, and functional sites such as protein binding regions and Post-Translational Modification (PTM) sites to-gether with references and structural domain assignments.58,72–74,76,77,81,86,87,92, 94,98,138,140,149

INGA A web server for the prediction of protein function. The method has been evaluated by theCritical Assessment of Functional Annota-tion (CAFA)assessors (2014) among the best predictors.46,166

(19)

predicting domains and important sites. iv, 14,38,39,47,49,50,126,131,133,134,168, 169

InterProScan The modular computational tool that produces annotation for InterPro from starting from sequence alone. 46, 131, 134, 166,168

JSON A lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.64

LOD In genetics, a statistical estimate of whether two genes, are likely to be located near each other on a chromosome and are therefore likely to be inherited.15

MFIB A repository for protein complexes that are formed exclusively by IDPs. As these pro-teins have no stable tertiary structure in their monomeric form, their folding is induced by the assembly of the complex. (http://mfib. enzim.ttk.mta.hu/). 72,141

Microsoft An American multinational technology company with headquarters in Redmond, Washington. It develops, manufactures, li-censes, supports and sells computer soft-ware, consumer electronics, personal comput-ers, and related services. Its best known software products are the Microsoft Windows line of operating systems, the Microsoft Office suite, and the Internet Explorer and Edge web browsers. 34

Mobi A method to derive mobility and disorder from biochemical structures.144

MobiDB A database of proteinIDand mobility an-notations from curated data, indirect exper-imental evidence and predicted annotations. iv,14,23,38,39,58,62–65,72,74–76,105, 126, 127, 131, 133, 134, 136–145, 148–150, 157,162,163,168,169

MobiDB-lite An optimized method for highly spe-cific predictions of longIDR. The method uses

168,169

NGP Non-globular proteins encompass different molecular phenomena that defy the tradi-tional sequence-structure-function paradigm. NGPs include intrinsically disordered regions, tandem repeats, aggregating domains, low-complexity sequences and transmembrane do-mains. Although growing evidence suggests that NGPs are central to many human dis-eases, functional annotation is very limited.7 Node.js® A JavaScript runtime built on Chrome’s V8 JavaScript engine designed to build scal-able network applications. 65

NoSQL A NoSQL (originally referring to “non SQL” or “non relational”) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabu-lar relations used in relational databases. 24, 166

Oracle Corporation An American multinational computer technology corporation that special-izes primarily in developing and marketing database software and technology, cloud engi-neered systems, and enterprise software prod-ucts - particularly its own brands of database management systems. In 2014, Oracle was the second-largest software maker by revenue, af-terMicrosoft.34

ORF The part of a reading frame that has the abil-ity to be translated. An ORF is a contin-uous stretch of codons that contain a start codon (usually AUG) and a stop codon (usu-ally UAA, UAG or UGA).7,45,165

Petri dish Named after the German bacteriologist Julius Richard Petri, is a shallow cylindrical glass or plastic lidded dish that biologists use to culture cells - such as bacteria - or small mosses. 16

Pfam A large collection of protein domain fami-lies. Each family is represented by

(20)

multi-phenotype A multi-phenotype (from Greek phainein, meaning “to show”, and typos, meaning “type”) is the composite of an organism’s observable characteristics or traits, such as its morphology, development, biochemical or physiological properties, behavior, and prod-ucts of behavior (such as a bird’s nest). A phenotype results from the expression of an organism’s genetic code, itsgenotype, as well as the influence of environmental factors and the interactions between the two. 2

PhytoTypeDB PhytoTypeDB is a database of plant proteomes, the function of their compo-nents and especially the variability between (cultivated) species. It complements exist-ing resources such as Ensembl and Plaza be-cause it focuses on the protein complement, its structure and function. iii,33,37–39,44, 46–53,56,166,169

polyploidization The process of polyploid forma-tion. 7,8

Protein Data Bank The Protein Data Bank was established as the historically 1st _open

ac-cess digital data resource in all of biology and medicine. The PDB provides access to 3D structure data for large biological molecules (proteins, DNA, and RNA). 20,22,57,103, 137,157,168

PubMed A repository of more than 28 million ci-tations for biomedical literature from MED-LINE, life science journals, and online books. Citations may include links to full-text con-tent from PubMed Central and publisher web sites.63,65

REACTOME An open-source, open access, man-ually curated and peer-reviewed pathway database. 23

REST An architectural style that defines a set of constraints to be used for creating web ser-vices. Web services that conform to the REST architectural style, or RESTful web services, provide interoperability between computer systems on the Internet. REST-compliant web services allow the requesting systems to access and manipulate textual representations of web resources by using a uniform and pre-defined set of stateless operations. 44,65

SAP A German-based European multinational ware corporation that makes enterprise soft-ware to manage business operations and cus-tomer relations.34

SEG A predictors that uses the method of Wootton & Federhen (? ) to divide a sequence into regions of high and low complexity. 75,153 self-incompatible A plant incapable of

self-fertilization because its own pollen is pre-vented from germinating on the stigma or because the pollen tube is blocked before it reaches the ovule.1,2

Short Linear Motif Short Linear Motifs (also known as SLiMs, Eukaryotic Linear Motifs, Molecular Recognition Elments/Features or minimotifs) are short stretches of protein se-quence that mediate protein-protein interac-tion. 21,58,167

SIFTS A project in the PDBe-KB resource for residue-level mapping between UniProt and PDBentries. SIFTS also provides annotation from the IntEnz,GO,InterPro,Pfam,CATH / Gene3D, SCOP, PubMed,PubMedand Ho-mologene resources.. 23

Silk Route A historic trade route that dated from the second century BC until the 14th_century

AD and stretched from China to the Mediter-ranean. The Silk Route was so named because of the heavy silk trading during that period. ii,2,3

SINTEF One of Europes largest independent re-search organizations. Based in Norway. 33 Software AG An enterprise software company

founded in 1969 with over 10 000 enterprise customers in over 70 countries. The company is the second largest software vendor in Ger-many, and the seventh largest in Europe.34 Swiss-Prot UniProt KnowledgeBase

(UniPro-tKB)/Swiss-Prot is the manually annotated and reviewed section of theUniProtKB. It is a high quality annotated and non-redundant protein sequence database, which brings to-gether experimental results, computed fea-tures and scientific conclusions. It was later included in theUniProtconsortium becoming a part ofUniProtKB.24,26,65

(21)

cause sequence data was being generated at a pace that exceededSwiss-Prot’s ability to keep up. It was later included in theUniProt consortium becoming a part of UniProtKB. 24

UniParc A comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world.136 UniProt UniProt is a comprehensive resource for

protein sequence and annotation data. The UniProt databases are the UniProtKB, the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc).ii,14,24,87,168 UniRef UniRef provides clustered sets of sequences from theUniProtKB(including isoforms) and selected UniParc records in order to obtain complete coverage of the sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view.24

Whole Genome Duplication Whole genome du-plication, or polyploidy, is a product of nondisjunction during meiosis which results in additional copies of the entire genome.2 XML XML is a markup language designed to store

and transport data and to be both human-and machine-readable.23

(22)

(23)

Acronyms

3D Three-dimensional.21,103 A6PR aldose 6-P reductase.18

API Application Programming Interface. Glossary: API

AUC Area Under the Curve. 108,111,112,119 BLAST Basic Local Alignment Search Tool.14,17,

46–50

BMRB Biological Magnetic Resonance Data Bank. 141,142

CAFA Critical Assessment of Functional Annota-tion. 46,166, Glossary: CAFA

CAID Critical Assessment of Intrinsic protein Dis-order. iv,39,167

CASP Critical Assessment of protein Structure Pre-diction.iv,39,105,122,167

CD Circular Dichroism.59,60,103

Cryo-EM Cryo Electron Microscopy.138,141 DB database.iii,26,28,32,37,38,165,166 DIBS Disordered Binding Site. 72–74, 77, 79–81,

86,87,92,94,98,138,140, Glossary: DIBS DisProt Database of Protein Disorder. iii,39,56,

58–69, 72–74,77,78,81,86,87,92,94,98, 99,102–110,112,114–119,122,123,126,131, 137,141,144,145,149,162,166–169, Glos-sary: DisProt

Short Linear Motif

EST Expressed Sequence Tag. 28,45,165 EVA European Variation Archive.45

FEM Fondazione Edmund Mach.2,4,5, Glossary: Fondazione Edmund Mach

FESS Fast Estimator of Secondary Structure. 46, 50

FPreg False Positive region.132

GO Gene Ontology. 17,22,28,45–48,50,72,74,75, 94,98,144,155,158,160,164–166, Glossary: Gene Ontology

GWAS Genome Wide Association Study. 10,11 He Expected Heterozygosity. 10, Glossary:

Ex-pected Heterozygosity (He)

HSQC Heteronuclear Single Quantum Coherence. 59

IBM International Business Machines. 34, Glos-sary: IBM

ICNCP International Code of Nomenclature for Cultivated Plants. 9

ID Intrinsic Disorder. iii,iv,20,37–39,72–81,87, 92,94, 98, 99,102–108, 110,112, 114, 119, 122, 123, 126, 131, 133, 134, 136, 148–151, 153–155,157,159–163,166–169

IDD Intrinsically Disordered Domain.58

IDP Intrinsically Disordered Protein.iii,20–22,39, 56–62,67,68,137,138,144,145,159,160 IDR Intrinsically Disordered Region. 20–22,56–60,

62,63,65,67–69,126,137,138,141,144,149– 151,153–164,167,168

(24)

JSON JavaScript Object Notation. 64, Glossary: JSON

kbp kilo base pairs. 11

LC Low Complexity. 73–75,92,94 LD Linkage Disequilibrium. 11

LIP Linear Interacting Peptide.138,140–143 LOD Logarithm of the ODDs.15, Glossary: LOD LRR Leucine Rich Repeat. 7

MAF Minor Allele Frequency.11

MCC Matthew’s Correlation Coefficient. 108,111, 114–119,132

MFIB Mutual Folding Induced by Binding. 72–74, 77,81,86,87,92,94,98,138,141, Glossary: MFIB

MoRE Molecular Recognition Element. 58, Glos-sary: Short Linear Motif

MoRF Molecular Recognition Feature. 58, Glos-sary: Short Linear Motif

NGP Non Globular Protein. 7, Glossary: NGP NGS Next Generation Sequencing. ii,4,5,15,34,

126

NMR Nuclear Magnetic Resonance. 59, 66, 103, 104,106,108,112,138,140–142,145,162 NoSQL Non Standardized Query Language.24,37,

166, Glossary: NoSQL

ORF Open Reading Frame.7,15,26,45,165, Glos-sary: ORF

PBA Pedigree-Based Analysis. 10,11

PCC Pearson’s Correlation Coefficient. 108, 111, 112

PDB Protein Data Bank. 20,22,23,49,50,57,64, 65,102–106,110,112,114,119,122,136–138, 141, 144, 157, 168, Glossary: Protein Data Bank

PED Protein Ensemble Database. 57,58,138 PTM Post-Translational Modification. 13,21,22,

58

QTL quantitative trait loci. 2,10,28,45,165 RCI Random Coil Index. 142

REST Representational State Transfer. 44,47,65, 144, Glossary: REST

Rg Radii of Gyration. 60 RH Hydrodynamic Radii. 60

RMSE Root Mean Square Error.108,111,115–118 RNA Ribonucleic Acid. 13,15,21,22,28,34,44,

45,57,69,94,133,165 SDH sorbitol-dehydrogenase. 18,19 SEC Size Exclusion Chromatography. 59,60 SIFTS Structure Integration with Function,

Taxon-omy and Sequence. 23, Glossary: SIFTS SINTEF Selskapet for INdustriell og TEknisk

Forskning. 33, Glossary: MFIB

SLiM Short Linear Motif.21,58,138,140,141,167, Glossary: Short Linear Motif

SNP Single Nucleotide Polymorphism. 2,7,10,11, 44,46,50,52,166

TPreg True Positive region. 132

TrEMBL Translated Embl Nucleotide Sequence Data Librarys. 24, 75, 86, 89, Glossary: TrEMBL

UniParc UniProt Archive. 136, Glossary: UniParc UniProt Universal Protein Resource. ii–iv,14,17, 23,24,26,27,38,39,45–47,50,58,61,62, 65,72–74,77,81,86,87,92,94,98,106,122, 136, 137, 145, 148–150, 154, 155, 157, 159, 163,168, Glossary: UniProt

UniProtKB UniProt KnowledgeBase. 24,26,87, 141,144, Glossary: UniProt

UniRef UniProt Reference clusters. 24,46–48,50, Glossary: UniRef

UV Ultra-Violet. 59,60

WGD Whole Genome Duplication. 2,8,18, Glos-sary: Whole Genome Duplication

XML eXtensible Markup Language. 23, Glossary: XML

(25)

Introduction

1.1 Malus x domestica

Domesticated apple is the most important temperate fruit crop and has been cultivated in Asia and Europe from antiquity (1). According to (2) the genus Malus has 30-55 species and several subspecies of so-called crabapples but its taxonomy is unclear (3) and will probably change in the years coming (4). The denomination Malus x domestica has been generally accepted as the appropriate scientific name (5) even if Malus pumila would be a more appropriate name according to the rules of botanical nomenclature (3). As the name suggests, the domesticated apple species is the result of both aware and unaware artificial selection by ancient farmers (6,7,8). Some may argue that there is no such thing as a “domesticated” species. The debate is similar to that surrounding the concept of species itself (9) and stems from the confusion between the act of defining a species and the criteria that are used to define a species (10). As Cornille et al. state in their work (3):

domesticated species can be defined as segments of evolutionary lineages diverging from their wild progenitors in response to artificial selection pres-sures and human control over reproduction.

Apples are incapable of self-fertilization (i.e. they are self-incompatible) resulting in offspring with different features from their parents (3). As a consequence, a wide variability within a same population is observed for phenotype characters of

(26)

domes-artificial selection harder and slower compared to other species and these very reasons are thought to be the reason behind grafting becoming the main cultivation method for apple (3). Since apple isself-incompatible, its lack of inter-specific reproductive barriers is thought to be the main driver of domestication of apple species into domesticated apple (7, 8). The main progenitor of the domestic apple is considered to be Malus sieversii which grows wild in the Heavenly Mountains (Tien Shan) at the boundary between western China and the former Soviet Union to the edge of the Caspian sea (8,11) As is summarized in a Figure 1 of (3) (Figure 1.1 in this work) domesticated apple was then brought by merchants along the Silk Route where it hybridized with other wild apples such as Malus baccata in Siberia, Mauls orientalis in the Caucasus and Malus sylvestris in Europe (3).

The study of evolutionary history of Malus x domestica as well as a better under-standing of the biology and an improved breeding efficiency requires the sequencing of apple genome. With the advent of high-throughput sequencing methods, also known as next generation sequencing, many new genomes were sequenced at rate that was unthinkable few years earlier (see Section 1.2.1). In 2010, a consortium led by the ItalianFondazione Edmund Mach (FEM)announced they had sequenced the complete genome of the apple using “Golden Delicious” as a reference (11) (see Section 1.2.2). This work identified 55,000 genes scattered across 17 chromosomes. Also, a model for the formation of a 17 chromosomes system was proposed, involving twoWhole Genome Duplication (WGD) events.

1.2 Apple genetics

A primary focus of apple genetics is the commentary of genes regulating diverse horti-cultural traits (i.e.,phenotypes) of economic importance. Most of thesephenotypes are genetically complex; i.e., traits are controlled through a polygenic inheritance mecha-nism, where multiple genes occupying chromosomal positions referred to asquantitative trait loci (QTL), concur to the observed phenotype.

Single Nucleotide Polymorphism (SNP) are major contributors to genetic varia-tion, comprising approximately 80% of all known polymorphisms. SNP can occur in both coding and noncoding regions of the genome. ThoseSNP found within a coding sequence are of particular interest since they may alter the molecular function of a

(27)

Figure 1.1: Evolutionary history of the cultivated apple - (A) This history was revealed by recent population studies using different types of molecular markers for evolu-tionary inferences. (1) Origin in the Tian Shan Mountains from Malus sieversii, followed by (2) dispersal from Asia to Europe along theSilk Route, facilitating hybridization and in-trogression from the Caucasian and Europeancrabapples. Arrow thickness is proportional to the genetic contribution of various wild species to the genetic makeup of Malus do-mestica. (B) Genealogical relationships between wild and cultivated apples. Approximate dates of the domestication and hybridization events between wild and cultivated species are detailed in the legend. Abbreviations: BACC, Malus baccata; DOM, M. domestica; OR, Malus orientalis; SIEV, M. sieversii ; SYL, Malus sylvestris; ya, years ago. The soruce of this figure is (3)

(28)

protein, altering its contribution to the biological process its involved in. Their density in plants is variable depending on the species (12).

With the advent of Next Generation Sequencing (NGS) many new genomes were sequenced at a rate that was unthinkable few years earlier (see Section1.2.1). In 2010, a consortium led by the italianFEMannounced they had sequenced the complete genome of Malus domestica using “454 Life Sciences” next generation sequencer (11).

1.2.1 Next generation sequencing

Knowledge of Deoxyribonucleic Acid (DNA) sequences has become increasingly im-portant in all fields of biological research and in many applied fields such as medical diagnosis, biotechnology or forensic biology. The first sequencing of a whole genome dates back to 1977, when Sanger et al. obtained the 5,375 nucleotides long sequence of the bacteriophage ϕX174 (13). That first experiment opened the way to many whole genome sequencing experiments with methods that were later given the name of “first generation sequencer”. In mid to late 1990s, many new methods for DNA sequencing were being developed, which were implemented in commercialDNA sequencers by the year 2000. After 2006, genome analysis strayed away from the application of auto-mated Sanger sequencing, which had dominated the industry for around twenty years and had lead to scientific achievements as the only finished-grade human genome se-quence (14). Despite many technical improvements during this era, the limitations of automated Sanger sequencing highlighted the need for new and improved technologies for sequencing large numbers of genomes (15). Next Generation Sequencing (NGS) technologies are based on a combination of template preparation, sequencing/imaging, and genome alignment and assembly methods. Not only NGSmethods were a techno-logical breaktrough that enabled scientist to work at a much faster rate, it also produced a fundamental shift in the approach to basic, applied and clinical research (15). The main advantage ofNGSis producing extremely large amount of data at a much smaller economic cost. This technological revolution was reflected in an extremely quick drop in cost of sequencing experiments that widely exceeded the expected drop in sequencing cost, which was based on the application of Moore’s law to the technology of the time (Figure1.2).

Availability of huge amount of data widens the use of these technologies beyond just determining the order of nucleotides in a sequence. For example, microarrays have been

(29)

fact while it is relatively more expensive,NGScan identify and quantify rare transcripts without requiring any markers and can provide information on alternative splicing products and sequence variations (16, 17). Sequencing experiments of many related organisms has enabled geneticists to perform large-scale comparative and evolutionary studies that were simply unfeasible just a few years earlier. The broadest applications of NGSis the “1,000 genomes project” (18), which involved resequencing of 1,091 human genomes to build a comprehensive reference of our genome. At the same time NGS technology were applied to agricoltural environments, as testified by the resequencing of more than one thousand genomes from as many different accessions of Arabidopsis thaliana (19) and of more than three thousand genomes from Oryza sativa’s accessions (20).

The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies that parallelize the sequencing process, producing thousands or millions of sequences concurrently (21). These technologies can differ in the length of the DNA fragment produced, the accuracy in base assignments, the amount of data produced one a single run, the time required for a single run, and the cost per billion bases. For example, while most high-throughput sequencing methods produce reads between 50 and 3,000 base pairs, single-molecule real-time sequencing from Pacific Biosciences produces N50 reads of 30,000 base pairs and can reportedly produce reads longer than 90,000 base pairs (22). This above average performance are obtained at the cost of its accuracy, around 87% (23), which is significantly below the average ofNGS sequencers.

1.2.2 Apple Genome

In 2010, a consortium led by the italianFondazione Edmund Mach (FEM)announced they had sequenced the complete genome of domesticated apple using with “454 Life Sciences” next generation sequencer (11). The genome assembly was later refined in (24), where gene predictions (see Section 1.4.2) were significantly improved. Domes-ticated apple genotypes are all highly heterozygous, imposing technical challenges in genome sequencing and assembly. At the cost of a high coverage of high-quality reads

(30)

Figure 1.2: Cost of sequencing - Cost of sequencing a human-sized genome since September 2001 as estimated by the National Human Genome Research Institute. Image from wikimedia commons (Historic cost of sequencing a human genome.svg). For full details, see the source data athttp://www.genome.gov/sequencingcosts/

(31)

heterozygous genome can be assembled an manySNPs identified (25,26).

The putative gene content in apple was initially reported to be very high with 57,386 putative genes plus 31,678 transposable element-related Open Reading Frame (ORF)s (11), which is considerably higher than Arabidopsis thaliana (27,228), poplar (45,654), papaya (28,027), Brachypodium distachyon (25,532), grape (33,514), rice (40,577), sorghum (34,496), cucumber (26,682), soybean (46,430) and maize (32,540) (11). The number of genes in apple was later reduced to 42,140 in (24), similarly to what has been reported for pear (42,812 genes) (27). Furthermore, in (11) 11,444 apple-specific genes were identified, along with a relatively high number of repeated sequences, and the lowest number of DNA transposons in plant genomes that were analyzed in that study. On the other hand, apple genome seems to be among the most enriched in transcription factors and a high number of Leucine Rich Repeat (LRR) kinases (11). LRR-kinases are a class of Non Globular Protein (NGP) (28), which is widely used by plants in signalling and has a significant role in plant defenses (29).

1.2.3 Whole genome duplications

Rosaceae belong to the rosids, which in turn comprise one-third of all flowering plants (12). Most Rosaceae have 7, 8 or 9 haploid chromosomes (x) while Maleae, the tribe descending from Rosaceae and Amygdaloideae, have 17 haploid chromosomes configu-ration (x = 17), which sets them apart from the other Amygdaloideae. The dominant theory explaining the number of chromosomes in Maleae has long been an allopoly-ploidizationbetween a species related to modern Spiraeoideae (x = 9) and one relative to modern Amygdaleoideae (x = 8). A within-lineage polyploidization was also hy-pothesized (30) but remained a less dominant explanation.

Actually, many models have been proposed to explain the uniquely high number of chromosomes in Maleae, with “wide-hybridization” hypothesis based on allopoly-ploidizationevent between spireoid and amygdaloid ancestors (31, 32) being the long standing one and the most popular. Later hypotheses suggest origin of Maleae by autopolyploidization or hybridization between two sister taxa with x = 9 (similar to Gillenia), followed by diploidization and aneuploidization (33) to x = 17. This last hypothesis is confirmed by Velasco et al. in (11) where they argue their support to the

(32)

First, their genomic data agrees with observations based on synteny and collinearity of molecular markers (34, 35) and archeobotanical dates (36) that place the WGD around 48-50 Mya.

Second, molecular phylogeny of Wx genes in the apple genome confirms the close relationship of Gillenia (x = 9) with the Pyreae (x = 17) lineage, as the Wx gene se-quences of Prunus, Spiraea and other Rosaceae genera belong to a different phylogenetic cluster (11).

Lastly, a parsimonious pattern of chromosome breakage and fusion explains the derivation of the current x = 17 Maleae karyotype from apolyploidizationevent of two x = 9 genomes (11) as show in Figure 3 of (11) (Figure 1.3in this work)

Figure 1.3: A model explaining the evolution from a 9-chromosome ancestor to the 17-chromosome karyotype of extant Maleae, including the genus Malus. - A WGDfollowed by a parsimony model of chromosome rearrangements is postulated. Shared colors indicate homology between extant chromosomes. White fragments of chromosomes indicate lack of a duplicated counterpart. The white-hatched portions of chromosomes 5 and 10 indicate partial homology. The source of this figure is (11)

(33)

1.3.1 Definition of cultivar

The term cultivar commonly refers to an assemblage of plants selected for desirable characters that are maintained during propagation. A cultivar is the most basic clas-sification category of cultivated plants in the International Code of Nomenclature for Cultivated Plants (ICNCP) (37). The ICNCP states that the word cultivar is used in two different senses: Article 2 of the ICNCP defines a cultivar as a classification category: “The basic category of cultivated plants whose nomenclature is governed by this Code is the cultivar” (37). TheICNCPalso defines a cultivar as a “taxonomic unit within the classification category of cultivar” (37). This latter is the most generally understood meaning of cultivar and which is used as a general definition, that can be formally expressed as:

A cultivar is an assemblage of plants that (a) has been selected for a par-ticular character or combination of characters, (b) is distinct, uniform and stable in those characters, and (c) when propagated by appropriate means, retains those characters. (37)

1.3.2 Malus x domestica cultivars

There are more than 7,500 known cultivars of apples (38) that vary in their yield and the ultimate size of the tree. While propagation by grafting, in act from early domestication (see Section 1.1), would be expected to quench yield variability, the latter is evident even when grown on the same rootstock. From an economical point of view we observe that the vast majority ofcultivars are not suitable for mass production while only an handful of them (around 80) are extremely important and extensively grown around the world. Many different civilizations have apparently fostered its range ofcultivars, which adapted to environmental conditions and to either aware or unaware artificial selection. Some of them are “McIntosh”, “Jonathan”, “Rome Beauty”, “Red Delicious”, “Golden Delicious” from the USA, “Cox Orange” from UK and “Granny Smith” from Australia (39).

(34)

licious” being predominant (40), making “Golden Delicious” among the best candidate to represent the reference genome (see Section 1.2.2).

Cultivars have been artificially selected towards desired features, either dictated by taste of people cultivating them or by environmentally derived necessities. Some of these desired qualities in modern commercial apple breeding are colorful skin, absence of russeting, ease of shipping, lengthy storage ability, high yields, disease resistance, common apple shape, and developed flavor (41). Commercially popular applecultivars are soft but crisp (41) as domestication has driven modern apples to be generally sweeter than older cultivars, following the changing popular taste over time. Most North Americans and Europeanscultivars favor in fact sweet, subacid apples, but tart apples have a strong minority following (42). Extremely sweet apples with barely any acid flavor are popular in Asia and especially Indian Subcontinent (42).

1.3.3 Genetics of cultivars

Despite being a domesticated species, Malus x domestica has retained high genetic diversity throughout the domestication process (11,43,44).

Across the last eight centuries Malus x domestica was not subject to any remarkable reduction in genetic diversity (44). From apples across all centuriesExpected Heterozy-gosity (He) is higher than 0.7 on average while remarkably, diversity of the apples for

current commercial production is lower (44). In general, improvement bottleneck in Malus x domestica is very mild, in contrast to improvement bottlenecks in many an-nual and perennial fruit crops as for example Barley, Rice, Soybean, Chestnut or Peach (44).

A reference genome (11) and genetic variants available through openaccess jour-nals (45, 46, 47, 48) are available among the genetic resources for domesticated ap-ple, together with around 22,500 SNPs deposited in the dbSNP database (http: //www.ncbi.nlm.nih.gov/SNP/) and the IRSC 8K and the 20K (45, 47), two pub-licly available Infinium® SNParrays.

These latter have been used to generate genetic linkage maps and to detect quantita-tive trait loci (QTL)(49,50,51) in families of individuals with both parents in common, forPedigree-Based Analysis (PBA) (52,53) or for family-basedGenome Wide Associ-ation Study (GWAS) (54). Despite having been used for GWAS, 8K SNParray have

(35)

librium (LD)decays quickly in Malus x domestica and anLDbelow 0.2 within 100kilo base pairs (kbp)was observed in (55) and in (54). Even the 20K apple Infinium®array was based on a restricted number ofcultivars (i.e., 13) and although this tool is suited forPBA investigations (53), the genetic background upon which is based remains too limited forGWAS (44). While the genetic background 8K Infinium® array was based upon 27 accessions, covering more diversity, its SNPdensity is too low forGWAS.

In 2016, Bianco et al. developed resequenced 63 applecultivars (56) covering most of the genetic diversity in cultivated apple. Cultivargenomes were aligned to the “Golden Delicious” reference publicated in (11) and a repetitive region was used to compute phylogenetic distances in Figure 1 of (56) (Figure 1.4 in this work). Contextually, in order to overcome the limitations of both previously developed arrays, a much denser SNParray was developed (56). In this array, in order to allow studies of uncharacterized apple collections a small percentage (7.5%) of rareSNPs with aMinor Allele Frequency (MAF)< 0.05 was incorporated.

1.4 Annotation of Apple and plant species

1.4.1 Genome annotation

As Lincoln Stein states in his work (57):

For thousands of years, rabbis have laboured over the text of the Torah, seeking to make this cryptic, uneven and internally contradictory text into a coherent system of law, and storing this commentary into an annotated version of the text, known as the Talmud. Over time, the amount of an-notation in the Talmud has greatly exceeded the original text. Each line of the Torah is now surrounded by layers of commentary in an onionskin fashion. So it is with the genome.

What Stein means is that, similarly to the Torah, the genome represents a mixture of different (evolutionary) directions, which in the genome case were provided by both natural selection and historical accident. As an example of what is an accident,

(36)

eukary-Figure 1.4: The neighborjoining tree of the 63 accessions - Sixteen simple sequence repeat genotypic data have been used to compute the dissimilarities among accessions before generating the neighborjoining tree. Source (56)

(37)

itive elements, these latter probably originating by chance from errors of molecular machinery deputed to the replication of DNA. Remarkably, some of the most impor-tant elements of the basic regulation of the genome are still unknown, as for example regulation of alternative splicing, control exterted over transcription, the role of non translated portions of the genome and the function of many non-coding RNAs. In later years, the attention of bioinformatic and sequencing community has shifted to-wards the problem of annotation, where with annotation is intended the action of producing knowledge from raw data. Annotation can be produced at multiple levels such as genome, transcriptome, proteome andPost-Translational Modification (PTM) to name a few. At the genome level, annotation is the process of taking the rawDNA sequence produced by ever improving sequencing technologies (see Section 1.2.1) and stratify layers of analysis and interpretation to extract a biological meaning with the fi-nal aim of understanding biological processes. Genome annotation itself is a multi-step process, that can be broadly divided in three categories: nucleotide-level, protein-level and process-level annotation (57). These categories roughly correspond to three main steps in genome annotation: identifying portions of the genome that do not code for proteins; identifying coding elements on the genome; and attaching biological informa-tion to these elements (57). The first two steps can be grouped in the definition of gene prediction (see Section 1.4.2) and the third step can be defined as gene function prediction (see Section 1.4.3).

Early on, the genome annotation process was exclusively deputed to a series of experiments on living cells and model organisms. Much before the years of DNA se-quencing, physical maps of chromosomes were constructed taking advantage of gene linkage. The first chromosome map dates back to 1913 (58) and these kind of ex-periments can be considered the first iteration of gene prediction (i.e., gene finding). Function inference is traditionally performed by gene knock-out experiments, first exe-cuted by Martin J. Evans in 1981 (59) after around ten years of work on gene targeting and site-directed mutagenesis.

In later years, genome annotations are almost completely performed by informatics means, exploiting methods and techniques coming from a completely new branch of biology, called bioinformatics. Bioinformatics is both an umbrella term for biological

(38)

that have become common practice, in biological data analysis, particularly in the field of genomics. Historically, the term bioinformatics did not mean what it means today. Paulien Hogeweg and Ben Hesper coined it in 1970 to refer to the study of information processes in biotic systems (60). This definition placed bioinformatics as a field parallel to biochemistry (the study of chemical processes in biological systems) (60). Computers became essential in molecular biology when protein sequences became available after Frederick Sanger determined the sequence of insulin in the early 1950s. Comparing multiple sequences manually turned out to be impractical. A pioneer in the field was Margaret Oakley Dayhoff. (61) She compiled one of the first protein sequence databases, initially published as books (62) and pioneered methods of sequence alignment and molecular evolution (63).

The simplest tools to perform gene annotation rely on sequenced-alignment based technologies, like Basic Local Alignment Search Tool (BLAST). These methods search for homologous genes in specific biological databases like Universal Protein Resource (UniProt)(see Section 1.5, in order to transfer the homologs’ annotation to the genes at hand (64). These methods are based on the well-established sequence-structure-function paradigm (see Section 1.4.5.1) that implies that similar sequences perform similar functions. As I explain in Section1.4.5the explanatory power of this paradigm has been shown to be limited and has now been deeply reviewed. However, along the years more elaborate systems have arisen to infer function from sequence and evolu-tionary data, borrowing methods from information theory and computer science and deploying increasingly accurate information to annotation platforms and databases (see Section 1.5). Due to the fragmentation of the annotation methods, some databases started using a modular approach integrating different resources to increase their cov-erage (e.g. MobiDB (65), InterPro (66)). Other databases (e.g. Ensembl (67)) rely on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline (67). Due to the ever-improving sequencing technologies (see Section 1.2.1), producing an ever-increasing amount of sequenced genomes (see Section 1.6.2), computer science has acquired an accordingly increasing importance along time, becoming predominant in genome annotation in later years.

(39)

Gene prediction or gene finding is the process of identifying the positions in a raw as-sembledDNAsequence that encode for functional elements. Functional elements range from protein-coding genes to RNA genes, but may also include other functional ele-ments such as regulatory regions. Nowadays, gene prediction represents one of the first steps to interpret a new genome (57) and is completely carried out through informatics means. However, finding the position of genes across chromosomes is a practice that dates back to early 1910s. Originally, “gene finding” was based on long and delicate experiments on living cells and organism as those performed by Sturtevant et al. to produce the first linkage map of a chromosome (58). Subsequent statistical analyses of the rates of homologous recombination of genes, as for example the Logarithm of the ODDs (LOD) score, developed by Newton E. Morton in 1955 (68) allow to decipher the order of genes along a chromosome. Results from a series of recombination exper-iments are used to generate a genetic map that that roughly indicates the position of known genes relatively to each other. Today, withNGStechnologies and computational resources gene finding has been redefined as a largely computational problem.

Determining that a sequence performs a function in the cell and determining what is the function of that sequence are different problems. Discovering the function of a gene product and especially confirming that a gene prediction is accurate still largely requires in vivo or in vitro experimentation (69) through gene knockout or other assays, although recent bioinformatics research (70, 71) are increasingly improving function prediction of a gene starting from its sequence alone.

Contextually to the first sequencing of the Malus x domestica genome (11), a com-bination of computational methods (i.e. FgenesH (72) , Twinscan (73), GlimmerHMM (74) and GeneWise (75)) were employed to produce a collection of predicted genes. This study reported a very high gene content with 57,386 putative genes plus 31,678 transposable element-relatedORFs (11), which is considerably higher than Arabidopsis thaliana (27,228), poplar (45,654), papaya (28,027), Brachypodium distachyon (25,532), grape (33,514), rice (40,577), sorghum (34,496), cucumber (26,682), soybean (46,430) and maize (32,540) (11). The number of genes in apple was later reduced to 42,140 in

(40)

1.4.3 Gene function prediction

Although the label “prediction” might suggest the implication of elaborate computa-tional models, experiments to infer gene function, either in silico or in vivo have a long history. Function inference is traditionally derived from the results of gene knock-out experiments. The origin of this methods dates back to late 1900 and has earned a shared the Nobel Prize in Physiology or Medicine 2007 (76) for its three main contributors, Mario R. Capecchi, Martin J. Evans and Oliver Smithies. Martin J. Evans generated the first knockout mouse, showing that embryonic stem cells (ES cells) could be directly taken from the blastocyst stage of an embryo and cultured in Petri dishes (59). Drs. Oliver Smithies and Mario Capecchi independently developed methods using homolo-gous recombination to precisely target any gene in the genome (77,78) after around ten years of work on gene targeting and site-directed mutagenesis. Along the years, gene knockout techniques have seen a continuous improvement from conditional knockouts (79) to end with the recent development of the CRISPR/Cas technology that effectively and specifically changes genes within organisms (80,81,82,83,84, 85,86, 87). Other experimental techniques such as microarray analysis, RNA interference, and the yeast two-hybrid assays are used to experimentally demonstrate the function of a protein.

While discovering new functions of a gene product and confirming that a gene prediction is accurate requires in vivo or in vitro experimentation (69), recent bioin-formatics research (70,71) are increasing the accuracy of function prediction methods. Furthermore, advances in sequencing technologies (see Section 1.2.1) have made the rate at which proteins can be experimentally characterized much slower than the rate at which new sequences become available (88) (see Section 1.6.2).

Bioinformatics methods for gene function prediction assign biological or biochemi-cal roles to proteins. These proteins are often poorly studied or bare putative proteins identified through gene prediction techniques applied to genomic sequence data (see Sec-tion 1.4.2). Since automatic annotation can often be performed quickly and for many genes at once, annotation of newly sequenced genomes is mostly by prediction through computational methods. Early techniques transferred function from homologous pro-teins with known functions, a process called homology-based function prediction. The development of context-based and structure-based methods have improved sensibility and accuracy of predictions (88). The importance and prevalence of computational

(41)

the Gene Ontology (GO)database: as of 2010, 98% of annotations were listed under the code Inferred from Electronic Annotation (IEA) while only 0.6% were based on experimental evidence (89). Information fed to modern predictors come in the form of sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, protein-protein interaction or a combination of the above. Application of computer science methods to the problem of gene function inference also posed an important problem. What does function mean for a gene or its protein product? Protein function is a broad term that includes all the roles a protein is involved in, from catalysis of biochemical reactions to transport to signal transduction, to structural roles within the cell (90). Generally, function can be thought of as, “anything that happens to or through a protein” (90). The Gene Ontology (GO)Consortium provides a useful classification of functions, based on a dic-tionary of well-defined terms divided into three main categories of molecular function, biological process and cellular component (91).

Contextually to the first sequencing of the Malus x domestica genome (11), a combi-nation of computational methods were employed to produce gene predictions (see Sec-tion 1.4.2). Predicted protein sequences were searched with BLAST against UniProt, protein domain data banks and plant protein databases annotated withGOterms. GO terms were subsequently extracted by other computational means (11). In my work I expand on gene function prediction by employing a novel method (i.e. Inga (92) version 2) which outperforms other methods for plants cellular component GO category and excels the otherGOcategories. I then proceed to create a platform to easily visualize functional annotation and retrieve gene entries from functions of interest (see Chapter 3).

1.4.4 Gene families

A gene family is a set of genes deriving from a common ancestor gene by various kind of duplication processes. Natural selection and genetic drift shape gene families differently so that they vastly differ in size and internal similarity. There are several ways to form a gene family in a genome, including genome duplication, tandem duplication and

(42)

Contextually to the sequencing of the genome of Malus x domestica (11), gene fam-ilies were identified. They were then compared to those of cucumber, soybean, poplar, Arabidopsis thaliana, grape, rice, Brachypodium distachyon and sorghum. This com-parison revealed apple-specific subclades of genes encoding MADS-box transcription factors (see Section 1.4.4.1) and highlighted an enrichment in sorbitol-related genes (see Section1.4.4.2).

1.4.4.1 MADs-Box transcription factors

An interesting feature of apple physiology involves its fruit, the pome, found only in the Maleae tribe (94). This suggests that the pome evolved after a relatively recent Maleae-specific WGDthat is hypothesized to have contributed to apple’s specific de-velopment and metabolism. Apple fruit forms through expansion of the receptacle, the region below the whorl of apple’s flower sepals. Since in plants MADS-box genes determine the fate of floral tissues (95), they may play a role in fruit development. Various studies suggest that MADS-box transcription factors are critical to a number of developmental processes, including fruit development and the control of flowering time and gametophyte cell division (96,97,98).

1.4.4.2 Sorbitol-related genes

Carbohydrate metabolism is important in fruit development. Since sorbitol - a sugar alcohol with a sweet taste - is the main chemical form used by Rosaceae to transport photosynthesis-derived carbohydrates (99,100), it is likely to play an important role in pome-specific features. Malus x domestica has significantly more genes related to sor-bitol metabolism compared to other plant species (11), these genes beingaldose 6-P re-ductase (A6PR), which is rate-limiting for sorbitol biosynthesis,sorbitol-dehydrogenase (SDH), which catalyze the conversion of sorbitol to fructose in the fruit (101), and the Rosaceae-specific sorbitol transporter PcSOT2 (102).

In their study (11) Velasco et al. report that there are 71 sorbitol metabolism genes in apple while in other species, the number ranges between 9 and 43. They also highlight a Rosaceae-specific evolutionary trend towards fruit organ specialization, which may be partially based on gene duplication (11). Events of WGD produced large families of paralogous genes, something that seems particularly evident for SDH (Figure 1.5) (11).

(43)

Figure 1.5: Dendrograms of SDH genes - Particular of Supplementary Figure 14 of (11) representing a dendrograms of genes putatively involved in pome development. This phylogenetic treeSDHgene family is shown

(44)

1.4.5 Beyond gene families and domains: intrinsic disorder

The concepts of gene family and domain, which are vastly exploited to generate sequence-based annotation, revolve around the premise of protein sequence conservation being the drive for function conservation. While this idea is undoubtedly right for glob-ular proteins it’s not enough to cover the whole protein function space. Almost 20 years ago, it has been suggested that many proteins lack such stable three dimensional structure and are rather “intrinsically disordered” under native, physiological-like con-ditions (thus named Intrinsically Disordered Protein (IDP) orIntrinsically Disordered Region (IDR), respectively) (103, 104, 105). While the idea that regions lacking a fixed structure gained terrain in the scientific community, resistances remained much longer against the idea that these regions were functional. In later years, it was finally acknowledged that structural disorder plays fundamental roles in the cell, primarily in cellular signaling and regulation (106). It was then discovered that because of their role, IDPs/IDRs are often implicated in diseases (107) and represent important drug targets (108).

1.4.5.1 Sequence-structure-function paradigm

At the center of the classical biochemistry, lies the idea that a protein function depends on a fixed three-dimensional structure, as the unique spatial pattern of properly placed amino acids residues creates a special physico-chemical microenvironment tailored for the tight and extremely specific interaction with the environment. So, the detailed description of this structure holds the key to understand the biological process a protein is involved in. In turn, the structure is perfectly encoded in the protein sequence as a specific pattern of amino acids is driven through the folding funnel to acquire a specific folded state. These concepts are crystallized in the name sequence-structure-function paradigm. A solid foundation of this view is provided by the 130,000 structures of proteins and complexes in the Protein Data Bank (PDB) (109). However, it is increasingly recognized that many proteins do not obey this rule. IDPs andIDRs are devoid of order in their native unbound state (105,110,111).

(45)

ConsideringIDPs as a separate and legitimate class of proteins is a modern conquest of biochemistry. Intrinsic Disorder (ID) state of a poly-peptide has long been considered an experimental error, while later on it was just aknowledged as an intermediate status undergone by the protein to achieve its native and functional conformation. Now we know that many proteins lack such stable Three-dimensional (3D) structure and are rather intrinsically disordered under physiological conditions (103, 104,105). The recognition of this structural phenomenon brought a radical change in the sequence-structure-function paradigm, and critically extended the general appreciation of the role of dynamics in protein function.

1.4.5.3 Biological function of intrinsic disorder

The early emphasis in the field of protein disorder was on proteins that are mostly or completely disordered, such as MAP2 (112), tau (113), Myelin basic protein or α-synuclein (114). These proteins escaped the characterization by the dominant exper-imental methods such as X-ray crystallography, and the experexper-imental challenges related to them attracted attention on the phenomenon. However, the protein universe is far more complex than the simplistic binary separation between “order” and “disorder”. All proteins have some movements, and no protein is completely chaotic (115). The term “disorder” may come into play when a protein lacks tertiary structure, but it is often used to describe regions devoid of secondary structure or with conditionally present secondary structure elements. Furthermore, the definition of a protein as in-trinsically disordered is largely supported by the way it performs its function. IDPs have functional advantages provided by their high entropy, their accessibility and their plasticity (115). High entropy designates IDRs’ inherent dynamic movement, which create a less restricted space. Site accessibility is essential in binding of other molecules and for the PTM of the protein. Plasticity summarize IDRs ability to interact with other molecules by changing shape, even becoming more ordered, and triggering other reactions.