Algorithms and data structures for big labeled graphs

(1)

Dipartimento di Informatica

Dottorato di Ricerca in Informatica

Algorithms and data structures for

big labeled graphs

Ph.D. Thesis

October 2016

Francesco Piccinno

piccinno@di.unipi.it

Supervisors

Paolo Ferragina

Università di Pisa

Rossano Venturini

Università di Pisa

Referees

Dafna Sheinwald

IBM Research

Steven Skiena

(2)

(3)

This thesis focuses on the design of efficient (in terms of time and space) and efficacious (in terms of precision and recall) algorithms and data struc-tures that leverage upon the information encoded in big labeled graphs, with the ultimate goal of deploying them in the solution of problems emerging in the realm of Search Engines and Social Networks.

In the first part of the dissertation, we focus on Entity Linking and its applications. The procedure can be thought as a semantic augmentation of documents with entities drawn from a large knowledge base, such as Wikipedia, that are linked to meaningful sequences of terms found in those documents. Our contributions are the following ones: first, we introduce a new and easy to use benchmarking platform that can be deployed to compare entity linkers with minimum effort; second, we propose a scalable disambiguation algorithm that hinges on the theory of word embeddings and language models in order to provide state-of-the-art or near state-of-the-art entity-linking performance at Web-scale; third, we show the application of entity-linking in a noisy and difficult domain like Twitter’s messages and derive a graph-based representation of hashtags’ meaning, which, in turn, can be employed to tackle the challenging problems of hashtag relatedness and hashtag classification.

In the second part of the dissertation, we deal with the problem of com-pressed representation of knowledge graphs and large labeled graphs in general. We propose a novel and efficient encoding of labeled graphs that need to support sophisticated search operations which involve both the linked structure of the graph and the string content of its nodes.

(4)

(5)

1 Introduction 1

1.1 Thesis organization . . . 4

2 Background and tools 9 2.1 Web of Data . . . 9

2.2 Knowledge Bases . . . 11

2.3 Entity Linking . . . 12

2.4 Applications of Entity Linking . . . 14

2.5 Graph compression . . . 17

2.6 Tools . . . 19

2.6.1 Spectral centrality measures for directed graphs . . . 19

2.6.2 Succinct representation of integer sequences . . . 21

2.6.3 Word embeddings . . . 22

2.6.4 Linear Support Vector Machines . . . 23

2.6.5 Learning-to-rank with LambdaMART . . . 26

3 Entity Linking 29 3.1 Terminology . . . 29

3.2 The structure of an entity annotator . . . 30

3.3 Notation . . . 32

3.3.1 Tasks . . . 32

3.3.2 Metrics . . . 34

3.3.3 Matching relations . . . 37

3.4 GERBIL: A benchmarking framework . . . 38

3.4.1 Supported experiments . . . 40

3.4.2 Datasets . . . 41

3.4.3 Entity annotators . . . 44

3.4.4 A running example . . . 50

3.5 Summary . . . 52

4 How to design an entity annotator 53 4.1 The first version of WAT (2014) . . . 54

(6)

4.1.1 Spotter . . . 54

4.1.2 Disambiguator . . . 56

4.1.3 Pruner . . . 58

4.1.4 Experimental results . . . 58

4.2 The new WAT (2016) . . . 60

4.2.1 Surface Form database . . . 61

4.2.2 Entity embeddings . . . 64

4.2.3 The algorithmic structure of the disambiguation task . . 66

4.3 Experimental results . . . 69

4.3.1 The Prior module . . . 69

4.3.2 The Filter module . . . 73

4.3.3 The Coreference module . . . 73

4.3.4 The Centroid module . . . 75

4.3.5 The Voting module . . . 85

4.3.6 The Ranking module . . . 89

4.3.7 The final configuration for WAT 2.0 . . . 93

4.4 Summary . . . 97

5 Hashtag analysis via Entity Linking 99 5.1 Related work . . . 101 5.2 The HE-graph . . . 103 5.3 Hashtag Relatedness . . . 105 5.4 Hashtag Classification . . . 108 5.5 Experimental results . . . 109 5.6 Summary . . . 118

6 String search in labeled graphs 121 6.1 Related work . . . 123

6.2 Notation and background . . . 124

6.3 Problems and algorithms . . . 129

6.3.1 Prefix-search over friends (or FoF) . . . 131

6.3.2 Top-k prefix-search over friends (or FoF) . . . 132

6.4 Experimental results . . . 134

6.5 Summary . . . 146

(7)

1

Introduction

In mathematics, and more specifically in graph theory, a graph is a structure consisting of a set of nodes that are interconnected by edges. Nodes are used to represent objects whereas edges describe relationships among those objects. It is very common to represent structures of any sort as graphs so that many problems can be easily formulated in terms of graph structures and solved by deploying (sometimes known) graph algorithms. For instance, the most famous example of a modern graph is the World Wide Web: the web pages are its nodes and an edge exists from page a to page b iff a contains a link pointing to b. Due to their flexibility and applicability to virtually any context (from Biology to Sociology, from Humanities to Physics, just to cite a few), graphs are of major interest in Computer Science, and so, it is the design of efficient graph algorithms.

But up to the last years the vast majority of scientific works focused their attention on relatively simple graphs: nodes were instances of a single type and edges were used to represent a unique form of relationship between two nodes mostly carrying a numerical information, called weight. Recently an interest in more complex and heterogeneous graph structures started to emerge in the fields of Information Retrieval, Data Mining and (Social) Network Analysis. The Web of Data played a key role in this paradigm shift. Until recently, the Web was considered the most interesting and the largest publicly available resource in graph form about human knowledge. The key principles that made the Web so popular and widely used were, and still are, mainly two: (i) the use of hyper-links to model connections among pages (i.e., graph structure), and (ii) the use of the Uniform Resource Identifier in order to uniquely identify and access any web document and resource. Building upon the same key principles the Web of Data is nowadays the de-facto standard for organizing, describing and linking in a machine-understandable way heterogeneous data coming from different data sources. This organization, in turn, allows the empowerment or the design of new algorithms to crawl,

(8)

navigate, consume and mine new information from this free and huge labeled graph of semi-structured data made up of billions of connections.

The building blocks of the Web of Data itself are the well-known Wikipedia [96], Freebase [21], DBpedia [5] (a comprehensive list of available datasets can be found in the work by Weikum [203]). The information contained in these datasets can be mapped to a “knowledge-graph” in which nodes are entities (i.e., a single object or a concept that exists in the world) and edges are used to describe several kinds of relationships between two entities (i.e., is-a, is-related-to, contains-a, is-part-of, etc.). The number of entities is very large (of the order of millions) and the nature of the relations among them is rich and diverse (i.e., they can be structured or unstructured, up to free text and very large too, of the order of billions). All this information can be used to empower the algorithmic solutions of classical tasks in Information Retrieval, by exploiting the semantic layer they provide. Notable examples are web search, tweet classification and understating, network analysis and social search, etc.; once the mapping between an object (being it a web page, a search query, a tweet, etc.) and its semantics is done correctly, the labeled graph provided by these knowledge bases can be exploited to perform high-quality retrieval at a fast speed.

In this thesis we will be concerned mainly with the so called Entity Linking problem, which constitutes one of the key steps to automatize this semantic mapping. It asks for detecting meaningful sequences of terms in a natural language text, called spots or mentions, that might refer to an entity listed in a given knowledge base such as Wikipedia. This is a nontrivial task to solve algorithmically due to the inner ambiguity of natural language mentions (e.g., the mention “apple” might be linked to the entity representing Apple Inc. or to the fruit) and because it is heavily dependent on the length of the input text (that might span from a web search query, up to a tweet or even a long text). The power of this mapping resides in the fact that the knowledge base from which the entities are picked is structured in the form of a complex labeled graph that explicitly and implicitly encodes interesting relations among those entities whose proper mining could extract useful information for the post-processing steps of other IR tools. For example, the subgraph formed by the discovered entities could be used to understand what the text is talking about, which other entities are related to the annotated ones, what are the main topics of the text, etc. The literature provides several examples of tools targeting the entity linking problem (e.g., see [102, 137, 75, 171, 145]).

(9)

Despite the early stage of development, these systems achieve today sur-prisingly good results and they have been used to bring significant improve-ments in applications such as document clustering [182], classification [198], query suggestion [22] and micro-blogging augmentation [137] just to cite a few. Albeit the satisfactory performance, very few annotators provide a ro-bust performance across diverse and variegated data sources (e.g., web pages, tweets, news, etc.) and none of such tools, to our knowledge, is optimized to handle Web-scale annotations, thus limiting the applicability of these tools to just small and well-formed datasets.

Another issue that none of the known systems has yet taken into account is the fact that, in order to be more and more effective, these systems need larger and larger graphs (it is said that Google Knowledge-base sums up to 1 billion entities, about two order of magnitudes larger than Wikipedia), which, in turn, urge to devise novel and efficient (in access time and storage space) representations of those graphs for supporting from basic to more complex (e.g., mining) access operations in the same way it has been done for compressed texts (e.g., [70, 150]), compressed sequences (e.g., [71, 77]), compressed dictionaries (e.g., [76, 67, 103]), etc. All these algorithmic issues go under the umbrella of Graph Compression, which recently became a very active area of research spurring few interesting results [18, 26, 3, 46]. However, the most well-known tool, namely Web Graph [18, 19] and others, allow to compress only the graph structure efficiently, but they fail to support labeled-graphs compression and more sophisticated access patterns other than the retrieval of the adjacency list of a given node. Actually compressing the structure of social networks is even more difficult [42, 16], and no significant results have been obtained in the realm of labeled graphs.

This thesis will address both issues above using as a fil rouge the design of algorithms and data structures for the efficient (in time and space) and efficacious (in terms of precision and recall) exploitation of such a large amount of information encoded in big labeled graphs, with the ultimate goal of deploying them in the solution of newly emerging problems in the realm of Search Engines and Social Networks. More specifically, we will attack the following four problems, to which we will dedicate one chapter each of the thesis:

(i) Formalize the entity linking problem and various sub-tasks a given entity annotator has to implement by introducing a proper evaluation

(10)

framework that can be used by the entity-linking community in order to simplify the research in the field (Chapter 3).

(ii) Improve the efficiency and efficacy of existing entity annotators that currently offer not much satisfactory performance and are not yet ready to be used in production environments (Chapter 4).

(iii) Use those entity annotators to tackle problems such as user hashtag relatedness and hashtag classification in Twitter (Chapter 5).

(iv) Design new algorithms and data structures for storing and efficiently retrieving information contained in big labeled graphs as they occur in modern Web and Social Network applications (Chapter 6).

1.1 Thesis organization

The structure of the thesis is as follows. After a brief overview of the basic con-cepts and literature, we present our contributions which are then concluded by Chapter 7 that contains some directions for future research.

Benchmarking Entity Linking Systems (Chapter 3). Linked Data is now

be-coming the de-facto technique to link resources on the Web and to enrich them of a semantic meaning. This is mostly useful to ease the job of computers in order to achieve a real understanding of a generic resource (URI) being it a Web Page, a tweet, a query, etc. Unfortunately, most of these annotations have to be manually curated, and they are therefore not scalable. Entity linking promises a scalable and accurate approach for this problem in textual documents.

In this chapter we deepen our understanding of the entity linking prob-lem, by introducing a proper terminology that will be used throughout this thesis and by providing a formal description of tasks and matching relations that can be used to assess and compare in an unambiguous and fair manner the performance of entity annotators.

We then introduce GERBIL, the main contribution of this chapter, which is a generic framework for the evaluation of entity linking systems against a collection of datasets, which are widely used in the entity linking literature. The need for such framework comes from the fact that the entity linking problem, being treated by different people with different backgrounds, has lead to different definitions with possibly

(11)

noisy and often incomparable results. GERBIL aims to be the de-facto standard for evaluating entity-linking tools with the ultimate goal of simplifying the research in the field. We conclude the chapter with a full review of several entity annotator systems, most of which are supported by the GERBIL platform.

This chapter is based on the paper “GERBIL: General entity annotator benchmarking framework” published in the Proceedings of the 24th Inter-national Conference on World Wide Web, 2015 [196].

Designing an Entity Annotator for Web-Scale Annotation (Chapter 4). We

detail the algorithmic principles and design of Wat, our entity annotator system, from its first version of 2014 to its second version of 2016 (not yet published). More precisely:

1. Our group designed in 2010 one of the first academic entity annota-tor, i.e., Tagme [75]. It has been queried over 500 million times with an average of 250 thousand queries per day. The new annotator Wat is designed over Tagme algorithmic principles but improves the original F1 in average by 12% and the speed of recent proposals [86, 210] up to a factor 34. These figures foresee the deployment of this entity linking technology to the Web scale.

2. We propose a new disambiguation algorithm that hinges upon two novel sets of features specifically designed for high-speed compu-tations. The former set is computed via a linear-time algorithm that jointly orchestrates entity and textual embeddings. The latter set of features is computed via a voting procedure that uses a new entity relatedness function which hinges on the theory of language models in order to avoid the data sparsity problem of other known entity relatedness functions (e.g., see [144, 38]).

3. Wat offers a competitive disambiguation performance across all tested datasets, providing state-of-the-art results on the AIDA/CoNLL datasets (88-90% F1) with an F1 score increase ranging from 3% up to 4% with respect to [86]. Results on the other datasets are near-state-of-the-art, being our proposal ranked in the vast majority of cases as second-most accurate entity annotator yet guaranteeing higher speed.

This chapter is based on the paper “From TagME to WAT: a new entity annotator” published in the Proceedings of the 1st international

(12)

work-shop on Entity Recognition & Disambiguation, co-located with SIGIR 2014 [165].

Analyzing Hashtags via Entity Linking (Chapter 5). We explore the use of

Entity Linking techniques to derive a new synthetic Knowledge Graph in a noisy and difficult domain such as Twitter. We specifically focus on hashtags, which are now becoming the most used way to tag short messages in social networks and address two challenging problems regarding the “meaning of hashtags” namely hashtag relatedness and classification thus providing two main contributions:

1. We construct a novel graph upon hashtags and entities drawn from the tweets by means of entity linking annotators. This graph allows modeling in an efficacious way not only classic co-occurrences but also semantic relatedness among hashtags and entities, or between entities themselves. Based on this graph and through the use of Wikipedia Category graph as a linked data resource, we design algorithms that significantly improve state-of-the-art results upon known publicly available datasets, bringing down the error rate for hashtag relatedness to 1% and improving hashtag classification up to 9% F1 (absolute).

2. We release to the research community two new datasets: the former is a new dataset for hashtag relatedness, the latter is a dataset for hashtag classification that is up to two orders of magnitude larger than the existing ones. These datasets will be used to show the robustness and efficacy of our approaches, showing improvements in F1 up to two-digits in percentage (absolute).

This chapter is based on the paper “On analyzing hashtags in Twitter” published in the Proceedings of the 9th International AAAI Conference on Web and Social Media, 2015 [72].

Searching Strings in Labeled Graphs (Chapter 6). Storing and searching large

labeled graphs is indeed becoming a key issue in the design of space/time efficient online platforms indexing modern social networks or knowl-edge graphs. But, as far as we know, all these results are limited to design compressed graph indexes that support basic access operations onto the link structure of the input graph, such as: given a node u, return the adjacency list of u.

(13)

Starting from the more difficult and challenging problem of finding the most similar entities that respect a given predicate from a seed node in a compressed Knowledge Graph (e.g., entities that are most related to Barack Obama, which are of type person and which are in a live in relation with New York), we study a simpler problem that finds a practical justification in the realm of online social network, paving the way for a future work on those lines.

Specifically, our contributions can be summarized as follows:

1. We formalize and propose efficient solutions to a set of string-matching problems with some neighbor constraints in labeled and general graphs, such as: prefix search over friends, prefix search over friends of friends, top-k prefix-search over friends, and top-k prefix search over friends of friends.

2. We provide a thourough experimental analysis on the proposed solutions. On real-world datasets our solutions occupy nearly the same space of other state-of-the-art graph encoding but with an improvement in query time of a factor up to 33, 38 and 20 over the best IR-based approach, respectively for prefix search over friends, prefix search over friends of friends and top-k prefix search over friends of friends.

The techniques proposed in this chapter are general enough to be applied in the context of social networks, Knowledge Graphs and other synthetic graphs that need to support sophisticated search operations that involve both the linked structure of the graph and the string content of its nodes.

This chapter is based on the paper “Compressed Indexes for String Search-ing in Labeled Graphs” published in the ProceedSearch-ings of the 24th Interna-tional Conference on World Wide Web, 2015 [73].

(14)

(15)

2

Background and tools

In this chapter, we introduce the concept of Web of Data including a brief overview of the most widely used resources, mainly in the context of entity linking. We then give an informal description of how entity linking works and why this technique is useful, in combination with the information derived from the Web of Data, to tackle various and difficult IR/NLP problems. We later overview a series of graph compression techniques, which may be useful in the context of the Web of Data and entity linking in general to tackle the problem of accessing in an efficient way large knowledge bases possibly stored using a compressed representation. We conclude this chapter by reviewing a set of algorithmic tools that will be used throughout this thesis.

2.1 Web of Data

The World Wide Web is a hyper-linked space of documents and resources, uniquely identified by a sequence of characters called Uniform Resource Identi-fier (URI in short). Similarly, the Web of Data can be informally described as the process of linking data resources together using Web technologies, hence the name.

The idea of Web of Data emerged from the Semantic Web initiative, born to ease the process of understanding the contents of a Web resource by a machine. Initially the effort went in the annotations of Web pages with meta-data information (e.g., micro-formats), to enable algorithms to interpret text and resource without any ambiguity. Though, this idea remained largely unrealized [184], as it required a technical background the average user lacked. The Web of Data is the realization of that vision in the data domain, made possible by the use of URI to identify resources and the deployment of machine-understandable annotations, in order to solve the problem of organizing data across the web.

The necessity of organizing heterogeneous data mainly spurred from the existence of countless structured and semi-structured datasets either coming

(16)

from a public community effort (e.g., Wikipedia, MusicBrainz), public sources (e.g., GeoNames) or from private companies (e.g., FreeBase).

Probably the best realization of the concept of the Web of Data is the Linking Open Data project (LOD), a community effort project supported by the W3C. The initial aim of the project was to identify existing open-licensed data sets and make them freely available in an appropriate and interoperable format on the Web. Now anyone can participate in the project by simply publishing their data sets and interlinking with existing ones following the Linked Data principles. A graphical indication of the diversity and scale of datasets included in the project is shown in Figure 2.1.

Linked Datasets as of August 2014

Uniprot Alexandria Digital Library Gazetteer lobid Organizations chem2 bio2rdf Multimedia Lab University Ghent Open Data Ecuador Geo Ecuador Serendipity UTPL LOD GovAgriBus Denmark DBpedia live BurnerURI

Linguistics Social Networking Life Sciences Cross-Domain Government User-Generated Content Publications Geographic Media Identifiers Eionet RDF lobid Resources Wiktionary DBpedia Viaf Umthes RKB Explorer Courseware Opencyc Olia Gem. Thesaurus AudiovisueleArchieven Diseasome FU-Berlin Eurovoc in SKOS DNB GND Cornetto Bio2RDF Pubmed Bio2RDF NDC Bio2RDF Mesh IDS Ontos News Portal AEMET ineverycrea Linked User Feedback Museos Espania GNOSS Europeana Nomenclator Asturias Red Uno Internacional GNOSS Geo Wordnet Bio2RDF HGNC Ctic Public Dataset Bio2RDF Homologene Bio2RDF Affymetrix Muninn World War I CKAN Government Web Integration for Linked Data Universidad de Cuenca Linkeddata Freebase Linklion Ariadne Organic Edunet Gene Expression Atlas RDF Chembl RDF Biosamples RDF Identifiers Org Biomodels RDF Reactome RDF Disgenet Semantic Quran IATI as Linked Data Dutch Ships and Sailors Verrijktkoninkrijk IServe Arago-dbpedia Linked TCGA ABS 270a.info RDF License Environmental Applications Reference Thesaurus Thist JudaicaLink BPR OCD Shoah Victims Names Reload Data for Tourists in Castilla y Leon 2001 Spanish Census to RDF RKB Explorer Webscience RKB Explorer Eprints Harvest NVS EU Agencies Bodies EPO Linked NUTS RKB Explorer Epsrc Open Mobile Network RKB Explorer Lisbon RKB Explorer Italy CE4R Environment Agency Bathing Water Quality RKB Explorer Kaunas Open Data Thesaurus RKB Explorer Wordnet RKB Explorer ECS Austrian Ski Racers Social-semweb Thesaurus Data Open Ac Uk RKB Explorer IEEE RKB Explorer LAAS RKB Explorer Wiki RKB Explorer JISC RKB Explorer Eprints RKB Explorer Pisa RKB Explorer Darmstadt RKB Explorer unlocode RKB Explorer Newcastle RKB Explorer OS RKB Explorer Curriculum RKB Explorer Resex RKB Explorer Roma RKB Explorer Eurecom RKB Explorer IBM RKB Explorer NSF RKB Explorer kisti RKB Explorer DBLP RKB Explorer ACM RKB Explorer Citeseer RKB Explorer Southampton RKB Explorer Deepblue RKB Explorer Deploy RKB Explorer Risks RKB Explorer ERA RKB Explorer OAI RKB Explorer FT RKB Explorer Ulm RKB Explorer Irit RKB Explorer RAE2001 RKB Explorer Dotac RKB Explorer Budapest Swedish Open Cultural Heritage Radatana Courts Thesaurus German Labor Law Thesaurus GovUK Transport Data GovUK Education Data Enakting Mortality Enakting Energy Enakting Crime Enakting Population Enakting CO2Emission Enakting NHS RKB Explorer Crime RKB Explorer cordis Govtrack Geological Survey of Austria Thesaurus Geo Linked Data Gesis Thesoz Bio2RDF Pharmgkb Bio2RDF Sabiork Bio2RDF Ncbigene Bio2RDF Irefindex Bio2RDF Iproclass Bio2RDF GOA Bio2RDF Drugbank Bio2RDF CTD Bio2RDF Biomodels Bio2RDF DBSNP Bio2RDF Clinicaltrials Bio2RDF LSR Bio2RDF Orphanet Bio2RDF Wormbase BIS 270a.info DM2E DBpedia PT DBpedia ES DBpedia CS DBnary Alpino RDF YAGO Pdev Lemon Lemonuby Isocat Ietflang Core KUPKB Getty AAT Semantic Web Journal OpenlinkSW Dataspaces MyOpenlink Dataspaces Jugem Typepad Aspire Harper Adams NBN Resolving Worldcat Bio2RDF Bio2RDF ECO Taxon-concept Assets Indymedia GovUK Societal Wellbeing Deprivation imd Employment Rank La 2010 GNU Licenses Greek Wordnet DBpedia CIPFA Yso.fi Allars Glottolog StatusNet Bonifaz StatusNet shnoulle Revyu StatusNet Kathryl Charging Stations Aspire UCL Tekord Didactalia Artenue Vosmedios GNOSS Linked Crunchbase ESD Standards VIVO University of Florida Bio2RDF SGD Resources Product Ontology Datos Bne.es StatusNet Mrblog Bio2RDF Dataset EUNIS GovUK Housing Market LCSH GovUK Transparency Impact ind. Households In temp. Accom. Uniprot KB StatusNet Timttmy Semantic Web Grundlagen GovUK Input ind. Local Authority Funding From Government Grant StatusNet Fcestrada JITA StatusNet Somsants StatusNet Ilikefreedom Drugbank FU-Berlin Semanlink StatusNet Dtdns StatusNet Status.net DCS Sheffield Athelia RFID StatusNet Tekk Lista Encabeza Mientos Materia StatusNet Fragdev Morelab DBTune John Peel Sessions RDFize last.fm Open Data Euskadi GovUK Transparency Input ind. Local auth. Funding f. Gvmnt. Grant MSC Lexinfo StatusNet Equestriarp Asn.us GovUK Societal Wellbeing Deprivation ImdHealth Rank la 2010 StatusNet Macno Oceandrilling Borehole Aspire Qmul GovUK Impact IndicatorsPlanning ApplicationsGranted Loius Datahub.io StatusNet Maymay Prospects and Trends GNOSS GovUK Transparency Impact Indicators Energy Efficiencynew Builds

DBpedia EU Bio2RDF Taxon StatusNet Tschlotfeldt Jamendo DBTune Aspire NTU GovUK Societal Wellbeing Deprivation ImdHealth Score 2010 Lotico GNOSS Uniprot Metadata Linked Eurostat Aspire Sussex Lexvo Linked Geo Data StatusNet Spip SORS GovUK Homeless-ness Accept. per 1000 TWC IEEEvis Aspire Brunel PlanetData Project Wiki StatusNet Freelish Statistics data.gov.uk StatusNet Mulestable Enipedia UK Legislation API Linked MDB StatusNet Qth Sider FU-Berlin DBpedia DE GovUK Households Social lettings General Needs Lettings Prp Number Bedrooms Agrovoc Skos My Experiment Proyecto Apadrina GovUK Imd Crime Rank 2010 SISVU GovUK Societal Wellbeing Deprivation Imd Housing Rank la 2010 StatusNet Uni Siegen Opendata Scotland Simd Education Rank StatusNet Kaimi GovUK Households Accommodated per 1000 StatusNet Planetlibre DBpedia EL Sztaki LOD DBpedia Lite Drug Interaction Knowledge Base StatusNet Qdnx Amsterdam Museum AS EDN LOD RDF Ohloh DBTune artists last.fm Aspire Uclan Hellenic Fire Brigade Bibsonomy Nottingham Trent Resource Lists Opendata Scotland Simd Income Rank Randomness Guide London Opendata Scotland Simd Health Rank Southampton ECS Eprints FRB 270a.info StatusNet Sebseb01 StatusNet Bka ESD Toolkit Hellenic Police StatusNet Ced117 Open Energy Info Wiki StatusNet Lydiastench Open Data RISP Taxon-concept Occurences Bio2RDF SGD UIS 270a.info NYTimes Linked Open Data Aspire Keele GovUK Households Projections Population W3C Opendata Scotland Simd Housing Rank ZDB StatusNet1w6 StatusNet Alexandre Franke Dewey Decimal Classification StatusNet Status StatusNet doomicile Currency Designators StatusNet Hiico Linked Edgar GovUK Households 2008 DOI StatusNet Pandaid Brazilian Politicians NHS Jargon Theses.fr Linked Life Data Semantic Web DogFood UMBEL Openly Local StatusNet Ssweeny Linked Food Interactive Maps GNOSS OECD 270a.info Sudoc.fr Green Competitive-ness GNOSS StatusNet Integralblue WOLD Linked Stock Index Apache KDATA Linked Open Piracy GovUK Societal Wellbeing Deprv. Imd Empl. Rank La 2010 BBC Music StatusNet Quitter StatusNet Scoffoni Open Election Data Project Reference data.gov.uk StatusNet Jonkman Project Gutenberg FU-Berlin DBTropes StatusNet Spraci Libris ECB 270a.info StatusNet Thelovebug Icane Greek Administrative Geography Bio2RDF OMIM StatusNet Orangeseeds National Diet Library WEB NDL Authorities Uniprot Taxonomy DBpedia NL L3S DBLP FAO Geopolitical Ontology GovUK Impact Indicators Housing Starts Deutsche Biographie StatusNet ldnfai StatusNet Keuser StatusNet Russwurm

GovUK SocietalWellbeing Deprivation Imd Crime Rank 2010 GovUK Imd Income Rank La 2010 StatusNet Datenfahrt StatusNet Imirhil Southampton ac.uk LOD2 Project Wiki DBpedia KO Dailymed FU-Berlin WALS DBpedia IT StatusNet Recit Livejournal StatusNet Exdc Elviajero Aves3D Open Calais Zaragoza Turruta Aspire Manchester Wordnet (VU) GovUK Transparency Impact Indicators Neighbourhood Plans StatusNet David Haberthuer B3Kat Pub Bielefeld Prefix.cc NALT Vulnera-pedia GovUK Impact Indicators Affordable Housing Starts GovUK Wellbeing lsoaHappy Yesterday Mean Flickr Wrappr Yso.fi YSA Open Library Aspire Plymouth StatusNet Johndrink Water StatusNet Gomertronic Tags2con Delicious StatusNet tl1n StatusNet Progval Testee World Factbook FU-Berlin DBpedia JA StatusNet Cooleysekula Product DB IMF 270a.info StatusNet Postblue StatusNet Skilledtests Nextweb GNOSS Eurostat FU-Berlin GovUK Households Social Lettings General NeedsLettings Prp Household Composition StatusNet Fcac DWS Group Opendata Scotland Graph Simd Rank DNB Clean Energy Data Reegle Opendata Scotland Simd Employment Rank Chronicling America GovUK Societal Wellbeing Deprivation Imd Rank 2010 StatusNet Belfalas Aspire MMU StatusNet Legadolibre Bluk BNB StatusNet Lebsanft GADM Geovocab GovUK Imd Score 2010 Semantic XBRL UK Postcodes Geo Names EEARod Aspire Roehampton BFS 270a.info Camera Deputati Linked Data Bio2RDF GeneID GovUK Transparency Impact Indicators Planning Applications Granted StatusNet Sweetie Belle O'Reilly GNI City Lichfield GovUK Imd Rank 2010 Bible Ontology Idref.fr StatusNet Atari Frosch Dev8d Nobel Prizes StatusNet Soucy Archiveshub Linked Data Linked Railway Data Project FAO 270a.info GovUK Wellbeing Worthwhile Mean Bibbase Semantic-web.org British Museum Collection GovUK Dev Local Authority Services Code Haus Lingvoj Ordnance Survey Linked Data Wordpress Eurostat RDF StatusNet Kenzoid GEMET GovUK Societal Wellbeing Deprv. imd Score '10 Mis Museos GNOSS GovUK Households Projections total Houseolds StatusNet 20100 EEA Ciard Ring Opendata Scotland Graph EducationPupils by School and Datazone VIVO Indiana University Pokepedia Transparency 270a.info StatusNet Glou GovUK HomelessnessHouseholds AccommodatedTemporary Housing Types STW Thesaurus for Economics Debian Package Tracking System DBTune Magnatune NUTS Geo-vocab GovUK Societal Wellbeing Deprivation Imd Income Rank La2010 BBC Wildlife Finder StatusNet Mystatus Miguiad Eviajes GNOSS Acorn Sat Data Bnf.fr GovUK imd env. rank 2010 StatusNet Opensimchat Open Food Facts GovUK Societal Wellbeing Deprivation Imd Education Rank La 2010 LOD ACBDLS FOAF-Profiles StatusNet Samnoble GovUK Transparency Impact Indicators Affordable Housing Starts StatusNet Coreyavis Enel Shops DBpedia FR StatusNet Rainbowdash StatusNet Mamalibre Princeton Library Findingaids WWW Foundation Bio2RDF OMIM Resources Opendata Scotland Simd Geographic Access Rank Gutenberg StatusNet Otbm ODCL SOA StatusNet Ourcoffs Colinda Web Nmasuno Traveler StatusNet Hackerposse LOV Garnica Plywood GovUK wellb. happy yesterday std. dev. StatusNet Ludost BBC Program-mes GovUK Societal Wellbeing Deprivation ImdEnvironment Rank 2010 Bio2RDF Taxonomy Worldbank 270a.info OSM DBTune Music-brainz Linked Mark Mail StatusNet Deuxpi GovUK Transparency Impact Indicators Housing Starts Bizkai Sense GovUK impact indicators energy efficiency new builds StatusNet Morphtown GovUK Transparency Input indicators Local authorities Working w. tr.Families ISO 639 Oasis Aspire Portsmouth Zaragoza Datos Abiertos Opendata Scotland Simd Crime Rank Berlios StatusNet piana GovUK Net Add. Dwellings Bootsnall StatusNet chromic Geospecies linkedct Wordnet (W3C) StatusNet thornton2 StatusNet mkuttner StatusNet linuxwrangling Eurostat Linked Data GovUK societal wellbeing deprv. imd rank '07 GovUK societal wellbeing deprv. imd rank la '10 Linked Open Data of Ecology StatusNet chickenkiller StatusNet gegeweb Deusto Tech StatusNet schiessle GovUK transparency impact indicators tr. families Taxon concept GovUK service expenditure GovUK societal wellbeing deprivation imd employment score 2010

Fig. 2.1 Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/. The figure shows the datasets that have been published in Linked Data format, by contributors to the Linking Open Data community project.

The cloud of data contains datasets that are very different in nature, com-prising structured information about geographical location (e.g., GeoNames), books (e.g., Project Gutenberg), publications (e.g., DBLP), music (e.g., Mu-sicBrainz), medical domains (e.g., PubChem, KEGG) and so on. Whereas the diversity and heterogeneity of data sources might appear to be a problem, the use of open-standards for organizing and linking the data, allows the

(17)

processing and querying over the aggregated data similarly to how a local database is queried today.

2.2 Knowledge Bases

A knowledge base is a fundamental and indispensable component in order to tackle several IR/NLP tasks such as entity linking, question answering, natural language understanding, just to cite a few. Knowledge bases provide a central location where complex and structured information about world’s entities is stored. This usually includes the semantic type an entity belongs to (e.g., Barack Obama is of type person, whereas Hawaii is of type location) and the relationship between entities (e.g., Barack Obama has a relation born-in with the entity Hawaii). In the following section we provide a brief overview of the most popular and widely used knowledge bases, mainly exploited in the realm of entity linking.

Wikipedia is one of the largest and most important collaborative projects ever

created. Supported by the non-profit organization Wikimedia Foundation, the project has as the final goal the creation of a free online multilingual encyclopedia, thanks to the collective efforts of thousands of volunteers around the world. It currently comes up in 294 languages and it is released under a permissible license that lets anyone freely download, modify and distribute its contents. The English version, the largest language edition, includes information about over 5 million articles (entities). It is also a very dynamic resource with an average of 10 edits per second and 800 new article per day. In addition to its textual contents, a series of auxiliary information that is especially useful for entity linking are provided. These include article categories, redirect pages, disambiguation pages and hyperlinks to other Wikipedia articles.

YAGO [62, 101] is a high-coverage and high-quality knowledge base

contain-ing around 10 million entities and 120 million facts about those entities. The knowledge base is automatically derived from Wikipedia and uni-fied with the WordNet [143] lexical dataset using a set of heuristics. The resource includes both taxonomic and non-taxonomic relations between entities. The former type induces an Is-A hierarchy between entities (e.g., type, subclass-of relations) whereas the latter is used to describe generic relations (e.g., born-in relation). Additionally, means relations are used between strings and entities (e.g., the string “Obama” means

(18)

Barack Obama). In [102] the means relation is used to retrieve candidate entities.

DBPedia [5] is a community effort project that has as objective the creation

of a multilingual knowledge base by extracting structured information from Wikipedia in an automatic way. Most of the information is derived from infobox templates, categorization information, geo-coordinates, hyperlinks between Wikipedia articles and links to external web pages. The project automatically evolves and grows as Wikipedia changes. The latest stable English version contains information about more than 4 million entities most of which (more than 3 million) are classified in a consistent ontology. It also contains 50 million links to other RDF datasets and about 41 million YAGO2 categories. It is available under a permissive license.

Freebase [21] was probably the largest collaborative knowledge base at

the time of its introduction. It consisted of data harvested from many sources, including Wikipedia, and curated mainly by its community members. It was developed by the company Metaweb, which was later acquired by Google in 2010. The latest snapshot of Freebase [93] is still available under a permissive license and contains information about over 40 million entities and more than 2 billion facts about them. In December 2014, Google announced the shutdown of the project and offered its content to WikiData.

WikiData [199] recently emerged as a user curated source of structured

information for Wikipedia. It is operated by the Wikimedia Foundation. The goal of the project is to provide multilingual, easy to access, clean and structured data which is human curated by the community. The data is available under a permissive license and exposed in several formats, including JSON and RDF. Every wikidata entity is identified by a unique URI, making WikiData part of the Semantic Web while supporting the integration of other semantic Web data sources with WikIData.

2.3 Entity Linking

For many years the entirety of works done around the fields of Information Retrieval focused their attention on the bag-of-words paradigm to tackle

(19)

different problems such as document indexing, clustering, and retrieval. The limitation of this paradigm became more apparent with the advent of the so-called entity annotators. While each document could once have been described as a set of points in a multi-dimensional space made of (independent) terms, annotating such documents with these tools now provides a completely new set of structured information. The paradigm shift is clear.

We move from a merely syntactical and statistical representation of the documents, to a new representation that takes into account not only the syntactical sequence of terms but also their “semantics”, as it may be extracted by identifying the topics that are referred in the text via sequences of terms (that are no more independent to each other). Topics are drawn from proper knowledge bases, and synonyms and polysemous words are handled correctly by linking them to unique and pertinent entities in the knowledge base that precisely identify their meaning, thus surpassing the very well-known limitations of the bag-of-word paradigm.

To better understand the power behind semantic annotation, take in consideration two relatively short and simple sentences:

1. “Apple acquired Palm.”

2. “The apple on the palm of my hand.”

In this case, the first sentence clearly refers to a business takeover1 of the target company Palm, Inc. 2 by the acquirer Apple Inc. 3, while the second describes a completely different situation (a fruit that sits on the palm of someone’s hand). A simple bag-of-word paradigm, applied to these two sentences, would make virtually no distinction between the two sentences treating apple and palm the same way though they clearly refer to different things in two different contexts. The problem can be solved by exploiting entity annotators that enrich the input sentences with useful annotations. The key idea is to identify short and meaningful sequences of terms (i.e., mentions) in the input text and link them to unambiguous and uniquely identified concepts (i.e., entities) drawn from a knowledge base (e.g., Wikipedia). This process, which is usually referred to as entity annotation, given an input text t produces a set of annotations At. Each annotation a ∈ At is a binding

between a mention m, occurring in t, and an entity e and can be represented

1_{http://en.wikipedia.org/wiki/Takeover} 2_{http://en.wikipedia.org/wiki/Palm,_Inc.} 3_{http://en.wikipedia.org/wiki/Apple_Inc.}

(20)

by the pair(m, e). Considering the sentences introduced above, the result of such process would ideally be:

1. “Apple (Apple_Inc.) acquired (Takeover) Palm (Palm,_Inc.).” 2. “The apple (Apple) on the palm (Hand) of my hand (Hand).”

2.4 Applications of Entity Linking

The annotation process not only helps in solving the ambiguity problem possibly present in sentences but unleashes a set of auxiliary information, that are of crucial importance in the construction of more “intelligent” systems: from clustering algorithms that are aware of the semantic dimension of such entities (i.e., apple and Apple are completely different concepts that cannot cluster together), to more complex software infrastructure like a semantic-aware financial information service [4]. Indeed, the knowledge base from which the entities are picked, not only offers a way to disambiguate a possibly ambiguous spot but can also be used to mine new information exploiting relations between entities.

To better understand, please take in consideration Figure 2.2, showing the entity subgraph derived from the annotation of the sentence “Apple acquired Palm”. Such graphical representation (hereafter semantic layer) can unleash additional information useful for discovering patterns that are not apparently evident from a simple parsing of the text. In the Financial monitoring field, the semantic layer might be used to infer the presence of an interest of Apple in acquiring know-how and technologies in the Mobile market segment. Such information might be of great help in assisting human experts in the field of data warehousing and decision support.

Here we briefly comment on possible applications that might benefit from the introduction of this semantic layer to texts:

Document Semantic Similarity. Measuring how similar and/or related two

words, entities, sentences or documents is a basic task that underlies different applications [140] in Information Retrieval, NLP and even Biomedical Informatics [161]. By providing a semantic layer as an addi-tional feature to the pure syntactical information of the BoW paradigm and by leveraging on existing works in the field of ontologies [181, 81], the definition of more accurate and clever similarity functions is there-fore possible (e.g., by taking in consideration relations between entities).

(21)

Apple Inc Smartphone Palm Inc

iPhone iOS Palm Pre Web OS

product / developer product / developer type OS type OS product / developer product / developer

Fig. 2.2 Similarities between the concept Apple Inc and Palm Inc. The graph is derived from a simple exploration of RDF representation of the two concepts using DBPedia. A sophisticated system might infer that the acquisition of Palm by Apple, as extracted from an hypothetical sentence “Apple acquired Palm”, betrays the interest of Apple in acquiring know-how and technologies in the Mobile market segment.

.

Machine Translation. As pointed out in [121], this is another field that can

benefit from having access to such large and multi-lingual knowledge bases (such as Freebase, or Wikipedia). As proven by several works, the addition of named entity recognition [6, 35] or entity disambiguation [34, 35] is shown to bring improvements in quality of Statistical Machine Translation. This additional information can be provided by language-agnostic entity annotators such as Tagme, Wat, DBpediaSpotlight, etc.

Text Summarization. Understanding what a text talks about (e.g.,

identify-ing entities in such texts) and leveragidentify-ing on graph-minidentify-ing algorithms to estimate what are the key entities in an annotated text might certainly help in the definition of new algorithms in the context of extractive summarizers. In such applications, a key issue is indeed the estima-tion of the most important sentences [92] from a cluster of documents (about the same topic). By exploiting an importance or the aboutness score to each entity the problem of identifying such sentences might be simplified. Moreover, exploiting newly defined semantic aware

(22)

simi-larity and clustering algorithms would certainly help in the creation of more sophisticated solutions to cope with the problem of information overload.

Document Clustering. In [104, 105] the authors showed how the use of

back-ground knowledge coming from ontologies (i.e., WordNet [143]) brings important improvements in the context of document clustering. The authors argue that BoW representation, being extensively used for doc-ument clustering, is often unsatisfactory as it ignores relationships between important terms that do not co-occur in the input corpora. We believe that further improvements are possible with the use of en-tity annotators that are not tied to specific and limited ontologies (as pointed out in [107]) but have access to a large and more diverse set of information.

Text Classification. Several works addressing the problem of text

classifica-tion [84, 85, 112, 200] have proven the importance of semantic features to cope with the limited information provided by the BoW model, which only accounts for term frequency in the documents, and ignores im-portant semantic relationships between key terms. The idea behind these works (specifically [200]) is to augment the BoW representation with semantic relations (e.g., synonymy, hyponymy, and associative relations). The same can be done by exploiting entity annotators that provide a more sophisticated enrichment of the input texts, which, in turn, offers knowledge base relations between entities that can be used for further feature generation.

Event Extraction. Techniques for event detection from streams of information

[123, 207] (such as news, social media sites, social networks, etc.) are based on understanding what a given document is talking about (e.g., which entities are involved), and leverages on various mechanisms such as clustering and document similarity (where entity linking has proven to be of major help).

Query Understanding. The goal of search engines is understanding the

in-tent of a user expressed in form of sequences of keywords [177]. This is especially true when the keywords submitted by the user are intrinsi-cally ambiguous [108] and/or can be inferred by looking at past queries of the user inside a search session [32]. By designing an entity annotator system that is able to detect entities in such short texts [51, 15], a search

(23)

engine could provide high quality search results (i.e., more pertinent results, faceted navigation, related concepts exploration, etc.) as pre-liminary demonstrated in [56] or even tackle more difficult tasks such as knowledge-base question answering (e.g., see [44] and references therein).

2.5 Graph compression

Data compression has the final goal of encoding a source of information using fewer bits than its original representation. We usually distinguish between lossy and lossless compression, with the former referring to a scheme where (unnecessary) parts of the original information are lost in the encoding process (e.g., typically in image, audio and video compression) whereas in the latter scheme no information is lost. Compression has proven to be a useful technique because it helps reduce resource usage, such as data storage space, memory requirements and even increase performance for certain applications (i.e when the encoding/decoding speeds significantly exceed I/O bandwidth [209, 41]).

Recently, a new line of research started to investigate the possibility to overcome the classic dichotomy between compression and indexing. While the former aims at removing data redundancies, the latter introduces extra-data in the index to support faster operations. The result has been the design of compressed data structures for indexing texts (e.g., [70, 150]), sequences (e.g., [71, 77]), dictionaries (e.g., [76, 67, 103]), labeled trees (e.g [68, 69, 74]), graphs (e.g., [46, 65, 47]), binary relations and permutations (e.g., [10, 9]), etc.

In the context of graphs, (lossless) compression has been successfully used to squeeze the size of large graphs into compressed formats that still allow the fast implementation of some basic graph-access primitives, such as “give me the adjacency list of a given node”. It is of course of crucial importance the choice of a good trade-off between space and time required for compress-ing/decompressing such large graph structures [64]. Most of the work in this field focuses on Web graph compression, where two important properties can be exploited:

Locality: most hyperlinks point to other resources that are served by the

same host.

Similarity: pages that are closer in lexicographic order tend to share many

(24)

Bharat et al. [14] were the first work to propose the sorting by lexicographic URL order in order to exploit the locality property: links are represented by numbers that reflect the URL lexicographic order, whereas the adjacency list for each node is sorted in ascending order.

Randal et al. [169] further improved this scheme by introducing delta-compression for the adjacency lists of each node. The authors additionally propose an interlist compression that takes into account similarity between the current adjacency list and the k previous ones. The list is then encoded as a modification of the representative list of the most similar node (among the previous k). The techniques proposed by [169] are implemented in WebGraph by Boldi et al. [18, 19]. The authors introduce ζ codes to efficiently encode gaps following a power-law distribution with low exponent α (in the range [1.1; 1.3]) and provide a working implementation with lazy iterators that delays the decompression until it is really necessary.

A different approach is the one followed by Apostolico et al. [3] which introduces a new ordering based on BFS traversal of the graph rather than URL lexicographic order. Compression is obtained by leveraging on a new family of universal codes for integer (π codes) and by exploiting redundancies in the adjacency lists either in one (e.g sequence of identical elements) or two (e.g., a box of identical rows) dimensions. The authors report better compression rate and less decompression time with respect to [18, 19].

The authors of [46] introduce a compact representation of graphs that requires the same amount of space but provides much faster navigation than [18] by leveraging on Re-Pair compression. Such compression scheme consists in repeatedly finding the most frequent pair of symbols in a sequence of integers and replacing it with a new symbol until no more replacements are convenient.

A recursive approach based on the clever encoding of the adjacency matrix of WebGraphs is introduced in [26]. The initial n×n matrix is partitioned into k2 _{submatrices of size}₍_n/k₎2 _{each, and further in each level, each submatrix}

that is not all 0 is partitioned into k2_{equal size matrices. Sub-matrices full of}

0s are not encoded, whereas the non-zero ones are represented using succinct bitvectors (supporting rank operation) in order to support fast navigation.

Other works focused on the compression of Social Network graphs, where the locality and similarity cannot be exploited. For two nodes in such graphs having closer labels (i.e., nicknames) does not imply sharing the same out links. This requires new form of preprocessing to be executed in order to exploit the community structure to achieve better compression rate.

(25)

To address the problem, the authors of [42] propose the so called shingle ordering heuristic to find a compression friendly permutation of the nodes of the social graphs. In this case, a min-hash locality sensitive hashing scheme is used to generate signatures for each node in the graph, to estimate the Jaccard coefficient between nodes: nodes with higher coefficient will cluster together with high probability. Boldi et al. [16] improved by 25% the result of [42] via the introduction of a new technique called Layered Label Propagation which is based on the Absolute Pott Model (APM) Label Propagation algorithm. Such reordering produces nodes with the same label close to one another in the graph and nodes within the same cluster in the same order they had before. Unfortunately, in all these works the aim is the compression of the struc-ture of the graph and not its content. The algorithms we briefly introduced only take in consideration a rather simple model in which accesses to the graph are governed by node IDs and not by their labels or spatial relationship with other nodes (i.e., a mere compression of the adjacency lists of each node is done), thus limiting the possibility to mine such compressed graph representation in an efficient way. Conscious of such limitations we will try to replicate on labeled graphs what it has been done on strings and trees. This problem is much more complicated than those ones because it is not obvious how can a graph be turned into a sequence of symbols, as it has been done for strings [150] and trees [68] by the previous papers, which have then used known compressed indexing techniques (such as [70] and derivatives).

2.6 Tools

In the following subsections, we provide a brief description of the key tools that we will use throughout this thesis and refer the reader to the literature for more details.

2.6.1 Spectral centrality measures for directed graphs

In this section, we briefly describe three of the most popular spectral centrality measures that can be associated with a given node in a directed graph, namely PageRank, HITS and SALSA. These algorithms are specifically used in the disambiguation procedure of Wat 1.0, whereas PageRank and its Personalized variant are used for the derivation of entity embeddings of Wat 2.0 in Chapter 4 and for the construction of a graph-aware hashtag relatedness function in Chapter 5.

(26)

All the three algorithms find a dominant eigenvector of a non-negative matrix and use the entries of this eigenvector as weights to associate to each node. We will use the same formulation of [20] for this brief review. We remind however that for actual implementation an iterative formulation is usually employed which is omitted in this section for the sake of clarity. We point the interested reader to [20] and references therein for a more detailed description.

PageRank [159] tries to model user behavior, assuming the presence of

a “random surfer” who is given a web page at random and keeps clicking on links but eventually starting on another random page. It is mathematically defined to be the unique vector p satisfying the following equation:

p =αpA+ (1₋α)v (2.1)

where A is theℓ1 row-normalized adjacency matrix of the graph, α ∈

[0, 1]it the damping factor and v is the personalization vector, which must be as distribution. The classic PageRank algorithm uses a uniform probability for v, while its Personalized variant allows specifying a non uniform distribution in order to model topic-specific random walks (e.g., teleporting to a random node chosen non-uniformly).

HITS [119] associates an authoritative and a hub score to each node. A node

is considered authoritative if it is pointed by many hubs nodes, whereas a node is considered a hub if it points to many authoritative pages. This iterative relation can be expressed in the following terms:

h_i₊1 = aiAT (2.2)

a_i+1 = hi+1A (2.3)

where A is the adjacency matrix of the graph, a_i and h_i represent respectively the authoritative and hub scores at iteration i and a0 =1.

SALSA [127] combines the random-walk theory of PageRank with the

hub/authority dichotomy of HITS. It is similar to HITS but uses the ℓ₁ row-normalized variant of A and AT_{. Formally:}

h_i₊₁ = a_iAT (2.4)

(27)

The iterative formulation is not needed in practice. Indeed, given the graph induced by AT_{A (i.e., a co-citation matrix in which node x and y}

are adjacent if they are pointed by a third node w), the SALSA score of a node can be computed as the ratio between its in-degree and the sum of in-degrees of nodes in the same connected component, multiplied by the ratio between the number of nodes in that component and the number of nodes in the original network.

2.6.2 Succinct representation of integer sequences

In this section, we give a description of Elias-Fano representation of monotone sequences [63, 60]. This powerful representation will be used in Chapter 4 to speed up the access to entity embeddings and in Chapter 6 as a building block, in conjunction with other state-of-the-art algorithms and data-structures (which we will briefly review directly in the Section 6.2, for the sake of clarity of the exposition), to derive a compressed scheme to solve string-matching problems with neighboring constraints in labeled graphs.

Given a monotonically increasing sequence of n integers drawn from an universe [m] = _{0, 1, . . . , m₋1_}, Elias-Fano can be used to represent the sequence by using at most n_⌈logm_n_{⌉ +}2n+o(n)while supporting constant-time access to the i-the integer.

The encoding proceeds as follows. Each integer x_i is firstly encoded in binary using _⌈log m_⌉ bits. The binary encoding is then split into two parts. The first part contains the higher_⌈log m_{⌉ − ℓ}bits, while the second contains the last ℓ lower bits, where the parameter ℓ = _⌊log(m_n)_⌋. The sequence of lower bits is concatenated in into a bitvector L of nℓbits. The higher bits are represented as a bitvector H of_⌈n+m/2ℓ

⌉bits, encoded as follows. Let hi be

the value of the higher part of x_i. For each x_i we set to 1 the position h_i+i of H (i.e., H is 0 elsewhere).

This simple encoding allows the efficient implementation of two powerful operations:

• Access(i) which returns the i-th element of the sequence in constant time.

• NextGEQ(x)which returns the smallest integer in the sequence that is greater than or equal to a given value x in logarithmic time.

To support these operations H needs to be augmented with an auxiliary data structure that efficiently answers Select0and Select1queries (see [197] and

(28)

references therein for implementation details), defined as follows: Select0(i)

(resp. Select1) returns the position of the i-th 0 (resp. 1) in H.

To implement Access(i) we simply need to concatenate the lower and upper bits of x_i. The lower part can be easily retrieved from L, whereas the upper part h_i can be obtained using the following formula h_i =Select₁(i)−i.

The operation NextGEQ(x)is implemented by scanning all the elements starting from position p = Select₀(hx)−hx, where p exactly represents the

number of elements in the original sequence whose higher bits are smaller than hx.

2.6.3 Word embeddings

Traditionally NLP and IR applications dealt with words as atomic units (i.e., bag of words or BoW approach), by representing a single word using a one-hot encoding scheme. In this scheme, each word is represented as an integer in the domain V, where|V| is the vocabulary size, that is the number of all distinct word in the corpus. Despite the simplicity and widespread popularity of this approach, this kind of representation has several limitations. The integer encoding is arbitrary and completely ignores the semantic relation between words thus providing no additional information to the system. For example, the word Rome represented say by 33 and Italy represented by 4325 have no implicit relation defined between them. Another problem of this encoding is the sparsity of the resulting representation. This means that specifically for tail words (i.e., words that appear rarely) we need more data in order to train statistical models.

A popular technique to solve the problems of this representation is the use of Vector Space Models. These models embed words in a continuous vector space (Wk_{) of arbitrary dimensionality k, where semantically similar words}

are mapped to nearby points. The idea behind this embedding process is that words that appear in the same context also share the same meaning, namely, co-occurrence statistics. Thanks to this embedding process it is possible to compute a semantic distance between word by simply evaluating the cosine similarity between the two word vectors.

The simplest methods use a count based approach in which the co-occurrence statistics of a word with respect to the neighboring ones are mapped down to a small dense vector for each word. Co-occurrence statistics might be re-weighted using non-linear methods. The most popular include taking the square root of co-occurrence counts [175], or the logarithm, or the

(29)

related Pointwise Mutual Information (PMI) [45]. The mapping is based on the application of Singular Value Decomposition (SVD) over the term-document matrix in order to extract latent semantic associations between terms and concepts [58]. Unfortunately, these approaches tend not to scale well when large amounts of data are considered, due to the computational cost of the decomposition process.

Recently, another approach based on the theory of Neural network lan-guage models ([12, 48]) seems to have gained traction in the IR and NLP community thanks to its performance and scalability to large amount of data. The idea behind this embedding method is to use the neural network’s internal representation for the word embeddings. The approach was popular-ized by the authors of word2vec [141, 142] a family of energy-based models, followed by a matrix factorization approach called GloVe [163]. word2vec algorithms allow training models over datasets containing billions of words and a vocabulary of millions of words. Similar words are encoded into sim-ilar vectors and many vectors can be represented as linear translation, for example:

vector(“King”) - vector(“Man”) + vector(“Woman”)≈vector(“Queen”) which means that vector(“King”) - vector(“Man”) + vector(“Woman”) gives the vector closest to vector(“Queen”).

In the original paper, the authors proposed two different neural network models to derive such embeddings: the Continuous Bag-Of-Words (CBOW in short) and the SkipGram model. Both models use a feed-forward neural network composed of two layers. In the former, the goal is the missing word given a sliding window of words (i.e., the context), whereas in the second model the objective is reversed, with the goal being to predict the surrounding context given an initial word. The two models are graphically depicted in Figure 2.3. Both models are built upon a feed-forward neural network, by improving the training efficiency with two different algorithms: Hierarchical Softmax and Negative Sampling [142].

This toolset will be specifically used for the derivation of Entity and Entity-Context embeddings in Chapter 4.

2.6.4 Linear Support Vector Machines

In this section we briefly explain how Support Vector Machines (SVMs in short) work by providing to the reader the main intuition and some

(30)

mathemat-Fig. 2.3 CBOW and SkipGram model architectures. w(i), where i = t₋ 2, . . . , t+2, are words represented by their one-hot encoding. Credits to [141].

.

ical formulation behind them but limiting the explanation to the linear case as it is the only SVM model used in this thesis (e.g., see hashtag classification in Chapter 5).

A linear SVM model for binary classification consists of a decision hy-perplane defined in Rp by an intercept term b and by a vector w that is perpendicular to the hyperplane. The classification task of such a model can be expressed as follows:

f(x) = sign(wTx+b) (2.6) where x is p-dimensional point that we wish to classify. It will be classified as positive (resp. negative) if the sign of Equation 2.6 is+1 (resp. ₋1).

Given a training dataset consisting of n tuples(x1, y1)· · · (xn, yn), where

the y_i are are used to indicate the class (either ₋1 or +1) to which the point x_i _∈ Rp _{belongs, the training objective of a linear SVM is to find the}

maximum-margin decision hyperplane that divides the training points into their two respective classes. Assuming linear-separability of the training data, this corresponds to finding two parallel hyperplanes wx+b =1 and

wx+b=₋1 that separate the two class of data and for which their distance

2

(31)

minimizing ||w_||subject to y_i(wTxi+b) ≥1 for all the tuples in the training

data.

Fig. 2.4 Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.

In the case of non-linearly separable training data, the so called hard-margin formulation presented above is extended allowing points in the training data to lie on the wrong side of the margin. This is the so called soft-margin formulation that can be expressed by the following unconstrained optimization problem: argmin w,b 1 2||w||2+C n

∑

i=1 max0, 1₋y_i(wTxi+b) (2.7) A large value of C makes margin constraints hard to ignore thus producing narrow margin. Conversely, low values of C produce large margin classifiers since margin constraints are more easily ignored.

For multi-class classification problems (in which labels are not binary and mutually exclusive, cf. hashtag classification) the most common technique is to build _|N_| distinct one-versus-rest classifiers and to choose the class that classifies the test point with the greatest margin. We refer the interested reader to [52, 133] for a more detailed a thoughtful description and derivation.

(32)

2.6.5 Learning-to-rank with LambdaMART

In this section we describe the general ideas behind LambdaMART [30], an algorithm used to construct ranking models for IR systems. In the context of this thesis, the algorithm will be used to train an ensemble of decision trees used in the final disambiguation module of Wat 2.0, detailed in Chapter 4. Specifically, the model will be trained to optimize the ranking of the candidate entities Em = {e1, . . . , en} that are associated with a given mention m, by

maximizing the Normalized Discounted Cumulative Gain [113] (NDCG), which is the normalized version of DCG, defined to be:

DCG@k= k

∑

i=1 = 2li−1 log₂(1+i) (2.8) where k is the truncation level (i.e., in our case 5), and l_i is the relevance label associated with the i-th entity (e.g., typically l_i ∈ {0, 1, 2, 3, 4} but in our context binary relevance scores were used since a notion of not-so-correct entity is missing, although this might be an interesting future research direction). The NDCG is the normalized version of DCG:

NDCG@k= DCG@T

max DCG@k (2.9)

where the denominator refers is the maximum DCG@k attainable for that mention and hence it ranges in[0, 1].

LambdaMART [30] combines the LambdaRank [43] with the GBRT [83] framework (Gradient Boosted Regression Trees). In order to give a brief description of how it works, we first briefly review GBRT, a boosted tree model in which the output of the model is a linear combination of an ensemble “weak learners” (i.e., regression trees). The output of GBRT can be written as:

F(x) =

N

∑

m=1

hm(x) (2.10)

where hm(x) is a function modeled by a regression tree. Evaluating a

single hm(x) maps the corresponding feature vector x ∈Rd to a real value λl.

The mapping is done by passing x down the tree until a leaf node l is reached. The path (left or right) at a given node in the decision tree is determined by the value of a particular feature x_j where j =1, . . . , d. GBRT builds its model incrementally, fitting each weak learner one at a time:

(33)

At each stage GBRT chooses the regression tree h that minimizes a given loss function L (usually least squares):

argmin h N

∑

i=1 L(yi, Fm−1(xi) +h(xi)) (2.12)

The loss function is evaluated at each point of training set. The minimiza-tion is solved numerically using steepest descent, that is the direcminimiza-tion of the negative gradient of the loss function evaluated at the current model. This framework we just described allows for finding an ensemble of trees that minimizes a differentiable loss function.

Unfortunately, in Information Retrieval most of the loss functions we care about optimizing (e.g., NDCG, MAP or MRR [113]) are either flat or discontin-uous everywhere, so they cannot be used inside GBRT. In order to overcome this problem, LambdaMART borrows a key idea behind LambdaRank: for each training point pair (x_i, x_j) for which x_i should be ranked higher than x_j we compute a value λ_i,j that acts as the gradient that we need. The λ’s can be interpreted as forces: x_i (resp. x_j) will get a push upwards (downwards) of size _|λ_i,j_|. The authors in [43] report that this technique has been successfully used to optimize NDCG, MAP and MRR.

(34)

(35)

3

Entity Linking

The economy of automatically extracting structured information from a large amount of cheap data to obtain unstructured data has led to the development of a considerable number of entity annotation tools, both in the academic and industrial world. However, the comparison between different solutions is still a hard task, for different reasons: (i) published results are calculated on diverse datasets and/or (ii) evaluated against a different set of metrics.

In this chapter, we try to clarify the landscape by introducing a proper terminology that is used throughout this thesis and outline the structure of a typical entity annotator system. In the first part, we give a hierarchical description of the different tasks that a given entity annotator might be capable of solving. We then move to the formal description of evaluation metrics that are used to assess the performance of a given entity annotator. We later provide a detailed description of GERBIL, an evaluation framework created to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. At the end of the chapter, we briefly review the most promising entity annotator tools, most of which are freely queryable via the GERBIL framework.

3.1 Terminology

The literature around entity annotators and entity linking in general presents a wide variability of terminologies. In order to avoid confusion and remove ambiguity, we hereby introduce a series of definitions often used in the description of techniques, algorithms, and metrics relative to entity linking.

• A Knowledge Base (K) is a rich collection or catalog about real-world entities, their semantic classes and their mutual relationships. Notable examples of such catalogs include DBPedia, YAGO, Freebase, Wikipedia.