Giordano’s thesis
Giordano Galanti
September 18, 2014
Contents
Introduction 3
1 The KDD’05 Cup 7
1.1 Challenge overview . . . 7
1.2 Dataset analysis . . . 11
1.3 Gold standard extraction . . . 12
1.4 Evaluation criteria . . . 14
2 Related works 16 2.1 Some approaches in detail . . . 17
2.1.1 Some performant methods . . . 18
2.2 State of the art . . . 19
3 Our System 22 3.1 Tools . . . 23
3.1.1 Wikipedia and its knowledge graphs . . . 23
3.1.2 Topic annotators . . . 25
3.1.3 Support Vector Machine . . . 27
3.2 Data preparation . . . 29 3.3 Bridge categories . . . 29 3.4 Feature generation . . . 30 3.4.1 Topic-based approaches . . . 31 3.4.2 Bag-of-words approaches . . . 35 3.4.3 Hybrid approaches . . . 37 3.5 Classification task . . . 39 4 Other approaches 40 4.1 Other features generation approaches . . . 41
4.1.1 Counting algorithms . . . 41
4.1.1.1 A different way of counting . . . 41
4.1.1.2 Counting neighbours . . . 42
4.1.1.3 Expand feature space to all Wikipedia cate-gories . . . 43
4.1.2 Shortest path algorithms . . . 44
4.1.2.1 Basic shortest path approach . . . 44
4.1.2.2 Shortest path with neighbourhood . . . 46
4.1.3 Other attempts . . . 47
4.2 Adjust the bridge Wikipedia category set . . . 49
4.3 Two stage classification . . . 51
5 Experimental Results 53 5.1 System predictions . . . 54
5.2 Performed experiments . . . 55
6 Future Developments 62 6.1 A fast and effective on-line tool . . . 63
6.2 Resizing representative Wikipedia page sets in BOW approaches 65 6.3 Automatic bridge Wikipedia categories detection . . . 65
6.4 Other considerations . . . 66
Appendices 67 A 68 A.1 The 122 Wikipedia bridge categories . . . 68
Introduction
Web query classification aims at classifying Web users’ queries with one or more predefined categories, according to their topics. This kind of classifica-tion task plays an important role in many IR applicaclassifica-tions [1], because it is a key tool to improve the effectiveness and the efficiency of general-purpose web search engines. As an example, consider the following uses:
• Metasearch engines: where the user’s query is sent to multiple search engines and the top results are blended from each search engine into one overall list; the search engine can organize the large number of Web pages in the search results, according to the potentials categories to which the issued query belongs.
• Vertical search: compared to general search, this tool allows to focus the search on a subset of Web pages that correspond to the intent behind every user query. Once the search engine has predicted the category of information a Web user is looking for, she can select a certain vertical search engine automatically, without being forced to access the vertical search engine explicitly.
• Online advertising: this application aims at providing interesting ad-vertisements to Web users during their search activities. This is the main revenue stream for the (free) search engines available on the Web. The classification of user queries into predefined categories is useful to improve the selection of the most pertinent advertisements.
While text classification and categorization is a well-known topic in Infor-mation Retrieval and Text Mining fields, the query classification problem is not yet completely addressed; in fact, there are several difficulties underlying this task since queries are short, ambiguous and query terms can be noisy. Furthermore, as the queries contain less than 3 terms on average, classic text
classification techniques that use the occurrence of words as features get in trouble because of feature-space sparseness [17].
To face these problems, most research groups built their query classi-fication system extracting extra information through/from the Web, more specifically from Web ontologies such as: Wordnet [23][24], Wikipedia [25], BabelNet [28][29], DBpedia [26][27], Yago 2 [30][31].
The effectiveness about using this kind of resources resides in their struc-ture as large graphs of concepts, properly interconnected to denote (semantic) relations between pairs of them. These labeled graphs are becoming more and more important in many IR applications and have recently led to the design of some powerful tools which are nowadays known as topic annotators [2]. The key idea [3] is to identify, in the input text, short-and-meaningful sequences of terms (also called mentions) and annotate them with unambigu-ous identifiers (also called entities) drawn from a catalog.
Most recent work adopts anchor texts occurring in Wikipedia as entity mentions and the respective Wikipedia pages as the mentioned entity, be-cause Wikipedia offers today the best trade-off between catalogs with a rig-orous structure but low coverage (such as WordNet, CYC, TAP), and a large text collection with wide coverage but unstructured and noisy content (like the whole Web).
Apart of some preliminary approaches to using topic annotators in clas-sification problems like [4] or the one of Piccinno/Santoro (to appear), as far as we know no one has investigated their use in the query classification problem.
This is exactly the goal of this thesis in which we will study, design and test a multi-label classification system which deploys three main ingredients: • the best topic annotator for texts to date, namely TagMe (version WAT [33]), to annotate user queries with pertinent topics drawn from Wikipedia;
• the best topic annotator for queries to date, namely SMAPH [41] which resulted the winning system in the SIGIR competition;
• several novel algorithms and data structures for deploying the struc-tural knowledge produced by the previous annotators in order to effi-ciently and efficaciously classify user queries into a set of 67 categories drawn from the KDDCUP ’05 competition.
All of our proposals will be tested on a dataset which is a portion of the one adopted in that competition and hence, we can not compare directly our proposals with the state-of-the-art solutions.
For this reason the main future development of this work, as we will see in Chapter 6, will be the creation of a dataset that allows us to perform the benchmarking of our classification system directly on the KDDCUP’05 test set.
Apart from this, we can state that our solution has achieved excellent results in terms of classification, with an F1 of 56.35%.
The thesis is organized as follows. Chapter 1 presents the KDDCup’05 challenge from which this work takes inspiration. This chapter contains:
1. a brief introduction to the challenge; 2. the challenge datasets analysis;
3. the generation phase of the gold standard we used to train the classifiers and for their evaluation;
4. the evaluation criteria used in the challenge, extended in order to per-form the benchmarking of our system.
In Chapter 2, we describe several approaches about query classification proposed in literature, highlighting their main features. Among all ap-proaches, particular attention is given to the state-of-the-art.
In Chapter 3, our proposed multi-label query classification system is pre-sented. The first section of this chapter introduces the reader to the tools we employed to develop the classification system, that is Wikipedia, topic annotators and Support Vector Machine model. The next section shows the datasets we used to both train and evaluate the classification system. An-other section describes the approach we designed in order to find a Wikipedia counterpart of the KDDCUP’05 categories. In the remainder of this chapter we present, one by one, each approach about features generation we employed to train the classifiers and finally the classification task.
In Chapter 4, all the other tested variants considered during the develop-ment of the system are presented.
Chapter 5 shows the results obtained benchmarking the query classifica-tion system with the test set.
In Chapter 6, several possible future developments are proposed.
In Appendix A, the list of all bridge Wikipedia categories employed in our system are reported.
Chapter 1
The KDD’05 Cup
1.1
Challenge overview
The work we are going to discuss is inspired by the KDDCup’05 [18], a chal-lenge organized by ACM SIGKDD group on Knowledge Discovery and Data Mining1, about internet user search query classification, held in 2005. The
task of the KDDCup 2005 competition [21] was to classify 800,000 internet user’s search queries into the 67 predefined categories listed below.
Computers\Hardware
Computers\Internet & Intranet Computers\Mobile Computing Computers\Multimedia
Computers\Networks & Telecommunication Computers\Security
Computers\Software Computers\Other
Entertainment\Celebrities Entertainment\Games & Toys Entertainment\Humor & Fun Entertainment\Movies
Entertainment\Music
Entertainment\Pictures & Photos Entertainment\Radio
Entertainment\TV Entertainment\Other
Information\Arts & Humanities Computers\Hardware
Computers\Internet & Intranet Computers\Mobile Computing Computers\Multimedia
Computers\Networks & Telecommunication Computers\Security
Computers\Software Computers\Other
Entertainment\Celebrities Entertainment\Games & Toys Entertainment\Humor & Fun Entertainment\Movies
Entertainment\Music
Entertainment\Pictures & Photos Entertainment\Radio
Entertainment\TV Entertainment\Other
Information\Arts & Humanities
Online Community\Chat & Instant Messaging Online Community\Forums & Groups
Online Community\Homepages Online Community\People Search Online Community\Personal Services Online Community\Other
Shopping\Auctions & Bids Shopping\Stores & Products
Shopping\Buying Guides & Researching Shopping\Lease & Rent
Shopping\Bargains & Discounts Shopping\Other Sports\American Football Sports\Auto Racing Sports\Baseball Sports\Basketball Sports\Hockey
Sports\News & Scores Sports\Schedules & Tickets Sports\Soccer
Sports\Tennis
Sports\Olympic Games Sports\Outdoor Recreations Sports\Other
The above categories form a two-level taxonomy, with 7 first-level cate-gories and 67 second-level catecate-gories. Ying Li et al. in [18] pointed out that to complete this challenge it is needed to face the below challenges:
• No straightforward training data: Not like previous years KDD-Cups, where for most tasks, standard learning algorithms can be directly ap-plied; it is not possible directly apply standard learning algorithms to this task since no training data is provided.
• Open resource: Although the information provided with the task are limited, there are not constraints about using extra information re-trieved from other resources in order to solve the problem. So it is pos-sible to acquire extra information from various media, such as search engine, dictionary, thesaurus, and document repository.
• Subjective search query intents: The meanings of search queries or in-tents of search users are subjective. For example, the query Saturn might mean the Saturn car to some users and mean Saturn, the planet to others. The meanings of a query may also be different for the same person at a different time.
• Implicit category semantics: Since there are not detailed description on each category, it is necessary have to reply on common sense to acquire the implicit category semantics.
• Noise in data: The provided query data was from real search query logs. It contains a lot of misspelling and incorrect word breaking. • Scalability and automation: It is very time consuming and costly to
manually categorize the queries. A scalable automatic solution is re-quired to label the 800,000 queries.
Furthermore in [18] the authors grouped the competitors approaches into three strands:
1. those in which the queries are preprocessed in order to remove noise and improve accuracy;
2. those in which it is expected the gathering of extra information to augment the queries;
3. those in which it is used a mapping function between pre-defined/existing directory structure or words and the KDDCUP’05 categories.
The challenge winning solution has been developed by D. Shen at al. [5][10]. Their approach will be widely treated in query classification state-of-the-art dedicated section, together with its improved version. However, we can anticipate that it consists in enriching queries with a set of intermediate objects extracted through conventional Web search engines and through Web directory search engines.
The best result of winning solution on the KDDCUP’05 test sets is 42.4% for F1 and 44.4% for precision.
1.2
Dataset analysis
The datasets provided for the KDDCUP 2005 competition (available on the Web2) are the following two:
• a dataset consisting of 800,000 real-user (unlabeled) queries extracted from a query data log of MSN search engine;
• a sample of the above dataset containing 111 queries together with category information.
The KDDCUP’05 task was to classify the 800,000 queries of the first dataset into the 67 categories listed before. There are no training data; the second dataset was provided only for demonstration purpose and it can not to be used alone as training set due to its small size.
The queries of the dataset vary a lot; they can be numbers, name of persons or locations, URLs, program segments and so on [5]. Some of these contain typos, others are agglomerates of meaningful words forming a unique meaningless one.
Furthermore, there are a number of statistics to be taken into account with regard to the KDDCUP’05 dataset, extracted from [5]:
• most of the second dataset queries have more than three possible mean-ings (Tab.2);
• 79% of the queries have less than four words;
• the queries containing three words are the most frequent (Tab.3).
1 2 3 4 5 0% 10% 20% 30% 40%
# labels per query
P
ercen
tage
Tab.2
1 2 3 4 5 6 7 8 9 10 0 20000 40000 60000 80000 100000 120000 140000 160000 180000
# words per query
F
requency
Tab.3
In order to evaluate the quality of the proposed solutions, the challenge organizers provided a test set containing 800 queries randomly extracted from the whole dataset.
Specifically the test set includes three different data set, each one manu-ally labelled from an human expert.
The records of this test sets are pairs (query, labels list ); the labels list is ranked starting from the label denoting the category with the higher affinity degree with the query, to the label denoting the category with the lower affinity degree.
1.3
Gold standard extraction
Since the final evaluation criteria are computed as the average of the results obtained on each test set and since the labels voted by a human labeler for a query often differ from the labels voted by another human labeler, it is impossible to achieve 100% in F1 and precision measures.
Taking into account these issues, we investigated about the labeling ap-proach which could guarantee the best performance attainable over the test set in order to consider the labels assignment produced by this method as our gold standard. In order to do that we have considered three algorithms:
1. the first one, for each query assign only the categories which are voted by all labelers;
2. the second one, for each query assign the categories which are voted by at least two labelers giving priority to which are voted by all labelers; 3. the last one, for each query assign the categories which are voted by at
least one labeler considering the priority cited on the second method. Denoting the three labeled test sets as L1, L2 and L3 respectively, we have the following results:
Table 1.2: Results on L1.
Methods First Second Third
Precision 1 0.927 0.687
Recall 0.329 0.669 0.890
F1 0.495 0.778 0.776
Table 1.3: Results on L2.
Methods First Second Third
Precision 1 0.645 0.445
Recall 0.504 0.714 0.883
F1 0.670 0.678 0.592
Table 1.4: Results on L3.
Methods First Second Third
Precision 1 0.883 0.679
Recall 0.314 0.609 0.839
The previous tables show the values measured on each test set; in Table 2.5, we can see approximately what the optimal solution can achieve, by considering the average of the three results get singularly on the three sets:
Table 1.5: Average results.
Methods First Second Third
Average Precision 1 0.819 0.604
Average Recall 0.382 0.664 0.871
Average F1 0.548 0.725 0.706
Hence, the gold standard we considered for the classification system de-velopment is the results of the merge of the 111 labeled queries (provided for demonstrating purpose by KDDCUP’05 organizers) with the 800 queries (used as test set in the KDDCUP’05 challenge) labeled with the second labels assignment algorithm.
1.4
Evaluation criteria
The criteria adopted to evaluate a classification system are the commonly used measures in the information retrieval and data mining fields, that is Precision and F1 (which thus includes Recall). The definitions given by the authors of KDDCUP’05 for the specific query classification context are the following [18]:
TP =
67
X
i=1
] of queries correctly tagged as ci
TP+FP =
67
X
i=1
] of queries are tagged as ci
TP+FN =
67
X
i=1
Precision = T P T P + F P Recall = T P T P + F N F1 = 2 × P recision × Recall P recision + Recall
As there are three set for the evaluation, the overall scores for precision and F1 are computed by averaging the results got on each set:
Overall Precision = 1 3
3
X
i=1
(P recision against the seti)
Overall F1 = 1 3
3
X
i=1
(F 1 against the seti)
To evaluate our system we considered the same F1 used in the challenge that we will also refer to as Micro F1, so:
Micro F1 = 2 × P recision × Recall P recision + Recall
Moreover we used the Macro F1 which is computed as the mean of the F1 measures obtained classifying the queries with one KDD category at time.
Chapter 2
Related works
Understanding the search intent of a query issued by an user has become an important problem in the research context. The KDDCUP competition was the first international event (2005) to underline the importance of this problem and its algorithmic difficulties.
Since that year, a variety of approaches have been proposed in literature; we can classify them into four categories.
The first category includes those methods which assign the labels using a self-training-like approach, according to some click-through data (without considering the context of queries) [6].
The second category uses semi-supervised techniques, exploiting the large amount of unlabeled data, in order to improve the accuracy of the supervised learning trained on the labeled data [7].
The third category has the aim to classify query sessions issued by users, considering the context of queries, that is by extracting some kind of knowl-edge from the click-through data provided from some commercial search en-gine [8][9].
The last category tries to leverage the Web knowledge, by enriching queries with intermediate objects, like categories (drawn from proper tax-onomies) or web pages, in order to derive the corrects target categories [5][10][11].
2.1
Some approaches in detail
The first work we mention is [13] in which X. Wang et al. faced the prob-lem of the non-English Web growth, performing the cross-language query classification introduced by J.S. Olsson et al. in [12].
In order to provide this kind of service, they proposed the following steps: 1. given a query, they dispatch it to a Web search engine and retrieve a
number of top-scoring pages;
2. they translate the content (plain-text) of these pages in English using a machine translation system;
3. the results of the previous step are subsequently classified using an existing classifier trained on English texts;
4. they proceed with a voting phase in order to choose the proper category for a given query.
Another kind of classification is the so called query type classification, where the aim is to classify the queries into categories of a special taxonomy (given for the first time by Broder in [14]).
This taxonomy consists of three categories:
1. Navigational Queries: those queries aiming to search contents within a specific Web site or domain.
2. Informational Queries, those queries which reflect the user intent to search various genre information through the Web.
3. Transactional Queries: those queries which reflect the intent of the user to make a transaction, like purchase something.
The work in [15] started from Broder’s classification and addressed the trans-actional queries. Specifically, they used toolbar log-data to generate a bipar-tite graph, in which nodes are the issued queries and the Web forms (clicked during the sessions started from the queries) and once they create it, they analyze it so that they can extract a group of high transactional queries in order to use that as training set.
2.1.1
Some performant methods
The literature is full of performant approaches in the query-classification problem, that we briefly recall in this section.
Cao et al.’s work [8] adopted the CRF models [16] to perform the so called context-aware query classification that they define in the following way: given a target taxonomy, a user-specified parameter K and a user query, the context-aware query classification incorporates the search context of the given query to classify it into a ranked list of K categories of the given taxonomy (chosen among the leaves of the taxonomic tree).
In their paper, they use supervised learning to train a classifier that can predict, given a query session, the target category using the last query of this session as the test query and the others as the search context. With query session they mean a series of observations where each observation consists of a query and a set of URLs clicked by an user.
They used the KDDCUP’05 taxonomy as their target taxonomy and for the training set a number of sessions manually labeled by three humans.
A further work with regard to the context-aware query classification is [9]. Also in this work the authors adopted CRF models to embed context information in query classification. The main differences between their work and [9] are the following: they used different training methods for CRF models and they adopted semi-supervised learning, combining the small set of labeled queries with the large amount of unlabeled data, to improve the accuracy and reducing the human effort needed to label query sessions.
Another interesting work is [17] where the authors faced the problem of sparseness of feature space, that it is a common problem in query classifica-tion.
In fact, the queries on average are composed by three words while the vocabulary considered is very rich, therefore in order to produce accurate classification, it is required very large amounts of training data.
The authors claim that considering a new smaller feature space made by tags, the final feature space considered is reduced in terms of size and sparsity.
They define tags the entities able to express properties of the pages, such as its topic or which types of entities it contains.
For a given query issued to a Web search engine, it is valuated the inci-dence of each tag t in the feature space within the search results, considering the tag ratio, that is the ratio between the pages tagged with t and all result
pages. Specifically they pre-compute the tag ratio for a suitably chosen sets of query keywords so that they can store them in main memory.
To improve the system accuracy, instead of using exactly the tag ratios for query q they consider a combination of tag ratios from queries q0 ⊂ q. To select the proper subset of tag ratios, they restrict the set to the tag ratios which share some properties; for instance, they precomputed tag ratios only for queries that produce (issued on a search engine) a size of results that it exceeds a certain threshold.
2.2
State of the art
The most interesting work with regard to the purpose of this thesis are, the winning solution of KDDCUP’05 ([5][10]) and its subsequent improvement [11]. In the above approaches the main idea is to enrich the queries with intermediate objects like Web pages or intermediates taxonomies categories, such as that of Open Directory Project1 (ODP).
The first work consists of two phases, the first one corresponding to the training phase and the second one to the testing phase.
The former phase aims to train the classifiers; specifically, they consider two kinds of classifiers: the synonym-based one and the statistical one. The synonym-based classifier is very important as it is the base to construct the second one.
The mapping function generated from this classifier has to be able to map all the categories of the intermediate taxonomy considered, with the categories of the target taxonomy. In order to do that they perform an exact (syntactic) matching between the words that make up the strings which identify the ODP categories and the target categories.
Since many intermediate categories don’t match the target ones, they expand the keywords which form the target categories through WordNet. To address the synonym-based classifier low-recall drawback, they proposed a statistical classifier based on Support Vector Machine[40] learning models. As there was not training data available, they constructed it according to the mapping function trained by the other classifier; in order to do that they followed these steps:
1. first, they sampled a number of Web pages from a Web page directory;
2. then, by using the mapping function aforementioned they assigned a target category to the sampled pages;
3. eventually, they used the above labeled page set as the training data. The features considered to represent the pages, were: the titles of the pages together with the full plain-texts of the pages or the snippets generated by some search engine (the latter combination have produced better results than the former one). The overall framework is formed by a total of six classifiers: three synonym-based, one statistical and two ensemble classifiers. To construct the ensemble classifiers, they used two strategies, with one aiming at higher F1 (giving equal weights to all classifiers) and the other one aiming at higher precision (using the development data to tune the weights of each classifier).
In the testing phase they classify a query with the classifiers trained in the first phase, using the page list and the categories list, retrieved by issuing the query to the Web directories search engines and to the standard search engines.
Their best result on the KDDCUP’05 test sets for F1 and precision is 42.4% and 44.4% respectively.
The last work we are going to discuss is [11] which proposes an approach that outperforms the solution proposed in the above article, i.e. the winning solution of the KDDCUP’05 competition. Besides improving the winning solution, the authors’ intent is to address two problems which that solution suffers. The problems at issue are the following:
1. first, the classifier for the mapping function of the categories needs to be trained whenever the target taxonomy structure changes;
2. second, Web directory contains a large amount of categories so it is costly to handle all mapping functions; a selection of the most relevant of those is necessary.
To solve the second problem they proposed a category selection method through which they can properly select the top n categories; the sorting is done according to some measures like the Total Probability, which gives to each intermediate category a score that reflects the probability of generating the categories in the target taxonomy. To address the first problem, they built a bridging classifier on an intermediate taxonomy in an offline fashion,
so that they use it in online mode to map the users’ queries to the target cat-egories, without re-building the classifiers whenever the taxonomy structure changes.
The idea of the bridging classifier is to connect the target taxonomy and queries by taking an intermediate taxonomy as bridge. In order to do that they built a probabilistic framework able to estimate p(CiT|q) (i.e. the prob-ability that given the query q, it belongs to the target category i) using other knowledge like p(CT
i |CjI), p(q|CjI) and p(CjI) denoting the probability of CiT
given CjI, the probability of q given CjI and the probability di CjI (estimable from the Web pages in CI) respectively. The final solution is the result of
the combination of the synonym-based classifiers and the SVM classifier of [10] with the bridging classifier obtained using the aforementioned strategies to combine the intermediate solutions (ensemble of classifiers). With this approach the authors claimed an improvement by 9.7% and 3.8% in terms of precision and F1 respectively compared with the best results of KDDCUP’05; so 44.01% for F1 and 48.71% for precision.
Chapter 3
Our System
The goal of this thesis has been to develop a query classification system able to perform multi-label query classification over the 67 KDDCUP’05 categories as the target ones.
In this chapter we pay attention to the main algorithmic aspects of our system and to its implementation issues. The key structural knowledge de-ployed by our system consists of the Wikipedia entities retrieved by a topic-annotator (namely TagME [34]) for the user query to be analyzed and the Wikipedia category graph. So the next sections will dig into recalling some basic notions related to these concepts.
The assignment of categories to queries will deploy, as commonly done in other papers, a machine learning approach based on Support Vector Machine [40], which will be the topic of another subsequent section. The rest of the chapter is then dedicated to digging into the datasets and gold standard adopted to build and test our system, as well as in the several steps involved in the classification task (from the feature generation to training and evaluation phases).
As we will see in this chapter, the system we developed is a fast and accurate tool, able to properly discriminate the belonging of a given query to target categories, among 67 predefined categories (the same of KDDCUP05 challenge). The query labels assignment is performed employing 67 binary SVM classifiers trained, using a well-tuned specific features set as training set.
A key task for this work was to manually extract a set of bridge Wikipedia categories able to best represent the 67 target categories; this task has made it possible to, through the semantic connection between bridge and target
categories, exploit Wikipedia and its knowledge graphs to extract valuable information.
The features set employed to train the classifiers, has been generated by the merging of feature vectors produced by combining four different algo-rithms. We can classify these algorithms in two classes:
1. The topic-based approaches. 2. The bag-of-words approaches.
The algorithms belonging to the first class are those which employ the entities retrieved querying some topic annotator and they are the real strength of this system.
The other algorithms which are the standard approaches in the IR field, have been employed to further enrich the query classification process.
Finally, each classifier has been tuned according to the development set, starting from the features generated by the four algorithms which have been evaluated and well-tuned individually.
Our system has achieved, on the test set, an F1 of 56.35%.
3.1
Tools
This section reports the tools we used to develop the query classification system, together with some basic useful notions.
3.1.1
Wikipedia and its knowledge graphs
Wikipedia is a multilanguage free-content encyclopedia, written collabora-tively by Internet volunteers who can contribute to enrich the encyclopedia under a pseudonym or using their real identity.
It’s a project created in 2001 by the non-profit Wikimedia Foundation and it’s based on the following five principles:
1. Wikipedia is an encyclopedia;
2. Wikipedia is written from a neutral point of view;
3. Wikipedia is free content that anyone can use, edit, and distribute; 4. Editors should treat each other with respect and civility;
5. Wikipedia has no firm rules.
English Wikipedia includes 4,581,047 data articles and it is increased every day with over 800 new articles. It includes, moreover, about 1 million cate-gories which are used to group other pages on similar subjects together; every category has one or more parent categories and one or more subcategories (to get an idea about the category tree see Figure 3.1 and Figure 3.2, however we suggest the tools at1 and at2).
Figure 3.1: A snaphost of Fundamental categories subcategory graph got from vCat tool.
Furthermore, there exist two top level categories groups, respectively Main topic classifications and Fundamental categories from which it’s possi-ble to derive every other category.
As reported by Wikipedia, a fundamental category in a “tree of all knowl-edge“ denotes a top-level place to start in a system of categorization (Figure 3.1), while the main topic classifications is a list of Wikipedia major topic classifications (Figure 3.2).
1http://en.wikipedia.org/w/index.php?title=Special%3ACategoryTree 2http://meta.wikimedia.org/wiki/User:Dapete/vCat
Figure 3.2: A snaphost of Main topic classifications subcategory graph got from vCat tool.
Most of Wikipedia pages is linked to a number of other pages expressing in some way a semantic relation between them; the degree correlation can be investigated, according to the needs, for example analyzing the graph topological structure.
The strength of the semantic correlation between two pages, usually, is in-versely proportional to the path length which interconnects these entities; so the longer is the path between two pages the lower is the semantic similarity between these entities.
Another valuable resource is the Wikipedia category graph; every Wikipedia page belongs to one or more categories and every categories is connected through a hierarchical connection with others.
The categories graph constitutes a rich taxonomy which can be used to mine knowledge.
All this features make Wikipedia the perfect ontology to enrich low-informative texts or queries with extra information.
3.1.2
Topic annotators
This kind of systems are born to go beyond the bag-of-words model [32], where a text is represented as a multiset of its words.
These systems are able to do the so called “text annotation“, identifying on-the-fly, meaningful mentions in the input text and linking them to their respective topics of a proper ontology.
An example of topic annotations is reported on Figure 3.3, where the underlined parts of the text on the middle are the mentions and the frag-ments surrounding the text are the titles of Wikipedia pages attached to the mentions.
Figure 3.3: An example of topics extraction from unstructured-text.
The state-of-the-art topic retrieval systems are WAT[33], TagMe[34], Illi-nois Wikifier[36], CMNS[37], Wikipedia Miner[38], AIDA[39]. These tools extract entities from text.
Every system differs from the other by the ontology used and by the al-gorithms employed to extract the topics; that is, in the way it performs the several phases, like the preprocessing one, the indexing one and the disam-biguation one.
The state-of-the-art of topic annotators is Tagme [34], a system that is able to efficiently and judiciously augment a plain-text with pertinent hyperlinks to Wikipedia pages. The specialty of Tagme with respect to known systems [35][38] is that it may annotate texts which are short and poorly composed, such as snippets of search-engine results, tweets, news, etc.
Very recently, authors of [41] introduced the SMAPH system which is the state-of-the-art in annotating search-engine queries with Wikipedia entities; SMAPH achieved the best result in the ERD Challenge hosted by SIGIR 2014.
As the paper reports, the approach implemented by the SMAPH system consists of two main phases: candidate entity generation and entity pruning. The goal of the first phase is to generate a broad set of entities that may be linked to the query. Candidate entities are generated from three different sources. In this phase, the focus is on recall rather than precision, as the ideal outcome of this phase is a set containing all correct entities, at the cost of some amount of false positives. The second phase is meant to prune away all wrong entities, refining the set of candidate entities to the final output. In particular, SMAPH implements a four stage pipeline:
1. Fetching stage: it fetches the search results retrieved by a search engine after issuing to it the query to be annotated.
2. Spotting: it identifies candidate mentions for the entities to be anno-tated, after it has parsed the results snippets.
3. Candidate generation: to generate candidate entities it uses either the mentions produced by the previous stage as input of the WAT annota-tor, or, directly the Wikipedia URLs in the search results.
4. Pruning: a pruning phase is performed in order to decide which entities to keep and which to discard, using a binary SVM classifier.
We used Tagme (WAT version) and SMAPH, to annotate texts (specifi-cally snippets and bolds) and queries respectively.
3.1.3
Support Vector Machine
The Support Vector Machines (SVM) [40], are a set of supervised learning methods for data analysis and pattern recognizing used in classification and regression tasks.
The prediction model is generated starting from a specific examples set called training set ; each example is marked as belonging to one of two cate-gories.
The general idea of SVM is to find the hyperplane which better divide the examples of the two classes, so the one which maximize the so called functional margin, that is the largest distance to the nearest training data point of any class.
Mathematically speaking in linear SVM problem there are the following classification constraints:
∀i, yi(wxi + b) >= 1
(
wxi+ b ≥ 1, if yi = +1
wxi+ b ≤ 1, if yi = −1
and the aim is to maximize the margin M = 2/|x| that is the same of minimizing 12 wTw.
Summarizing we have to: • Minimize: 1
2 w Tw
• Subject to: yi(wxi+ b) >= 1, ∀ i
Since the objective function is quadratic, and the constrains are linear in w and b, this is know to be a convex optimization problem. SVM tries to find a separating hyperplane in the higher dimensional space (called feature space) where the training vectors are mapped; it’s possible to use several function depending on the the kind of separation request, the following kernels are the most common ones:
• Linear. • Polynomial.
• Gaussian (radial-basis function).
Support Vector Machine is proved to be very effective machine learning method for text categorization and document classification [5][42]; for this reason and for the remarkable properties listed below [43], we adopted it in this work:
• Only support vectors are used to specify the separating hyperplane; • Ability to handle large feature spaces complexity does not depend on
the dimensionality of the feature space ability to identify outliers; • Overfitting and noise can be reduced using soft margin approach; • It is a convex optimization problem, so it is guaranteed to converge to
a unique global solution; • Has intrinsic feature selection.
The software library we used for support vector classification is LIBSVM [46].
3.2
Data preparation
In this section we broaden the discussion of Chapter 1 about the data sets and the gold standard used for classification and evaluation phases.
The gold standard we considered to develop and to evaluate the query classification system is the one described in gold standard extraction section of Chapter 1.
We remember that this dataset consists of 911 queries, specifically: 1. the 800 queries, labelled performing the labeling algorithm (proposed in
gold standard extraction section) on the three labeled test set provided by the KDDCUP’05 challenge;
2. the 111 uniquely labeled queries, deployed for demonstrating purpose (see Chapter 2).
Subsequently, the gold standard queries have been shuffled and divided into two parts:
1. training set, consisting of 711 queries; 2. test set, consisting of 200 queries.
It not has been possible to further divide data, generating a development set (as usual), since for some target categories there are no more than two associated examples (in the gold standard); and so, there would not be the possibility to have at least an example for each dataset.
For these issues we were forced to use the training set also as development set.
3.3
Bridge categories
As it was anticipated in the Introduction, the first basic step of our work was to choose a representative set of Wikipedia categories which could character-ize as best as possible the 67 KDDCUP’05 ones.
The reason about having a representative Wikipedia categories set is that, through it, it is possible to create a semantic interconnection between the Wikipedia ontology and the target categories.
To build the aforementioned set, we manually mapped each target cate-gory (except those which are “Other“) to two or three Wikipedia categories.
In order to choose the proper Wikipedia categories to associate to a given target cateogory, we did the following steps:
1. first, we looked for a Wikipedia counterpart (according to some syn-tactic matching);
2. then, for the target categories left out from the first step, we issued them in the built-in Wikipedia search engine or in Google (through the specific search within Wikipedia Web site, by specifying in the query the “site option“). Once got the retrieved pages we selected manually some Wikipedia categories (among those used by Wikipedia to categorize these pages).
Overall, we found 122 representative Wikipedia categories (all these cate-gories are listed in Appendix A) that we will refer to as bridge catecate-gories; hereunder we report a few target-bridge category mappings for illustrative purposes:
Target category Bridge Wikipedia category
Computers\Hardware Digital electronics
Computer hardware
Computer hardware companies
Entertainment\Games & Toys Play
Toys Games Online Community\Chat & Instant Messaging Online chat
Instant messaging
Social networking services
Table 3.1: Three mapping example between target categories and bridge cate-gories.
3.4
Feature generation
In this section we detail how our system generates the features for each query. In particular, as it was anticipated in Introduction, we developed four
algorithms, each of them aims at generating its part of feature vector, for a given query.
We classified these four algorithms into two classes: 1. Topic-based approaches.
2. Bag-of-words approaches.
3.4.1
Topic-based approaches
The algorithms in this class of approaches, employ the entities retrieved through a topic annotation phase, in order to gather information in the Wikipedia category graph and generate the feature vector.
The two topic-based algorithms differ in the way they find out the entities (the remaining part of the algorithm is the same for both). In particular, with the first algorithm, the entities are retrieved simply querying SMAPH. In the second algorithm, the following steps are performed:
1. the queries are sent to Bing search engine [45];
2. the snippets of the search engine results page are gathered;
3. the concatenation of those snippets is sent to WAT annotator, receiving the entities.
As regards the feature space, both the two topic-based algorithms repre-sent each query with 122+122 features (since 122 are the bridge categories). The reason about having 244 features rather than 122, will be clearer going on with the reading.
Once retrieved the entities, the two algorithms perform the same following phases:
1. the information gathering phase, which, in general, consists in visiting the Wikipedia category graph with each entity;
2. the results combination phase in order to generate the feature vector. In the first phase of the algorithm, for each entity (Wikipedia page), we first get the Wikipedia categories to which the current entity belongs and then we perform a breadth-first-search over the Wikipedia category graph (starting from each category), keeping track of the bridge categories encountered.
Figure 3.4: The simulation of Wikipedia category graph traversal, starting from a certain entity.
In Figure 3.4 we tried to graphically summarize the approach described so far. There are two stopping conditions for the graph traversal performed starting from an entity:
• The horizon specified has been exceeded (see Figure 3.4).
• A category node belonging to Main topic classifications has been vis-ited.
The choice about introducing these conditions is motivated by the fact that the semantic correlation between starting categories and the visited one be-comes low, the more we move away from the starting categories. In partic-ular, if we keep on traversing by expanding a category node belonging to Main topic classifications it is very likely that we introduce noise, going to visit nodes with a completely different semantics (e.g. exceeding the History category we can easily reach the Science category and so on). We empirically found that the horizon which guarantees the best performance is 8.
Algorithm 1 Upwards traversal for feature generation.
1: begin
2: results ← {}
3: for entity in Entities do
4: results.add(recursiveUpwardsTravesal(getCategories(entity), {}, 1, horizon))
5: generateFeatureVector(results)
6: where
7: function recursiveUpwardsTravesal(nodesToVisit, results,
cur-rentHeight, horizon):
8: if currentHeight ≥ horizon then return results
9: categoriesT oV isit ← {}
10: for category in nodesT oV isit do
11: if not category is in M ainT opicClassif ications then
12: categoriesT oV isit.enqueue(category.getParents())
13: if not category is in BridgeCategories then
14: results.add(category)
15: return recursiveUpwardsTravesal(categoriesToVisit, results, curren-tHeight+1, horizon)
The pseudo-code in Algorithm 1 describes all first phase aspects.
Once we gathered, for each entity, the set of bridge categories visited, we perform the second phase of the algorithm, by merging these results in order to generate the first 122 features.
Specifically, denoting with CI the first part of feature vector, consisting
of 122 components, the entry CI[j] is computed as the number of times the
category node CIj has been visited by each entity, so CI[j] ∈ [0, |E|] where E
is the set of entities retrieved for a given query.
During the development of this system we tried different solutions about the kind of information to gather during the traversals and the function to use in order to merge those information; these approaches are discussed in the next chapter.
In order to compute the remaining part of the feature vector (so the other 122 features), for each query with entities associated, we build a ranked list of the top n entities most strongly correlated with the starting entities retrieved by the topic annotator (SMAPH for the first algorithm and Tagme for the second one).
In order to choose the top ones, we compute for each entity, the list of the n Wikipedia pages most similar to the entity considered; every list is sorted in non-ascending order of values computed with Jaccard index (the applied formula is reported below) computed between entity and pages.
J accard(P age1, P age2) =
|inlink(P age1) ∩ inlink(P age2)|
|inlink(P age1) ∪ inlink(P age2)|
Once we generated those lists, each one individually sorted according to the value identifying the similarity degree with the respective entity, we perform a multi-way merge extraction (Figure 3.5) of the overall most similar n Wikipedia pages (complexity T (n)).
As last step, retracing the first and the second phase described before, we visit the Wikipedia category graph for each of the n new entities, producing the remaining 122 features, each one in the range [0,n].
The idea about augmenting, for each query, the entity set with the top n similar entities aims at augmenting the query with extra knowledge able to improve the system accuracy and effectiveness.
Summarizing, for each query we construct a feature vector of 244 entries which is the merge of the 122 features produced with the starting entities and the other 122 features produced with the top n similar entities.
Figure 3.5: An example of merging of three ranked list.
3.4.2
Bag-of-words approaches
This kind of approaches are typical in IR field; they consist in representing text as a multi-set of its terms. The general idea is to compare the query representation with the bridge categories representations using a similarity function and employ this information to train the SVM classifiers.
The first step was to choose the proper representation both for queries and for bridge categories, and a suitable similarity function. We decided to represent every bridge category with a set of representative Wikipedia pages; specifically given a bridge category Ci, we chose the set of Wikipedia pages belonging to Ci and to the siblings categories of Ci (Figure 3.6). The second step was to index, using Lucene framework[44], the abstracts (retrieved from DBpedia[26]) of those Wikipedia pages. The model used to represent queries and abstracts is the vector space model where a document is identified by a vector of its distinct terms; every vector entry identifies a different term and it stores a value called tf-idf ; the tf-idf is the product of two factors:
1. tf : which is the acronym of term frequency, indicates the term occur-rence term within the document;
2. idf : which is the acronym of inverse documents frequency, is inversely proportional to the frequency of term t within the overall index D. In formula:
log N
Figure 3.6: An abstract example of representative Wikipedia page set extraction for a bridge category.
where N is the total number of documents indexed.
The similarity function used to compare the vectors in this model is the cosine similarity expressed by the following formula:
cosineSimilarity(v1, v2) ≡ cos(θ) =
v1· v2
||v1|| ||v2||
In other words, each abstract indexed is represented by a tf-idf vector of its terms, while the tf-idf vector used to represent the query is generated according to the algorithm employed (which also in this class of approaches are two). Specifically, we create the query tf-idf vector as follows:
1. with the first algorithm we issue to Bing the queries and we collect the snippets of the top pages retrieved by the search engine, then we use the Lucene analyzer employed to index the abstracts (so proceeding with stemming, stop words removal, tokenization and so on) in order to generate the tf-idf vector, starting from the concatenation of the aforementioned snippets.
2. The second method differs by the first, collecting bolds rather than snippets.
As regards the features generation, for each query we create a 122 fea-ture vector (denoted as B ), computing B[i], as the average cosine similarity between the query tf-idf vector (generated with snippets or bolds variant) and the tf-idf vectors of those abstracts belonging to the representative set of i-th bridge category.
Figure 3.7, in the next page, graphically summarize all the aspects afore-mentioned, from the indexing phase to the features generation phase.
3.4.3
Hybrid approaches
In order to fully evaluate the proposed feature-generation methods, we inves-tigated also an hybrid approach which consists of juxtaposing all generated features vectors into a single one of 732 components:
1. 122+122 features for each topic-based approach; 2. 122 features for each bow-based approach.
This hybrid approach is currently employed by the system for the features generation task.
3.5
Classification task
After we generated all the features, we trained 67 binary SVM classifiers, one for each target category, via LibSVM [46] library.
All features are scaled to [0,1] range before the classifiers are trained and all SVM classifiers employed are set to use linear kernel.
Each classifier has been tuned separately using the development set; specifically we tune the values of the class imbalance for positives and neg-atives instances, the γ parameter to control the shape of the separating hy-perplane and the C parameter to set the confidence with the training set. As objective function we considered the maximization of the F1 score for each target category.
In order to figure out the best parameters configuration of each algorithm, tuning has been performed focusing one features block at a time.
Once we optimize apart each features block, we evaluate all the possible combination of them with respect to development set.
The validation phase applied on each combination of features block (see Chapter 5) established that the solution which achieve the best results is that which combines all the four features blocks:
• the one generated with the first topic-based algorithm, that is the Wikipedia category graph traversal algorithm starting from the SMAPH entities;
• the one generated with the second topic-based algorithm, so using WAT entities;
• the one generated with the first bag-of-words algorithm, using the tf-idf created from the bolds to represent the query;
• the one generated with the second bag-of-words algorithm, using the tf-idf created from the snippets to represent the query;
Chapter 4
Other approaches
As we will see in this chapter, a lot of variants have been considered and tested during multi-label query classification system development.
Unfortunately, those approaches did not produce any improvement in terms of classification system accuracy. The first part of this chapter reports the other approaches concerning the features generation task. In particular we designed different strategies, with respect to those currently used by sys-tem, both for gathering information during the Wikipedia category graph traversal and for combining these information.
For example, rather than considering binary counters which identity whether a given bridge category has been encountered during the traversal, we con-sidered generic counters which take into account the number of times a given bridge category has been encountered during the traversal.
We also tried to augment the graph coverage of bridge categories by introducing the neighbourhood notion; where neighbourhood is the set formed by siblings/parents categories of the bridge categories.
Finally we attempted to employ another algorithm for gathering infor-mation during the graph traversal; specifically, putting aside for a moment the counting algorithm, we employed a shortest-path based algorithm that, instead of considering whether a bridge category has been encountered, it considered the length of the shortest path necessary to reach that category. Also for this variant, we tried to exploit neighbourhood in the computation, but, even here, we had no improvements.
Another section of this chapter describes the attempt of adjusting the bridge category set, trying to add/remove bridge categories, in order to find the best representative bridge category set.
The final section reports the last attempt we made, that is the attempt to refine the classification process introducing a further stage of computation.
4.1
Other features generation approaches
This section reports all the algorithmics variants for the features generation task we tested during the system development.
In particular we designed two kinds of algorithm: 1. the counting algorithms;
2. the shortest path algorithms.
4.1.1
Counting algorithms
This section reports the other methodologies developed with respect to the counting algorithm currently used by the system.
To brush up the reader ideas, we remember that the actual solution em-ploys a set of binary counters, one for each bridge category, where every counter indicates whether the relative bridge category has been encountered during the graph visit or not (by each entity). In order to merge the infor-mation which resides in each set of counters, the content of i-th counter (of each set) is summed up.
4.1.1.1 A different way of counting
Instead of considering a set of binary counters one for each bridge category, we tried to accumulate in each counter the number of times a bridge category has been encounter during the graph visit, starting from an entity.
In order to merge the results, we considered the following two methods: 1. with the first, the content of any i-th counter is summed up, for all 122
counters (hence the same method used for the actual solution);
2. the second considers as i-th final value, the maximum between the val-ues of the i-th counters. We think that those solutions did not produce better results since they introduce redundancy in the representation.
4.1.1.2 Counting neighbours
A promising solution that it is revealed a bit disappointing in terms of perfor-mance, is that which introduces the notion of bridge categories neighborhood. With neighborhood we mean the set of all Wikipedia categories which are sib-lings and parents categories of the bridge ones.
We decided to introduce this extra resource, in order to have a higher Wikipedia category graph coverage, and consequently having a way to ac-quire a larger amount of information.
The differences with the other methods resides in the counting performed to compute the values identifying each bridge category.
Specifically, when a category node is encountered during the graph visit, the routine explicited by Algorithm 2 is executed.
In practice, if a visited node constitute a parent or sibling node for some bridge category, then we assign to that category a weight less than 1, unless that category have already associated a greater weight.
In this approach, we tried several configurations for the weights to be assigned according to the kind of node visited. In those tests, we considered the following constraint:
siblingW eight ≺ parentW eight ≺ intermCatW eigth
he reason about this, is that the semantic correlation between categories, decreases the more we move away upward and even the more, when we laterally move to other taxonomy branches.
As regards the function used to merge the results, we considered the same two methods employed in the previous section.
Algorithm 2 Weigthing algorithm [Counting approach]. Data:
• Results - a list of the results accumulated until the procedure call, every result is a pair (category name, weight ),
• Neighborhood - a map, where each entry contains a bridge category, its siblings and its parents categories.
1: procedure computeWeight(categoryName): 2: intermCatW eigth ← 1
3: parentW eight ← 0.6
4: siblingW eight ← 0.3
5: for entry in N eighborhood do
6: if categoryN ame = entry.categoryN ame then
7: updateResults(entry.categoryN ame, intermCatW eigth)
8: else if categoryN ame is in entry.parents then
9: updateResults(entry.categoryN ame, parentW eight)
10: else if categoryN ame is in entry.siblings then
11: updateResults(entry.categoryN ame, siblingW eight)
12: procedure updateResults(categoryName, weight):
13: Results[categoryN ame] = max(Results[categoryN ame], weight)
4.1.1.3 Expand feature space to all Wikipedia categories
Another important approach, we developed in parallel with the solution cur-rently implemented, is the one which consider all the possible Wikipedia categories instead of reducing to a representative page set of bridge cate-gories.
Even though this solution involves some issues (e.g. it produces a sparse feature space), it has produced good results thanks also to the properties of the SVM model; specifically, the three properties listed above:
• The ability to handle large feature spaces, like in this case, where we have a number of features equal to the number of Wikipedia categories (≈ 1M );
• The ability to handle the sparsity, due, in this case, to the nature of the used approach;
• The ability to perform feature selection, that, in this case, is necessary since we considered all Wikipedia categories.
The sparsity of the feature space is due to the algorithm employed to gener-ate features, that is the one which reflects the standard counting algorithm discussed in Chapter 3. The difference between the standard counting al-gorithm and the one used in this solution, is that in the latter there are as many counters as the Wikipedia categories or in other words that, in the lat-ter we keep track of all categories encounlat-tered during the graph visit instead of considering a chunk of them.
Even if it proved to be an effective and easy to implement method, this solution did not produce improvements in terms of system accuracy.
4.1.2
Shortest path algorithms
Another road we have investigated, is the one which employs different gath-ering information methods during the Wikipedia category graph traversal.
These kind of methods exploits the Wikipedia category graph topology to extract the proper knowledge, in order to generate features.
In Chapter 5, we empirically proved that, such algorithms, even though theoretically promising, do not outperform the counting-based solutions, at least from what we have seen in our specific context.
4.1.2.1 Basic shortest path approach
The approach at issue, differs from the counting-based, both for the nature of information acquired during the graph traversal and for the function adopted in order to merge the results.
As we can guess from the Figure 4.1, the graph topology determines the values to be a gathered during the graph traversal, starting from an entity.
Specifically, for each entity, we traverse the Wikipedia category graph computing the weight for each bridge category encountered, as the length of shortest path between the bridge category and the entity node.
We remind the reader that the semantic correlation between categories is inversely proportional to the path length (shortest path) which separates that categories.
Hence, with this approach, we can use the shortest path length of the bridge categories with an entity as the similarity degree between them. 4.1.2.2 Shortest path with neighbourhood
With this experiment, we tried to employ the neighbourhood together with the shortest path approach.
For some explanation about the neighbourhood notion and/or the reasons about introducing this extra resource, we refer to section 4.1.1.2.
This method, basically, recalls the Algorithm 2 seen in the previous sec-tion, but in this context it acquires another meaning.
As, it is employed in the shortest path scenario, we redesigned the weights assignment for parents and siblings categories.
Specifically, when a node is encountered at a certain height during the graph traversal, we can have the following three case studies:
• the node encountered identifies an bridge category: the relative weight is set to the current height (if the value of height is less than the current value);
• the node encountered is a parent node of a bridge category node: the bridge category weight considered, is the sum of the current height and a local weight (specific for parent nodes) multiplied by a spreading factor ; this factor is used to leverage the neighbourhood confidence about the weighting;
• the node encountered is a sibling node of a bridge category node: the bridge category weight considered, is the same of the previous case study, expect for the local weight which is specific for sibling nodes. All these case studies, were taken into account to develop the Algorithm 3. The agglomerating function used to combine all the local shortest path information, got by each entity, is the one considering the minimum between any value gathered for each bridge categories.
Algorithm 3 Weigthing algorithm [Shortest Path approach]. Data:
• Results - a list of the results accumulated until the procedure call, every result is a pair (category name, weight ),
• Neighborhood - a map, where each entry contains a bridge category, its siblings and its parents categories.
1: procedure computeWeight(categoryName, currentHeight): 2: spreadingF actor ← 2
3: parentW eight ← currentHeight + 1 ∗ spreadingF actor
4: siblingW eight ← currentHeight + 2 ∗ spreadingF actor
5: for entry in N eighborhood do
6: if categoryN ame = entry.categoryN ame then
7: updateResults(entry.categoryN ame, currentHeight)
8: else if categoryN ame is in entry.parents then
9: updateResults(entry.categoryN ame, parentW eight)
10: else if categoryN ame is in entry.siblings then
11: updateResults(entry.categoryN ame, siblingW eight)
12: procedure updateResults(categoryName, weight):
13: Results[categoryN ame] = min(Results[categoryN ame], weight)
4.1.3
Other attempts
In this section, we report some methods employed by the initial system pro-totype.
The first method concerned the choice of the entities with greater relat-edness used to augment the starting SMAPH entities set.
Trivially, the first algorithm that implemented the aforementioned task, performed the following phases (graphically exposed in Figure 4.2):
1. extracting for any SMAPH entities, a list of Wikipedia pages, sorted according to the Jaccard relatedness degree with the related entity; 2. generating the final list as result of the concatenation of all the list got
with the previous phase, considering the first n Wikipedia pages for each one.
Figure 4.2: An example showing the behaviour of the top n most similar pages extraction algorithm.
This approach has provided (with high values for n) performance comparable to those provided by the approach actually employed.
The drawback about using this approach is that, with high values of n, the number of iteration (Wikipedia category graph traversal performed by each entity) grow; from T (n) using the multi-way merge approach, to O(n ∗ #SM AP Hentities) using this one.
For the above reasons and since multi-way merge approach is able to obtain the same system accuracy employing a less number of entities, we decided to discard the first approach.
It is worth to mention other two approaches, where we normalized feature values, for a given query, using the number of entities retrieved issuing the query to the SMAPH system, as information.
Specifically, we attempted the following variants:
• we divided up each feature by the number of entities retrieved;
• we add a new feature containing as value, the number of entities re-trieved.
The above choices are justified, considering that: the more the entities re-trieved by the annotator are, the higher the values computed by each counter are, as well.
Performing that kind of normalization and hence, penalizing information-rich queries and valorizing information-poor queries, we hoped that the clas-sifiers could benefit from it.
Unfortunately, that did not happen; we think the reason is that, most of those queries we considered information-rich, due to the nature of results retrieved by the annotator, contain noisy entities which already penalize that queries.
4.2
Adjust the bridge Wikipedia category set
A crucial task was to find the Wikipedia bridge category set which could rep-resent as good as possible the target category set, by adjusting the manually extracted one.
With the term adjust, here, we mean the insertion/removal of Wikipedia categories from the starting bridge category set, in order to improve the system performance.
Since the approach which considers all possible combinations to add or remove categories, from the bridge category set, has exponential complexity, we were forced to employ an heuristic, straightforward with the number of Wikipedia categories.
More precisely, with the optimal approach, we have to performPN i=1
N i =
2N−1 different evaluations, where N are the categories we want to add/remove.
The approach we followed, is a naive method which adds/removes one category at time, accumulating the results produced by the system with the new bridge category set and sorting them from the category whose in-sertion/removal produced the best F1 to the category which produced the worst result.
Once we get the above list, we keep the categories which achieved an improvement in terms of F1, with respect to the result obtained using the starting bridge category set, and we discard the other ones.
Finally, we add/remove one category after the other, starting from the first category of the results list and we go on until the modification produces better results.
The logical structure of the program which implements the results list generation task is represented in Figure 4.3.
Figure 4.3: Skeleton of program which implements the results list generation task.