• Non ci sono risultati.

search strategy seems to us quite advanced and powerful.

In the table of Figure 3.8 we report classification accuracy of the two classifiers on the 42 data-sets. We can notice that accuracies are compara-ble (even though with our method it is a bit lower). However, the heuristic search adopted by Ward method is clearly less precise than the one imple-mented in Weka wrapper: Ward agglomerative hierarchical method is greedy and could be trapped in local minimums. We believe that the answer to the slightly inferior performances reached by our technique could be in the simpler, greedy search method we have adopted. For the future, we plan to enrich Ward hierarchical method with some backtracking facility, such as the one presented in [23].

3.5 Discussion

The presented approach to feature selection is an heuristic approach - lead by τ association measure. Selection of the candidate features follows a greedy, best-first, forward method - Ward hierarchical clustering. However, the fea-ture selection itself is an hybrid approach. Clusters help to identify redundant features. From clusters, the most relevant features are chosen as representa-tives - a task that consists in a filter approach. The actual selection of the most appropriate combination of representatives (which corresponds to the selection of the dendrogram level) follows a wrapper approach. One of the advantages of this solution is that it does not require any special nor complex parameter tuning process that other approaches like ReliefF require. Fur-thermore, the dendrogram produced by hierarchical clustering constitutes a useful, immediate and semantic-rich conceptual organization of the feature space. It helps the experts to explore and understand the feature space of a new problem domain.

From the extensive experiments performed - on several data-sets and

learners - we have seen that our hybrid method of feature selection outper-forms the ranking methods - at least regarding the classification accuracy viewpoint. In particular, we have shown that the classification accuracy achieved by different learners on many data-sets after the application of our feature selection always increases (or at least equals) the original accuracy of the same learners without feature selection. We claim that the classification accuracy achievable by the learners after the feature selection step should not be degraded as a consequence of the feature selection. This issue in the evaluation of any feature selection method is an important one and should always be considered carefully.

3.5. DISCUSSION

Data-set Orig. HCL NB FCBF ReliefF

anneal 86.30% 90.31% 86.97% 86.19%<

audiology 73.45% 73.89% 73.89% 73.89%

autos 56.10% 63.42% 53.17%< 56.10%

badges2 99.66% 100.00% 100.00% 100%

balance-scale 90.40% 90.40% 90.40% 90.40%

breast-cancer 71.68% 74.48% 70.28% < 73.08%

breast-w 95.99% 95.99% 95.85%< 95.99%

colic 77.99% 84.51% 83.15% 79.89%

contact-lenses 70.83% 87.50% 70.83% 83.33%

credit-a 77.68% 86.09% 75.22%< 86.38%

credit-g 75.40% 75.60% 74.40%< 75.00%<

diabetes 76.30% 77.47% 77.47% 76.43%

glass 48.60% 53.27% 42.52%< 48.60%

heart-c 83.50% 83.83% 84.49% 83.83%

hepatitis 84.52% 86.45% 84.52% 84.52%

HO 77.99% 84.51% 83.15% 79.89%

hypothyroid 95.28% 95.28% 94.62%< 95.10%

ionosphere 82.62% 86.61% 88.89% 82.62%

iris 96.00% 96.00% 96.00% 96.00%

kropt 36.01% 36.01% 33.26%< 36.01%

kr-vs-kp 87.89% 91.52% 91.99% 88.77%

labor 89.47% 94.74% 89.47% 91.23%

letter 64.12% 65.09% 65.52% 64.12%

lymph 83.11% 83.78% 79.73%< 81.08%<

monk2 56.81% 62.13% 59.17% 61.54%

mushroom 95.83% 98.62% 98.52% 95.83%

primary-tumor 50.15% 50.15% 47.20%< 49.85%<

segment 80.22% 81.91% 86.67% 83.25%

sick 92.60% 96.55% 96.53% 94.43%

sonar 67.79% 73.56% 70.67% 71.15%

soybean 92.97% 92.97% 89.60%< 92.97%

spambase 79.29% 79.79% 76.87%< 67.83%<

splice 95.30% 95.58% 96.14% 95.36%

ticdata-categ 85.52% 94.02% 93.90% 80.83%<

titanic 77.87% 77.87% 77.60%< 67.92%<

vehicle 44.80% 53.19% 43.26%< 44.80%

vote 90.12% 95.63% 96.09% 90.11%<

vowel 63.74% 67.78% 63.43%< 64.65%

waveform-5000 80.00% 80.84% 77.90%< 79.90%<

weather 64.29% 64.29% 57.14%< 57.14%<

wine 96.63% 97.75% 97.75% 96.63%

zoo 95.05% 97.03% 93.07%< 95.05%

Mean 78.33% 81.34% 78.75% 78.52%

Mean Incr. Acc 3.01% 0.42% 0.19%

Figure 3.6: Comparing the accuracy of Naive Bayes coupled with our method (HCL FS) vs. FCBF and ReliefF

Avg. Orig. Avg. HCL NB Avg. HCL J48

Num. Feat. 23.12 15.07 14.45

Mean. % Red. 34.81% 37.49%

Avg. FCBF Avg. ReliefF

Num. Feat. 6.52 16.33

Mean. % Red. 71.78% 29.35%

Figure 3.7: Comparison on Feature Reduction

HCL NB HCL J48 Wrapper NB Wrapper J48

81,34% 84,54% 83,19% 85,46%

Figure 3.8: Average accuracy: Comparison between our approach and Wrap-per methods

3.5. DISCUSSION

Data-set Orig. HCL J48 FCBF ReliefF

anneal 98.44% 98.44% 96.88% 98.44%

audiology 77.88% 78.32% 77.88% 77.88%

autos 81.95% 81.95% 69.76%< 81.95%

badges2 100.00% 100.00% 100.00% 100.00%

balance-scale 76.64% 76.64% 76.64% 76.64%

breast-cancer 75.52% 75.87% 69.93%< 75.52%

breast-w 94.56% 95.85% 95.85% 94.56%

colic 85.33% 85.87% 81.79%< 85.33%

contact-lenses 83.33% 87.50% 83.33% 83.33%

credit-a 86.09% 87.54% 85.07%< 86.23%

credit-g 70.50% 70.50% 70.50% 72.20%

diabetes 73.83% 75.26% 74.87% 74.48%

glass 66.82% 67.76% 69.16% 66.82%

heart-c 77.56% 80.53% 77.23%< 76.57%<

hepatitis 83.87% 83.87% 82.58%< 81.29%<

HO 85.33% 85.87% 81.79%< 85.33%

hypothyroid 99.58% 99.58% 97.32%< 99.34%<

ionosphere 91.45% 91.45% 90.03%< 91.17%<

iris 96.00% 96.00% 96.00% 96.00%

kropt 56.58% 56.58% 52.22%< 56.58%

kr-vs-kp 99.44% 99.47% 94.06%< 97.78%<

labor 73.68% 85.97% 77.19% 75.44%

letter 87.98% 88.22% 88.47% 88.14%

lymph 77.03% 79.73% 75.68%< 77.03%

monk2 56.21% 62.13% 62.13% 57.99%

mushroom 100.00% 100.00% 99.02%< 100.00%

primary-tumor 39.82% 44.25% 41.89% 41.00%

segment 96.93% 97.06% 97.01% 96.97%

sick 98.81% 98.83% 97.45%< 97.61%<

sonar 71.15% 76.44% 75.48% 72.60%

soybean 91.50% 92.24% 91.07%< 91.51%

spambase 92.98% 92.98% 93.31% 76.16%<

splice 94.08% 94.11% 94.51% 94.39%

ticdata-categ 93.97% 94.02% 94.02% 94.02%

titanic 78.92% 78.92% 77.60%< 67.51%<

vehicle 72.46% 74.70% 58.27%< 72.70%

vote 96.32% 96.32% 96.09%< 96.32%

vowel 81.52% 81.52% 81.11%< 80.10%<

waveform-5000 75.08% 75.46% 76.88% 76.66%

weather 64.29% 71.43% 42.86%< 71.43%

wine 93.82% 95.51% 93.82% 93.82%

zoo 92.08% 96.04% 92.08% 92.08%

Mean 83.08% 84.54% 81.64% 82.64%

Mean Var. Acc 1.46% -1.44% -0.44%

Figure 3.9: Comparing the accuracy of J48 decision tree coupled with our method (HCL FS) vs. FCBF and ReliefF

One should always generalize Carl Gustav Jacobi

(1804-1851)

4

Towards the Automatic Construction of Conceptual Taxonomies

In this chapter we investigate the possibility of an automatic construction of conceptual taxonomies and evaluate the achievable results. In our meaning, a concept is represented by a keyword contained and extracted from a text corpus. A conceptual taxonomy is then a hierarchical organization of the keywords (Keyword Hierarchy, KH) such that the keywords at the higher hi-erarchy levels are representatives of a higher number of other keywords and as a consequence there is a higher number of documents that contain them. The hierarchy construction, on the space of the keywords, is performed by Ward hierarchical clustering algorithm, guided by a keyword proximity measure that is well-known in statistics for the association between two categories:

Goodman-Kruskal τ . Then, we perform a choice of a keyword representative from each cluster of keywords in the hierarchy. This is done in a similar way in which PageRank determines the authority of Web pages. The obtained hierarchy has the same, several advantages - both descriptive and operative - of indices on keywords which perform a partitioning of a large document collection with respect to the search space of the contents. In addition, we have a description of each cluster in the hierarchy by a list of frequent occur-ring terms ordered by their authority score determined by PageRank. The

obtained keyword hierarchy provides interesting insights on the documents content and constitutes a useful browsing tool [55].

4.1 Introduction

We propose a method based on data mining techniques to detect the signifi-cant topics in a document, i.e., a method for determining which concepts in the document are relatively important for a given task. The need to deter-mine the importance of a particular concept within a document is motivated by a range of applications, including information retrieval, automatic de-termination of authorship, similarity metrics for cross-document clustering, automatic indexing and input to summarization. The objective of this work is to automatically build a hierarchical organization of terms from a set of documents that provides at the same time: 1) a concise description of the area in which the content of the documents belongs to and 2) an index fa-cility to browse and partition the collection of documents. In this work we focus instead on an automated indexing in a set of categories which are not predefined but are determined automatically by the documents content (the recurrent documents terms). We took short texts - the abstract of 126 papers submitted in the last 8 years to ACM Transactions of Database Systems -and produced the terms hierarchy by application of Ward hierarchical clus-tering algorithm to the contained keywords. Ward algorithm is described in Chapter 2. In order to represent a concept, we extract from documents the recurrent keywords (after Porter Stemming preprocessing), cluster them by Ward algorithm, obtaining a hierarchical keyword tree (KH). Ward algo-rithm is guided by an objective function that evaluates the best elements (in our case the feature sets) to be agglomerated in a cluster. We coupled Ward algorithm with a measure of distance between features derived by Goodman-Kruskal τ , an associative measure often used in statistics to determine the ability of one category of predict another category.

4.2. RELATED WORK We then determine from the group of terms in any hierarchy node the authoritative term that represents the group. In this step we apply PageR-ank [73] which has been brought to success by its adoption in the popular Google search engine and has been already applied to many domains to detect the authoritative elements in a large collection. We briefly review PageR-ank technique in Section 4.2.1. Section 4.3 discusses the proposed algorithm.

In Section 4.4 we report the experiments for the validation of the obtained hierarchy. We used the ACM Computing Taxonomy (CT) to validate the au-tomatic term hierarchy produced. ACM CT was created by computer science experts to organize the collection of papers and concisely represents by ab-stract keywords the computer science knowledge. We succeeded to perform a comparison between ACM CT and KH since the papers whose abstracts we have used were manually classified by the authors themselves into the ACM CT categories. We applied three different measures to perform the comparison: Jaccard measure, Entropy and classification accuracy into the taxonomy categories by a set of classifiers. We obtained considerably supe-rior results by all the classifiers into our KH taxonomy, that could mean that the discovered taxonomy is much more consistent than ACM CT with the textual content of the documents.

4.2 Related Work

Much of the literature on this field is related to text categorization, often referred to as machine-aided indexing, automated indexing or authority con-trol and it is the assignment of texts to one or more of a pre-existing set of categories [62, 16, 1].

A rich set of works based on document clustering or performing a hierar-chical organization of topics of documents exist. They make use very often of an already existing set of categories or NLP domain knowledge either to learn a classifier or to generate topic taxonomies. [92] clusters documents

in the result of web search engines to improve user browsing. It intersects the sets of words contained in documents to obtain the document clusters, whose cohesion is defined as the number of common words in the cluster documents. In [74] salient words and phrases are extracted from documents and are organized hierarchically using a type of co-occurrence derived from subsumption. [45] investigated the adoption of hierarchical clustering meth-ods in the context of document management. It adopted a statistical model based on log-linear regression for the evaluation of similarity between infor-mation. In [49, 75] EM clustering algorithm is adopted instead, once the words distribution conditioned to the classes is given. [26] studies a spherical k-means algorithm for clustering very sparse document vectors. [19] iden-tifies recurrent topics in a text corpus by a combination of NLP (to detect named entities) and data mining techniques, to detect clusters of named en-tities frequently occurring together in documents. Here clustering partitions the hypergraph resulting from the relationships between frequent itemsets.

[33] presents a system for the construction of taxonomies which yield high accuracies with automated categorization systems on the web. On the con-trary, the proposed technique is domain- and language-independent: it is based on clustering - guided by τ predictive measure - and on PageRank, and exploited little knowledge on NLP.

[83] compares three shallow processing methods to extract index terms:

technical terms (noun phrases repeated more than twice in a document), keywords or stemmed words and head phrases (simplex noun phrases with the same head ranked in decreasing order of frequency). [44] proposes instead a new composite similarity metric that combines information from multiple linguistic indicators to measure semantic distance between pairs of small textual units (usually applied in IR).

[84] produces a spatial representation of a text corpora based on the doc-ument content for information visualization. It is based on NLP processing to detect the semantic relationships between words and neural networks for

4.2. RELATED WORK the association between words and their context.

Similarly, [16] organizes large text databases hierarchically by topic to aid better searching, browsing and filtering. It starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents and classify them into the topics.

[48] presents a system intended to aid human document analysts in the assignment of indexes to physical chemistry journal articles.

[93] presents a computer aided system to retrieve the training samples for building enterprise taxonomies that adhere to an initially defined taxonomy by some keywords.

[1] proposes a solution based on supervised hierarchical clustering (based on a set of predefined categories and an initial training set of labelled docu-ments) of the problem of automatic text categorization.

[18] investigates the possibilities of using highly ranked search result snip-pets to enrich the representation of text segments.

4.2.1 PageRank

In our approach PageRank algorithm is used for ranking the keywords inside a specific cluster. PageRank [73] is a graph-based ranking algorithm already used in the Google search engine and in a great number of unsupervised ap-plications. A good definition of PageRank and of some of its applications is given in [67]. The basic idea that supports the adoption of a graph-based ranking algorithm is that of voting or recommendation: when a first vertex is connected to a second vertex by a weighted edge the first vertex basically votes for the second one proportionally to the edge weight connecting them.

The higher is the sum of the weights obtained by the second vertex by the other vertices the higher is the importance of that vertex in the graph. Fur-thermore, the importance of a vertex determines the importance of its votes.

A Random Walk on a graph describes the probability of moving between the graph vertices. In our case, being graph vertices the keywords, it describes

the probability of finding a document with the keywords. Random Walks search the stationary state in the Markov Chain and this situation assigns to each state in the Markov Chain a probability that is the probability of being in that state after an infinite walk on the graph guided by the transition probabilities. Through Random Walks on the graph PageRank determines the authority of a graph vertex, essentially by an iterative algorithm, i.e. col-lectively by aggregation of the transition probabilities between all the graph vertices. PageRank produces for each graph vertex a score (according to for-mula 4.1 that will be discussed below) and orders them by the score value.

As a conclusion, it finds a ranking between the keywords.

In our problem, we have an undirected and weighted graph G = (V, E), where V is the set of vertices corresponding to the keywords in a cluster, E ⊆ V × V is the set of edges between two vertices corresponding to the probability of having a document with both the keywords. For an edge connecting vertices Vaand Vb∈ V there is a weight denoted by wab. PageRank assigns a score to any keyword corresponding to a vertex of the graph: the score at vertex Va is as greater as is the importance of the vertex. The importance is determined by the vertices to which Vais connected. PageRank determines iteratively the score for each vertex Va in the graph as a weighted contribution of all the scores assigned to the vertices Vb connected to Va, as follows:

WP(Va) = (1 − d) + d · [ X

Vb,Vb6=Va

wba

P

Vc,Vc6=Vbwbc

WP(Vb)] 4.1

where d is a parameter that is set between 0 and 1 (set to 0.85, the usual value). WP is the resulting score vector, whose i-th component is the score associated to vertex Vi. The greater is the score, the greater is the importance of the vertex according to its similarity with the other vertices to which it is connected. This algorithm is used in many applications, particularly in NLP tasks such as Word Sense Disambiguation [67].