• Non ci sono risultati.

A new era: Topic-based Annotators

N/A
N/A
Protected

Academic year: 2021

Condividi "A new era: Topic-based Annotators"

Copied!
52
0
0

Testo completo

(1)

A new era:

Topic-based Annotators

(2)

“Diego Maradona won against Mexico”

Dictionary of terms

against Diego Maradona Mexico won

2.2 5.1 9.1 1.0 0.1

Term Vector

Similarity(v,w) ≈ cos(a) t1 v

w t3

t2

a

Vector Space model

Classical approach

Mainly term-based:

polysemy and synonymy issues

(3)

He is using Microsoft’s browser He is a fan of Internet Explorer

Another issue: synonymy

(4)

Latent Semantic Analysis

• Related terms get projected to the same concept dimension

• Polysemy is not yet solved, concepts are latent

courtesy of Susan Dumais

Sec. 18.4

(5)

the paparazzi photographed the star the astronomer photographed the star

A typical issue: polysemy

(6)

6

A new approach:

Massive graphs of entities and relations

May 2012

(7)

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

Web of Data: RDF, Tables, Microdata

Cyc

Thanks to G. Weikum, SIGMOD 20137

(8)

Wikipedia is a rich source of instances

(9)

Wikipedia's categories contain classes

Categories form a taxonomic DAG (not really…)

(10)

10

DAG of categories:

http://toolserver.org/~dapete/catgraph/

(11)

“Diego Maradona won against Mexico”

Find mentions and annotate them with topics drawn from a catalog

Ex-Argentina’s player

Mexico’s football team

A new approach:

Topic-based annotation

Wikipedia!

anchors articles

(12)

Polysemy

the paparazzi photographed the star the astronomer photographed the star

Star is a massive, luminous ball of

plasma …

Celebrity is a person who is famou- sly recognized …

(13)

He is using Microsoft’s browser She plays with Internet Explorer

Synonymy

Internet Explorer (IE o MSIE), oggi noto anche con il nome Windows

Internet Explorer (WIE), è un browser web grafico proprietario sviluppato da Microsoft

(14)

“Diego Maradona won against Mexico”

•Ex-Argentina’s coach

•His nephew

•Maradona Stadium

•Maradona Movie

•…

•Mexico nation

•Mexico state

•Mexico football team

•Mexico baseball team

•…

Don’t annotate!

Why is it a difficult problem?

(15)

Features used to link a  p

Commonness of a page p wrt an anchor a

Link probability of an anchor a

Context of a around the mention T = …w1 w2 w3 a w4 w5 w6

and the content of a page/entity

Page p = z1 z2 z3 z4 z5 z6

Graph-based features a is a mention-node

p is an entity-node Links a p and paths

between pages

Pr(p| a) = # a linked to p

# a as anchor

text in the

of freq

anchor as

of ) freq

( a

a a lp =

(16)

Relatedness between pages

(17)

The literature

17

Many commercial software: AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, SemiTags, TextRazor, Wikimeta, Yahoo! Content Analysis, Zemanta.

(18)

• Designed for short texts

– news, blogs, search-results snippets, tweets, ads, etc – competitive on long texts too

• Achieves high accuracy

– Massive experimental test on millions of short texts

• Fast

– More than 10x faster than others – 100% Java

[Ferragina-Scaiella, CIKM 2010]

tagme.di.unipi.it

(19)

• …

“Diego Maradona won against Mexico”

•Diego A. Maradona

•Diego Maradona jr.

•Maradona Stadium

•Maradona Film

•…

•Mexico nation

•Mexico state

•Mexico football team

•Mexico baseball team

•…

No Annotation PARSING

PARSING PRUNING

2 simple features

• …

DISAMBIGUATION

by a voting scheme

How TAGME works

(20)

Disambiguation: The voting scheme

Disambiguation: The voting scheme

Collective agreement among topics via voting

“Diego Maradona won against Mexico”

Wiki-articles Votes

Mexico 0.09

State of Mexico 0

Mexico National

Football Team 0.29

Mexico National

Baseball Team 0

p

a

p

b •Diego A. Maradona

•Diego Maradona jr.

•Maradona Stadium

•Maradona Film

•…

b a

(21)

Disambiguation: all steps Disambiguation: all steps

Commonness of a page p wrt an anchor a

Voting scheme Select top-ε pages

Select the best in commonness

Pruning by commonness < τ

τ = trade-of speed vs. recall

Pr(p| a) = # a linked to p

# a as anchor

(22)
(23)

Less links

More links

(24)

NLP ?

Useful for relating topics

(25)

Dense Subgraph

• Compute dense subgraph such that:

each mention is connected to at most one entity 90

30

5 100 100 50

20 50

90

80 30 90

10 10

20 30

30

Thanks to G. Weikum, SIGMOD 201326

Entities = Wikipedia Pages

Mentions = Anchor Texts

(26)

Random Walks

for each mention run personalized PageRank with restart

• rank entity e by stationary visiting probability, starting from m

50 90

80 30 90

10

20 10

0.83 0.7

0.4

0.75 0.15

0.17

0.2 0.1

90 30

5 100 100 50 30 20 30

0.75 0.25

0.04 0.96 0.77 0.5 0.23 0.2 0.3

Thanks to G. Weikum, SIGMOD 201327

(27)

On comparing annotators

(28)

“Diego Maradona won against Mexico”

Find mentions and annotate them with topics drawn from a catalog

Ex-Argentina’s player

Mexico’s football team

A new approach:

Topic-based annotation

Wikipedia!

anchors articles

(29)

A word of caution !

A2W

Sa2W

Rc2W

(30)
(31)
(32)

Is it just a matter of “annotation” ?

(33)

#1: Classic BOW-representation

t1 T2

T1 T5 t3

t2

a

cos(a) = v w / ||v|| * ||w||

• T is an input text

• D is the dictionary of the terms in the collection

• Represent T via a vector consisting of |D| components

(34)

#2: LSI-representation

• You still use cos-similarity for doc comparison

• Docs live in a reduced-space of latent concepts

• LSI projects m-dim space  k-dim space

• Compare docs/queries in latent space

k

UTk x m

d =

m

k

Reduced doc

(35)

#3: LSI expansion

• T is an input text

• D is the dictionary of m terms in the collection

• k are the latent concepts computed via LSI

• Represent T via a vector consisting of m + k components (syntactic and latent)

• Use cos-similarity among expanded vectors

(36)

#4: Topic expansion

• T is an input text

• D is the dictionary of m terms in the collection

• Represent T via a vector consisting of (m + #Wikipedia-pages) as components

• Terms are weighted via e.g. tf-idf

• How do you weight the Wikipedia pages ?

You could use the rho-score assigned by TagME

(37)

• Given T be an input text, we form a vector:

– One component per term ti weighted wrt T as

Repr[T,t

i

] = Tf-Idf

i,T

– One component per Wiki-page Pj weighted as

Repr[T,Pj] = <Repr[T], Repr[Pj]>

Observe that Pj can be considered as a text

#5: Explicit Semantic Analysis

[Gabrilovitch et al, IJCAI 07]

(38)

#5: Explicit Semantic Analysis

[Gabrilovitch et al, IJCAI 07]

To compute semantic relatedness of a pair of texts, compute the cosine similarity of their ESA

vectors, consisting of n + |Wikipedia|

components

Yet vector space model, thus it does not take into account for the similarity of Pj versus Ph

(39)

obama says gaddafi may way out military assault

us president issues libya ultimatum

Barack Obama Muammar al-Gaddaf

Offensive (military)

President of the United States

Libya Ultimatum

#6: graph of concepts

(40)

Text as a sub-graph of topics

Mahmoud Ahmadineja d

Ultimatum

RQ-170 drone

Any relatedness measure over a graph, e.g. [Milne &

Witten, 2008]

President of the United States

Barack Obama

Iran

(41)

• Given two texts, we map them to two sets of nodes/entities

– We stick onto the graph representation

Random walks for Text Sim

Text 1

Text 2

Volumes

of random paths in the concept

graph (e.g. Wikipedia)

Graph-KB of concepts

Latest approaches combine: category hierarchy, Wikipedia graph, random walks, Word2Vect representation of terms & entities,….

(42)

Results

• word relatedness

• text relatedness

(43)

Some applications of this

novel text representation

(44)

Text Categorization

It is the problem of labeling natural language texts with one or more thematic categories drawn from a predefned set.

• Many solutions for long texts (hence many terms)

– Naive Bayes, SVM, k-NN, MaxEnt, Decision Tree…

few for short texts which are poorly composed

(45)

Topical classifcation of news

Training (given a set of classes)

1. Annotate the training-set using TAGME

2. For each class z, create a subset of Wikipedia pages (topics) Tz by importance and distinctiveness

Test

3. Annotate the text to be classified using TAGME 4. Classify the text by computing the r-weighted

relatedness between any text’s topic with the subset of topics Tz assigned to each category z (r of topic in text).

(46)

Example

Relatedness measure SPORT

SPORT SCI-TECHSCI-TECH BUSINESSBUSINESS HEALTHHEALTH ....

Rafael Nadal Coach (sport)

Roger Federer

NBA Playofs

World Wide Web

Federal Reserve

NASA

web search engine Planet

Eurozone Stock

Francesca Schiavone won Roland Garros

Wall Street

Cholesterol Patient Infants

HIV

French Open Francesca Schiavone

SPORT SPORT

(47)

Some experiments

32K items from 3 RSS news-feeds

TAGME BAYES C4.5 AdaBoost.MH SVM

0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76

F1 measure 0.78

80%

75%

70%

65%

60%

Categories Sport Business U.S.

Health Sci&Tech World

Entertainment

(48)

User profling

(49)

User profiling in Twitter

Parse all tweets of a users via TAGME and via the Stanford NLP Parser, then assign a weight to each detected entity according to its grammatical role in the tweet.

User profle = weighted set of entities

Few interesting advantages:

Small footprint of the user profile, better space and time

Robustness against polysemy and synonymy

Same profile structure for users and tweets

(50)

Best user per tweet

[M. Pennacchiotti et al., CIKM 2012]

Problem: Given a tweet predict the “most interested” user

Data = 250 users over 182'000 tweets

o Tested: Classic tf-idf approach versus TagMe-based approach

events/content status/sentiment

Generic News

% correct

(51)

https://services.d4science.org/web/tagme

Riferimenti

Documenti correlati

Ancora una volta tentare di dare un quadro esaustivo delle caratteristiche che definiscono il libro per bambini e per ragazzi si è dimostrata un’impresa titanica che,

masse due forze opposte, he le farebbero deviare dalle geodeti he. Esistono per o an he geodeti he non \ omoventi,&#34; in ui la distanza tra

Currently, he is writing an article on ‘Modernist translation in Muslim Countries: Pioneers, Means, Ends’ in the hopes of highlighting the dynamics of Western epistemic

Qui, si vuole semplicemente osservare che l’aggettivo o il nome di nazionalità è un po’ come un’etichetta, che può esser percepita come una barriera, un confine, un

With this expression it was intended that control of law on constitutional documents (art. c.c.) with which the judicial authority, verified the occurrence of

• Trova (Ctrl + F): effettua la ricerca di un testo all’interno della pagina visualizzata; si possono specificare la direzione di ricerca (verso il basso o verso l’alto

• il riquadro Pagina iniziale, per impostare l’indirizzo della pagina web che il browser deve caricare quando viene avviato; è possibile scrivere l’indirizzo per esteso nella

In  more  than  twenty  years  of  academic  activity  he  conjugated  research  activity  and  industrial  advice  to  make  these  worlds  better  interact  and