A new era: Topic-based Annotators

(1)

A new era:

Topic-based Annotators

(2)

“Diego Maradona won against Mexico”

Dictionary of terms

against Diego Maradona Mexico won

2.2 5.1 9.1 1.0 0.1

Term Vector

Similarity(v,w) ≈ cos(a) t₁ v

w t₃

t₂

a

Vector Space model

Classical approach

Mainly term-based^:

polysemy and synonymy issues

(3)

He is using Microsoft’s browser He is a fan of Internet Explorer

Another issue: synonymy

(4)

Latent Semantic Analysis

• Related terms get projected to the same concept dimension

• Polysemy is not yet solved, concepts are latent

courtesy of Susan Dumais

Sec. 18.4

(5)

the paparazzi photographed the star the astronomer photographed the star

A typical issue: polysemy

(6)

6

A new approach:

Massive graphs of entities and relations

May 2012

(7)

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

Web of Data: RDF, Tables, Microdata

Cyc

Thanks to G. Weikum, SIGMOD 20137

(8)

Wikipedia is a rich source of instances

(9)

Wikipedia's categories contain classes

Categories form a taxonomic DAG (not really…)

(10)

10

DAG of categories:

http://toolserver.org/~dapete/catgraph/

(11)

Find mentions and annotate them with topics drawn from a catalog

Ex-Argentina’s player

Mexico’s football team

A new approach:

Topic-based annotation

Wikipedia!

anchors articles

(12)

Polysemy

the paparazzi photographed the star the astronomer photographed the star

Star is a massive, luminous ball of

plasma …

Celebrity is a person who is famou- sly recognized …

(13)

He is using Microsoft’s browser She plays with Internet Explorer

Synonymy

Internet Explorer (IE o MSIE), oggi noto anche con il nome Windows

Internet Explorer (WIE), è un browser web grafico proprietario sviluppato da Microsoft …

(14)

•Ex-Argentina’s coach

•His nephew

•Maradona Stadium

•Maradona Movie

•…

•Mexico nation

•Mexico state

•Mexico football team

•Mexico baseball team

•…

Don’t annotate!

Why is it a difficult problem?

(15)

Features used to link a  p

Commonness of a page p wrt an anchor a

Link probability of an anchor a

Context of a around the mention T = ^…w₁^w₂^w₃^a^w₄^w₅^w₆^…

and the content of a page/entity

Page p = z₁ z₂ z₃z₄ z₅ z₆ …

Graph-based features a is a mention-node

p is an entity-node Links a  p and paths

between pages

Pr(p| a) = # a linked to p

# a as anchor

text in the

of freq

anchor as

of ) freq

( a

a a lp =

(16)

Relatedness between pages

(17)

The literature

17

Many commercial software: AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, SemiTags, TextRazor, Wikimeta, Yahoo! Content Analysis, Zemanta.

(18)

• Designed for short texts

– news, blogs, search-results snippets, tweets, ads, etc – competitive on long texts too

• Achieves high accuracy

– Massive experimental test on millions of short texts

• Fast

– More than 10x faster than others – 100% Java

[Ferragina-Scaiella, CIKM 2010]

tagme.di.unipi.it

(19)

• …

•Diego A. Maradona

•Diego Maradona jr.

•Maradona Stadium

•Maradona Film

•…

•Mexico nation

•Mexico state

•Mexico football team

•Mexico baseball team

•…

No Annotation PARSING

PARSING PRUNING

2 simple features

• …

DISAMBIGUATION

by a voting scheme

How TAGME works

(20)

Disambiguation: The voting scheme

Collective agreement among topics via voting

Wiki-articles Votes

Mexico 0.09

State of Mexico 0

Mexico National

Football Team 0.29

Mexico National

Baseball Team 0

p

_a

p

_b •Diego A. Maradona

•Diego Maradona jr.

•Maradona Stadium

•Maradona Film

•…

b a

(21)

Disambiguation: ^{all steps} Disambiguation: ^{all steps}

Commonness of a page p wrt an anchor a

Voting scheme Select top-ε pages

Select the best in commonness

Pruning by commonness < τ

τ = trade-of speed vs. recall

Pr(p| a) = # a linked to p

# a as anchor

(22)

(23)

Less links

NLP ?

Useful for relating topics

(25)

Dense Subgraph

• Compute dense subgraph such that:

each mention is connected to at most one entity 90

30

5 100 100 50

20 50

90

80 30 90

10 10

20 30

30

Entities = Wikipedia Pages

Mentions = Anchor Texts

(26)

Random Walks

• for each mention run personalized PageRank with restart

• rank entity e by stationary visiting probability, starting from m

50 90

80 30 90

10

20 10

0.83 0.7

0.4

0.75 0.15

0.17

0.2 0.1

90 30

5 100 100 50 30 20 30

0.75 0.25

0.04 0.96 0.77 0.5 0.23 0.2 0.3



(27)

On comparing annotators

(28)

Find mentions and annotate them with topics drawn from a catalog

Ex-Argentina’s player

Mexico’s football team

A new approach:

Topic-based annotation

Wikipedia!

anchors articles

(29)

A word of caution !

A2W

Sa2W

Rc2W

(30)

(31)

(32)

Is it just a matter of “annotation” ?

(33)

#1: Classic BOW-representation

t₁ T₂

T₁ T₅ t₃

t₂

a

cos(a) = v  w / ||v|| * ||w||

• T is an input text

• D is the dictionary of the terms in the collection

• Represent T via a vector consisting of |D| components

(34)

#2: LSI-representation

• You still use cos-similarity for doc comparison

• Docs live in a reduced-space of latent concepts

• LSI projects m-dim space  k-dim space

• Compare docs/queries in latent space

k

U^T_{k x m}

d =

m

k

Reduced doc

(35)

#3: LSI expansion

• D is the dictionary of m terms in the collection

• k are the latent concepts computed via LSI

• Represent T via a vector consisting of m + k components (syntactic and latent)

• Use cos-similarity among expanded vectors

(36)

#4: Topic expansion

• D is the dictionary of m terms in the collection

• Represent T via a vector consisting of (m + #Wikipedia-pages) as components

• Terms are weighted via e.g. tf-idf

• How do you weight the Wikipedia pages ?

You could use the rho-score assigned by TagME

(37)

• Given T be an input text, we form a vector:

– One component per term ti weighted wrt T as

Repr[T,t

_i

] = Tf-Idf

_i,T

– One component per Wiki-page Pj weighted as

Repr[T,Pj] = <Repr[T], Repr[Pj]>

Observe that Pj can be considered as a text

#5: Explicit Semantic Analysis

[Gabrilovitch et al, IJCAI 07]

(38)

#5: Explicit Semantic Analysis

[Gabrilovitch et al, IJCAI 07]

To compute semantic relatedness of a pair of texts, compute the cosine similarity of their ESA

vectors, consisting of n + |Wikipedia|

components

Yet vector space model, thus it does not take into account for the similarity of P_jversus P_h

(39)

obama says gaddafi may way out military assault

us president issues libya ultimatum

Barack Obama Muammar al-Gaddaf

Offensive (military)

President of the United States

Libya Ultimatum

#6: graph of concepts

(40)

Text as a sub-graph of topics

Mahmoud Ahmadineja d

Ultimatum

RQ-170 drone

Any relatedness measure over a graph, e.g. [Milne &

Witten, 2008]

President of the United States

Barack Obama

Iran

(41)

• Given two texts, we map them to two sets of nodes/entities

– We stick onto the graph representation

Random walks for Text Sim

Text 1

Text 2

Volumes

of random paths in the concept

graph (e.g. Wikipedia)

Graph-KB of concepts

Latest approaches combine: category hierarchy, Wikipedia graph, random walks, Word2Vect representation of terms & entities,….

(42)

Results

• word relatedness

• text relatedness

(43)

Some applications of this

novel text representation

(44)

Text Categorization

It is the problem of labeling natural language texts with one or more thematic categories drawn from a predefned set.

• Many solutions for long texts (hence many terms)

– Naive Bayes, SVM, k-NN, MaxEnt, Decision Tree…

few for short texts which are poorly composed

(45)

Topical classifcation of news

Training (given a set of classes)

1. Annotate the training-set using TAGME

2. For each class z, create a subset of Wikipedia pages (topics) T_z by importance and distinctiveness

Test

3. Annotate the text to be classified using TAGME 4. Classify the text by computing the r-weighted

relatedness between any text’s topic with the subset of topics T_z assigned to each category z (r of topic in text).

(46)

Example

Relatedness measure SPORT

SPORT SCI-TECHSCI-TECH BUSINESSBUSINESS HEALTHHEALTH ....

Rafael Nadal Coach (sport)

Roger Federer

…

NBA Playofs

World Wide Web

Federal Reserve

… NASA

web search engine Planet

Eurozone Stock

Francesca Schiavone won Roland Garros

Wall Street

…

Cholesterol Patient Infants

HIV

…

French Open Francesca Schiavone

SPORT SPORT

(47)

Some experiments

32K items from 3 RSS news-feeds

TAGME BAYES C4.5 AdaBoost.MH SVM

0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76

F1 measure 0.78

80%

75%

70%

65%

60%

Categories Sport Business U.S.

Health Sci&Tech World

Entertainment

(48)

User profling

(49)

User profiling in Twitter

 Parse all tweets of a users via TAGME and via the Stanford NLP Parser, then assign a weight to each detected entity according to its grammatical role in the tweet.

User profle = weighted set of entities

Few interesting advantages:

• Small footprint of the user profile, better space and time

• Robustness against polysemy and synonymy

• Same profile structure for users and tweets

(50)

Best user per tweet

[M. Pennacchiotti et al., CIKM 2012]

Problem: Given a tweet predict the “most interested” user

 Data = 250 users over 182'000 tweets

o Tested: Classic tf-idf approach versus TagMe-based approach

events/content status/sentiment

Generic News

% correct

(51)

https://services.d4science.org/web/tagme

A new era: Topic-based Annotators

A new era:

Topic-based Annotators

Classical approach

Another issue: synonymy

Latent Semantic Analysis

A typical issue: polysemy

A new approach:

Massive graphs of entities and relations

May 2012

Web of Data: RDF, Tables, Microdata

Wikipedia is a rich source of instances

Wikipedia's categories contain classes

DAG of categories:

A new approach:

Topic-based annotation

Polysemy

Synonymy

Why is it a difficult problem?

Features used to link a  p

Relatedness between pages

The literature

How TAGME works

Disambiguation: The voting scheme

Disambiguation: The voting scheme

p

p

b a

Disambiguation: all steps Disambiguation: all steps

Less links

More links

NLP ?

Dense Subgraph

Random Walks

On comparing annotators

A new approach:

Topic-based annotation

A word of caution !

Is it just a matter of “annotation” ?

#1: Classic BOW-representation

#2: LSI-representation

#3: LSI expansion

#4: Topic expansion

Repr[T,t

] = Tf-Idf

#5: Explicit Semantic Analysis

#5: Explicit Semantic Analysis

#6: graph of concepts

Text as a sub-graph of topics

Random walks for Text Sim

Results

Some applications of this

novel text representation

Text Categorization

Topical classifcation of news

Example

Some experiments

User profling

User profiling in Twitter

Best user per tweet

Disambiguation: ^{all steps} Disambiguation: ^{all steps}