• Non ci sono risultati.

Text Classification Text Classification

N/A
N/A
Protected

Academic year: 2021

Condividi "Text Classification Text Classification"

Copied!
58
0
0

Testo completo

(1)

Text Classification Text Classification

The Naïve Bayes algorithm The Naïve Bayes algorithm

IP notice: most slides from: Chris Manning, plus some from

William Cohen, Chien Chin Chen, Jason Eisner, David Yarowsky, Dan Jurafsky, P. Nakov, Marti Hearst, Barbara Rosario

(2)

Outline Outline

Introduction to Text Classification Introduction to Text Classification

 Also called “text categorization”

Naïve Bayes text classification Naïve Bayes text classification

(3)

Is this spam?

Is this spam?

(4)

More Applications of Text More Applications of Text

Classification Classification

Authorship identificationAuthorship identification

Age/gender identificationAge/gender identification

Language IdentificationLanguage Identification

Assigning topics such as Yahoo-categoriesAssigning topics such as Yahoo-categories

e.g., "finance," "sports," "news>world>asia>business"

e.g., "finance," "sports," "news>world>asia>business"

Genre-detectionGenre-detection

e.g., "editorials" "movie-reviews" "news“

e.g., "editorials" "movie-reviews" "news“

Opinion/sentiment analysis on a person/productOpinion/sentiment analysis on a person/product e.g., “like”, “hate”, “neutral”

e.g., “like”, “hate”, “neutral”

Labels may be domain-specificLabels may be domain-specific

e.g., “contains adult language” : “doesn’t”

e.g., “contains adult language” : “doesn’t”

(5)

Text Classification:

Text Classification:

definition definition

The classifier: The classifier:

 Input: a document d

 Output: a predicted class c from some fixed set of labels c1,...,cK

The learner: The learner:

 Input: a set of m hand-labeled documents (d1,c1),...., (dm,cm)

 Output: a learned classifier f:d → c

Slide from William Cohen

(6)

Multimedia GUI Garb.Coll.

Semantics

ML Planning

planning temporal reasoning plan

language...

programming semantics language proof...

learning intelligence algorithm reinforcement network...

garbage collection memory

optimization region...

“planning language proof

intelligence”

Training Data:

Test Data:

Classes:

(AI)

Document Classification Document Classification

Slide from Chris Manning

(Programming) (HCI)

... ...

(7)

Classification Methods:

Classification Methods:

Hand-coded rules Hand-coded rules

Some spam/email filters, etc. Some spam/email filters, etc.

E.g., assign category if document contains a given E.g., assign category if document contains a given boolean combination of words

boolean combination of words

Accuracy is often very high if a rule has been Accuracy is often very high if a rule has been carefully refined over time by a subject expert carefully refined over time by a subject expert

Building and maintaining these rules is expensive Building and maintaining these rules is expensive

Slide from Chris Manning

(8)

Classification Methods:

Classification Methods:

Machine Learning Machine Learning

Supervised Machine Learning Supervised Machine Learning

To learn a function from documents (or To learn a function from documents (or sentences) to labels

sentences) to labels

 Naive Bayes (simple, common method)

 Others

k-Nearest Neighbors (simple, powerful)

Support-vector machines (new, more powerful)

… plus many other methods

 No free lunch: requires hand-classified training data

But data can be built up (and refined) by amateurs

Slide from Chris Manning

(9)

Naïve Bayes Intuition

Naïve Bayes Intuition

(10)

Representing text for Representing text for

classification classification

Slide from William Cohen

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26

Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).

Maize Mar 48.0, total 48.0 (nil).

Sorghum nil (nil)

Oilseed export registrations were:

Sunflowerseed total 15.0 (7.9)

Soybean May 20.0, total 20.0 (nil)

The board also detailed export registrations for subproducts, as follows....

f( )=c

?

What is the best representation for the document d being

classified?

simplest useful

(11)

Bag of words Bag of words

representation representation

Slide from William Cohen

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26

Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future

shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).

Maize Mar 48.0, total 48.0 (nil).

Sorghum nil (nil)

Oilseed export registrations were:

Sunflowerseed total 15.0 (7.9)

Soybean May 20.0, total 20.0 (nil)

The board also detailed export registrations for subproducts, as follows....

Categories: grain, wheat

(12)

Bag of words Bag of words

representation representation

xxxxxxxxxxxxxxxxxxx GRAIN/OILSEED xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxx grain xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx grains, oilseeds xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx tonnes, xxxxxxxxxxxxxxxxx shipments

xxxxxxxxxxxx total xxxxxxxxx total xxxxxxxx xxxxxxxxxxxxxxxxxxxx:

Xxxxx wheat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, total xxxxxxxxxxxxxxxx

Maize xxxxxxxxxxxxxxxxx

Sorghum xxxxxxxxxx

Oilseed xxxxxxxxxxxxxxxxxxxxx

Sunflowerseed xxxxxxxxxxxxxx

Soybean xxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....

Categories: grain, wheat

Slide from William Cohen

(13)

Bag of words Bag of words

representation representation

xxxxxxxxxxxxxxxxxxxGRAIN/OILSEED xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxgrain xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx grains, oilseeds xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx tonnes,

xxxxxxxxxxxxxxxxx shipments xxxxxxxxxxxx total xxxxxxxxx total xxxxxxxx xxxxxxxxxxxxxxxxxxxx:

Xxxxx wheat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, total xxxxxxxxxxxxxxxx

Maize xxxxxxxxxxxxxxxxx

Sorghum xxxxxxxxxx

Oilseed xxxxxxxxxxxxxxxxxxxxx

Sunflowerseed xxxxxxxxxxxxxx

Soybean xxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....

Categories: grain, wheat

grain(s) 3

oilseed(s) 2

total 3

wheat 1

maize 1

soybean 1

tonnes 1

... ...

word freq

Slide from William Cohen

(14)

Formalizing Naïve Formalizing Naïve

Bayes

Bayes

(15)

Bayes’ Rule Bayes’ Rule

) (

) (

)

| ) (

|

( P A

B P

B A

A P B

P

• Allows us to swap the conditioning

• Sometimes easier to estimate

one kind of dependence than

the other

(16)

S

Conditional Probability Conditional Probability

let let AA and and BB be events be events

PP((BB||AA)) = the probability = the probability of event of event BB occurring givenoccurring given event event AA occursoccurs

definition:definition: PP((BB||AA) = ) = PP((AA  BB) / ) / PP((AA))

(17)

Deriving Bayes’ Rule Deriving Bayes’ Rule



P(B | A)  P(A  B) P(A)



P(B | A)  P(A  B) P(A)

) (

) ) (

|

( P B

B A

B P A

P  

) (

) ) (

|

( P B

B A

B P A

P  



P(B | A)P(A)  P(A  B)



P(B | A)P(A)  P(A  B)

) (

) ( )

|

( A B P B P A B P ( A | B ) P ( B )  P ( AB )

P  



P(A | B)P(B)  P(B | A)P(A)



P(A | B)P(B)  P(B | A)P(A)



P(A | B)P(B | A)P(A) P(B)



P(A | B)P(B | A)P(A)

P(B)

(18)

Bayes’ Rule Applied to Bayes’ Rule Applied to

Documents and Classes Documents and Classes

Slide from Chris Manning



P (C, D)  P (C | D)P (D)  P (D | C )P (C )



P(C | D)P(D | C)P(C)

P(D)

(19)

The Text Classification The Text Classification

Problem Problem

Using a supervised Using a supervised learning methodlearning method, we want to learn a , we want to learn a classifier

classifier (or classification function):  (or classification function):

We denote the supervised learning method by We denote the supervised learning method by ::

((TT) =) = 

 The learning method  takes the training set T as input and returns the learned classifier .

Once we have learned Once we have learned , we can apply it to the test set, we can apply it to the test set (or (or test data).

test data).

Slide from Chien Chin Chen

C X

 :

(20)

Naïve Bayes Text Naïve Bayes Text

Classification Classification

The The Multinomial Naïve Bayes model Multinomial Naïve Bayes model (NB) is a (NB) is a probabilistic learning method.

probabilistic learning method.

In text classification, our goal is to find the “best In text classification, our goal is to find the “ best ” ” class for the document:

class for the document:

Slide from Chien Chin Chen

)

| ( max

arg P c d c

C map c

) (

)

| ( ) max (

arg P d

c d P c P

cC

)

| ( ) ( max

arg P c P d c

cC

The probability of a

document d being in class c.

The probability of a

document d being in class c.

Bayes’ Rule Bayes’ Rule

We can ignore the denominator

We can ignore the denominator

(21)

Naive Bayes Classifiers Naive Bayes Classifiers

We represent an instance

We represent an instance DD based on some attributes. based on some attributes.

Task: Classify a new instance

Task: Classify a new instance DD based on a tuple of attribute values based on a tuple of attribute values into one of the classes

into one of the classes ccjj CC

Slide from Chris Manning

xn

x x

D1, 2,,

) ,

, ,

| (

argmax j 1 2 n

C

MAP c P c x x x

c

j

) ,

, , (

) ( )

| ,

, , argmax (

2 1 2 1

n

j j

n C

c P x x x

c P c

x x

x P

j

) ( )

| ,

, , (

argmax 1 2 n j j

C c

c P c

x x

x P

j

The probability of a

document d being in class c.

The probability of a

document d being in class c.

Bayes’ Rule Bayes’ Rule

We can ignore the denominator

We can ignore the denominator

(22)

Naïve Bayes Classifier: Naïve Naïve Bayes Classifier: Naïve

Bayes Assumption Bayes Assumption

P P ( ( c c

jj

) )

 Can be estimated from the frequency of classes in the training examples.

P P ( ( x x

11

,x ,x

22

,…,x ,…,x

nn

|c |c

jj

) )

 O(|X|n•|C|) parameters

 Could only be estimated if a very, very large number of training examples was available.

Naïve Bayes Conditional Independence Assumption:

Naïve Bayes Conditional Independence Assumption:

Assume that the probability of observing the conjunction Assume that the probability of observing the conjunction of attributes is equal to the product of the individual

of attributes is equal to the product of the individual probabilities

probabilities PP((xxii||ccjj))..

Slide from Chris Manning

(23)

Flu

X1 X2 X3 X4 X5

fever sinus cough

runnynose muscle-ache

The Naïve Bayes The Naïve Bayes

Classifier Classifier

Conditional Independence Assumption: Conditional Independence Assumption:

 features are independent of each other given the class:

)

| (

)

| (

)

| (

)

| ,

,

( X

1

X

5

C P X

1

C P X

2

C P X

5

C

P      

Slide from Chris Manning

(24)

Using Multinomial Naive Bayes Using Multinomial Naive Bayes

Classifiers to Classify Text Classifiers to Classify Text

Attributes are text positionsAttributes are text positions, values are words, values are words..

Still too many possibilities

Assume that classification is independent of the positions of the words

Use same parameters for each position

Result is bag of words model (over tokens not types)

)

| text"

"

( )

| our"

"

( ) ( argmax

)

| ( )

( argmax

1

j j

j n

j C j

c

i

j i

C j NB c

c x

P c

x P c

P

c x

P c

P c

Slide from Chris Manning

(25)

Learning the Model Learning the Model

Simplest: maximum likelihood estimate Simplest: maximum likelihood estimate

 simply use the frequencies in the data

) (

) ,

) (

| ˆ (

i i

j i

i j

i

N X x

c C

x X

c N x

P

 

C

X1 X2 X3 X4 X5 X6

) (

) ) (

ˆ (

C N

c C

c N

P

j

j

Slide from Chris Manning

(26)

Problem with Max Problem with Max

Likelihood Likelihood

) 0 (

) ,

) (

|

ˆ (

5

5

 

N C nf

nf C

t X

nf N C

t X

P

What if we have seen no training cases where patient had no flu and What if we have seen no training cases where patient had no flu and muscle aches?

muscle aches?

Zero probabilities cannot be conditioned away, no matter the other Zero probabilities cannot be conditioned away, no matter the other evidence!

evidence!

 arg max

c

P ˆ ( c )

i

P ˆ ( x

i

| c )

Flu

X1 X2 X3 X4 X5

fever sinus cough

runnynose muscle-ache

)

| (

)

| (

)

| (

)

| ,

,

( X

1

X

5

C P X

1

C P X

2

C P X

5

C

P      

Slide from Chris Manning

(27)

Smoothing to Avoid Smoothing to Avoid

Overfitting Overfitting

k c

C N

c C

x X

c N x

P

j

j i

i j

i

 

 

) (

1 )

, ) (

| ˆ (

Bayesian Unigram Prior:Bayesian Unigram Prior:

Slide from Chris Manning

# of values of Xi

m c

C N

mp c

C x

X c N

x P

j

k i j

k i i

j k

i

 

 

) (

) ,

) (

|

ˆ (

, ,

,

overall fraction in data where Xi=xi,k

extent of

“smoothing”

Laplace:

(28)

Textj  single document containing all docsj

for each word wk in Vocabulary

nkj  number of occurrences of wk in Textj

nk  number of occurrences of wk in all docs

Naïve Bayes: Learning Naïve Bayes: Learning

From training corpus, extract From training corpus, extract VocabularyVocabulary

Calculate required Calculate required PP((ccjj)) and Pand P((wwkk | c | cjj)) termsterms

 For each cj in C do

docsj  subset of documents for which the target class is cj

| ) |

|

( n Vocabulary

c n w P

k

kj j

k

documents

# total

| ) |

( j docsj

c

P

Slide from Chris Manning

(29)

Naïve Bayes: Classifying Naïve Bayes: Classifying

positions positions   all word positions in current document all word positions in current document which contain tokens found in

which contain tokens found in Vocabulary

Vocabulary

Return Return c c

NBNB

, where , where

positions i

j i

C j

NB c

P c P w c

c argmax ( ) ( | )

j

Slide from Chris Manning

(30)

Underflow Prevention: log Underflow Prevention: log

space space

Multiplying lots of probabilities, which are between 0 and Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

1 by definition, can result in floating-point underflow.

Since Since log(log(xyxy) = log() = log(xx) + log() + log(yy)), it is better to perform all , it is better to perform all computations by summing logs of probabilities rather than computations by summing logs of probabilities rather than

multiplying probabilities.

multiplying probabilities.

Class with highest final un-normalized log probability Class with highest final un-normalized log probability score is still the most probable.

score is still the most probable.

Note that model is now just max of sum of weights…Note that model is now just max of sum of weights…

positions i

j i

C j

NB c

P c P x c

c argmax log ( ) log ( | )

j

Slide from Chris Manning

(31)

Naïve Bayes Generative Naïve Bayes Generative

Model for Text Model for Text

nudedeal Nigeria

spam ham

hot

$Viagra lottery

! !!

win

Friday

exam computer

May PM test

March science Viagra

homework score

!

spamham spam

hamspam spam

ham hamspam Category

Viagra hot deal!!

Slide from Ray Mooney



c

NB

 argmax

cj C

P(c

j

) P(x

i

| c

j

)

i positions

Choose a class c according to P(c)

Choose a class c according to P(c)

Then choose a word from that class with

probability P(x|c) Then choose a word from that class with

probability P(x|c)

Essentially model probability of each class as class- specific unigram language model Essentially model probability of each class as class- specific unigram language model

(32)

Naïve Bayes and Naïve Bayes and

Language Modeling Language Modeling

Naïve Bayes classifiers can use any sort of Naïve Bayes classifiers can use any sort of features

features

 URL, email address, dictionary

But, if: But, if:

 We use only word features

 We use all of the words in the text (not subset)

Then Then

 Naïve Bayes bears similarity to language modeling

(33)

Each class = Unigram Each class = Unigram

language model language model

Assign to each word: Assign to each word: P P ( ( word word | | c c ) )

Assign to each sentence: Assign to each sentence: P P ( ( c c | | s s ) = ) = P P ( ( c c )∏ )∏ P P ( ( w w

i i

| | c c ) )

w P(w | c)

I 0.1

love 0.1 this 0.01

fun 0.05

film 0.1

I love this fun film

0.1 0.1 0.05 0.01 0.1

P(s | c) = 0.0000005

(34)

Naïve Bayes Language Naïve Bayes Language

Model Model

Two classes: in language, out language Two classes: in language, out language

In Language

I 0.1

love 0.1 this 0.01

fun 0.05

film 0.1

Out Language

I 0.2

love 0.001 this 0.01

fun 0.005

film 0.1

I love this fun film

0.1 0.1 0.05 0.01 0.1 0.2 0.001 0.01 0.005 0.1

P(s | in) = P(s | out)

(35)

Naïve Bayes Naïve Bayes

Classification Classification

nudedeal Nigeria

spam ham

hot

$Viagra lottery

! !!

win

Friday

exam computer

May PM test

March science Viagra

homework score

!

spamham spam

hamspam spam

ham hamspam Category Win lotttery $ !

?? ??

Slide from Ray Mooney

(36)

Naïve Bayes Text Classification Naïve Bayes Text Classification

Example Example

Training: Training:

 Vocabulary V = {Chinese, Beijing, Shanghai, Macao, Tokyo, Japan} and |V | = 6.

 P(c) = 3/4 and P(~c) = 1/4.

 P(Chinese|c) = (5+1) / (8+6) = 6/14 = 3/7

 P(Chinese|~c) = (1+1) / (3+6) = 2/9

 P(Tokyo|c) = P(Japan|c) = (0+1)/(8+6) =1/14

 P(Chinese|~c) = (1+1)/(3+6) = 2/9

 P(Tokyo|~c) = p(Japan|~c) = (1+1/)3+6) = 2/9

Testing:Testing:

 P(c|d) 3/4 * (3/7)3 * 1/14 * 1/14

≈ 0.0003

 P(~c|d) 1/4 * (2/9)3 * 2/9 * 2/9

≈ 0.0001

Slide from Chien Chin Chen

Set Doc Words Class

Train 1 Chinese Bejing Chinese c

2 Chinese Chinese Shanghai c

3 Chinese Macao c

4 Tokyo Japan Chinese ~c

Test 5 Chinese Chinese Chinese Tokyo Japan ?

N c c N

P ( )

) ˆ(

|

| ) (

1 ) , ) (

| ˆ(

V c

N

c w c N

w

P

(37)

Naïve Bayes Text Naïve Bayes Text

Classification Classification

Naïve Bayes algorithm – training phase. Naïve Bayes algorithm – training phase.

Slide from Chien Chin Chen

TrainMultinomialNB(C, D) V  ExtractVocabulary(D) N  CountDocs(D)

for each c in C

Nc  CountDocsInClass(D, c) prior[c]  Nc / Count(C)

textc  TextOfAllDocsInClass(D, c) for each t in V

Ftc  CountOccurrencesOfTerm(t, textc) for each t in V

condprob[t][c]  (Ftc+1) / ∑(Ft’c+1) return V, prior, condprob

TrainMultinomialNB(C, D) V  ExtractVocabulary(D) N  CountDocs(D)

for each c in C

Nc  CountDocsInClass(D, c) prior[c]  Nc / Count(C)

textc  TextOfAllDocsInClass(D, c) for each t in V

Ftc  CountOccurrencesOfTerm(t, textc) for each t in V

condprob[t][c]  (Ftc+1) / ∑(Ft’c+1) return V, prior, condprob

(38)

Naïve Bayes Text Naïve Bayes Text

Classification Classification

Naïve Bayes algorithm – testing phase. Naïve Bayes algorithm – testing phase.

Slide from Chien Chin Chen

ApplyMultinomialNB(C, V, prior, condProb, d) W  ExtractTokensFromDoc(V, d)

for each c in C

score[c]  log prior[c]

for each t in W

score[c] += log condprob[t][c]

return argmaxcscore[c]

ApplyMultinomialNB(C, V, prior, condProb, d) W  ExtractTokensFromDoc(V, d)

for each c in C

score[c]  log prior[c]

for each t in W

score[c] += log condprob[t][c]

return argmaxcscore[c]

(39)

Evaluating Categorization Evaluating Categorization

Evaluation must be done on test data that are Evaluation must be done on test data that are independent

independent of the training data of the training data

 usually a disjoint set of instances

Classification accuracy Classification accuracy : c : c / / n n where where n is the total n is the total number of test instances and

number of test instances and c c is the number of is the number of test instances correctly classified by the system.

test instances correctly classified by the system.

 Adequate if one class per document

Results can vary based on sampling error due to Results can vary based on sampling error due to different training and test sets.

different training and test sets.

 Average results over multiple training and test sets (splits of the overall data) for the best results.

Slide from Chris Manning

(40)

Measuring Performance Measuring Performance

Precision Precision = =

good messages kept good messages kept

all messages kept all messages kept

Recall Recall = =

good messages kept good messages kept

all good messages all good messages

Trade off precision vs. recall by setting threshold

Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable

Trade off precision vs. recall by setting threshold

Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable

Precision vs. Recall of Good (non-spam) Email

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%

Recall

Precision

Slide from Jason Eisner

(41)

Measuring Performance Measuring Performance

Slide from Jason Eisner

Precision vs. Recall of Good (non-spam) Email

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%

Recall

P re ci si o n

low threshold:

keep all the good stuff, but a lot of the bad too high threshold:

all we keep is good,

but we don’t keep much

OK for spam filtering and legal search OK for search

engines (maybe)

would prefer to be here!

point where

precision=recall (often reported)

(42)

The 2-by-2 contingency The 2-by-2 contingency

table table

Correct Incorrect

Selected True Positive False Positive Not selected False Negative True Negative

(43)

Precision and Recall Precision and Recall

Precision: % of selected items that are Precision: % of selected items that are correct

correct

Recall: % of correct items that are selected Recall: % of correct items that are selected

(44)

A Combined measure: F A Combined measure: F

The F measure assesses the P/R tradeoff, The F measure assesses the P/R tradeoff, through the weighted harmonic mean:

through the weighted harmonic mean:

(45)

Multiclass Classification Multiclass Classification

Dealing with > 2 classes Dealing with > 2 classes

For each class c For each class c

 Build a binary classifier Yc to distinguish c from ~c

Given a test document d Given a test document d

 Evaluate membership in each class using Yc

 Assign d to each class c for which Yc returns true

(46)

Micro- vs. Macro-Averaging Micro- vs. Macro-Averaging

If we have more than one class, how do we If we have more than one class, how do we combine multiple performance measures into combine multiple performance measures into

one quantity one quantity

Macroaveraging Macroaveraging : compute performance for : compute performance for each class, then average

each class, then average

Microaveraging Microaveraging : collect decision for all : collect decision for all

classes, compute contingency table, evaluate

classes, compute contingency table, evaluate

(47)

More Complicated Cases of More Complicated Cases of

Measuring Performance Measuring Performance

For multiclass classifiers: For multiclass classifiers:

 Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not, etc.

 Better, estimate the cost of different kinds of errors

e.g., how bad is each of the following?

– putting Sports articles in the News section – putting Fashion articles in the News section – putting News articles in the Fashion section

Now tune system to minimize total cost

For ranking systems: For ranking systems:

 Correlate with human rankings?

 Get active feedback from user?

 Measure user’s wasted time by tracking clicks?

Slide from Jason Eisner

Which articles are most Sports-like?

Which articles / webpages most relevant?

(48)

Evaluation Benchmark Evaluation Benchmark

Reuters-21578 Data Set Reuters-21578 Data Set

Most (over)used data set, 21,578 docs (each 90 Most (over)used data set, 21,578 docs (each 90 types, 200 tokens)

types, 200 tokens)

9603 training, 3299 test articles 9603 training, 3299 test articles (ModApte/Lewis split)

(ModApte/Lewis split)

118 categories 118 categories

 An article can be in more than one category

 Learn 118 binary category distinctions

Average document (with at least one category) Average document (with at least one category) has 1.24 classes

has 1.24 classes

Only about 10 out of 118 categories are large Only about 10 out of 118 categories are large

(49)

Training size Training size

The more the better! (usually) The more the better! (usually)

Results for text classification Results for text classification

**

*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang, Slide from Nakov/Hearst/Rosario

Test error vs training size on the newsgroups rec.sport.baseball and rec.sport.hockey

(50)

Training size Training size

*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang, Slide from Nakov/Hearst/Rosario

Test error vs training size on the newsgroups alt.atheism and talk.religion.misc

(51)

Training size Training size

*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang, Slide from Nakov/Hearst/Rosario

(52)

Training Size Training Size

Author identification Author identification

Authorship Attribution a Comparison Of Three Methods, Matthew Care, Slide from Nakov/Hearst/Rosario

(53)

Violation of NB Violation of NB

Assumptions Assumptions

Conditional independence Conditional independence

“ “ Positional independence” Positional independence”

Examples? Examples?

Slide from Chris Manning

(54)

Naïve Bayes is Not So Naïve Bayes is Not So

Naïve Naïve

Naïve Bayes: first and second place in KDD-CUP 97 Naïve Bayes: first and second place in KDD-CUP 97

competition, among 16 (then) state of the art algorithms competition, among 16 (then) state of the art algorithms

Goal: Financial services industry direct mail response prediction model:

Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

Robust to Irrelevant FeaturesRobust to Irrelevant Features

Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this.

Very good in domains with many equally importantVery good in domains with many equally important features features

Decision Trees suffer from fragmentation in such cases – especially if little data

A good dependable baseline for text classification (but not the A good dependable baseline for text classification (but not the best)!

best)!

Slide from Chris Manning

(55)

Naïve Bayes is Not So Naïve Bayes is Not So

Naïve Naïve

Optimal if the Independence Assumptions Optimal if the Independence Assumptions hold:

hold:

If assumed independence is correct, then it is the Bayes Optimal If assumed independence is correct, then it is the Bayes Optimal Classifier for problem

Classifier for problem

Very Fast: Very Fast:

Learning with one pass of counting over the data; testing linear in Learning with one pass of counting over the data; testing linear in the number of attributes, and document collection size

the number of attributes, and document collection size

Low Storage requirements Low Storage requirements

Online Learning Algorithm Online Learning Algorithm

Can be trained incrementally, on new examples Can be trained incrementally, on new examples

(56)

SpamAssassin SpamAssassin

Naïve Bayes widely used in spam filtering Naïve Bayes widely used in spam filtering

 Paul Graham’s A Plan for Spam

A mutant with more mutant offspring...

 Naive Bayes-like classifier with weird parameter estimation

 But also many other things: black hole lists, etc.

Many email topic filters also use NB classifiers Many email topic filters also use NB classifiers

Slide from Chris Manning

(57)

SpamAssassin Tests SpamAssassin Tests

Mentions Generic ViagraMentions Generic Viagra

Online PharmacyOnline Pharmacy

No prescription neededNo prescription needed

Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)

Talks about Oprah with an exclamation!Talks about Oprah with an exclamation!

Phrase: impress ... girlPhrase: impress ... girl

From: starts with many numbersFrom: starts with many numbers

Subject contains "Your Family”Subject contains "Your Family”

Subject is all capitalsSubject is all capitals

HTML has a low ratio of text to image areaHTML has a low ratio of text to image area

One hundred percent guaranteedOne hundred percent guaranteed

Claims you can be removed from the listClaims you can be removed from the list

'Prestigious Non-Accredited Universities''Prestigious Non-Accredited Universities'

http://spamassassin.apache.org/tests_3_3_x.htmlhttp://spamassassin.apache.org/tests_3_3_x.html

(58)

Naïve Bayes: Word Naïve Bayes: Word

Sense Disambiguation Sense Disambiguation

ww an ambiguous wordan ambiguous word

ss11, …, s, …, sKK senses for word senses for word ww

v1, …, vJ words in the context of words in the context of ww

PP((ssjj)) prior probability of sense prior probability of sense ssjj

PP(v(vjj|s|skk)) probability that word probability that word vvjj occurs in context of sense occurs in context of sense sskk

) (

) , ) (

| (

k k j k

j C s

s v s C

v

P

) (

) ) (

( C w

s s C

P k k

k vj

k j

s k

s v

P s

P

s argmax ( ) ( | )

Riferimenti

Documenti correlati

In particular, deconvolution analysis of infiltrating immune cell populations revealed that a significantly higher number of CD4+ and CD8+ T cells can be found in patients with

The cortical stimulation during denomination test (Laiacona-Capitani) elicited semantic paraphasias (tag 3) at the level of the posterior thirds of the inferior temporal sulcus

If in a broader European context all fields (media and journalism, public administration and justice, education, and culture and literature) are given approximately the same amount

The microclimate conditions are monitored by sensors for: solar radiation, operating in the spectral range from 290 nm to 2800 nm; air temperature in the range between

Oggi Alba è un complesso “palinsesto” insediativo, architettonico e paesaggistico, frutto dei cambiamenti introdotti dalle azioni antropiche in più di 2.000 anni,

Nel 1986 Quentin Skinner, autore dell’opera The Foundations of Modern Political Thought, pubblica un articolo intitolato Ambrogio Lorenzetti: the artist as political

For example, in OLDAE there are three sentence-level examples for the word agglomeration (e.g. ‘The larger the area became, the more the agglomeration of firms made it attractive

Ma solo dopo quel capolavoro che sono Gli anni, autosociobiografia dei suoi coetanei, libro cerniera nella sua corposa opera – passata da un po’