Text Classification Text Classification
The Naïve Bayes algorithm The Naïve Bayes algorithm
IP notice: most slides from: Chris Manning, plus some from
William Cohen, Chien Chin Chen, Jason Eisner, David Yarowsky, Dan Jurafsky, P. Nakov, Marti Hearst, Barbara Rosario
Outline Outline
Introduction to Text Classification Introduction to Text Classification
Also called “text categorization”
Naïve Bayes text classification Naïve Bayes text classification
Is this spam?
Is this spam?
More Applications of Text More Applications of Text
Classification Classification
Authorship identificationAuthorship identification
Age/gender identificationAge/gender identification
Language IdentificationLanguage Identification
Assigning topics such as Yahoo-categoriesAssigning topics such as Yahoo-categories
e.g., "finance," "sports," "news>world>asia>business"
e.g., "finance," "sports," "news>world>asia>business"
Genre-detectionGenre-detection
e.g., "editorials" "movie-reviews" "news“
e.g., "editorials" "movie-reviews" "news“
Opinion/sentiment analysis on a person/productOpinion/sentiment analysis on a person/product e.g., “like”, “hate”, “neutral”
e.g., “like”, “hate”, “neutral”
Labels may be domain-specificLabels may be domain-specific
e.g., “contains adult language” : “doesn’t”
e.g., “contains adult language” : “doesn’t”
Text Classification:
Text Classification:
definition definition
The classifier: The classifier:
Input: a document d
Output: a predicted class c from some fixed set of labels c1,...,cK
The learner: The learner:
Input: a set of m hand-labeled documents (d1,c1),...., (dm,cm)
Output: a learned classifier f:d → c
Slide from William Cohen
Multimedia GUI Garb.Coll.
Semantics
ML Planning
planning temporal reasoning plan
language...
programming semantics language proof...
learning intelligence algorithm reinforcement network...
garbage collection memory
optimization region...
“planning language proof
intelligence”
Training Data:
Test Data:
Classes:
(AI)
Document Classification Document Classification
Slide from Chris Manning
(Programming) (HCI)
... ...
Classification Methods:
Classification Methods:
Hand-coded rules Hand-coded rules
Some spam/email filters, etc. Some spam/email filters, etc.
E.g., assign category if document contains a given E.g., assign category if document contains a given boolean combination of words
boolean combination of words
Accuracy is often very high if a rule has been Accuracy is often very high if a rule has been carefully refined over time by a subject expert carefully refined over time by a subject expert
Building and maintaining these rules is expensive Building and maintaining these rules is expensive
Slide from Chris Manning
Classification Methods:
Classification Methods:
Machine Learning Machine Learning
Supervised Machine Learning Supervised Machine Learning
To learn a function from documents (or To learn a function from documents (or sentences) to labels
sentences) to labels
Naive Bayes (simple, common method)
Others
• k-Nearest Neighbors (simple, powerful)
• Support-vector machines (new, more powerful)
• … plus many other methods
No free lunch: requires hand-classified training data
• But data can be built up (and refined) by amateurs
Slide from Chris Manning
Naïve Bayes Intuition
Naïve Bayes Intuition
Representing text for Representing text for
classification classification
Slide from William Cohen
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:
• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
• Maize Mar 48.0, total 48.0 (nil).
• Sorghum nil (nil)
• Oilseed export registrations were:
• Sunflowerseed total 15.0 (7.9)
• Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for subproducts, as follows....
f( )=c
?
What is the best representation for the document d beingclassified?
simplest useful
Bag of words Bag of words
representation representation
Slide from William Cohen
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future
shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:
• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
• Maize Mar 48.0, total 48.0 (nil).
• Sorghum nil (nil)
• Oilseed export registrations were:
• Sunflowerseed total 15.0 (7.9)
• Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for subproducts, as follows....
Categories: grain, wheat
Bag of words Bag of words
representation representation
xxxxxxxxxxxxxxxxxxx GRAIN/OILSEED xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxx grain xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx grains, oilseeds xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx tonnes, xxxxxxxxxxxxxxxxx shipments
xxxxxxxxxxxx total xxxxxxxxx total xxxxxxxx xxxxxxxxxxxxxxxxxxxx:
• Xxxxx wheat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, total xxxxxxxxxxxxxxxx
• Maize xxxxxxxxxxxxxxxxx
• Sorghum xxxxxxxxxx
• Oilseed xxxxxxxxxxxxxxxxxxxxx
• Sunflowerseed xxxxxxxxxxxxxx
• Soybean xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....
Categories: grain, wheat
Slide from William Cohen
Bag of words Bag of words
representation representation
xxxxxxxxxxxxxxxxxxxGRAIN/OILSEED xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxgrain xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx grains, oilseeds xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx tonnes,
xxxxxxxxxxxxxxxxx shipments xxxxxxxxxxxx total xxxxxxxxx total xxxxxxxx xxxxxxxxxxxxxxxxxxxx:
• Xxxxx wheat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, total xxxxxxxxxxxxxxxx
• Maize xxxxxxxxxxxxxxxxx
• Sorghum xxxxxxxxxx
• Oilseed xxxxxxxxxxxxxxxxxxxxx
• Sunflowerseed xxxxxxxxxxxxxx
• Soybean xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....
Categories: grain, wheat
grain(s) 3
oilseed(s) 2
total 3
wheat 1
maize 1
soybean 1
tonnes 1
... ...
word freq
Slide from William Cohen
Formalizing Naïve Formalizing Naïve
Bayes
Bayes
Bayes’ Rule Bayes’ Rule
) (
) (
)
| ) (
|
( P A
B P
B A
A P B
P
• Allows us to swap the conditioning
• Sometimes easier to estimate
one kind of dependence than
the other
S
Conditional Probability Conditional Probability
let let AA and and BB be events be events
PP((BB||AA)) = the probability = the probability of event of event BB occurring givenoccurring given event event AA occursoccurs
definition:definition: PP((BB||AA) = ) = PP((AA BB) / ) / PP((AA))
Deriving Bayes’ Rule Deriving Bayes’ Rule
P(B | A) P(A B) P(A)
P(B | A) P(A B) P(A)
) (
) ) (
|
( P B
B A
B P A
P
) (
) ) (
|
( P B
B A
B P A
P
P(B | A)P(A) P(A B)
P(B | A)P(A) P(A B)
) (
) ( )
|
( A B P B P A B P ( A | B ) P ( B ) P ( A B )
P
P(A | B)P(B) P(B | A)P(A)
P(A | B)P(B) P(B | A)P(A)
P(A | B) P(B | A)P(A) P(B)
P(A | B) P(B | A)P(A)
P(B)
Bayes’ Rule Applied to Bayes’ Rule Applied to
Documents and Classes Documents and Classes
Slide from Chris Manning
P (C, D) P (C | D)P (D) P (D | C )P (C )
P(C | D) P(D | C)P(C)
P(D)
The Text Classification The Text Classification
Problem Problem
Using a supervised Using a supervised learning methodlearning method, we want to learn a , we want to learn a classifier
classifier (or classification function): (or classification function):
We denote the supervised learning method by We denote the supervised learning method by ::
((TT) =) =
The learning method takes the training set T as input and returns the learned classifier .
Once we have learned Once we have learned , we can apply it to the test set, we can apply it to the test set (or (or test data).
test data).
Slide from Chien Chin Chen
C X
:
Naïve Bayes Text Naïve Bayes Text
Classification Classification
The The Multinomial Naïve Bayes model Multinomial Naïve Bayes model (NB) is a (NB) is a probabilistic learning method.
probabilistic learning method.
In text classification, our goal is to find the “best In text classification, our goal is to find the “ best ” ” class for the document:
class for the document:
Slide from Chien Chin Chen
)
| ( max
arg P c d c
C map c
) (
)
| ( ) max (
arg P d
c d P c P
cC
)
| ( ) ( max
arg P c P d c
cC
The probability of a
document d being in class c.
The probability of a
document d being in class c.
Bayes’ Rule Bayes’ Rule
We can ignore the denominator
We can ignore the denominator
Naive Bayes Classifiers Naive Bayes Classifiers
We represent an instance
We represent an instance DD based on some attributes. based on some attributes.
Task: Classify a new instance
Task: Classify a new instance DD based on a tuple of attribute values based on a tuple of attribute values into one of the classes
into one of the classes ccjj CC
Slide from Chris Manning
xn
x x
D 1, 2,,
) ,
, ,
| (
argmax j 1 2 n
C
MAP c P c x x x
c
j
) ,
, , (
) ( )
| ,
, , argmax (
2 1 2 1
n
j j
n C
c P x x x
c P c
x x
x P
j
) ( )
| ,
, , (
argmax 1 2 n j j
C c
c P c
x x
x P
j
The probability of a
document d being in class c.
The probability of a
document d being in class c.
Bayes’ Rule Bayes’ Rule
We can ignore the denominator
We can ignore the denominator
Naïve Bayes Classifier: Naïve Naïve Bayes Classifier: Naïve
Bayes Assumption Bayes Assumption
P P ( ( c c
jj) )
Can be estimated from the frequency of classes in the training examples.
P P ( ( x x
11,x ,x
22,…,x ,…,x
nn|c |c
jj) )
O(|X|n•|C|) parameters
Could only be estimated if a very, very large number of training examples was available.
Naïve Bayes Conditional Independence Assumption:
Naïve Bayes Conditional Independence Assumption:
Assume that the probability of observing the conjunction Assume that the probability of observing the conjunction of attributes is equal to the product of the individual
of attributes is equal to the product of the individual probabilities
probabilities PP((xxii||ccjj))..
Slide from Chris Manning
Flu
X1 X2 X3 X4 X5
fever sinus cough
runnynose muscle-ache
The Naïve Bayes The Naïve Bayes
Classifier Classifier
Conditional Independence Assumption: Conditional Independence Assumption:
features are independent of each other given the class:
)
| (
)
| (
)
| (
)
| ,
,
( X
1X
5C P X
1C P X
2C P X
5C
P
Slide from Chris Manning
Using Multinomial Naive Bayes Using Multinomial Naive Bayes
Classifiers to Classify Text Classifiers to Classify Text
Attributes are text positionsAttributes are text positions, values are words, values are words..
Still too many possibilities
Assume that classification is independent of the positions of the words
Use same parameters for each position
Result is bag of words model (over tokens not types)
)
| text"
"
( )
| our"
"
( ) ( argmax
)
| ( )
( argmax
1
j j
j n
j C j
c
i
j i
C j NB c
c x
P c
x P c
P
c x
P c
P c
Slide from Chris Manning
Learning the Model Learning the Model
Simplest: maximum likelihood estimate Simplest: maximum likelihood estimate
simply use the frequencies in the data
) (
) ,
) (
| ˆ (
i i
j i
i j
i
N X x
c C
x X
c N x
P
C
X1 X2 X3 X4 X5 X6
) (
) ) (
ˆ (
C N
c C
c N
P
j
j
Slide from Chris Manning
Problem with Max Problem with Max
Likelihood Likelihood
) 0 (
) ,
) (
|
ˆ (
55
N C nf
nf C
t X
nf N C
t X
P
What if we have seen no training cases where patient had no flu and What if we have seen no training cases where patient had no flu and muscle aches?
muscle aches?
Zero probabilities cannot be conditioned away, no matter the other Zero probabilities cannot be conditioned away, no matter the other evidence!
evidence!
arg max
cP ˆ ( c )
iP ˆ ( x
i| c )
Flu
X1 X2 X3 X4 X5
fever sinus cough
runnynose muscle-ache
)
| (
)
| (
)
| (
)
| ,
,
( X
1X
5C P X
1C P X
2C P X
5C
P
Slide from Chris Manning
Smoothing to Avoid Smoothing to Avoid
Overfitting Overfitting
k c
C N
c C
x X
c N x
P
j
j i
i j
i
) (
1 )
, ) (
| ˆ (
Bayesian Unigram Prior:Bayesian Unigram Prior:
Slide from Chris Manning
# of values of Xi
m c
C N
mp c
C x
X c N
x P
j
k i j
k i i
j k
i
) (
) ,
) (
|
ˆ (
, ,,
overall fraction in data where Xi=xi,k
extent of
“smoothing”
• Laplace:
• Textj single document containing all docsj
• for each word wk in Vocabulary
nkj number of occurrences of wk in Textj
nk number of occurrences of wk in all docs
Naïve Bayes: Learning Naïve Bayes: Learning
From training corpus, extract From training corpus, extract VocabularyVocabulary
Calculate required Calculate required PP((ccjj)) and Pand P((wwkk | c | cjj)) termsterms
For each cj in C do
• docsj subset of documents for which the target class is cj
| ) |
|
( n Vocabulary
c n w P
k
kj j
k
documents
# total
| ) |
( j docsj
c
P
Slide from Chris Manning
Naïve Bayes: Classifying Naïve Bayes: Classifying
positions positions all word positions in current document all word positions in current document which contain tokens found in
which contain tokens found in Vocabulary
Vocabulary
Return Return c c
NBNB, where , where
positions i
j i
C j
NB c
P c P w c
c argmax ( ) ( | )
j
Slide from Chris Manning
Underflow Prevention: log Underflow Prevention: log
space space
Multiplying lots of probabilities, which are between 0 and Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.
1 by definition, can result in floating-point underflow.
Since Since log(log(xyxy) = log() = log(xx) + log() + log(yy)), it is better to perform all , it is better to perform all computations by summing logs of probabilities rather than computations by summing logs of probabilities rather than
multiplying probabilities.
multiplying probabilities.
Class with highest final un-normalized log probability Class with highest final un-normalized log probability score is still the most probable.
score is still the most probable.
Note that model is now just max of sum of weights…Note that model is now just max of sum of weights…
positions i
j i
C j
NB c
P c P x c
c argmax log ( ) log ( | )
j
Slide from Chris Manning
Naïve Bayes Generative Naïve Bayes Generative
Model for Text Model for Text
nudedeal Nigeria
spam ham
hot
$Viagra lottery
! !!
win
Friday
exam computer
May PM test
March science Viagra
homework score
!
spamham spam
hamspam spam
ham hamspam Category
Viagra hot deal!!
Slide from Ray Mooney
c
NB argmax
cj C
P(c
j) P(x
i| c
j)
i positions
Choose a class c according to P(c)
Choose a class c according to P(c)
Then choose a word from that class with
probability P(x|c) Then choose a word from that class with
probability P(x|c)
Essentially model probability of each class as class- specific unigram language model Essentially model probability of each class as class- specific unigram language model
Naïve Bayes and Naïve Bayes and
Language Modeling Language Modeling
Naïve Bayes classifiers can use any sort of Naïve Bayes classifiers can use any sort of features
features
URL, email address, dictionary
But, if: But, if:
We use only word features
We use all of the words in the text (not subset)
Then Then
Naïve Bayes bears similarity to language modeling
Each class = Unigram Each class = Unigram
language model language model
Assign to each word: Assign to each word: P P ( ( word word | | c c ) )
Assign to each sentence: Assign to each sentence: P P ( ( c c | | s s ) = ) = P P ( ( c c )∏ )∏ P P ( ( w w
i i| | c c ) )
w P(w | c)
I 0.1
love 0.1 this 0.01
fun 0.05
film 0.1
I love this fun film
0.1 0.1 0.05 0.01 0.1
P(s | c) = 0.0000005
Naïve Bayes Language Naïve Bayes Language
Model Model
Two classes: in language, out language Two classes: in language, out language
In Language
I 0.1
love 0.1 this 0.01
fun 0.05
film 0.1
Out Language
I 0.2
love 0.001 this 0.01
fun 0.005
film 0.1
I love this fun film
0.1 0.1 0.05 0.01 0.1 0.2 0.001 0.01 0.005 0.1
P(s | in) = P(s | out)
Naïve Bayes Naïve Bayes
Classification Classification
nudedeal Nigeria
spam ham
hot
$Viagra lottery
! !!
win
Friday
exam computer
May PM test
March science Viagra
homework score
!
spamham spam
hamspam spam
ham hamspam Category Win lotttery $ !
?? ??
Slide from Ray Mooney
Naïve Bayes Text Classification Naïve Bayes Text Classification
Example Example
Training: Training:
Vocabulary V = {Chinese, Beijing, Shanghai, Macao, Tokyo, Japan} and |V | = 6.
P(c) = 3/4 and P(~c) = 1/4.
P(Chinese|c) = (5+1) / (8+6) = 6/14 = 3/7
P(Chinese|~c) = (1+1) / (3+6) = 2/9
P(Tokyo|c) = P(Japan|c) = (0+1)/(8+6) =1/14
P(Chinese|~c) = (1+1)/(3+6) = 2/9
P(Tokyo|~c) = p(Japan|~c) = (1+1/)3+6) = 2/9
Testing:Testing:
P(c|d) 3/4 * (3/7)3 * 1/14 * 1/14
≈ 0.0003
P(~c|d) 1/4 * (2/9)3 * 2/9 * 2/9
≈ 0.0001
Slide from Chien Chin Chen
Set Doc Words Class
Train 1 Chinese Bejing Chinese c
2 Chinese Chinese Shanghai c
3 Chinese Macao c
4 Tokyo Japan Chinese ~c
Test 5 Chinese Chinese Chinese Tokyo Japan ?
N c c N
P ( )
) ˆ(
|
| ) (
1 ) , ) (
| ˆ(
V c
N
c w c N
w
P
Naïve Bayes Text Naïve Bayes Text
Classification Classification
Naïve Bayes algorithm – training phase. Naïve Bayes algorithm – training phase.
Slide from Chien Chin Chen
TrainMultinomialNB(C, D) V ExtractVocabulary(D) N CountDocs(D)
for each c in C
Nc CountDocsInClass(D, c) prior[c] Nc / Count(C)
textc TextOfAllDocsInClass(D, c) for each t in V
Ftc CountOccurrencesOfTerm(t, textc) for each t in V
condprob[t][c] (Ftc+1) / ∑(Ft’c+1) return V, prior, condprob
TrainMultinomialNB(C, D) V ExtractVocabulary(D) N CountDocs(D)
for each c in C
Nc CountDocsInClass(D, c) prior[c] Nc / Count(C)
textc TextOfAllDocsInClass(D, c) for each t in V
Ftc CountOccurrencesOfTerm(t, textc) for each t in V
condprob[t][c] (Ftc+1) / ∑(Ft’c+1) return V, prior, condprob
Naïve Bayes Text Naïve Bayes Text
Classification Classification
Naïve Bayes algorithm – testing phase. Naïve Bayes algorithm – testing phase.
Slide from Chien Chin Chen
ApplyMultinomialNB(C, V, prior, condProb, d) W ExtractTokensFromDoc(V, d)
for each c in C
score[c] log prior[c]
for each t in W
score[c] += log condprob[t][c]
return argmaxcscore[c]
ApplyMultinomialNB(C, V, prior, condProb, d) W ExtractTokensFromDoc(V, d)
for each c in C
score[c] log prior[c]
for each t in W
score[c] += log condprob[t][c]
return argmaxcscore[c]
Evaluating Categorization Evaluating Categorization
Evaluation must be done on test data that are Evaluation must be done on test data that are independent
independent of the training data of the training data
usually a disjoint set of instances
Classification accuracy Classification accuracy : c : c / / n n where where n is the total n is the total number of test instances and
number of test instances and c c is the number of is the number of test instances correctly classified by the system.
test instances correctly classified by the system.
Adequate if one class per document
Results can vary based on sampling error due to Results can vary based on sampling error due to different training and test sets.
different training and test sets.
Average results over multiple training and test sets (splits of the overall data) for the best results.
Slide from Chris Manning
Measuring Performance Measuring Performance
Precision Precision = =
good messages kept good messages kept
all messages kept all messages kept
Recall Recall = =
good messages kept good messages kept
all good messages all good messages
Trade off precision vs. recall by setting threshold
Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable
Trade off precision vs. recall by setting threshold
Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable
Precision vs. Recall of Good (non-spam) Email
0%
25%
50%
75%
100%
0% 25% 50% 75% 100%
Recall
Precision
Slide from Jason Eisner
Measuring Performance Measuring Performance
Slide from Jason Eisner
Precision vs. Recall of Good (non-spam) Email
0%
25%
50%
75%
100%
0% 25% 50% 75% 100%
Recall
P re ci si o n
low threshold:
keep all the good stuff, but a lot of the bad too high threshold:
all we keep is good,
but we don’t keep much
OK for spam filtering and legal search OK for search
engines (maybe)
would prefer to be here!
point where
precision=recall (often reported)
The 2-by-2 contingency The 2-by-2 contingency
table table
Correct Incorrect
Selected True Positive False Positive Not selected False Negative True Negative
Precision and Recall Precision and Recall
Precision: % of selected items that are Precision: % of selected items that are correct
correct
Recall: % of correct items that are selected Recall: % of correct items that are selected
A Combined measure: F A Combined measure: F
The F measure assesses the P/R tradeoff, The F measure assesses the P/R tradeoff, through the weighted harmonic mean:
through the weighted harmonic mean:
Multiclass Classification Multiclass Classification
Dealing with > 2 classes Dealing with > 2 classes
For each class c For each class c
Build a binary classifier Yc to distinguish c from ~c
Given a test document d Given a test document d
Evaluate membership in each class using Yc
Assign d to each class c for which Yc returns true
Micro- vs. Macro-Averaging Micro- vs. Macro-Averaging
If we have more than one class, how do we If we have more than one class, how do we combine multiple performance measures into combine multiple performance measures into
one quantity one quantity
Macroaveraging Macroaveraging : compute performance for : compute performance for each class, then average
each class, then average
Microaveraging Microaveraging : collect decision for all : collect decision for all
classes, compute contingency table, evaluate
classes, compute contingency table, evaluate
More Complicated Cases of More Complicated Cases of
Measuring Performance Measuring Performance
For multiclass classifiers: For multiclass classifiers:
Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not, etc.
Better, estimate the cost of different kinds of errors
• e.g., how bad is each of the following?
– putting Sports articles in the News section – putting Fashion articles in the News section – putting News articles in the Fashion section
• Now tune system to minimize total cost
For ranking systems: For ranking systems:
Correlate with human rankings?
Get active feedback from user?
Measure user’s wasted time by tracking clicks?
Slide from Jason Eisner
Which articles are most Sports-like?
Which articles / webpages most relevant?
Evaluation Benchmark Evaluation Benchmark
Reuters-21578 Data Set Reuters-21578 Data Set
Most (over)used data set, 21,578 docs (each 90 Most (over)used data set, 21,578 docs (each 90 types, 200 tokens)
types, 200 tokens)
9603 training, 3299 test articles 9603 training, 3299 test articles (ModApte/Lewis split)
(ModApte/Lewis split)
118 categories 118 categories
An article can be in more than one category
Learn 118 binary category distinctions
Average document (with at least one category) Average document (with at least one category) has 1.24 classes
has 1.24 classes
Only about 10 out of 118 categories are large Only about 10 out of 118 categories are large
Training size Training size
The more the better! (usually) The more the better! (usually)
Results for text classification Results for text classification
***From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang, Slide from Nakov/Hearst/Rosario
Test error vs training size on the newsgroups rec.sport.baseball and rec.sport.hockey
Training size Training size
*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang, Slide from Nakov/Hearst/Rosario
Test error vs training size on the newsgroups alt.atheism and talk.religion.misc
Training size Training size
*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang, Slide from Nakov/Hearst/Rosario
Training Size Training Size
Author identification Author identification
Authorship Attribution a Comparison Of Three Methods, Matthew Care, Slide from Nakov/Hearst/Rosario
Violation of NB Violation of NB
Assumptions Assumptions
Conditional independence Conditional independence
“ “ Positional independence” Positional independence”
Examples? Examples?
Slide from Chris Manning
Naïve Bayes is Not So Naïve Bayes is Not So
Naïve Naïve
Naïve Bayes: first and second place in KDD-CUP 97 Naïve Bayes: first and second place in KDD-CUP 97
competition, among 16 (then) state of the art algorithms competition, among 16 (then) state of the art algorithms
Goal: Financial services industry direct mail response prediction model:
Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.
Robust to Irrelevant FeaturesRobust to Irrelevant Features
Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this.
Very good in domains with many equally importantVery good in domains with many equally important features features
Decision Trees suffer from fragmentation in such cases – especially if little data
A good dependable baseline for text classification (but not the A good dependable baseline for text classification (but not the best)!
best)!
Slide from Chris Manning
Naïve Bayes is Not So Naïve Bayes is Not So
Naïve Naïve
Optimal if the Independence Assumptions Optimal if the Independence Assumptions hold:
hold:
If assumed independence is correct, then it is the Bayes Optimal If assumed independence is correct, then it is the Bayes Optimal Classifier for problem
Classifier for problem
Very Fast: Very Fast:
Learning with one pass of counting over the data; testing linear in Learning with one pass of counting over the data; testing linear in the number of attributes, and document collection size
the number of attributes, and document collection size
Low Storage requirements Low Storage requirements
Online Learning Algorithm Online Learning Algorithm
Can be trained incrementally, on new examples Can be trained incrementally, on new examples
SpamAssassin SpamAssassin
Naïve Bayes widely used in spam filtering Naïve Bayes widely used in spam filtering
Paul Graham’s A Plan for Spam
• A mutant with more mutant offspring...
Naive Bayes-like classifier with weird parameter estimation
But also many other things: black hole lists, etc.
Many email topic filters also use NB classifiers Many email topic filters also use NB classifiers
Slide from Chris Manning
SpamAssassin Tests SpamAssassin Tests
Mentions Generic ViagraMentions Generic Viagra
Online PharmacyOnline Pharmacy
No prescription neededNo prescription needed
Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
Talks about Oprah with an exclamation!Talks about Oprah with an exclamation!
Phrase: impress ... girlPhrase: impress ... girl
From: starts with many numbersFrom: starts with many numbers
Subject contains "Your Family”Subject contains "Your Family”
Subject is all capitalsSubject is all capitals
HTML has a low ratio of text to image areaHTML has a low ratio of text to image area
One hundred percent guaranteedOne hundred percent guaranteed
Claims you can be removed from the listClaims you can be removed from the list
'Prestigious Non-Accredited Universities''Prestigious Non-Accredited Universities'
http://spamassassin.apache.org/tests_3_3_x.htmlhttp://spamassassin.apache.org/tests_3_3_x.html
Naïve Bayes: Word Naïve Bayes: Word
Sense Disambiguation Sense Disambiguation
ww an ambiguous wordan ambiguous word
ss11, …, s, …, sKK senses for word senses for word ww
v1, …, vJ words in the context of words in the context of ww
PP((ssjj)) prior probability of sense prior probability of sense ssjj
PP(v(vjj|s|skk)) probability that word probability that word vvjj occurs in context of sense occurs in context of sense sskk
) (
) , ) (
| (
k k j k
j C s
s v s C
v
P
) (
) ) (
( C w
s s C
P k k
k vj
k j
s k
s v
P s
P
s argmax ( ) ( | )