PPT - Search

Download Report

Transcript PPT - Search

CSA3180: Natural Language
Processing
Statistics 2 – Probability and
Classification II
• Experiments/Outcomes/Events
• Independence/Dependence
• Bayes’ Rule
• Conditional Probabilities/Chain Rule
• Classification II
November 2005
CSA3180: Statistics II
1
Introduction
• Slides based on Lectures by Mike Rosner
(2003) and material by Mary Dalrymple,
Kings College, London
November 2005
CSA3180: Statistics II
2
Experiments, Basic Outcome,
Sample Space
• Probability theory is founded upon the notion of
an experiment.
• An experiment is a situation which can have
one or more different basic outcomes.
• Example: if we throw a die, there are six
possible basic outcomes.
• A Sample Space Ω is a set of all possible basic
outcomes. For example,
– If we toss a coin, Ω = {H,T}
– If we toss a coin twice, Ω = {HT,TH,TT,HH}
– if we throw a die, Ω = {1,2,3,4,5,6}
November 2005
CSA3180: Statistics II
3
Events
• An Event A  Ω is a set of basic outcomes
e.g.
– tossing two heads {HH}
– throwing a 6, {6}
– getting either a 2 or a 4, {2,4}.
• Ω itself is the certain event, whilst { } is
the impossible event.
• Event Space ≠ Sample Space
November 2005
CSA3180: Statistics II
4
Probability Distribution
• A probability distribution of an experiment is a
function that assigns a number (or probability)
between 0 and 1 to each basic outcome such that
the sum of all the probabilities = 1.
• Probability distribution functions (PDFs)
• The probability p(E) of an event E is the sum of
the probabilities of all the basic outcomes in E.
• Uniform distribution is when each basic outcome
is equally likely.
November 2005
CSA3180: Statistics II
5
Probability of an Event
• Sample space for a die throw = set of
basic outcomes = {1,2,3,4,5,6}
• If the die is not loaded, distribution is
uniform.
• Thus for each basic outcome, e.g. {6}
(throwing a six) is assigned the same
probability = 1/6.
• So p({3,6}) = p({3}) + p({6}) = 2/6 = 1/3
November 2005
CSA3180: Statistics II
6
Probability Estimates
• Repeat experiment T times and count
frequency of E.
• Estimated p(E) = count(E)/count(T)
• This can be done over m runs, yielding
estimates p1(E),...pm(E).
• Best estimate is (possibly weighted)
average of individual pi(E)
November 2005
CSA3180: Statistics II
7
3 Times Coin Toss
• Ω = {HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
• Cases with exactly 2 tails = {HTT, THT,TTH}
• Experimenti = 1000 cases (3000 tosses).
– c1(E)= 386, p1(E) = .386
– c2(E)= 375, p2(E) = .375
– pmean(E)= (.386+.375)/2 = .381
• Uniform distribution is when each basic
outcome is equally likely.
• Assuming uniform distribution, p(E) = 3/8 = .375
November 2005
CSA3180: Statistics II
8
Word Probability
• General Problem:
What is the probability of the next
word/character/phoneme in a sequence, given
the first N words/characters/phonemes.
• To approach this problem we study an
experiment whose sample space is the set of
possible words.
• Same approach could be used to study the the
probability of the next character or phoneme.
November 2005
CSA3180: Statistics II
9
Word Probability
• I would like to make a phone _____.
• Look it up in the phone ________,
quick!
• The phone ________ you requested
is…
• Context can have decisive effect on word
probability
November 2005
CSA3180: Statistics II
10
Word Probability
• Approximation 1: all words are equally
probable
• Then probability of each word = 1/N where
N is the number of word types.
• But all words are not equally probable
• Approximation 2: probability of each
word is the same as its frequency of
occurrence in a corpus.
November 2005
CSA3180: Statistics II
11
Word Probability
• Estimate p(w) - the probability of word w:
• Given corpus C
p(w)  count(w)/size(C)
• Example
–
–
–
–
–
–
Brown corpus: 1,000,000 tokens
the: 69,971 tokens
Probability of the: 69,971/1,000,000  .07
rabbit: 11 tokens
Probability of rabbit: 11/1,000,000  .00001
conclusion: next word is most likely to be the
• Is this correct?
November 2005
CSA3180: Statistics II
12
Word Probability
• Given the context: Look at the cute ...
• is the more likely than rabbit?
• Context matters in determining what
word comes next.
• What is the probability of the next word in
a sequence, given the first N words?
November 2005
CSA3180: Statistics II
13
Independent Events
A: eggs
B: monday
sample space
November 2005
CSA3180: Statistics II
14
Sample Space
(eggs,mon)
(eggs,tue)
(eggs,wed)
(eggs,thu)
(eggs,fri)
(eggs,sat)
(eggs,sun)
November 2005
(cereal,mon)
(cereal,tue)
(cereal,wed)
(cereal,thu)
(cereal,fri)
(cereal,sat)
(cereal,sun)
CSA3180: Statistics II
(nothing,mon)
(nothing,tue)
(nothing,wed)
(nothing,thu)
(nothing,fri)
(nothing,sat)
(nothing,sun)
15
Independent Events
• Two events, A and B, are independent if
the fact that A occurs does not affect the
probability of B occurring.
• When two events, A and B, are
independent, the probability of both
occurring p(A,B) is the product of the prior
probabilities of each, i.e.
p(A,B) = p(A) · p(B)
November 2005
CSA3180: Statistics II
16
Dependent Events
• Two events, A and B, are dependent if the
occurrence of one affects the probability of
the occurrence of the other.
November 2005
CSA3180: Statistics II
17
Dependent Events
A
AB
B
sample space
November 2005
CSA3180: Statistics II
18
Conditional Probability
• The conditional probability of an event A
given that event B has already occurred is
written p(A|B)
• In general p(A|B)  p(B|A)
November 2005
CSA3180: Statistics II
19
Dependent Events: p(A|B)≠ p(B|A)
sample space
A
AB
November 2005
CSA3180: Statistics II
B
20
Example Dependencies
• Consider fair die example with
– A = outcome divisible by 2
– B = outcome divisible by 3
– C = outcome divisible by 4
• p(A|B) = p(A  B)/p(B) = (1/6)/(1/3) = ½
• p(A|C) = p(A  C)/p(C) = (1/6)/(1/6) = 1
November 2005
CSA3180: Statistics II
21
Conditional Probability
• Intuitively, after B has occurred, event A is
replaced by A  B, the sample space Ω is
replaced by B, and probabilities are
renormalised accordingly
• The conditional probability of an event A
given that B has occurred (p(B)>0) is thus
given by p(A|B) = p(A  B)/p(B).
• If A and B are independent,
p(A  B) = p(A) · p(B) so
p(A|B) = p(A) · p(B) /p(B) = p(A)
November 2005
CSA3180: Statistics II
22
Bayesian Inversion
• For A and B to occur, either B must occur first, then
B, or vice versa. We get the following possibilites:
p(A|B) = p(A  B)/p(B)
p(B|A) = p(A  B)/p(A)
• Hence p(A|B) p(B) = p(B|A) p(A)
• We can thus express p(A|B) in terms of p(B|A)
• p(A|B) = p(B|A) p(A)/p(B)
• This equivalence, known as Bayes’ Theorem, is
useful when one or other quantity is difficult to
determine
November 2005
CSA3180: Statistics II
23
Bayes’ Theorem
• p(B|A) = p(BA)/p(A) = p(A|B) p(B)/p(A)
• The denominator p(A) can be ignored if we
are only interested in which event out of
some set is most likely.
• Typically we are interested in the value of
B that maximises an observation A, i.e.
• arg maxB p(A|B) p(B)/p(A) =
arg maxB p(A|B) p(B)
November 2005
CSA3180: Statistics II
24
Chain Rule
• We can use the definition of conditional
probability to more than two events
• p(A1  ...  An) =
p(A1) * p(A2|A1) * p(A3|A1  A2)
..., p(An|A1  ...  An-1)
• The chain rule allows us to talk about the
probability of sequences of events
p(A1,...,An)
November 2005
CSA3180: Statistics II
25
Classification II
•
•
•
•
•
•
Linear algorithms in Classification I
Non-linear algorithms
Kernel methods
Multi-class classification
Decision trees
Naïve Bayes
November 2005
CSA3180: Statistics II
26
Non-Linear Problems
November 2005
CSA3180: Statistics II
27
Non-Linear Problems
November 2005
CSA3180: Statistics II
28
Non-Linear Problems
• Kernel methods
• A family of non-linear algorithms
• Transform the non linear problem in a
linear one (in a different feature space)
• Use linear algorithms to solve the linear
problem in the new space
November 2005
CSA3180: Statistics II
29
Kernel Methods
• Linear separability: more likely in high
dimensions
• Mapping:  maps input into highdimensional feature space
• Classifier: construct linear classifier in
high-dimensional feature space
• Motivation: appropriate choice of  leads
to linear separability
• We can do this efficiently!
November 2005
CSA3180: Statistics II
30
Kernel Methods
 : Rd  RD
(D >> d)
wT(x)+b=0
X=[x z]
(X)=[x2 z2 xz]
f(x) = sign(w1x2+w2z2+w3xz +b)
November 2005
CSA3180: Statistics II
31
Kernel Methods
• We can use the linear algorithms seen before
(Perceptron, SVM) for classification in the higher
dimensional space
• Kernel methods basically transform any
algorithm that solely depend on dot product
between two vectors by replacing dot with kernel
function
• Non-linear kernel algorithm is the linear
algorithm operating in the range space of 
• The  is never explicitly computed (kernels
are used instead)
November 2005
CSA3180: Statistics II
32
Multi-class Classification
• Given: some data items that belong to one
of M possible classes
• Task: Train the classifier and predict the
class for a new data item
• Geometrically: harder problem, no more
simple geometry
November 2005
CSA3180: Statistics II
33
Multi-class Classification
November 2005
CSA3180: Statistics II
34
Multi-class Classification
• Author identification
• Language identification
• Text categorization (topics)
November 2005
CSA3180: Statistics II
35
Multi-class Classification
• Linear
– Parallel class separators: Decision Trees
– Non parallel class separators: Naïve Bayes
• Non Linear
– K-nearest neighbors
November 2005
CSA3180: Statistics II
36
Linear, parallel class separators
(e.g. Decision Trees)
November 2005
CSA3180: Statistics II
37
Linear, non-parallel class
separators (e.g. Naïve Bayes)
November 2005
CSA3180: Statistics II
38
Non-Linear separators (e.g. k
Nearest Neighbors)
November 2005
CSA3180: Statistics II
39
Decision Trees
• Decision tree is a classifier in the form of a tree
structure, where each node is either:
– Leaf node - indicates the value of the target attribute
(class) of examples, or
– Decision node - specifies some test to be carried out
on a single attribute-value, with one branch and subtree for each possible outcome of the test.
• A decision tree can be used to classify an
example by starting at the root of the tree and
moving through it until a leaf node, which
provides the classification of the instance.
November 2005
CSA3180: Statistics II
40
Goal: learn when we can play Tennis and when we cannot
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
November 2005
Temp.
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cold
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
CSA3180: Statistics II
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
Play Tennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
41
Decision Trees
Outlook
Sunny
Humidity
High
No
November 2005
Overcast
Rain
Yes
Normal
Wind
Strong
Yes
No
CSA3180: Statistics II
Weak
Yes
42
Decision Trees
Outlook
Sunny
Humidity
High
No
November 2005
Overcast
Rain
Each internal node tests an attribute
Normal
Yes
Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
CSA3180: Statistics II
43
Outlook Temperature Humidity Wind
Sunny
Hot
High
PlayTennis
Weak
? No
Outlook
Sunny
Humidity
High
No
November 2005
Overcast
Rain
Yes
Normal
Wind
Strong
Yes
No
CSA3180: Statistics II
Weak
Yes
44
Decision Tree for Reuters
November 2005
CSA3180: Statistics II
45
Decision Trees for Reuters
November 2005
CSA3180: Statistics II
46
Building Decision Trees
• Given training data, how do we construct them?
• The central focus of the decision tree growing
algorithm is selecting which attribute to test at
each node in the tree. The goal is to select the
attribute that is most useful for classifying
examples.
• Top-down, greedy search through the space of
possible decision trees.
– That is, it picks the best attribute and never looks
back to reconsider earlier choices.
November 2005
CSA3180: Statistics II
47
Building Decision Trees
• Splitting criterion
– Finding the features and the values to split on
• for example, why test first “cts” and not “vs”?
• Why test on “cts < 2” and not “cts < 5” ?
– Split that gives us the maximum information gain (or the maximum
reduction of uncertainty)
• Stopping criterion
– When all the elements at one node have the same class, no need to
split further
• In practice, one first builds a large tree and then one prunes it back
(to avoid overfitting)
• See Foundations of Statistical Natural Language Processing,
Manning and Schuetze for a good introduction
November 2005
CSA3180: Statistics II
48
Decision Trees: Strengths
• Decision trees are able to generate
understandable rules.
• Decision trees perform classification
without requiring much computation.
• Decision trees are able to handle both
continuous and categorical variables.
• Decision trees provide a clear indication of
which features are most important for
prediction or classification.
November 2005
CSA3180: Statistics II
49
Decision Trees: Weaknesses
• Decision trees are prone to errors in
classification problems with many classes and
relatively small number of training examples.
• Decision tree can be computationally expensive
to train.
– Need to compare all possible splits
– Pruning is also expensive
• Most decision-tree algorithms only examine a
single field at a time. This leads to rectangular
classification boxes that may not correspond
well with the actual distribution of records in the
decision space.
November 2005
CSA3180: Statistics II
50
Naïve Bayes
More powerful that Decision Trees
Decision Trees
November 2005
Naïve Bayes
CSA3180: Statistics II
51
Naïve Bayes
• Graphical Models:
graph theory plus
probability theory
• Nodes are variables
• Edges are
conditional
probabilities
November 2005
CSA3180: Statistics II
A
B
C
P(A)
P(B|A)
P(C|A)
52
Naïve Bayes
• Graphical Models:
graph theory plus
probability theory
• Nodes are variables
• Edges are conditional
probabilities
• Absence of an edge
between nodes implies
independence between
the variables of the
nodes
November 2005
A
B
CSA3180: Statistics II
C
P(A)
P(B|A)
P(C|A)
 P(C|A,B)
53
Naïve Bayes
November 2005
CSA3180: Statistics II
54
Naïve Bayes
earn
Shr
November 2005
34
cts
vs
CSA3180: Statistics II
per
shr
55
Naïve Bayes
Topic
w1
w2
w3
w4
wn-1
wn
• The words depend on the topic: P(wi| Topic)
– P(cts|earn) > P(tennis| earn)
• Naïve Bayes assumption: all words are independent given the topic
• From training set we learn the probabilities P(wi| Topic) for each word
and for each topic in the training set
November 2005
CSA3180: Statistics II
56
Naïve Bayes
Topic
w1
w2
w3
w4
wn-1
wn
• To: Classify new example
• Calculate P(Topic | w1, w2, … wn) for each topic
• Bayes decision rule:
– Choose the topic T’ for which
– P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T T’
November 2005
CSA3180: Statistics II
57
Naïve Bayes: Math
• Naïve Bayes define a joint probability
distribution:
• P(Topic , w1, w2, … wn) = P(Topic) P(wi| Topic)
• We learn P(Topic) and P(wi| Topic) in training
• Test: we need P(Topic | w1, w2, … wn)
• P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) /
P(w1, w2, … wn)
November 2005
CSA3180: Statistics II
58
Naïve Bayes: Strengths
• Very simple model
– Easy to understand
– Very easy to implement
• Very efficient, fast training and classification
• Modest space storage
• Widely used because it works really well for text
categorization
• Linear, but non parallel decision boundaries
November 2005
CSA3180: Statistics II
59
Naïve Bayes: Weaknesses
• Naïve Bayes independence assumption has two
consequences:
– The linear ordering of words is ignored (bag of words model)
– The words are independent of each other given the class: False
• President is more likely to occur in a context that contains election
than in a context that contains poet
• Naïve Bayes assumption is inappropriate if there are
strong conditional dependencies between the variables
• (But even if the model is not “right”, Naïve Bayes models
do well in a surprisingly large number of cases because
often we are interested in classification accuracy and not
in accurate probability estimations)
November 2005
CSA3180: Statistics II
60