Transcript Slides

CMSC 471
Machine Learning:
Naïve Bayes, Neural Networks, Clustering
Skim 20.5
1
The
Naïve Bayes
Classifier
Some material adapted from slides by
Tom Mitchell, CMU.
2
The Naïve Bayes Classifier


Recall Bayes rule:
P(Yi | X j ) 
P( X j )
Which is short for:
P(Y  yi | X  x j ) 

P(Yi ) P( X j | Yi )
P(Y  yi ) P( X  x j | Y  yi )
P( X  x j )
We can re-write this as:
P(Y  yi | X  x j ) 
P(Y  yi ) P( X  x j | Y  yi )

k
P( X  x j | Y  yk ) P(Y  yk )
3
Deriving Naïve Bayes

Idea: use the training data to directly estimate:
P( X | Y )

and
P(Y )
Then, we can use these values to estimate
P(Y | X ) using Bayes rule.
new

Recall that representing the full joint probability
P( X 1, X 2 ,, X n | Y ) is not practical.
4
Deriving Naïve Bayes

However, if we make the assumption that the
attributes are independent, estimation is easy!
P( X 1 ,, X n | Y )   P( X i | Y )
i

In other words, we assume all attributes are
conditionally independent given Y.

Often this assumption is violated in practice, but
more on that later…
5
Deriving Naïve Bayes


Let X  X1,, X n and label Y be discrete.
Then, we can estimate P( X i | Yi ) and P (Yi )
directly from the training data by counting!
Sky
sunny
sunny
rainy
sunny
Temp
warm
warm
cold
warm
Humid
normal
high
high
high
P(Sky = sunny | Play = yes) = ?
Wind
strong
strong
strong
strong
Water
warm
warm
warm
cool
Forecast
same
same
change
change
Play?
yes
yes
no
yes
P(Humid = high | Play = yes) = ?
6
The Naïve Bayes Classifier

Now we have:
P(Y  y j | X 1 ,, X n ) 
P(Y  y j )i P( X i | Y  y j )
 P(Y  y ) P( X
k
k
i
i
| Y  yk )
which is just a one-level Bayesian Network
P( X i | Y j )
X1

…
Y j PP((HYij )
Xi
…
Labels (hypotheses)
Xn
Attributes (evidence)
To classify a new point Xnew:
Ynew 
 arg max P(Y  yk ) P( X i | Y  yk )
yk
i
7
The Naïve Bayes Algorithm

For each value yk
Estimate P(Y = yk) from the data.
 For each value xij of each attribute Xi



Estimate P(Xi=xij | Y = yk)
Classify a new point via:
Ynew 
 arg max P(Y  yk ) P( X i | Y  yk )
yk

i
In practice, the independence assumption
doesn’t often hold true, but Naïve Bayes
performs very well despite it.
8
Naïve Bayes Applications

Text classification
Which e-mails are spam?
 Which e-mails are meeting notices?
 Which author wrote a document?


Classifying mental states
Learning P(BrainActivity | WordCategory)
Pairwise Classification
Accuracy: 85%
People Words
Animal Words
9
Neural
Networks
Adapted from slides by
Tim Finin and
Marie desJardins.
Some material adapted
from lecture notes by
Lise Getoor and Ron Parr
10
Neural function





Brain function (thought) occurs as the result of the
firing of neurons
Neurons connect to each other through synapses,
which propagate action potential (electrical
impulses) by releasing neurotransmitters
Synapses can be excitatory (potential-increasing) or
inhibitory (potential-decreasing), and have varying
activation thresholds
Learning occurs as a result of the synapses’
plasticicity: They exhibit long-term changes in
connection strength
There are about 1011 neurons and about 1014
synapses in the human brain(!)
11
Biology of a neuron
12
Brain structure

Different areas of the brain have different functions



We don’t know how different functions are “assigned” or
acquired




Some areas seem to have the same function in all humans (e.g.,
Broca’s region for motor speech); the overall layout is generally
consistent
Some areas are more plastic, and vary in their function; also, the
lower-level structure and function vary greatly
Partly the result of the physical layout / connection to inputs
(sensors) and outputs (effectors)
Partly the result of experience (learning)
We really don’t understand how this neural structure leads to
what we perceive as “consciousness” or “thought”
Artificial neural networks are not nearly as complex or
intricate as the actual brain structure
13
Comparison of computing power
INFORMATION CIRCA 1995
Computer
Human Brain
Computation Units
Storage Units
Cycle time
Bandwidth
Updates / sec
1 CPU, 105 Gates
1011 Neurons
104 bits RAM, 1010 bits disk
1011 neurons, 1014 synapses
10-8 sec
10-3 sec
104 bits/sec
1014 bits/sec
105
1014




Computers are way faster than neurons…
But there are a lot more neurons than we can reasonably
model in modern digital computers, and they all fire in
parallel
Neural networks are designed to be massively parallel
The brain is effectively a billion times faster
14
Neural networks
Output units
Hidden units
Input units
Layered feed-forward network



Neural networks are made up of nodes or units, connected
by links
Each link has an associated weight and activation level
Each node has an input function (typically summing over
weighted inputs), an activation function, and an output
15
Model of a neuron



Neuron modeled as a unit i
weights on input unit j to i, wji
net input to unit i is:
ini   wij  o j
j

Activation function g() determines the neuron’s output


g() is typically a sigmoid
output is either 0 or 1 (no partial activation)
16
“Executing” neural networks


Input units are set by some exterior function (think of
these as sensors), which causes their output links to be
activated at the specified level
Working forward through the network, the input
function of each unit is applied to compute the input
value


Usually this is just the weighted sum of the activation on the
links feeding into this node
The activation function transforms this input function
into a final value

Typically this is a nonlinear function, often a sigmoid
function corresponding to the “threshold” of that node
17
Learning rules

Rosenblatt (1959) suggested that if a target
output value is provided for a single neuron with
fixed inputs, can incrementally change weights
to learn to produce these outputs using the
perceptron learning rule
 assumes binary valued input/outputs
 assumes a single linear threshold unit
18
Perceptron learning rule

If the target output for unit i is ti
w ji  w ji   (ti  oi )o j

Equivalent to the intuitive rules:
 If output is correct, don’t change the weights
 If output is low (oi=0, ti=1), increment weights for all the
inputs which are 1
 If output is high (oi=1, ti=0), decrement weights for all
inputs which are 1
 Must also adjust threshold. Or equivalently assume there
is a weight w0i for an extra input unit that has an output
of 1.
19
Perceptron learning algorithm



Repeatedly iterate through examples adjusting
weights according to the perceptron learning rule
until all outputs are correct
 Initialize the weights to all zero (or random)
 Until outputs for all training examples are correct
 for each training example e do
 compute the current output oj
 compare it to the target tj and update
weights
Each execution of outer loop is called an epoch
For multiple category problems, learn a separate
perceptron for each category and assign to the class
20
whose perceptron most exceeds its threshold
Representation limitations of a
perceptron

Perceptrons can only represent linear threshold
functions and can therefore only learn functions
which linearly separate the data.

i.e., the positive and negative examples are separable
by a hyperplane in n-dimensional space
<W,X> -  = 0
> 0 on this side
< 0 on this side
21
Perceptron learnability


Perceptron Convergence Theorem: If there is
a set of weights that is consistent with the
training data (i.e., the data is linearly separable),
the perceptron learning algorithm will converge
(Minicksy & Papert, 1969)
Unfortunately, many functions (like parity)
cannot be represented by LTU
22
Learning: Backpropagation


Similar to perceptron learning algorithm, we
cycle through our examples
 if the output of the network is correct, no
changes are made
 if there is an error, the weights are adjusted to
reduce the error
The trick is to assess the blame for the error and
divide it among the contributing weights
23
Output layer

As in perceptron learning algorithm, we want to
minimize difference between target output and
the output actually computed
Wji  Wji    a j  Erri  g(ini )
activation of
hidden unit j
i  Erri  g(ini )
(Ti – Oi)
derivative
of activation
function
Wji  Wji    a j   i
24
Hidden layers



Need to define error; we do error backpropagation.
Intuition: Each hidden node j is “responsible” for some
fraction of the error I in each of the output nodes to
which it connects.
I divided according to the strength of the connection
between hidden node and the output node and
propagated back to provide the j values for the hidden
layer:
 j  g(in j ) Wji i
j
update rule: Wkj  Wkj    Ik   j
25
Backprogation algorithm


Compute the  values for the output units using
the observed error
Starting with output layer, repeat the following
for each layer in the network, until earliest
hidden layer is reached:
 propagate the  values back to the previous
layer
 update the weights between the two layers
26
Backprop issues


“Backprop is the cockroach of machine
learning. It’s ugly, and annoying, but you just
can’t get rid of it.”
Geoff Hinton
Problems:
 black box
 local minima
27
Unsupervised
Learning:
Clustering
Some material adapted from slides by Andrew Moore, CMU.
Visit http://www.autonlab.org/tutorials/ for
Andrew’s repository of Data Mining tutorials.
28
Unsupervised Learning




Supervised learning used labeled data pairs (x, y) to
learn a function f : X→Y.
But, what if we don’t have labels?
No labels = unsupervised learning
Only some points are labeled = semi-supervised
learning


Labels may be expensive to obtain, so we only get a few.
Clustering is the unsupervised grouping of data
points. It can be used for knowledge discovery.
29
Clustering Data
30
K-Means Clustering
K-Means ( k , data )
• Randomly choose k
cluster center locations
(centroids).
• Loop until convergence
• Assign each point to the
cluster of the closest
centroid.
• Reestimate the cluster
centroids based on the
data assigned to each.
31
K-Means Clustering
K-Means ( k , data )
• Randomly choose k
cluster center locations
(centroids).
• Loop until convergence
• Assign each point to the
cluster of the closest
centroid.
• Reestimate the cluster
centroids based on the
data assigned to each.
32
K-Means Clustering
K-Means ( k , data )
• Randomly choose k
cluster center locations
(centroids).
• Loop until convergence
• Assign each point to the
cluster of the closest
centroid.
• Reestimate the cluster
centroids based on the
data assigned to each.
33
K-Means Animation
Example generated by
Andrew Moore using
Dan Pelleg’s superduper fast K-means
system:
Dan Pelleg and Andrew
Moore. Accelerating
Exact k-means
Algorithms with
Geometric Reasoning.
Proc. Conference on
Knowledge Discovery in
Databases 1999.
34
Problems with K-Means

Very sensitive to the initial points.
 Do
many runs of k-Means, each with
different initial centroids.
 Seed the centroids using a better method than
random. (e.g. Farthest-first sampling)

Must manually choose k.
 Learn
the optimal k for the clustering. (Note
that this requires a performance measure.)
35
Problems with K-Means

How do you tell it which clustering you want?

Constrained clustering techniques
Same-cluster constraint
(must-link)
Different-cluster constraint
(cannot-link)
36