Probabilistic Learning and AI: A Review and Update

Download Report

Transcript Probabilistic Learning and AI: A Review and Update

Principles and Applications of
Probabilistic Learning
Padhraic Smyth
Department of Computer Science
University of California, Irvine
www.ics.uci.edu/~smyth
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
New Slides
• Original slides created in mid-July for ACM
– Some new slides have been added
• “new” logo in upper left
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
UPDATED
New Slides
• Original slides created in mid-July for ACM
– Some new slides have been added
• “new” logo in upper left
– A few slides have been updated
• “updated” logo in upper left
• Current slides (including new and updated) at:
www.ics.uci.edu/~smyth/talks
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
From the tutorial Web page:
“The intent of this tutorial is to provide a starting
point for students and researchers……”
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Probabilistic Modeling vs. Function
Approximation
• Two major themes in machine learning:
1. Function approximation/”black box” methods
• e.g., for classification and regression
• Learn a flexible function y = f(x)
• e.g., SVMs, decision trees, boosting, etc
2. Probabilistic learning
• e.g., for regression, model p(y|x) or p(y,x)
• e.g, graphical models, mixture models, hidden Markov
models, etc
• Both approaches are useful in general
– In this tutorial we will focus only on the 2nd approach,
probabilistic modeling
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Motivations for Probabilistic Modeling
• leverage prior knowledge
• generalize beyond data analysis in vector-spaces
• handle missing data
• combine multiple types of information into an analysis
• generate calibrated probability outputs
• quantify uncertainty about parameters, models, and
predictions in a statistical manner
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Learning object models in vision
Weber, Welling, Perona, 2000
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Learning object models in vision
Weber, Welling, Perona, 2000
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Learning to Extract Information from
Documents
e.g., Seymore, McCallum, Rosenfeld, 1999
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Segal, Friedman, Koller, et al,
Nature Genetics, 2005
P(Data | Parameters)
Probabilistic
Model
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Real World
Data
P(Data | Parameters)
Probabilistic
Model
Real World
Data
P(Parameters | Data)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
(Generative Model)
P(Data | Parameters)
Real World
Data
Probabilistic
Model
P(Parameters | Data)
(Inference)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Outline
1. Review of probability
2. Graphical models
3. Connecting probability models to data
4. Models with hidden variables
5. Case studies
(i) Simulating and forecasting rainfall data
(ii) Curve clustering with cyclone trajectories
(iii) Topic modeling from text documents
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Part 1: Review of Probability
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Notation and Definitions
• X is a random variable
– Lower-case x is some possible value for X
– “X = x” is a logical proposition: that X takes value x
– There is uncertainty about the value of X
• e.g., X is the Dow Jones index at 5pm tomorrow
• p(X = x) is the probability that proposition X=x is true
– often shortened to p(x)
• If the set of possible x’s is finite, we have a probability
distribution and S p(x) = 1
• If the set of possible x’s is infinite, p(x) is a density
function, and p(x) integrates to 1 over the range of X
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example
• Let X be the Dow Jones Index (DJI) at 5pm Monday
August 22nd (tomorrow)
• X can take real values from 0 to some large number
• p(x) is a density representing our uncertainty about X
– This density could be constructed from historical data,
e.g.,
– After 5pm p(x) becomes infinitely narrow around the true
known x (no uncertainty)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Probability as Degree of Belief
• Different agents can have different p(x)’s
– Your p(x) and the p(x) of a Wall Street expert might be
quite different
– OR: if we were on vacation we might not have access to
stock market information
• we would still be uncertain about p(x) after 5pm
• So we should really think of p(x) as p(x | BI)
– Where BI is background information available to agent I
– (will drop explicit conditioning on BI in notation)
• Thus, p(x) represents the degree of belief that agent I
has in proposition x, conditioned on available
background information
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Comments on Degree of Belief
• Different agents can have different probability models
– There is no necessarily “correct” p(x)
– Why? Because p(x) is a model built on whatever assumptions or
background information we use
– Naturally leads to the notion of updating
• p(x | BI) -> p(x | BI, CI)
• This is the subjective Bayesian interpretation of probability
– Generalizes other interpretations (such as frequentist)
– Can be used in cases where frequentist reasoning is not applicable
– We will use “degree of belief” as our interpretation of p(x) in this
tutorial
• Note!
– Degree of belief is just our semantic interpretation of p(x)
– The mathematics of probability (e.g., Bayes rule) remain the same
regardless of our semantic interpretation
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Multiple Variables
• p(x, y, z)
– Probability that X=x AND Y=y AND Z =z
– Possible values: cross-product of X Y Z
– e.g., X, Y, Z each take 10 possible values
• x,y,z can take 103 possible values
• p(x,y,z) is a 3-dimensional array/table
– Defines 103 probabilities
• Note the exponential increase as we add more
variables
– e.g., X, Y, Z are all real-valued
• x,y,z live in a 3-dimensional vector space
• p(x,y,z) is a positive function defined over this space,
integrates to 1
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Conditional Probability
• p(x | y, z)
– Probability of x given that Y=y and Z = z
– Could be
• hypothetical, e.g., “if Y=y and if Z = z”
• observational, e.g., we observed values y and z
– can also have p(x, y | z), etc
– “all probabilities are conditional probabilities”
• Computing conditional probabilities is the basis of many
prediction and learning problems, e.g.,
– p(DJI tomorrow | DJI index last week)
– expected value of [DJI tomorrow | DJI index next week)
– most likely value of parameter a given observed data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Computing Conditional Probabilities
• Variables A, B, C, D
– All distributions of interest related to A,B,C,D can be computed
from the full joint distribution p(a,b,c,d)
• Examples, using the Law of Total Probability
– p(a) =
S{b,c,d} p(a, b, c, d)
– p(c,d) =
S{a,b} p(a, b, c, d)
– p(a,c | d) =
S{b} p(a, b, c | d)
where p(a, b, c | d) = p(a,b,c,d)/p(d)
• These are standard probability manipulations: however, we
will see how to use these to make inferences about
parameters and unobserved variables, given data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Conditional Independence
• A is conditionally independent of B given C iff
p(a | b, c) = p(a | c)
(also implies that B is conditionally independent of A given C)
• In words, B provides no information about A, if value of
C is known
• Example:
– a = “patient has upset stomach”
– b = “patient has headache”
– c = “patient has flu”
• Note that conditional independence does not imply
marginal independence
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Two Practical Problems
(Assume for simplicity each variable takes K values)
• Problem 1: Computational Complexity
– Conditional probability computations scale as O(KN)
• where N is the number of variables being summed over
• Problem 2: Model Specification
– To specify a joint distribution we need a table of O(KN) numbers
– Where do these numbers come from?
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Two Key Ideas
• Problem 1: Computational Complexity
– Idea: Graphical models
• Structured probability models lead to tractable inference
• Problem 2: Model Specification
– Idea: Probabilistic learning
• General principles for learning from data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Part 2: Graphical Models
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
“…probability theory is more fundamentally
concerned with the structure of reasoning and
causation than with numbers.”
Glenn Shafer and Judea Pearl
Introduction to Readings in Uncertain Reasoning,
Morgan Kaufmann, 1990
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Graphical Models
• Represent dependency structure with a directed graph
– Node <-> random variable
– Edges encode dependencies
• Absence of edge -> conditional independence
– Directed and undirected versions
• Why is this useful?
– A language for communication
– A language for computation
• Origins:
– Wright 1920’s
– Independently developed by Spiegelhalter and Lauritzen in
statistics and Pearl in computer science in the late 1980’s
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Examples of 3-way Graphical Models
A
B
C
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Marginal Independence:
p(A,B,C) = p(A) p(B) p(C)
Examples of 3-way Graphical Models
A
Conditionally independent effects:
p(A,B,C) = p(B|A)p(C|A)p(A)
B
C
B and C are conditionally independent
Given A
e.g., A is a disease, and we model
B and C as conditionally independent
symptoms given A
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Examples of 3-way Graphical Models
A
B
C
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Independent Causes:
p(A,B,C) = p(C|A,B)p(A)p(B)
Examples of 3-way Graphical Models
A
B
C
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Real-World Example
Monitoring Intensive-Care Patients
• 37 variables
PULMEMBOLUS
INTUBATION
• 509 parameters
…instead of 237
PAP
SHUNT
MINVOLSET
KINKEDTUBE
VENTMACH
VENTLUNG
VENITUBE
PRESS
MINOVL
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
(figure courtesy of Kevin
Murphy/Nir Friedman)
CVP
LVFAILURE
STROEVOLUME
PCWP
FIO2
VENTALV
PVSAT
ARTCO2
EXPCO2
INSUFFANESTH
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
BP
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
HR
HREKG
HRBP
DISCONNECT
ERRCAUTER
HRSAT
Directed Graphical Models
B
A
p(A,B,C) = p(C|A,B)p(A)p(B)
C
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Directed Graphical Models
B
A
p(A,B,C) = p(C|A,B)p(A)p(B)
C
In general,
p(X1, X2,....XN) =
 p(Xi | parents(Xi ) )
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Directed Graphical Models
B
A
p(A,B,C) = p(C|A,B)p(A)p(B)
C
In general,
p(X1, X2,....XN) =
 p(Xi | parents(Xi ) )
• Probability model has simple factored form
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Also known as belief networks, Bayesian networks, causal networks
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example
D
A
B
E
C
F
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
G
Example
D
A
p(A, B, C, D, E, F, G) =
B
E
C
F
G
 p( variable | parents )
= p(A|B)p(C|B)p(B|D)p(F|E)p(G|E)p(E|D) p(D)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example
D
A
B
E
c
F
Say we want to compute p(a | c, g)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
g
Example
D
A
B
E
c
F
g
Direct calculation: p(a|c,g) = Sbdef p(a,b,d,e,f | c,g)
Complexity of the sum is O(K4)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example
D
A
B
E
c
F
Reordering (using factorization):
Sb p(a|b) Sd p(b|d,c) Se p(d|e) Sf p(e,f |g)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
g
Example
D
A
B
E
c
F
g
Reordering:
Sb p(a|b) Sd p(b|d,c) Se p(d|e) Sf p(e,f |g)
p(e|g)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example
D
A
B
E
c
F
Reordering:
Sb p(a|b) Sd p(b|d,c) Se p(d|e) p(e|g)
p(d|g)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
g
Example
D
A
B
E
c
F
Reordering:
Sb p(a|b) Sd p(b|d,c) p(d|g)
p(b|c,g)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
g
Example
D
B
E
c
F
A
g
Reordering:
Sb p(a|b) p(b|c,g)
p(a|c,g)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Complexity is O(K), compared to O(K4)
A More General Algorithm
• Message Passing (MP) Algorithm
– Pearl, 1988; Lauritzen and Spiegelhalter, 1988
– Declare 1 node (any node) to be a root
– Schedule two phases of message-passing
• nodes pass messages up to the root
• messages are distributed back to the leaves
– In time O(N), we can compute P(….)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Sketch of the MP algorithm in action
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Sketch of the MP algorithm in action
1
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Sketch of the MP algorithm in action
1
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
2
Sketch of the MP algorithm in action
1
3
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
2
Sketch of the MP algorithm in action
2
1
3
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
4
Complexity of the MP Algorithm
• Efficient
– Complexity scales as O(N K m)
• N = number of variables
• K = arity of variables
• m = maximum number of parents for any node
– Compare to O(KN) for brute-force method
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Graphs with “loops”
D
A
B
E
C
F
G
Message passing algorithm does not work when
there are multiple paths between 2 nodes
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Graphs with “loops”
D
A
B
E
C
F
General approach: “cluster” variables
together to convert graph to a tree
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
G
Reduce to a Tree
D
B, E
A
C
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
F
G
Reduce to a Tree
D
B, E
A
C
F
G
Good news: can perform MP algorithm on this tree
Bad news: complexity is now O(K2)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Probability Calculations on Graphs
• Structure of the graph reveals
– Computational strategy
– Dependency relations
• Complexity is typically O(K
max(number of parents)
)
– If single parents (e.g., tree), -> O(K)
– The sparser the graph the lower the complexity
• Technique can be “automated”
– i.e., a fully general algorithm for arbitrary graphs
– For continuous variables:
• replace sum with integral
– For identification of most likely values
• Replace sum with max operator
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Hidden Markov Model (HMM)
Y1
Y2
Y3
Yn
Observed
---------------------------------------------------S1
S2
S3
Sn
Hidden
Two key assumptions:
1. hidden state sequence is Markov
2. observation Yt is CI of all other variables given St
Widely used in speech recognition, protein sequence models
Motivation: switching dynamics, low-d representation of Y’s, etc
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
HMMs as graphical models…
• Computations of interest
• p( Y ) =
S p(Y , S = s)
• arg maxs p(S = s | Y)
-> “forward-backward” algorithm
-> Viterbi algorithm
• Both algorithms….
– computation time linear in T
– special cases of MP algorithm
• Many generalizations and extensions….
– Make state S continuous -> Kalman filters
– Add inputs -> convolutional decoding
– Add additional dependencies in the model
• Generalized HMMs
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Part 3: Connecting Probability
Models to Data
Recommended References for this Section:
• All of Statistics, L. Wasserman, Chapman and Hall, 2004 (Chapters 6,9,11)
• Pattern Classification and Scene Analysis, 1st ed, R. Duda and P. Hart,
Wiley, 1973, Chapter 3.
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
(Generative Model)
P(Data | Parameters)
Real World
Data
Probabilistic
Model
P(Parameters | Data)
(Inference)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Conditionally Independent Observations
Model parameters
q
y1
y2
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
yn-1
yn
Data
“Plate” Notation
q
Model parameters
yi
Data = {y1,…yn}
i=1:n
Plate = rectangle in graphical model
variables within a plate are replicated
in a conditionally independent manner
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example: Gaussian Model
s
m
yi
i=1:n
Generative model:
p(y1,…yn | m, s) =  p(yi | m, s)
=
p(data | parameters)
=
p(D | q)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
where q = {m, s }
The Likelihood Function
• Likelihood = p(data | parameters)
= p( D | q )
= L (q )
• Likelihood tells us how likely the observed data are
conditioned on a particular setting of the parameters
• Details
– Constants that do not involve q can be dropped in defining
L (q)
– Often easier to work with log L (q)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Comments on the Likelihood Function
• Constructing a likelihood function L (q) is the first step
in probabilistic modeling
• The likelihood function implicitly assumes an underlying
probabilistic model M with parameters q
• L (q) connects the model to the observed data
• Graphical models provide a useful language for
constructing likelihoods
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Binomial Likelihood
• Binomial model
– N memoryless trials
– probability q of success at each trial
• Observed data
yi
– r successes in n trials
– Defines a likelihood:
i=1:n
L(q) = p(D | q)
= p(succeses) p(non-successes)
=
q r (1-q)
q
n-r
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Binomial Likelihood Examples
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Gaussian Model and Likelihood
Model assumptions:
1. y’s are conditionally independent given model
2. each y comes from a Gaussian (Normal) density
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Conditional Independence (CI)
• CI in a likelihood model means that we are assuming
data points provide no information about each other, if
the model parameters are assumed known.
p( D | q ) = p(y1,… yN | q ) =  p(yi | q )
CI assumption
• Works well for (e.g.)
– Patients randomly arriving at a clinic
– Web surfers randomly arriving at a Web site
• Does not work well for
– Time-dependent data (e.g., stock market)
– Spatial data (e.g., pixel correlations)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example: Markov Likelihood
• Motivation: wish to model data in a sequence where
there is sequential dependence,
– e.g., a first-order Markov chain for a DNA sequence
– Markov modeling assumption:
p(yt | yt-1, yt-2, …yt) = p(yt | yt-1)
– q = matrix of K x K transition matrix probabilities
L( q ) = p( D | q ) = p(y1,… yN | q ) =  p(yt | yt-1 , q )
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Maximum Likelihood (ML) Principle
(R. Fisher ~ 1922)
Model parameters
q
Data = {y1,…yn}
yi
i=1:n
L (q) = p(Data | q ) =  p(yi | q )
Maximum Likelihood:
qML = arg max{ Likelihood(q) }
Select the parameters that make the observed data most likely
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example: ML for Gaussian Model
Maximum Likelhood
Estimate qML
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Maximizing the Likelihood
• More generally, we analytically solve for the q value that
maximizes the function L (q)
– With p parameters, L (q) is a scalar function defined over a
p-dimensional space
– 2 situations:
• We can analytically solve for the maxima of L (q)
– This is rare
• We have to resort to iterative techniques to find qML
– More common
• General approach
– Define a generative probabilistic model
– Define an associated likelihood (connect model to data)
– Solve an optimization problem to find qML
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Analytical Solution for Gaussian
Likelihood
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Graphical Model for Regression
q
s
xi
yi
i=1:n
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example
y
x
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example
y
f(x ; q )
this is unknown
x
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example: ML for Linear Regression
• Generative model:
y = ax + b + Gaussian noise
p(y) = N(ax + b, s)
• Conditional Likelihood
L(q) = p(y1,… yN | x1,… xN, q )
=
 p(yi | xi , q ) ,
q = {a, b}
• Can show (homework problem!) that
log L(q) = - S [yi - (a xi – b) ]2
i.e., finding a,b to maximize log- likelihood is the
same as finding a,b that minimizes least squares
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
ML and Regression
• Multivariate case
– multiple x’s, multiple regression coefficients
– with Gaussian noise, the ML solution is again equivalent to leastsquares (solutions to a set of linear equations)
• Non-linear multivariate model
– With Gaussian noise we get
log L(q) = - S [yi - f (xi ; q) ]2
– Conditions for the q that maximizes L(q) leads to a set of p nonlinear equations in p variables
– e.g., f (xi ; q) = a multilayer neural network with 1000 weights
• Optimization = finding the maximum of a non-convex function in
1000 dimensional space!
• Typically use iterative local search based on gradient (many
possible variations)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Probabilistic Learning and Classification
•
2 main approaches:
1.
p(c | x) = p(x|c) p(c) / p(x) ~ p(x|c) p(c)
-> learn a model for p(x|c) for each class, use Bayes rule to classify
- example: naïve Bayes
- advantage: theoretically optimal if p(x|c) is “correct”
- disadvantage: not directly optimizing predictive accuracy
2. Learn p(c|x) directly, e.g.,
– logistic regression (see tutorial notes from D. Lewis)
– other regression methods such as neural networks, etc.
– Often quite effective in practice: very useful for ranking,
scoring, etc
–
Contrast with purely discriminative methods such as SVMs, trees
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
The Bayesian Approach to Learning
a
q
Prior(q) = p( q | a )
yi
i=1:n
Maximum A Posteriori:
qMAP = arg max{ Likelihood(q) x Prior(q) }
Fully Bayesian:
p( q | Data) = p(Data | q ) p(q) / p(Data)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
The Bayesian Approach
Fully Bayesian:
p( q | Data) = p(Data | q ) p(q) / p(Data)
= Likelihood x Prior / Normalization term
Estimating p( q | Data) can be viewed as inference in a
graphical model
ML is a special case = MAP with a “flat” prior
a
q
yi
i=1:n
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
More Comments on Bayesian Learning
• “fully” Bayesian: report full posterior density p(q |D)
– For simple models, we can calculate p(q |D) analytically
– Otherwise we empirically estimate p(q |D)
• Monte Carlo sampling methods are very useful
• Bayesian prediction (e.g., for regression):
p(y | x, D ) = integral p(y, q| x, D) dq
= integral p(y | q, x) p(q |D) dq
-> prediction at each q is weighted by p(q|D)
[theoretically preferable to picking a single q (as in ML)]
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
More Comments on Bayesian Learning
• In practice…
– Fully Bayesian is theoretically optimal but not always the
most practical approach
• E.g., computational limitations with large numbers of
parameters
• assessing priors can be tricky
• Bayesian approach particularly useful for small data sets
• For large data sets, Bayesian, MAP, ML tend to agree
– ML/MAP are much simpler => often used in practice
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example of Bayesian Estimation
• Definition of Beta prior
• Definition of Binomial likelihood
• Form of Beta posterior
• Examples of plots with prior+likelihood -> posterior
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Beta Density as a Prior
•
Let q be a proportion,
– e.g., fraction of customers that respond to an email ad
– p(q) is a prior for q
a
b
– e.g. p(q | a, b) = Beta density with parameters a and b
p(q | a, b) ~ q a-1 (1-q) b-1
a /(a + b) influences the location
a + b controls the width
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
q
NEW
Examples of Beta Density Priors
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Binomial Likelihood
• Binomial model
– N memoryless trials
– probability q of success at each trial
• Observed data
– r successes in n trials
– Defines a likelihood:
p(D | q) = p(succeses) p(non-successes)
=
q r (1-q)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
n-r
NEW
Beta + Binomial -> Beta
p(q | D) = Posterior ~ Likelihood x Prior
=
Binomial x Beta
~
q r (1-q)
n-r x
q a-1 (1-q)
= Beta(a + r, b + n – r)
Prior is “updated” using data:
Parameters:
a -> a+r,
b -> b + n – r
Sample size: a + b -> a + b + n
Mean: a /(a + b)
-> (a + r)/(a + b + n)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
b-1
NEW
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Extensions
• K categories with K probabilities that sum to 1
– Dirichlet prior + Multinomial likelihood -> Dirichlet posterior
– Used in text modeling, protein alignment algorithms, etc
• E.g. Biological Sequence Analysis, R. Durbin et al.,
Cambridge University Press, 1998.
• Hierarchical modeling
– Multiple trials for different individuals
– Each individual has their own q
– The q’s ~ common population distribution
– For applications in marketing see
• Market Segmentation: Conceptual and Methodological
Foundations, M. Wedel and W. A. Kamakura, Kluwer,
1998
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example: Bayesian Gaussian Model
a
s
m
b
yi
i=1:n
Note: priors and parameters are assumed independent here
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example: Bayesian Regression
a
q
s
xi
yi
i=1:n
Model:
yi = f [xi;q] + e,
e ~ N(0, s2)
p(yi | xi) ~ N ( f[xi;q] , s2 )
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
b
UPDATED
Other Examples
• Bayesian examples
– Bayesian neural networks
• Richer probabilistic models
– Random effects models
– E.g., Learning to align curves
• Learning model structure
– Chow-Liu trees
– General graphical model structures
• e.g. gene regulation networks
Comprehensive reference:
Bayesian Data Analysis, A. Gelman, J. B. Carlin. H. S. Stern, and D.
B. Rubin, Chapman and Hall, 2nd edition, 2003.
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Learning Shapes and Shifts
Data = smoothed growth acceleration data from teenagers
EM used to learn a spline model + time-shift for each curve
Original data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Data after Learning
Learning to Track People
Sidenbladh, Black , Fleet, 2000
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Model Uncertainty
• How do we know what model M to select for our
likelihood function?
– In general, we don’t!
– However, we can use the data to help us infer which model
from a set of possible models is best
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Method 1: Bayesian Approach
• Can evaluate the evidence for each model,
p(M |D) = p(D|M) p(M)/ p(D)
– Can get p(D|M) by integrating p(D, q| M) over parameter
space (this is the “marginal likelihood”)
– in theory p(M |D) is how much evidence exists in the data
for model M
• More complex models are automatically penalized
because of the integration over higher-dimensional
parameter spaces
– in practice p(M|D) can rarely be computed directly
• Monte Carlo schemes are popular
• Also: approximations such as BIC, Laplace, etc
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Comments on Bayesian Approach
• Bayesian Model Averaging (BMA):
– Instead of selecting the single best model, for prediction
average over all available models (theoretically the correct
thing to do)
– Weights used for averaging are p(M|D)
• Empirical alternatives
– e.g., Stacking, Bagging
– Idea is to learn a set of unconstrained combining weights
from the data, weights that optimize predictive accuracy
• “emulate” BMA approach
• may be more effective in practice
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Method 2: Predictive Validation
• Instead of the Bayesian approach, we could use the
probability of new unseen test data as our metric for
selecting models
• E.g., 2 models
– If p(D | M1) > p(D | M2) then M1 is assigning higher
probability to new data than M2
– This will (with enough data) select the model that predicts
the best, in a probabilistic sense
– Useful for problems where we have very large amounts of
data and it is easy to create a large validation data set D
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
The Prediction Game
Observed Data
0
10
What is a good guess at p(x)?
0
0
Model A for p(x)
Model B for p(x)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
10
10
x
NEW
Which of Model A or B is better?
Test data generated from the true underlying q(x)
Model A
Model B
We can score each model in terms of p(new data | model)
Asymptotically, this is a fair unbiased score
(irrespective of the complexities of the models)
Note: empirical average of log p(data) scores ~ negative entropy
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Predictive Entropy Out-of-Sample
4
Negative log-likelihood [bits/token]
3.8
Model-based clustering and visualization of navigation patterns on a Web site
Cadez et al, Journal of Data Mining and Knowledge Discovery, 2003
3.6
3.4
3.2
3
2.8
2.6
Mixtures of Multinomials
2.4
Mixtures of SFSMs
2.2
2
20
40
60
80
100
120
140
Number of mixture components [K]
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
160
180
200
Simple Model Class
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Data-generating
process (“truth”)
Simple Model Class
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Data-generating
process (“truth”)
“Closest” model in terms
of KL distance
Simple Model Class
Best model is relatively far from Truth
=> High Bias
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Data-generating
process (“truth”)
Simple Model Class
Complex Model Class
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Data-generating
process (“truth”)
Simple Model Class
Complex Model Class
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Best model is closer to Truth
=> Low Bias
However,…. this could be the model that best fits the observed data
=> High Variance
Data-generating
process (“truth”)
Simple Model Class
Complex Model Class
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Part 4: Models with Hidden Variables
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Hidden or Latent Variables
• In many applications there are 2 sets of variables:
– Variables whose values we can directly measure
– Variables that are “hidden”, cannot be measured
• Examples:
– Speech recognition:
• Observed: acoustic voice signal
• Hidden: label of the word spoken
– Face tracking in images
• Observed: pixel intensities
• Hidden: position of the face in the image
– Text modeling
• Observed: counts of words in a document
• Hidden: topics that the document is about
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Mixture Models
Pearson, 1894, Phil. Trans. Roy. Soc. A.
p(Y) =
Sk p(Y | S=k) p(S=k)
S
Hidden discrete variable
Y
Observed variable(s)
Motivation:
1. models a true process (e.g., fish example)
2. approximation for a complex process
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
x
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
x
2
p(x)
1.5
Component Models
1
0.5
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
0
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
x
A Graphical Model for Clustering
S
Y1
Hidden discrete (cluster) variable
Yj
Yd
Observed variable(s)
(assumed conditionally independent given S)
Clusters = p(Y1,…Yd | S = s)
Probabilistic Clustering = learning these probability
distributions from data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Hidden Markov Model (HMM)
Y1
Y2
Y3
Yn
Observed
---------------------------------------------------S1
S2
S3
Sn
Hidden
Two key assumptions:
1. hidden state sequence is Markov
2. observation Yt is CI of all other variables given St
Widely used in speech recognition, protein sequence models
Motivation?
- S can provide non-linear switching
- S can encode low-dim time-dependence for high-dim Y
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Generalizing HMMs
T1
Y1
Y2
Y3
Yn
S1
S2
S3
Sn
T2
T3
Tn
Two independent state variables,
e.g., two processes evolving at different time-scales
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Generalizing HMMs
I1
Y1
Y2
Y3
Yn
S1
S2
S3
Sn
I2
I3
Inputs I provide context to influence switching,
e.g., external forcing variables
Model is still a tree -> inference is still linear
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
In
Generalizing HMMs
I1
Y1
Y2
Y3
Yn
S1
S2
S3
Sn
I2
I3
In
Add direct dependence between Y’s to better model persistence
Can merge each St and Yt to construct a tree-structured model
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Mixture Model
q
Si
yi
i=1:n
Likelihood(q) = p(Data | q )
= i p(yi | q )
= i [ Sk p(yi |si = k , q ) p(si = k) ]
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Learning with Missing Data
• Guess at some initial parameters q0
• E-step (Inference)
– For each case, and each unknown variable compute
p(S | known data, q0 )
• M-step (Optimization)
– Maximize L(q) using p(S | ….. )
– This yields new parameter estimates q1
• This is the EM algorithm:
– Guaranteed to converge to a (local) maximum of L(q)
– Dempster, Laird, Rubin, 1977
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
E-Step
q
Si
yi
i=1:n
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
M-Step
q
Si
yi
i=1:n
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
E-Step
q
Si
yi
i=1:n
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
The E (Expectation) Step
n objects
Current K components
and parameters
E step: Compute p(object i is in group k)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
The M (Maximization) Step
n objects
New parameters for
the K components
M step: Compute q, given n objects and memberships
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Complexity of EM for mixtures
n objects
K models
Complexity per iteration scales as O( n K f(d) )
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
Data from Prof.
Christine McLaren,
Dept of Epidemiology,
UC Irvine
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
3.8
3.9
4
EM ITERATION 1
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
3.8
3.9
4
EM ITERATION 3
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
3.8
3.9
4
EM ITERATION 5
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
3.8
3.9
4
EM ITERATION 10
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
3.8
3.9
4
EM ITERATION 15
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
3.8
3.9
4
EM ITERATION 25
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
3.8
3.9
4
ANEMIA DATA WITH LABELS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
Control Group
4.1
4
Anemia Group
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
3.8
3.9
4
LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS
490
480
Log-Likelihood
470
460
450
440
430
420
410
400
0
5
10
15
EM Iteration
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
20
25
Example of a Log-Likelihood Surface
50
100
150
Mean 2
200
250
300
350
400
10
20
30
40
50
60
70
Log
Scale
for Sigma
280
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
90
100
Log-Likelihood Cross-Section
-45
-50
Log-likelihood
-55
-60
-65
-70
-75
-80
-50
-40
-30
-20
-10
Log(sigma)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
0
10
20
HMMs
Y1
Y2
Y3
YN
S1
S2
S3
SN
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
q1
Y1
Y2
Y3
YN
S1
S2
S3
SN
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
q1
Y1
Y2
Y3
YN
S1
S2
S3
SN
q2
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
E-Step
(linear inference)
q1
Y1
Y2
Y3
YN
S1
S2
S3
SN
q2
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
M-Step
(closed form)
q1
Y1
Y2
Y3
YN
S1
S2
S3
SN
q2
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Alternatives to EM
• Method of Moments
– EM is more efficient
• Direct optimization
– e.g., gradient descent, Newton methods
– EM is usually simpler to implement
• Sampling (e.g., MCMC)
• Minimum distance, e.g.,

IMSE (q ) = E  p( x | q )  q( x) 
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
2

Mixtures as “Data Simulators”
For i = 1 to N
classk ~ p(class1, class2, …., class K)
xi ~ p(x | classk)
end
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Mixtures with Markov Dependence
For i = 1 to N
classk ~ p(class1, class2, …., class K | class[xi-1] )
xi ~ p(x | classk)
end
Current class depends on
previous class (Markov dependence)
This is a hidden Markov model
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Mixtures of Sequences
For i = 1 to N
classk ~ p(class1, class2, …., class K)
while non-end state
xij ~ p(xj | xj-1, classk)
end
Produces a variable
length sequence
end
Markov sequence model
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Mixtures of Curves
For i = 1 to N
classk ~ p(class1, class2, …., class K)
Li ~ p(Li | classk)
Length of curve
for i = 1 to Li
yij ~ f(y | xj, classk) + ek
end
end
Independent variable x
Class-dependent curve model
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Mixtures of Image Models
For i = 1 to N
classk ~ p(class1, class2, …., class K)
Global scale
sizei ~ p(size|classk)
for i = 1 to Vi-1
Number of vertices
intensityi ~ p(intensity | classk)
end
end
Pixel generation model
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
More generally…..
K
p ( Di ) =  p ( Di | ck ) a k
k =1
Generative Model
- select a component ck for individual i
- generate data according to p(Di | ck)
- p(Di | ck) can be very general
- e.g., sets of sequences, spatial patterns, etc
[Note: given p(Di | ck), we can define an EM algorithm]
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
References
•
The EM Algorithm and Mixture Models
– The EM Algorithm and Extensions
G. McLachlan and T. Krishnan. John Wiley and Sons, New York,
1997.
•
Mixture models
– Statistical analysis of finite mixture distributions.
D. M. Titterington, A. F. M. Smith & U. E. Makov. Wiley & Sons,
Inc., New York, 1985.
– Finite Mixture Models
G.J. McLachlan and D. Peel, New York: Wiley (2000)
– Model-based clustering, discriminant analysis, and density
estimation, C. Fraley and A. E. Raftery, Journal of the American
Statistical Association 97:611-631 (2002).
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
References
•
Hidden Markov Models
– A tutorial on hidden Markov models and selected
applications in speech recognition,
L. R. Rabiner, Proceedings of the IEEE, vol. 77, no.2, 257-287,
1989.
– Probabilistic independence networks for hidden Markov
models
P. Smyth, D. Heckerman, and M. Jordan, Neural Computation ,
vol.9, no. 2, 227-269, 1997.
– Hidden Markov models,
A. Moore, online tutorial slides,
http://www.autonlab.org/tutorials/hmm12.pdf
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Part 5: Case Studies
(i) Simulating and forecasting rainfall data
(ii) Curve clustering with cyclones
(iii) Topic modeling from text documents
and if time permits…..
(iv) Sequence clustering for Web data
(v) Analysis of time-course gene expression data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Case Study 1:
Simulating and Predicting Rainfall Patterns
Joint work with:
Andy Robertson, International Research Institute for Climate Prediction
Sergey Kirshner, Department of Computer Science, UC Irvine
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Spatio-Temporal Rainfall Data
Northeast Brazil 1975-2002
90-day time series
24 years
10 stations
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
DATA FOR ONE RAIN-STATION
5
10
15
YEAR
20
25
30
35
10
20
30
40
50
DAY
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
60
70
80
90
Modeling Goals
• “Downscaling”
– Modeling interannual variability
– coupling rainfall to large-scale effects like El Nino
• Prediction
– e.g., “hindcasting” of missing data
• Seasonal Forecasts
– E.g. on Dec 1 produce simulations of likely 90-day winters
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
HMMs for Rainfall Modeling
I1
Y1
Y2
Y3
YN
S1
S2
S3
SN
I2
I3
IN
S = unobserved weather state
Y = spatial rainfall pattern (“outputs”)
I = atmospheric variables (“inputs”)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Learned Weather States
States provide an interpretable “view” of spatio-temporal
relationships in the data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Weather
States
for Kenya
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Spatial Chow-Liu Trees
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
-
Spatial distribution given a
state is a tree structure
(a graphical model)
-
Useful intermediate between
full pair-wise model and
conditional independence
-
Optimal topology learned from
data using minimum spanning
tree algorithm
-
Can use priors based on
distance, topography
-
Tree-structure over time also
Missing Data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Error rate v. fraction of missing data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
References
•
Trees and Hidden Markov Models
– Conditional Chow-Liu tree structures for modeling discretevalued vector time series
S. Kirshner, P. Smyth, and A. Robertson
in Proceedings of the 20th International Conference on
Uncertainty in AI , 2004.
•
Applications to rainfall modeling
– Hidden Markov models for modeling daily rainfall
occurrence over Brazil
A. Robertson, S. Kirshner, and P. Smyth
Journal of Climate, November 2005.
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Summary
• Simple “empirical” probabilistic models can be very helpful in
interpreting large scientific data sets
– e.g., HMM states provide scientists with a basic but useful
classification of historical spatial rainfall patterns
• Graphical models provide “glue” to link together different
information
– Spatial
– Temporal
– Hidden states, etc
• “Generative” aspect of probabilistic models can be quite
useful, e.g., for simulation
• Missing data is handled naturally in a probabilistic framework
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Case Study 2:
Clustering Cyclone Trajectories
Joint work with:
Suzana Camargo, Andy Robertson, International Research Institute for
Climate Prediction
Scott Gaffney, Department of Computer Science, UC Irvine
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Storm Trajectories
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Microarray Gene Expression Data
TIME-COURSE GENE EXPRESSION DATA
2
Normalized log-ratio of intensity
1.5
1
0.5
0
-0.5
-1
Yeast Cell-Cycle Data
Spellman et al (1998)
-1.5
-2
0
2
4
6
8
10
12
Time (7-minute increments)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
14
16
18
Clustering “non-vector” data
• Challenges with the data….
– May be of different “lengths”, “sizes”, etc
– Not easily representable in vector spaces
– Distance is not naturally defined a priori
• Possible approaches
– “convert” into a fixed-dimensional vector space
• Apply standard vector clustering – but loses information
– use hierarchical clustering
• But O(N2) and requires a distance measure
– probabilistic clustering with mixtures
• Define a generative mixture model for the data
• Learn distance and clustering simultaneously
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Graphical Models for Curves
Data = { (y1,t1),……. yT, tT) }
q
t
y
T
y = f(t ; q )
e.g., y = at2 + bt + c,
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
q = {a, b, c}
Graphical Models for Curves
q
t
s
y
T
y ~ Gaussian density
with mean = f(t ; q ), variance = s2
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example
y
t
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Example
f(t ; q ) <- this is hidden
y
t
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Graphical Models for Sets of Curves
q
t
s
y
T
N curves
Each curve: P(yi | ti, q ) = product of Gaussians
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Curve-Specific Transformations
q
Note: we can learn
function parameters
and shifts
simultaneously with EM
t
a
s
y
T
N curves
e.g., yi = at2 + bt + c + ai,
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
q = {a, b, c, a1,….aN}
Learning Shapes and Shifts
Data = smoothed growth acceleration data from teenagers
EM used to learn a spline model + time-shift for each curve
Original data
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Data after Learning
Clustering: Mixtures of Curves
q
c
t
a
s
y
T
N curves
Each set of trajectory points comes from 1 of K models
Model for group k is a Gaussian curve model
Marginal probability for a trajectory = mixture model
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
The Learning Problem
• K cluster models
– Each cluster is a shape model E[Y] = f(X;q) with its own
parameters
• N observed curves: for each curve we learn
– P(cluster k | curve data)
– distribution on alignments, shifts, scaling, etc, given data
• Requires simultaneous learning of
– Cluster models
– Curve transformation parameters
• Results in an EM algorithm where E and M step are tractable
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Simulated Curves (K=2 Clusters)
5
4
3
2
1
0
-1
-2
2
4
6
8
10
Time
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
12
14
16
18
20
Simulated Data after Alignment
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
-3
0
5
10
15
Time
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
20
25
Results on Simulated Data
Method
Classification
Accuracy
LogP
Error in
Mean
WithinCluster s
True Model
1
2.01
0
0.050
EM with
Alignment
0.99
1.34
0.019
0.048
Standard
EM
0.89
-7.87
0.171
0.105
K-means
0.79
-
0.424
0.129
*Averaged over 50 train/test sets
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Clusters of Trajectories
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Cluster Shapes for Pacific Cyclones
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
TROPICAL CYCLONES Western North Pacific 1983-2002
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
References on Curve Clustering
•
Functional Data Analysis
J. O. Ramsay and B. W. Silverman, Springer, 1997.
•
Probabilistic curve-aligned clustering and prediction with
regression mixture models
S. J. Gaffney, Phd Thesis, Department of Computer Science,
University of California, Irvine, March 2004.
•
Joint probabilistic curve clustering and alignment
S. Gaffney and P. Smyth
Advances in Neural Information Processing 17 , in press, 2005.
•
Probabilistic clustering of extratropical cyclones using
regression mixture models
S. Gaffney, A. Robertson, P. Smyth, S. Camargo, M. Ghil
preprint, online at www.datalab.uci.edu.
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Summary
• Graphical models provide a flexible representational
language for modeling complex scientific data
– can build complex models from simpler building blocks
• Systematic variability in the data can be handled in a
principled way
– Variable length time-series
– Misalignments in trajectories
• Generative probabilistic models are interpretable and
understandable by scientists
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Case Study 3:
Topic Modeling from Text Documents
Joint work with:
Mark Steyvers, Dave Newman, Chaitanya Chemudugunta,
UC Irvine
Michal Rosen-Zvi, Hebrew University, Jerusalem
Tom Griffiths, Brown University
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Enron email data
250,000 emails
5000 authors
1999-2002
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Questions of Interest
– What topics do these documents “span”?
– Which documents are about a particular topic?
– How have topics changed over time?
– What does author X write about?
– Who is likely to write about topic Y?
– Who wrote this specific document?
– and so on…..
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Graphical Model for Clustering
Cluster-Word
distributions
f
z
w
Cluster for
document
Word
n
D
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Graphical Model for Topics
q
Topic-Word
distributions
f
Document-Topic
distributions
z
Topic
w
Word
n
D
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Topic = probability distribution over words
TOPIC 209
TOPIC 289
WORD
PROB.
WORD
PROB.
PROBABILISTIC
0.0778
RETRIEVAL
0.1179
BAYESIAN
0.0671
TEXT
0.0853
PROBABILITY
0.0532
DOCUMENTS
0.0527
CARLO
0.0309
INFORMATION
0.0504
MONTE
0.0308
DOCUMENT
0.0441
DISTRIBUTION
0.0257
CONTENT
0.0242
INFERENCE
0.0253
INDEXING
0.0205
PROBABILITIES
0.0253
RELEVANCE
0.0159
CONDITIONAL
0.0229
COLLECTION
0.0146
PRIOR
0.0219
RELEVANT
0.0136
....
...
...
...
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
P(w | z )
NEW
Key Features of Topic Models
• Generative model for documents in form of bags of words
• Allows a document to be composed of multiple topics
– Much more powerful than 1 doc -> 1 cluster
• Completely unsupervised
– Topics learned directly from data
– Leverages strong dependencies at word level AND large data sets
• Learning algorithm
– Gibbs sampling is the method of choice
• Scalable
– Linear in number of word tokens
– Can be run on millions of documents
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Topics vs. Other Approaches
• Clustering documents
– Computationally simpler…
– But a less accurate and less flexible model
• LSI/LSA
– Projects words into a K-dimensional hidden space
– Less interpretable
– Not generalizable
• E.g., authors or other side-information
– Not as accurate
• E.g., precision-recall: Hoffman, Blei et al, Buntine, etc
• Topic Models (aka LDA model)
– “next-generation” text modeling, after LSI
– More flexible and more accurate (in prediction)
– Linear time complexity in fitting the model
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Examples of Topics learned from
Proceedings of the National Academy of Sciences
Griffiths and Steyvers, 2004
HIV
FORCE
VIRUS
SURFACE
INFECTED
MOLECULES
SOLUTION IMMUNODEFICIENCY
CD4
SURFACES
INFECTION
MICROSCOPY
HUMAN
WATER
VIRAL
FORCES
TAT
PARTICLES
GP120
STRENGTH
REPLICATION
POLYMER
TYPE
IONIC
ENVELOPE
ATOMIC
AIDS
AQUEOUS
REV
MOLECULAR
BLOOD
PROPERTIES
CCR5
LIQUID
INDIVIDUALS
SOLUTIONS
ENV
BEADS
PERIPHERAL
MECHANICAL
MUSCLE
CARDIAC
HEART
SKELETAL
MYOCYTES
VENTRICULAR
MUSCLES
SMOOTH
HYPERTROPHY
DYSTROPHIN
HEARTS
CONTRACTION
FIBERS
FUNCTION
TISSUE
RAT
MYOCARDIAL
ISOLATED
MYOD
FAILURE
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
STRUCTURE
ANGSTROM
CRYSTAL
RESIDUES
STRUCTURES
STRUCTURAL
RESOLUTION
HELIX
THREE
HELICES
DETERMINED
RAY
CONFORMATION
HELICAL
HYDROPHOBIC
SIDE
DIMENSIONAL
INTERACTIONS
MOLECULE
SURFACE
NEURONS
BRAIN
CORTEX
CORTICAL
OLFACTORY
NUCLEUS
NEURONAL
LAYER
RAT
NUCLEI
CEREBELLUM
CEREBELLAR
LATERAL
CEREBRAL
LAYERS
GRANULE
LABELED
HIPPOCAMPUS
AREAS
THALAMIC
TUMOR
CANCER
TUMORS
HUMAN
CELLS
BREAST
MELANOMA
GROWTH
CARCINOMA
PROSTATE
NORMAL
CELL
METASTATIC
MALIGNANT
LUNG
CANCERS
MICE
NUDE
PRIMARY
OVARIAN
What can Topic Models be used for?
– Queries
• Who writes on this topic?
– e.g., finding experts or reviewers in a particular area
• What topics does this person do research on?
– Comparing groups of authors or documents
– Discovering trends over time
– Detecting unusual papers and authors
– Interactive browsing of a digital library via topics
– Parsing documents (and parts of documents) by topic
– and more…..
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
What is this paper about?
Empirical Bayes screening for multi-item associations
Bill DuMouchel and Daryl Pregibon, ACM SIGKDD 2001
Most likely topics according to the model are…
1.
2.
3.
4.
data, mining, discovery, association, attribute..
set, subset, maximal, minimal, complete,…
measurements, correlation, statistical, variation,
Bayesian, model, prior, data, mixture,…..
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
0.012
CHANGING TRENDS IN COMPUTER SCIENCE
0.01
Topic Probability
0.008
OPERATING
SYSTEMS
WWW
PROGRAMMING
LANGUAGES
0.006
INFORMATION
RETRIEVAL
0.004
0.002
0
1990
1992
1994
1996
Year
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
1998
2000
2002
NEW
Pennsylvania
Gazette
(courtesy of David Newman & Sharon Block, UC Irvine)
1728-1800
80,000 articles
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
Historical Trends in Pennsylvania Gazette
(courtesy of David Newman & Sharon Block, UC Irvine)
STATE
GOVERNMENT
CONSTITUTION
LAW
UNITED
POWER
CITIZEN
PEOPLE
PUBLIC
CONGRESS
Topic Proportion (%)
10
8
6
4
2
0
1730
1740
1750
1760
1770
1780
YEAR
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
1790
1800
SILK
COTTON
DITTO
WHITE
BLACK
LINEN
CLOTH
WOMEN
BLUE
WORSTED
Enron email data
250,000 emails
5000 authors
1999-2002
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Enron email topics
TOPIC 36
TOPIC 72
TOPIC 23
TOPIC 54
WORD
PROB.
WORD
PROB.
WORD
PROB.
FEEDBACK
0.0781
PROJECT
0.0514
FERC
0.0554
PERFORMANCE
0.0462
PLANT
0.028
MARKET
0.0328
PROCESS
0.0455
COST
0.0182
ISO
0.0226
PEP
0.0446
MANAGEMENT
0.03
UNIT
0.0166
ORDER
COMPLETE
0.0205
FACILITY
0.0165
QUESTIONS
0.0203
SITE
0.0136
CONSTRUCTION 0.0169
WORD
PROB.
ENVIRONMENTAL 0.0291
AIR
0.0232
MTBE
0.019
EMISSIONS
0.017
0.0212
CLEAN
0.0143
FILING
0.0149
EPA
0.0133
COMMENTS
0.0116
PENDING
0.0129
COMMISSION 0.0215
SELECTED
0.0187
PROJECTS
0.0117
PRICE
0.0116
SAFETY
0.0104
COMPLETED
0.0146
CONTRACT
0.011
CALIFORNIA
0.0110
WATER
0.0092
SYSTEM
0.0146
UNITS
0.0106
FILED
0.0110
GASOLINE
0.0086
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
perfmgmt
0.2195
***
0.0288
***
0.0532
***
0.1339
perf eval process
0.0784
***
0.022
***
0.0454
***
0.0275
enron announcements
0.0489
***
0.0123
***
0.0384
***
0.0205
***
0.0089
***
0.0111
***
0.0334
***
0.0166
***
0.0048
***
0.0108
***
0.0317
***
0.0129
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Non-work Topics…
TOPIC 66
TOPIC 182
TOPIC 113
TOPIC 109
WORD
PROB.
WORD
PROB.
WORD
PROB.
WORD
PROB.
HOLIDAY
0.0857
TEXANS
0.0145
GOD
0.0357
AMAZON
0.0312
PARTY
0.0368
WIN
0.0143
LIFE
0.0272
GIFT
0.0226
YEAR
0.0316
FOOTBALL
0.0137
MAN
0.0116
CLICK
0.0193
SEASON
0.0305
FANTASY
0.0129
PEOPLE
0.0103
SAVE
0.0147
COMPANY
0.0255
SPORTSLINE
0.0129
CHRIST
0.0092
SHOPPING
0.0140
CELEBRATION
0.0199
PLAY
0.0123
FAITH
0.0083
OFFER
0.0124
ENRON
0.0198
TEAM
0.0114
LORD
0.0079
HOLIDAY
0.0122
TIME
0.0194
GAME
0.0112
JESUS
0.0075
RECEIVE
0.0102
RECOGNIZE
0.019
SPORTS
0.011
SPIRITUAL
0.0066
SHIPPING
0.0100
MONTH
0.018
GAMES
0.0109
VISIT
0.0065
FLOWERS
0.0099
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
chairman & ceo
0.131
cbs sportsline com 0.0866
crosswalk com
0.2358
amazon com
0.1344
***
0.0102
houston texans 0.0267
wordsmith
0.0208
jos a bank
0.0266
***
0.0046
houstontexans 0.0203
***
0.0107
sharperimageoffers
0.0136
***
0.0022
sportsline rewards 0.0175
travelocity com
0.0094
general announcement 0.0017
pro football 0.0136
barnes & noble com
0.0089
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
doctor dictionary 0.0101
***
0.0061
Topical Topics
TOPIC 18
TOPIC 22
TOPIC 114
WORD
PROB.
WORD
PROB.
WORD
PROB.
TOPIC 194
WORD
PROB.
POWER
0.0915
STATE
0.0253
COMMITTEE
0.0197
LAW
0.0380
CALIFORNIA
0.0756
PLAN
0.0245
BILL
0.0189
TESTIMONY
0.0201
ELECTRICITY
0.0331
CALIFORNIA
0.0137
HOUSE
0.0169
ATTORNEY
0.0164
UTILITIES
0.0253
POLITICIAN Y
0.0137
SETTLEMENT
0.0131
PRICES
0.0249
RATE
0.0131
LEGAL
0.0100
MARKET
0.0244
EXHIBIT
0.0098
PRICE
0.0207
SOCAL
0.0119
CONGRESS
0.0112
CLE
0.0093
UTILITY
0.0140
POWER
0.0114
PRESIDENT
0.0105
SOCALGAS
0.0093
CUSTOMERS
0.0134
BONDS
0.0109
METALS
0.0091
ELECTRIC
0.0120
MOU
0.0107
DC
0.0093
PERSON Z
0.0083
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
***
0.1160
***
0.0395
***
0.0696
***
0.0696
***
0.0518
***
0.0337
***
0.0453
***
0.0453
***
0.0284
***
0.0295
***
0.0255
***
0.0255
***
0.0272
***
0.0251
***
0.0173
***
0.0173
***
0.0266
***
0.0202
***
0.0317
***
0.0317
BANKRUPTCY 0.0126
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
WASHINGTON 0.0140
SENATE
0.0135
POLITICIAN X 0.0114
LEGISLATION 0.0099
UPDATED
Using Topic Models for Information Retrieval
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Author-Topic Models
• The author-topic model
– a probabilistic model linking authors and topics
• authors -> topics -> words
– Topic = distribution over words
– Author = distribution over topics
– Document = generated from a mixture of author
distributions
– Learns about entities based on associated text
• Can be generalized
– Replace author with any categorical doc information
– e.g., publication type, source, year, country of origin, etc
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Author-Topic Graphical Model
a
Author-Topic
distributions
Topic-Word
distributions
q
f
x
Author
z
Topic
w
Word
n
D
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Learning Author-Topic Models from
Text
• Full probabilistic model
– Power of statistical learning can be leveraged
– Learning algorithm is linear in number of word occurrences
• Scalable to very large data sets
• Completely automated (no tweaking required)
– completely unsupervised, no labels
• Query answering
– A wide variety of queries can be answered:
• Which authors write on topic X?
• What are the spatial patterns in usage of topic Y?
• How have authors A, B and C changed over time?
– Queries answered using probabilistic inference
• Query time is real-time (learning is offline)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Author-Topic Models for CiteSeer
TOPIC 205
TOPIC 209
WORD
PROB.
WORD
DATA
0.1563
MINING
0.0674
BAYESIAN
ATTRIBUTES
0.0462
DISCOVERY
0.0401
ASSOCIATION
0.0335
LARGE
0.0280
KNOWLEDGE
0.0260
DATABASES
0.0210
ATTRIBUTE
0.0188
CONDITIONAL
DATASETS
0.0165
AUTHOR
Han_J
TOPIC 289
PROB.
TOPIC 10
WORD
PROB.
WORD
PROB.
RETRIEVAL
0.1179
QUERY
0.1848
0.0671
TEXT
0.0853
QUERIES
0.1367
PROBABILITY
0.0532
DOCUMENTS
0.0527
INDEX
0.0488
CARLO
0.0309
INFORMATION
0.0504
DATA
0.0368
MONTE
0.0308
DOCUMENT
0.0441
JOIN
0.0260
CONTENT
0.0242
INDEXING
0.0180
INDEXING
0.0205
PROCESSING 0.0113
RELEVANCE
0.0159
AGGREGATE 0.0110
0.0229
COLLECTION
0.0146
ACCESS
0.0102
PRIOR
0.0219
RELEVANT
0.0136
PRESENT
0.0095
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
0.0196
Friedman_N
0.0094
Oard_D
0.0110
Suciu_D
0.0102
Rastogi_R
0.0094
Heckerman_D
0.0067
Croft_W
0.0056
Naughton_J
0.0095
Zaki_M
0.0084
Ghahramani_Z
0.0062
Jones_K
0.0053
Levy_A
0.0071
Shim_K
0.0077
Koller_D
0.0062
Schauble_P
0.0051
DeWitt_D
0.0068
Ng_R
0.0060
Jordan_M
0.0059
Voorhees_E
0.0050
Wong_L
0.0067
Liu_B
0.0058
Neal_R
0.0055
Singhal_A
0.0048
Mannila_H
0.0056
Raftery_A
0.0054
Hawking_D
0.0048
Ross_K
0.0061
Brin_S
0.0054
Lukasiewicz_T
0.0053
Merkl_D
0.0042
Hellerstein_J
0.0059
Liu_H
0.0047
Halpern_J
0.0052
Allan_J
0.0040
Lenzerini_M
0.0054
Holder_L
0.0044
Muller_P
0.0048
Doermann_D
0.0039
Moerkotte_G
0.0053
PROBABILISTIC 0.0778
DISTRIBUTION 0.0257
INFERENCE
0.0253
PROBABILITIES 0.0253
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Chakrabarti_K 0.0064
Author-Profiles
• Author = Andrew McCallum, U Mass:
– Topic 1: classification, training, generalization, decision, data,…
– Topic 2: learning, machine, examples, reinforcement, inductive,…..
– Topic 3: retrieval, text, document, information, content,…
• Author = Hector Garcia-Molina, Stanford:
- Topic 1: query, index, data, join, processing, aggregate….
- Topic 2: transaction, concurrency, copy, permission, distributed….
- Topic 3: source, separation, paper, heterogeneous, merging…..
• Author = Jerry Friedman, Stanford:
– Topic 1: regression, estimate, variance, data, series,…
– Topic 2: classification, training, accuracy, decision, data,…
– Topic 3: distance, metric, similarity, measure, nearest,…
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
PubMed-Query Topics
TOPIC 188
WORD
BIOLOGICAL
TOPIC 63
PROB.
WORD
TOPIC 32
TOPIC 85
PROB.
WORD
PROB.
0.1002
PLAGUE
0.0296
BOTULISM
0.1014
AGENTS
0.0889
MEDICAL
0.0287
BOTULINUM
0.0888
THREAT
0.0396
CENTURY 0.0280
BIOTERRORISM 0.0348
MEDICINE
0.0266
0.0669
INHIBITORS
0.0366
0.0340
INHIBITOR
0.0220
0.0245
PLASMA
0.0204
0.0203
POTENTIAL
0.0305
EPIDEMIC
0.0106
ATTACK
0.0290
GREAT
0.0091
CHEMICAL
0.0288
WARFARE
0.0219
CHINESE
0.0083
ANTHRAX
0.0146
FRENCH
0.0082
PARALYSIS
0.0124
PROB.
AUTHOR
PROB.
AUTHOR
0.0563
TYPE
HISTORY
PROB.
PROTEASE
0.0916
0.0877
0.0328
AUTHOR
HIV
PROB.
TOXIN
WEAPONS
EPIDEMICS 0.0090
WORD
CLOSTRIDIUM
INFANT
NEUROTOXIN
AMPRENAVIR 0.0527
0.0184
APV
0.0169
BONT
0.0167
DRUG
0.0169
FOOD
0.0134
RITONAVIR
0.0164
IMMUNODEFICIENCY0.0150
AUTHOR
PROB.
Atlas_RM
0.0044
Károly_L
0.0089
Hatheway_CL
0.0254
Sadler_BM
0.0129
Tegnell_A
0.0036
Jian-ping_Z
0.0085
Schiavo_G
0.0141
Tisdale_M
0.0118
Aas_P
0.0036
Sabbatani_S
0.0080
Sugiyama_H
0.0111
Lou_Y
0.0069
Arnon_SS
0.0108
Stein_DS
0.0069
Simpson_LL
0.0093
Haubrich_R
0.0061
Greenfield_RA
Bricaire_F
0.0032
0.0032
Theodorides_J 0.0045
Bowers_JZ
0.0045
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
PubMed-Query Topics
TOPIC 40
WORD
TOPIC 89
PROB.
WORD
ANTHRACIS
0.1627
CHEMICAL 0.0578
ANTHRAX
0.1402
SARIN
0.0454
BACILLUS
0.1219
AGENT
0.0332
SPORES
0.0614
GAS
0.0312
CEREUS
0.0382
SPORE
0.0274
THURINGIENSIS 0.0177
AGENTS
VX
PROB.
ENZYME
0.0938
MUSTARD
0.0639
ACTIVE
0.0429
EXPOSURE
0.0444
SM
0.0361
0.0264
SKIN
0.0208
REACTION
0.0225
0.0232
EXPOSED
0.0185
AGENT
0.0140
0.0124
TOXIC
0.0197
PRODUCTS 0.0170
PROB.
EPIDERMAL
DAMAGE
AUTHOR
Mock_M
0.0203
Minami_M
0.0093
Phillips_AP
0.0125
Hoskin_FC
0.0092
Smith_WJ
Welkos_SL
0.0083
Benschop_HP 0.0090
Turnbull_PC
0.0071
Raushel_FM 0.0084
Wild_JR
SITE
0.0399
0.0308
STERNE
0.0067
0.0353
SUBSTRATE
ENZYMES
0.0220
Fouet_A
PROB.
0.0657
NERVE
AUTHOR
WORD
0.0343
ACID
PROB.
HD
PROB.
SULFUR
0.0152
AUTHOR
WORD
0.0268
SUBTILIS
INHALATIONAL 0.0104
TOPIC 178
TOPIC 104
0.0075
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
0.0129
0.0116
PROB.
Monteiro-Riviere_NA 0.0284
SUBSTRATES 0.0201
FOLD
CATALYTIC
RATE
AUTHOR
0.0176
0.0154
0.0148
PROB.
Masson_P
0.0166
0.0219
Kovach_IM
0.0137
Lindsay_CD
0.0214
Schramm_VL
0.0094
Sawyer_TW
0.0146
Meier_HL
0.0139
Barak_D
Broomfield_CA
0.0076
0.0072
PubMed: Topics by Country
ISRAEL, n=196 authors
TOPIC 188
p=0.049
BIOLOGICAL
AGENTS
THREAT
BIOTERRORISM
WEAPONS
POTENTIAL
ATTACK
CHEMICAL
WARFARE
ANTHRAX
TOPIC 6
p=0.045
INJURY
INJURIES
WAR
TERRORIST
MILITARY
MEDICAL
VICTIMS
TRAUMA
BLAST
VETERANS
TOPIC 133
p=0.043
HEALTH
PUBLIC
CARE
SERVICES
EDUCATION
NATIONAL
COMMUNITY
INFORMATION
PREVENTION
LOCAL
TOPIC 104
p=0.027
HD
MUSTARD
EXPOSURE
SM
SULFUR
SKIN
EXPOSED
AGENT
EPIDERMAL
DAMAGE
TOPIC 159
p=0.025
EMERGENCY
RESPONSE
MEDICAL
PREPAREDNESS
DISASTER
MANAGEMENT
TRAINING
EVENTS
BIOTERRORISM
LOCAL
CHINA, n=1775 authors
TOPIC 177
TOPIC 7
TOPIC 79
p=0.045
p=0.026
p=0.024
SARS
RENAL
FINDINGS
RESPIRATORY
HFRS
CHEST
SEVERE
VIRUS
CT
COV
SYNDROME
LUNG
SYNDROME
FEVER
CLINICAL
ACUTE
HEMORRHAGIC
PULMONARY
CORONAVIRUS
HANTAVIRUS
ABNORMAL
CHINA
HANTAAN
Probabilistic Learning Tutorial:
P. Smyth, UC Irvine,
August 2005 INVOLVEMENT
TOPIC 49
p=0.024
STUDY
TOPIC 197
p=0.023
PATIENTS
HOSPITAL
PATIENT
ADMITTED
TWENTY
HOSPITALIZED
CONSECUTIVE
OBJECTIVES
PROSPECTIVELY
METHODS
RESULTS
CONCLUSION
OBJECTIVE
CONCLUSIONS
BACKGROUND
POTENTIAL
ATTACK
CHEMICAL
WARFARE
ANTHRAX
MEDICAL
VICTIMS
TRAUMA
BLAST
VETERANS
NATIONAL
COMMUNITY
INFORMATION
PREVENTION
LOCAL
SKIN
TOPIC 7
p=0.026
RENAL
HFRS
VIRUS
SYNDROME
FEVER
TOPIC 79
p=0.024
FINDINGS
CHEST
CT
LUNG
CLINICAL
HEMORRHAGIC
PULMONARY
CONCLUSIONS
BACKGROUND
HANTAVIRUS
HANTAAN
PUUMALA
ABNORMAL
STUDY
INVOLVEMENT
COMMON
OBJECTIVES
INVESTIGATE
HANTAVIRUSES
RADIOGRAPHIC
DESIGN
EXPOSED
AGENT
MANAGEMENT
TRAINING
EVENTS
BIOTERRORISM
LOCAL
PubMed-Query: Topics byDAMAGE
Country
EPIDERMAL
CHINA, n=1775 authors
TOPIC 177
p=0.045
SARS
RESPIRATORY
SEVERE
COV
SYNDROME
ACUTE
CORONAVIRUS
CHINA
KONG
PROBABLE
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
TOPIC 49
p=0.024
METHODS
RESULTS
CONCLUSION
OBJECTIVE
TOPIC 197
p=0.023
PATIENTS
HOSPITAL
PATIENT
ADMITTED
TWENTY
HOSPITALIZED
CONSECUTIVE
PROSPECTIVELY
DIAGNOSED
PROGNOSIS
Extended Models
• Conditioning on non-authors
– “side-information” other than authors
– e.g., date, publication venue, country, etc
– can use citations as authors
• Fictitious authors and common author
– Allow 1 unique fictitious author per document
• Captures document specific effects
– Assign 1 common fictitious author to each document
• Captures broad topics that are used in many documents
• Semantics and syntax model
– Semantic topics = topics that are specific to certain documents
– Syntactic topics = broad, across many documents
– Probabilistic model that learns each type automatically
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Scientific syntax and semantics
(Griffiths et al., NIPS 2004 – slides courtesy of Mark Steyvers and Tom Griffiths,
PNAS Symposium presentation, 2003)
Factorization of language based on
statistical dependency patterns:
long-range, document specific
dependencies
short-range dependencies constant
across all documents
semantics: probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax: probabilistic regular grammar
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.2
0.9
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
0.1
0.3
x=3
THE
0.6
A
0.3
MANY 0.1
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.2
0.9
THE ………………………………
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
0.1
0.3
x=3
THE
0.6
A
0.3
MANY 0.1
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.2
0.9
THE LOVE……………………
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
0.1
0.3
x=3
THE
0.6
A
0.3
MANY 0.1
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.2
0.9
THE LOVE OF………………
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
0.1
0.3
x=3
THE
0.6
A
0.3
MANY 0.1
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.2
0.9
THE LOVE OF RESEARCH ……
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
0.1
0.3
x=3
THE
0.6
A
0.3
MANY 0.1
Semantic topics
29
AGE
LIFE
AGING
OLD
YOUNG
CRE
AGED
SENESCENCE
MORTALITY
AGES
CR
INFANTS
SPAN
MEN
WOMEN
SENESCENT
LOXP
INDIVIDUALS
CHILDREN
NORMAL
46
SELECTION
POPULATION
SPECIES
POPULATIONS
GENETIC
EVOLUTION
SIZE
NATURAL
VARIATION
FITNESS
MUTATION
PER
NUCLEOTIDE
RATES
RATE
HYBRID
DIVERSITY
SUBSTITUTION
SPECIATION
EVOLUTIONARY
51
LOCI
LOCUS
ALLELES
ALLELE
GENETIC
LINKAGE
POLYMORPHISM
CHROMOSOME
MARKERS
SUSCEPTIBILITY
ALLELIC
POLYMORPHIC
POLYMORPHISMS
RESTRICTION
FRAGMENT
HAPLOTYPE
GENE
LENGTH
DISEASE
MICROSATELLITE
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
71
TUMOR
CANCER
TUMORS
BREAST
HUMAN
CARCINOMA
PROSTATE
MELANOMA
CANCERS
NORMAL
COLON
LUNG
APC
MAMMARY
CARCINOMAS
MALIGNANT
CELL
GROWTH
METASTATIC
EPITHELIAL
115
MALE
FEMALE
MALES
FEMALES
SPERM
SEX
SEXUAL
MATING
REPRODUCTIVE
OFFSPRING
PHEROMONE
SOCIAL
EGG
BEHAVIOR
EGGS
FERTILIZATION
MATERNAL
PATERNAL
FERTILITY
GERM
125
MEMORY
LEARNING
BRAIN
TASK
CORTEX
SUBJECTS
LEFT
RIGHT
SONG
TASKS
HIPPOCAMPAL
PERFORMANCE
SPATIAL
PREFRONTAL
COGNITIVE
TRAINING
TOMOGRAPHY
FRONTAL
MOTOR
EMISSION
Syntactic classes
5
IN
FOR
ON
BETWEEN
DURING
AMONG
FROM
UNDER
WITHIN
THROUGHOUT
THROUGH
TOWARD
INTO
AT
INVOLVING
AFTER
ACROSS
AGAINST
WHEN
ALONG
8
ARE
WERE
WAS
IS
WHEN
REMAIN
REMAINS
REMAINED
PREVIOUSLY
BECOME
BECAME
BEING
BUT
GIVE
MERE
APPEARED
APPEAR
ALLOWED
NORMALLY
EACH
14
THE
THIS
ITS
THEIR
AN
EACH
ONE
ANY
INCREASED
EXOGENOUS
OUR
RECOMBINANT
ENDOGENOUS
TOTAL
PURIFIED
TILE
FULL
CHRONIC
ANOTHER
EXCESS
25
26
30
SUGGEST
LEVELS
RESULTS
INDICATE
NUMBER
ANALYSIS
SUGGESTING
LEVEL
DATA
SUGGESTS
RATE
STUDIES
SHOWED
TIME
STUDY
REVEALED
CONCENTRATIONS
FINDINGS
SHOW
VARIETY
EXPERIMENTS
DEMONSTRATE
RANGE
OBSERVATIONS
INDICATING
CONCENTRATION
HYPOTHESIS
PROVIDE
DOSE
ANALYSES
SUPPORT
FAMILY
ASSAYS
INDICATES
SET
POSSIBILITY
PROVIDES
FREQUENCY
MICROSCOPY
INDICATED
SERIES
PAPER
DEMONSTRATED
AMOUNTS
WORK
SHOWS
RATES
EVIDENCE
SO
CLASS
FINDING
REVEAL
VALUES
MUTAGENESIS
DEMONSTRATES
AMOUNT
OBSERVATION
SUGGESTED
SITES
MEASUREMENTS
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
33
BEEN
MAY
CAN
COULD
WELL
DID
DOES
DO
MIGHT
SHOULD
WILL
WOULD
MUST
CANNOT
REMAINED
ALSO
THEY
BECOME
MAG
LIKELY
(PNAS, 1991, vol. 88, 4874-4876)
A23 generalized49 fundamental11 theorem20 of4 natural46 selection46 is32
derived17 for5 populations46 incorporating22 both39 genetic46 and37 cultural46
transmission46. The14 phenotype15 is32 determined17 by42 an23 arbitrary49
number26 of4 multiallelic52 loci40 with22 two39-factor148 epistasis46 and37 an23
arbitrary49 linkage11 map20, as43 well33 as43 by42 cultural46 transmission46
from22 the14 parents46. Generations46 are8 discrete49 but37 partially19
overlapping24, and37 mating46 may33 be44 nonrandom17 at9 either39 the14
genotypic46 or37 the14 phenotypic46 level46 (or37 both39). I12 show34 that47
cultural46 transmission46 has18 several39 important49 implications6 for5 the14
evolution46 of4 population46 fitness46, most36 notably4 that47 there41 is32 a23
time26 lag7 in22 the14 response28 to31 selection46 such9 that47 the14 future137
evolution46 depends29 on21 the14 past24 selection46 history46 of4 the14
population46.
(graylevel = “semanticity”, the probability of using LDA over HMM)
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
(PNAS, 1996, vol. 93, 14628-14631)
The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14
relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4
acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115
is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115
recognition125, communication9, and37 speciation46, yet50 it41 has18
rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7
shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48).
Preferences115 change46 dramatically19 between22 races46 of4 a23 species15,
from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23
net49 directional46 effect46), Preference115 shape46 generally19 matches10
the14 distribution16 of4 the14 male115 trait15, This41 is32 compatible29 with21
a23 coevolutionary46 model20 of4 signal9-preference115 evolution46,
although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20,
sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44
genetic11 in5 origin7.
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
(PNAS, 1996, vol. 93, 14628-14631)
The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14
relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4
acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115
is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115
recognition125, communication9, and37 speciation46, yet50 it41 has18
rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7
shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48).
Preferences115 change46 dramatically19 between22 races46 of4 a23 species15,
from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23
net49 directional46 effect46), Preference115 shape46 generally19 matches10
the14 distribution16 of4 the14 male115 trait15. This41 is32 compatible29 with21
a23 coevolutionary46 model20 of4 signal9-preference115 evolution46,
although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20,
sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44
genetic11 in5 origin7.
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
References on Topic Models
•
Latent Dirichlet allocation
David Blei, Andrew Y. Ng and Michael Jordan. Journal of Machine
Learning Research, 3:993-1022, 2003.
•
Finding scientific topics
Griffiths, T., & Steyvers, M. (2004). Proceedings of the National
Academy of Sciences, 101 (suppl. 1), 5228-5235
•
Probabilistic author-topic models for information discovery
M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths, in Proceedings
of the ACM SIGKDD Conference on Data Mining and Knowledge
Discovery, August 2004.
•
Integrating topics and syntax.
Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (in press,
2005). In: Advances in Neural Information Processing Systems, 17.
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Summary
• State-of-the-art probabilistic text models can be
constructed from large text data sets
– Can yield better performance than other approaches like
clustering, LSI, etc
– Advantage of probabilistic approach is that a wide range of
queries can be supported by a single model
– See also recent work by Buntine and colleagues
• Learning algorithms are slow but scalable
– Linear in the number of word tokens
– Applying this type of Monte Carlo statistical learning to
millions of words was unheard of a few years ago
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Conclusion
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
NEW
“All models are wrong, but some
are useful” (G.E.P. Box)
Modeling
Real World
Data
Probabilistic
Model
Learning
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Concluding Comments
• The probabilistic approach is worthy of inclusion in a data miner’s
toolbox
–
–
–
–
Systematic handling of missing information and uncertainty
Ability to incorporate prior knowledge
Integration of different sources of information
However, not always best choice for “black-box” predictive modeling
• Graphical models in particular provide:
– A flexible and modular representational language for modeling
– efficient and general computational inference and learning algorithms
• Many recent advances in theory, algorithms, and applications
– Likely to continue to see advances in new powerful models, more
efficient scalable learning algorithms, etc
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Examples of New Research Directions
• Modeling and Learning
– Probabilistic Relational Models
• Work by Koller et al, Russell et al, etc.
– Conditional Markov Random Fields
• information extraction (McCallum et al)
– Dirichlet processes
• Flexible non-parametric models (Jordan et al)
– Combining discriminative and generative models
• e.g., Haussler and Jaakkola
• Applications
–
–
–
–
–
Computer vision: particle filters
Robotics: map learning
Statistical machine translation
Biology: learning gene regulation networks
and many more….
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
UPDATED
General References
•
All of Statistics: A Concise Course in Statistical Inference
L. Wasserman, Chapman and Hall, 2004
•
Bayesian Data Analysis
A. Gelman, J. B. Carlin. H. S. Stern, and D. B. Rubin, Chapman and Hall, 2nd
edition, 2003.
•
Learning in Graphical Models
M. I. Jordan (ed), MIT Press, 1998
•
Graphical models
M. I. Jordan. Statistical Science (Special Issue on Bayesian Statistics), 19, 140155, 2004.
•
The Elements of Statistical Learning : Data Mining, Inference,
and Prediction
T. Hastie, R. Tibshirani, J. H. Friedman, Springer, 2001
•
Recent Research:
– Proceedings of NIPS and UAI conferences, Journal of Machine Learning
Research
Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005