Part III - hkust cse
Download
Report
Transcript Part III - hkust cse
AAAI 2014 Tutorial
Latent Tree Models
Part III: Learning Algorithms
Nevin L. Zhang
Dept. of Computer Science & Engineering
The Hong Kong Univ. of Sci. & Tech.
http://www.cse.ust.hk/~lzhang
Learning Latent Tree Models
To Determine
1.
Number of latent variables
2.
Cardinality of each latent variable
3.
Model structure
4.
Probability distributions
Model selection: 1, 2, 3
Parameter estimation: 4
AAAI 2014 Tutorial Nevin L. Zhang HKUST
2
Light Bulb Illustration
Run interactive program “LightBulbIllustration.jar”
Illustrate the possibility of inferring latent variables and latent
structures from observed co-occurrence patterns.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
3
Part III: Learning Algorithms
Introduction
Search-based algorithms
Algorithms based on variable clustering
Distance-based algorithms
Empirical comparisons
Spectral methods for parameter estimation
AAAI 2014 Tutorial Nevin L. Zhang HKUST
4
Search Algorithms
A search algorithm explores the space of regular models guided by a
scoring function:
Start with an initial model
Iterate until model score ceases to increase
Modify the current model in various ways to generate a list of
candidate models.
Evaluate the candidate models using the scoring function.
Pick the best candidate model
What scoring function to use? How do we evaluate candidate models?
This is the model selection problem.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
5
Model Selection Criteria
Bayesian score: posterior probability P(m|D)
P(m|D) = P(m)P(D|m) / P(D)
= P(m)∫ P(D|m, θ) P(θ |m) dθ / P(D)
BIC Score: Large sample approximation of Bayesian score
BIC(m|D) = log P(D|m, θ*) – d/2 logN
d : number of free parameters; N is the sample size.
θ*: MLE of θ, estimated using the EM algorithm.
Likelihood term of BIC: Measure how well the model fits data.
Second term: Penalty for model complexity.
The use of the BIC score indicates that we are looking for a model
that fits the data well, and at the same time, not overly complex.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
6
Model Selection Criteria
AIC (Akaike, 1974):
AIC(m|D) = log P(D|m, θ*) – d/2
Holdout likelihood
Data => Training set, validation set.
Model parameters estimated based on the training set.
Quality of model is measured using likelihood on the validation set.
Cross validation: too expensive
AAAI 2014 Tutorial Nevin L. Zhang HKUST
7
Search Algorithms
Double hill climbing (DHC), (Zhang 2002, 2004)
Single hill climbing (SHC), (Zhang and Kocka 2004)
12 manifest variables
Heuristic SHC (HSHC), (Zhang and Kocka 2004)
7 manifest variables.
50 manifest variables
EAST, (Chen et al 2011)
100+ manifest variables
AAAI 2014 Tutorial Nevin L. Zhang HKUST
8
Double Hill Climbing (DHC)
Two search procedures
One for model structure
One for cardinalities of latent variables.
Very inefficient. Tested only on data sets with 7 or fewer variables.
(Zhang 2004)
DHC tested on synthetic and real-world data sets, together with BIC,
AIC, and Holdout likelihood respectively.
Best models found when BIC was used.
So subsequent work based on BIC.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
9
Single Hill Climbing (HSC)
Determines both model structure and cardinalities of latent variables
using a single search procedure.
Uses five search operators
Node Introduction (NI)
Node Deletion (ND)
Node Relation (NR)
State Introduction (SI)
State Deletion (SI)
AAAI 2014 Tutorial Nevin L. Zhang HKUST
10
Node Introduction (NI)
NI involves a latent variable Y and some of its neighbors
It introduces a new node Y’ to mediate Y and the neighbors.
The cardinality of Y’ is set at |Y|
Example:
Y2 introduced to mediate Y1 and its neighbors X1 and X2
The cardinality of Y2 is set at |Y1|
AAAI 2014 Tutorial Nevin L. Zhang HKUST
11
Node Relocation (NR)
NR involves a latent variable Y, a neighbor Z of Y, and another
neighbor Y’ of Y that is also a latent variable.
It relocates Z from Y to Y’.
Example:
X3 is relocated from Y1 to Y2
AAAI 2014 Tutorial Nevin L. Zhang HKUST
12
Node Deletion
ND involves a latent variable Y, a neighbor Y’ of Y that is a latent
variables.
It remove Y, and reconnects the other neighbors of Y to Y’.
Example:
Y2 is removed w.r.t to Y1.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
13
State Introduction/Deletion
State introduction (SI)
Increase the number of states of a latent variable by 1
State deletion (SD)
Reduce the number of states of a latent variable by 1.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
14
Single Hill Climbing (SHC)
Start with an initial model (LCM)
At each step:
Construct all possible candidate models using NI, ND, NR, SI and SD
Evaluate them one by one
Pick the best one
Still inefficient
Tested on data with no more than 12 variables.
Reason
Too many candidate models
Too expensive to run EM on all of them
AAAI 2014 Tutorial Nevin L. Zhang HKUST
15
The EAST Algorithm
Scale up SHC
Idea 1: Restrict NI to involve only two neighbors of the latent
variable it operators on
AAAI 2014 Tutorial Nevin L. Zhang HKUST
16
Reachability
How to go from the left to the right then with the restriction?
First apply NI, and then NR
NI
NR
Each NI operation is followed by NR operations to compensate for the
restriction on NI.
Idea 2: Reducing Number of Candidate Models
Not to use ALL the operators at once.
How?
BIC: BIC(m|D) = log P(D|m, θ*) – d/2 logN
Improve the two terms alternately
NI and SI improve the likelihood term?
Let be m’ obtained from m using NI or SI
Then, m’ includes m, hence has higher maximized likelihood
log P(D|m’, θ’*) >= log P(D|m, θ*)
SD and ND reduce the penalty term.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
18
The EAST Algorithm
(Chen et al. AIJ 2011)
1.
Start with a simple initial model
2.
Repeat until model score ceases to improve
Expansion:
Search with node introduction (NI), and state introduction (SI)
Each NI operation is followed by NR operations to compensate for the restriction
on NI. (See Slide 17)
Adjustment:
Search with NR
Simplification:
Search with node deletion (ND), and state deletion (SD)
EAST: Expansion, Adjustment, Simplification until Termination
AAAI 2014 Tutorial Nevin L. Zhang HKUST
19
Idea 3: Parameter Value Inheritance
m : current model;
m’ : candidate model generated by applying a search operator on m.
The two models share many parameters
m: ( θ1, θ2);
m’: ( θ1, λ2);
When evaluating m’, inherit values of the shared parameters θ1 from
m, and estimate only the new parameters λ2:
λ*2 = arg max λ2 log P(D|m’, θ1, λ2 )
Avoid Local Optimum at the Expansion Phase
NI: Increases structure complexity.
SI: Increases variable complexity.
Key Issue at the expansion phase:
Tradeoff between structure complexity and variable complexity
AAAI 2014 Tutorial Nevin L. Zhang HKUST
21
Operation Granularity
NI and SI are of different granularities
p = 100
SI: 101 more parameters
NI: 2 more parameters
Huge disparity in granularity
Penalty term in BIC insufficient to handle
SI always preferred initially,
Quick increase in variable complexity
Leading to local optimum in model score
AAAI 2014 Tutorial Nevin L. Zhang HKUST
22
Dealing with Operation Granularity
EAST does not use BIC when choosing between candidate models
produced by NI and SI.
Instead, it uses the cost-effectiveness principle
That is, select candidate model with highest improvement ratio
Increase in model score per unit increase in model complexity.
Denominator is larger for operations that increase the number of model
parameters more.
Can be justified using Likelihood Ratio Test (LRT)
It picks the candidate model that gives the strongest evidence to
reject the null model (the current model) according to LRT.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
23
Likelihood Ratio Test (LRT)
Wikipedia
The alternative model includes the null model, and hence fits data better
than the null model.
Whether it fits significantly better is determined by p-value of the
difference D , which approximately follows Chi-squared distribution with
degree of freedom: d2 – d1
Likelihood Ratio Test (LRT)
Required D value for given p-value increases roughly linearly with
d2-d1
.
The ratio D/(d2-d1) closely related to p-value
It is a measure of the strength of evidence in favor of the alternative
model
Likelihood Ratio Test & Improvement Ratio
Second term is constant
First term is exactly ½ * D / (d2-d1)
Loosely speaking, the cost-effectiveness principle picks the
candidate model that gives the strongest evidence to reject the null
model (the current model) according to LRT.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
26
Search Process on Danish Beer Data
EAST used on medical
survey data
A few dozens variables
Hundreds to thousands
observations
Model quality important
(Xu et al 2013)
Part III: Learning Algorithms
Introduction
Search-based algorithms
Algorithms based on variable clustering
Distance-based algorithms
Empirical comparisons
Spectral methods for parameter estimation
AAAI 2014 Tutorial Nevin L. Zhang HKUST
29
Algorithm based on Variable Clustering
Key Idea
Group variables into clusters
Introduce a latent variable for each cluster
For discrete variables, mutual information is used as similarity
measure
Algorithms
BIN-G: Harmeling & Williams, PAMI 2011
Bridged-islands (BI) algorithm: Liu et al. MLJ, 2013
AAAI 2014 Tutorial Nevin L. Zhang HKUST
30
The BIN-G Algorithm
Learns binary tree models
L All observed variables
Loop
Remove from L pair of variables with highest mutual information
Introduce a new latent variable
Add new latent variable to L
AAAI 2014 Tutorial Nevin L. Zhang HKUST
31
Two Issues
Learn LCM: Cardinality of new latent variable and parameters
Let |H1|=1 and increase it gradually until termination
At each step, run EM to optimize model parameters and
calculate the BIC score
Return LCM with highest BIC score
Determine MI between new latent variable and others
Convert the new latent variable into a observed via
imputation (hard assignment)
Then calculate MI(H1; X3), MI(H1; X4)
NOTE: if some latent variables have cardinality 1, they can be removed from the
model, resulting a forest, AAAI
instead
tree.Nevin L. Zhang HKUST
2014 of
Tutorial
32
Result of BIN-G on subset of 20 Newsgroups Dataset
The BI Algorithm
Learns non-binary trees.
Partitions all observed variables into clusters, with some clusters
having >2 variables
Introduces a latent variable for each variable cluster
Links up the latent variables to get a global tree model
The result is a flat latent tree model in the sense that each latent
variable it directly connected to at one observed variable.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
34
BI Step 1: Partition the Observed Variables
Identify a cluster of variables such that,
Variables in the cluster are closely correlated, and
The correlations can be properly modeled using a latent variable.
Uni-Dimensional (UD) cluster
Remove the cluster and repeat the process.
Eventually obtain a partition of the observed variables.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
35
Obtaining First Variable Cluster
Sketch of algorithm for identifying first variable cluster
L All observed variables
S pair of variables with highest mutual information
Loop
X Variable in L with highest MI with S
SS U {X}, L L \ {X}
Perform uni-dimensionality test on S,
If the test fails, stop loop and pick the first cluster of variables.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
36
Uni-Dimensionality (UD) Test
Test whether the correlations among variables in a set S can be properly
modeled using a single latent variable
Example: S={X1, X2, X3, X4, X5}
Learn two models
m1: Best LCM, i.e., LTM with one latent variable
m2: Best LTM with two latent variables
Can be done using EAST
UD-test passes if and only if
If the use of two latent variable does not give significantly better model, then the use of
one latent variable is appropriate.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
37
Bayes Factor
Wikipedia
Unlike a likelihood-ratio test,
Models do not need to be nested
Strength of evidence in favor of M2 depends on the value of K
AAAI 2014 Tutorial Nevin L. Zhang HKUST
38
Bayes Factor and UD-Test
Wikipedia
The statistic
is a large sample approximation of
Strength of evidence in favor of two latent variables depends on U :
In the UD-test, we usually set
:
Conclude single latent variable if no strong evidence for >1 latent variables
AAAI 2014 Tutorial Nevin L. Zhang HKUST
39
UD-Test and Variable Cluster
Initially, S={X1, X2}
X3, X4 added to S, and UD-test passes
Next add X5
S = {X1, X2, X3, X4, X5},
m2 is significantly better than m1
UD-test fails
m2 gives two uni-dimensional (UD) clusters
The first variable cluster is: {X1, X2, X4}
Picked because it contains the initial variables X1 and X2
AAAI 2014 Tutorial Nevin L. Zhang HKUST
40
BI Step 2: Latent Variable Introduction
Introduce a latent variable for each variable UD cluster.
Optimize the cardinalities of latent variables an parameters
AAAI 2014 Tutorial Nevin L. Zhang HKUST
41
BI Step 3: Link up Latent Variables
Bridging the “islands” using Chow-Liu’s Algorithm (1968)
Estimate joint of each pair of latent variables Y and Y’:
m and m’ are the LCMs that contains Y and Y ‘ respectively.
Calculate MI(Y;Y’)
Find the maximum spanning tree for MI values
AAAI 2014 Tutorial Nevin L. Zhang HKUST
42
BI Step 4: Global Adjustment
Improvement based on global consideration
Run EM to optimize parameters for whole model
For each latent variable Y and each observed variable X, calculate:
Re-estimate MI(Y; X) based the above distribution
Let Y* be the latent variable with highest MI(Y; X)
If Y* is not currently the neighbor of X, make it so.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
43
Result of BI on subset of 20 Newsgroups Dataset
Part II: Learning Algorithms
Introduction
Search-based algorithms
Algorithms based on variable clustering
Distance-based algorithms
Empirical comparisons
Spectral methods for parameter estimation
AAAI 2014 Tutorial Nevin L. Zhang HKUST
45
Distance-Based Algorithms
Define distance between variables that are additive over trees
Estimate distances between observed variables from data
Inference model structure from those distance estimates
Assumptions:
Latent variables have equal cardinality, and it is known.
In some cases, it equals the cardinality of observed variables.
Or, all variables are continuous.
Focus on two algorithms
Recursive groping (Choi et al, JMLR 2011)
Neighbor Joining (Saitou & Nei, 1987, Studier and Keppler, 1988)
Slides based on Choi et al, 2010:
www.ece.nus.edu.sg/stfpage/vtan/latentTree_slides.pdf
AAAI 2014
Tutorial Nevin L. Zhang HKUST
46
Information Distance
Information distance between two discrete variables Xi and Xj (Lake
1994)
When both variables are binary:
AAAI 2014 Tutorial Nevin L. Zhang HKUST
47
Additivity of Information Distance on Trees
Erdos, Szekely, Steel, & Warnow, 1999
AAAI 2014 Tutorial Nevin L. Zhang HKUST
48
Testing Node Relationships
This implies the difference
It does not change with k.
Equality not true
if j is not leaf, or
i is not the parent of j
is a constant.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
49
Testing Node Relationships
This implies the difference
It does not change with k.
It is between –
and
is a constant.
This property allows us to determine leaf nodes that are siblings
AAAI 2014 Tutorial Nevin L. Zhang HKUST
50
The Recursive Grouping (RG) Algorithm
RG is an algorithm that determines model structure using the two
properties mentioned earlier.
Explain RG with an example
Assume data generated by the following model
Data contain no information about latent variables
Task: Recover model structure
AAAI 2014 Tutorial Nevin L. Zhang HKUST
51
Recursive Grouping
Step 1: Estimate from data the information distance between each
pair of observed variables.
Step 2: Identify (leaf, parent-of-leaf) and (leaf siblings) pairs
For each i, j
=c
If
If c =
If -
(constant) for all k \= i, j
, then j is a leaf and i is its parent
< c <
, then i and j are leaves and siblings
AAAI 2014 Tutorial Nevin L. Zhang HKUST
52
Recursive Grouping
Step 3: Introduce a hidden parent node for each sibling group without
a parent.
NOTE: No need to determine model parameters here.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
53
Recursive Grouping
Step 4. Compute the information distance for new hidden nodes.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
54
Recursive Grouping
Step 5. Remove the identified child nodes and repeat Steps 2-4.
Parameters of the final model can be determined using EM if needed.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
55
CL Recursive Grouping (CLRG)
Making Recursive Grouping more efficient
Step 1: Construct Chow-Liu tree over observed variables only based
on information distance, i.e. maximum spanning tree
AAAI 2014 Tutorial Nevin L. Zhang HKUST
56
CL Recursive Grouping (CLRG)
Step 2: Select an internal node and its neighbors, and apply the
recursive-grouping (RG) algorithm. (Much cheaper)
Step 3: Replace the output of RG with the sub-tree spanning the
neighborhood.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
57
CL Recursive Grouping (CLRG)
Repeat Steps 2-3 until all internal nodes are operated on.
Theorem: Both RG and CLRG are consistent
AAAI 2014 Tutorial Nevin L. Zhang HKUST
58
Result of CLRG on subset of 20 Newsgroups Dataset
An Extension
Choi et al 2011 assume all observed and latent variables have equal cardinality
So that information distance can be defined.
Assumption relaxed by (Song et al, 2011; Wang et al 2013):
Latent variables can have fewer states than observed,
But they still need to have equal cardinality among themselves.
New definition of information distance (pseudo-determinant):
denotes the s-th largest singular value of matrix A
k is the rank of joint probability matrix Pxy.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
60
Information Distance for Continuous Variables
Gaussian distributions:
Non-Gaussian distributions (Song et al, NIPS 2011)
obtained via Kernel embedding
AAAI 2014 Tutorial Nevin L. Zhang HKUST
61
Neighbor Joining
Another method to infer model structure using tree-additive distances
Originally developed for phylogenetic trees (Saitou & Nei, 1987)
Key Observations:
Closest pair might not be siblings
dAB smallest, but those two leaves are not neighbors
AAAI 2014 Tutorial Nevin L. Zhang HKUST
62
Neighbor Joining
Let L be the number of leaf nodes
Define: ri
= 1/(|L| - 2) Sk in L dik
Dij = dij - (ri + rj)
Theorem
Pair of leaves with minimal Dij are siblings
AAAI 2014 Tutorial Nevin L. Zhang HKUST
63
Neighbor Joining
Initialization
Define T to be the set of leaf nodes
Make list L of active nodes = T
Loop until |L|=2
Find two nodes i and j where Dij is minimal
Combine to form a new node k and
dkm = 1/2(dim + djm - dij) for all m in L
dik = 1/2(dij + ri - rj )
djk = dij - dik
Add k to L, and remove i and j from L and add node k
Results binary tree.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
64
Example
AAAI 2014 Tutorial Nevin L. Zhang HKUST
65
CL Neighbor Joining (CLNJ)
Making Neighbor Joining more efficient
Step 1: Construct Chow-Liu tree over observed variables only based
on information distance, i.e. maximum spanning tree
Step 2: Select an internal node and its neighbors, and apply the NJ
algorithm. (Much cheaper)
Step 3: Replace the output of NJ with the sub-tree spanning the
neighborhood.
Repeat Steps 2-3 until all internal nodes are operated on.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
66
Result of CLNJ on subset of 20 Newsgroups Dataset
Quartet-Based Algorithms
Originally developed for phylogenetic tree reconstruction (John et al, Journal of
Algorithms, 2003)
Idea:
First determine the structures among quartets (groups of 4) of observed variables
Then combine the results to obtain a global model of all the observed variables.
If two nodes are not siblings in quartet model, they cannot be siblings in the global
model.
For general LTMs: Chen et al, PGM 2006, Anandkumar et al, NIPS 20111, Mossel et al, 2011.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
68
Part III: Learning Algorithms
Introduction
Search-based algorithms
Algorithms based on variable clustering
Distance-based algorithms
Empirical comparisons
Spectral methods for parameter estimation
AAAI 2014 Tutorial Nevin L. Zhang HKUST
69
Empirical Comparisons
Algorithms compared
Search-Based Algorithm:
EAST (Chen et al, AIJ 2011)
Variable Clustering-Based Algorithms
BIN (Harmeling & Williams, PAMI 2011)
BI (Liu et al. MLJ, 2013)
Distance-Based Algorithms
CLRG (Choi et al, JMRL 2011)
CLNJ (Saitou & Nei, 1987)
Data
Synthetic data
Real-world data
Measurements
Running time
Model quality
AAAI 2014 Tutorial Nevin L. Zhang HKUST
70
Generative Models
4-complete model (M4C):
Every latent node has 4 neighbors
All variables are binary
Parameter values randomly generated such normalized MI
between each pair of neighbor is between 0.05 and 0.75.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
71
Generative Models
M4CF: Obtained from M4C
More variables added such that each latent node has 3 observed
neighbors . A flat model.
It is a flat model.
Other models and the total number of observed variables
AAAI 2014 Tutorial Nevin L. Zhang HKUST
72
Synthetic Data and Evaluation Criteria
Synthetic Data
Training: 5,000; Testing: 5,000
No information on latent variables
Evaluation Criteria: Distribution
m0 : generative model; m : learned model
Empirical KL divergence on testing data: The smaller the better.
Second term is hold-out likelihood of m. The larger the better.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
73
Evaluation Criteria: Structure
Example: For the two models on the right
m0
m
Y2-Y1: X1X2X3 | X4X5X6X7
Y1-Y3: X1X2X3X4 | X5X6X7
X1-Y2: X1 | X2X3X4X5X6X7
..
Y2-Y1: X1X2 | X3X4X5X6X7
Y1-Y3: X1X2X3X4 | X 5X6X7
X1-Y2: X1 | X2X3X4X5X6X7
..
dRF = (1 + 1)/2 = 1
Not defined for forests
Running Times (Seconds)
EAST was too slow on data sets with more than 100 attributes.
CLRG was the fastest, followed by CLNJ, BIN and BI.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
75
Model Quality
Flat generative models
Non-flat generative models
EAST found best models on data with dozens of attributes
BI found best models on data with hundreds of attributes.
BIN is the worst. (No RF
values
because
it produces
forests, not trees)
AAAI
2014 Tutorial
Nevin
L. Zhang HKUST
76
When latent variables have different cardinalities …
Make latent variables have different cardinalities in generative models
3 for those at levels 1 and 3
2 for those at level 2.
All algorithms perform worse in before
EAST and BI still found best models.
CLRG and CLNJ especially bad on M7CF1.
They assume all latent variables have equal cardinality.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
77
Real-World Data Sets
Data
Evaluation criteria:
BIC score on training set: measures of model fit
Loglikelihood on testing set (hold-out likelihood): measures how
well learned model predict unseen data.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
78
Running Times (seconds)
CLNJ and CLRG not applicable to Coil-42 and Alarm because
different attributes have different cardinalities.
EAST did not finish on News-100 and WebKB within 60 hours
CLRG was the fastest, followed by CLNJ, BIN and BI.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
79
Model Quality
EAST and BI found best models. BIN found the worst.
Structures of models obtained on News-100 by BIN, BI, CLRG, and
CLNJ are shown earlier.
BI introduced fewer latent variables. The model is more “compact”.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
80
Summary
EAST found best models on data sets it could manage
With < 100 observed variables, hundreds to thousands observations
BI found best models on data sets with hundreds of observed
variables, and was slower than BIN, CLRG, and CLNJ.
BIN found the worst models.
CLRG and CLNJ are not applicable when observed variables do
NOT have equality cardinality.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
81
Part III: Learning Algorithms
Introduction
Search-based algorithms
Algorithms based on variable clustering
Distance-based algorithms
Empirical comparisons
Spectral methods for parameter estimation
AAAI 2014 Tutorial Nevin L. Zhang HKUST
82
Parameter Estimation
The Expectation-Maximization (EM) algorithm (Dempster et al 1977)
Start with initial guess
Iterate until termination
Improve the current parameters values by maximizing the expected
likelihood
Can be trapped at local maxima
Spectral methods (Anandkumar et al., 2012)
Get empirical distributions of 2 or 3 observed variables
Relate them to model parameters
Determine model parameters accordingly
Need large sample size for robust estimates.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
83
Markov Random Field over Graphs
A MRF
Undirected graph
Potentials
Non-negative functions
Associated with edges and
hyper-edges
Eliminate a variable X in MRF
Multiply all potentials involve X
Obtain a new potential by
eliminate X from the product
Ex: Elimination of B and E :
AAAI 2014 Tutorial Nevin L. Zhang HKUST
84
Matrix Representation of Potentials and Elimination
Lower case letter denote value of variable
E.g. a, b are values for A and B
Use
to denote
Use
to denote the matrix [
]
Value at a-th row and b-th column is
Elimination rewritten as matrix multiplication
AAAI 2014 Tutorial Nevin L. Zhang HKUST
85
Matrix Representation of Potentials and Elimination
Use
to denote
Use
to denote the column vector
whether the b-th row is
Similarly define notations
Then, the equation can be rewritten as
and
AAAI 2014 Tutorial Nevin L. Zhang HKUST
86
Spectral Method for Parameter Estimation
Equality (Anandkumar et al., 2012)
H: latent; A, B, C: Observed
All variables have equal cardinality
For a given value b of B,
Elements of diagonal matrix:
P(B=b|H=1), P(B=b|H=2),…
Notes
The matrices on the right hand side can be estimated from data
The eigenvalues of the matrix on the left are model parameters:
P(B=b|H=1), P(B=b|H=2),…
AAAI 2014 Tutorial Nevin L. Zhang HKUST
87
Spectral Method for Parameter Estimation
Parameter estimation:
Get empirical distributions P(A, B, C) and P(A, C) from data
For each value b of B, form matrix on the right hand side
Find the eigenvalues of the matrix. (Spectral method)
They are elements of Pb|H, or P(B=b|H=1), P(B=b|H=2),…
Notes
Similarly, we can estimate the other parameters
The technique can used on LTMs with >1 latent variables and where
observed variables have higher cardinality than latent variables.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
88
Derivation of Equation (1)
Start with generalized MRF
Some potential might have
negative values
Eliminate C, and then H’
Further eliminate H, we get P(A, B, A’).
For a value b of B,
AAAI 2014 Tutorial Nevin L. Zhang HKUST
89
Derivation of Equation (1)
Start with MRF
Eliminate H and H’
Let B=b
Next, eliminate C.
Putting together, we get
AAAI 2014 Tutorial Nevin L. Zhang HKUST
90
Another Equation of Similar Flavor
In general, we cannot determine
P(A, B, C, D) from P(A, B, D), P(B, D), P(B, C, D).
Possible in LTMs
(Parikh et al., 2011)
Can be used to estimate of joint probability without explicit parameter
estimation
Get empirical distributions of 2 or 3 observed variables from data.
Caculate joint probability of particular value assignment of ALL observed
from them using the relationship.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
91
Matrix Representation of Potentials
Observed: A, B, C, D
Latent: H , G
All variables have equal cardinality,
AAAI 2014 Tutorial Nevin L. Zhang HKUST
92
Transformations
P(A, B, C, D) not changed during transformations
AAAI 2014 Tutorial Nevin L. Zhang HKUST
93
Transformations
Eliminate: H’, G’, and {H, G}
From the last MRF, we get
Joint of 4 variables determined from joint of 3 and 2 observed
variables
AAAI 2014 Tutorial Nevin L. Zhang HKUST
94
Notes
The technique can used
On trees with >4 observed variables.
When observed variables have higher cardinality than latent variables.
AAAI 2014 Tutorial Nevin L. Zhang HKUST
95
Summary
Equation that relates model parameters to joint distributions of 2 or 3
observed variables
Equation that relates joint distributions of 4 or more observed
variables to joint distributions of 2 or 3 observed variables.
Need large sample size for robust result
AAAI 2014 Tutorial Nevin L. Zhang HKUST
96