Discriminative Probabilistic Models for Relational Data
Download
Report
Transcript Discriminative Probabilistic Models for Relational Data
Statistical Learning
from Relational Data
Daphne Koller
Stanford University
Joint work with many many people
Relational Data is Everywhere
The web
Social networks
Genes, proteins, interactions, regulation
Bibliometrics
People, institutions, friendship links
Biological data
Webpages (& the entities they represent), hyperlinks
Papers, authors, journals, citations
Corporate databases
Customers, products, transactions
Relational Data is Different
Data instances not independent
Topics of linked webpages are correlated
Data instances are not identically distributed:
Heterogeneous instances (papers, authors)
No IID assumption
This is a good thing
New Learning Tasks
Collective classification of related instances
Relational clustering
Finding coherent clusters in the genome
Link prediction & classification
Labeling an entire website of related webpages
Predicting when two people are likely to be friends
Pattern detection in network of related objects
Finding groups (research groups, terrorist groups)
Probabilistic Models
Uncertainty model:
space of “possible worlds”;
probability distribution over this space.
Worlds: often defined via a set of state variables
medical diagnosis: diseases, symptoms, findings, …
each world: an assignment of values to variables
Number of worlds is exponential in # of vars
2n if we have n binary variables
Outline
Relational Bayesian networks*
Relational Markov networks
Collective Classification
Relational clustering
* with Avi Pfeffer, Nir Friedman, Lise Getoor
Bayesian Networks
CPD P(G|D,I)
A
B
C
Difficulty
Intelligence
easy,low
easy,high
hard,low
Grade
hard,high
0%
20% 40%
60% 80% 100%
nodes = variables
edges = direct influence
Job
SAT
Graph structure encodes independence assumptions:
Letter conditionally independent of Intelligence given Grade
Bayesian Networks: Problem
Bayesian nets use propositional representation
Real world has objects, related to each other
Intelligence
Diffic_CS101
Intell_Jane
Difficulty
Diffic_CS101
Intell_George
These
“instances”
are not
independent
Grade_Jane_CS101
A
Grade
Intell_George
Diffic_Geo101
Grade_George_Geo101
Grade_George_CS101
C
Relational Schema
Specifies types of objects in domain, attributes of each
type of object & types of relations between objects
Professor
Classes
Student
Intelligence
Teaching-Ability
Teach
Take
Attributes
Relations
Course
Difficulty
In
Registration
Grade
Satisfaction
St. Nordaf University
Prof. Smith
Teaching-ability
Teaches
Teaches
Prof. Jones
Teaching-ability
World
In-courseGrade
Registered
Satisfac
Intelligence
Welcome to
George
Geo101
Grade
Welcome to
Difficulty
Registered
In-courseSatisfac
CS101
Intelligence
Grade
Difficulty
In-courseSatisfac
Registered
Jane
Relational Bayesian Networks
Universals: Probabilistic patterns hold for all objects in class
Locality: Represent direct probabilistic dependencies
Links define potential interactions
Professor
Teaching-Ability
Student
Intelligence
Course
Difficulty
A
B
C
easy,low
Reg
Grade
Satisfaction
[K. & Pfeffer; Poole; Ngo & Haddawy]
easy,high
hard,low
hard,high
0%
20%
40%
60%
80%
100%
RBN Semantics
Prof. Jones
Teaching-ability
Prof. Smith
Teaching-ability
Ground model:
variables: attributes of all objects
dependencies: determined by
relational links & template model
Grade
Welcome to
Intelligence
Satisfac
George
Geo101
Grade
Welcome
to to
Welcome
Difficulty
Satisfac
CS101
CS101
Grade
Difficulty
Satisfac
Intelligence
Jane
The Web of Influence
Welcome to
CS101
C
0%
0%
50%
50%
Welcome to
Geo101
easy / hard
A
low
high
low / high
100%
100%
Outline
Relational Bayesian networks*
Relational Markov networks†
Collective Classification
Relational clustering
* with Avi Pfeffer, Nir Friedman, Lise Getoor
† with Ben Taskar, Pieter Abbeel
Why Undirected Models?
Symmetric, non-causal interactions
Patterns involving multiple entities
E.g., web: categories of linked pages are correlated
Cannot introduce direct edges because of cycles
E.g., web: “triangle” patterns
Directed edges not appropriate
“Solution”: Impose arbitrary direction
Not clear how to parameterize CPD for variables
involved in multiple interactions
Very difficult within a class-based parameterization
[Taskar, Abbeel, K. 2001]
Markov Networks
James
Template potential
CC
CB
CA
BC
BB
BA
AC
AB
AA
Mary
Kyle
Noah
0
0.5
1
P(J,K, L, M,N)
1.5
2
Laura
1
Z
(J,K)(J, L)(K, L) (J, M) (M,N) (L,N)
Relational Markov Networks
Universals: Probabilistic patterns hold for all groups of
objects
Locality: Represent local probabilistic dependencies
Sets of links give us possible interactions
Student1
Intelligence
Reg1
Grade
Course
Difficulty
Student2
Intelligence
Template potential
Reg2
Grade
CC
CB
CA
BC
Study
BB
BA
AC
AB
AA
Group
0
0.5
1
1.5
2
RMN Semantics
Grade
Welcome to
Intelligence
Geo Study Group
Geo101
Welcome to
Difficulty
George
Grade
CS101
Intelligence
Difficulty
Grade
CS Study Group
Grade
Jane
Intelligence
Jill
Outline
Relational Bayesian Networks
Relational Markov Networks
Collective Classification*
Discriminative training
Web page classification
Link prediction
Relational clustering
* with Ben Taskar, Carlos Guestrin, Ming Fai Wong, Pieter Abbeel
Collective Classification
Training Data
Features: .x
Course
Labels: .y*
Probabilistic
Relational
Model
Student
Reg
Learning
Model
Structure
New Data
Features: ’.x
Conclusions
Inference
Labels: ’.y
Example:
Train on one year of student intelligence, course difficulty, and grades
Given only grades in following year, predict all students’ intelligence
Learning RMN Parameters
Parameterize potentials as log-linear model
Student1
Intelligence
Reg1
Grade
Course
Intelligence
CC
CB
CA
BC
BB
BA
AC
AB
AA
Study Group
Difficulty
Student2
Template potential
Reg2
Grade
(R1.G , R2 .G ) exp(wAAfAA wCC fCC )
1
T
Pw (.x )
exp(w f ( x ))
Zw
Max Likelihood Estimation
We don’t care about the
joint distribution P(.x, .y)
Estimation
.x
.y*
maximizew
log
log Pww (.y* ,|.x
.x))
Classification
argmaxy
logPw ( '.y | '.x )
Web KB
Tom Mitchell
Professor
Project-of
WebKB
Project
Member
Advisor-of
Sean Slattery
Student
[Craven et al.]
Web Classification Experiments
WebKB dataset
Four CS department websites
Bag of words on each page
Links between pages
Anchor text for links
Experimental setup
Trained on three universities
Tested on fourth
Repeated for all four combinations
Standard Classification
Page
Category
Word1
Professor
department
extract
information
computer
science
machine
learning
…
Categories:
faculty
course
project
student
other
...WordN
Standard Classification
Page
Category
Word1
...WordN...LinkWordN
test set error
workin
0.18
g 0.16
0.14
with
0.12
Discriminatively trained naïve Markov
Tom
0.1
= Logistic Regression
Mitchell
0.08
…
0.06
0.04
4-fold CV:
0.02
Trained on 3 universities 0
Tested on 4th
Logistic
Power of Context
Professor?
Student?
Post-doc?
Collective Classification
To-Page
From-Page
Category
Category
Word1
...WordN
FT
SS
SP
SF
SC
PS
PP
PF
PC
FS
FP
FF
FC
CS
CP
CF
CC
Link
...
WordN
Word1
Compatibility (From,To)
Collective Classification
To-Page
From-Page
Category
Category
Word1
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
...WordN
Link
Word1
...WordN
test set error
Classify all pages collectively,
maximizing the joint label probability
[Taskar, Abbeel, K., 2002]
Logistic
Links
More Complex Structure
More Complex Structure
Faculty
W1
Students
C
Wn
S
S
Courses
test set error
Collective Classification: Results
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
35.4% error reduction over logistic
[Taskar, Abbeel, K., 2002]
Logistic
Links
Section
Link+Section
Max Conditional Likelihood
We don’t care about the
conditional distribution P(.y |.x)
Estimation
.x
.y*
Classification
maximizew
argmaxy
T
log
w Pfw ( '.y ,|'.'.xx )
logPw (.y* | .x )
1
Pw (.y | .x )
exp w T f .y , .x
Zw (x )
logPw (.y | .x ) w f .x,.y logZw (x )
T
Max Margin Estimation
What we really want: correct class labels
Estimation
.x
.y*
w T f .maximize
x, .y * ||w||=1
Classification
argmaxy
T
w
w f .x, .y [ .y ] f '.y , '.x
y y *
T
Quadratic program margin
# labeling
mistakes in y
Exponentially many constraints
[Taskar, Guestrin, K., 2003]
(see also [Collins, 2002; Hoffman 2003])
Max Margin Markov Networks
We use structure of Markov network to
provide equivalent formulation of QP
Can solve approximately in networks
where induced width is too large
Exponential only in tree width of network
Complexity = max-likelihood classification
Analogous to loopy belief propagation
Can use kernel-based features!
SVMs meet graphical models
[Taskar, Guestrin, K., 2003]
WebKB Revisited
Test Error
16.1% relative reduction in error
relative to cond. likelihood RMNs
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
Logistic
likelihood
max margin
Predicting Relationships
Tom Mitchell
Professor
Member
WebKB
Project
Member
Advisor-of
Sean Slattery
Student
Even more interesting: relationships between objects
Predicting Relations
Introduce exists/type attribute for each potential link
Learn discriminative model for this attribute
Collectively predict its value in new world
From-Page
72.9% error reduction over flat
To- Page
Category
Word1
...
Category
Word1
WordN
...
30
WordN
25
20
Relation Exists/
LinkWord1
Type
... LinkWordN
15
10
5
0
Flat
[Taskar, Wong, Abbeel, K., 2003]
Collective
Outline
Relational Bayesian Networks
Relational Markov Networks
Collective Classification
Relational clustering
Movie data*
Biological data†
* with Ben Taskar, Eran Segal
† with Eran Segal, Nir Friedman, Aviv Regev, Dana Pe’er,
Haidong Wang, Micha Shapira, David Botstein
Relational Clustering
Unlabeled Relational Data
Course
Probabilistic
Relational
Model
Stude
nt
Reg
Learning
Model
Structure
Clustering of instances
Example:
Given only students’ grades, cluster similar students
Learning w. Missing Data: EM
EM Algorithm applies essentially unchanged
E-step computes expected sufficient statistics,
aggregated over all objects in class
M-step uses ML (or MAP) parameter estimation
Key difference:
In general, the hidden variables are not independent
Computation of expected sufficient statistics requires
inference over entire network
Learning w. Missing Data: EM
[Dempster et al. 77]
Students
Courses
A
B
C
easy,low
easy,high
hard,low
hard,high
0%
20%
40%
60%
80% 100%
P(Registration.Grade |
Course.Difficulty,
Student.Intelligence)
easy / hard
low / high
Movie Data
Internet Movie Database
http://www.imdb.com
Discovering Hidden Types
Learn model using EM
Actor
Director
Type
Type
Movie
Genres
Type
Year MPAA Rating
[Taskar, Segal, K., 2001]
Rating
#Votes
Discovering Hidden Types
Movies
Actors
Directors
Wizard of Oz
Cinderella
Sound of Music
The Love Bug
Pollyanna
The Parent Trap
Mary Poppins
Swiss Family Robinson
Sylvester Stallone
Bruce Willis
Harrison Ford
Steven Seagal
Kurt Russell
Kevin Costner
Jean-Claude Van Damme
Arnold Schwarzenegger
Alfred Hitchcock
Stanley Kubrick
David Lean
Milos Forman
Terry Gilliam
Francis Coppola
Terminator 2
Batman
Batman Forever
GoldenEye
Starship Troopers
Mission: Impossible
Hunt for Red October
Anthony Hopkins
Robert De Niro
Tommy Lee Jones
Harvey Keitel
Morgan Freeman
Gary Oldman
Steven Spielberg
Tim Burton
Tony Scott
James Cameron
John McTiernan
Joel Schumacher
…
[Taskar, Segal, K., 2001]
…
Biology 101: Gene Expression
Gene 1
Coding
Control
Gene 2
Coding
Control
DNA
RNA
Protein
Cells express different
subsets of their genes
in different tissues and
under different conditions
Swi5 Transcription factor
Gene Expression Microarrays
Measure mRNA level for all genes in one condition
Hundreds of experiments
Highly noisy
Expression of gene i
in experiment j
Experiments
Induced
Genes
Repressed
Standard Analysis
Cluster genes by similarity of expression profiles
Manually examine clusters to understand what’s
common to genes in cluster
Clustering
General Approach
Expression level is a function of gene properties
and experiment properties
Learn model that best explains the data
• Observed properties: gene sequence, array condition, …
••Hidden
Assignment
properties:
to hidden
gene
variables
cluster (e.g., module assignment)
Gene level as function of properties
Experiment
• Expression
Properties of
Properties of
Attributes
Attributes j
Experiment
Gene i
Expression level
ofLevel
Gene i
in Experiment j
Expression
Clustering as a PRM
Gene
Experiment
Cluster
ID
Level
g.C P(Ei.L | g.C)
Expression
1
0
g.C
CPD 1
2
Naïve
Bayes
CPD 2
CPD k
0
3
g.E1
g.E2
g.Ek
Modular Regulation
Learn functional modules:
Clusters of genes that are similarly controlled
Learn control program for modules
Expression as function of control genes
false
false
HAP4
true
CMK1
true
Module Network PRM
Gene
Experiment
Cluster
Control
Control
1
2 Controlk
Activity level
of control gene
in experiment
Level
Expression
Cluster 1
Cluster 2
false
HAP4
CMK1
false
0
0
false
true
0
true
true
Yer184c
FAR1
APG1
true
BMH1
false
true
false
false
true
[Segal, Regev, Pe’er, Koller, Friedman, 2003]
true
GIC2
USV1
false
true
false
true
USV1
false
true
Experimental Results
Yeast Stress Data (Gasch et al.)
2355 genes that showed activity
173 experiments (microarrays):
Learned module network with 50 modules:
Diverse environmental stress conditions (e.g. heat shock)
Cluster assignments are hidden variables
Structure of dependency trees unknown
Learned model using structural EM algorithm
Segal et al., Nature Genetics, 2003
Biological Evaluation
Find sets of co-regulated genes (regulatory module)
Find the regulators of each module
46/50
30/50
[Segal et al., Nature Genetics, 2003]
Experimental Results
Hypothesis: Regulator ‘X’ regulates process ‘Y’
Experiment: Knock out ‘X’ and rerun the experiment
false
false
CMK1
X
HAP4
true
true
?
[Segal et al., Nature Genetics, 2003]
Differentially Expressed Genes
Ypl230w
wt
0 3 5
(hrs.)
7 9 24
Ppt1
wt
0 2 5 7 9 24
(min.)
0 7 15 30 60
[Segal et al., Nature Genetics, 2003]
Kin82
(min.)
0 5 15 30 60
0 7 15 30 60
341
differentially
expressed
genes
>16x
wt
0 5 15 30 60
281
602
>4x
>4x
Biological Experiments Validation
Ypl230w
Were the differentially
expressed genes
predicted as targets?
# Module
Significance
39Protein folding
7/23, 1e-4
29Cell differentiation
6/41, 2e-2
5 Glycolysis and folding
5/37, 4e-2
34Mitochondrial and protein fate
5/37, 4e-2
Ppt1
Rank modules by
enrichment for diff.
expressed genes
# Module
Significance
14 Ribosomal and phosphate metabolism 8/32, 9e 3
11 Amino acid and purine metabolism
11/53, 1e 2
15 mRNA, rRNA and tRNA processing
9/43, 2e 2
39 Protein folding
6/23, 2e 2
30 Cell cycle
7/30, 2e 2
Kin82
All regulators regulate
predicted modules
[Segal et al., Nature Genetics, 2003]
# Module
Significance
3 Energy and osmotic stress I
8/31, 1e 4
2 Energy, osmolarity & cAMP signaling 9/64, 6e 3
15 mRNA, rRNA and tRNA processing
6/43, 2e 2
Biology 102: Pathways
Pathways are sets of genes that act together to
achieve a common function
Finding Pathways: Attempt I
Use protein-protein interaction data
Finding Pathways: Attempt I
Use protein-protein interaction data
Finding Pathways: Attempt I
Use protein-protein interaction data
Problems:
Data is very noisy
Structure is lost:
Large connected component
in interaction graph
(3527/3589 genes)
Finding Pathways: Attempt II
Use expression microarray clusters
Problems:
Expression is only ‘weak’
indicator of interaction
Interacting pathways are
not separable
Pathway I
Pathway II
Finding Pathways: Our Approach
Use both types of data to find pathways
Find “active” interactions using gene expression
Find pathway-related co-expression using interactions
Pathway I
Pathway III
Pathway II
Pathway IV
[Segal, Wang, K., 2003]
Probabilistic Model
Gene2
Gene 1
Pathway
Pathway
Exp1
... ExpN
Expression level
in N arrays
1
0
2
Interacts
... ExpN
(g.C,g.C)
protein
g1.Cproduct
g2.C
interaction
1
1
1
2
1
Cluster all genes
collectively,
3
1
0
3
Exp1
2
2
2
3
3
3
1
2
3
1
2
3
maximizing the joint model likelihood
Compatibility
potential
[Segal, Wang, K., 2003]
Capturing Protein Complexes
Independent data set of interacting proteins
400
Our method
Standard expression clustering
Num Complexes
350
300
250
200
150
124 complexes covered
at 50% for our method
46 complexes covered at
50% for clustering
100
50
0
0
10 20 30 40 50 60 70 80 90 100
[Segal, Wang, K., 2003] Complex Coverage (%)
RNAse Complex Pathway
YHR081W
RRP40
RRP42
MTR3
RRP45
RRP4
RRP43
DIS3
TRM7
SKI6
RRP46
CSL4
Includes all 10
known pathway
genes
Only 5 genes
found by
clustering
[Segal, Wang, K., 2003]
RRP40
TRM7
RRP43
DIS3
RRP42
CSL4
RRP45
YHR081W
RRP46
RRP4
SKI6
MTR3
Interaction Clustering
RNAse complex found by interaction clustering as
part of cluster with 138 genes
[Segal, Wang, K., 2003]
Truth in Advertising
Huge graphical models:
Learning:
3000-50,000 hidden variables
Hundreds of thousands of observed nodes
Very densely connected
Multiple iterations of model updates
Each requires running inference on the model
Inference:
Exact inference is intractable
Use belief propagation
Single inference iteration: 1-6 hours
Algorithmic ideas key to scaling
Relational Data: A New Opportunity
Challenge
Data consists of different types of instances
Instances are related in complex networks
Instances are not independent
New tasks for machine learning
Collective classification
Relational clustering
Link prediction
Group detection
http://robotics.stanford.edu/~koller/