Discriminative Probabilistic Models for Relational Data

Transcript Discriminative Probabilistic Models for Relational Data

Statistical Learning
from Relational Data
Daphne Koller
Stanford University
Joint work with many many people
Relational Data is Everywhere

The web


Social networks


Genes, proteins, interactions, regulation
Bibliometrics


People, institutions, friendship links
Biological data


Webpages (& the entities they represent), hyperlinks
Papers, authors, journals, citations
Corporate databases

Customers, products, transactions
Relational Data is Different

Data instances not independent


Topics of linked webpages are correlated
Data instances are not identically distributed:

Heterogeneous instances (papers, authors)
No IID assumption 
This is a good thing 
New Learning Tasks

Collective classification of related instances


Relational clustering


Finding coherent clusters in the genome
Link prediction & classification


Labeling an entire website of related webpages
Predicting when two people are likely to be friends
Pattern detection in network of related objects

Finding groups (research groups, terrorist groups)
Probabilistic Models

Uncertainty model:



space of “possible worlds”;
probability distribution over this space.
Worlds: often defined via a set of state variables

medical diagnosis: diseases, symptoms, findings, …
each world: an assignment of values to variables

Number of worlds is exponential in # of vars

2n if we have n binary variables
Outline




Relational Bayesian networks*
Relational Markov networks
Collective Classification
Relational clustering
* with Avi Pfeffer, Nir Friedman, Lise Getoor
Bayesian Networks
CPD P(G|D,I)
A
B
C
Difficulty
Intelligence
easy,low
easy,high
hard,low
Grade
hard,high
0%
20% 40%
60% 80% 100%
nodes = variables
edges = direct influence
Job
SAT
Graph structure encodes independence assumptions:
Letter conditionally independent of Intelligence given Grade
Bayesian Networks: Problem


Bayesian nets use propositional representation
Real world has objects, related to each other
Intelligence
Diffic_CS101
Intell_Jane
Difficulty
Diffic_CS101
Intell_George
These
“instances”
are not
independent
Grade_Jane_CS101
A
Grade
Intell_George
Diffic_Geo101
Grade_George_Geo101
Grade_George_CS101
C
Relational Schema

Specifies types of objects in domain, attributes of each
type of object & types of relations between objects
Professor
Classes
Student
Intelligence
Teaching-Ability
Teach
Take
Attributes
Relations
Course
Difficulty
In
Registration
Grade
Satisfaction
St. Nordaf University
Prof. Smith
Teaching-ability
Teaches
Teaches
Prof. Jones
Teaching-ability
World 
In-courseGrade
Registered
Satisfac
Intelligence
Welcome to
George
Geo101
Grade
Welcome to
Difficulty
Registered
In-courseSatisfac
CS101
Intelligence
Grade
Difficulty
In-courseSatisfac
Registered
Jane
Relational Bayesian Networks


Universals: Probabilistic patterns hold for all objects in class
Locality: Represent direct probabilistic dependencies
 Links define potential interactions
Professor
Teaching-Ability
Student
Intelligence
Course
Difficulty
A
B
C
easy,low
Reg
Grade
Satisfaction
[K. & Pfeffer; Poole; Ngo & Haddawy]
easy,high
hard,low
hard,high
0%
20%
40%
60%
80%
100%
RBN Semantics
Prof. Jones
Teaching-ability
Prof. Smith
Teaching-ability
Ground model:
 variables: attributes of all objects
 dependencies: determined by
relational links & template model
Grade
Welcome to
Intelligence
Satisfac
George
Geo101
Grade
Welcome
to to
Welcome
Difficulty
Satisfac
CS101
CS101
Grade
Difficulty
Satisfac
Intelligence
Jane
The Web of Influence
Welcome to
CS101
C
0%
0%
50%
50%
Welcome to
Geo101
easy / hard
A
low
high
low / high
100%
100%
Outline




Relational Bayesian networks*
Relational Markov networks†
Collective Classification
Relational clustering
* with Avi Pfeffer, Nir Friedman, Lise Getoor
† with Ben Taskar, Pieter Abbeel
Why Undirected Models?

Symmetric, non-causal interactions



Patterns involving multiple entities



E.g., web: categories of linked pages are correlated
Cannot introduce direct edges because of cycles
E.g., web: “triangle” patterns
Directed edges not appropriate
“Solution”: Impose arbitrary direction


Not clear how to parameterize CPD for variables
involved in multiple interactions
Very difficult within a class-based parameterization
[Taskar, Abbeel, K. 2001]
Markov Networks
James
Template potential
CC
CB
CA
BC
BB
BA
AC
AB
AA
Mary
Kyle
Noah
0
0.5
1
P(J,K, L, M,N) 
1.5
2
Laura
1
Z
(J,K)(J, L)(K, L) (J, M) (M,N) (L,N)
Relational Markov Networks


Universals: Probabilistic patterns hold for all groups of
objects
Locality: Represent local probabilistic dependencies
 Sets of links give us possible interactions
Student1
Intelligence
Reg1
Grade
Course
Difficulty
Student2
Intelligence
Template potential
Reg2
Grade
CC
CB
CA
BC
Study
BB
BA
AC
AB
AA
Group
0
0.5
1
1.5
2
RMN Semantics
Grade
Welcome to
Intelligence
Geo Study Group
Geo101
Welcome to
Difficulty
George
Grade
CS101
Intelligence
Difficulty
Grade
CS Study Group
Grade
Jane
Intelligence
Jill
Outline



Relational Bayesian Networks
Relational Markov Networks
Collective Classification*




Discriminative training
Web page classification
Link prediction
Relational clustering
* with Ben Taskar, Carlos Guestrin, Ming Fai Wong, Pieter Abbeel
Collective Classification
Training Data
Features: .x
Course
Labels: .y*
Probabilistic
Relational
Model
Student
Reg
Learning
Model
Structure
New Data
Features: ’.x
Conclusions
Inference
Labels: ’.y
Example:

Train on one year of student intelligence, course difficulty, and grades

Given only grades in following year, predict all students’ intelligence
Learning RMN Parameters
Parameterize potentials as log-linear model
Student1
Intelligence
Reg1
Grade
Course
Intelligence
CC
CB
CA
BC
BB
BA
AC
AB
AA
Study Group
Difficulty
Student2
Template potential
Reg2
Grade
 (R1.G , R2 .G )  exp(wAAfAA   wCC fCC )
1
T
Pw (.x ) 
exp(w f ( x ))
Zw

Max Likelihood Estimation
We don’t care about the
joint distribution P(.x, .y)
Estimation
.x
.y*
maximizew
log
log Pww (.y* ,|.x
.x))
Classification
argmaxy
logPw ( '.y |  '.x )
Web  KB
Tom Mitchell
Professor
Project-of
WebKB
Project
Member
Advisor-of
Sean Slattery
Student
[Craven et al.]
Web Classification Experiments

WebKB dataset





Four CS department websites
Bag of words on each page
Links between pages
Anchor text for links
Experimental setup



Trained on three universities
Tested on fourth
Repeated for all four combinations
Standard Classification
Page
Category
Word1
Professor
department
extract
information
computer
science
machine
learning
…
Categories:
faculty
course
project
student
other
...WordN
Standard Classification
Page
Category
Word1
...WordN...LinkWordN
test set error
workin
0.18
g 0.16
0.14
with
0.12
Discriminatively trained naïve Markov
Tom
0.1
= Logistic Regression
Mitchell
0.08
…
0.06
0.04
4-fold CV:
0.02
Trained on 3 universities 0
Tested on 4th
Logistic
Power of Context
Professor?
Student?
Post-doc?
Collective Classification
To-Page
From-Page
Category
Category
Word1
...WordN
FT
SS
SP
SF
SC
PS
PP
PF
PC
FS
FP
FF
FC
CS
CP
CF
CC
Link
...
WordN
Word1
Compatibility (From,To)
Collective Classification
To-Page
From-Page
Category
Category
Word1
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
...WordN
Link
Word1
...WordN
test set error
Classify all pages collectively,
maximizing the joint label probability
[Taskar, Abbeel, K., 2002]
Logistic
Links
More Complex Structure
More Complex Structure
Faculty
W1
Students
C
Wn
S
S
Courses
test set error
Collective Classification: Results
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
35.4% error reduction over logistic
[Taskar, Abbeel, K., 2002]
Logistic
Links
Section
Link+Section
Max Conditional Likelihood
We don’t care about the
conditional distribution P(.y |.x)
Estimation
.x
.y*
Classification
maximizew
argmaxy
T
log
w Pfw ( '.y ,|'.'.xx )
logPw (.y* | .x )


1
Pw (.y | .x ) 
exp w T f .y , .x 
Zw (x )
logPw (.y | .x )  w f .x,.y   logZw (x )
T
Max Margin Estimation
What we really want: correct class labels
Estimation
.x
.y*
w T f  .maximize
x, .y * ||w||=1 
Classification
argmaxy
T
w
w f  .x, .y    [ .y ] f  '.y , '.x 
y  y *
T
Quadratic program margin

# labeling
mistakes in y
Exponentially many constraints 
[Taskar, Guestrin, K., 2003]
(see also [Collins, 2002; Hoffman 2003])
Max Margin Markov Networks

We use structure of Markov network to
provide equivalent formulation of QP



Can solve approximately in networks
where induced width is too large


Exponential only in tree width of network
Complexity = max-likelihood classification
Analogous to loopy belief propagation
Can use kernel-based features!

SVMs meet graphical models
[Taskar, Guestrin, K., 2003]
WebKB Revisited
Test Error
16.1% relative reduction in error
relative to cond. likelihood RMNs
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
Logistic
likelihood
max margin
Predicting Relationships
Tom Mitchell
Professor
Member
WebKB
Project
Member
Advisor-of
Sean Slattery
Student
Even more interesting: relationships between objects
Predicting Relations



Introduce exists/type attribute for each potential link
Learn discriminative model for this attribute
Collectively predict its value in new world
From-Page
72.9% error reduction over flat
To- Page
Category
Word1
...
Category
Word1
WordN
...
30
WordN
25
20
Relation Exists/
LinkWord1
Type
... LinkWordN
15
10
5
0
Flat
[Taskar, Wong, Abbeel, K., 2003]
Collective
Outline




Relational Bayesian Networks
Relational Markov Networks
Collective Classification
Relational clustering


Movie data*
Biological data†
* with Ben Taskar, Eran Segal
† with Eran Segal, Nir Friedman, Aviv Regev, Dana Pe’er,
Haidong Wang, Micha Shapira, David Botstein
Relational Clustering
Unlabeled Relational Data
Course
Probabilistic
Relational
Model
Stude
nt
Reg
Learning
Model
Structure
Clustering of instances
Example:

Given only students’ grades, cluster similar students
Learning w. Missing Data: EM

EM Algorithm applies essentially unchanged



E-step computes expected sufficient statistics,
aggregated over all objects in class
M-step uses ML (or MAP) parameter estimation
Key difference:


In general, the hidden variables are not independent
Computation of expected sufficient statistics requires
inference over entire network
Learning w. Missing Data: EM
[Dempster et al. 77]
Students
Courses
A
B
C
easy,low
easy,high
hard,low
hard,high
0%
20%
40%
60%
80% 100%
P(Registration.Grade |
Course.Difficulty,
Student.Intelligence)
easy / hard
low / high
Movie Data
Internet Movie Database
http://www.imdb.com
Discovering Hidden Types
Learn model using EM
Actor
Director
Type
Type
Movie
Genres
Type
Year MPAA Rating
[Taskar, Segal, K., 2001]
Rating
#Votes
Discovering Hidden Types
Movies
Actors
Directors
Wizard of Oz
Cinderella
Sound of Music
The Love Bug
Pollyanna
The Parent Trap
Mary Poppins
Swiss Family Robinson
Sylvester Stallone
Bruce Willis
Harrison Ford
Steven Seagal
Kurt Russell
Kevin Costner
Jean-Claude Van Damme
Arnold Schwarzenegger
Alfred Hitchcock
Stanley Kubrick
David Lean
Milos Forman
Terry Gilliam
Francis Coppola
Terminator 2
Batman
Batman Forever
GoldenEye
Starship Troopers
Mission: Impossible
Hunt for Red October
Anthony Hopkins
Robert De Niro
Tommy Lee Jones
Harvey Keitel
Morgan Freeman
Gary Oldman
Steven Spielberg
Tim Burton
Tony Scott
James Cameron
John McTiernan
Joel Schumacher
…
[Taskar, Segal, K., 2001]
…
Biology 101: Gene Expression
Gene 1
Coding
Control
Gene 2
Coding
Control
DNA
RNA
Protein
Cells express different
subsets of their genes
in different tissues and
under different conditions
Swi5 Transcription factor
Gene Expression Microarrays


Measure mRNA level for all genes in one condition
Hundreds of experiments
Highly noisy
Expression of gene i
in experiment j
Experiments
Induced
Genes

Repressed
Standard Analysis


Cluster genes by similarity of expression profiles
Manually examine clusters to understand what’s
common to genes in cluster
Clustering
General Approach


Expression level is a function of gene properties
and experiment properties
Learn model that best explains the data
• Observed properties: gene sequence, array condition, …
••Hidden
Assignment
properties:
to hidden
gene
variables
cluster (e.g., module assignment)
Gene level as function of properties
Experiment
• Expression
Properties of
Properties of
Attributes
Attributes j
Experiment
Gene i
Expression level
ofLevel
Gene i
in Experiment j
Expression
Clustering as a PRM
Gene
Experiment
Cluster
ID
Level
g.C P(Ei.L | g.C)
Expression
1
0
g.C
CPD 1
2
Naïve
Bayes
CPD 2
CPD k
0
3
g.E1
g.E2
g.Ek
Modular Regulation

Learn functional modules:


Clusters of genes that are similarly controlled
Learn control program for modules

Expression as function of control genes
false
false
HAP4 
true
CMK1 
true
Module Network PRM
Gene
Experiment
Cluster
Control
Control
1
2 Controlk
Activity level
of control gene
in experiment
Level
Expression
Cluster 1
Cluster 2
false
HAP4 
CMK1 
false
0
0
false
true
0
true
true
Yer184c
FAR1
APG1
true
BMH1
false
true
false
false
true
[Segal, Regev, Pe’er, Koller, Friedman, 2003]
true
GIC2
USV1
false
true
false
true
USV1 
false
true
Experimental Results

Yeast Stress Data (Gasch et al.)


2355 genes that showed activity
173 experiments (microarrays):


Learned module network with 50 modules:



Diverse environmental stress conditions (e.g. heat shock)
Cluster assignments are hidden variables
Structure of dependency trees unknown
Learned model using structural EM algorithm
Segal et al., Nature Genetics, 2003
Biological Evaluation

Find sets of co-regulated genes (regulatory module)

Find the regulators of each module
46/50
30/50
[Segal et al., Nature Genetics, 2003]
Experimental Results


Hypothesis: Regulator ‘X’ regulates process ‘Y’
Experiment: Knock out ‘X’ and rerun the experiment
false
false
CMK1 
X
HAP4 
true
true
?
[Segal et al., Nature Genetics, 2003]
Differentially Expressed Genes
 Ypl230w
wt
0 3 5
(hrs.)
7 9 24
 Ppt1

wt
0 2 5 7 9 24
(min.)
0 7 15 30 60

[Segal et al., Nature Genetics, 2003]
 Kin82
(min.)
0 5 15 30 60
0 7 15 30 60
341
differentially
expressed
genes
>16x
wt

0 5 15 30 60
281
602
>4x
>4x
Biological Experiments Validation
 Ypl230w

Were the differentially
expressed genes
predicted as targets?
# Module
Significance
39Protein folding
7/23, 1e-4
29Cell differentiation
6/41, 2e-2
5 Glycolysis and folding
5/37, 4e-2
34Mitochondrial and protein fate
5/37, 4e-2
 Ppt1

Rank modules by
enrichment for diff.
expressed genes
# Module
Significance
14 Ribosomal and phosphate metabolism 8/32, 9e 3
11 Amino acid and purine metabolism
11/53, 1e 2
15 mRNA, rRNA and tRNA processing
9/43, 2e 2
39 Protein folding
6/23, 2e 2
30 Cell cycle
7/30, 2e 2
 Kin82

All regulators regulate
predicted modules
[Segal et al., Nature Genetics, 2003]
# Module
Significance
3 Energy and osmotic stress I
8/31, 1e 4
2 Energy, osmolarity & cAMP signaling 9/64, 6e 3
15 mRNA, rRNA and tRNA processing
6/43, 2e 2
Biology 102: Pathways

Pathways are sets of genes that act together to
achieve a common function
Finding Pathways: Attempt I

Use protein-protein interaction data
Finding Pathways: Attempt I

Use protein-protein interaction data
Finding Pathways: Attempt I

Use protein-protein interaction data

Problems:


Data is very noisy
Structure is lost:

Large connected component
in interaction graph
(3527/3589 genes)
Finding Pathways: Attempt II

Use expression microarray clusters

Problems:


Expression is only ‘weak’
indicator of interaction
Interacting pathways are
not separable
Pathway I
Pathway II
Finding Pathways: Our Approach

Use both types of data to find pathways


Find “active” interactions using gene expression
Find pathway-related co-expression using interactions
Pathway I
Pathway III
Pathway II
Pathway IV
[Segal, Wang, K., 2003]
Probabilistic Model
Gene2
Gene 1
Pathway
Pathway
Exp1
... ExpN
Expression level
in N arrays
1
0
2
Interacts
... ExpN
(g.C,g.C)
protein
g1.Cproduct
g2.C
interaction
1
1
1

2
1
Cluster all genes
collectively,
3
1
0
3
Exp1
2
2
2
3
3
3
1
2
3
1
2
3
maximizing the joint model likelihood
Compatibility
potential
[Segal, Wang, K., 2003]
Capturing Protein Complexes

Independent data set of interacting proteins
400
Our method
Standard expression clustering
Num Complexes
350
300

250
200

150
124 complexes covered
at 50% for our method
46 complexes covered at
50% for clustering
100
50
0
0
10 20 30 40 50 60 70 80 90 100
[Segal, Wang, K., 2003] Complex Coverage (%)
RNAse Complex Pathway
YHR081W
RRP40
RRP42
MTR3
RRP45
RRP4
RRP43
DIS3
TRM7
SKI6
RRP46
CSL4


Includes all 10
known pathway
genes
Only 5 genes
found by
clustering
[Segal, Wang, K., 2003]
RRP40
TRM7
RRP43
DIS3
RRP42
CSL4
RRP45
YHR081W
RRP46
RRP4
SKI6
MTR3
Interaction Clustering

RNAse complex found by interaction clustering as
part of cluster with 138 genes
[Segal, Wang, K., 2003]
Truth in Advertising

Huge graphical models:




Learning:



3000-50,000 hidden variables
Hundreds of thousands of observed nodes
Very densely connected
Multiple iterations of model updates
Each requires running inference on the model
Inference:




Exact inference is intractable
Use belief propagation
Single inference iteration: 1-6 hours
Algorithmic ideas key to scaling
Relational Data: A New Opportunity
Challenge

Data consists of different types of instances

Instances are related in complex networks

Instances are not independent

New tasks for machine learning




Collective classification
Relational clustering
Link prediction
Group detection
http://robotics.stanford.edu/~koller/

Discriminative Probabilistic Models for Relational Data

Transcript Discriminative Probabilistic Models for Relational Data

Directory