Transcript KDD_final
Gene family classification using
a semi-supervised learning method
Nan Song
Advisors: John Lafferty, Dannie Durand
1
Outline
• Introduction
– A motivating application: genome annotation
•
•
•
•
A graphical model of sequence relatedness
Gene classification using machine learning
Empirical evaluation
Conclusion
2
The Genome
The complete
genetic material
of an organism or
species
Key genomic component: genes
A gene is a DNA subsequence
ACCCTTAGCTAGACCTTTAGGAGG...
Key genomic component: genes
A gene is a DNA subsequence
ACCCTTAGCTAGACCTTTAGGAGG...
A protein is an amino acid sequence
V H L T P E...
Genes encode proteins,
the building blocks of the cell
Whole Genome Sequencing
413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria
In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes
6
www.genomesonline.org
• atgcaccttg
Gene prediction and annotation
International Human Genome Consortium, Nature 2001
14,882
16,896
Known genes
Predicted genes
31,778
Total
8
Gene annotation
• We are given a new genome sequence with
predicted genes.
• A few genes are well studied.
• Identify other genes in the same family to predict
function.
• Verify predictions experimentally
Two contexts:
– Individual scientist
– High throughput
Outline
• Introduction
– Molecular biology
– A motivating application: genome
annotation
• A graphical model of sequence relatedness
• Gene classification using machine learning
• Empirical evaluation
• Conclusion
10
Evolutionarily related genes have
related functions
Ancestral gene
atgccaggactcccagtga…
Duplication
Duplication
atgcgccgtctggcatgt…
β-globin
Adult
atgcaaggagtcccagagc…
γ-globin
Fetal
atgcgaggtctcccatgt…
ε-globin
11
Embryonic
Evolutionarily related genes have
related functions
Ancestral gene
atgccaggactcccagtga…
Duplication
Duplication
atgcgccgtctggcatgt…
β-globin
atgcaaggagtcccagagc…
γ-globin
atgcgaggtctcccatgt…
ε-globin
Gene family classification is a powerful source
of information for inferring evolutionary,
functional and structural properties of genes
Outline
•
•
•
•
•
Introduction
A graphical model of sequence relatedness
Gene classification using machine learning
Empirical evaluation
Conclusion
13
A graphical model of sequence relatedness
G = (V,E)
V: represent sequences
E: weight of the edge is
proportional to the similarity
between sequences.
xi
…atgcaaggagtcccagagcc…
…atgcgaggtctcccagtgtc…
xj
14
A graphical model of sequence relatedness
G = (V,E)
V: represent sequences
E: weight of the edge is
proportional to the similarity
between sequences.
xi
xj
15
Gene family classification
Biological scenario:
• small number of known genes
• large number of unknown genes
Goal:
Given known genes, identify
genes in the same family.
xi
xj
16
Outline
•
•
•
•
•
Introduction
A graphical model of sequence relatedness
Gene classification using machine learning
Empirical evaluation
Conclusion
17
Framework: binary classification
Machine learning scenario:
• small number of labeled data
• genes known to be in family
• genes clearly not in family
• large number of unlabeled data
Determine which unlabeled
genes belong to the family.
18
Several challenging problems of
gene family classification
Ancestral gene
Duplication
Duplication
atgcgccgtctggcatgt…
atgcgaggtctcccatgt…
atgcaaggagtcccagagc…
Mutations
DNA shuffling
atgcgccccccggcatgt…
atgcgccgtctggcatgt…ggctcgta
Traditionally, similarity is represented by
sequence comparison
19
Several challenging problems of
gene family classification
Ancestral gene
Duplication
Duplication
atgcgccgtctggcatgt…
atgcgaggtctcccatgt…
atgcaaggagtcccagagc…
Mutations
DNA shuffling
atgcgccccccggcatgt…
atgcgccgtctggcatgt…ggctcgta
Traditionally, similarity is represented by
sequence comparison
20
Several challenging problems of
gene family classification
Families
– do not form a clique
– do not form a connected
component
– have edges to sequences
outside the family.
21
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning
– Semi-supervised learning algorithm
– Supervised learning algorithm
• Empirical evaluation
• Conclusion
22
Gene family classification
Machine learning scenario:
• large number of unlabeled data
• small number of labeled data
Goal:
Binary classification
Semi supervised learning:
• Exploit information from both labeled and unlabeled data
• Performed well in many applications
23
Graphical semi-supervised learning
(Binary classification)
Notation:
(xk,f(k))
(xj,yj = 0)
• V: The whole data set
• L: Labeled data set
• U: unlabeled data set
• Each vertex: (xi,yi) or (xk, f(k))
(xi,yi = 1)
24
Xiaojin Zhu, Zoubin Ghahramani, John Lafferty
The Twentieth International Conference on Machine Learning (ICML-2003)
Graphical semi-supervised learning
(Binary classification)
• Input:
(xk,f(k))
(xj,yj = 0)
– family members (xi, yi = 1)
– nonfamily members: (xj, yj = 0)
• Output:
– Assign a real value to every
vertex in the graph
– Find a cutoff to separate the
two classes
(xi,yi = 1)
25
Xiaojin Zhu, Zoubin Ghahramani, John Lafferty
The Twentieth International Conference on Machine Learning (ICML-2003)
Graphical semi-supervised learning
(Binary classification)
Assign real values to all vertices
in the graph, to minimize E(f):
(xn,yp = 1)
(xk,f(k))
E ( f ) iV jV (Wij ( f (i ) f ( j )) 2 )
s.t. f (i ) yi
xi L
where Wij exp (
Sij
)
Sij
(xi,yi = 0)
G = (V,E)
L: Labeled data set
U: unlabeled data set
26
Graph-based semi-supervised learning
f(xk)
Works well
27
http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html
Graph-based semi-supervised learning
f(xk)
Works well
Works well ?
28
http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning
– Semi-supervised learning
– Supervised learning
• Empirical evaluation
• Conclusion
29
Semi-supervised vs kernel-based
supervised learning
• Semi-supervised learning:
E ( f ) iV jV (Wij ( f (i) f ( j )) 2 )
s.t. f (i) yi
xi L
• Supervised learning:
E( f ) iL jU (Wij ( f (i) f ( j )) 2 )
where L is the labeled data set and U is the unlabeled data set
Outline
•
•
•
•
Introduction
A graphical model of sequence relatedness
Gene classification using machine learning
Empirical evaluation
– Methodology
– Results
• Conclusion
31
Graph construction
G = (V,E)
V: All mouse sequences from
SwissProt (n = 7439)
E: based on newly designed
sequence similarity
measurement.
0 < S(i, j) < 1
32
Methodology
•
•
•
•
Graph construction
Test set construction
Experiments performed
Basis for evaluation
33
Test set construction
18 well studied protein families
– Receptors, enzymes, transcription factors,
motor proteins, structural proteins, and
extracellular matrix proteins.
ACSL
FOX
Laminin
SEMA
USP
ADAM
GATA
Myosin
T-box
WNT
DVL
Kinase
Notch
TNFR
FGF
Kinesin
PDE
TRAF
Test set construction
• Retrieved all complete mouse
sequences from SwissProt
database (7,439)
• Identified sequences for each
test family based on
– Nomenclature committee
reports
– Structural properties
– Literature surveys
Family size
ACSL
ADAM
DVL
FGF
FOX
GATA
Kinase
Kinesin
Laminin
Myosin
Notch
PDE
SEMA
TNFR
TRAF
Tbox
USP
WNT
5
26
3
20
30
6
293
18
11
12
4
15
16
24
9
6
18
19
35
Methodology
•
•
•
•
Graph construction
Test set construction
Experiments performed
Basis for evaluation
36
Experiments performed
• Compare semi-supervised with supervised
learning algorithm
• Tested parameters:
– Scaling parameter,σ, in the kernel function
– Number of Labeled Family members (LF)
– Number of Labeled Nonfamily members(LN)
Tested parameters
σ
For each set of parameters,
20 tests were performed
number of Labeled
Family members
number of Non-labeled
Family members
Tested parameters (1)
Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100
σ=100
1
σ=10
W
0.8
0.8
0.6
0.6
σ=1
0.4
0.4
σ=0.5
0.2
0.2
σ=0.2
σ=0.1
0
00
0.2
0.2
0.4
0.4
0.6
0.6
0.08
0.05
0.02
0.8
0.8
11
S
E ( f ) iV jV (Wij ( f (i) f ( j )) ), where Wij exp(
2
Sij
)
Tested parameters (2)
• Labeled Family members (LF):
10-70% of family size
• Labeled Nonfamily members (LN) :
100, 500, 1000
about 1 - 10% of nonfamily size
Database size: 7439
Family
size
LF
ACSL
ADAM
DVL
FGF
FOX
GATA
Kinase
5
26
3
20
30
6
293
Kinesin
Laminin
Myosin
Notch
PDE
SEMA
TNFR
TRAF
Tbox
USP
WNT
18
11
12
4
15
16
24
9
6
18
19
1, 3
3, 5, 7, 9,15
1
3, 5, 7, 11, 15
3, 9, 15
1,3
3, 7, 11, 15,
20, 50, 150
2, 6, 9
1, 3, 5, 7
2, 4, 6, 9
1, 2
2, 5, 7, 10
2, 5, 8
2, 4, 8, 12, 18
1, 3
2, 5
2, 4, 6, 9,13
2, 9
Methodology
•
•
•
•
Graph construction
Test set construction
Experiments performed
Basis for evaluation
41
Semi-supervised learning
Goal:
f(i) > f(j) when xi is a family member and xj is not.
Evaluation criteria:
• Visualization
• AUC score
• False negatives
42
Visualization
Sort all unlabeled data by f(x)
f(x)
Family members
Nonfamily members
Rank
Rank plot
f(x)
Family members
AUC
(Area Under ROC Curve)
Rank
sensitivity
Nonfamily
members
na
AUC
nb
1
i 1 j 1
ai b j
na nb
1 - specificity
Advantages of rank plot
AUC = 0.9382
AUC scores do not reflect all
information we need
• False negatives
after the first false
positive
• The number of
missed data after
the first false
positive
Outline
•
•
•
•
Introduction
A graphical model of sequence relatedness
Gene classification using machine learning
Empirical evaluation
– Methodology
– Results
• Conclusion
47
Several challenging problems of
gene family classification
Families
– do not form a clique
– do not form a connected
component
– have edges to sequences
outside the family.
Edges to sequences outside the family are mainly a
problem if they have strong edge weights
48
Test families have different graph properties
Family
size
ACSL
5
DVL
3
FOX
30
GATA
6
SEMA
16
TRAF
9
Tbox
6
WNT
19
Kinesin
18
Laminin
12
Myosin
12
Notch
4
ADAM
26
FGF
20
Kinase
293
PDE
15
TNFR
24
USP
18
Clique
Connected
NOT
Connected
W
W
W
W
W
W
W
W
S
S
S
S
X
X
X
X
X
X
X
X
X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
49
Results
• Compare semi-supervised with supervised
learning algorithm
• Tested parameters:
– Scaling parameter,σ, in the kernel function
– Number of Labeled Family members (LF)
– Number of Labeled Nonfamily members(LN)
Tested parameters
σ
number of Labeled
Family members
number of Non-labeled
Family members
AUC (ave)
Notch, Lf = 1, Ln =1000
1
0.9998
0.9996
0.1
0.2
0.5
1
10
The effect of σ
σ=100
1
σ=10
W
0.8
0.8
0.6
0.6
σ=1
0.4
0.4
σ=0.5
0.2
0.2
σ=0.2
σ=0.1
0
00
0.2
0.2
0.4
0.4
0.6
0.6
0.08
0.05
0.02
0.8
0.8
11
Raw similarity score (s)
E ( f ) iV jV (Wij ( f (i) f ( j )) ), where Wij exp(
2
Sij
)
Test families have different graph properties
Family
size
ACSL
5
DVL
3
FOX
30
GATA
6
SEMA
16
TRAF
9
Tbox
6
WNT
19
Kinesin
18
Laminin
12
Myosin
12
Notch
4
ADAM
26
FGF
20
Kinase
293
PDE
15
TNFR
24
USP
18
Clique
Connected
NOT
Connected
W
W
W
W
W
W
W
W
S
S
S
S
X
X
X
X
X
X
X
X
X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
53
Edges to sequences outside the family are mainly
a problem if they have strong edge weights
Number of edges
Edges to sequences outside the family are mainly
a problem if they have strong edge weights
FOX
Raw edge weight
Notch
Raw edge weight
Case study: Rank plots for semi-supervised
learning in FOX
LF = 3, LN = 100, family size: 30
σ = 0.1
σ =1
σ = 10
σ=100
Case study: rank plots for semi-supervised
learning in Notch
σ= 10
σ = 0.1
labeled family seqs: 1 (out of 4)
labeled nonfamily seqs: 100(out of 7435)
σ =1
σ = 10
1
0.9998
0.9996
0.1
0.2
0.5
1
Notch, Lf = 1, Ln =1000
10
σ
AUC (ave)
AUC (ave)
FOX, Lf = 3, Ln =1000
1
0.9998
0.9996
0.1
0.2
0.5
1
10
Summary on σ
• For most families, the performance is not
very sensitive to σ
• For almost all families that form a clique,
there is at least one value of sigma (usually
many)
– such that both semi-supervised and supervised
learning algorithms have perfect classfication
performance
Results
• Compare semi-supervised with supervised
learning algorithm
• Tested parameters:
– Scaling parameter,σ, in the kernel function
– Number of Labeled Family members (LF)
– Number of Labeled Nonfamily members(LN)
Test families have different graph properties
Family
size
ACSL
5
DVL
3
FOX
30
GATA
6
SEMA
16
TRAF
9
Tbox
6
WNT
19
Kinesin
18
Laminin
12
Myosin
12
Notch
4
ADAM
26
FGF
20
Kinase
293
PDE
15
TNFR
24
USP
18
Clique
Connected
NOT
Connected
W
W
W
W
W
W
W
W
S
S
S
S
X
X
X
X
X
X
X
X
X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
61
The connection among sequences
in ADAM family
16
14
12
10
8
6
4
2
0
9
24 25 26
# of connected ADAM sequences
The connection among sequences
in ADAM family
Tested parameters
σ
achieve the best average AUC score
number of Labeled
Family members
number of Non-labeled
Family members
By taking the maximum
number of Labeled
Family members
number of Non-labeled
Family members
The impact of number of labeled family and
nonfamily members on the performance
ADAM
AUC
1
Supervised, LN =100
0.996
0.992
3
5
7
9
# labeled family seqs, LF
15
The impact of number of labeled family and
nonfamily members on the performance
ADAM
AUC
1
0.996
Supervised, LN =100
Semi-supervised, LN = 100
0.992
3
5
7
9
15
# labeled family seqs, LF
Performed paired t-test to detect the difference between
semi-supervised and supervised method for a set of
parameters
The impact of number of labeled family and
nonfamily members on the performance
ADAM
AUC
1
Supervised, ln =100
Semi-supervised, ln = 100
0.996
Supervised, ln =1000
0.992
3
5
7
# labeled family seqs
9
15
The impact of number of labeled family and
nonfamily members on the performance
ADAM
AUC
1
Supervised, ln =100
Semi-supervised, ln = 100
0.996
Supervised, ln =1000
Semi-supervised, ln = 1000
0.992
3
5
7
# labeled family seqs
9
15
Graph structure of ADAM
• Troublemaker: ADAMTS10 matches with only 8
out of 26 sequences in ADAM family.
• ADAMTS10 is often misclassified
• ADAMTS10 is implicated in a genetic disease
that causes impaired vision and heat defects.
Semi-supervised method
Supervised method
70
Several challenging problems of
gene family classification
• Sequences in the same family
– do not form a clique
– do not exist in the same connected component
• Sequences in different families
– have significant matches
71
Test families have different graph properties
Family
size
ACSL
5
DVL
3
FOX
30
GATA
6
SEMA
16
TRAF
9
Tbox
6
WNT
19
Kinesin
18
Laminin
12
Myosin
12
Notch
4
ADAM
26
FGF
20
Kinase
293
PDE
15
TNFR
24
USP
18
Clique
Connected
NOT
Connected
W
W
W
W
W
W
W
W
S
S
S
S
X
X
X
X
X
X
X
X
X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
72
The connection among sequences in
TNFR family
The connection among sequences in TNFR family
20 TNFR in this connected component
7
66
5
44
3
22
1
0
10
11
12
13
14
15
16
17
18
19
# of connected TNFR sequences
20
The impact of number of labeled family and
nonfamily members on the performance
TNFR (family size 24)
AUC
1
Semi, , ln = 100
Supervised, ln =100
Semi, ln = 1000
Supervised, ln=1000
0.96
0.92
2
4
8
12
18
Summary for Number of labeled
family members
• The performance of both semi-supervised
and supervised learning improves as LF
increases for all families.
• In non-clique families, semi-supervised
learning performs better than supervised
when LF is small.
Rank plots for semi-supervised
learning in TNFR
σ= 0.1
Lf = 2, ln = 100
AUC values do not reflect all
information that we need
The impact of number of labeled family and
nonfamily members on the performance
TNFR (family size 24)
Number of missed TNFR
6
Semi, ln = 1000
5
Semi, , ln = 100
4
Supervised, ln =100
Supervised, ln=1000
3
2
1
0
2
4
8
12
18
Summary for Number of labeled
family members
• The performance of both semi-supervised
and supervised learning improves as LF
increases for all families.
• In non-clique families, semi-supervised
learning performs better than supervised
when LF is small.
Summary for Number of labeled
non-family members (LN)
• The performance supervised learning
improves as LN increases for all families.
• For semi-supervised learning, sometimes
LN is sometimes helpful and sometimes
not.
Summary of results
100
Super
ACSL
DVL
FOX
GATA
Kinesin
Myosin
Laminin
Notch
SEMA
TRAF
Tbox
WNT
ADAM
FGF
Kinase
PDE
TNFR
USP
Small LF
1000
100
Semi
Super
1000
semi
100
Super
Large LF
1000
100
Semi
Super
1000
semi
Clique
0.9999
0.9999
0.9999
0.9999
1
1
1
1
0.9951
1
0.9989
1
1
1
1
1
0.9549
0.9181
0.9297
0.9792
0.9644
0.9364
0.9420
0.9798
0.9745
0.9644
0.9537
0.9907
0.9738
0.9589
0.9526
0.9895
0.9656
0.9612
0.9628
0.9827
0.9666
0.9603
0.9671
0.9875
0.9804
0.9850
0.9845
0.9900
0.9771
0.9769
0.9866
0.9895
Connected
81
Insights - 1
•
SSL is most effective for families that are
not cliques but are connected.
•
In test set, 12/18 cliques, 3/18 not
connected.
•
What fraction of protein families are
cliques? Is the large number of cliques
in the test set due to sample bias?
Insights - 2
• Performance evaluation measures should
match the needs of the user.
– AUC scores penalize all FNs and
FPs.
– For experimental biologists, top
ranked predictions are of interest
– The number of FNs after the first
false positive can reveal some
information
Insights - 3
• Semi-supervised learning algorithm
provides an appealing visualization tool for
identifying family members especially
when the number of known family
members are small
Acknowledgements
• John Lafferty
• Dannie Durand
Durand Lab
• Robbie Sedgewick
• Rose Hoberman
• Ben Vernot
• Narayanan Raghupathy
• Jerry Zhu
• Aiton Goldman
• Jacob Joseph
• Annette McLeod
• Maureen Stolzer
Thank You !