A graph-theoretic modeling on GO space for biological interpretation

Download Report

Transcript A graph-theoretic modeling on GO space for biological interpretation

A graph-theoretic modeling on GO
space for biological interpretation of
gene clusters
Bioinformatics Unit, ISTECH Inc.
Cancer Metastasis Research Center, Yonsei University College of
Medicine
Sung Geun Lee,
Jung Uk Hur and
Yang Seok Kim
報告人:張家榮
Introduction

Gene Ontology (GO)
–

Clusters of microarray data
–

Each cluster has some genes
Extracts GO terms for a gene cluster
–

Controlled vocabulary of various genomic databases about
diverse species
Each gene has several corresponding GO terms
Purpose
–
To discover the meaning of each cluster
Metric structure of GO tree
Lowest common ancestor

Given a non-empty subset U, v is a common
ancestor of U if every node in U is on a
subtree having v as the root and v0 is an
LCA of U if v0 is greater than or equal to the
level of w for any common ancestor w of U.
Lowest common ancestor
Principal distance

Each level has its own weight
–
–
–
W: IH -> R+
W(i) > W(i+1)
For example: W(k)=150-10(k-1)
Where w0 is LCA of v1 and v2
Principal distance
40
30
20
10
0
Multiset



Mathematically, the following three sets {1},
{1, 1}, {1, 1, 1} are equal in the set notation.
Yet, we want to take the number of
occurrences of elements into account.
Such set is called as a multiset.
MaxPd and AverPd
given a multiset G ={v1, v2, . . . , vn}
MaxPd and AverPd

MaxPd
–
–

give the comprehensive biological meanings of a
gene cluster
Not flexible but informs us of the existence of
some functional outliers
AverPd
–
–
Signifies the most frequent GO codes
More than one
Algorithmic approach (1)




c[i,j] is j’st GO code of i’st gene
We consider ordered GO codes g[m] where
1≤m≤α
α is a constant related to the input data
α≤Ω , Ω is the total number of GO code
Algorithmic approach (2)


MaxPd is used to find LCA of C
Complexity : 3αn
min
MaxPd
40
40
30
40
30
20
20
10
0
40
Algorithmic approach (3)


AverPd is used to find an optimal GO code g[m0]
such that the average distance between g[m0] and
each gene in C is smaller than that of any g[m]
Complexity : 3αn
AverPd
Discussion


Other algorithms consider GO term
frequencies or compare specific GO termrelated gene groups
In our modeling, the topological property of
GO hierarchy is used.
Utility



Biological assessment of the clustering
results of DNA microarray data
Coupled with any clustering technique to
predict the functional category of the
unknown genes
Not only DNA microarray data, but also any
kinds of group analysis with any ontology
having an identical structure with GO
Another approach
The length of GO code is about logα
 Take one number of GO code each time
pseudo-code:
for 1≤k≤ logα
cluster C by the kth number
break if no cluster above n
delete clusters under n
end

MaxPd
Complexity



Worst case : O( nlogα)
Best case : O (n)
Alternate
–
Change the break time to cluster it in detail
disadvantage



Cannot obtain information not contained in
GO such as disease-related genes
GO terms on the same level have different
level information
GO hierarchy is dynamic and flexible