No Slide Title - NDSU Computer Science
Download
Report
Transcript No Slide Title - NDSU Computer Science
*
Ptree -based Approach
Fei
to Mining Gene Expression Data
1
Pan ,
Xin
2
Hu ,
William
1
Perrizo
1. Dept. Computer Science, 2. Dept. Pharmaceutical Science, North Dakota State University, Fargo, ND 58105
Peano Count Tree (Ptree)
INTRODUCTION
Gene expression data provides valuable information to analyze human
diseases. The process of drug discovery involves the screening of
compounds against a drug target to identify those compounds that interact
with the target to produce the desired effect. Cluster analysis is currently
the most widely used technique to analyze gene expression profiles.
However, this technique is limited as to the amount of useful information
that it can provide. On the other hand, to mine large microarray data sets
and to analyze proteomic data sets is becoming a huge challenge. New
computational tools and interpretation methods in the analysis of results. To
address the problem, we introduce a new data structure, called Peano Count
Tree (Ptree) and a new dissimilarity metric, called HOBbit metric. We
apply a new hierarchical clustering algorithm to analyze the NCI 60 cell
lines data set. The data set comprises 1376 gene expression profiles and the
growth inhibition of 1400 chemical compounds on the same cell lines,
providing an excellent test case case in the context of linking genomic data
mining with high-throughput drug design.
RESULTS
Peano order, also called Z-ordering, is a recursive raster ordering. An
accumulative quadrant is the run-length quadrant-wise compression of a bit
sequence in Peano order. A quadrant of a bit sequence is pure if it is entirely 1
or entirely 0.
Peano Count Trees (Ptrees) are defined as the recursive 1-bit counts of the
accumulative quadrants in peano order.
The clustering results in Figure 3 show that various cancer cell lines tend
to cluster according to the tissue of origin whereas, the gene-drug
interaction suggest that drug metabolism rather than the mechanism of
drug action is an important feature of the drug activity-gene expression
correlation.
A mask P-tree, also called template P-tree, is the representation of clustering
groups. It contains 1 bit at the same index position of the data objects if this
data objects belong to this groups, otherwise contains 0 bit.
56
____________/ / \ \___________
/
_____/ \ _
\
___12 __ 16
16
__ 12_
/ / |
\
/ | \ \
4 2 4 2
2 4 2 4
//|\
//|\
//|\
//|\
1100 1100
0011 0011
depth=0
level=3
depth=1
level=2
depth=2
level=1
depth=3
level=0
56
____________/ / \ \___________
/
_____/ \ _
\
___12 __ 16
16
__ 12_
/ / |
\
/ | \ \
4 2 4 2
2 4 2 4
//|\
//|\
//|\
//|\
1100 1100
0011 0011
depth=0
level=3
depth=1
level=2
depth=2
level=1
depth=3
level=0
Figure 3. (left) Cluster results of gene expression profiles. (right)
cluster results represented using mask Ptree
Figure 2. (left) Accumulative bit quadrant of gene expression
data, (right) its corresponding Peano Count Tree
HOBbit dissimilarity measurement
We compared our approach to the widely used average linkage
method on the whole gene expression dataset. Results showed that our
method is much faster than the average linkage method with a better
tightness around the centroid.
Let ai and bi be the ith bits of integer A and integer B respectively, and let
m (m 1) be the number of bits in binary representations of the values, the
HOBbit similarity between two integers A and B is defined as:
s : i 1 i s ai bi
HOB(A, B) = max
s 0
m
Figure 1. (left) Scan of cDNA microarray containing whole
yeast genome. (right) microarray spotting device at The
Institute for Genome Research
Let m (m 1) be the number of bits in binary representations of the integer
values or mantissa of floating point, the HOBbit dissimilarity between two
data points A and B is defined as:
dv(A, B) = m – HOB(A, B).
Hierarchical Ptree Clustering Algorithm (HPC)
METHODS
The hierarchical Ptree clustering algorithm was developed based on Peano
Count Trees (Ptree) and the HOBbit metric. We defined the Peano Count
Tree and HOBbit metric and then use HPC algorithm to evaluate the cellcell correlations on the basis of NCI gene expression profiles.
*Patents
are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308, NSF Grant OSR-9553368.
•HPC features one database scan and high accuracy by using Ptree
techniques and an optimal approximation LP algorithm.
•Step1: build accumulative bit quadrants and Ptrees while reading gene
expression data.
•Step2: find the 2-median of the whole Ptree as well as its sub-trees using the
LP primal dual optimization algorithm recursively.
•Step3: assign gene expression data to its nearest median and build the mask
P-tree hierarchically.
CONCLUSIONS
We introduce a new data structure as well as a dissimilarity metric and apply the
Ptree based hierarchical clustering to the analysis of large gene expression data
sets. Empirical analysis of the NCI-60 cancer cell lines data set show that the
new approach present a better tightness around the centroid compared to the
average linkage method, thus providing high accuracy. Also the new clustering
approach makes it possible to handle large data set and achieved speed
improvements.
ACKNOWLEDGEMENTS
We thank Pratap Kotala for his support and assistance. A special thanks to
GSA Grant ACT# K96130308 for financial support.