A New Biclustering Algorithm for Analyzing Biological Data

Download Report

Transcript A New Biclustering Algorithm for Analyzing Biological Data

A New Biclustering Algorithm for
Analyzing Biological Data
Prashant Paymal
Advisor: Dr. Hesham Ali
Introduction
• Microarray technology use to study the
expression of many genes at once
• Large amount of data is produced in the
microarray technology
• Proper analysis of the data is important to get
meaningful information from it
• There is a need for new analysis techniques
Data Analysis
• From data to knowledge
• We need to process data by grouping and
synthesizing information into a “big picture”
based upon characteristics and relationships
• One of the most used analysis technique is
traditional clustering
Traditional Clustering
Genes
Genes
Conditions
Conditions
• Applied to either rows or columns of the data matrix
separately
• Each gene is defined using all the conditions
• Each condition is characterized by the activity of all
the genes that belong to it
Motivation
• The large amount of data provide us great
challenges of analysis
• Clustering algorithms consider all the conditions to
group genes and all the genes to group conditions
• Biologically data may not show similar behavior in
all conditions but in a subset of them
• Traditional clustering algorithms will very likely
miss some important information
Biclustering
• The term “Biclustering” was first used by Cheng and
Church in gene expression data analysis [Year
2000]
• Clusters do not need to include all parameters
(genes in Bioinformatics) for all conditions
• Data Matrix
▫ Each gene – One row
▫ Each condition – One column
▫ Each element – expression level of a gene under
specific condition
Biclustering (Cont.)
Genes
Conditions
• Performs clustering in these two dimensions
simultaneously
• Each gene is selected using only a subset of the
conditions
• Each condition is selected using only a subset of the
genes
Goal of Biclustering
• To identify subgroups of genes and subgroups of
conditions by performing simultaneous
clustering of both rows and columns of the gene
expression matrix, instead of clustering these
two dimensions separately
• To find biclusters is NP-hard problem: It is
actually a generalized version of traditional
clustering
Previous Work
• A systematic comparison and evaluation of
biclustering methods for gene expression data Amela Prelic (2006)
• Algorithms:
▫ Statistical Algorithmic Method for Biclustering Analysis
Algorithm (SAMBA)
▫ Order Preserving Submatrix Algorithm (OPSM)
▫ Iterative Signature Algorithm (ISA)
▫ Cheng and Church algorithm
▫ xMotif
▫ Bimax
Previous Work (Cont.)
• Comparative Analysis of Biclustering Algorithms
– Doruk Bozdag … (2010)
• Algorithms
▫
▫
▫
▫
Correlated Pattern Bicluster Algorithm (CPB)
Cheng and Church Algorithm
Order Preserving Submatrix Algorithm (OPSM)
HARP Algorithm
▫ Minimum Sum-Squared Residue-based CoClustering Algorithm
(MSSRCC)
▫ Statistical Algorithmic Method for Biclustering Analysis
Algorithm (SAMBA)
The Importance of Assessment
• Different algorithms give different solutions for
same data
• There is no agreed upon guideline for choosing
among them
• Validation Techniques
▫ External Validation Measures
 Evaluate a result based on the knowledge of the correct
class labels
▫ Internal Validation Measures
 Evaluate a result based on the information intrinsic to the
data alone
Validation
• In most biclustering papers external validation
measures used to assess the methods,
▫ It is not clear how to extend notions such as
homogeneity and separation to the biclustering
context (Gat-Viks et al 2003)
▫ Internal measures don’t work well in case of
biclustering due to which Gat-Viks et al 2003 and
Handl et al 2005 recommend external measures
Objectives of the Project
• Comprehensive Assessment Technique
▫ Internal measures as well as external measures
• Customized Biclustering Method
▫ Input domain
Validation using Synthetic Data
• Testing using Manufactured data
▫ The portion of the implanted bicluster the
algorithm was able to return
▫ The portion external or irrelevant to the implanted
bicluster which algorithm returns
▫ Two metrics to evaluate cluster quality
 U: Uncovered portion of the implanted bicluster
 E: Portion of the output cluster external to the
implanted bicluster
Validation using Synthetic Data
• Testing using real (domain specific) data – for
example using Gene match score
▫ M1, M2 be two sets of Biclusters
▫ Average of the maximum match scores for all
biclusters in M1 with respect to the bicluster in M2
• Potential improvements
▫ Don’t consider samples / conditions
▫ Specificity and Sensitivity
Proposed Assessment
• Calculate sensitivity and specificity scores
▫ Specificity: proportion of negatives which are correctly
identified
▫ Sensitivity: proportion of actual positives which are
correctly identified
• Improve existing measures:
▫ Average of the maximum match scores for all bi-clusters in
M1 with respect to bi-clusters in M2 (considering both
genes and samples)
• Assessment based on knowledge of domain data
▫ The resulting biclusters were evaluated based on the
enrichment of Gene Ontology (GO) terms
Experiments
• Given two biclustering results
▫ M1: Result of a biclustering algorithm
▫ M2: True Result
▫ (G1, C1)  M1 and (G2, C2)  M2
• Calculate similarity score (Jaccard Coefficient)
▫
G1  G 2
G1  G 2
and
C1  C 2
C1  C 2
• Calculate the two scores,
▫ Score 1: % of result of an algorithm is included in the
true result
▫ Score 2: % of true result an algorithm can find
Results
• Synthetic Data: 100 genes and 100 samples
• 10 implanted biclusters of each size 10 X 10 (10 genes and 10
samples)
• Used publically available different biclustering algorithm
implementations
Algorithm
No of
biclusters
Score 1
Score 2
Cheng and Church Algorithm (CC)
8
0.475
0.38
Iterative Search Algorithm (ISA)
9
1
0.9
Order Preserving Sub Matrix (OPSM) Algorithm
32
0.273139
0.874044
Statistical Algorithm Method for Bicluster Analysis (SAMBA)
9
0.5
0.45
xMotif Algorithm
87
0.100023
0.870204
• Score 1: % of result of an algorithm is included in the true result
• Score 2: % of true result an algorithm can find
Conclusion
• Traditional Clustering is too restrictive technique for
analyzing datasets in various application domains
• We need new flexible analysis technique like biclustering
to deal with possible imperfections in the input datasets
• Assessment of data analysis is critical and must be
considered while selecting the right tool for each
application domains
• Biclustering represents a powerful tool for analysis of
data in a variety of domains and can be applicable to
datasets other than biology
References
• Madeira, S.C., Oliveira, A.L.: Biclustering algorithms
for biological data analysis: A survey
• Amela Prelic et al: A systematic comparison and
evaluation of biclustering methods for gene
expression data
• http://cheng.ececs.uc.edu/biclustering
• http://www.tik.ethz.ch/~sop/bicat/
• http://acgt.cs.tau.ac.il/expander/
Thank you…