BioInformatics (3)

Download Report

Transcript BioInformatics (3)

BioInformatics (3)
Computational Issues
• Data Warehousing:
– Organising Biological Information into a Structured Entity
(World’s Largest Distributed DB)
• Function Analysis (Numerical Analysis) :
– Gene Expression Analysis : Applying sophisticated data
mining/Visualisation to understand gene activities within an
environment (Clustering )
– Integrated Genomic Study : Relating structural analysis with
functional analysis
• Structure Analysis (Symbolic Analysis) :
– Sequence Alignment: Analysing a sequence using
comparative methods against existing databases to develop
hypothesis concerning relatives (genetics) and functions
(Dynamic Programming and HMM)
– Structure prediction : from a sequence of a protein to predict
its 3D structure (Inductive LP)
Data Warehousing : Mapping
Biologic into Data Logic
Structure Analysis :
Alignments & Scores
Global (e.g. haplotype)
ACCACACA
::xx::x:
ACACCATA
Score= 5(+1) + 3(-1) = 2
Local (motif)
ACCACACA
::::
ACACCATA
Score= 4(+1) = 4
Suffix (shotgun assembly)
ACCACACA
:::
ACACCATA
Score= 3(+1) =3
A comparison of the homology search and the motif search
for functional interpretation of sequence information.
Homology Search
New sequence
Retrieval
Sequence database
(Primary data)
Motif Search
Knowledge
acquisition
Similar
sequence
Expert
knowledge
New sequence
Motif library
(Empirical rules)
Inference
Expert
knowledge
Sequence interpretation
Sequence interpretation
Search and learning problems in sequence analysis
Similarity search
Proble ms in Biological Science
Pairwise sequence alignment
Database search for similar
sequences
Mult iple sequence alignment
Phylogenetic tree
reconstruction
Prot ein 3D structure
alignment
Structure/func tion ab initio prediction
prediction
Knowledge based
prediction
Mole cular classifi cation
RNA seconda ry struc ture
prediction
RNA 3D structure prediction
Protein 3D structure prediction
Motif extraction
Func tiona l sit e prediction
Cellular locali zation p rediction
Coding region p rediction
Transmembrane domain
prediction
Protein seconda ry structure
prediction
Protein 3D structure prediction
Supe rfamil y classification
Ortholog/p aralog grouping of
gene s
3D fold classification
Math/Stat/CompSci method
Optimi zation algorithms
 Dynamic progra mmi ng
(DP)
 Simulated annealing (SA)
 Genetic algorithms (GA)
 Markov Chain Monte
Carlo (MCMC:
Metropolis and Gibbs
sampl ers)
 Hopfield neural networ k
Pattern recogn iti on and
learning algo rit hms
 Discrimi nan t ana lysis
 Neural networks
 Suppor t vec tor machin es
 Hidden Markov models
(HMM)
 Forma l gramm ar
 CART
Clustering algorithms
 Hierarchical, k-means , etc
 PCA, MDS, etc
 Self -organ izing maps, etc
(Whole genome) Gene Expression Analysis
• Quantitative Analysis of Gene Activities
(Transcription Profiles)
Gene
Expression
Biotinylated RNA
from experiment
GeneChip expression
analysis probe array
Image of hybridized probe array
Each probe cell contains
millions of copies of a specific
oligonucleotide probe
Streptavidinphycoerythrin
conjugate
(Sub)cellular inhomogeneity
Cell-cycle
differences in
expression.
XIST RNA localized
on inactive
X-chromosome
( see figure)
Cluster Analysis
General Purpose: To divide samples into
homogeneous groups based on a set of features.
Gene Expression Analysis: To find co-regulated
genes.
Protein/protein complex
Genes
DNA regulatory elements
Functional Analysis via Gene
Expression
Gene Expression Data
Pairwise Measures
Distance/Similarity Matrix
Clustering
Gene Clusters
Motif Searching/...
Regulatory Elements / Gene Functions
Clustering Algorithms
A clustering algorithm attempts to find natural groups of components
(or data) based on some similarity. Also, the clustering algorithm finds
the centroid of a group of data sets.To determine cluster membership,
most algorithms evaluate the distance between a point and the cluster
centroids. The output from a clustering algorithm is basically a
statistical description of the cluster centroids with the number of
components in each cluster.
Clusters of Two-Dimensional Data
Key Terms in Cluster Analysis
•
•
•
•
Distance & Similarity measures
Hierarchical & non-hierarchical
Single/complete/average linkage
Dendrograms & ordering
Distance Measures: Minkowski Metric
Suppose two objects x and y both have p features :
x  ( x1 x 2  xp)
y  ( y1 y 2  yp)
The Minkowski metric is defined by
d ( x, y )  r
ref
p
|
i 1
xi  yi |r
Most Common Minkowski Metrics
1, r  2 (Euclidean distance )
d ( x, y)  2
p
|
i 1
xi  yi |2
2, r  1 (Manhattan distance)
p
d ( x , y )   | xi  yi |
i 1
3, r   (" sup" distance )
d ( x , y )  max | xi  yi |
1 i  p
An Example
x
3
y
4
1, Euclidean distance : 2 4 2  32  5.
2, Manhattan distance : 4  3  7.
3, " sup" distance : max{4,3}  4.
Manhattan distance is called Hamming
distance when all features are binary.
Gene Expression Levels Under 17 Conditions (1-High,0-Low)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
GeneA 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1
GeneB 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1
Hamming Distance : #( 01 )  #( 10 )  4  1  5.
Similarity Measures: Correlation Coefficient
p
s ( x, y ) 
 ( x  x)( y
i
i 1
i
p
p
i 1
i 1
 y)
2
2
(
x

x
)

(
y

y
)
 i
 i
averages : x 
p
1
p
 xi and y 
i 1
s( x, y )  1
p
1
p
y.
i 1
i
Similarity Measures: Correlation Coefficient
Expression Level
Expression Level
Gene A
Gene B
Gene B
Time
Time
Expression Level
Gene B
Gene A
Time
Gene A
Distance-based Clustering
• Assign a distance measure between data
• Find a partition such that:
– Distance between objects within partition (i.e. same cluster) is
minimized
– Distance between objects from different clusters is maximised
• Issues :
– Requires defining a distance (similarity) measure in situation
where it is unclear how to assign it
– What relative weighting to give to one attribute vs another?
– Number of possible partition is super-exponential
hierarchical & nonNormalized Expression Data
ab c d
Hierarchical Clustering Techniques
At the beginning, each object (gene) is
a cluster. In each of the subsequent
steps, two closest clusters will merge
into one cluster until there is only one
cluster left.
Hierarchical Clustering
Given a set of N items to be clustered, and an NxN distance (or
similarity) matrix, the basic process hierarchical clustering is this:
1.Start by assigning each item to its own cluster, so that if you have N items,
you now have N clusters, each containing just one item. Let the distances
(similarities) between the clusters equal the distances (similarities) between the
items they contain.
2.Find the closest (most similar) pair of clusters and merge them into a single
cluster, so that now you have one less cluster.
3.Compute distances (similarities) between the new cluster and each of the old
clusters.
4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
The distance between two clusters is
defined as the distance between
•
•
•
•
Single-Link Method / Nearest Neighbor
Complete-Link / Furthest Neighbor
Their Centroids.
Average of all cross-cluster pairs.
Computing Distances
• single-link clustering (also called the connectedness or minimum
method) : we consider the distance between one cluster and another cluster to be
equal to the shortest distance from any member of one cluster to any member of
the other cluster. If the data consist of similarities, we consider the similarity
between one cluster and another cluster to be equal to the greatest similarity from
any member of one cluster to any member of the other cluster.
• complete-link clustering (also called the diameter or maximum
method): we consider the distance between one cluster and another cluster to be
equal to the longest distance from any member of one cluster to any member of the
other cluster.
• average-link clustering : we consider the distance between one cluster and
another cluster to be equal to the average distance from any member of one cluster
to any member of the other cluster.
Single-Link Method
Euclidean Distance
a
b
c
b c
a 2 5
b
3
c
a,b
d
d
6
5
4
a,b,c
c
d
(1)
b c
a 2 5
b
3
c
d
6
5
4
a,b,c,d
d
(2)
(3)
c d
a, b 3 5
c
4
d
a, b, c 4
Distance Matrix
Complete-Link Method
Euclidean Distance
a
b
c
b c
a 2 5
b
3
c
a,b
d
d
6
5
4
a,b
c,d
a,b,c,d
(2)
(3)
c d
a, b 5 6
c
4
c, d
a, b 6
c
d
(1)
b c
a 2 5
b
3
c
d
6
5
4
Distance Matrix
Compare Dendrograms
Single-Link
Complete-Link
ab c d
ab c d
0
2
4
6
Ordered dendrograms
2 n-1 linear orderings of n elements
(n= # genes or conditions)
Maximizing adjacent similarity is
impractical. So order by:
•Average expression level,
•Time of max induction, or
•Chromosome positioning
Eisen98
Which clustering methods do you suggest for
the following two-dimensional data?
Nadler and Smith, Pattern Recognition Engineering, 1993
Problems of Hierarchical
Clustering
• It concerns more about complete tree
structure than the optimal number of
clusters.
• There is no possibility of correcting for a
poor initial partition.
• Similarity and distance measures rarely
have strict numerical significance.
Non-hierarchical clustering
Normalized Expression Data
Tavazoie et al. 1999 (http://arep.med.harvard.edu)
Clustering by K-means
•Given a set S of N p-dimension vectors without any prior knowledge about the
set, the K-means clustering algorithm forms K disjoint nonempty subsets such
that each subset minimizes some measure of dissimilarity locally. The algorithm
will globally yield an optimal dissimilarity of all subsets.
•K-means algorithm has time complexity O(RKN) where K is the number of
desired clusters and R is the number of iterations to converges.
•Euclidean distance metric between the coordinates of any two genes in the space
reflects ignorance of a more biologically relevant measure of distance. K-means
is an unsupervised, iterative algorithm that minimizes the within-cluster sum of
squared distances from the cluster mean.
•The first cluster center is chosen as the centroid of the entire data set and
subsequent centers are chosen by finding the data point farthest from the centers
already chosen. 200-400 iterations.
K-Means Clustering Algorithm
1) Select an initial partition of k clusters
2) Assign each object to the cluster with the closest
center:
3) Compute the new centers of the clusters:
n 



C ( S )   X i / n, X 1 ,..., X n  S
i 1
4) Repeat step 2 and 3 until no object changes cluster
Representation of expression data
T1 T2 T3
Gene 1
Time-point 3
Time-point 1
dij
Gene N
Normalized
Expression Data
from microarrays
.
Gene 1
Gene 2
Identifying prevalent expression patterns
(gene clusters)
1.2
0.7
0.2
-0.3
1
2
-0.8
-1.3
-1.8
Time -point
3
Normalized
Expression
Normalized
Expression
Time-point 3
Normalized
Expression
Time-point 1
1.5
1
0.5
0
-0.5
1
2
3
-1
-1.5
Time -point
1.5
1
0.5
0
-0.5
1
2
-1
-1.5
-2
Time -point
3
Evaluate Cluster contents
Genes
gpm1
HTB1
RPL11A
RPL12B
RPL13A
RPL14A
RPL15A
RPL17A
RPL23A
TEF2
YDL228c
YDR133C
YDR134C
YDR327W
YDR417C
YKL153W
YPL142C
MIPS functional category
Glycolysis
Nuclear Organization
Ribosome
Translation
Unknown