BIONFORMATIC aLGORITHMS

download report

Transcript BIONFORMATIC aLGORITHMS

Ryan Tinsley
Brandon Lile
BIONFORMATIC
ALGORITHMS
May 9th, 2014
Bioinformatics
•
•
•
•
What?
Why?
Remains large frontier
Goals:
– Organize and serve data
– Develop tools to analyze
– Interpret results
2
Clustering
• Clustering’s goal is to take large amounts of data
and group it into similarity classes.
• Viewing and analyzing vast amounts of biological
data as a whole can be a challenging job. It is easier
to interpret the data if it is partitioned into clusters
combining similar data points.
• Clustering arose from researchers wanting to know
the functions of newly sequenced genes. If you were
to compare the gene sequence to already know
DNA sequence it would not often give away the
function of the gene. In fact 40% of sequenced
gene’s functionality cannot be attained by only
comparing sequences of known genes.
3
Microarrays
• Biologists infer gene functions by using microarrays.
Microarrays are a grid of DNA segment of known
sequences that is used to test and map DNA
fragments. They measure the expression level of
genes under varying conditions.
• Expression level is estimated by measuring the
amount of mRNA for that particular gene. Microarray
data are transformed into an intensity matrix, which
allows biologists to make correlations between
genes. This is where clustering comes in handy.
• Microarray data is plotted by each data as a point in
an N-dimensional space. Clustering makes a
distance matrix for the distances between every
gene point. Genes that are close together share the
same characteristics. The outcome is groups of
clusters that have related functionality.
4
Green: control
Red: experimental cell
Yellow: in both samples
Black: in neither sample
Microarrays can
also be expressed
by tables.
Each box represents
one gene’s expression
over time.
5
Hierarchical Clustering
• Hierarchical clustering is a method which seeks to
build a hierarchy of clusters. This is done in two
ways, agglomerative and divisive.
• Agglomerative starts with every element in its own
cluster and iteratively joins clusters together. The
time complexity of agglomerative clustering is
O(n^3).
• Divisive starts with one cluster and iteratively divides
it into smaller clusters. Divisive clustering with an
exhaustive search is O(2^n), which is even worse.
However, for some special cases, optimal efficient
agglomerative methods (of complexity O(n^2)) are
known: SLINK[1] for single-linkage and CLINK[2] for
complete-linkage clustering.
6
• Hierarchical
clustering is often
used to reveal
evolutionary
history. Distance
between two
blusters is the
smallest distance
between any pair
of their elements.
Average distance
is between all
pairs of elements.
This is a good example of clustering.
It satisfies both homogeneity and
separation principles.
7
H.C. Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Hierarchical Clustering (d , n)
Form n clusters each with one element
Construct a graph T by assigning one vertex to each cluster
while there is more than one cluster
Find the two closest clusters C1 and C2
Merge C1 and C2 into new cluster C with |C1| +|C2| elements
Compute distance from C to all other clusters
Add a new vertex C to T and connect to vertices C1 and C2
Remove rows and columns of d corresponding to C1 and C2
Add a row and column to d corrsponding to the new cluster C
return T
Distance:
dmin(C, C*) = min d(x,y)
davg(C, C*) = (1 / |C*||C|) ∑ d(x,y)
Squared Error Disttortion:
d(V,X) = ∑d(vi, X)2 / n
1<i<n
8
K-Means Cluster and Lloyd’s
Clustering
• K-Means clustering is a technique to partition a set of N
points into K clusters.
• 1-Means Clustering problem is easy. However, it becomes
very difficult (NP-complete) for more than one center.
• An efficient heuristic method for K-Means clustering is the
Lloyd algorithm. It differs from K-Means in that its inputs
are a continuous geometric region rather than a discrete
set of points. Lloyd’s algorithm, while being fast, in each
iteration it moves many data points not necessarily
causing a better convergence.
• A more overall better clustering cost is done with the
Progressive Greedy K-Means clustering. This method
moves one point at a time.
• A comparison between the two has the conclusion that
Lloyd’s method is more efficient in run-time and in best
found SED.
9
K-Means Clustering: Lloyd
Algorithm
1. Arbitrarily assign the k cluster centers
2. while the cluster centers keep changing
3.
Assign each data point to the cluster
Ci corresponding to the closest cluster
representative (center) (1 ≤ i ≤ k)
4. After the assignment of all data points,
compute new cluster representatives
according to the center of gravity of each
cluster, that is, the new cluster
representative is ∑v \ |C| for all v in C
for every cluster C
10
Progressive “Greedy” K-Means
11
K-Mean Execution Ex.
12
Distance Graph
• A distance matrix can
be turned into a
distance graph.
Genes are then
represented as
vertices in the graph,
otherwise known as
Clique graphs. A
problem with clique
graphs is corruption
which can be solved
with the CAST
algorithm.
13
Other Clustering Uses
• Clustering has a few more applications.
Clustering can be used on highly homologous
sequences to reduce the size of large protein
databases. It reduces the complexity and
speeds up comparisons.
• Clustering can be used to evaluate proteinprotein interaction networks. The structure of
proteins interactions can be represented by a
graph. Clustering then looks for clusters in the
graph.
• Also it can be used to improve the protein
structure prediction by merging the predictions
made by a large number of alternative
conformation models.
14
BLAST
• Need way to compare newly
sequenced genes with
predetermined genes
• Quadratic local alignment too slow
• Heuristic approach instead allows
for O(nm) runtime
15
BLAST cont…
• Step 1: Removal of low
complexity/repetitive regions
– DUST
• Step 2: Splitting
amino acids
nucleotides
into words/
sequences
16
BLAST cont…
• Step 3: All words above threshold
incorporated into query list and
placed in efficient search tree
17
BLAST cont…
• Step 4: DB queried for matches
• Step 5: Verification/Extension of
matches
– Ungapped BLAST
– Gapped BLAST
18
BLAST Verification
Ungapped
Gapped
• Step 6: Verifies threshold and
returns ranked results
19
BLAST Implementations
• NCBI: http://blast.ncbi.nlm.nih.gov/
–
–
–
–
•
•
•
•
Blastn
Blastp
Blastx
Etc.
mpiBLAST
ScalaBLAST
BLAT
Future of BLAST
20
Questions?
21