Transcript CISD794

CISD 794 – Knowledge Discovery in Databases
Prof. Junping Sun
DENCLUE - Clustering DNA Sequences
using FFT (Fast Fourier Transform )
Gilberto R dos Santos
Clustering
• Cluster analysis is an unsupervised learning
method that constitutes a cornerstone of an
intelligent data analysis process.
• Some DNA sequences may reach several millions
of base pairs. Any search method tried to have a
direct hit comparing two DNA sequences may
have an algorithm that grows O[n*n], being n the
number of bases in the DNA sequence. Several
methods are being combined to optimize sequence
search and matching. One initial approach is to
build clusters of DNA sequences.
Functions in Time or Frequency
• A physical process can be described either in the time
domain, by the values of some quantity h as a function of
time t, e.g., h(t), or else in the frequency domain, where
the process is specified by giving its amplitude H
(generally a complex number indicating phase also) as
function of frequency f, that is H(f), with - ∞ < f < + ∞
• h(t) and H(f) are two different representations of the
same function.
• H(f) = h(t) e ** ( 2* π * i * f * t ) dt
•
h(t) = H(f) e ** ( -2 * π * i * f * t ) dt
FFT –Fast Fourier Transform
Fourier Equation
Where An and Bn are
DNA –Sequence – PDB: 1b05
DNA –Sequence – PDB: 1b25
DNA and FFT (DFT)
•
DFT /FFT (Discrete Fourier Transformation) can be used to define a wave
that is the closest representation of a DNA sequence function h(t). DFT
transformation technique can be used to reduce the search time of range
queries. This method can be used in conjunction with any other existing
method, or can be used as a filtration technique. The challenge of this research
is to define how DFT (FFT) can represent a DNA sequence. The proximity
search seeks the sequences close enough to a given query sequence either
through direct alignment or using other heuristics. The alignment of biological
sequences ( pairwise or multiple alignment) is the operation to place
nucleotide or amino acid residues in columns inferring the closest common
ancestral relationships. The best alignment usually refers to the one
demonstrating the most likely evolutionary scenario.
Wave Composition
Data Structures Used
• Data matrix
– (two modes)
• Dissimilarity matrix
– (one mode)
 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
xif
...
...
...
...
... xnf
...
...
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...
x1p 

... 
xip 

... 
xnp 







... 0
Euclidean Distance
• The Euclidean Distance for the matrix d(i,j) is
•
• Dx (i,j) = sqrt ( (Xi1 – Xj1 ) ** 2 + (Xi2 – Xj2 ) ** 2 + .. + ( Xip –
Xjp )** 2 )
•
• Dy (i,j) = sqrt ( (Yi1 – Yj1 ) ** 2 + (Yi2 – Yj2 ) ** 2 + .. + ( Yip – Yjp
)** 2 )
•
• Dz (i,j) = sqrt ( (Zi1 – Zj1 ) ** 2 + (Zi2 – Zj2 ) ** 2 + .. + ( Zip – Zjp
)** 2 )
Euclidean Distance
cursor c1 is select * from DNA_SEQUENCE_FREQ_XY where frequency > 0 for update order by 1;
cursor c2 is select * from DNA_SEQUENCE_FREQ_YZ where frequency > 0 for update order by 1;
cursor c3 is select * from DNA_SEQUENCE_FREQ_ZX where frequency > 0 for update order by 1;
d_v_dist_a number:=0;
d_v_dist_b number:=0;
v_dist_a number:=0;
v_dist_b number:=0;
r2 dna_variance_coef_xy%rowtype;
rec3 dna_variance_coef_yz%rowtype;
rec4 dna_variance_coef_zx%rowtype;
begin
for r1 in c1 loop
select * into r2
from dna_variance_coef_xy where frequency = r1.frequency ;
begin
select /*+ INDEX ( DNA_SEQUENCE_FREQ_XY, DNA_SEQUENCE_FREQ_XY_IDX1 ) */
sum( ( r1.ACOEF - ACOEF ) * exp ( -1 * (
power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef )) ),
sum( ( r1.BCOEF - BCOEF ) * exp ( -1 * (
power ( (r1.BCOEF - BCOEF) ,2) / r2.var_bcoef )) )
into v_dist_a,v_dist_b
from DNA_SEQUENCE_FREQ_XY where frequency = r1.frequency and pdb_id != r1.pdb_id;
update DNA_SEQUENCE_FREQ_XY set gradient_ACOEF = round(v_dist_a), gradient_BCOEF =
round(v_dist_b)
where current of c1;
exception when others then
null;
end;
end loop;
commit;
Euclidean Distance - sine coefficient – Function [X]  Y
Euclidean Distance - cosine coefficient – Function (X)  Y
DENCLUE – Clustering Based on Density Distribution
Functions
The influence of each data point can be formally modeled using a mathematical
function (influence function), that describes the impact of a data point within its
neighborhood.
The overall density of the data space can be modeled analytically as the sum of the
influence functions of all data points
Clusters can then be determined mathematically by identifying density atractors,
where density atractors are local maxima of the overall density functions.
d(X,Y)  should be reflexive and symetric  Euclidean distance Function.
Density Attractor/Density-Attracted Points
-local maximum of the density function
-density-attracted points are determined by a gradient-based hill-climbing method
Center-Defined Cluster
A center-defined cluster with density-attractor x* ( ) is the subset of the database which is densityattracted by x*.
DNA Sequence – 1b5s
ALA Residue
Denclue: Technical Essence
• Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure.
• Influence function: describes the impact of a data point within its
neighborhood.
• Overall density of the data space can be calculated as the sum of the
influence function of all data points.
• Clusters can be determined mathematically by identifying density
attractors.
• Density attractors are local maximal of the overall density function.
Gradient: The steepness of a slope
• Example
f Gaussian ( x , y )  e
f
D
Gaussian
f
( x) 

d ( x , y )2

2 2

N
i 1
e
d ( x , xi ) 2
2
2
( x, xi )  i 1 ( xi  x)  e
D
Gaussian
N

d ( x , xi ) 2
2 2
Density Attractor Function
A point x* 
is called a density-attractor for a given
influence function, iff x* a local maximum of the densityfunction
A point x 
is density-attracted to a density attractor x*,
iff $k N : d (
, x*) <= x with
Center Defined Clusters
A center-defined cluster (wrt to , x) for a
density attractor x* is a subset C D, with x
 C being density-attracted by x*
and
(x*) >= x. Points x D are called
outliers if they are density-attracted by a local
maximum
with
( )<x
DENCLUE Algorithm
Local Density Function
Two cubes c1, c2  Cp are connected if
d(mean(c1); mean(c2)) <= 4
near(x) = { x1 c1 | d(mean(c1),x) < k
(note: k4)
Density Attractors
DENCLUE Algorithm (cont.)
After determining the density-attractor x* for a point x and
the point x is classified and attached to the cluster belonging to x*.
Gaussian Distance
cursor c1 is select * from DNA_SEQUENCE_FREQ_XY where frequency > 0 for update order by 1;
cursor c2 is select * from DNA_SEQUENCE_FREQ_YZ where frequency > 0 for update order by 1;
cursor c3 is select * from DNA_SEQUENCE_FREQ_ZX where frequency > 0 for update order by 1;
d_v_dist_a number:=0;
d_v_dist_b number:=0;
v_dist_a number:=0;
v_dist_b number:=0;
r2 dna_variance_coef_xy%rowtype;
rec3 dna_variance_coef_yz%rowtype;
rec4 dna_variance_coef_zx%rowtype;
begin
for r1 in c1 loop
select * into r2
from dna_variance_coef_xy where frequency = r1.frequency ;
begin
select /*+ INDEX ( DNA_SEQUENCE_FREQ_XY, DNA_SEQUENCE_FREQ_XY_IDX1 ) */
sum(exp ( -1 * (
power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef )) ),
sum( exp ( -1 * (
power ( (r1.BCOEF - BCOEF) ,2) / r2.var_bcoef )) )
into v_dist_a,v_dist_b
from DNA_SEQUENCE_FREQ_XY where frequency = r1.frequency and pdb_id != r1.pdb_id;
update DNA_SEQUENCE_FREQ_XY set GAUSSIAN_ACOEF = round(v_dist_a), GAUSSIAN_BCOEF = round(v_dist_b)
where current of c1;
exception when others then
null;
end;
end loop;
commit;
end;
/
Gaussian Distance - sine coefficient  xy  ALA Residue
Gaussian Distance - cosine coefficient  xy  ALA Residue
Gradient Distance
cursor c1 is select * from DNA_SEQUENCE_FREQ_XY where frequency > 0 for update order by 1;
cursor c2 is select * from DNA_SEQUENCE_FREQ_YZ where frequency > 0 for update order by 1;
cursor c3 is select * from DNA_SEQUENCE_FREQ_ZX where frequency > 0 for update order by 1;
d_v_dist_a number:=0;
d_v_dist_b number:=0;
v_dist_a number:=0;
v_dist_b number:=0;
r2 dna_variance_coef_xy%rowtype;
rec3 dna_variance_coef_yz%rowtype;
rec4 dna_variance_coef_zx%rowtype;
begin
for r1 in c1 loop
select * into r2
from dna_variance_coef_xy where frequency = r1.frequency ;
begin
select /*+ INDEX ( DNA_SEQUENCE_FREQ_XY, DNA_SEQUENCE_FREQ_XY_IDX1 ) */
sum( ( r1.ACOEF - ACOEF ) * exp ( -1 * ( power ( (r1.ACOEF - ACOEF) ,2) / r2.var_acoef )) ),
sum( ( r1.BCOEF - BCOEF ) * exp ( -1 * ( power ( (r1.BCOEF - BCOEF) ,2)/ r2.var_bcoef )) )
into v_dist_a,v_dist_b
from DNA_SEQUENCE_FREQ_XY where frequency = r1.frequency and pdb_id != r1.pdb_id;
update DNA_SEQUENCE_FREQ_XY set gradient_ACOEF = round(v_dist_a), gradient_BCOEF =
round(v_dist_b)
where current of c1;
exception when others then
null;
end;
end loop;
commit;
Gradient Gaussian Distance
Cos  xy  ALA Residue
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Flexible Pattern Matching in Strings: Practical On-line
Search Algorithms for Texts and Biological Sequences
Gonzalo Navarro and Mathieu Raffinot
Numerical Recipes in C –
William H. Press / Saul A. Teukolsky / William T. Vetterling /
Brian P. Flannery - http://library.lanl.gov/numerical
Introduction to Algorithms –
Thomas H. Cormen / Charles E. Leiserson / Ronald L. Rivest
Introduction to Bioinformatics - Arthur M. Lesk
Handbook of Algorithms and Data Structures,
G.H. Gonnet & R. Baeza-Yates.
Data Mining: Opportunities and Challenges - John Wang
Who is Fourier? A Mathematical Adventure –Alan Gleason
Mathematical Handbook for Scientists and Engineers –
Granino A. Korn and Theresa M. Korn
An Efficient Approach to Clustering in Large Multimedia Databases with Noise
Alexander Hinneburg, Daniel A. Keim
Data Mining: Concepts and Techniques
Jiawei Han and Micheline Kamber