T - Florida Tech Department of Computer Sciences

Download Report

Transcript T - Florida Tech Department of Computer Sciences

Correlogram Method
for comparing Bio-Sequences
- Gandhali Samant , M.S. Computer Science
Committee
Dr. Debasis Mitra, PhD
Dr. William Shoaff, PhD
Dr. Alan Leonard, PhD
What is Sequence Comparison
Sequence Comparison – One of the most important
primitive operations in computational biology.
Finding resemblance or similarity between sequences
Basis for many other more complex manipulations.
Used for database search, phylogeny development,
clustering etc.
What is Sequence Comparison …Contd.
Two important notions are Similarity – How similar are the two sequences?
This gives a numeric score of similarity between two
sequences
A G T C T C
A T T G T C
-------------------------1 -1 1 -1 1 1 = 2
Alignment – Way of placing one sequence above other to
make clear the correspondence between them.
AG T C GTC
A_ T C _ TC
-------------------------1 -2 1 1 -2 1 1 = 1
What is Sequence Comparison …Contd.
Many methods have been proposed for sequence
comparison.
Some Important ones include –
Dynamic programming algorithms for sequence
alignment - Global, Local or Semi-Global Alignment
Heuristic and Database Search Algorithms - BLAST,
FASTA.
What is Sequence Comparison …Contd.
Multiple sequence alignment Algorithms
Multiple sequence alignment methods are mainly used
when there is a need to extract information from a
group of sequences.
Examples of situations in which these techniques are
used include the determination of secondary or tertiary
structures, characterization of protein families,
identification of similar regions etc.
What is Sequence Comparison …Contd.
Also many miscellaneous techniques have been
proposed for sequence comparison
Contact based sequence alignment
Using Correlation Images

Some methods have been proposed without using the
fundamental tool of Sequence Alignment
Shortest unique substring
Background Study
Basic Concepts of Molecular Biology
BLAST
Clustering
Phylogeny Trees / Phylip
Basic Concepts of Molecular Biology
Proteins –
Most substances in our body are proteins
Some of these are structural proteins and some are
enzymes.
Proteins are responsible for what an organism is and
what it does in physical sense.
Amino Acids –
A protein is a chain of simple molecules called Amino
Acids. There are total 20 amino acids
Basic Concepts of Molecular Biology
Nucleic Acids –
Nucleic Acids encode information necessary to produce
proteins
They are responsible for passing recipe to subsequent
generations.
2 types of nucleic acids present in living organisms,
RNA (ribonucleic acid)
DNA (deoxyribonucleic acid).
BLAST
BLAST (Basic Local Alignment Search Tool)
BLAST algorithms are heuristic search methods
This method seeks words of length W (default=3 in
blastP) that score at least T when aligned with the
query and scored with the substitution matrix (e.g.
PAM)
Clustering
Clustering
It can be defined as “The process of organizing objects
into groups whose members are similar in some way”
Phylogeny Trees / Phylip
Phylogeny -The context of evolutionary biology
Phylogeny Trees
Relationships between different species and their common
ancestors shown by constructing a tree.
PHYLIP, the Phylogeny Inference Package, is a package of
programs for inferring phylogenies (evolutionary trees)
from University of Washington .
What Phylip can do??
Data used by phylip.
Phylip…Contd.
Following are the programs used from Phylip package
in this research.
FITCH - Estimates phylogenies from distance matrix data.
KITCH - Estimates phylogenies from distance matrix data.
NEIGHBOR - Produces an un-rooted tree
DRAWGRAM - Plots rooted phylogenies, cladograms,
circular trees and phenograms in a wide variety of usercontrollable formats. The program is interactive.
DRAWTREE - Similar to DRAWGRAM but plots unrooted
phylogenies.
Our Approach …Correlogram
What is a Correlogram??
D
Representation of sequence in
mathematical space.
3-D matrix of which 2
dimensions are the set of
entities (e.g.. Amino Acids,
Nucleic Acids) and third
dimension is distance.
3
2
1
A0
T
G
C
A
T
G
C
Correlogram for Image Comparison
Correlogram method has already been used for Image
comparison.
“Image indexing using color correlograms” By Jing
Huang,S Ravi Kumar, Mandar Mitra, Wei-Jing Zhu,
Ramin Zabih
A color correlogram expresses how the spatial
correlation of pairs of colors changes with the distance
Color correlogram has also been used recently for
object tracking
Correlogram Usage in the field of Bioinformatics
Correlograms were used to analyze autocorrelation
characteristics of active polypeptides.
MF Macchiato, V Cuomo and A Tramontano (1985),
“Determination of the autocorrelation orders of proteins”

For analyzing spatial patterns in various experiments.
–

Giorgio Bertorelle and Guido Barbujanit (1995), “Analysis
of DNA Diversity by Spatial Auto Correlation”
In studies regarding patterns of transitional mutation
biases within and among mammalian genomes
–
Michael S. Rosenberg, Sankar Subramanian, and Sudhir
Kumar (2003), “Patterns of Transitional Mutation Biases
Within and Among Mammalian Genomes”
Constructing a Correlogram plane
Example
Sequence ….. agcttactgt
If we calculate the
appearance of every pair of
characters at distance 1 ..
d=1
A
Correlogram can be
constructed as a set of
frequencies for different
distances.
G
A
T
The Correlogram Plane for
distance 1 will be ->
T
C
1
1
1
G
1
C
2
1
1
1
Constructing a Correlogram plane…Contd.
Example
Sequence ….. agcttactgt
Correlogram plane for d=0
d=0
A
A
T
G
C
T
G
C
2
4
2
2
Constructing a Correlogram plane…Contd.
Example
Sequence ….. agcttactgt
Correlogram plane for d=2
d=2
A
T
A
T
1
G
C
1
1
1
1
G
1
C
1
1
Graphical Representation of Correlogram
Correlogram plane
shown here is of a
protein sequence for
distance 0.
Y
W
V
T
S
R
Q
P
N
M
L
K
I
0.05
H
G
F
E
D
C
A
At distance 0 each
character is
compared with itself
so we can see all the
peaks at diagonal.
0.1
AC DE FG H I KLM N PQ RS TV WY
This is a Histogram.
0
Graphical Representation …Contd.
Similarly Correlogram frequencies for distance 1 and
distance 2 can be represented as…
Correlogram frequencies for Distance 1
Correlogram frequencies for Distance 2
W
0.02
S
M
0.01
0.02
P
H
G
A
A C D E F GH I KLM N PQ R S TV W Y
0
A
0
ACDEFGHIKLMNPQRSTVWY
Normalization of Correlogram
Need for normalization – Finding similarity between
sequences of different length.
For every correlogram plane, each value is divided by the
total volume of that plane.
Extension - Gapped Correlogram
1
Gapped Correlogram Consideration the gapped
alignment of sequences
0.5
0.5
0.25
The reason is if a pair of
character is at distance d,
there is probability that in
other sequence it might
appear at distance d-1 or
d+1.
Adding a ‘delta’ to
Correlogram.
d -> 2
0.25
3
4
5
6
For every pair at distance n,
frequency f and with delta =
d, a fraction of frequency
f/(2|n-distance|) is added at
distances n-1,n-2… n-d and
distances n+1,n+2… n+d.
Extension - Gapped Correlogram…Contd.
D=2
Delta = 1
+
A
T
G
A
1
C
D=3
0.5
A
T
G
C
0.5
0.5
0.5
0.5
0.5
1
Adding values to
previous plane
T
G
A
T
2
1
C
D=4
1
1
1
G
1
1
C
2
Adding values to
next plane
+
A
T
G
A
T
1
0.5
C
0.5
0.5
0.5
G
0.5
0.5
C
1
Correlogram for Sequence Comparison
We are using these Correlograms for comparison of 2
sequences.
Correlograms were constructed using same set of distances
for both the sequences being compared.
Then distance between each cell of two Correlograms (i.e.
Two 3-D Matrices) is calculated as
dijk = (Sijk – S’ijk )2 / (1+ Sijk + S’ijk )
where i, j and k are 3 dimensions.
These distances were then added to get a final distance
between two sequences.
d = √ ∑ dijk
One major difference !!
Synthetic Data Experiments using Correlogram
Purpose
To discriminate and compare the capability of correlogrammethod with one of the "traditional" comparison techniques
i.e. Smith-Waterman Dynamic Programming algorithms.
The reason for using DP algorithms for comparison was
that they are the most standard method for sequence
comparison.
The sequences used in these experiments were amino
acid sequences
Synthetic Data Experiments…Contd.
In all the experiments, the pair of sequences was
compared using both Correlogram method and DP
Method.
Synthetic Data Experiments…Contd.
The experiments were designed as follows
Comparing a base sequence with its reverse sequence
Wrap around the target sequence at different character
length and measure the difference with respect to the
reference sequence each time
Delete an amino acid from target sequence and measure
the difference with respect to the reference sequence each
time
Replace an amino acid at different location and measure
the difference with respect to the reference sequence each
time
Add an amino acid from target sequence and measure the
difference with respect to the reference sequence each
time
Synthetic Data Experiments…Contd.
Comparing a base sequence with its reverse
sequence.
Correlogram Score
DP Score
5
4
Scores
3
2
1
0
-1
-2
1
2
3
Iterations
4
Synthetic Data Experiments…Contd.
Wrap around the target sequence at different character
length and measure the difference with respect to the
reference sequence each time.
Correlogram Score
DP Score
5
4
Score
3
2
1
0
-1
0
2
4
6
Iterations
8
10
12
Synthetic Data Experiments…Contd.
Delete an amino acid from target sequence and
measure the difference with respect to the reference
sequence each time.
Correlogram Score
DP Score
5
4
Scores
3
2
1
0
-1
1
2
3
4
5
6
Iterations
7
8
9
10
Synthetic Data Experiments…Contd.
Replace an amino acid at different location and
measure the difference with respect to the reference
sequence each time.
Correlogram Score
DP Score
5
4
Scores
3
2
1
0
-1
1
2
3
4
5
6
7
8
9
Iterations
10
11
12
13
14
Synthetic Data Experiments…Contd.
Add an amino acid at different location and measure
the difference with respect to the reference
sequence each time.
Correlogram Score
DP Score
5
4
Scores
3
2
1
0
-1
1
2
3
4
5
6
7
8
9
Iterations
10
11
12
13
14
Finding Test data..
“Alternate circulation of recent equine-2 influenza viruses
(H3N8) from two distinct lineages in the United States” By
Alexander C.K. Lai, Kristin M. Rogers, Amy Glaser, Lynn
Tudor, Thomas Chambers
hemagglutinin (HA) gene from Different strains of equine-2
influenza viruses.
GeneTool version 1.1. – Compilation and analysis
Phylogenetic analysis was performed by using the deduced
HA1 amino acid sequence and the PHYLIP software package
Distance matrix was calculated by using the PROTDIST
program, and an unrooted tree generated by using the FITCH
program.
Test Data
Phylogeny Tree
Experiment 1 : Using same Test data
We have done an experiment with the same test data.
All the protein sequences were searched.
http://www.ebi.ac.uk/cgi-bin/expasyfetch
A distance matrix was created using correlogram distances
for every pair among these sequences.
From this distance matrix, a tree is created using PHYLIP
software package.
The program ‘FITCH’ is used for creating tree whereas the
program ‘DRAWTREE’ is used for visualizing the tree.
Graphical Representation of Correlogram for SA90
Distance = 1
0.02
0.05
0.01
K
0
A
A CDE F G H I K LM NPQ RS T VWY
A
A C D E F G H I K LM N PQ RS T VWY
F
F
K
P
T
0.1
P
T
Distance = 0
Distance = 2
0.02
F
K
P
T
0.01
A
ACDE F G H I KLM NPQ RS T VWY
0
0
Graphical Representation of Correlogram for SA90
Distance = 3
Distance = 4
0.01
0.01
0
K
ACDE F G H I KLM NPQ RS T VWY
A
A
ACDE F G H I KLM NPQ RS T VWY
0.005
F
F
K
0.005
P
P
T
0.015
T
0.015
0
Distance Matrix
SA90
SU89
LM92
HK92
KY92
KY91
KY94
SA90
0
0.084172
0.035637
0.087184
0.085504
0.085942
0.086551
SU90
0.084172
0
0.082183
0.014866
0.020469
0.021679
0.020881
LM92
0.035637
0.082183
0
0.081841
0.085076
0.085575
0.085744
HK92
0.087184
0.014866
0.081841
0
0.024637
0.026439
0.025417
KY92
0.085504
0.020469
0.085076
0.024637
0
0.018841
0.017493
KY91
0.085942
0.021679
0.085575
0.026439
0.018841
0
0.016823
KY94
0.086551
0.020881
0.085744
0.025417
0.017493
0.016823
0
Phylogeny Tree found with Correlogram Distances
Comparison of two trees.
Experiment 2 : Finding Test Data
Parvovirus causes stomach diseases in children.
Coat protein – Some coat proteins are important as they are
responsible for the resistance.
Different strains of parvoviri were studied for their VP1
Protein.
Reference for the test data – Dr. Mavis McKenna and Dr. Rob
McKenna from University of Florida, Gainesville.
From these distance matrices, trees were created using
PHYLIP software package.
The programs ‘NEIGHBOR’ and ‘DRAWTREE’ were used.
Comparison of two trees.
Experiment 3 -Correlogram for Sequence Scanning
The next experiment was to use correlogram for scanning
Sequences i.e. Pattern Finding.
The algorithm Scan Correlogram was developed for finding
the occurrences of a given pattern over a long sequence.
Pattern
A T C G T
A
T
C
G
A
T
C
G
T
T
A
G
C
T
C
C
Target
1st
Comparison
2nd Comparison
Last Comparison
Experiment 3 -Correlogram for Sequence
Scanning…Contd.
Following Viruses were used in this experiment
Porcine-parvovirus
Bovine Parvovirus
CPV Packaged Strand
H1 Complementary
MVM Packaged Strand
PhiX-Genome
AAV NC001401
AAV Complementary
ADV Complementary
Astell and Tattersall MVMi Packaged Sequence
Experiment 3 -Correlogram for Sequence
Scanning…Contd.
The patterns searched were as follows
ACACCAAAA
ATACCTCTTGC
ATCCTCTATCAC
Results for Bovine Parvovirus
Following are the results shown for Bovine Parvovirus.
The length of sequence was 5517 and cut-off score used
was 2.48 for all three patterns.
Pattern 1 - ACACCAAAA
Difference Score
3
2.5
2
1.5
1
0.5
0
0
1000
2000
3000
4000
Location of Substring
5000
6000
Results for Bovine Parvovirus
Following are the results shown for Bovine Parvovirus for
pattern ACACCAAAA.
Location
Score Distance
129
2167
Substring
2.28 ACAACTAAA
1.99 ACCCAAATA
4149
2.39 AACTCCAAA
1.83 TACCACCAA
4150
1.83 ACCACCAAA
4151
4152
2.09 CCACCAAAT
1.81 CACCAAATC
4798
2.48 ACCCCCAAT
3543
Conclusions??
This research developed the correlogram comparison
method for comparing sequences. Experiments were
performed on real sequences and on synthetic sequences to
answer the research questions of whether the correlogram
biological sequences.
It was observed that the Dynamic Programming method
was more sensitive to the positioning of characters (i.e.
amino acids or nucleic acids) in the sequence (sequence
alignment), whereas the Correlogram method was found to
be more sensitive to the character itself (contents of the
sequence)
Conclusions??
The real data experiment was conducted on different
strains of the horse influenza virus and the parvovirus. It
was observed that the phylogeny was retained in most
cases, however there were certain remarkable differences
between the two.
The scan correlogram algorithm was developed and used in
this research to find motifs or patterns. The results of this
experiment showed that the sub-sequences obtained were
very similar to the given pattern.
Future Work
The further study can be done to see how the array
of distances used for correlogram computations can
impact the results.
It will be interesting to study various delta values for
Gapped correlograms and how they affect the scores.
This gapped correlogram method can be further
researched to see if the delta values are useful in
determining global versus local alignments.
Future Work…Contd.
Enhancements can be made to the scan correlogram
method to use the gapped correlogram method for
finding patterns and also to find the sub-sequences
of more or less length than that of the pattern
sequence.
Acknowledgement
Dr. Kuntal Sengupta suggested that correlogram method
can be used for comparison of bio-sequences.
Dr. Mavis McKenna and Dr. Rob McKenna, University of
Florida, Gainesville.
Mridula Anand, Florida Institute of Technology.
References
http://www.ncbi.nlm.nih.gov/BLAST/
http://highwire.stanford.edu/
http://au.expasy.org/
http://evolution.gs.washington.edu/phylip.html
“Alternate circulation of recent equine-2 influenza viruses
(H3N8) from two distinct lineages in the United States” By
Alexander C.K. Lai, Kristin M. Rogers, Amy Glaser, Lynn
Tudor, Thomas Chambers.
“Image indexing using color correlograms” By Jing Huang,S
Ravi
Kumar, Mandar Mitra, Wei-Jing Zhu, Ramin Zabih
“Phylogeny of the genus Haemophilus as determined by
comparison of partial infB sequences” By Jakob Hedegaard,
Henrik Okkels, Brita Bruun, Mogens Kilian, Kim K.
Mortensen1 and Niels Nørskov-Lauritsen
Thanks!!