SEQUENCE ANALYSIS - Indian Agricultural Statistics

Download Report

Transcript SEQUENCE ANALYSIS - Indian Agricultural Statistics

SEQUENCE ANALYSIS
By
Jyotika Bhati
Bioinformatics
The design, construction and use of software tools to
generate, store, annotate, access and analyse data and
information relating to Molecular Biology
OR
Biologists doing “stuff” with computers?
What is Sequence ?
• A sequence is an ordered list of objects (or events).
• Biological sequence is a single, continuous molecule
of nucleic acid or protein.
• Sequence analysis in bioinformatics is an automated,
computer-based examination of characteristic
fragments, e.g. of a DNA strand.
• The term "sequence analysis" in biology implies
subjecting a DNA or peptide sequence to sequence
alignment, sequence databases, repeated sequence
searches, or other bioinformatics methods on a
computer.
Nucleotide Sequence Databases
 NCBI (National Center for Biotechnology
Information)
 EMBL (European Molecular Biology Laboratory)
 DDBJ (DNA DataBank of Japan)
Protein Sequence Database
 SWISS-PROT
 TrEMBL
Sequence Alignment
• The identification of residue-residue correspondences
• The basic tool in bioinformatics
WHY Sequence Alignment ?
• For discovering functional, structural and
evolutionary information in biological sequences
• Eases further tasks like:
‾ Annotation of new sequences
‾ Modeling of protein structures
‾ Design and analysis of gene expression experiments
Basic Steps in Sequence Alignment
• Comparison of sequences to find similarity and
dissimilarity in compared sequences
• Identification of gene-structures, reading frames,
distributions of introns and exons and regulatory
elements
• Finding and comparing point mutations to get the
genetic marker
• Revealing the evolutionary and genetic diversity
• Function annotation of genes.
The Concept
• An alignment is a mutual arrangement of two
sequences
• Exhibits where two sequences are similar, and where
they differ
• An ‘optimal’ alignment – most correspondences and
the least differences
• Sequences that are similar probably have the same
function
Sequence alignment involves the identification of the
correct location of deletions and insertions that have
occurred in either of the two lineages since the
divergence from a common ancestor.
Terms of sequence comparison
Sequence identity
• Exactly same Nucleotide/AminoAcid in same position
Sequence similarity
• Substitutions with similar chemical properties
Sequence homology
• General term that indicates evolutionary relatedness
among sequences
• Sequences are homologous if they are derived from a
common ancestral sequence.
Homology
• Homology designates a qualitative relationship of
common descent between entities
• Two genes are either homologs or not !
‾ It doesn’t make sense to say “two genes are 43%
homologous”
‾ It doesn’t make sense to say “John is 43% diabetic”
Two genes are orthologs if they originated from single
ancestral gene in the most recent common ancestor of
their respective genomes
Two genes are paralogs if they are related by duplication
Things to consider
• To find the best alignment one needs to examine all
possible alignment
• To reflect the quality of the possible alignments one
needs to score them
• There can be different alignments with the same
highest score
• Variations in the scoring scheme may change the
ranking of alignments
Manual alignment
• When there are few gaps and the two sequences are
not too different from each other, a reasonable
alignment can be obtained by visual inspection.
• Advantages:
(1) use of a powerful and trainable tool (the brain,
well… some brains).
(2) ability to integrate additional data
Disadvantage : The method is subjective and unscalable.
Types of Alignment
- Pairwise Alignment
•Dot Matrix Method
•Dynamic Programming
•Word Method
- Multiple Alignment
•Dynamic Programming
•Progressive Methods
•Iterative Methods
•Motif Finding
Pairwise Sequence Alignment
• One pair of elements at a time
• Challenge – Find optimum alignment of 2
seqs with some degree of similarity
• Optimality is based on SCORE
• Score reflects the no. of paired characters
in the 2 seqs and the no. and length of gaps
introduced to adjust the seqs so that max
no. of characters are in alignment
A pairwise alignment consists of a series of paired bases,
one base from each sequence. There are three types of
pairs:
(1) matches = the same nucleotide appears in both
sequences.
(2) mismatches = different nucleotides are found in the
two sequences.
(3) gaps = a base in one sequence and a null base in the
other.
Match
Gap
Mismatch
GCGGCCCATCAGGTACTTGGTG -G
GCGT TCCATC - - CTGGTTGGTGTG
Dot Matrix Method
• Established in 1970 by A.J. Gibbs and G.A.McIntyre
• Method for comparing two nucleotide/aa sequences
• each sequence builds one axis of the
grid
• one puts a dot, at the intersection of
same letters appearing in both sequences
• scan the graph for a series of dots
reveals similarity or a string of same
characters
• longer sequences can also be compared
on a single page, by using smaller dots
Dot Matrix Method
• the dot matrix method reveals the presence of
insertions or deletions
• comparing a single sequence to itself can reveal the
presence of a repeat of a subsequence
• self comparison can reveal several features:
– similarity between chromosomes
– tandem genes
– repeated domains in a protein sequence
– regions of low sequence complexity (same
characters are often repeated)
Tools generating Dot Matrices
• Dotlet (Java based web-application)
http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
• Compare & dotplot programmes in GCG Wisconsin
Package (Genetics Computer Group [commercial])
• GeneAssist package of ABI/Perkin Elmer
• DOTTER (available on dapsas, UNIX X-Windows)
• DNA Strider (Macintosh only)
Dot Matrix Methods
• When to use :
– unless the sequences are known to be very
much alike
• Demerits
– doesn’t readily resolve similarity that is
interrupted by insertion or deletions
– Difficult to find the best possible alignment
(optimal alignment)
– most computer programs don’t show an actual
alignment
Pairwise alignment: the problem
The number of possible pairwise alignments increases explosively with the
length of the sequences:
Two protein sequences of length 100 amino acids can be aligned in
approximately 1060 different ways
Time needed to test all possibilities is same order of magnitude as the entire
lifetime of the universe.
Global versus local alignments
Global alignment: align full length of both sequences.
“Needleman-Wunsch” algorithm).
(The
Global alignment
Local alignment: find best partial alignment of two sequences (the “SmithWaterman” algorithm).
Seq 1
Local alignment
Seq 2
Global Sequence Alignment
• The Needleman–Wunsch algorithm performs a
global alignment
• An example of dynamic programming
• First application of dynamic programming to
biological sequence comparison
• Suitable when the two sequences are of similar
length, with a significant degree of similarity
throughout
• Aim: The best alignment over the entire length of
two sequences
Steps in NW Algorithm
• Initialization
• Scoring
• Trace back (Alignment)
Consider the two DNA sequences to be globally
aligned are:
ATCG (x=4, length of sequence 1)
TCG (y=3, length of sequence 2)
Why Gap Penalties?
• The
optimal alignment of two similar sequences is
usually that which
• maximizes the number of matches and
• minimizes the number of gaps.
•There is a tradeoff between these two
- adding gaps reduces mismatches
•Permitting the insertion of arbitrarily many gaps can lead
to high scoring alignments of non-homologous sequences.
• Penalizing gaps forces alignments to have relatively few
gaps.
Initialization Step
• Create a matrix with X +1 Rows and Y +1
Columns
• The 1st row and the 1st column of the score
matrix are filled as multiple of gap penalty
0
A
-1
T
-2
C
-3
G
-4
T
C
G
-1
-2
-3
Scoring
• The score of any cell C(i, j) is the maximum of:
scorediag = C(i-1, j-1) + S(i, j)
scoreup = C(i-1, j) + g
scoreleft = C(i, j-1) + g
where S(i, j) is the substitution score for letters i
and j, and g is the gap penalty
Scoring ….
• Example:
The calculation for the cell C(2, 2):
scorediag = C(i-1, j-1) + S(i, j) = 0 + -1 = -1
scoreup = C(i-1, j) + g = -1 + -1 = -2
scoreleft = C(i, j-1) + g = -1 + -1 = -2
T
C
G
0
-1
-2
-3
A
-1
-1
T
-2
C
-3
G
-4
Scoring ….
• Final Scoring Matrix
T
C
G
0
-1
-2
-3
A
-1
-1
-2
-3
T
-2
0
-1
-2
C
-3
-1
1
0
G
-4
-2
0
2
Note: Always the last cell has the maximum
alignment score: 2
Trace back
• The trace back step determines the actual
alignment(s) that result in the maximum score
• There are likely to be multiple maximal
alignments
• Trace back starts from the last cell, i.e. position
X, Y in the matrix
• Gives alignment in reverse order
Trace back ….
• There are three possible moves: diagonally (toward
the top-left corner of the matrix), up, or left
• Trace back takes the current cell and looks to the
neighbor cells that could be direct predecessors. This
means it looks to the neighbor to the left (gap in
sequence
#2),
the
diagonal
neighbor
(match/mismatch), and the neighbor above it (gap in
sequence #1). The algorithm for trace back chooses
as the next cell in the sequence one of the possible
predecessors
Trace back ….
T
C
G
0
-1
-2
-3
A
-1
-1
-2
-3
T
-2
0
-1
-2
C
-3
-1
1
0
G
-4
-2
0
2
• The only possible predecessor is the diagonal
match/mismatch neighbor. If more than one possible
predecessor exists, any can be chosen. This gives us a
current alignment of
Seq 1: G
|
Seq 2: G
Trace back ….
• Final Trace back
T
C
G
0
-1
-2
-3
A
-1
-1
-2
-3
T
-2
0
-1
-2
C
-3
-1
1
0
G
-4
-2
0
2
Best Alignment:
AT C G
| | | |
_TCG
Local Sequence Alignment
• The Smith-Waterman algorithm performs a
alignment on two sequences
local
• It is an example of dynamic programming
• Useful for dissimilar sequences that are suspected to
contain regions of similarity or similar sequence motifs
within their larger sequence context
• Aim: The best alignment over the conserved domain
of two sequences
Differences in Needleman-Wunsch
and Smith-Waterman Algorithms
• In the initialization stage, the first row and first
column are all filled in with 0s
• While filling the matrix, if a score becomes
negative, put in 0 instead
• In the traceback, start with the cell that has the
highest score and work back until a cell with a score
of 0 is reached.
Three steps in Smith-Waterman
Algorithm
• Initialization
• Scoring
• Trace back (Alignment)
Consider the two DNA sequences to be globally
aligned are:
ATCG (x=4, length of sequence 1)
TCG (y=3, length of sequence 2)
Initialization Step
• Create a matrix with X +1 Rows and Y +1
Columns
• The 1st row and the 1st column of the score
matrix are filled with 0s
0
A
0
T
0
C
0
G
0
T
C
G
0
0
0
Scoring
• The score of any cell C(i, j) is the maximum of:
scorediag = C(i-1, j-1) + S(I, j)
scoreup = C(i-1, j) + g
scoreleft = C(i, j-1) + g
And
0
(here S(i, j) is the substitution score for letters i
and j, and g is the gap penalty)
Scoring ….
• Example:
The calculation for the cell C(2, 2):
scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1
scoreup = C(i-1, j) + g = 0 + -1 = -1
scoreleft = C(i, j-1) + g = 0 + -1 = -1
T
C
G
0
0
0
0
A
0
0
T
0
C
0
G
0
Scoring ….
• Final Scoring Matrix
T
C
G
0
0
0
0
A
0
0
0
0
T
0
1
0
0
C
0
0
2
1
G
0
0
1
3
Note: It is not mandatory that the last cell has the
maximum alignment score!
Trace back
• The trace back step determines the actual
alignment(s) that result in the maximum score
• There are likely to be multiple maximal
alignments
• Trace back starts from the cell with maximum
value in the matrix
• Gives alignment in reverse order
Trace back ….
• There are three possible moves: diagonally (toward
the top-left corner of the matrix), up, or left
• Trace back takes the current cell and looks to the
neighbor cells that could be direct predecessors. This
means it looks to the neighbor to the left (gap in
sequence
#2),
the
diagonal
neighbor
(match/mismatch), and the neighbor above it (gap in
sequence #1). The algorithm for trace back chooses
as the next cell in the sequence one of the possible
predecessors. This continues till cell with value 0 is
reached.
Trace back ….
T
C
G
0
0
0
0
A
0
0
0
0
T
0
1
0
0
C
0
0
2
1
G
0
0
1
3
• The only possible predecessor is the diagonal
match/mismatch neighbor. If more than one possible
predecessor exists, any can be chosen. This gives us a
current alignment of
Seq 1: G
|
Seq 2: G
Trace back ….
• Final Trace back
T
C
G
0
0
0
0
A
0
0
0
0
T
0
1
0
0
C
0
0
2
1
G
0
0
1
3
Best Alignment:
TCG
| | |
TCG
• The true alignment between two sequences is the
one that reflects accurately the evolutionary
relationships between the sequences.
• Since the true alignment is unknown, in practice we
look for the optimal alignment, which is the one in
which the numbers of mismatches and gaps are
minimized according to certain criteria.
• Unfortunately, reducing the number of mismatches
results in an increase in the number of gaps, and vice
versa.
FASTA
1) Derived from logic of the dot plot
– compute best diagonals from all frames of alignment
2) Word method looks for exact matches between words in query and test
sequence
– hash tables (fast computer technique)
– DNA words are usually 6 bases
– protein words are 1 or 2 amino acids
– only searches for diagonals in region of word
matches = faster searching
FastA searches can be done on the WWW FastA server at EBI:
http://www2.ebi.ac.uk/fasta3/
FASTA Algorithm
Makes Longest Diagonal
3) after all diagonals found, tries to join
diagonals by adding gaps
4) computes alignments in regions of
best diagonals
FASTA Alignments
FASTA Format
• simple format used by almost all programs
• >header line with a [return] at end
• Sequence (no specific requirements for line
length, characters, etc)
>URO1 uro1.seq
Length: 2018
November 9, 2000 11:50
Type: N
Check: 3854
CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA
ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT
GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC
CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG
TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA
GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT
CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA
TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG
GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC
CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC
CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT
..
BLAST Searches GenBank
[BLAST= Basic Local Alignment Search Tool]
The NCBI BLAST web server lets you compare
your query sequence to various sections of
GenBank:
– nr = non-redundant (main sections)
– month = new sequences from the past few weeks
– ESTs
– human, drososphila, yeast, or E.coli genomes
– proteins (by automatic translation)
• This is a VERY fast and powerful computer.
BLAST
• Uses word matching like FASTA
• Similarity matching of words (3 aa’s, 11 bases)
– does not require identical words.
• If no words are similar, then no alignment
– won’t find matches for very short sequences
• Does not handle gaps well
BLAST Algorithm
BLAST Word Matching
MEAAVKEEISVEDEAVDKNI
MEA
EAA
Break query
AAV
AVK
into words:
VKE
KEE
EEI
EIS
ISV
Break database
...
sequences
into words:
Compare Word Lists
Database Sequence Word Lists
Query Word List:
MEA
EAA
AAV
AVK
VKL
KEE
EEI
EIS
ISV
?
Compare word lists
by Hashing
(allow near matches)
SDG
SRW
QEL
VKI
GKG
DKI
NIS
LFC
AAV
PFR
…
RTT
KSS
LLN
RWY
WDV
KVR
DEI
…
AAQ
Find locations of matching words
in database sequences
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT
MEA
EAA
AAV
AVK
KLV
KEE
EEI
EIS
ISV
TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY
IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
Extend hits one base at a time
HSPs are Aligned Regions
• The results of the word matching and
attempts to extend the alignment are
segments
- called HSPs (High-scoring Segment Pairs)
• BLAST often produces several short HSPs
rather than a single aligned region
Searching on the web: BLAST at
NCBI
Very fast computer dedicated to
running BLAST searches
Many databases that are always up
to date
Nice simple web interface
But you still need knowledge about
BLAST to use it properly
http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST Output: Alignments
>gi|730028|sp|P40692|MLH1_HUMAN
Length = 756
DNA mismatch repair protein Mlh1 1)
Score = 233 bits (593), Expect = 8e-62
Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1
IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60
IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL
Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query: 61
GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120
GSNSSRMYFTQTLLPGLAGPSGEMVK
DKVYAHQMVRTDSREQKLDA
Sbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query: 121 FLQPLSKPLSS 131
FLQPLSKPLSS
Sbjct: 396 FLQPLSKPLSS 406
low complexity sequence filtered
DNA vs. Protein searches
• DNA is composed of 4 characters: A,G,C,T It is
anticipated that on the average, at least 25% of the
residues of any 2 unrelated aligned sequences, would
be identical.
• Protein sequence is composed of 20 characters (aa).
The sensitivity of the comparison is improved. It is
accepted that convergence of Proteins is rare,
meaning that high similarity between 2 proteins
always means homology.
DNA vs. Protein searches
• What should we use to search for similarity, the
nucleotide or the protein sequences?
• If we have a nucleotide sequence, should we search
the DNA databases only? Or should we translate it to
protein and search protein databases?
Note, that by translating into aa sequence, we’ll
presumably lose information, since the genetic code
is degenerate, meaning that two or more codons can
be translated to the same amino acid.
Nucleotide, amino-acid sequences
• 3 different DNA positions but
-GGAGCCATATTAGATAGA-only one different amino acid
position:
-GGAGCAATTTTTGATAGA2 of the nucleotide substitutions
Gly Ala Ile Phe asp Arg are therefore synonymous and
one is non-synonymous.
Gly Ala Ile Leu asp Arg
DNA yields more phylogenetic information than proteins. The
nucleotide sequences of a pair of homologous genes have a
higher information content than the amino acid sequences of
the corresponding proteins, because mutations that result in
synonymous changes alter the DNA sequence but do not affect the
amino acid sequence. (Amino-acid sequences are more efficiently
aligned).
DNA vs. Protein searches
• What about very different DNA sequences that
code for similar protein sequences? We certainly
do not want to miss those.
• Conclusion: We should use proteins for database
similarity searches when possible.
DNA vs. Protein searches
• The reasons for this conclusion are:
– When comparing DNA sequences, we get significantly
more random matches than we get with proteins.
– The DNA databases are much larger, and grow faster
than Protein databases. Bigger database means more
random hits!
– For DNA we usually use identity matrices, for protein
more sensitive matrices like PAM and BLOSUM, which
allow for better search results.
– The conservation in evolution, protein are rarely
mutated.
Input Query
Amino Acid Sequence
DNA Sequence
Blastp
tblastn
blastn
blastx
tblastx
Compares
Against
Protein
Sequence
Database
Compares
Against
translated
Nucleotide
Sequence
Database
Compares
Against
Nucleotide
Sequence
Database
Compares
Against
Protein
Sequence
Database
Compares
Against
translated
nucleotide
Sequence
Database
An Overview of BLAST
Why use BLAST?
• To discover functional, structural and evolutionary
similarities
• Because “similarity” may be an indicator of
“homology” and thus provide some insight into
function or gene identification.
• Applications include
– identifying orthologs and paralogs
– discovering new genes or proteins
– exploring protein structure and function
Meaningfulness
•
•
•
•
Is the alignment correct ?
Can I make it better ?
Which programs are best ?
How do you know if its correct ?
Is the Alignment Correct ?
• What do mean by correct ?
– Mathematically rigorous
– Biologically meaningful
– Operationally useful
Can you make it better ?
•
•
•
•
Only if you know what you doing !
Define better ?
What’s the goal ?
What’s the biology ?
Which programs are best ?
• No simple answer
• Depends on the particular problem
• Recent objective studies help answer this
problem
• Some tools to help compare alignments
How do you know it is correct ?
•
•
•
•
Methods to evaluate the alignment
Methods to evaluate the program/algorithm
Structural information
Biology
Next time we talk about
MULTIPLE
ALIGNMENT
THANK
YOU