Database Searching and Pairwise Alignment

Download Report

Transcript Database Searching and Pairwise Alignment

An Introduction to Bioinformatics
Database Searching - Pairwise Alignments
AIMS
To explain the principles underlying local and global
alignment programs
To explain what substitution matrices are and how
they are used
To introduce the commonly used pairwise alignment
programs
To explore the significance of alignment results
OBJECTIVES
Carry out FastA and Blast searches
To select appropriate substitution matrices
To evaluate the significance of alignment/search results
INTRODUCTION
• Sequence comparisons
• Protein v Protein
• DNA v DNA
• Protein v DNA
• DNA v Protein
• Pair-wise comparison
• Methodology
Similarity v Homology……….
“If two genes shared a common
ancestor then they are
homologous”
They did or they didn’t, they are or they arn’t
% Homology
Definitions
Similarity v Homology…….
But :• Comparison of two sequences complex
• Differences need to be quantified
• infer homology from degree of similarity
Information theory……….?
Protein sequence = message
bits = log2M
4.19 bits per residue
bit: The amount of information required to distinguish
between two equally likely choices
Ref: Molecular Information theory - http://www-lmmb.ncifcrf.gov/~toms/
http://www.lecb.ncifcrf.gov/~toms/paper/nano2/latex/index.html
Are two proteins related ?
• Average protein size of 150 residues
• Information content of 630 bits.
• Probability that two random sequences specify the same message
is 2-630 or about 10-190.
• Convergent evolution giving rise to two similar sequences would
be very rare
• If two sequences exhibit significant similarity arose from a
common ancestor and are homologous.
Basic concept
• The English alphabet contains 26
letters, that of DNA 4, and that of protein
20
• Measure similarity or dissimilarity
Basic concept……….
• Hamming Distance
AGATCTAG ACGA
AGGCATCATGCAGT
• Measure No of differences between two sequences
• The answer to the above is………….. 10
• The proportional or p-distance. Hamming
distance divided by the total sequence length, so
ranges from 0 to 1. In the above example the pdistance is 10/14
Basic concept……….
The log-odds ratio.
- measure of how unlikely two sequences should
be so similar.
- based on the observed frequencies of each of
the characters (bases or amino acids) in the sequences,
and the probability of observing each homologous pair in
the two sequences.
- positive score, measuring similarity, calculated
by adding the scores from pre-calculated matrices (PAM
and BLOSUM for protein, unitary for DNA).
Two problems to consider:
• GAPS
• genes evolve
AGATCTAG-ACGA-TGCAGT
AGGCATCATGCAGT
•deletions, insertions, recombination
• give penalties for gap creations and extensions
• Global or Local Alignments
• Will sequences be similar over their whole length?
• Use different algorithms
Global and Local Alignments
• A global approach will attempt to align two sequences along
their entire length
• A local alignment will look for local regions of similarity or
subsequences.
T
H
E
R
A
T
S
A
T
O
N
T
H
E
C
A
T
T H E C A T S A T O N T H E M A T
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
Dotplots are the simplest form of alignment
l
l
l
l
Identical sequences, or subsequences are
l
l
l
l
l
identified by diaganol lines
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
DOTTUP website does
this analysis
Example of Rabbit v
Emperor Penguin
Haemoglobin
Matrices - PAM and BLOSUM
• Certain groups of amino acids have similar physicochemical properties e.g Lysine and Arginine
– conservative substitution
• Genetic code is degenerate - silent mutations
• Dayhoff - Point Accepted Mutation (PAM)
Matrices - PAM and BLOSUM
PAM
1 PAM unit is the extent of evolutionary divergence in
which 1% of amino acid residues are altered
• Alignment of 15 very closely related proteins
• Calculate a matrix of probability of a mutation altering one
amino acid residue to any other amino acid on the basis of 1
PAM.
• Extrapolate to PAM250
– more useful for proteins not well conserved
PAM250 matrix
Problems: derived from proteins of
only slight divergence
BLOSUM
• Henikov and Henikov (1992) derived matrices based on
sequences more divergent.
• The BLOSUM (BLOcks SUbstition Matrix) matrices cover
sequences with 80% or more similarity (BLOSUM 80), 62%
or greater similarity (BLOSUM 62) etc
• Based on local not global alignments
Alignments - local
Basic principle
• Choose one sequence to be searched against the other
• Query sequence (q) and target sequence (t)
• Divide the query sequence into small subsequences, called
words
• For each word of q, look along t to find other words in t which
are similar
• Matching words "anchors" build up a better alignment between
q and t
• Assess how good this alignment is.
FastA and BLAST
FastA
• Pearson and Lipman Method (late 80s)
• Query sequence compared to each sequence in a database
•matching words (up to 6 nucleotides, or two amino acids in a row)
• Rescore best regions with matrices
• Algorithm checks concatenation
• Best sequences displayed
FastA and BLAST
BLAST
• Basic Local Alignment Search Tool
•Compares query to database
– For each pair - finds maximal segment pair (using BLOSUM)
– The algorithm calculates probability of random occurrence
– Faster than FastA, less accurate, method of choice since
introduction of GAP-BLAST
Significance?
Only Local Alignments - without gaps
HSPs/MSPs - alignment occurring by chance (p value)
is derived from the observed score (S) to the expected
distribution of scores
– larger databases - larger probability of a
sequence match by chance
– the closer the p-value to zero the more
confidence can be given to the alignment
Types of BLAST
• Nucleotide BLAST
Standard nucleotide-nucleotide BLAST [blastn]
MEGABLAST
Search for short nearly exact matches
• Protein BLAST
Standard protein-protein BLAST [blastp]
PSI- and PHI-BLAST
Search for short nearly exact matches
• Translated BLAST Searches
Nucleotide query - Protein db [blastx]
Protein query - Translated db [tblastn]
Nucleotide query - Translated db [tblastx]
Example.
I have a new mRNA sequence:
TGGCGGCGGCGGCGGCGGTTGTCCCGGCTGTGCCGGTTGGTGTGGCCCGTCAGCCCGCGTACCACAGCGCCCGGGCCGCG
TCGAGCCCAGTACAGCCAAGCCGCTGCGGCCGGGTCCGGCGCGGGCGGCGCGCGCAGACGGAGGGCGGCGGCCGCGGCCA
GGGCGGCCCGTGGGACCGCGGGCCCCCGGCGCAGCGCTGCCCGGCTCCCGGCCCTGCCGGCCTCCTCCCTTGGCGCCGCG
GCCATGGCGGCCAGCGCGAAGCGGAAGCAGGAGGAGAAGCACCTGAAGATGCTGCGGGACATGACCGGCCTCCCGCGCAA
CCGAAAGTGCTTCGACTGCGACCAGCGCGGCCCCACCTACGTTAACATGACGGTCGGCTCCTTCGTGTGTACCTCCTGCT
CCGGCAGCCTGCGAGGATTAAATCCACCACACAGGGTGAAATCTATCTCCATGACAACATTCACACAACAGGAAATTGAA
TTCTTACAAAAACATGGAAATGAAGTCTGTAAACAGATTTGGCTAGGATTATTTGATGATAGATCTTCAGCAATTCCAGA
CTTCAGGGATCCACAAAAAGTGAAAGAGTTTCTACAAGAAAAGTATGAAAAGAAAAGATGGTATGTCCCGCCAGAACAAG
CCAAAGTCGTGGCATCAGTTCATGCATCTATTTCAGGGTCCTCTGCCAGTAGCACAAGCAGCACACCTGAGGTCAAACCA
CTGAAATCTCTTTTAGGGGATTCTGCACCAACACTGCACTTAAATAAGGGCACACCTAGTCAGTCCCCAGTTGTAGGTCG
TTCTCAAGGGCAGCAGCAGGAGAAGAAGCAATTTGACCTTTTAAGTGATCTCGGCTCAGACATCTTTGCTGCTCCAGCTC
CTCAGTCAACAGCTACAGCCAATTTTGCTAACTTTGCACATTTCAACAGTCATGCAGCTCAGAATTCTGCAAATGCAGAT
TTTGCAAACTTTGATGCATTTGGACAGTCTAGTGGTTCGAGTAATTTTGGAGGTTTCCCCACAGCAAGTCACTCTCCTTT
TCAGCCCCAAACTACAGGTGGAAGTGCTGCATCAGTAAATGCTAATTTTGCTCATTTTGATAACTTCCCCAAATCCTCCA
GTGCTGATTTTGGAACCTTCAATACTTCCCAGAGTCATCAAACAGCATCAGCTGTTAGTAAAGTTTCAACGAACAAAGCT
GGTTTACAGACTGCAGACAAATATGCAGCACTTGCTAATTTAGACAATATCTTCAGTGCCGGGCAAGGTGGTGATCAGGG
AAGTGGCTTTGGGACCACAGGTAAAGCTCCTGTTGGTTCTGTGGTTTCAGTTCCCAGTCAGTCAAGTGCATCTTCAGACA
AGTATGCAGCTCTGGCAGAACTAGACAGCGTTTTCAGTTCTGCAGCCACCTCCAGTAATGCGTATACTTCCACAAGTAAT
GCTAGCAGCAATGTTTTTGGAACAGTGCCAGTGGTTGCTTCTGCACAGACACAGCCTGCTTCATCAAGTGTGCCTGCTCC
ATTTGGACGTACGCCTTCCACAAATCCATTTGTTGCTGCTGCTGGTCCTTCTGTGGCATCTTCTACAAACCCATTTCAGA
CCAATGCCAGAGGAGCAACAGCGGCAACCTTTGGCACTGCATCCATGAGCATGCCCACGGGATTCGGCACTCCTGCTCCC
TACAGTCTTCCCACCAGCTTTAGTGGCAGCTTTCAGCAGCCTGCCTTTCCAGCCCAAGCAGCTTTCCCTCAACAGACAGC
TTTTTCTCAACAGCCCAATGGTGCAGGTTTTGCAGCATTTGGACAAACAAAGCCAGTAGTAACCCCTTTTGGTCAAGTTG
CAGCTGCTGGAGTATCTAGTAATCCTTTTATGACTGGTGCACCAACAGGACAATTTCCAACAGGAAGCTCATCAACCAAT
CCTTTCTTATAGCCTTATATAGACAATTTACTGGAACGAACTTTTATGTGGTCACATTACATCTCTCCACCTCTTGCACT
GTTGTCTTGTTTCACTGATCTTAGCTTTAAACACAAGAGAAGTCTTTAAAAAGCCTGCATTGTGTATTAAACACCAGGTA
ATATGTGCAAAACCGAGGGCTCCAGTAACACCTTCTAACCTGTGAATTGGCAGAAAAGGGTAGCGGTATCATGTATATTA
AAATTGGCTAATATTAAGTTATTGCAGATACCACATTCATTATGCTGCAGTACTGTACATATTTTTCTTAGAAATTAGCT
ATTTGTGCATATCAGTATTTGTAACTTTAACACATTGTTATGTGAGAAATGTTACTGGGGAAATAGATCAGCCACTTTTA
AGGTGCTGTCATATATCTTGGAATGAATGACCTAAAATCATTTTAACCATTGCTACTGGAAAGTAACAGAGTCAAAATTG
GAAGGTTTTATTCATTCTTGAATTTTTCCTTTCTAAAGAGCTCTTCTATTTATACATGCCTAAATTCTTTTAAAATGTAG
AGGGATACCTGTCTGCATAATAAAGCTGATCATGTTTTGCTACAGTTTGCAGGTGAAAAAAAATAAATATTATAAAATAA
AAAAAAAAAAAAAGAAAAAAAAAA
I’ve pasted my sequence
I’ve selected the database
I hit BLAST!
Record this number
Press Format!
Setting up a BLAST search
Step 1. Plan the search
Step 2. Enter the query sequence
Step 3. Choose the appropriate search parameters
Step 4. Submit the query
Deciphering the BLAST output
Step 1. Examine the alignment scores and statistics
Step 2. Examine the alignments
Step 3. Review search details to plan the next step
Post-BLAST analysis
Perform a PSI-BLAST analysis
Create a multiple alignment
Try motif searching with PHI-BLAST