Sequence Similarity & Sequence Searching

Download Report

Transcript Sequence Similarity & Sequence Searching

Sequencing & Sequence
Alignment
G
E
N
E
S
I
S
Lecture 2.4
G
60
40
30
20
20
10
0
E
40
50
30
20
20
10
0
N
30
30
40
20
20
10
0
E
20
30
20
30
20
10
0
T
20
20
20
20
20
10
0
I
0
0
0
10
0
20
0
C
10
10
10
10
10
10
0
S
0
0
0
0
10
0
10
1
Objectives
• Understand how DNA sequence data is
collected and prepared
• Be aware of the importance of sequence
searching and sequence alignment in
biology and medicine
• Be familiar with the different algorithms
and scoring schemes used in sequence
searching and sequence alignment
Lecture 2.4
2
High Throughput DNA
Sequencing
Lecture 2.4
3
30,000
Lecture 2.4
4
Shotgun Sequencing
Isolate
Chromosome
Lecture 2.4
ShearDNA
into Fragments
Clone into
Seq. Vectors
Sequence
5
Principles of DNA Sequencing
Primer
DNA fragment
Amp
PBR322
Tet
Ori
Lecture 2.4
Denature with
heat to produce
ssDNA
Klenow + ddNTP
+ dNTP + primers
6
The Secret to Sanger
Sequencing
Lecture 2.4
7
Principles of DNA Sequencing
5’
3’ Template
G C A T G C
5’ Primer
dATP
dCTP
dGTP
dTTP
ddCTP
GddC
GCATGddC
Lecture 2.4
dATP
dCTP
dGTP
dTTP
ddATP
GCddA
dATP
dCTP
dGTP
dTTP
ddTTP
GCAddT
dATP
dCTP
dGTP
dTTP
ddCTP
ddG
GCATddG
8
Principles of DNA Sequencing
G
T
_
C
_
A
G
C
A
T
G
C
Lecture 2.4
+
+
9
Capillary Electrophoresis
Separation by Electro-osmotic Flow
Lecture 2.4
10
Multiplexed CE with
Fluorescent detection
Lecture 2.4
ABI 3700
96x700 bases
11
Shotgun Sequencing
Sequence
Chromatogram
Lecture 2.4
Send to Computer
Assembled
Sequence
12
Shotgun Sequencing
• Very efficient process for small-scale (~10 kb)
sequencing (preferred method)
• First applied to whole genome sequencing in
1995 (H. influenzae)
• Now standard for all prokaryotic genome
sequencing projects
• Successfully applied to D. melanogaster
• Moderately successful for H. sapiens
Lecture 2.4
13
The Finished Product
GATTACAGATTACAGATTACAGATTACAGATTACAG
ATTACAGATTACAGATTACAGATTACAGATTACAGA
TTACAGATTACAGATTACAGATTACAGATTACAGAT
TACAGATTAGAGATTACAGATTACAGATTACAGATT
ACAGATTACAGATTACAGATTACAGATTACAGATTA
CAGATTACAGATTACAGATTACAGATTACAGATTAC
AGATTACAGATTACAGATTACAGATTACAGATTACA
GATTACAGATTACAGATTACAGATTACAGATTACAG
ATTACAGATTACAGATTACAGATTACAGATTACAGA
TTACAGATTACAGATTACAGATTACAGATTACAGAT
Lecture 2.4
14
Sequencing Successes
T7 bacteriophage
completed in 1983
39,937 bp, 59 coded proteins
Escherichia coli
completed in 1998
4,639,221 bp, 4293 ORFs
Sacchoromyces cerevisae
completed in 1996
12,069,252 bp, 5800 genes
Lecture 2.4
15
Sequencing Successes
Caenorhabditis elegans
completed in 1998
95,078,296 bp, 19,099 genes
Drosophila melanogaster
completed in 2000
116,117,226 bp, 13,601 genes
Homo sapiens
1st draft completed in 2001
3,160,079,000 bp, 31,780 genes
Lecture 2.4
16
So what do we do
with all this
sequence data?
Lecture 2.4
17
Sequence Alignment
G
E
N
E
S
I
S
Lecture 2.4
G
60
40
30
20
20
10
0
E
40
50
30
20
20
10
0
N
30
30
40
20
20
10
0
E
20
30
20
30
20
10
0
T
20
20
20
20
20
10
0
I
0
0
0
10
0
20
0
C
10
10
10
10
10
10
0
S
0
0
0
0
10
0
10
18
Alignments tell us about...
• Function or activity of a new gene/protein
• Structure or shape of a new protein
• Location or preferred location of a protein
• Stability of a gene or protein
• Origin of a gene or protein
• Origin or phylogeny of an organelle
• Origin or phylogeny of an organism
Lecture 2.4
19
Factoid:
Sequence comparisons
lie at the heart of all
bioinformatics
Lecture 2.4
20
Similarity versus Homology
• Similarity refers to the
likeness or % identity
between 2 sequences
• Homology refers to
shared ancestry
• Similarity means
sharing a statistically
significant number of
bases or amino acids
• Two sequences are
homologous is they
are derived from a
common ancestral
sequence
• Similarity does not
imply homology
• Homology usually
implies similarity
Lecture 2.4
21
Similarity versus Homology
• Similarity can be quantified
• It is correct to say that two sequences are
X% identical
• It is correct to say that two sequences have
a similarity score of Z
• It is generally incorrect to say that two
sequences are X% similar
Lecture 2.4
22
Similarity versus Homology
• Homology cannot be quantified
• If two sequences have a high % identity it
is OK to say they are homologous
• It is incorrect to say two sequences have a
homology score of Z

It is incorrect to say two sequences are
X% homologous
Lecture 2.4
23
Sequence Complexity
MCDEFGHIKLAN….
High Complexity
ACTGTCACTGAT….
Mid Complexity
NNNNTTTTTNNN….
Low Complexity
Translate those DNA sequences!!!
Lecture 2.4
24
Assessing Sequence Similarity
THESTORYOFGENESIS
THISBOOKONGENETICS
Two Character
Strings
THESTORYOFGENESI-S
* *
*
*
*
* * * *
*
*
THISBOOKONGENETICS
Character
Comparison
THE STORY OF GENESIS
THIS BOOK ON GENETICS
Context
Comparison
Lecture 2.4
25
Assessing Sequence Similarity
Rbn
Lsz
KETAAAKFERQHMD
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT
Rbn
Lsz
SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA
QATNRNTDGSTDYGILQINSRWWCNDGRTP
GSRN
Rbn
Lsz
DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKY
LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR
Rbn
Lsz
PNACYKTTQANKHIIVACEGNPYVPHFDASV
NRCKGTDVQA
WIRGCRL
is this alignment significant?
Lecture 2.4
26
Is This Alignment Significant?
Gelsolin
89
L G N E L S Q D E S G A A A I F T V Q L
108
Annexin
82
L P S A L K S A L S G H L E T V I L G L
101
154
L E K D I I S D T S G D F R K L M V A L
173
240
L E – S I K K E V K G D L E N A F L N L
258
314
L Y Y Y I Q Q D T K G D Y Q K A L L Y L
333
Consensus
Lecture 2.4
L x P x x x P D x S G x h x x h x V L L
27
Some Simple Rules
• If two sequence are > 100 residues and
> 25% identical, they are likely related
• If two sequences are 15-25% identical they
may be related, but more tests are needed
• If two sequences are < 15% identical they
are probably not related
• If you need more than 1 gap for every 20
residues the alignment is suspicious
Lecture 2.4
28
Doolittle’s Rules of Thumb
Evolutionary Distance VS Percent Sequence Identity
Sequence Identity (%)
120
100
80
60
Twilight Zone
40
20
0
0
40
80
120
160
200
240
280
320
360
400
Number of Residues
Lecture 2.4
29
Sequence Alignment - Methods
• Dot Plots
• Dynamic Programming
• Heuristic (Fast) Local Alignment
• Multiple Sequence Alignment
• Contig Assembly
Lecture 2.4
30
PAM Matrices
• Developed by M.O. Dayhoff (1978)
• PAM = Point Accepted Mutation
• Matrix assembled by looking at patterns of
substitutions in closely related proteins
• 1 PAM corresponds to 1 amino acid
change per 100 residues
• 1 PAM = 1% divergence or 1 million years
in evolutionary history
Lecture 2.4
31
Fast Local Alignment Methods

Developed by Lipman & Pearson (1985/88)

Refined by Altschul et al. (1990/97)

Ideal for large database comparisons

Uses heuristics & statistical simplification

Fast N-type algorithm (similar to Dot Plot)

Cuts sequences into short words (k-tuples)

Uses “Hash Tables” to speed comparison
Lecture 2.4
32
FASTA
• Developed in 1985 and 1988 (W. Pearson)
• Looks for clusters of nearby or locally
dense “identical” k-tuples
• init1 score = score for first set of k-tuples
• initn score = score for gapped k-tuples
• opt score = optimized alignment score
• Z-score = number of S.D. above random
• expect = expected # of random matches
Lecture 2.4
33
FASTA
gi|135775|sp|P08628|THIO_RABIT THIOREDOXIN
(104 aa)
initn: 641 init1: 641 opt: 642 Z-score: 806.4 expect() 3.2e-38
Smith-Waterman score: 642; 86.538% identity in 104 aa overlap (2-105:1-104)
gi|135
2- 105: --------------------------------------------------------------------:
10
20
30
40
50
60
70
80
thiore MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMINPFFHSLSEKYSNVIFLEVDVDDCQDVASECEVKCTPTFQFF
:::::::.::::.::.:::::::::::::::::::::.::::.::::..::.:.:::::::.:.:.:::::: ::::::
gi|135 VKQIESKSAFQEVLDSAGDKLVVVDFSATWCGPCKMIKPFFHALSEKFNNVVFIEVDVDDCKDIAAECEVKCMPTFQFF
10
20
30
40
50
60
70
90
100
thiore KKGQKVGEFSGANKEKLEATINELV
::::::::::::::::::::::::.
gi|135
KKGQKVGEFSGANKEKLEATINELL
Lecture 2.4
80
90
100
34
Multiple Sequence Alignment
Multiple alignment of Calcitonins
Lecture 2.4
35
Multiple Alignment Algorithm
• Take all “n” sequences and perform all
possible pairwise (n/2(n-1)) alignments
• Identify highest scoring pair, perform an
alignment & create a consensus sequence
• Select next most similar sequence and
align it to the initial consensus, regenerate
a second consensus
• Repeat step 3 until finished
Lecture 2.4
36
Multiple Sequence Alignment
• Developed and refined by many (Doolittle,
Barton, Corpet) through the 1980’s
• Used extensively for extracting hidden
phylogenetic relationships and identifying
sequence families
• Powerful tool for extracting new sequence
motifs and signature sequences
Lecture 2.4
37
Multiple Alignment
• Most commercial vendors offer good
multiple alignment programs including:
• GCG (Accelerys)
• PepTool/GeneTool (BioTools Inc.)
• LaserGene (DNAStar)
• Popular web servers include T-COFFEE,
MULTALIN and CLUSTALW
• Popular freeware includes PHYLIP & PAUP
Lecture 2.4
38
Mutli-Align Websites
• Match-Box
http://www.fundp.ac.be/sciences/biologie/bms/matchbox_submit.shtml
• MUSCA http://cbcsrv.watson.ibm.com/Tmsa.html
• T-Coffee http://www.ch.embnet.org/software/TCoffee.html
• MULTALIN http://www.toulouse.inra.fr/multalin.html
• CLUSTALW http://www.ebi.ac.uk/clustalw/
Lecture 2.4
39
Multi-alignment & Contig
Assembly
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…
TAGCTACGCATCGTCTGATGGCAATGCTACGGAA..
Lecture 2.4
40
Contig Assembly
•
•
•
•
Read, edit & trim DNA chromatograms
Remove overlaps & ambiguous calls
Read in all sequence files (10-10,000)
Reverse complement all sequences (doubles
# of sequences to align)
• Remove vector sequences (vector trim)
• Remove regions of low complexity
• Perform multiple sequence alignment
Lecture 2.4
41
Chromatogram Editing
Lecture 2.4
42
Sequence Loading
Lecture 2.4
43
Sequence Alignment
Lecture 2.4
44
Contig Alignment - Process
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…
Lecture 2.4
45
Sequence Assembly Programs
• Phred - base calling program that does detailed
statistical analysis (UNIX)
http://www.phrap.org/
• Phrap - sequence assembly program (UNIX)
http://www.phrap.org/
• TIGR Assembler - microbial genomes (UNIX)
http://www.tigr.org/softlab/assembler/
• The Staden Package (UNIX)
http://www.mrc-lmb.cam.ac.uk/pubseq/
• GeneTool/ChromaTool/Sequencher (PC/Mac)
Lecture 2.4
46
Conclusions
• Sequence alignments and database
searching are key to all of bioinformatics
• There are four different methods for doing
sequence comparisons 1) Dot Plots; 2)
Dynamic Programming; 3) Fast Alignment;
and 4) Multiple Alignment
• Understanding the significance of
alignments requires an understanding of
statistics and distributions
Lecture 2.4
47