sequence - Université d`Ottawa

Download Report

Transcript sequence - Université d`Ottawa

EVOLUTIONARY CHANGE IN DNA SEQUENCES
- usually too slow to monitor directly…
spontaneous mutation rates? p. 35-37
for mammalian nuclear DNA (regions not under functional constraint)
~ 4 x 10 -9 nt sub per site per year
... much higher for viruses
eg. 10 -6 to 10 -3 nt sub per site per generation
… so use comparative analysis of 2 sequences which
share a common ancestor
- determine number and nature of nt substitutions that have
occurred (ie measure degree of divergence)
Potential pitfalls
1. Are all evolutionary changes being monitored?
- if closely-related, high probability only one change at any
given site…
but if distant, may have been multiple substitutions (“hits”)
at a site
- can use algorithms to correct for this
2. If indels between two sequences, can they be aligned
with confidence?
- algorithms with gap penalties
Ancestral sequence
Present day sequences
Fig. 3.6
Homoplasy: same nt, but not directly inherited from ancestral sequence
(If comparing long stretches, highly unlikely they would have converged to
the same sequence)
Page & Holmes Fig. 5.9
Nucleotide substitutions within protein-coding sequences
1. Synonymous
vs. non-synonymous
Single step:
Multiple steps:
AAT
ACT
Is one pathway more likely than another?
p.82
2. Nomenclature related to “degeneracy”:
Non-degenerate
- all possible changes at site are non-synonymous
2-fold degenerate
- one of the 3 possible changes is synonymous
4-fold degenerate
- all possible changes at site are synonymous
ALIGNMENT OF SEQUENCES FOR COMPARATIVE ANALYSIS
1. By manual inspection
- if sequences very similar and no (or few) gaps
2. By sequence distance methods
(often followed by “correction by visual inspection”)
- use algorithms which minimize mismatches and gaps
- gap penalty > mismatch penalty
Alignment of human and chicken pancreatic hormone proteins
no gap penality
with gap penalty
alignment as in (b), with biochemically similar aa
Fig. 3.12
Multiple sequence alignments - CLUSTALW
ww.ebi.ac.uk/clustalw (European Bioinformatics Institute)
CLUSTAL W (1.81) Multiple Sequence Alignments
Sequence 1: ArabidopsisAAG52143
Sequence 2: ArabidopsisAAC26676
Sequence 3: yeast
798 aa
845 aa
664 aa
Sequences (2:3) Aligned. Score: 23
Sequences (1:2) Aligned. Score: 93
Sequences (1:3) Aligned. Score: 22
ArabAAG52143
ArabAAC26676
yeast
FIVDEADLLLDLGFRRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 539
FIVDEADLLLDLGFKRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 586
-VLDEADRLLEIGFRDDLETISGILNEKNSKSADNIKTLLFSATLDDKVQKLANNIMNKK 323
::**** **::**: *:*.* . * .:.
::******: .:*:::: ::: *:
Symbols used?
*
: .
Alignment of human a-globin and b-globin proteins
a
b
Human a globin = 141 aa
Human b globin = 146 aa
b globin
a globin
Was D-helix loss neutral or
adaptive mutation? (Nature 352:
349-51, 1991)
Avers Fig. 3.23
Reminder about definition of the word “homology”
In sequence comparisons, refer to nt (or aa) sequence
relatedness as
“… % identity” or “...% similarity”
BUT NOT “ … % homology”
because “homology” means “shares a common ancestor”
“Non-evolutionary biologists”
Petsko Genome Biol. 2:1002,2001
“Normalized alignment score”
NAS = (# identities x 10) + (# Cys identities x 20) – (# gaps x 25)
Doolittle, R. “URFs & ORFs” p.14
BLAST searches
www.ncbi.nlm.nih.gov/BLAST/
- to detect similarity between “sequence of interest” & databank entries
Query = yeast mt ribosomal protein L8 gene (1275 nt)
Example of high score “hit” (red)
Score = 383 bits (193), Expect = 1e-102
Identities = 196/197 (99%), Gaps = 0/197 (0%)
Query
Sbjct
AGCGTCAGGATAGCTCGCTCGATGTGGTCAGGCTAACACAATGAACAACGAGACTAGTG
|||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||
AGCGTCAGGATAGCTCGCTCGATGTGATCAGGCTAACACAATGAACAACGAGACTAGTG
E-values: statistical measure of likelihood that sequences with this degree of
similarity occur randomly
ie. reflects number of hits expected by chance
Example of low score “hit” (blue or black)
Score = 40.1 bits (20), Expect = 3.6
Identities = 23/24 (95%), Gaps = 0/24 (0%)
Query
Sbjct
GTTTTCTTAATATTTATTTAAAAA
|||||||||||||||| |||||||
GTTTTCTTAATATTTAATTAAAAA
“low complexity sequence”
Why is “sequence complexity” important when judging whether
two sequences are homologous?
AAGAGGAG
Pu-rich region #2
(not homologous to #1)
Human DNA
Chimp DNA
Pu-rich region #1
Region of unbiased
base composition
G=C=A=T
AAGAGGAG
How frequently is AAGAGGAG (8-nt sequence) expected to occur by
chance in a DNA sequence?
If sequence A is of low complexity (or short length), high % identity
with sequence B may not reflect shared evolutionary origin
Advantages of using aa (rather than nt) sequences for
identifying homologous genes among organisms?
- lower chance of “spurious” matches
-20 amino acids vs. 4 nucleotides
- unrelated nt sequences (non-homologous) expected to show 25%
identity by random chance (if unbiased base composition)
- degeneracy of genetic code & different codon usage patterns
(and G+C% of genomes) among organisms
- for distantly related sequences – “saturation” of synonymous
sites within codons (multiple hits)
But… for certain phylogenetic analyses, number of
informative characters may be higher at DNA than protein level
What if BLAST search were done at protein (instead of nt) level?
Query = yeast mitochondrial ribosomal protein L8 (238 aa)
Fungal
Bacterial
Dot matrix method for aligning sequences
- 2 sequences to be compared along X and Y axis of matrix
- dots put in matrix when nts in the 2 sequences are identical
mismatch = “gap” (or break) in line
Fig. 3.7
indel = shift in diagonal
Fig. 3.7
Dot matrix method
- normally compare blocks rather than individual nts
- spurious matches (background noise) influenced by
1. window size – overlapping fixed-length windows
whereby sequence 1 compared with seq 2
2. stringency – minimum threshold value (% identity)
at each step to score as hit
- for coding regions, could use aa instead of nt sequences
to reduce “noise”
2004 sequence (fewer errors)
Comparison of human chromosome 7 “draft” sequence (2001) with
“near-complete” sequence (2004)
Blowup of 500 kb region
2001 sequence
How do you interpret the data in this figure?
Nature 431:935, 2004