Sequencing genomes
Download
Report
Transcript Sequencing genomes
Last lecture summary
New generation sequencing (NGS)
• The completion of human genome was just a start of
•
•
•
•
•
modern DNA sequencing era – “high-throughput next
generation sequencing” (NGS).
New approaches, reduce time and cost.
Holly Grail of sequencing – complete human genome
below $ 1000.
1st generation – Sanger dideoxy method
2nd generation – sequencing by synthesis
(pyrosequencing)
3rd generation – single molecule sequencing
Sequence alignment
• What is sequence alignment
• Three flavors of sequence alignment
• Point mutations, indels
Homology
• 'Central dogma of bioinformatics'
• Sequences diverge
• Conserved residues
• Sequences are homologous, orthologous, paralogous
• The variation between sequences – changes occurred
during evolution in the form of substitutions (mutations)
and/or indels.
New stuff
Scoring systems I
• DNA and protein sequences can be aligned so that the
number of identically matching pairs is maximized.
A T T G - - - T
A – - G A C A T
• Counting the number of matches gives us a score (3 in
this case). Higher score means better alignment.
• This procedure can be formalized using substitution
matrix.
A
Identity
matrix
T
C
A
1
T
0
1
C
0
0
1
G
0
0
0
G
1
Scoring systems II
• identity matrix: NAs – OK, proteins – not enough
• AAs are not exchanged with the same probability as can
be conceived theoretically.
• For example substitution of aspartic acids D by glutamic
acid E is frequently observed. And change from aspartic
acid to tryptophan W is very rare.
D
E
W
Scoring systems II
• Why is that?
1. Triplet-based genetic code
GAT (D) → GAA (E), GAT (D) → TGG (W)
2. Both D and E have similar properties, but D and W differ
considerably. D is hydrophilic, W is hydrophobic, D → W
mutation can greatly alter 3D structure and
consequently function.
Genetic code
http://www.doctortee.com/dsu/tiftickjian/bio100/gene-expression.html
Gaps or no gaps
Scoring DNA sequence alignment (1)
• Match score:
• Mismatch score:
• Gap penalty:
+1
+0
–1
•
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
• Matches: 18 × (+1)
• Mismatches: 2 × 0
• Gaps: 7 × (– 1)
Score = +11
Length penalties
• We want to find alignments that are evolutionarily likely.
• Which of the following alignments seems more likely to
you?
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGAT-------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA--CG-CGT-TA-TCTATCT
• We can achieve this by penalizing more for a new gap,
than for extending an existing gap
Scoring DNA sequence alignment (2)
• Match/mismatch score:
• Origination/length penalty:
+1/+0
–2/–1
•
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
• Matches: 18 × (+1)
• Mismatches: 2 × 0
• Origination: 2 × (–2)
• Length: 7 × (–1)
Score = +7
Substitution matrices
• Substitution (score) matrices show scores for amino acids
substitution. Higher score means higher probability of
mutation.
• Conservative substitutions – conserve the physical and
chemical properties of the amino acids, limit
structural/functional disruption
• Substitution matrices should reflect:
• Physicochemical properties of amino acids.
• Different frequencies of individual amino acids occuring in proteins.
• Interchangeability of the genetic code.
PAM matrices I
• How to assign scores? Let’s get nature – evolution –
•
•
•
•
involved!
If you choose set of proteins with very similar sequences,
you can do alignment manually.
Also, if sequences in your set are similar, then there is
high probability that amino acid difference are due to
single mutation.
From the frequencies of mutations in the set of similar
protein sequences probabilities of substitutions can be
derived.
This is exactly the approach take by Margaret Dayhoff in
1978 to construct PAM (Accepted Point Mutation)
matrices.
Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure
(volume 5, supplement 3 ed.). Nat. Biomed. Res. Found.. pp. 345–358.
PAM matrices II
• Alignments of 71 groups of very similar (at least 85% identity)
protein sequences. 1572 substitutions were found.
• These mutations do not significantly alter the protein function.
Hence they are called accepted mutations (accepted by
natural selection).
• Probabilities that any one amino acid would mutate into any
other were calculated.
• If I know probabilities of individual amino acids, what is the
probability for the given sequence?
• Product
• But to calculate the score, we would like to sum probabilities,
not multiply. How to achieve this?
• Logarithm
Excellent discussion of the derivation and use of PAM matrices: George DG, Barker WC, Hunt LT. Mutation data matrix and its
uses. Methods Enzymol. 1990,183:333-51. PMID: 2314281.
PAM matrices III
• Dayhoff’s definition of accepted mutation was thus based
on empirically observed amino acids substitutions.
• The used unit is a PAM. Two sequences are 1 PAM apart
if they have 99% identical residues.
• PAM1 matrix is the result of computing the probability of
one substitution per 100 amino acids.
• PAM1 matrix represents probabilities of point mutations
over certain evolutionary time.
• in Drosophila 1 PAM corresponds to ~2.62 MYA
• in Human 1 PAM corresponds to ~4.58 MYA
PAM1 matrix
numbers are multiplied by 10 000
Higher PAM matrices
• What to do if I want get probabilities over much longer
evolutionary time?
• Dayhoff proposed a model of evolution that is a Markov
process.
• A case of Markov process is a linear dynamical system.
Linear dynamical system I
A new species of frog has been introduced into an area where it
has too few natural predators. In an attempt to restore the
ecological balance, a team of scientists is considering
introducing a species of bird which feeds on this frog.
Experimental data suggests that the population of frogs and
birds from one year to the next can be modeled by linear
relationships. Specifically, it has been found that if the quantities
Fk and Bk represent the populations of the frogs and birds in the
kth year, then
𝐵𝑘+1 = 0.6𝐵𝑘 + 0.4𝐹𝑘
𝐹𝑘+1 = −0.35𝐵𝑘 + 1.4𝐹𝑘
The question is this: in the long run, will the introduction of the
birds reduce or eliminate the frog population growth?
Linear dynamical system II
𝐹𝑘+1
0.6
0.4 𝐹𝑘
=
𝐵𝑘+1
−0.35 1.4 𝐵𝑘
• So this system evolves in time according to x(k+1) = Ax(k).
•
•
•
•
Such a system is called discrete linear dynamical
system, matrix A is called transition matrix.
If we need to know the state of the system in time k = 50,
we have to compute x(50) = A50 x(0).
And the same is true for Dayhoff’s model of evolution.
If we need to obtain probability matrices for higher
percentage of accepted mutations (i.e. covering longer
evolutionary time), we do matrix powers.
Let’s say we want PAM120 – 120 mutations fixed on
average per 100 residues. We do PAM1120.
Higher PAM matrices
• Biologically, the PAM120 matrix means that in 100 amino
acids there have been 50 substitutions, while in PAM250
there have been 2.5 amino acid mutation at each side.
• This may sound unusual, but remember, that over
evolutionary time, it is possible that an alanine was
changed to glycine, then to valine, and then back to
alanine.
• These are called silent substituions.
Zvelebil, Baum, Understanding bioinformatics.
PAM 120
Positive score – frequency of
substitutions is greater than would
have occurred by random chance.
Zero score – frequency is equal to
that expected by chance.
small, polar
Negative score – frequency is less
than would have occurred by random
chance.
small, nonpolar
polar or acidic
basic
large, hydrophobic
aromatic
PAM matrices assumptions
• Mutation of amino acid is independent of previous
•
•
•
•
•
mutations at the same position (Markov process
requirement).
Only PAM1 was “measured”, all other are extrapolations
(i.e. predictions based on some model).
Each amino acid position is equally mutable.
Mutations are assumed to be independent of surrounding
residues.
Forces responsible for sequence evolution over short time
are the same as these over longer times.
PAM matrices are based on protein sequences available
in 1978 (bias towards small, globular proteins)
• New generation of Dayhoff-type – e.g. PET91
Selzer, Applied bioinformatics.
How to calculate score?
2
substitution matrix
Protein vs. DNA sequences
• Given the choice of aligning DNA or protein, it is often
more informative to compare protein sequences.
• There are several reasons for this:
• Many changes in DNA do not change the amino acid that is
specified.
• Many amino acids share related biophysical properties. Though
these amino acids are not identical, they can be more easily
substituted each with other. These relationships can be accounted
for using scoring systems.
• When is it appropriate to compare nucleic sequences?
• confirming the identity of DNA sequence in database search,
searching for polymorphisms, confirming identity of cloned cDNA
Similarity vs. identity
• Similarity refers to the percentage of aligned residues
that can be more readily substituted for each other.
• have similar physicochemical characteristics and
• the selective pressure results in some mutations being accepted
and others being eliminated
S = [(Ls × 2)/(La + Lb)] × 100
number of aligned residues
with similar characteristics
total lengths of
each sequence
Homology vs. similarity
• Two sequences are homologous when they descended
from a common ancestor sequence.
• Similarity can be quantified: “two sequences share 40%
similarity”.
• But NOT “two sequences share 40% homology”. Just “two
sequences are homologous”
• Qualitative statement
• And it is a conclusion about a common ancestral relationship drawn
from sequence similarity comparison
Gaps
• How will I score this alignment?
V D S - C Y
V E S L C Y
• The gaps can’t be inserted freely.
• Indels are relatively slow evolutionary processes.
• And alignments with large gaps do not make biological sense.
• Each gap is penalized – a gap penalty
• The gap penalty is an adjustable parameter.
• Let’s use the gap penalty equaling to -11.
V D S V E S L
4 2 4 -11
C Y
C Y
9 7
S = 4 + 2 + 4 – 11 + 9 + 7=15
Gap penalty
• Affine gap penalty
• different for opening and extending
• constant for extending
• The gap penalty is high – fewer gaps will be inserted
• If you’re searching for sequences that are a strict match for your
query sequence, the gap penalty should be set high.
• This will retrieve regions with very closely related sequences.
• The gap penalty is low – more and larger gaps will be
inserted
• If you are searching for similarity between distantly related
sequences, the gap penalty should be set low.
(A) High gap penalty. Gaps has been inserted only at the beginning and end.
Percentage identity = 10%
(B) Low gap penalty. More gaps. Percentage identity = 18%
Zvelebil, Baum, Understanding bioinformatics.