Sequencing genomes

Download Report

Transcript Sequencing genomes

Last lecture summary
• Window size?
• Stringency?
• Color mapping?
Frame shifts?
Limits of detection of alignment
• Homology, similarity
• Twilight zone
Statistical significance
• Key question – Constitutes a given alignment evidence for
homology? Or did it occur just by chance?
• The statistical significance of the alignment (i.e. its score)
can be tested by statistical hypotheses testing.
• What are H0 and Ha?
• Significance of local alignment
• gapless
• gapped
• Significance of global alignment
Gumbel distribution
𝜌 𝑥 =
1
𝑥−𝜇
𝑥−𝜇
exp −
− exp
𝜎
𝜎
𝜎
𝜇 … location parameter
𝜎, 𝛽, 𝜆 … scale parameter
𝑥−𝜇
𝑃 𝑋 > 𝑥 = 1 − exp −exp −
𝜎
wikipedia.org
New stuff
Database similarity searching
BLAST
• Basic Local Alignment Search Tool (BLAST) – Google of the
sequence world.
• Compare a protein or DNA sequence to other sequences in
various databases, main tool of NCBI.
• Why to search database
• Determine what orthologs and paralogs are known for a particular
•
•
•
•
•
sequence.
Determine what proteins or genes are present in a particular organism.
Determining the identity of a DNA or protein sequence.
Determining what variants have been described for a particular gene
or protein.
Investigating ESTs.
Exploring amino acid residues that are important in the function and/or
structure of a protein (multiple alignment of BLAST results, conserved
residues).
Database searching requirements I
• query sequence, perform pairwise alignments between
•
•
•
•
the query and the whole database (target)
Typically, this means that millions of alignments are
analyzed in a BLAST search, and only the most closely
related matches are returned.
We are usually more interested in identifying locally
matching regions such as protein domains. Global
alignment (Needlman-Wunsch) is not often used.
Smith-Watermann is too computationally intensive.
Instead, heuristic is utilized, significant speed up.
Database searching requirements II
• sensitivity – the ability to find as many correct hits (TP)
as possible
• selectivity (specificity) – ability to exclude incorrect hits
(FP)
• speed
• ideally: high sensitivity, high specificity, high speed
• reality: increase in sensitivity leads to decrease in
specificity, improvement in speed often comes at the cost
of lowered sensitivity and selectivity
Types of algorithms
• exhaustive
• uses a rigorous algorithm to find the exact solution for a particular
problem by examining all mathematical combinations
• example: DP
• heuristic
• computational strategy to find an empirical or near optimal solution
by using rules of thumb
• this type of algorithms take shortcuts by reducing the search space
according to some criteria
• the shortcut strategy is not guaranteed to find the best or most
accurate solution
Heuristic algorithms
• Perform faster searches because they examine only a
fraction of the possible alignments examined in regular
dynamic programming
• currently, there are two major algorithms:
• FASTA
• BLAST
• Not guaranteed to find the optimal alignment or true
homologs, but are 50–100 times faster than DP.
• The increased computational speed comes at a moderate
expense of sensitivity and specificity of the search, which
is easily tolerated by working molecular biologists.
BLAST
• Parts of algorithm
• list, scan, extend
• BLAST uses word method for pairwise alignment
• Find short stretches of identical (or nearly identical) letters
in two sequences – words (similar to window in dot plot)
• Basic assumption: two related sequences must have at
least one word in common
• By first identifying word matches, a longer alignment can
be obtained by extending similarity regions from the
words.
• Once regions of high sequence similarity are found,
adjacent high-scoring regions can be joined into a full
alignment.
BLAST - list
• Compile a list of “words” of a fixed length w that are
•
•
•
•
derived from the query sequence.
protein searches – word size = 3, NA searches = 11
A threshold value T is established for the score of aligned
words (true for proteins, for NAs exact matches are used).
Those words either at or above the threshold are
collected and used to identify database matches; those
words below threshold are not further pursued.
The threshold score T can be lowered to identify more
initial pairwise alignments. This will increase the time
required to perform the search and may increase the
sensitivity
BLAST - scan
• After compiling a list of word pairs at or above threshold T,
the BLAST algorithm scans a database for hits.
• This requires BLAST to search an index of the database
to find entries that correspond to words on the compiled
list.
BLAST - extend
• Extend hits to find alignments called high-scoring
segment pairs (HSPs).
• Extend in both directions (ungapped originally, gapped
BLAST is newer), count the alignment score.
• The extension process is terminated when a score falls
below a cutoff.
BLAST strategy
• Compare a protein or DNA query sequence to each
database entry and form pairwise alignments (HSPs).
• When the threshold parameter is raised, the speed of the
search is increased, but fewer hits are registered, and so
distantly related database matches may be missed.
• When the threshold parameter is lowered, the search
proceeds more slowly, but many more word hits are
evaluated, and thus sensitivity is increased.
• Recent improvement – gapped BLAST
• Variants
• BLASTN – nucleotide sequences
• BLASTP – protein sequences
• BLASTX – uses nucleotide sequences as queries and translates
them in all six reading frames to produce translated protein
sequences, which are used to query a protein sequence database
• TBLASTN – queries protein sequences to a nucleotide sequence
database with the sequences translated in all six reading frames
• TBLASTX – uses nucleotide sequences, which are translated in all
six frames, to search against a nucleotide sequence database that
has all the sequences translated in six frames.
Which sequence to search?
• The choice of the type of sequences also influences the
sensitivity of the search.
• Clear advantage of using protein sequences in detecting
homologs
• If the input sequence is a protein-encoding DNA sequence, use
BLASTX (six open reading frames before sequence comparisons)
• If you’re looking for protein homologs encoded in newly
sequenced genomes, you may use TBLASTN. This may
help to identify protein coding genes that have not yet
been annotated.
• If a DNA sequence is to be used as the query, a proteinlevel comparison can be done with TBLASTX.
• TBLASTN, TBLASTX are very computationally intensive
and the search process can be very slow.
BLAST Statistics
• quantitative measure of whether the alignments represent
significant matches or whether they would be expected to
occur by chance alone
• For two random sequences
m and n, the cumulative
distribution function of
Gumbel distribution
scores S is described
by the formula
𝑃 𝑆 < 𝑥 = exp(−𝑒 −𝜆(𝑥−𝑢) )
𝑢 – characteristic value
𝜆 – decay constant
• To use this equation, we need to estimate the values of 𝑢
and 𝜆.
𝑙𝑛𝐾𝑚𝑛
𝑢=
𝜆
• m, n … sequence lengths, K … constant
• Probability of observing a score S equal to or greater than
x by chance is then given by the formula
𝑃 𝑆 ≥ 𝑥 = 1 − exp(−𝐾𝑚𝑛𝑒 −𝜆𝑥 )
• In the context of a database search, m and n refer to the length
(in residues) of the query sequence and the length of the entire
database, respectively.
• The product m.n defines the size of the search space.
• The search space represents all the sites at which a query
sequence can be aligned to any sequence in the database.
• The expected number of HSPs having some score S (or better)
by chance alone is described using the equation
𝑬 = 𝑲𝒎𝒏𝒆−𝝀𝑺
• E refers to the expect value, which is the number of different
alignments with scores equivalent to or better than S that are
expected to occur by chance in a database search.
• This provides an estimate of the number of false positive
results from a BLAST search.
E-value
• E depends on score S and parameter 𝜆 which scale the
scoring system.
• It also depends on the length of the query sequence and
on the length of the database.
• The parameter K is a scaling factor for the search space.
• The parameters K and 𝜆 were described by Karlin and
Altschul and so are often called Karlin–Altschul statistics.
Karlin S., Altschul S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. PNAS 87, 2264–2268, 1990
Properties of E-value I
• Decreases exponentially with increasing S. Thus, a high score
corresponds to a low E-value. As E approaches zero, the
probability that the alignment occurred by chance approaches
zero.
• The expected score for aligning a random pair of amino acids
must be negative. Otherwise, very long alignments of two
sequences could accumulate large positive scores and appear
to be significantly related when they are not.
• The size of the database that is searched influences the
likelihood that particular alignments will occur by chance.
Consider a result with an E = 1. This value indicates that in a
database of this particular size one match with a similar score is
expected to occur by chance. If the database were twice as big,
there would be twice the likelihood of finding a score equal to or
greater than S by chance.
Properties of E-value II
• The theory was developed for ungapped alignments. For
these, BLAST calculates values for 𝜆, and 𝐾.
• The same equation can be used for gapped local
alignments (such as the results of a BLAST search).
However, for gapped alignments 𝜆, and 𝐾 cannot be
calculated analytically, but instead they are estimated by
simulation and looked up in a table of precomputed
values.
Bit score
• A typical BLAST output reports both E values and scores.
• There are two kinds of scores: raw and bit scores.
• Raw scores are calculated from the substitution matrix and the gap
penalty parameters that are chosen.
• The bit score S’ is calculated from the raw score by normalizing
with the statistical variables that define a given scoring system.
• Bit scores from different alignments, even those
employing different scoring matrices in separate BLAST
searches, can be compared.
𝑆′
𝜆𝑆 − ln(𝐾)
=
ln(2)
Relation between E and p values I
• What is p-value?
• Probability of a chance alignment occurring with the score in
question or better.
• The most highly significant p values are those close to
zero. The p and E values are different ways of
representing the significance of the alignment.
• The probability of finding an HSP with a given E value is
𝑝 = 1 − 𝑒 −𝐸
Relation between E and p values II
• While BLAST reports E values rather than p values, the
two measures are nearly identical, especially for very
small values associated with strong database matches.
• An advantage of using E values is that
it is easier to think about E values of 5
versus 10 rather than 0.99326205
versus 0.99995460.
• A p-value below 0.05 is usually used
to define statistical significance (what does it mean?)
• Thus, an E value of 0.05 or less may be considered
significant.
Multiple comparisons correction I
• The significance level 𝛼 = 0.05 … p-value of 0.05
suggests that some observation (e.g., the score of a
protein query to a match in a database) is likely to have
occurred by chance 1 times in 20.
• Ho … your query is not homologous to the database
match, Ha … it is homologous
• If p-value < 𝛼, which hypothesis is true?
• When you search a database that has one million
proteins, there are many opportunities for your query to
find matches. Five percent of 1 million proteins is 50,000
proteins, and we might expect to obtain that many
matches by chance.
Multiple comparisons correction II
• This situation involves multiple comparisons: you are not
hypothesizing that your query will match one particular
database entry, you are interested in knowing if it matches
any entries.
• A solution is to correct for multiple comparisons by
adjusting the 𝛼 level.
• A very conservative way to do this (Bonferroni
correction) is to divide 𝛼 by the size of the database.
• In the case of BLAST searches this is done automatically,
because the E value is multiplied by the effective search
space.
𝐸 = 𝐾𝑚𝑛𝑒 −𝜆𝑆
E-value interpretation
• E < 10-50 … extremely high confidence that the database match
is a result of homologous relationships
• E is from (10-50 , 0.01) … the match can be considered a result
of homology
• E is from (0.01, 10) … the match is considered not significant,
but may hint tentative remote homology
• E > 10 … the sequences under consideration are either
unrelated or related by extremely distant relationships that fall
below the limit of detection with the current method.
• E-value is proportional to the database size, as database
grows E-value for a given sequence match increases.
However, the evolutionary relationship between two sequences
remains constant. As the db grows, one may lose previously
detected homologs.