Quality Score
Download
Report
Transcript Quality Score
Basic terms:
Similarity
- measurable quantity.
Similarity- applied to proteins using concept of
conservative substitutions
Identity
percentage
Homology-specific
term indicating
relationship by evolution
Basic terms:
Orthologs:
homologous sequences found
in two or more species, that have the
same function (i.e. alpha- hemoglobin).
Basic terms:
Orthologs:
homologous sequences found
it two or more species, that have the
same function (i.e. alpha- hemoglobin).
Paralogs: homologous sequences found in
the same species that arose by gene
duplication. ( alpha and beta hemoglobin).
Pairwise comparison
Dotplot
All against all comparison.
• Every position is compared with every other
position.
Pairwise comparison
Dotplot
All against all comparison.
• Every position is compared with every other
position.
• Nucleic acids and proteins have polarity.
Pairwise comparison
Dotplot
All against all comparison.
• Every position is compared with every other
position.
• Nucleic acids and proteins have polarity.
• Typically only one direction makes biological
sense.
Pairwise comparison
Dotplot
All against all comparison.
• Every position is compared with every other
position.
• Nucleic acids and proteins have polarity.
• Typically only one direction makes biological
sense.
5’ to 3’ or amino terminus to carboxyl terminus.
Simple plot
Window:
size of sequence block used for
comparison. In previous example:
window = 1
Stringency
= Number of matches required
to score positive. In previous example:
stringency = 1 (required exact match)
DotPlot
WINDOW = 4; STRINGENCY = 2
GATCGTACCATGGAATCGTCCAGATCA
GATC
+ (4/4)
GATC
- (0/4)
GATC
- (0/4)
GATC
+ (2/4)
Dot Plot
Compare two sequences in every register.
Vary size of window and stringency
depending upon sequences being compared.
For nucleotide sequences typically start with
window = 21; stringency = 14
Protein - start with smaller window : 3,
stringency 1 or 2.
Important to test different stringencies.
Intergenic comparison
Nucleotide sequence
contains three domains.
50 - 350 - Strong conservation
• Indel places comparison
out of register
450 - 1300 - Slightly weaker
conservation
1300 - 2400 - Strong
conservation
Scoring Alignments
Quality
Score:
Score x for match, -y for mismatch;
Scoring Alignments
Quality
Score:
Score x for match, -y for mismatch;
• Penalty for:
Creating Gap
Extending a gap
Scoring Alignments
Quality
Score:
Quality = [10(match)]
Scoring Alignments
Quality
Score:
Quality = [10(match)] + [-1(mismatch)]
Scoring Alignments
Quality
Score:
Quality = [10(match)] + [-1(mismatch)] [(Gap Creation Penalty)(#of Gaps)
Scoring Alignments
Quality
Score:
Quality = [10(match)] + [-1(mismatch)] [(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total
length of Gaps)]
Scoring scheme incorporates an evolutionary
model--
Scoring Alignments
Quality Score:
Quality = [10(match)] + [-1(mismatch)]
[(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of
Gaps)]
Scoring scheme incorporates an evolutionary model--
Matches are conserved
Scoring Alignments
Quality Score:
Quality = [10(match)] + [-1(mismatch)]
[(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of
Gaps)]
Scoring scheme incorporates an evolutionary model-Matches are conserved
Mismatches are divergences
Scoring Alignments
Quality Score:
Quality = [10(match)] + [-1(mismatch)]
[(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of
Gaps)]
Scoring scheme incorporates an evolutionary model-Matches are conserved
Mismatches are divergences
Gaps are more likely to disrupt function, hence greater
penalty than mismatch.
Scoring Alignments
Quality Score:
Quality = [10(match)] + [-1(mismatch)]
[(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)]
Scoring scheme incorporates an evolutionary model-Matches are conserved
Mismatches are divergences
Gaps are more likely to disrupt function, hence greater penalty than mismatch.
Introduction of a gap (indel) penalized more
than extension of a gap.
Z Score (standardized score)
Z
= (Scorealignment - Average Scorerandom)
Standard Deviationrandom
Quality Score:Randomization
•Program takes sequence and randomizes it X times
(user select).
•Determines average quality score and standard
deviation with randomized sequences
•Compare randomized scores with Quality score to
help determine if alignment is potentially significant.
Randomization
It
has become clear that
Sequences appear to evolve in a
“word” like fashion.
• 26 letters of the alphabet--combined to
make words.
• Words actually communicate information.
Randomization should actually occur at
the level of strings of nucleotides (2-4).
Global Alignment
Global
- Compares all possible
alignments of two sequences and
presents the one with the greatest
number of matches and the fewest
gaps.
Global Alignment
Global
- Compares all possible
alignments of two sequences and
presents the one with the greatest
number of matches and the fewest
gaps.
Alignment will “run” from one end of the
longest sequence, to the other end.
Global Alignment
Global
- Compares all possible
alignments of two sequences and
presents the one with the greatest
number of matches and the fewest
gaps.
Alignment will “run” from one end of the
longest sequence, to the other end.
Best for closely related sequences.
Global Alignment
Global - Compares all possible alignments of
two sequences and presents the one with the
greatest number of matches and the fewest
gaps.
Alignment will “run” from one end of the
longest sequence, to the other end.
Best for closely related sequences.
Can miss short regions of strongly conserved
sequence.
Local Alignment
Identifies segments of alignment with the
highest possible score.
Local Alignment
Identifies segments of alignment with the
highest possible score.
Align sequences, extends aligned regions in
both directions until score falls to zero.
Local Alignment
Identifies segments of alignment with the highest
possible score.
Align sequences, extends aligned regions in both
directions until score falls to zero.
Best for comparing sequences whose relationship is
unknown.
Global Alignment:
Local Alignment:
Blast 2
Basic Local Alignment Search Tool
E (expect) value: number of hits expected by random
chance in a database of same size.
Larger numerical value = lower significance
HIV sequence
Both
Global and Local alignment programs will
(almost) always give a match.
Both
Global and Local alignment programs will
(almost) always give a match.
It is important to determine if the match is
biologically relevant.
Both
Global and Local alignment programs will
(almost) always give a match.
It is important to determine if the match is
biologically relevant.
Not necessarily relevant: Low complexity
regions.
Sequence repeats (glutamine runs)
Both
Global and Local alignment programs will
(almost) always give a match.
It is important to determine if the match is
biologically relevant.
Not necessarily relevant: Low complexity
regions.
Sequence repeats (glutamine runs)
Transmembrane regions (high in hydrophobes)
Both
Global and Local alignment programs will
(almost) always give a match.
It is important to determine if the match is
biologically relevant.
Not necessarily relevant: Low complexity
regions.
If
Sequence repeats (glutamine runs)
Transmembrane regions (high in hydrophobes)
working with coding regions, you are
typically better off comparing protein
sequences. Greater information content.