Quality Score

Download Report

Transcript Quality Score

Basic terms:
 Similarity



- measurable quantity.
Similarity- applied to proteins using concept of
conservative substitutions
Identity
percentage
 Homology-specific
term indicating
relationship by evolution
Basic terms:
 Orthologs:
homologous sequences found
in two or more species, that have the
same function (i.e. alpha- hemoglobin).
Basic terms:
 Orthologs:
homologous sequences found
it two or more species, that have the
same function (i.e. alpha- hemoglobin).
 Paralogs: homologous sequences found in
the same species that arose by gene
duplication. ( alpha and beta hemoglobin).
Pairwise comparison
 Dotplot

All against all comparison.
• Every position is compared with every other
position.
Pairwise comparison
 Dotplot

All against all comparison.
• Every position is compared with every other
position.
• Nucleic acids and proteins have polarity.
Pairwise comparison
 Dotplot

All against all comparison.
• Every position is compared with every other
position.
• Nucleic acids and proteins have polarity.
• Typically only one direction makes biological
sense.
Pairwise comparison
 Dotplot

All against all comparison.
• Every position is compared with every other
position.
• Nucleic acids and proteins have polarity.
• Typically only one direction makes biological
sense.

5’ to 3’ or amino terminus to carboxyl terminus.
Simple plot
 Window:
size of sequence block used for
comparison. In previous example:

window = 1
 Stringency
= Number of matches required
to score positive. In previous example:

stringency = 1 (required exact match)
DotPlot
WINDOW = 4; STRINGENCY = 2
GATCGTACCATGGAATCGTCCAGATCA
GATC
+ (4/4)
GATC
- (0/4)
GATC
- (0/4)
GATC
+ (2/4)
Dot Plot





Compare two sequences in every register.
Vary size of window and stringency
depending upon sequences being compared.
For nucleotide sequences typically start with
window = 21; stringency = 14
Protein - start with smaller window : 3,
stringency 1 or 2.
Important to test different stringencies.
Intergenic comparison

Nucleotide sequence
contains three domains.

50 - 350 - Strong conservation
• Indel places comparison
out of register
450 - 1300 - Slightly weaker
conservation
1300 - 2400 - Strong
conservation


Scoring Alignments
 Quality

Score:
Score x for match, -y for mismatch;
Scoring Alignments
 Quality

Score:
Score x for match, -y for mismatch;
• Penalty for:


Creating Gap
Extending a gap
Scoring Alignments
 Quality
Score:
 Quality = [10(match)]
Scoring Alignments
 Quality
Score:
 Quality = [10(match)] + [-1(mismatch)]
Scoring Alignments
 Quality
Score:
 Quality = [10(match)] + [-1(mismatch)] [(Gap Creation Penalty)(#of Gaps)
Scoring Alignments
 Quality
Score:
 Quality = [10(match)] + [-1(mismatch)] [(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total
length of Gaps)]
Scoring scheme incorporates an evolutionary
model--
Scoring Alignments

Quality Score:

Quality = [10(match)] + [-1(mismatch)]
[(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of
Gaps)]
Scoring scheme incorporates an evolutionary model--
Matches are conserved
Scoring Alignments

Quality Score:

Quality = [10(match)] + [-1(mismatch)]
[(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of
Gaps)]
Scoring scheme incorporates an evolutionary model-Matches are conserved
Mismatches are divergences
Scoring Alignments

Quality Score:

Quality = [10(match)] + [-1(mismatch)]
[(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of
Gaps)]
Scoring scheme incorporates an evolutionary model-Matches are conserved
Mismatches are divergences
Gaps are more likely to disrupt function, hence greater
penalty than mismatch.
Scoring Alignments

Quality Score:

Quality = [10(match)] + [-1(mismatch)]
[(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)]
Scoring scheme incorporates an evolutionary model-Matches are conserved
Mismatches are divergences
Gaps are more likely to disrupt function, hence greater penalty than mismatch.
Introduction of a gap (indel) penalized more
than extension of a gap.
Z Score (standardized score)
Z
= (Scorealignment - Average Scorerandom)
Standard Deviationrandom
Quality Score:Randomization
•Program takes sequence and randomizes it X times
(user select).
•Determines average quality score and standard
deviation with randomized sequences
•Compare randomized scores with Quality score to
help determine if alignment is potentially significant.
Randomization
 It

has become clear that
Sequences appear to evolve in a
“word” like fashion.
• 26 letters of the alphabet--combined to
make words.
• Words actually communicate information.

Randomization should actually occur at
the level of strings of nucleotides (2-4).
Global Alignment
 Global
- Compares all possible
alignments of two sequences and
presents the one with the greatest
number of matches and the fewest
gaps.
Global Alignment
 Global
- Compares all possible
alignments of two sequences and
presents the one with the greatest
number of matches and the fewest
gaps.
 Alignment will “run” from one end of the
longest sequence, to the other end.
Global Alignment
 Global
- Compares all possible
alignments of two sequences and
presents the one with the greatest
number of matches and the fewest
gaps.
 Alignment will “run” from one end of the
longest sequence, to the other end.
 Best for closely related sequences.
Global Alignment

Global - Compares all possible alignments of
two sequences and presents the one with the
greatest number of matches and the fewest
gaps.
 Alignment will “run” from one end of the
longest sequence, to the other end.
 Best for closely related sequences.
 Can miss short regions of strongly conserved
sequence.
Local Alignment

Identifies segments of alignment with the
highest possible score.
Local Alignment

Identifies segments of alignment with the
highest possible score.
 Align sequences, extends aligned regions in
both directions until score falls to zero.
Local Alignment



Identifies segments of alignment with the highest
possible score.
Align sequences, extends aligned regions in both
directions until score falls to zero.
Best for comparing sequences whose relationship is
unknown.
Global Alignment:
Local Alignment:
Blast 2
Basic Local Alignment Search Tool
E (expect) value: number of hits expected by random
chance in a database of same size.
Larger numerical value = lower significance
HIV sequence
 Both
Global and Local alignment programs will
(almost) always give a match.
 Both
Global and Local alignment programs will
(almost) always give a match.
 It is important to determine if the match is
biologically relevant.
 Both
Global and Local alignment programs will
(almost) always give a match.
 It is important to determine if the match is
biologically relevant.
 Not necessarily relevant: Low complexity
regions.

Sequence repeats (glutamine runs)
 Both
Global and Local alignment programs will
(almost) always give a match.
 It is important to determine if the match is
biologically relevant.
 Not necessarily relevant: Low complexity
regions.


Sequence repeats (glutamine runs)
Transmembrane regions (high in hydrophobes)
 Both
Global and Local alignment programs will
(almost) always give a match.
 It is important to determine if the match is
biologically relevant.
 Not necessarily relevant: Low complexity
regions.


 If
Sequence repeats (glutamine runs)
Transmembrane regions (high in hydrophobes)
working with coding regions, you are
typically better off comparing protein
sequences. Greater information content.