Transcript Document

Pairwise Sequence Analysis-I
•
•
•
•
Analogous Concepts
Domain Concepts
Dot-matrix Analysis
Pairwise Sequence Analysis
– Why is it useful?
– What are the underlying concepts?
– What algorithms are used?
– Limitations/Open questions
Lecture 2 CS566
1
Concepts
• Pairwise “everything” analysis
– Humans are experts at pairwise comparison
– “Peas of a pod”
– Basis for pecking order in society
– Basis for notion of fairness (“He got a better
deal than I did!”)
– Basis for grouping
– Basis for sorting/ranking algorithms
– Basis for search strategies
Lecture 2 CS566
2
Concepts
• Identity versus similarity
– Brand name versus generic products
• Coincidence versus plagiarism
– Co-discoveries versus “Microsoft Explorer”
• “What you see is not what you got”
– Process not evident from product
• “If it ain’t broke, don’t fix it”
– Why messy code that works is left to remain
messy
Lecture 2 CS566
3
Domain Concepts
• Sequences change over generations
– More changes in DNA than in proteins (Why?)
• Causes of change
– Error rates of replication ~ 1 in a billion
– Exchange
• Legitimate: Recombination
• Illegitimate: Lateral transfer (a la Napster)
– Mutation
• Exposure to chemicals
Lecture 2 CS566
4
Dot-matrix Analysis
A B L E W A S O N E E R
A \
B
\
L
\
E
\
\
W
\
A
\
S
\
I
E
\ \
R
\
E
\ \
I
S
\
A
\
W
\
E
\
\
L
\
B
\
A \
E O N E S A W E L B A
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
Lecture 2 CS566
\
\
\
\
5
Why (Motivation)
1. Given sequence A, what are its properties?
–
Find any sequence(s) B similar to A, where B 
{Sequences with known properties}
2. Given sequences A & B, are they
–
–
–
Identical?
Similar?
Evolutionary related?
3. Given sequence A, is this a new discovery?
4. Given a set of sequences, can they be
clustered into families of related sequences?
Lecture 2 CS566
6
Components
•
Compare the two sequences and assign
score
• Evaluate the score
1. Use of a scoring scheme to assign a value
to a candidate alignment
2. Finding the best alignment between two
sequences
3. Use of a probability model to assess the
significance of the similarity (coincidence or
plagiarism?)
Lecture 2 CS566
7
Concepts - Alignment
•
•
•
•
Sequences may be aligned ‘locally’ (look for
regions/subsequences that are similar) or
‘globally’ (align along entire length)
The result of a local alignment may differ from
that a global alignment (Why?)
Potentially large number of alignments to be
considered
Each alignment is a path through a fully
connected dot matrix graph, i.e., a subset of
the set of all diagonal edges, connected by
horizontal or vertical lines.
Lecture 2 CS566
8
Concepts – Scoring scheme
•
Scores are based on amino-acid
substitution matrices (log-odds ratio of
observed versus expected random
substitutions)
– PAM (Percent amino acid substitution)
matrices: Based on evolutionary model.
E.g., PAM 1%, PAM 250%.
– BLOSUM (Blocks substitution) matrices:
Based on percent identity. E.g., Blosum50,
Blosum62.
Lecture 2 CS566
9
Concepts – Scoring scheme
Identity based scoring (match XOR non-match)
Isawa --- dancer
Isawatapdancer
Isawa ---- danc-er
Isawapinkpanther
Similarity based scoring (partial matching)
Eyl
dkv
Lecture 2 CS566
10
Algorithms
• Optimal/Exact solutions:
– Take longer time
– Typically used for comparing a small number of
sequences
– (Needleman-Wunsch; Smith-Waterman)
• Heuristic:
– Frequently close to the optimal solution
– Rapid
– Typically used for 1:n searches involving large
number of sequences
– (BLAST, FASTA)
Lecture 2 CS566
11
Open Questions/Limitations
Highly similar sequences or highly dissimilar
sequences are easily found BUT
• Gray area lies in weak similarity; probability
distribution is continuous
• Other forms of similarity (e. g., structural or
species similarity) may be necessary reach a
decision
• Sequence similarity per se does not guarantee
functional equivalence; small difference can be
responsible for large difference in function
Lecture 2 CS566
12