Transcript praline

www.
.uni-rostock.de
Bioinformatics
Sequence Analysis III
Ulf Schmitz
[email protected]
Bioinformatics and Systems Biology Group
www.sbi.informatik.uni-rostock.de
Ulf Schmitz, Sequence Analysis III
1
www.
Outline
.uni-rostock.de
Multiple sequence alignment
 introduction to msa
 methods of msa
o
o
o
progressive global alignment
Iterative methods
Alignments based on locally conserved
patterns
Ulf Schmitz, Sequence Analysis III
2
www.
choose two
sequences
no
are the
sequences
protein
sequences?
no
do sequences
encode proteins
(e.g. cDNA)?
yes
yes
Methods
pairwise
sequence
alignment
perfom local
alignment
translate
sequences
is alignment of
high quality?
yes
perform statistical
test of alignment
score
examine sequences for
presence of repeats or
low-complexity
sequences
no
.uni-rostock.de
no
does sequence
encode proteins and
have introns?
yes
predict gene
structure
alter parameters e.g.
scoring matrix, gap
penalties, and repeat
alignment
yes
did alignment
improve?
no
is alignment score
significant?
no
sequences are
not detectably
similar
yes
sequences are
significantly similar
Ulf Schmitz, Sequence Analysis III
3
Multiple Sequence Alignment
www.
.uni-rostock.de
Motivation
• DNA sequences of different organisms are often
related
• Similar genes performing similar function
• Genes are represented in highly conserved
forms in organisms
• Through simultaneous alignment of the
sequences of the genes, sequence patterns may
be analyzed
Ulf Schmitz, Sequence Analysis III
4
www.
Multiple Sequence Alignment
.uni-rostock.de
things to consider
2 protein sequences length = 300, excluding gaps
number of comparisons
by dynamic programming
300  9  10
2
4
3 protein sequences length = 300, excluding gaps
number of comparisons by
dynamic programming
300  2.7  10
3
7
number of steps and memory required for a 300-aminmo-acid sequence = 300N,
where N is the number of sequences
Ulf Schmitz, Sequence Analysis III
5
www.
.uni-rostock.de
Relationship of MSA to Phylogenetic analysis
once the msa has been found, the number or types of changes in
the aligned sequences may be used for a phylogenetic analysis
seqA
seqB
seqC
seqD
N Y L S
N
N
N
N
N K Y L S
+K
–
–
K
–
F
F
Y
Y
L
–
L
L
N F S
S
S
S
S
N F L S
-L
Y to F
hypothetical evolutionary tree that could have generated three sequence changes
Ulf Schmitz, Sequence Analysis III
6
Phylogenetic analysis
Ulf Schmitz, Sequence Analysis III
www.
.uni-rostock.de
7
www.
MSA methods
•
.uni-rostock.de
Approximate methods are used:
a) progressive global alignment
o
starting with an alignment of the most alike sequences and
then building an alignment by adding more sequences
b) Iterative methods
o
makes an initial alignment of groups of sequences and
then revises the alignment to achieve a more reasonable
result
c) Alignments based on locally conserved patterns
d) statistical methods
o
probabilistic models of sequences
Ulf Schmitz, Sequence Analysis III
8
www.
MSA Tools
Name
.uni-rostock.de
Source
Global alignments including
progressive
CLUSTALW or CLUSTALX (latter has
graphical interface)
ftp.ebi.ac.uk/pub/software/unix
MSA
ftp://fastlink.nih.gov/pub/msa
PRALINE
http://ibivu.cs.vu.nl/programs/pralinewww/
Iterative and other methods
DIALIGN segment alignment
http://bioweb.pasteur.fr/seqanal/interfaces/dialign2-simple.html
MultAlin
http://protein.toulouse.inra.fr/multalin.html
SAGA genetic algorithm
http://igs-server.cnrsmrs.fr/~cnotred/Projects_home_page/saga_home_page.html
Ulf Schmitz, Sequence Analysis III
9
www.
MSA Tools
Name
.uni-rostock.de
Source
Local alignments of proteins
BLOCKS Web site
http://blocks.fhcrc.org/blocks/
HMMER hidden Markov model software
http://hmmer.wustl.edu/
MEME Web site, expectation maximization
method
http://meme.sdsc.edu/meme/website/
eMOTIF web server
http://dna.Stanford.EDU/emotif
GIBBS, the Gibbs sampler statistical method
ftp://ftp.ncbi.nlm.nih.gov/pub/neuwald/gibbs9_95/
Aligned Segment Statistical Evaluation Tool
(Asset)
ncbi.nlm.nih.gov/pub/neuwald/asset
SAM hidden Markov model web site
http://www.cse.ucsc.edu/research/compbio/sam.html
Ulf Schmitz, Sequence Analysis III
10
www.
MSA scoring
.uni-rostock.de
• Another computational challenge is identifying a
reasonable method of obtaining a cumulative score
for the substitutions in the columns of a msa
• And also the placement and scoring of gaps in
various sequences of an msa
• one method for optimizing the msa by
– maximizing the number of matched pairs summed over
all columns in the msa
Ulf Schmitz, Sequence Analysis III
11
www.
MSA scoring with the SP model
.uni-rostock.de
• the method assumes a model for evolutionary change in which any
of the sequences could be the ancestor of the others
Sequence
1
2
3
4
4
Column A
Column B
Column C
....N..............N..............N
....N..............N..............N
....N..............N..............N
....N..............N..............C
....N..............C..............C
N
N
N
N
N
N
N
N
N
N
C
Column B
Column A
No. of N - N matched pairs (each scores 6):
10
No. of N - C matched pairs (each scores -3):
0
BLOSUM62 score:
60
N
C
N
C
Column C
6
4
4
6
24
6
Ulf Schmitz, Sequence Analysis III
12
www.
.uni-rostock.de
Progressive multiple sequence alignment
• alignment on each of the pairs of sequences
• next, trail msa is produced by first predicting a phylogenetic
tree for the sequences
• sequences are then multiply aligned in order of their
relationship on the tree
– starting with the most related sequences
– then progressively adding less related sequences to the initial
alignment
• used by PILEUP and CLUSTALW
• not guaranteed to be optimal
Ulf Schmitz, Sequence Analysis III
13
Progressive msa - general principles
www.
.uni-rostock.de
1
2
1
3
Score 1-2
4
5
Score 4-5
Score 1-3
Scores
5×5
Scores to distances
Guide tree
Similarity matrix
Iteration possibilities
Multiple alignment
Ulf Schmitz, Sequence Analysis III
14
General progressive msa technique
www.
.uni-rostock.de
(follow generated tree)
d
1
3
1
3
2
5
1
3
2
5
root
1
3
2
5
4
Ulf Schmitz, Sequence Analysis III
15
www.
CLUSTALW / CLUSTALX
•
‘W’ stands for “weighting”
–
•
•
•
•
•
•
•
Ulf Schmitz, Sequence Analysis III
.uni-rostock.de
ability to provide weights to
sequence and program
parameters
CLUSTALX – with graphical
interface
provides global msa
Not constructed to perform
local alignments.
Similarity in small regions is a
problem.
Problems with large insertions.
Problems with repetitive
elements, such as domains.
ClustalW does not guarantee
an optimal solution
16
www.
PILEUP
•
•
•
•
.uni-rostock.de
very similar to CLUSTALW
part of the genetic computer group (GCG)
does not guarantee optimal alignment
plots a cluster dendogram of similarities betwenn sequences
This is not an evolutionary tree!
Ulf Schmitz, Sequence Analysis III
17
limits of progressive alignment
www.
.uni-rostock.de
• initial pairwise alignment
• the very first sequences to be aligned are the
most closely related in the tree
– if they align well, there will be few errors
– the more distantly related the more errors
• choice of suitable scoring matrices and gap
penalties
when to use progressive alignment?
• for more closely related sequences
• large number of sequences
Ulf Schmitz, Sequence Analysis III
18
Iterative methods of msa
www.
.uni-rostock.de
• repeatedly realigns subgroups of sequences
• then aligning these subgroups into global
alignment of all the sequences
• aim is to improve the overall alignment score
• selection of groups is based on the phylogenetic
tree
– separation of one or two sequences from the rest
– similar to that of progressive alignment
Ulf Schmitz, Sequence Analysis III
19
Localized alignments in Sequences
www.
.uni-rostock.de
1st
profile analysis
2nd
blocks analysis
3rd
pattern-searching or statistical methods
Ulf Schmitz, Sequence Analysis III
20
www.
Profile analysis
.uni-rostock.de
• is a sequence comparison method for finding and aligning
distantly related sequences
• Finding new family members
• Profile = position-specific scoring table
• from global MSA of a group of sequences more highly
conserved regions are removed into a smaller MSA
• a scoring matrix (called profile) is then made
Ulf Schmitz, Sequence Analysis III
21
www.
Profile analysis
.uni-rostock.de
• A profile is used to search a target sequence for
possible matches to the profile
• Scores in the table are used to evaluate the
likelihood at each position
• e.g. a profile that is 25 amino acids long will have
25 rows of 20 scores
– each score in a row for matching one of the amino acids
at the corresponding position in the profile
Ulf Schmitz, Sequence Analysis III
22
www.
Profile example
Con
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
.uni-rostock.de
T
V
W
Y
I
8
-2
5
4
5
5
-4
24
0
15
13
1
1
1
-7
2
22
21
-18
-6
T
13
-5
24
18
-18
19
7
1
7
-7
-4
14
11
10
-1
9
29
3
-28
-14
L
5
-5
3
4
13
4
2
8
-4
14
12
8
-5
0
-10
0
10
10
-1
5
S
17
17
13
10
-12
29
-5
-5
6
-14
-9
12
10
0
-2
34
19
1
-8
-15
– Each column is independent
– Average Method: profile matrix values are weighted by the proportion of each amino acid
in each column of MSA
– Evolutionary Method: calculate the evolutionary distance (Dayhoff model) required to
generate the observed amino acid distribution
Ulf Schmitz, Sequence Analysis III
23
www.
Profile analysis
.uni-rostock.de
Disadvantages:
• Profile extraction from an msa is only as representative of
the variation in the family of sequences as the msa itself.
– If several sequences are similar, the derived profile will be based in
favor of those sequences
– Solution: sequences are weighted by the distance of relation based
on a phylog. tree
• Some amino acids may not be represented in a column
because not enough sequences have been included
Ulf Schmitz, Sequence Analysis III
24
www.
Block analysis
.uni-rostock.de
•
•
•
•
like profiles, blocks represent a conserved region in msa
but they don’t consider deletions and insertions
Instead columns include only matches and mismatches
Blocks are made by searching an alignment for sections
that are highly conserved
• no scoring matrices are used
Ulf Schmitz, Sequence Analysis III
25
www.
Blocks
.uni-rostock.de
Gapless
alignment
blocks
Ulf Schmitz, Sequence Analysis III
26
www.
Block analysis
.uni-rostock.de
Extraction of Blocks from a global or local msa
• Global msa of related sequences usually include regions
without gaps in any of the sequences
• These ungapped patterns are extracted and used to build
blocks
• These blocks are only as good as the msa from which they
are derived
• The BLOCKS server (http://blocks.fhcrc.org) extracts blocks
of width 10-55 from a protein MSA of up to 400 sequences.
Ulf Schmitz, Sequence Analysis III
27
www.
Block analysis
.uni-rostock.de
• conserved patterns in protein or dna
sequences can be represented by sequence
logos
• the horizontal scale represents sequential
positions in the motif
• height of a amino acid is proportional to the
frequency of the amino acid in the column
• Amino acids are shown in decreasing order of
abundance from the top
Extractable information:
• consensus may be read across the columns as
the top amino acid in each column
• Relative frequency of each amino acid
• height of a column provides measure of how
useful that column is for reducing the level of
uncertainty
Ulf Schmitz, Sequence Analysis III
28
www.
choose three or
more sequences
yes
are the
sequences
protein
sequences?
is a convincing
alignment
produced?
perfom global
alignment
do sequences
encode proteins
(e.g. cDNA)?
yes
translate
sequences
no
no
make a profile or
PSSM
representation of
the alignment
no
are the sequences
genomic sequences
that encode related
proteins?
yes
predict gene
structure
produce a hidden
markov model.
no
analyze for
patterns,
repeats, etc.
no
do the
sequences
encode RNA
molecules?
yes
yes
are there large
number of
sequences?
no
Methods
multiple
sequence
alignment
.uni-rostock.de
yes
analyze promoter
regions, inton-exon
boundaries, etc.
search for blocks
analyze for
secondary structure
Ulf Schmitz, Sequence Analysis III
29
www.
Outlook
.uni-rostock.de
Statistical methods and probabilistic models
1.
2.
3.
Expectation Maximization Algorithm
the Gibbs Sampler
Hidden Markov Models
Ulf Schmitz, Sequence Analysis III
30
Sequence Alignment
www.
.uni-rostock.de
Thanks for your attention!
Ulf Schmitz, Sequence Analysis III
31