CSC598BIL675-2016
Download
Report
Transcript CSC598BIL675-2016
Multiple alignment
• The linear comparison of more than two sequences
• Places residues in columns per position specific similarity scores
o reflects relationships of the sequences
o the scores are based on indels (gaps) and substitutions.
The alignment of residues implies that they have similar roles in the
proteins or DNA sequences being aligned
e.g. protein active sites or transcription factor binding sites
Strength in numbers: the structure/function message from a multiple
alignment is stronger than that of a pairwise alignment.
Uses of multiple alignments
Identification of functionally or structurally conserved domain/motif
- biological meaning, domain groups, motif matrices
- ProSite, InterPro, etc
Classification of domains into families
- biological or structural meaning
- Pfam, SCOP
Evolutionary studies
- phylogenetic inference of gene or species evolution
Structure prediction
- homology modeling
Multiple alignment method
1. Find homologous sequences
Homologous sequences share a common ancestor, usually
relatively high sequence similarity
Not all similar proteins are homologous:
Similarity may have come about due to convergent evolution
or by chance.
Homology is detectable
When there is consensus over a relatively long stretch of sequence
OR
When the conservation is high within functionally relevant regions
THUS
Statistical methods based on position-specific matrices help to
provide some evidence
BUT
You usually need to check your alignment by eye
to make sure it makes sense
AND
May need structural data to recognize homology
Finding how many sequences?
Use different BLAST algorithms.
The more seqs you have the stronger your alignment will be …
– Depends on your sequence type and your question
– Beware of redundant sequences
(choose a threshold relevant to your question)
– Beware of pulling in unrelated sequences
(take a good look at your dataset)
(More sequences means longer computational time, but this
is why we have Pegasus)
Global multiple alignment
Assumes conserved regions occur in same order
Begins by aligning them from the beginning of the sequence
Allows gaps
Builds a consensus sequence,
or a profile if based on statistical calculations
Most useful for defining protein families and evolutionary work
Local multiple alignment
Assumes conserved regions can be duplicated
and can occur in different order along the seqs
Block A
Block B
Block C
Block D
Most useful for finding motifs (shorter sequence lengths)
Gaps and substitutions
For protein msa, PAM, BLOSUM, or other scoring matrices are
used for gaps and substitutions – but with position specific
weighting.
Clustal default is BLOSUM68
MUSCLE uses 200PAM plus their own log-expectation matrix
PAM is based on number of changes per evolutionary rate – the higher,
the less stringent, eg 250 PAM is casting a wide net
BLOSUM is based on frequency of changes in closely conserved blocks of
motifs – the higher the more stringent, eg BLOSUM80 is biased towards
finding motifs that are highly conserved (to 80%), BLOSUM68 less so etc.
Gaps and substitutions
PAM, BLOSUM, or other scoring matrices are used for gaps and
substitutions – but with position specific weighting.
ClustalW default is BLOSUM
MUSCLE uses 200PAM plus their own log-expectation matrix
For protein sequences, more chance of having indels
in the outer loops than inner core or catalytic domain
For non-coding DNA, repeats and transposons may occur
For structure RNAs, loop regions are more variable
than stem regions
Evolution of Algorithms
Profiles
position specific scoring matrix based on amino acid conservation
PSI-BLAST
position specific iterative scoring matrix plus BLAST
Hidden Markov Models
position specific scoring matrix
plus position specific gap penalties
Structural information? Not trivial…
multiple alignment
1.
Exhaustive approaches mathematically very accurate
alignments are optimal
BUT these are very complex and take a huge amount of time
2.
Heuristic methods slightly less accurate,
alignments are good but not optimal
AND are usually enough for biological questions
Multiple alignment method
1. Find homologous sequences
2. Place the sequences in a relevant format (usually FASTA), and edit to
similar length.
Example
>ACTB
cggcctccagatggtctgggagggcagttcagctgtggctgcgcatagcagacatacaacggacggtgggcccagacccaggctgtgtag
acccagcccccccgccccgcagtgcctaggtcacccactaacgccccaggccttgtcttggctgggcgtgactgttaccctcaaaagcag
gcagctccagggtaaaaggtgccctgccctgtagagcccaccttccttcccagggctgcggctgggtaggtttgtagccttcatcacggg
ccacctccagccactggaccgctggcccctgccctgtcctggggagtgtggtcctgcgacttctaagtggccgcaagcca
>AGPAT1
tctgcctctccacagtgcccttataccagccccctcccagatctcatctgaatgtgatccatatttcctggttctccccgactcaactga
tgcgtgcctcccttaacctttgtgtctcacttgtttccacctgcacagctaagacccctcacttctctggggtaaggtggctcgggtctc
acattgtcctgccactccccgccccaccttctcttctcagcacatcacgtgcctcagctcctggttcctaagacctttctttccacagat
ctcgaccgttatactcccacccacacataccagcaaagtcttatgtctcctgtcgggcttcacctatgggaacgtgccct
You can use a list of accession numbers if you already know that the
sequences are of similar lengths.
Multiple alignment method
1. Find homologous sequences
2. Place the sequences in a relevant format (usually FASTA), and edit to
similar length.
3. Run a multiple alignment program
ClustalW
ClustalΩ
MUSCLE
PRALINE
T-Coffee
- oldest, flexible, robust
- latest version, scalable, more accurate with addition of HMM
- fast, good for finding short motifs in small datasets
- includes secondary structure information
- good for small datasets of shorter sequences, has a module
for checking input seqs against the PDB
COBALT
- uses domain conservation information (from BLAST page)
which by definition has some structural information
Clustal family
• Clustal X
Uses progressive global alignment algorithm
Graphic user interface only
• Clustal W and W2
Command line tool, W2 also had a web interface
Has a parallelized version, to cope with larger datasets
• ClustalΩ
HMM searches added to algorithm
Command line and web interface
Scalable to very large datasets
Input of Data
3 or more sequences are needed, nucleic or amino acids, several
formats are accepted: eg FASTA text files
Remove any white space or empty lines
The analysis will fail if two sequences have the same name
Can copy/paste sequences into Clustal or upload a txt file
Groups
Group 1: Joey (BI;G), Alan (CS;U), Chingis (CS, U)
Group 2: Wei (BI;G), Toni (BI,CS;U), William (CS; U), Federico (CS; U),
Group 3: Robert (BI; U), Yifan (BI; G), Shiv (CS; U), Travis (CS; U)