PPT presentation

Download Report

Transcript PPT presentation

Alignments
Year
Apr-98
Oct-97
Apr-97
Oct-96
Apr-96
Oct-95
Apr-95
Oct-94
Apr-94
Oct-93
Apr-93
Sep-92
Dec-91
Mar-91
Jun-90
Sep-89
Dec-88
Jun-88
Sep-87
Feb-87
May-86
May-85
Sep-84
Dec-82
Bases
GenBank Growth Chart
1600000000
1400000000
1200000000
1000000000
800000000
600000000
400000000
200000000
0
Evolutionary basis of Alignment
• Enable the researcher to determine if two
sequences display sufficient similarity to
justify the inference of homology.
• Similarity is an observable quantity that
may be expressed as say %identity or some
other measure.
• Homology is a conclusion drawn from this
data that the two genes share a common
evolutionary history.
Conclusion
• Genes either are or are not homologous.
• There are no degrees of homology as are
there in similarity.
• While it is presumed that the homologous
sequences have diverged from a common
ancestral sequence through iterative
molecular changes we do not actually
know what the ancestral sequence was.
Conclusion
• An alignment thus just reflects the probable
evolutionary history of two genes/proteins.
• Residues that have aligned and are not
identical represent substitutions.
• Regions in which the residues of one
sequence correspond to nothing in the other
would be interpreted as either an
insertion/deletion. These regions are
represented in an alignment as Gaps
•
Continued
.
Certain regions more conserved than others -
Crucial residues (structure/function)
• There may be certain regions conserved but
not functionally related - historical reasons.
• Specially, from closely related species- have
not had sufficient time to diverge.
• MOTD: Experimental tests are must for
validation and computational analysis just
provides basic insight.
An interesting example
•
•
•
•
Appear to share a high degree of similarity.
Should have similar biological function.
Hypothetical statement.
Crystalline: lens matrix of vertebrate eye.
E.coli metabolic enzyme - quinone oxido
reductase
• Function has changed during the course of
evolution.
• CAREFUL!!
Global Alignment:
• An alignment that spans the entire length of
the protein like the one in the previous
example.
• Best for proteins that have not diverged
substantially.
• Have single globular domain.
Local alignment:
• Many proteins appear to be mosaics of modular
domains.
• Modular structure of two proteins involved in
Blood clotting.
F2
E
F1
E
F1
E
K
K
Catalytic
K
Catalytic
Continued
• Besides the catalytic domain that provides
that serine protease activity there are other
domains.
• Two types of Fibronectin repeats.
• A domain with similarity to EGF.
• And a “Kringle” domain.
• Can be repeated and can appear in different
order.
• For such cases “Local Alignment”.
Continued...
• Another case where local alignments might
be used is at the nucleotide level when one
tries to compare the nucleotide sequence of
a spliced RNA to its Genomic DNA.
• Each Exon is in a distinct local alignment.
Comparing two sequences
• Gap: Finds the alignment of two complete
sequences (Global Alignment), maximises
the number of matches, minimises the
number of Gaps.
• BestFit: Aligns the best segment of
similarity between two sequences (local
alignment)
Evaluation of Alignment
Accuracy
Evaluation of alignment accuracy
• What is a good alignment?
• The amino acid sequence codes for the protein three dimensional
structure.
• when an alignment of two or more sequences is made, the implication
is that the equivalent residues are performing similar structural roles in
the native folded protein.
• The best judge of alignment accuracy is thus obtained by comparing
alignments resulting from sequence comparison with those derived
from protein three dimensional structures.
• Care must be taken when performing the comparison since within
protein families, some regions show greater similarity than others.
Check by
• Monte Carlo Simulation
– To check the accuracy of alignment of say e.g A
and B.
– Randomise B and calculate the % identity.
– Iterate 1000 times and see out of thousand how
many times the % identity is more than the
actual sequence and calculate the probability of
getting the alignment score by chance.
Scoring Schemes
Identity scoring
• This is the simplest scoring scheme.
• Amino acid pairs are classified into two types:
identical and non-identical
• Non-identical pairs are scored 0 and identical pairs
given a positive score (usually 1)
• The scoring scheme is generally considered less
effective than schemes that weight non-identical
pairs
Genetic code scoring
• Genetic code scoring was introduced by Fitch.
• Considers the minimum number of DNA/RNA base
changes (0,1,2 or 3) that would be required to inter-convert
the codons for the two amino acids.
• The scheme has been used both in the construction of
phylogenetic trees and in the determination of homology
between protein sequences having similar three
dimensional structures.
Chemical similarity scoring
• Give greater weight to the alignment of
amino acids with similar physico-chemical
properties.
• Classified amino acids on the basis of polar
or non-polar character, size, shape and
charge.
PAM
• Scoring scheme based on observed
substitutions.
• Derived by analysing the substitution
frequencies seen in alignments of sequences.
• This is something of a chicken and egg
problem, since in order to generate the
alignments, one really needs a scoring
scheme but in order to derive the scoring
scheme one needs the alignments!
BLAST
• Compile a list of High Scoring words
towards the query sequence.
• All w-mers with a score of at least T.
Flavors of BLAST
• Blastp/blastn - Match protein and nucleic
acids against resp. databases.
• Blastx - Match Nucleic acid against protein
database that is matching at amino acid level.
• Tblastx - Nucleic acid against a nucleic acid
database but matching is done at the protein
level.
• Tblastn - amino acid against a nucleic acid
database but matching is at amino acid level.
Flavors of BLAST
• Blast2/Advanced Blast/Wu -Blast
– Can perform gapped alignments.
• BLAST2.0
– Introduced a window factor A
• Two hits must located in a window size of A .
• Ignored random hits.
• Can perform gapped alignments.
Flavors of BLAST
• PSI Blast- Position specific iterative blast.
– Perform iterative database searches.
– The results from each search and incorporated into a
“Position specific scoring matrix” which is used for
further searching.
• PHI Blast - Pattern Hit Initiated Blast
– Input a protein sequence query sequence and a
pattern contained in that sequence.
– Search for other protein sequences that contain the
pattern and have significant similarity to the query
– The pattern becomes the most rigid part of the
search.
PSI BLAST
• A profile can be understood as a table that lists the frequencies of
finding each of the 20 amino acids at each position in a conserved
protein domain.
• Building a profile can be tedious.
• PSI-BLAST: A profile is constructed and iteratively refined.
• Take a query sequence.
• Make a profile with the initial search result.
• Use this profile in a second pass search of the database.
• Additional sequences found are used to refine the profile.
• An interesting case of HIT (Histidine Triad Protein DB Search)
FASTA
• Identify all exact matches of word size k or
greater between the query and the database
sequence.
• This word size is what is called k-tup and is
usually set as 2 for proteins and 1-6 for
nucleic acids.
• Higher word size:
» Faster, less sensitive and more selective.
FASTA
• Penalty - gaps
• Penalty - creation of Gap. (3)
• Penalty -extension of Gap. (1). Also called
bias.
FASTA
• Rescan the 10 regions with highest similarity
score.
• Calculate the scores using the scoring matrix.
• Trim the ends of the region to include only
those residues which contribute to the
highest score.
• This results in 10 partial alignments without
gaps.
FASTA
• The score of the highest scoring initial region
is stored as the init1 score.
• Try and join regions to see if they lie around
the same diagonal (longest possible nonoverlapping alignment with gaps).
• Penalty for gaps. The score of the highest
scoring region at the end of joining is called
as the initn score.
• Optimise the alignment and score is called as
the opt score.
Low Complexity Regions
• Biased composition and can lead to confusing results
during database searches.
• Homo polymeric runs and short period repeats or to
the more subtle cases where one or few residues are
over represented.
• SEG : partitions seqs into LCR and HCR.
• More than 50% of the proteins in the db contains at
least 1 LCR.
• LCR do not fit the residue by residue sequence
conservation. So you may see a lot of false positives.
Repetitive Elements
• False positives.
• More in DNA than protein searches. Mostly,
found in the untranslated regions of the
message.
• Represented in the results as warning
sequences.
• Do a preliminary search against Alu repeats
Database
Multiple Alignments
• Multiple sequence alignment algorithms
allow you to compare and align more than
two related sequences.
• Very useful when analysing a family of
proteins.
• E.g: ClustalW and Pileup
Method
• Create a tree by comparing the most similar
sequences step by step.
• The two most similar sequences are aligned
and then the next two sequences are aligned
that are most similar.
• Re-adjust the gaps so that the alignment is
maximum and the gaps are least.
GenBank - Clean and Up-to-date
• Examples:
– Jurassic Park - Michael Crichton.
Continued....
• John Mallata - Univ. of Washington.
– Evolutionary biologist...comparative
genomics...realised after several months of
work that the Xenopus sequence he was relying
on was incorrect...found the error by
accidentally coming across the correct sequence
in literature.
• Cases where Hamster sequence is called
Human by mistake.
Continued....
• Genes are placed on wrong chromosomes.
• 5 years ago people sequenced only those genes for
which they knew the function. Now the reverse is
true.
• There are very few genes in the database that are
characterised and function is usually determined
by computer programs that can be tripped domains have different roles.
Continued......
• Database education is important.
• Peer Bork , EBI estimates , about 15% of the
information in GenBank to be unverified and not
up-to-date.
• Don’t assume that all information in the databases
is correct.