aligning any odd sequences
Download
Report
Transcript aligning any odd sequences
Transfer of information
The main topic of this course is transfer of information.
A month in the lab can easily save you an hour in front
of the computer.
Nothing is impossible for a man who doesn’t have to do
it himself.
But, to err is human, but to really screw things up, you need a computer.
©CMBI 2005
Transfer of information
The main topic of this course is transfer of information.
In the protein world that leads to the questions:
1)From which protein can I transfer information
2)How do I transfer what information from where to where
Today’s answer is BLAST…
©CMBI 2005
Equivalent structural positions
To know if positions in two different proteins are
equivalent, we need to know both protein structures
and compare them with protein structure comparison
software.
But by the time you have solved one or two protein
structures the four years of your PhD period are over...
So, we need a short-cut, and that, ladies and gentleman,
will be a sequence alignment (i.e. Blast + ...).
©CMBI 2005
Database Searching with BLAST
Database searching with BLAST involves a series of
topics we will deal with today:
•Database Searching
•Sequence Alignment
•Scoring Matrices
•Significance of an alignment
and:
•BLAST, algorithm
•BLAST, parameters
•BLAST, output
©CMBI 2005
Database Searching
Identify similarities between:
your query sequence
likely with unknown structure and function
database subject sequences
with elucidated structures and function
©CMBI 2005
Database searching concept
The query sequence is compared/aligned with every
subject sequence in the database.
High-scoring database sequences are assumed to be
evolutionary related to the query sequence.
If sequences are related by divergence from a common
ancestor, there are said to be homologous.
We can only transfer information between homologs.
(And we will learn later that that is because structure is maintained longer during evolution than sequence).
©CMBI 2005
Transfer of information
We want to be able to say things like “this serine is
phorphorylated in the database protein, so in my
homologous protein the corresponding serine is likely
to be phosphorylated too”.
That requires that the green serine and the purple serine
both come from a common ancestor that was
phosphorylated too.
And that, in turn, requires that both serines are located
at the same location in their respective structures.
©CMBI 2005
Sequence alignment
TTSASDFRTRTTHISILLMRL
STSATSYRTRSTHLSLMLMRI
But this is the topic of another seminar. Today we
discuss finding sequences…
©CMBI 2005
Which Matrix to use?
Close relationships (Low PAM, high Blosum)
Distant relationships (High PAM, low Blosum)
BLOSUM 80
PAM 20
BLOSUM 62
PAM 120
More conserved
Often used defaults are: PAM250, BLOSUM62
BLOSUM 45
PAM 250
More variable
Significance of alignment (1)
When is an alignment statistically significant?
In other words:
How much different is the alignment score found from scores
obtained by aligning any odd sequences to the query sequence?
Or:
What is the probability that an alignment with this score could have
arisen by chance?
©CMBI 2005
Significance of alignment (2)
Database size= 20 x 106 amino acids
peptide
#hits
A
AP
IAP
LIAP
WLIAP
KWLIAP
KWLIAPY
1 x 106
50000
2500
125
6
0,3
0,015
©CMBI 2005
BLAST
Question: What database sequences are most similar to
(or contain the most similar regions to) my own sequence?
•BLAST finds the highest scoring locally optimal
alignments between a query sequence and all database
sequences.
•Very fast algorithm
•Can be used to search extremely large databases
•Sufficiently sensitive and selective for most purposes
•Robust – the default parameters can usually be used
©CMBI 2005
BLAST – Algorithme
Step 1: Read/understand user query sequence.
Step 2: Use hashing technology to select several thousand
likely candidates.
Step 3: Do a real alignment between the query sequence
and those likely candidate. ‘Real alignment’ is a main topic
of this course.
Step 4: Present output to user.
©CMBI 2005
BLAST Algorithm, Step 2
The program first looks for series of short, highly similar
fragment, it extends these matching segments in both
directions by adding residues. Residues will be added
until the incremental score drops below a threshold.
©CMBI 2005
Basic BLAST Algorithms
Program
Query
Database
BLASTP
Protein
Protein
BLASTN
DNA
DNA
BLASTX
translatedDNA
protein
TBLASTN
protein
translatedDNA
TBLASTX
translatedDNA
translatedDNA
©CMBI 2005
PSI-BLAST
Position-Specific Iterated BLAST
• Distant relationships are often best detected by motif
or profile searches rather than pair-wise comparisons
• PSI-BLAST first performs a BLAST search.
• PSI-BLAST uses the information from significant
BLAST alignments returned to construct a position
specific score matrix, which replaces the query
sequence for the next round of database searching.
• PSI-BLAST may be iterated until no new significant
alignments are found.
©CMBI 2005
BLAST Input
Steps in running BLAST:
•Entering your query sequence (cut-and-paste)
•Select the database(s) you want to search
And, optionally:
•Choose output parameters
•Choose alignment parameters (scoring matrix, filters,….)
Example query=
>something
AFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC
GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND
ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT
NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS
GGVYAKVTKIIPWVQKILSSN
©CMBI 2005
BLAST Output
A low probability
indicates that a
match is unlikely
to ave arisen by
chance
A high score, or
preferably indicates a
likely relationship
©CMBI 2005
BLAST Output
Low scores with high
probabilities suggest
that matches have
arisen by chance
©CMBI 2005
Alignment Significance in BLAST
P-value (probability)
Relates the score for an alignment to the likelihood that it
arose by chance. The closer to zero, the greater the
confidence that the hit is real.
E-value (expect value)
The number of alignments with E that would be expected
by chance in that database (e.g. if E=10, 10 matches with
scores this high are expected to be found by chance).
A match will be reported if its E is below the threshold.
Lower E thresholds are more stringent, and report fewer
matches.
©CMBI 2005
BLAST result: easy
©CMBI 2005
BLAST result: less easy
©CMBI 2005
BLAST result: very difficult
©CMBI 2005
Low complexity filter
Many sequences contain repeats or stretches that consist
predominantly of one type of amino acid.
E.g. Many nuclear proteins have a poly-asparagine tail,
membrane proteins often consist of mainly hydrophobic
amino acids, or many binding proteins have proline rich
stretches.
ASDFGTRGHPPPPPPPPPPP--------------NPPPPPPPPPLTSSDFRGT
Are NOT homologs, but analogs.
©CMBI 2005
Demo
IJs, CNCZ, en het internet dienende komt nu een demo…
©CMBI 2005