No Slide Title

Download Report

Transcript No Slide Title

Database searching with BLAST
Outline of today’s lecture
• Transfer of information
• Database searching with sequences
• Sequence Alignment
• Scoring Matrices
• Significance of alignments
• BLAST
• method
• parameters
• output
Celia van Gelder
CMBI
UMC Radboud
September 2013
Transfer of information
The main topic of this course is transfer of information from a well
known to a “new” system (sequence).
In the protein world that leads to the questions:
1) From which protein can I transfer information
2) How do I transfer what information from where to where
Today’s answer is BLAST…
BLAST - Searching with sequences
LAST WEEK:
Searching with words (Google like)
Query = word(s)
Tool used: (MRS-Search, Entrez, SRS, …)
TODAY:
Searching with sequences
Query = sequence
Tool used: BLAST (MRS, NCBI, ..)
Database Searching with a query sequence
Purpose:
To identify similarities between
Your query sequence (with unknown structure and function)
and
Database sequences (with elucidated structures and function)
If we identify similarity we can transfer information!
Transfer of information to corresponding residues
Your sequence: DRTGHNIPLMSTRKTYHIHIENASEERTIKLLMN
is phosphorylated on one of the two serines.
Which one?
What is your approach?
Transfer of information to corresponding residues
BLAST finds two database hits that are annotated to have a
phosphorylated serine.
DRT-GHNIPLMSTRK-TYHIHIENASEERTIKLLMN
DRR-GTTINLMTTKR-TYADELENASEDRTLLLNMN
AEPIYYHL---LTKRETYHIHIENASEEKIIKIVVN
“this serine is phorphorylated in a known protein from the database,
so in my protein the corresponding serine is likely to be
phosphorylated too”.
Database searching concept
– The query sequence is compared (aligned) with every
sequence in the database.
– High-scoring database sequences are assumed to be
evolutionary related to the query sequence.
– If sequences are related by divergence from a common
ancestor, there are said to be homologous.
Sequence Alignment
A
B
A
B
gap = insertion or deletion (indel)
Sequence alignment is easy:
You only need three things:
1) A computer program that produces all possible
alignments, and
2) A computer program that gives each alignment a score,
and, the simplest,
3) A computer program that selects the highest scoring
alignment from the very large number you tried.
Scoring/Substitution Matrix
• Scoring scheme for quality of an alignment
• Contains scores for every possible amino acid substitution
in a sequence alignment
• For protein/protein comparisons we need a 20 x 20 matrix
with scores for pairs of residues. Every cell in the matrix
contains at position X, Y a score for the
substitution/mutation amino acid X -> amino acid Y
Scores
• Positive score if corresponding amino acid residues in the two
aligned sequences are identical or similar. This is a likely
change.
• Negative score if corresponding amino acid residues are not
similar. This is an unlikely change.
• The scores are numbers that you can add up.
Amino Acid substitutions, some thoughts
Not all 20x20 possible mutations occur equally often
• Residues mutate more easily to similar ones (e.g.
Leucine and Isoleucine)
• Residues at surface mutate more easily
• Aromatics mutate preferably into aromatics
• Core tends to be hydrophobic;
• Cysteines are dangerous at the surface
• Cysteines in sulfur bridges (S-S) seldom mutate
• Some amino acids have similar codons
(for example TTT & TTC for Phe, TTA & TTG for Leu)
• Etc etc
PAM250 Matrix (Dayhoff Matrix)
Scoring example
Score of an alignment is the sum of the scores of
all pairs of residues in the alignment
sequence 1: TCCPSIVARSN
sequence 2: SCCPSISARNT
1 12 12 6
2 5 -1 2 6 1 0
=> score = 46
Scoring matrix, cntnd
• When you use bioinformatics tools (BLAST, CLUSTAL, etc) the
scoring matrix often is a paramater that you can choose.
• Two widely used matrices (often default in the packages)
PAM250 (Dayhoff et al)
Based on closely similar proteins
BLOSUM62 (Henikoff et al)
Based on conserved regions
Considered best for distantly related proteins
Significance of alignment (1)
When is an alignment statistically significant?
In other words:
How much different is the alignment score found from scores
obtained by aligning a random sequence to the query sequence?
Or:
What is the probability that an alignment with this score could have
arisen by chance?
Significance of alignment (2)
Database size= 200 x 106 amino acids
peptide
#hits
A
AP
IAP
LIAP
WLIAP
KWLIAP
KWLIAPY
KWLIAPYS
10 x 106
500 x 103
25000
1250
62,5
3,1
0,16
0,008
Sequence similarity search
Question: What database sequences are most similar to (or
contain the most similar regions to) my own sequence?
Input:
Output:
Query sequence
List of sequences that are similar to the query
sequence
BLAST
• BLAST – Basic Local Alignment Search Tool
• BLAST finds the highest scoring locally optimal alignments
between a query sequence and all database sequences.
• Very fast algorithm
• Can be used to search extremely large databases
• Sufficiently sensitive and selective for most purposes
• Robust – the default parameters can usually be used
Why use BLAST?
BLAST searching is fundamental to understanding the relatedness
of any favorite query sequence to other known proteins or DNA
sequences.
Applications include
• discovering new genes or proteins
• discovering variants of genes or proteins
• exploring protein structure and function
• Etc.
It is all about transfer of information!
BLAST – Algorithm
Step 1: Read/understand user query sequence.
Step 2: Use hashing technology to select several thousand
likely candidates.
Step 3: Do a real alignment between the query sequence and
those likely candidate.
N.B. ‘Real alignment’ is a main topic of this course.
Step 4: Present result to user: list of sequences that match
query sequence & their alignments
Basic BLAST Algorithms
Program
Query
Database
BLASTP
Protein
Protein
1
BLASTN
DNA
DNA
1
BLASTX
translatedDNA
protein
6
TBLASTN
protein
translatedDNA
6
TBLASTX
translatedDNA
translatedDNA
36
DNA potentially encodes six proteins
5’ CAT CAA
5’ ATC AAC
5’ TCA ACT
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
Slide from Bioinformatics and Functional Genomics
by Jonathan PevsnerCopyright © 2009
Steps in running BLAST
•Entering your query sequence (cut-and-paste)
•Select the database(s) you want to search
And, optionally:
•Choose output parameters
•Choose alignment parameters (scoring matrix, filters,….)
BLAST Input - FASTA format
>relevant_sequence_name optional comments
AFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPW
QVTLQDRSGFHFCSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDE
DNIQVLRIAKVFKQPKYSILTVNNDITLLKLASPARYSQTISAVCLPSV
DDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT
BLAST Output
Click here to go
to the
corresponding
swissprot entry
Click here to
study alignment
in detail;
Look here first!!
A high score
indicates a likely
relationship
A low E-value
indicates that a
match is unlikely to
have arisen by
chance
BLAST Output
But remember:
Mathematical significance ≠
biological significance!
Low scores with high
E-values suggest that
matches have arisen
by chance
Alignment Significance in BLAST
P value (probability)
•A p value is a way of representing the significance of an
alignment.
•The closer to zero, the greater the confidence that the hit is
significant.
• 0<p<1
Alignment Significance in BLAST
E value (expect value)
•The expect value E is the number of alignments with scores
greater than or equal to the current score S that are expected to
occur by chance in a database search.
• e.g. an E value of 5 assigned to a hit indicates that in a
database of the current size one might expect to see 5 matches
with a similar score simply by chance.
• Rule of thumb: An E value of 10-6 or better normally means
that things are OK.
BLAST result: easy
BLAST result: less easy
BLAST result: very difficult
BLAST parameter: Low complexity filter
•Many sequences contain repeats or stretches that consist
predominantly of one type of amino acid
•We call this low-complexity regions.
•Examples:
• Many nuclear proteins have a poly-asparagine tail (polyN)
• Huntington´s disease PolyGlutamine (polyQ) repeat
• Membrane proteins often consist of mainly hydrophobic
amino acids
•Many binding proteins have proline rich stretches.
Example PPPPPPL/R
BLAST - Low complexity filter
Low complexity regions influence your BLAST output
NNNNNNNN
Use the low complexity filter to adapt your BLAST query sequence:
Filter OFF
NNNNNNNN
Filter ON
Choice depends on your research question!
Low complexity motifs visible
Things we discussed today
Why we want to do database searches –
Transfer of information!
Alignment & scoring methods
Significance of alignments
BLAST
• principle of method
• BLAST output, in particular E-value
• BLAST input parameters, in particular low complexity filter
Let´s BLAST!!