Department of Health Information Management

Download Report

Transcript Department of Health Information Management

Genomics and Personalized
Care in Health Systems
Lecture 4. Blast Search
Leming Zhou, PhD, DSc
School of Health and Rehabilitation Sciences
Department of Health Information Management
Department of Health Information Management
Outline
• BLAST algorithm
• BLAST search
• Other programs
Department of Health Information Management
Pairwise Local Alignment
• Pairwise local sequence alignment: identify similar
segments in two sequences
• Smith-Waterman algorithm (a dynamic programming
algorithm) is guaranteed to find optimal alignments, but it
is computationally expensive [O(nm)].
• BLAST and FASTA are heuristic approximations to local
alignment and they run much faster than SmithWaterman algorithm but retain sensitivity of the search
Department of Health Information Management
BLAST
• BLAST [Basic Local Alignment Search Tool] is a sequence
comparison algorithm optimized for speed used to search
sequence databases for optimal local alignments to a query
• It is the most widely used and referenced computational
biology resource
• The central idea of the BLAST algorithm is to confine
attention to segment pairs that contain a word pair of
length W with a score of at least T when compared to the
query using a substitution matrix
• Word hits are then extended in both directions to generate
an alignment with score exceeding a given threshold S
Department of Health Information Management
Applications of BLAST
• BLAST searching is fundamental to understanding the
relatedness of any query sequence to other known proteins
or DNA sequences.
• Applications include
– Identifying homologs
– Discovering new genes
– Discovering variants of genes
– Exploring protein structure and function
– …
page 87
BLAST Algorithm
Department of Health Information Management
BLAST Algorithm
• Filter out low complexity regions
• Locate words with a fix size in the query sequence
• Scan the sequence database for entries that match the
words in the query sequence
• If there is a hit (i.e. a match between a word in the query
and a word in the database entry), extend the hit in both
directions. Keep track of the score and stop the extension
when the score drops below a threshold
Department of Health Information Management
Word Size
• The initial search is done for a word of length W
• Default values:
• Protein sequence search: W = 3
• Nucleotide sequence search: W = 11
• Highly similar nucleotide sequence: W=28
• Each word in the query sequence index is compared to the
database index and residue pairs are scored
Department of Health Information Management
Scoring Matrices
• Nucleotide:
A
G
C
T
A
+1
–2
–2
–2
G
–2
+1
–2
–2
C
–2
–2
+1
–2
T
-2
-2
-2
+1
• Protein:
– PAM matrices
– BLOSUM matrices
• BLOSUM 62 is default for major database searches
Department of Health Information Management
Initial Search
• The initial search is done for a word of length W that
scores at least T when compared to the query using a
scoring matrix
Department of Health Information Management
Hit Extension
• Word hits are then extended in both directions in an
attempt to generate an alignment with a score exceeding a
given threshold to derive the High-scoring Segment Pairs
• This procedure stops when the score becomes lower than
the threshold
…..SLAALLNKCKTPQGQRLVNQWIKQPLMDKNR IEERLNLVEA…
+LA++L+
TP G R++ +W+ P+ D
+ ER
+A
…..TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA….
Department of Health Information Management
Gap Penalty
• Gap penalties can be calculated linearly (constant penalty for each
gap) or using affine gap penalty:
– G = a + bx
– a: gap open penalty
– b: gap extension penalty
– x: number of gaps
• The choice of gap open and gap extension penalty is empirical. Usually
we choose a high value for gap open penalty and a low value for gap
extension penalty.
BLAST Search
Department of Health Information Management
Four Steps of a BLAST search
• Enter query sequence
• Select one BLAST program
• Choose the database to search
• Set optional parameters
Department of Health Information Management
Enter Query Sequence
• A sequence can be pasted into a text field in FASTA format or
as accession number
• A sequence or a sequence list can also be uploaded as a file
• Users may indicate a range of the query sequence instead of
using the whole query sequence
• You may enter a descriptive title for your BLAST search
Department of Health Information Management
Align Two or More Sequences
• You may provide two or more sequence and perform
pairwise BLAST search
Department of Health Information Management
Select a BLAST Program
• BLAST Programs:
– BLASTN: DNA query sequence against a DNA database
– BLASTP: protein query sequence against a protein database
– BLASTX: DNA query sequence, translated into all six reading
frames, against a protein database
– TBLASTN: protein query sequence against a DNA database,
translated into all six reading frames
– TBLASTX: DNA query sequence, translated into all six reading
frames, against a DNA database, translated into all six reading
frames
• Choose the right one according to the sequence you have
and your purpose of the search
Department of Health Information Management
DNA vs. Protein Searches
• Consider the two sequences:
AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA
AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA
• Ungapped DNA alignment:
AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA
||||| | || ||
|| | || || || | |
AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA
• 21 identical resides (out of 36) 58% identity
• Translate each to protein first:
ELVISISALIVE
ELVISISALIVE
• 100% identical at amino acid level
Department of Health Information Management
DNA vs. Protein Searches
• If nucleotide region contains a gene, it is beneficial to
translate the sequence to protein sequence first
• Target and query translated into all six reading frames
– 3 in forward, 3 in reverse
• Number of comparisons needed grows
• More sensitive, but slower
Department of Health Information Management
Choose the Database to Search
• BLASTN
Nucleotide Databases for BLAST
Database
nr
refseq_mrna
refseq_genomic
est
est_others
gss
htgs
pat
pdb
alu_repeats
dbsts
wgs
env_nt
Content Description
All GenBank + EMBL + DDBJ + PDB sequences
mRNA sequences from NCBI Reference Sequence Project.
Genomic sequences from NCBI Reference Sequence Project.
Database of GenBank + EMBL + DDBJ sequences from EST
division.
Subset of est other than human or mouse.
Genome Survey Sequence, includes single-pass genomic data,
exon-trapped sequences, and Alu PCR sequences.
Unfinished High Throughput Genomic Sequences
Nucleotides from the Patent division of GenBank.
Sequences derived from the 3D structure records from PDB. They
are not the coding sequences for the corresponding proteins found
in the same PDB record.
Select Alu repeats from REPBASE, suitable for masking Alu repeats
from query sequences.
Database of Sequence Tag Site entries from the STS division
Assemblies of Whole Genome Shotgun sequences.
Sequences from environmental samples, such as uncultured
bacterial samples isolated from soil or marine samples
Department of Health Information Management
Choose the Database to Search
• BLASTP
Department of Health Information Management
Protein Sequence Databases for BLAST
Database
nr
refseq
swissprot
pat
pdb
env_nr
Content Description
Non-redundant GenBank CDS translations + PDB +
SwissProt + PIR + PRF, excluding those in env_nr.
Protein sequences from RefSeq
Last major release of the SWISS-PROT protein
sequence database
Proteins from the Patent division of GenBank.
Sequences derived from the 3-dimensional structure
records from the Protein Data Bank.
Non-redundant CDS translations from env_nt entries.
Department of Health Information Management
Optional Parameters
• Specify the organism to search or exclude
– Common name, taxonomy id, …
• Exclude certain sequences
– Exclude predicted sequences or sequences from metagenomics
• Use Entrez query to select a subset of the blast database
page 93
Algorithm Parameters
Optional Parameters
Department of Health Information Management
Algorithm Parameters
• Expect value
• Word size
• Filtering/masking
• Substitution matrix
Department of Health Information Management
BLASTN Algorithm Parameters
Department of Health Information Management
BLASTP Algorithm Parameters
Department of Health Information Management
Scoring Parameters
Match/Mismatch scores
Gap Penalty
Expect Value
Department of Health Information Management
Expect Value
Department of Health Information Management
Expect Value
• It is important to assess the statistical significance of
search results.
• For local alignments, the scores follow an extreme value
distribution
• Expected value (E value) is the number of matches
expected to occur randomly with a given score
• The lower the E value, more significant the match.
• E = Kmn e-lS
– K: A variable with a value dependent upon the substitution matrix
used and adjusted for search base size.
– m, n: length of the query and database sequences
– λ: A statistical parameter used as a natural scale for the scoring
system
– S: alignment score
Department of Health Information Management
More about E Value
• The value of E decreases exponentially with increasing
alignment score S (higher S values correspond to better
alignments). Very high scores correspond to very low E
values.
• For E=1, one match with a similar score is expected to
occur by chance.
– For a much larger or smaller database, you would expect E to vary
accordingly
Department of Health Information Management
Why Set Expect Threshold to 1000
• When you perform a search with a short query (e.g. 9
amino acids). There are not enough residues to
accumulate a big score (or a small E value).
• A match of 9 out of 9 residues could yield a small score
with an E value of 100 or 200. And yet, this result could be
real and of interest to you.
• By setting the E value cutoff to 1000 or a bigger value you
do not change the way the search was done, but you do
change which results are reported to you.
– All hits with E value less than 1000 are reported
Department of Health Information Management
E Values
• Orthologs from closely related species will have the
highest scores and lowest E values
– Often E = 10-30 to 10-100
• Closely related homologs with highly conserved function
and structure will have high scores
– Often E = 10-15 to 10-50
• Distantly related homologs may be hard to identify
– Less than E = 10-4
• These values may be served as general guideline but not a
strict range for those situations
Department of Health Information Management
Set the Expect Threshold
• The Expect Threshold can be any positive real number.
• The lower the number the more stringent the matches
displayed.
• The default value of 10 signifies that 10 matches can be
expected by chance in a search of the database using a random
query with similar length.
• No match with an E-value higher than the Expect Threshold
selected will be displayed
• Increase the Expect Threshold to 1000 or more when searching
with a short query
Default Special Cases
Expect
10
Threshold
Short query
Large sequence family Ungapped Blast
1000 or more
10
10
Department of Health Information Management
Raw Scores and Bit Scores
• There are two kinds of scores: raw scores (calculated from
a substitution matrix) and bit scores (normalized scores)
• Bit scores are comparable between different searches
because they are normalized to account for the use of
different scoring matrices and different database sizes
S’ = bit score = (lS - lnK) / ln2
• The E value corresponding to a given bit score is:
E = mn 2 -S’
• Bit scores allow you to compare results between different
database searches, even using different scoring matrices.
Repeats
Department of Health Information Management
Low Complexity Regions
• Low complexity regions: amino acid or DNA sequence
regions that offer very low information due to their highly
biased content
– poly-A tails in DNA sequences
– runs of purines or pyrimidines
– Tandem repeats, such as ACACACACACACAC…
– runs of a single amino acid, etc.
Department of Health Information Management
Short Repeats
• DNA or amino acids less than 10 bases that repeat
themselves
• Short, tandem repeats
• Such regions can be the cause of disease, but are common
in genomes
Department of Health Information Management
Interspersed Repeats
• Larger repeats are found interspersed throughout
genomes
• Humans: > 40% interspersed repeats
• Plants have large numbers of these as well
• Short Interspersed Repeats (SINES, 300 bp)
• Long Interspersed Repeats (LINES, 1k bp)
Department of Health Information Management
RepeatMasker
• RepeatMasker is a program that screens DNA sequences for
interspersed repeats and low complexity DNA sequences using
Smith-Waterman algorithm called cross-match.
• The output of the program is a detailed annotation of the
repeats that are present in the query sequence as well as a
modified version of the query sequence in which all the
annotated repeats have been masked (default: replaced by Ns).
• http://www.repeatmasker.org/
Department of Health Information Management
Soft vs. Hard Masking
• In BLAST, two options to mask repetitive elements and
low complexity regions:
– Hard masking: replace regions with X’s or N’s
– Soft masking: repetitive regions and low complexity regions are
shown in lower case
Department of Health Information Management
Masking in Lowercase or X/N
Filters and Masking
Department of Health Information Management
Filters and Masking
• Filter low-complexity region
– This function mask off segments of the query sequence that have low compositional
complexity.
– Filtering can eliminate statistically significant but biologically uninteresting reports
from the blast output.
– Filtering is only applied to the query sequence (or its translation products), not to
database sequences.
• Mask for lookup table only
– This option masks only for purposes of constructing the lookup table used by
BLAST so that no hits are found based upon low-complexity sequence or repeats.
– The BLAST extensions are performed without masking and so they can be extended
through low-complexity sequence.
• Mask lower case letters
– With this option selected you can cut and paste a FASTA sequence in upper case
characters and denote areas you would like filtered with lower case.
– This allows you to customize what is filtered from the sequence.
Department of Health Information Management
Filters and Masking
• If the number of hits returned is small when searching with a short
query, it may help to re-search with filtering turned off.
Default
Filter
On
Special cases
Short query Large sequence family
Ungapped Blast
off
on
on
BLAST Search Output
Department of Health Information Management
BLASTN Output (header)
Department of Health Information Management
BLASTN Output (Graphic Summary)
matches to itself
probable homologs
distantly related
homologs
distant homolog with
shared domain or motif
Department of Health Information Management
BLASTN Output (Descriptions)
Department of Health Information Management
BLASTN Output (Sequence Alignments)
Formatted BLAST Output
Department of Health Information Management
Alignment Formatting
Department of Health Information Management
Formatting
Department of Health Information Management
Formatting - Pairwise
• Pairwise - Default BLAST alignment in pairs of query
sequence and database match.
BLASTP
Department of Health Information Management
Algorithm Parameters for BLASTP
Department of Health Information Management
Scoring Matrix
Department of Health Information Management
Output (header) - BLASTP
Department of Health Information Management
Output (Graphic Summary)
Department of Health Information Management
Graphic Summary
Department of Health Information Management
Description
Other Programs
Department of Health Information Management
FASTA
• First rapid database search utility
• 50 times faster than Dynamic Programming
• Based on a heuristic – not guaranteed to locate optimal
solution
Department of Health Information Management
FASTA Algorithm
• Hashing:
– Construct a table showing each word of length k for query and target
– Relative positions calculated by subtracting positions
• Identify regions with highest density of hits
– Trim regions to include only residues contributing to high scores
– Associate initial score to each region. Each region is partial alignment
without gaps
• Join initial regions to form alignments with gaps
• Assign score
– Sum of initial scores for initial regions
– Subtract gap penalty
• Construct optimal alignment of the query and database
– Consider only a band 32 residues wide
– Centered on best initial region
Department of Health Information Management
FASTA Programs
• FASTA: protein to protein or DNA to DNA
• Compare a set of ordered peptide fragments against a
protein database
– FASTF or FASTS
• Compares a set of ordered peptide fragments against a
DNA database
– TFASTF or TFASTS
• Compares a query DNA sequence to a protein sequence
database
– FASTX or FASTY
• Compare protein sequence to a DNA sequence or DNA
database
– TFASTA or TFASTX or TFASTY
Department of Health Information Management
BLAST-related Tools
• The analysis of genomic DNA presents special challenges
– There are exons and introns.
– There may be sequencing errors or polymorphisms
– The comparison may between related species (e.g. human and
mouse)
• Recently developed tools include
– MegaBLAST at NCBI.
– BLAT. BLAT parses an entire genomic DNA database into words
(11mers), then searches them against a query
– SSAHA at Ensembl uses a similar strategy as BLAT.
Department of Health Information Management
MegaBLAST
• For megablast, the word size is 28 and can be adjusted to
64.
• Megablast is very fast for finding closely related DNA
sequences
Department of Health Information Management
BLAT
• BLAT on DNA is designed to quickly find sequences of
95% and greater similarity of length 25 bases or more. It
may miss more divergent or shorter sequence alignments.
It will find perfect sequence matches of 25 bases, and
sometimes find them down to 20 bases.
• BLAT on proteins finds sequences of 80% and greater
similarity of length 20 amino acids or more.
• In practice DNA BLAT works well on primates, and
protein BLAT on land vertebrates
• BLAT works by keeping an index of the entire genome in
memory. The index consists of all non-overlapping 11mers except for those heavily involved in repeats.
• http://genome.ucsc.edu
Department of Health Information Management
Human BLAT Results
Department of Health Information Management
Sequence Alignment Result
Department of Health Information Management
Results in Genome Browser
Department of Health Information Management
Homework 3
•
Having the gene sequences you obtained in the homework2, use BLAST to
perform pairwise sequence alignment between human and cow BRCA1 genes,
summarize the result (level of identity, gaps, strands, coverage, etc.)
•
For the human gene sequence, perform a BLAST search against the
Refseq_RNA database, summarize the result and create a similarity score
matrix (including at least five other species, such as Pan troglodytes, Canis
lupus familiaris, Bos taurus, Equus caballus, Felis catus, …) based on the bit
scores
•
Retrieve one of the dbSNP records in your homework 1, obtain the sequence
record (in fasta format), perform a BLAST search to determine the precise
location of the mutation in the human genome