dbSearching - IME-USP

Download Report

Transcript dbSearching - IME-USP

Lecture 5:
Searching Sequence Databases
Eric C. Rouchka, D.Sc.
[email protected]
http://kbrin.a-bldg.louisville.edu/~rouchka/CECS694/
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Multiple Alignment Formats
• Formats for storing multiple alignments
are specified
• FASTA, GCG MSF, ALN, etc
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Format
• Each sequence begins with a
description line ‘>’
• Sequence data follows, with gap
character ‘-’
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Example Fasta sequence
>JC2395
NVSDVNLNK---YIWRTAEKMK---ICDAKKFARQHKIPESKIDEIEHNSPQDAAE----------------------------QKIQLLQCWYQSHGKT—GACQALIQGLRKANRCDI
AEEIQAM
>KPEL_DROME
MAIRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQISSQKQRGRS-----------------------------ASNEFLNIWGGQYN----HTVQTLFALFKKLKLHN
AMRLIKDY
>FASA_MOUSE
NASNLSLSK---YIPRIAEDMT---IQEAKKFARENNIKEGKIDEIMHDSIQDTAE----------------------------QKVQLLLCWYQSHGKS--DAYQDLIKGLKKAECRR
TLDKFQDM
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Stockholm Format
• Features in a multiple alignment are
annotated using a ‘magic’ label
–
–
–
–
GF: Generic per-File annotation
GC: Generic per-Column annotation
GS: Generic per-Sequence annotation
GR: Generic per sequence and per column
markup
• Used by PFAM, HMMER, Belvu
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Example Stockholm
Sequence
• http://www.cgr.ki.se/cgr/groups/sonnhammer/Stockholm.html
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
GCG Multiple Sequence
Format (MSF)
• The beginning of the file is a header
describing the sequences
• Header may be formatted to contain
specific information
• Header ends with a “//”
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
!!AA_MULTIPLE_ALIGNMENT 1.0
msf MSF: 131 Type: P 22/01/02 CompCheck: 3003 ..
Name: IXI_234 Len: 131 Check: 6808 Weight: 1.00
Name: IXI_235 Len: 131 Check: 4032 Weight: 1.00
Name: IXI_236 Len: 131 Check: 2744 Weight: 1.00
Name: IXI_237 Len: 131 Check: 9419 Weight: 1.00
//
1
50
IXI_234
TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_235
TSPASIRPPAGPSSR.........RPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_236
TSPASIRPPAGPSSRPAMVSSR..RPSPPPPRRPPGRPCCSAAPPRPQAT
IXI_237
TSPASLRPPAGPSSRPAMVSSRR.RPSPPGPRRPT....CSAAPRRPQAT
IXI_234
IXI_235
IXI_236
IXI_237
51
100
GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
GGWKTCSGTCTTSTSTRHRGRSGW..........RASRKSMRAACSRSAG
GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR..G
GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR..G
101
131
IXI_234
IXI_235
IXI_236
IXI_237
SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
ClustalW ALN Format
• First non-blank line contains the word
“CLUSTAL”
• Each sequence starts with sequence
name
• Lines containing conservation symbols
(* or :) are ignored
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
ClustalW ALN Format
CLUSTAL W (1.82) multiple sequence alignment
JC2395
FASA_MOUSE
KPEL_DROME
-NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW 59
-NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW 59
MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRGRSASN-EFLNIW 59
: *
*.
:
:::*
:: .::::* :. :. : .: ::* *
JC2395
FASA_MOUSE
KPEL_DROME
YQSHGKTGACQALIQGLRKANRCDIAEEIQAM 91
YQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM 91
GGQYNHT--VQTLFALFKKLKLHNAMRLIKDY 89
.:.::
* *: ::* :
::
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
ClustalW ALN Format
CLUSTAL W(1.4) multiple sequence alignment
IXI_234
IXI_235
IXI_236
IXI_237
TSPASIRPPA
TSPASIRPPA
TSPASIRPPA
TSPASLRPPA
GPSSRPAMVS
GPSSR----GPSSRPAMVS
GPSSRPAMVS
SRRTRPSPPG
----RPSPPG
SR--RPSPPP
SRR-RPSPPG
PRRPTGRPCC
PRRPTGRPCC
PRRPPGRPCC
PRRPT----C
SAAPRRPQAT
SAAPRRPQAT
SAAPPRPQAT
SAAPRRPQAT
IXI_234
IXI_235
IXI_236
IXI_237
GGWKTCSGTC
GGWKTCSGTC
GGWKTCSGTC
GGYKTCSGTC
TTSTSTRHRG
TTSTSTRHRG
TTSTSTRHRG
TTSTSTRHRG
RSGWSARTTT
RSGW-----RSGWSARTTT
RSGYSARTTT
AACLRASRKS
----RASRKS
AACLRASRKS
AACLRASRKS
MRAACSRSAG
MRAACSRSAG
MRAACSR--G
MRAACSR--G
IXI_234
IXI_235
IXI_236
IXI_237
SRPNRFAPTL
SRPNRFAPTL
SRPPRFAPPL
SRPNRFAPTL
MSSCITSTTG
MSSCITSTTG
MSSCITSTTG
MSSCLTSTTG
PPAWAGDRSH
PPAWAGDRSH
PPPPAGDRSH
PPAYAGDRSH
E
E
E
E
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Phylip
• The first line is two numbers
– First indicates number of sequences
– Second indicates length of alignment
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Example Phylip File
Phylip
3
JC2395
FASA_MOUSE
KPEL_DROME
92
-NVSDVNLNK YIWRTAEKMK ICDAKKFARQ HKIPESKIDE IEHNSPQDAA
-NASNLSLSK YIPRIAEDMT IQEAKKFARE NNIKEGKIDE IMHDSIQDTA
MAIRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ ISSQKQRGRS
EQKIQLLQCW YQSHGKTGAC QALIQGLRKA NRCDIAEEIQ AM
EQKVQLLLCW YQSHGKSDAY QDLIKGLKKA ECRRTLDKFQ DM
ASN-EFLNIW GGQYNHT--V QTLFALFKKL KLHNAMRLIK DY
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
PIR Format
>P1;JC2395
-NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW
YQHGKTGACQALIQGLRKANRCDIAEEIQAM
*
>P1;FASA_MOUSE
NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW
YQHGKSDAYQDLIKGLKKAECRRTLDKFQDM
*
>P1;KPEL_DROME
MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQGRSASN-EFLNIW
GGQYNH--VQTLFALFKKLKLHNAMRLIKDY
*
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
GDE
%JC2395
nvsdvnlnkyiwrtaekmkicdakkfarqhkipeskideiehnspqdaaeqkiqllqcwy
qshgktgacqaliqglrkanrcdiaeeiqam
%FASA_MOUSE
Nasnlslskyipriaedmtiqeakkfarennikegkideimhdsiqdtaeqkvqlllcwy
qshgksdayqdlikglkkaecrrtldkfqdm
%KPEL_DROME
--mairllplpvraqlcahldaldvwqqlatavklypdqveqissqkqrgrsasneflni
wggqynhtvqtlfalfkklklhnamrlikdy
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
NEXUS
#NEXUS
BEGIN DATA;
dimensions ntax=3 nchar=91;
format missing=?
symbols="ABCDEFGHIKLMNPQRSTUVWXYZ“
interleave datatype=PROTEIN gap= -;
matrix
JC2395
FASA_MOUSE
KPEL_DROME
JC2395
FASA_MOUSE
KPEL_DROME
;
end;
NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAE
NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAE
--MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRG
QKIQLLQCWYQSHGKTGACQALIQGLRKANRCDIAEEIQAM
QKVQLLLCWYQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM
RSASNEFLNIWGGQYNHTVQTLFALFKKLKLHNAMRLIKDY
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
General Feature Format
(GFF)
• Developed for easy parsing of features
• Used to:
– Read annotations into ACE format
– Print out images of sequences and
annotations
– http://www.sanger.ac.uk/Software/formats/GFF/
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Sequence Formats
• Numerous other formats
• Descriptions:
– BLOCKS Server
• Http://www.blocks.fhcrc.org/blocks/help/blocks_format.html
– EMBL accepted formats
• http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/SequenceFormats.html
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Sequence Conversion
Programs
• SEQIO
– http://bioweb.pasteur.fr/docs/seqio/seqio.html
• READSEQ
– http://bimas.dcrt.nih.gov/molbio/readseq/
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Searching Sequence
Databases
• Compare a query sequence against a
target database
• Return significant results
– Possible Homolgous sequences
– Yields insight into structure and function
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches
• Easier to determine similarity in protein
sequences
– 4 base of DNA means more random
sequences
• Consider alignment of length 4
– DNA: 1/44 = 1/256 chance at random
– AA: 1/204 = 1/160,000 chance at random
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches
• Redundancy in Genetic code
– Multiple codons code for same amino acid
• A.A. sequence could be identical
• DNA sequence could be different
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches
• Consider the two sequences:
AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA
AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA
• Ungapped DNA alignment:
AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA
||||| | || ||
|| | || || || | |
AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA
• 21 identical resides (out of 36) 58% identity
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches
• Translate each to protein first:
ELVISISALIVE
ELVISISALIVE
• 100% identical at amino acid level
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches
• If nucleotide region contains a gene,
beneficial to translate first
• Target and query translated into all six
reading frames
– 3 in forward, 3 in reverse
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches
• Number of comparisons needed grows
– 4 comparisons: 2 in each direction
– 36 comparisons: 6 in each direction
• More sensitive, but slower
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Scoring Matrices
• Defaults for major database searches
– PAM250 (original)
– BLOSUM62
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA
• First rapid database search utility
• 50 times faster than Dynamic
Programming
• Based on a heuristic – not guaranteed
to locate optimal solution
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Algorithm
• Hashing approach:
– Construct a table showing each word of length k
(k-tuple) for query and target
• 1 or 2 for proteins
• 4 or 6 for DNA
– Relative positions calculated by subtracting
positions
– Matches in same phase are strung together
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Algorithm
• Identify 10 regions with highest density
of hits
– Trim regions to include only residues
contributing to high scores
– Associate init1 score to each region
Each region is partial alignment without gaps
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Algorithm
• Join initial regions to form approximate
alignments with gaps
• Assign score
– Sum of init1 scores for initial regions
– Subtract gap penalty
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Algorithm
• Construct Needleman-Wunsch optimal
alignment of the query and database
– Consider only a band 32 residues wide
– Centered on best initial region
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Algorithm
Step 1: Locate k-tuples
Step 2: locate 10 highest
density regions (init1)
Step 3: Join initial regions
with gaps (initn)
Step 4: Align query and
database around best
region
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Scores
• FASTA calculates the u and 
parameters for the extreme value
distribution
• Results reported as normalized zscores and E-values
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Steps to calculate z-score
• Average score for databases
sequences of same length range
determined
• Average score plotted against log of
average sequence length
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Steps to calculate z-score
• Points fitted to a straight line
• A z score (number of standard deviations
from fitted line) calculated for each score
• High scoring and low scoring alignments
removed
• Steps repeated one or more times to refine
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Steps to calculate z-score
• Z score normalized
– Z’ = 50 + 10 * z
– Alignment score with std dev of 5 has
normalized z score of 100
• Significance can be refined by shuffling
sequences in database
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Probability of z-score
• Pearson, 2000 (ISMB):
P(Z  z)  1  (e
 e( 1.2825z0.5772)
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
)
Expected Value
• In a database of D sequences:
E ( Z  z )  D  P( Z  z )
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Example Fasta Output
• FASTA reports a histogram
– Indicates distribution of normalized scores
– Expected to fall in normal distribution
– Outliers are significant matches
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Histogram Output
• normalized z’ score
• Number of optimized scores found
• Number of expected scores
• “=“: approximate curve for observed
• ‘*”: approximate curve for expected
• Z’ score > 120 considered high-scoring
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
opt
< 20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
100
102
104
106
108
110
112
114
116
118
>120
E()
188
0:==
0
0:
one = represents 109 library sequences
0
0:
2
1:*
7
15:*
28
91:*
200
353:== *
841
958:========*
2217 1968:==================*==
3746 3253:=============================*=====
5360 4538:=========================================*========
6055 5547:==================================================*=====
6496 6119:========================================================*===
5820 6232:======================================================
*
5469 5966:===================================================
*
4820 5444:=============================================
*
4202 4787:=======================================
*
3815 4089:=================================== *
3271 3415:===============================*
2755 2804:=========================*
2268 2271:====================*
1813 1821:================*
1500 1448:=============*
1233 1145:==========*=
951
900:========*
746
706:======*
699
551:=====*=
460
430:===*=
337
335:===*
287
260:==*
244
202:=*=
185
154:=*
115
122:=*
114
95:*=
75
73:*
inset = represents 1 library sequences
70
57:*
48
44:*
:=======================================*
26
34:*
:==========================
*
33
26:*
:=========================*=======
14
20:*
:==============
*
10
16:*
:==========
*
7
12:*
:=======
*
6
9:*
:====== *
5
7:*
:===== *
2
6:*
:==
*
2
4:*
:== *
1
3:*
:= *
0
3:*
: *
0
2:*
: *
0
2:*
: *
27
1:*
:*==========================
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Fasta Best scoring hits
• At most, one hit per sequence
• Description, z’ score, init1 score, initn
score, opt score, E value
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Example FASTA output
The best scores are:
MERR_PSEAE mercuric resistance operon regu
MERR_SHIFL mercuric resistance operon regu
MERR_SERMA mercuric resistance operon regu
MERR_STAAU mercuric resistance operon regu
MERR_BACSR (strain rc607). mercuric resist
YHDM_ECOLI hypothetical transcriptional re
(
(
(
(
(
(
144)
144)
144)
135)
132)
141)
CECS 694-02 Introduction to Bioinformatics University of Louisville
initn
928
871
810
292
241
175
init1 opt z-sc
928 928 1129.8
871 871 1061.3
810 810 988.1
172 298 373.6
198 289 363.0
175 276 347.0
Spring 2003 Dr. Eric Rouchka
E(66345)
0
0
0
3.5e-14
1.4e-13
1.1e-12
FASTA Alignment Output
• Smith-Waterman type alignment
– ‘:’ denotes conserved residue
– ‘.’ denotes conservative substitution (ie a
substitution with a positive score)
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Alignment Output
>>MERR_STAAU mercuric resistance operon regulatory protei (135 aa)
initn: 292 init1: 172 opt: 298 Z-score: 373.6 expect() 3.5e-14
Smith-Waterman score: 298; 36.923% identity in 130 aa overlap
10
20
30
40
50
60
MerR
MENNLENLTIGVFAKAAGVNVETIRFYQRKGLLLEPDKPYGSIRRYGEADVTRVRFVKSA
. :. .::: :: ::.:.:.::::. : . .. : :.: . ::::.:
MERR_S
MGMKISELAKACDVNKETVRYYERKGLIAGPPRNESGYRIYSEETADRVRFIKRM
10
20
30
40
50
70
80
90
100
110
MerR
QRLGFSLDEIAELLRL--EDGTHCEEASSLAEHKLKDVREKMADLARMEAVLSELVCACH
..: ::: :: :. . .:: .:.. ... .: :....:. : :.. .: ::
:
MERR_S KELDFSLKEIHLLFGVVDQDGERCKDMYAFTVQKTKEIERKVQGLLRIQRLLEELKEKCP
60
70
80
90
100
110
120
130
140
MerR
ARRGNVSCPLIASLQGGASLAGSAMP
... .::.: .:.::
MERR_S DEKAMYTCPIIETLMGGPDK
120
130
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Programs
• FASTA – protein to protein OR DNA to
DNA
• TFASTA –query protein to DNA
database
– the DNA database is first translated in all
six reading frames
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Programs
• FASTF – compares a set of ordered peptide
fragments, obtained from analysis of a protein
by cleavage and sequencing of protein bands
resolved by electrophoresis, against a protein
database
• TFASTF – compares a set of ordered peptide
fragments, against a DNA database
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Programs
• FASTS – compares a set of ordered peptide
fragments, obtained from mass-spectometry
analysis of a protein, against a protein
database.
• TFASTS – compares a set of ordered peptide
fragments,, against a DNA database.
>mgstm1
MGCEN,MIDYP,MLLAY,MLLGY
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Programs
• FASTX, FASTY – compares a query
DNA sequence to a protein sequence
database
– DNA sequence translated in all six reading
frames
– frameshifts allowed
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Programs
• TFASTX, TFASTY –protein sequence to
a DNA sequence or DNA database
– DNA sequence translated in all six reading
frames
• Translated from beginning to end
• Termination codons translated into unknown
amino acids
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
FASTA Programs
• LALIGN – FASTA, reporting multiple
aligning regions
• PLALIGN – dot plot algorithm available
through the fasta suite
•
• FAST-pat, FAST-swap: compares a
sequence to a pattern database
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAST
• Basic Local Alignment Search Tool
• Most widely used and referenced
computational biology/bioinformatics
resource
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAST
• Improves search speed of FASTA
• Retains sensitivity of searches
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAST Algorithm
• Filter out low complexity regions
• Locate k-tuples (words) in the query
sequence
– Word length 3 for amino acids
– Word length 11 for nucleotides
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAST Options
• http://blast.wustl.edu/blast/README.html
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAST Programs
• BLASTP: protein query sequence
against a protein database, allowing for
gaps
• BLASTN: DNA query sequence against
a DNA database, allowing for gaps
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAST Programs
• BLASTX: DNA query sequence,
translated into all six reading frames,
against a protein database, allowing for
gaps
• TBLASTN: protein query sequence
against a DNA database, translated into
all six reading frames, allowing for gaps
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAST Programs
• TBLASTX: DNA query sequence,
translated into all six reading frames,
against a DNA database, translated into
all six reading frames (No gaps allowed)
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
PSI-BLAST
• (position specific iterated blast)
• take in an initial query sequence and find
similar sequences to the query
• multiply align to create a scoring matrix
• search the database for more matches
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
PSI-BLAST
• more sequences are found that can then be
added onto the multiple alignment
• caution should be used with PSI-BLAST:
– a greedy algorithm is used
– most recently added sequences will influence the
next round of sequences
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
PHI-BLAST
• (pattern hit initiated blast)
• functions in same manner as PSIBLAST except that the query sequence
is first searched for a regular expression
• search for similar sequences is focused
on regions containing the pattern
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
PHI-BLAST
• One example of a regular expression:
• [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R[STAQ]-A-x-[LIVMA]-x-[STACV]
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Bayes Block Aligner
• find all possible blocks located within
two sequences
• consider possible alignments by
aligning combinations of blocks with
gaps between the blocks
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Bayes Block Aligner
• Bayesian statistics provide posterior
probabilities of alignment
– various scoring models
– different number of blocks
• locates weak, yet real, similarities
between sequences
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
SSAHA
• Sequence Search and Alignment by
Hashing Algorithm
• aligns DNA sequences by converting
the sequence information into a ‘hash
table’ data structure
• word length 10 bases by default
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
SSAHA
• locating identical or near identical
matches
– SNP detection
– rapid sequence assembly
– detecting order and orientation of contigs
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
SSEARCH
• SSEARCH implements the Smith-Waterman
approach to sequence alignment
• SSEARCH is part of the FASTA suite
• compares protein to another protein or
protein database (or DNA to DNA sequence
or database) using enhanced SmithWaterman local sequence alignments
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAT
• (BLAST-Like Alignment Tool)
– Jim Kent at UCSC
• locate smaller regions of higher identity within
genomic assemblies
– nucleic acids: regions at least 95% similar
consisting of 40 bases or more
– amino acids: sequences at least 80% similar
consisting of at least 20 amino acids
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAT
• Keeps index of entire genome in
memory
– Non-overlapping k-mers
– 1 GB for DNA (11 base k-mers)
– 2 GB for amino acids (4-mers)
– K-mers in repetitive regions not used
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAT
• fast tool for localizing highly similar regions
• distant homologies are not detected
• typical use: localize a specific sequence on a
genome
– BLAT web interface directly ties to the UCSC
GoldenPath genomic browser
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
BLAT
• WEB SERVER:
•
http://genome.ucsc.edu/cgi-bin/hgBlat?command=start&org=human
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Searching with PSSMs
• Similar to testing if a sequence belongs to a family
that the PSSM defines
• Each position in each database sequence is
evaluated by sliding the PSSM along one sequence
at a time
• Positions with high scores are the best matches, and
can be quickly identified
• EXAMPLES: BLOCKS Server; MAST Server (p323
for more)
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Searching with Regular
Expressions
• Certain databases (such as ProSite)
allow the databases to be searched
using a regular expression.
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Low Complexity Regions
• amino acid or DNA sequence regions that
offer very low information due to their highly
biased content
–
–
–
–
–
–
histidine-rich domains in amino acids
poly-A tails in DNA sequences
poly-G tails in nucleotides
runs of purines
runs of pyrimidines
runs of a single amino acid, etc.
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Calculating Complexity
• Complexity of a window size L is:
K
1
L!
L  log N (
)
 ni !
alli
• N is the size of the alphabet; ni is the
counts of residue i in the window
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Calculating Complexity
• Consider sequence AAAA
K
1
L!
L  log N (
)
 ni !
alli
•
•
•
•
L is 4; L! = 4*3*2*1 = 24
nA = 4; nC = nG = nT = 0
product of the factorials is 4!*0!*0!*0! = 24
K = ¼ log4(24/24) = 0
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Calculating Complexity
• Consider sequence ACTG
K
1
L!
L  log N (
)
 ni !
L is 4; L! is 4*3*2*1 = 24
nA =nC=nG=nT = 1
product of the factorials is 1!1!1!1! = 1
alli
so K = 1/4log4(24/1) = 0.573
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
•
IMAGE SOURCE:http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Seg.html
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Short, Periodic Repeats
• DNA or amino acids less than 10 bases
that repeat themselves
• Short, tandem repeats
• Such regions can be the cause of
disease, but are common in genomes
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Programs to Detect
• SEG, PSEG: NCBI mask low
complexity regions in amino acid
sequences
• DUST: masks low complexity DNA
• XNU: locates internal repeats with short
periodicity
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Interspersed Repeats
• Larger repeats are found interspersed
throughout genomes
• Humans: > 40% interspersed repeats
• SINES: Short Interspersed Repeats (300
bases)
• LINES: Long Interspersed Repeats (1kb)
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Interspersed Repeats
• Transposable elements found in many
genomes
• Plants have large numbers of these,
leading to large genome site
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Masking Out Interspersed
Repetitive Elements
• Repetitive Elements: stored as a
database of sequences, RepBase
• RepeatMasker: Locates repetitive
elements using Smith-Waterman
algorithm called cross-match
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Masking Out Interspersed
Repetitive Elements
• MaskerAid: Built on top of
RepeatMasker, uses BLAST as the
underlying database search
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Soft vs. Hard Masking
• Two options to mask repetitive elements
and low complexity regions:
• Hard masking: replace regions with X’s
or N’s
• Soft masking: repetitive regions and low
complexity regions lower case
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
Removing Vector Sequence
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka
HOMEWORK
• Project #1 Due TODAY
• Homework #2 Due 2/20/2003
• Journal Article:
– BLAST
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2003 Dr. Eric Rouchka