PROTEIN SEQUENCE ANALYSIS: OBJECTS

Download Report

Transcript PROTEIN SEQUENCE ANALYSIS: OBJECTS

DATABASE SEARCH & SEQUENCE COMPARISON
• “ NOTHING IN BIOLOGY MAKES SENSE
EXCEPT IN LIGHT OF EVOLUTION ”
Theodosius Dobzhansky, 1970
• NOTHING IN COMPUTATIONAL BIOLOGY MAKES SENSE
EXCEPT IN LIGHT OF SEQUENCE COMPARISON
© SIMR Bioinformatics
- more or less literal description of daily practices
SEQUENCE COMPARISON
• “ IN BIOMOLECULAR SEQUENCES
HIGH SEQUENCE SIMILARITY USUALLY IMPLIES
SIGNIFICANT FUNCTIONAL OR STRUCTURAL SIMILARITY ”
- “ The 1ST fact of sequence analysis ”, D. Gusfield, 1997.
• IN BIOMOLECULAR SEQUENCES,
HIGH SEQUENCE SIMILARITY USUALLY IMPLIES
EVOLUTIONARY RELATIONSHIP.
• INFERENCE OF EVOLUTIONARY RELATIONSHIP
USUALLY IS REQUIRED
FOR INFERENCE OF COMMON STRUCTURE / FUNCTION
Pauling and Zuckerkandl, J. Theor. Biol., 1965
• PROTEINS AND NUCLEIC ACIDS CONTAIN
INFORMATION ABOUT EVOLUTION:
what the ancestral molecule was,
when it existed,
how it changed
• MANY OBSERVED CHANGES ARE NOT STRONGLY
SELECTED
“cryptic polymorphism” ( 1970+, Lewontin & Harris: 30% genes
in population are polymorphic; 1970s, Kimura: neutral theory )
“dormancy” of the whole genes and copies
SEQUENCE COMPARISON
• SEQUENCES OF BIOPOLYMERS CONTAIN INFORMATION
ABOUT THEIR STRUCTURE, FUNCTION,
AND EVOLUTIONARY HISTORY
• COMPARISON OF RELATED SEQUENCES
IS THE MAJOR WAY OF EXTRACTING THIS INFORMATION
• RELATED BUT DIFFERENT TASKS
• FIND SEQUENCES THAT NEED TO BE ALIGNED
• ALIGN THEM
• EVALUATE STAT. SIGNIFICANCE OF THE ALIGNMENTS
• also issues of algorithm efficiency
SOURCES OF MUTATION AND POLYMORPHISM
• POINT SUBSTITUTIONS AND SMALL INDELS
• ERRORS OF DNA REPLICATION
• ERRORS OF DNA REPAIR
• DNA REARRANGMENTS AT LONGER RANGE
• ALSO ERRORS OF REPAIR
• ERRORS OF RECOMBINATION
• LEGITIMATE RECOMBINATION PROCESSES
• GENE DUPLICATIONS
• PIECES OF GENES / PROTEINS MAY BE SHUFFLED
HOMOLOGS AND THEIR SUBSETS
PARALOGS
ORTHOLOGS
ORTHOLOGS AND
PARALOGS
HOMOLOGY ≡ COMMON ANCESTRY
• IT IS EITHER THERE OR IT IS NOT ( NO DEGREES )
• OBJECTION 1: WHAT IF ONLY HALF OF THE MOLECULE IS
HOMOLOGOUS ? - JUST SAY SO
• OBJECTION 2: WE MAY MEAN THE DEGREE OF CERTAINTY
THAT THEY ARE HOMOLOGOUS - 1. JUST SAY SO
2. SOME STATISTICIANS DO NOT LIKE IT EITHER
3. 60 % IDENTITY MAY CONFER 100 % BELIEF THAT
HOMOLOGY EXISTS
• ORTHOLOGY / PARALOGY IS ESTABLISHED AFTER
• “FUNCTIONAL HOMOLOGY” USUALLY DOES NOT MAKE SENSE
( CALL IT THE SAME FUNCTION )
Gibbs and McIntyre, Eur.J.Biochem, 1970
DIAGRAM OF IMMUNOGLOBULIN REPEATS
IS THERE CONNEXIN IN PLANTS ( 1992-1993) ?
gi|117687|sp|P27450|CX32_ARATH GAP JUNCTION CX32 PROTEIN (CON...
gi|15237062|ref|NP_195285.1| (NM_119725) protein kinase - lik...
gi|18398350|ref|NP_565408.1| (NM_127276) putative protein kin...
gi|15242204|ref|NP_197012.1| (NM_121512) serine/threonine spe...
gi|15233058|ref|NP_189510.1| (NM_113789) protein kinase, puta...
gi|15222437|ref|NP_172237.1| (NM_100631) protein kinase APK1A...
gi|15239047|ref|NP_196702.1| (NM_121179) putative protein [Ar...
gi|15241749|ref|NP_195849.1| (NM_120307) serine/threonine-spe...
570 e-162
187 1e-46
107 1e-22
102 3e-21
102 4e-21
99 5e-20
95 7e-19
91 9e-18
consensus 1
query 1
1DAW_A 33
1FGI_A 23
gi 6226547 311
gi 125484 1078
gi 1730077 1289
gi 125874 108
gi 462606 3
gi 1346396 534
YELGEKLGSGAFGKVYKGKHKD-------TGEIVAIKILK----KRSLSEkk--krFLREIQILRRLS-HPNIVRLLGVFE--EDDHLYLVMEYMEGGDL
------------------------------------------------------------------mlwHRNLVKLLGYCR--EDKALLLVYEFIPKEVL
YEVVRKVGRGKYSEVFEGINVN-------NNEKCIIKILK----PVKKKK------IKREIKILQNLCgGPNIVKLLDIVRdqHSKTPSLIFEYVNNTDF
LVLGKPLGEGAFGQVVLAEAIGldkdkpnRVTKVAVKMLKsdatEKDLSD------LISEMEMMKMIGkHKNIINLLGACTq--DGPLYVIVEYASKGNL
IIMHNKLGGGQYGDVYEGYWKR-------HDCTIAVKALK----EDAMPLh----eFLAEAAIMKDLH-HKNLVRLLGVCT--HEAPFYIITEFMCNGNL
VHFNEVIGRGHFGCVYHGTLLDnd----gKKIHCAVKSLN----RITDIGev--sqFLTEGIIMKDFS-HPNVLSLLGICLr-SEGSPLVVLPYMKHGDL
LEFGQTIGKGFFGEVKRGYWR---------ETDVAIKIIY----RDQFKTksslvmFQNEVGILSKLR-HPNVVQFLGACTagGEDHHCIVTEWMGGGSL
IQFIQKVGEGAFSEVWEGWWK---------GIHVAIKKLKiigdEEQFKEr-----FIREVQNLKKGN-HQNIVMFIGACY----KPACIITEYMAGGSL
LTLEEIIGIGGFGKVYRAFWI---------GDEVAVKAARhd-pDEDISQti--enVRQEAKLFAMLK-HPNIIALRGVCL--KEPNLCLVMEFARGGPL
RKFKVELGRGESGTVYKGVLE--------DDRHVAVKKLEn---VRQGKEv-----FQAELSVIGRIN-HMNLVRIWGFCS--EGSHRLLVSEYVENGSL
84
32
115
114
392
1165
1374
188
87
614
consensus 85
query 33
1DAW_A 116
1FGI_A 115
gi 6226547 393
gi 125484 1166
gi 1730077 1375
gi 125874 189
gi 462606 88
gi 1346396 615
FDYLRRNGLL---------------LSEKEAKKIALQILRG--LE-YLHSRG---IVHRDLKPENILLDEN-------------GTVKIADFG--LARKRVMFLRRNDP---------------FPWDLRIKIVICAARGpcVStQLTKRE---CIYRDLQVFHILLDLS--------------------YGavLSRVs
KVLYPTLT-------------------DYDIRYYIYELLKA--LD-YCHSQG---IMHRDVKPHNVMIDHEl------------RKLRLIDWG--LAEFREYLQARRppgleysynpshnpeeqlsSKDLVSCAYQVARG--ME-YLASKK---CIHRDLAARNVLVTED-------------NVMKIADFG--LARDLEYLRRTDksl--------------lpPIILVQMASQIASG--MS-YLEARH---FIHRDLAARNCLVSEH-------------NIVKIADFG--LARFRNFIRNEThn---------------ptVKDLIGFGLQVAKG--MK-YLASKK---FVHRDLAARNCMLDEK-------------FTVKVADFG--LARDRQFLTDHFnll-------------eqnPHIRLKLALDIAKG--MN-YLHGWTp-pILHRDLSSRNILLDHNidpknpvvssrqdIKCKISDFG--LSRLYNILHNPNsstpk----------vkysFPLVLKMATDMALG--LL-HLHSIT---IVHRDLTSQNILLDEL-------------GNIKISDFG--LSAENRVLSGKRi-----------------pPDILVNWAVQIARG--MN-YLHDEAivpIIHRDLKSSNILILQKveng-----dlsnKILKITDFG--LAREANILFSEGgni-------------lldWEGRFNIALGVAKG--LA-YLHHEClewVIHCDVKPENILLDQA-------------FEPKITDFG--LVKL-
147
94
175
192
456
1228
1454
256
159
682
consensus 148
query 95
1DAW_A 176
1FGI_A 193
gi 6226547 457
gi 125484 1229
gi 1730077 1455
gi 125874 257
gi 462606 160
gi 1346396 683
---LESS--SYEKLTTFVGT----PEYM-APEVLE---G-RGYSSKVDVWSLGVILYELLTG----------------------KLPFPG------IDPL
gpwLVAM--EQQNREVHRGTakvhRRHI-KVMLLLeyiA-GHLYVKSVAFAFGVVLLEIMTGltahntkrprgqaenhlmrtyvmddkhtqtatpyythk
---YHP----GKEYNVRVAS----RYFK-GPELLV---DlQDYDYSLDMWSLGCMFAGMIFRkepffyghdnhdqlvkiakvlgTDGLNVylnkyrIELD
---IHHi--dYYKKTTNGRL----PVKWmAPEALF---D-RIYTHQSDVWSFGVLLWEIFTLg---------------------GSPYPG-------VPV
---MKEd--tYTAHAGAKFP----IKWT-APEGLA---F-NTFSSKSDVWAFGVLLWEIATYg---------------------MAPYPG-------VEL
---MYDkeyySVHNKTGAKL----PVKWmALESLQ---T-QKFTTKSDVWSFGVVLWELMTRg---------------------APPYPD------VNTF
---KKEq---ASQMTQSVGC----IPYM-APEVFK---G-DSNSEKSDVYSYGMVLFELLTS----------------------DEPQQD------MKPM
---KSReg-sMTMTNGGICN----PRWR-PPELTK---NlGHYSEKVDVYCFSLVVWEILTG----------------------EIPFSD------LDGS
---WH-----RTTKMSAAGT----YAWM-APEVIR---A-SMFSKGSDVWSYGVLLWELLTG----------------------EVPFRG------IDGL
---LNRgg-sTQNVSHVRGT----LGYI-APEWVS---S-LPITAKVDVYSYGVVLLELLTGtrvse-------------lvggTDEVHSmlrklvRMLS
205
190
260
251
514
1290
1511
316
214
756
consensus 206
query 191
1DAW_A 261
1FGI_A 252
gi 6226547 515
gi 125484 1291
gi 1730077 1512
gi 125874 317
gi 462606 215
gi 1346396 757
EELFRIKERP-------RLRLPLPPNCSEELKDLIKKCLNKDPEKRPTAKEILNHPWF
rteieeqnneikginkvnhnqrvagtrlqfalrhytlllviepdpknqtthegsrsks
PQLEALVGRHsrkpwlkFMNADNQHLVSPEAIDFLDKLLRYDHQERLTALEAMTHPYF
EELFKLLKEG--------HRMDKPSNCTNELYMMMRDCWHAVPSQRPTFKQLVEDLdr
SNVYGLLENG--------FRMDGPQGCPPSVYRLMLQCWNWSPSDRPRFRDIHFNLen
DITVYLLQG---------RRLLQPEYCPDPLYEVMLKCWHPKAEMRPSFSELVSRIsa
KMAHLAAYES--------YRPPIPLTTSSKWKEILTQCWDSNPDSRPTFKQIIVHLke
QRSAQVAYAG--------LRPPIPEYCDPELKLLLTQCWEADPNDRPPFTYIVNKLke
RVAYGVAMN--------KLALPIPSTCPEPFAKLMEDCWNPDPHSRPSFTNILDQLtt
AKLEGEEQSWidgyldsKLNRPVNYVQARTLIKLAVSCLEEDRSKRPTMEHAVQTLls
256
248
318
301
564
1339
1561
366
264
814
CONNEXIN IN PLANTS : CONCLUSIONS
• ARBITRARY ALIGNMENTS ( esp. ARBITRARY GAPS )
FAIL TO RETRIEVE RIGHT SIGNALS
• DATABASE SEARCH IS MUCH LESS ARBITRARY
• COMPUTER ANALYSIS MAY BE VIEWED AS A FALSIFICATION
EXPERIMENT OF WET-LAB “RESULTS”
• CX32 IN PLANTS IS PROTEIN KINASE NOT CONNEXIN
… in 1992 , this could all be figured out , but in 1970 ?
BARNEY AND BRITNEY - A CONNECTION ?
• USEFUL FOR ANNOYING PARENTS
• HARD-WORKING
• SING A LOT
-BARNEY
BRITNEY
BA-RNEY
BRITNEY
BAR--NEY
B-RITNEY
BAR--NEY
-BRITNEY
DYNAMIC PROGRAMMING (aka DYN. PLANNING)
1
1
1
1
2
3
4
5
1
2
3
4
5
3
6
10
15
1
3
6
10
15
1
4
10
20
35
0
3
9
19
34
1
5
15
35
70
0
3
12
31
65
1
1
1
1
1
1
1
1
2
3
1
2
1
3
6
7
9
0
3
9
16
25
0
3
12
28
53
1
1
1
DYNAMIC P - Needleman and Wunsch ( 1970 )
B
B
1
R
I
T
N
0/1 0/1 0/1 0/1
A 0/1 0/1 0/1 0/1 0/1
0/1 1/2 0/2 0/2 0/2
R
0/1 0/2 0/2 0/2 1/3
N
BAR--NEY
B-RITNEY
• CAN BE AUTOMATED
• ASKS EVERY AMINO ACID TO BE ALIGNED WITH SOMETHING
• DOES NOT TELL WHETHER SEQUENCES ARE RELATED
• OUTCOME IS DEPENDENT ON HOW MATCHES
AND MISMATCHES ARE SCORED
BARNEY AND BRITNEY - HOW TO QUANTIFY ?
M/R/ID/IDend
1/0/0/0
1/0/-1/0
1/0/-2/-1
-BARNEY
BRITNEY
3
3
3
BA-RNEY
BRITNEY
4
3
2
BAR--NEY
B-RITNEY
5
2
1
BAR--NEY
-BRITNEY
4
2
1
Other ideas ?
Each type of match and each type of mismatch to be scored differently !
SUBSTITUTION MATRICES
• SET OF VALUES IN THE FORM OF 20*20 (AMINO ACIDS)
OR 4*4 (NUCLEOTIDES) MATRIX
• “ SCORE OF CHANGING i TO j ”
• OR, MORE COMMONLY, ONE HALF OF SUCH MATRIX
• “ SCORE OF ALIGNING i TO j ”
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
4
-2
2
0
-3
-2
-1
-2
-1
1
5
-1
-3
-1
0
-1
-3
-2
-2
5
0
-2
-1
-1
-1
-1
1
6
-4
-2
-2
1
3
-1
7
-1
-1
-4
-3
-2
4
1
-3
-2
-2
5
-2
-2
0
11
2
-3
7
-1
4
HOW TO DERIVE SCORE VALUES
• “ FIRST PRINCIPLES ”
• HOW MANY MUTATIONS ARE REQUIRED TO CHANGE i TO j
• CAN CALCULATE FROM GENETIC CODE, BUT IMPLIES FUNNY
THINGS ABOUT EVOLUTION
• “CHEMICAL ISOFUNCTIONALITY”
S
T
L
V
• MUCH,
D
E
I
M
MUCH
A
G
F
W
Y
L I V M T
F
S A G
W Y D
E
BETTER - BASED ON THE OBSERVED
FREQUENCIES OF SUBSTITUTIONS
PAM, BLOSUM, ... - ALL SHOULD BE LOG-ODDS
• SCORE FOR ALIGNING i TO j Sij = log (qij/pipj)
p – background frequencies, q – target frequencies: how to get them?
• PAM (= point accepted mutations) - Dayhoff, 1968
• alignments of  85% identical proteins, 71 families, mostly animal
• used model of evolutionary change with many assumptions
• directly observed data are at short evolutionary distances
• for more distant relationships, multiply matrix by itself - e.g. PAM 120
• BLOSUM (= summary of BLOCKS) – the Henikoffs, 1992
• 500 families, more members, and they are more diverse
• but most importantly – conservation is of a different type!
BLOCKS : THIS IS THE WAY PROTEINS LIVE
consensus 85
query 33
1DAW_A 116
1FGI_A 115
gi 6226547 393
gi 125484 1166
gi 1730077 1375
gi 125874 189
gi 462606 88
gi 1346396 615
YLHSRG
QLTKRE
YCHSQG
YLASKK
YLEARH
YLASKK
YLHGWT
HLHSIT
YLHDEA
YLHHEC
FDYLRRNGLL---------------LSEKEAKKIALQILRG--LE-YLHSRG---IVHRDLKPENILLDEN-------------GTVKIADFG--LARKRVMFLRRNDP---------------FPWDLRIKIVICAARGpcVStQLTKRE---CIYRDLQVFHILLDLS--------------------YGavLSRVs
KVLYPTLT-------------------DYDIRYYIYELLKA--LD-YCHSQG---IMHRDVKPHNVMIDHEl------------RKLRLIDWG--LAEFREYLQARRppgleysynpshnpeeqlsSKDLVSCAYQVARG--ME-YLASKK---CIHRDLAARNVLVTED-------------NVMKIADFG--LARDLEYLRRTDksl--------------lpPIILVQMASQIASG--MS-YLEARH---FIHRDLAARNCLVSEH-------------NIVKIADFG--LARFRNFIRNEThn---------------ptVKDLIGFGLQVAKG--MK-YLASKK---FVHRDLAARNCMLDEK-------------FTVKVADFG--LARDRQFLTDHFnll-------------eqnPHIRLKLALDIAKG--MN-YLHGWTp-pILHRDLSSRNILLDHNidpknpvvssrqdIKCKISDFG--LSRLYNILHNPNsstpk----------vkysFPLVLKMATDMALG--LL-HLHSIT---IVHRDLTSQNILLDEL-------------GNIKISDFG--LSAENRVLSGKRi-----------------pPDILVNWAVQIARG--MN-YLHDEAivpIIHRDLKSSNILILQKveng-----dlsnKILKITDFG--LAREANILFSEGgni-------------lldWEGRFNIALGVAKG--LA-YLHHEClewVIHCDVKPENILLDQA-------------FEPKITDFG--LVKL-
IVHRDLKPENILLDEN
CIYRDLQVFHILLDLS
IMHRDVKPHNVMIDHE
CIHRDLAARNVLVTED
FIHRDLAARNCLVSEH
FVHRDLAARNCMLDEK
ILHRDLSSRNILLDHN
IVHRDLTSQNILLDEL
IIHRDLKSSNILILQK
VIHCDVKPENILLDQA
147
94
175
192
456
1228
1454
256
159
682
• BLOCKS ARE REGIONS WITH
HIGH STRUCTURAL, FUNCTIONAL,
AND EVOLUTIONARY SIGNAL
( or signal-to-noise ratio )
LOG-ODDS: RANDOM SCORES ARE NEGATIVE
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
4
-2
2
0
-3
-2
-1
-2
-1
1
5
-1
-3
-1
0
-1
-3
-2
-2
5
0
-2
-1
-1
-1
-1
1
6
-4
-2
-2
1
3
-1
7
-1
-1
-4
-3
-2
4
1
-3
-2
-2
5
-2
-2
0
11
2
-3
7
-1
4
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
RANDOM SEQUENCES ( per S. Altschul ) :
ALIGN TWO SEQUENCES WITH SCORE S (this is called HSP)
- IS THIS SCORE SIGNIFICANTLY DIFFERENT FROM ALIGNING
TO A RANDOM SEQUENCE ? - where RANDOM may be
• COMPUTER – GENERATED ( perhaps with assumptions )
• THEMSELVES BUT SHUFFLED ( Z-scores in many programs )
• REAL BUT UNRELATED SEQUENCES ( e.g. all database )
- and SIGNIFICANT is …..
BLAST STATISTICS
E = Kmn e-lS
E - the expected number of HSPs with score S or higher observed
by chance, given the size and complexity of database
m and n – effective lengths of database and query
l – parameter from the substitution matrix ( precomputed )
K – parameter from the search space
(length+complexity)
Raw Score S : sum of scores in all aligned positions ( matches and
mismatches) minus gap penalties
Bit Score S’ : get rid of l , reset the log base S’ = log2 K/E + log2 mn
S’
E
WHAT IS IN THE BLAST SCORE ?
Bit Score : S’ = log2 K/E + log2 mn
- usually dominated by log2 mn ,
i.e. score distinguishing chance from non-chance is
the number of binary choices to map the HSP ( 40 – 45 )
SCORE : DIRECT MEASURE
INDEPENDENT OF THE DB ( IF BITS )
STAYS THE SAME WHEN SEQUENCES ARE FLIPPED
E and P VALUES : CALCULATE KNOWING THE SCORE
DEPENDENT OF THE DB SIZE
NON-SYMMETRICAL EVEN WITH UNGAPPED HSPs
HOMOLOGY STILL HAS TO BE INFERRED
THE OPPOSITE IS NOT TRUE: LOW S ≠ NO HOMOLOGY
touches ATP
g-phosphate
touches Mg++
Y
YNIVAQARTGSGKTASFAIPL
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
0.1
0
0
0
0
0
0
0.1
0
0
0
0
0
0
0
0.3
0.1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.3
0.7
0
0
0
T-G-[GSAT]-G-K-[ST]
ALL ABOVE ARE PROBABILISTIC MODELS
• IN EACH POSITION, EXPECTED OR OBSERVED FREQUENCY
OF EACH AMINO ACID
• ALIGNMENTS , PROFILES , REGULAR EXPRESSIONS , PSSMs ,
HMMs , etc. ARE ALL INCARNATIONS OF THE SAME IDEA
• ALL OF THE ABOVE CAN BE MATCHED TO EACH OTHER OR
TO A SINGLE SEQUENCE , USING VARIATIONS OF A SCORING
FUNCTION
• AFFORD BETTER SENSITIVITY AND SELECTIVITY THAN ONE
SEQUENCE
• HMM IS FOR HIDDEN MARKOV MODEL : WHAT IS HIDDEN ?
• “ OCCASIONALLY DISHONEST CASINO ”
142326546665562262143165
the state of the die is ‘hidden’, but can be revealed
142326546665562262143165
142326546665562262143165
• GIVEN: SEQUENCE; ALIGNMENT; PROBABILITIES OF CHANGES
DETERMINE: IS SEQUENCE PRODUCED BY EVOLUTION OF THE
FAMILY THAT MAKES UP THIS ALIGNMENT ??
SOURCES AND ACKNOWLEDGEMENTS
•
•
•
King Jordan’s class: http://jhunix.hcf.jhu.edu/~kjordan6/
Sean Eddy’s: http://bio5495.wustl.edu/
Steve Altschul’s tutorial:
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html