Transcript Slide 1
Sequence Based Analysis
Tutorial
NIH Proteomics Workshop
Lai-Su L. Yeh, Ph.D.
Protein Information Resource at
Georgetown University Medical Center
Retrieval, Sequence Search &
Classification Methods
Retrieve protein info by text / UID
Sequence
BLAST, FASTA, Dynamic Programming
Family
Similarity Search
Classification
Patterns, Profiles, Hidden Markov Models,
Sequence Alignments, Neural Networks
Integrated
Search and Classification
System
2
Sequence Similarity Search (I)
Based on Pair-Wise Comparisons
Dynamic Programming Algorithms
Global Similarity: Needleman-Wunch
Local Similarity: Smith-Waterman
Heuristic Algorithms
FASTA: Based on K-Tuples (2-Amino Acid)
BLAST: Triples of Conserved Amino Acids
Gapped-BLAST: Allow Gaps in Segment Pairs
PHI-BLAST: Pattern-Hit Initiated Search
PSI-BLAST: Position-Specific Iterated Search
3
Sequence Similarity Search (II)
Similarity Search Parameters
Scoring Matrices – Based on Conserved Amino
Acid Substitution
• Dayhoff Mutation Matrix, e.g., PAM250 (~20%
Identity)
• Henikoff Matrix from Ungapped Alignments,
e.g., BLOSUM 62
Gap Penalty
Search Time Comparisons
Smith-Waterman: 10 Min
FASTA: 2 Min
BLAST: 20 Sec
4
Feature Representation
Features of Amino Acids: Physicochemical Properties,
Context (Local & Global) Features, Evolutionary Features
Alternative Amino Acids: Classification of Amino Acids To
Capture Different Features of Amino Acid Residues
Alphabet
AA Identity
Exchange Group
Charge/Polarity
Hydrophobicity
Structural
2D Propensity
Size
20
6
4
3
3
3
Features
Membership
A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
Sequence Identity
EvolutionSubstitution {HRK}{DENQ}{C}{STPAG}{MILV}{FYW}
Charge and Polarity {HRK} {DE} {CTSGNQY} {PMLIVFW}
{DENQRK} {CSTPGHY} {AMILVFW}
Hydrophobicity
{DENQHRK} {CSTPAGWY} {MILVF}
Surface Exposure
Secondary Structure {AEQHKMLR} {CTIVFYW} {SGPDN}
5
Substitution Matrix
Likelihood of One Amino Acid Mutated into Another Over
Evolutionary Time
Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7)
Positive Score: Conservative Substitution (e.g., Lys/Arg, +3)
High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys)
6
BLAST
BALST (Basic Local Alignment Search Tool)
Extremely fast
Robust
Most frequently used
It finds very short segment pairs (“seeds”) between the
query and the database sequence
These seeds are then extended in both directions until
the maximum possible score for extensions of this
particular seed is reached
7
BLAST Search
From BLAST Search Interface
Table-Format Result with BLAST Output and SSEARCH
(Smith-Waterman) Pair-Wise Alignment
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
Click to see
SSearch alignment
Click to see8
alignment
Blast Result & Pairwise Alignment
BLAST
Aligment
9
Classification
What
is classification?
Why do we need protein classification?
Different levels of classification
Basis for functional protein classification
How to classify a protein of unknown
function?
10
Classification Databases
Protein motif
Protein domain
3-D structure
Whole-protein
Group
proteins
C - x(2,4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3,5) - H
to
The 2 C's and the 2 H's are zinc according
ligands
the presence
of a common
domain
Group proteins
Group proteins according to
according to
common 3D structure
common
domain
architecture and
length
11
Family Classification Methods
Based on Other Classification
Information
Multiple Sequence Alignment (ClustalW)
ProSite Pattern Search
Profile Search
Hidden Markov Models (HMMs)
Domain (Pfam); Whole protein (PIRSF)
Neural Networks
12
How do you build a tree?
Pick
sequences to align
Align them
Verify the alignment
Keep the parts that are aligned correctly
Build and evaluate a phylogenetic tree
Integrated Analysis
13
Multiple Sequence Alignment
ClustalW
Progressive Pairwise Approach
Base on Exhaustive Pairwise Alignments
Neighbor Joining
Joining Order Corresponding to a Tree
Alignment Varies
Dependent on Joining Order
14
Multiple Alignment and Tree
From Text/Sequence Search Result or ClustalW Alignment Interface
15
16
Motif Patterns (Regular Expressions)
Signature Patterns for Functional Motifs
PCM_AC PCM00836
PCM_ID ALADH_PNT_1; MOTIF
PS_DE Alanine dehydrogenase & pyridine nucleotide transhydrogenase signature 1
PS_PA G-[LIVM]-P-x-E-x(3)-N-E-x(1,3)-R-V-A-x-[ST]-P-x-[GST]-V-x(2)-L-x-[KRH]-x-G.
PROSITE PS00836; PDOC00654
LENGTH Conserve = 16aa; Maximum = 29aa; Minimum = 27aa
COUNT PST = 5 (5); PSN = 2 (2); PCT = 2 (2); PCN = 3 (3);
NNTM_BOVIN+DEBOXM
+G02257
DHA_BACSH+A34261
DHA_BACST+B34261
DHA_MYCTU+A43830
PNTA_ECOLI+DEECXA
PST
PCT
PST
PST
PST
PST
60
60
4
4
4
4
DHA_BACSU+A49337
PNTA_HAEIN+E64119
+S74638
+S77433
+F64694
PSN
PSN
PCN
PCN
PCn
4
4
4
23
4
Predicted
Not Predicted
GVPKEIFQNEK--RVALSPAGVQALVKQG
GVPKEIFQNEK--RVALSPAGVQNLVKQG
GIPKEIKNNEN--RVAMTPAGVVSLTHAG
GIPKEIKNNEN--RVAITPAGVMTLVKAG
GIPTETKNNEFQFRVAITPAGVAELTRRG
GIPRERLTNET--RVAATPKTVEQLLKLG
*:* *
**
*** :* * * : *
GVPKEIKNNEN--RVALTPGGVSQLISNG
GVPRELLENES--RVAATPKTVQQILKLG
GVPKEIKDQEF--RVGLTPSSVRALLSQG
GVPRESFDQEC--RVAMTPDTAQKLQKLG
GLVKESMDLES--RVALVPDDVALIVQKG
*: *
*
**. * . :
*
Member
True Positive (“T”)
False Negative (“N”)
Non-Member
False Positive (“F”)
True Negative
ProClass
Motif
Alignments
17
PIR Pattern Search
From Text/Sequence Search Result or Pattern Search Interface
One Query Sequence Against PROSITE Pattern Database
One Query Pattern (PROSITE or User-Defined) Against Sequence DB
18
Pattern Search Result (I)
One Query Sequence Against PROSITE Pattern Database
19
Pattern Search Result (II)
One Query Pattern Against Sequence Database
Display the query
pattern
1
Sorting arrows
2
3
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
20
Profile Method
Profile: A Table of Scores to Express Family
Consensus Derived from Multiple Sequence
Alignments
Num of Rows = Num of Aligned Positions
Each row contains a score for the alignment with
each possible residue.
Profile Searching
Summation of Scores for Each Amino Acid Residue
along Query Sequence
Higher Match Values at Conserved Positions
21
1
PIRSF scan
Search One Query
Protein Against all
the Full-length and
Domain HMM
models for the fully
curated PIRSFs by
HAMMER
The matched regions
and statistics will be
displayed.
Shows PIRSF that the
query belongs to
Statistical data for all
domains
Statistical data
per domain
Alignment with
consensus
sequence
22
Secondary Structure Features
a Helix Patterns of Hydrophobic Residue Conservation Showing I,
I+3, I+4, I+7 Pattern Are Highly Indicative of an a Helix (Amphipathic)
b Strands That Are Half Buried in the Protein Core Will Tend to Have
Hydrophobic Residues at Positions I, I+2, I+4, I+6
23
3D Structure
Proteins share the same fold suggesting homology
Gamma Crystallin C
Beta B1 Crystallin
24
Creation and Curation of PIRSFs
25
Integrated Bioinformatics System for
Function and Pathway Discovery
Data Integration
Associative Analysis
User
Input
Input
(Local
Data, Search
(Gene/Protein
Criteria,
Report
Format)
Expression
Data)
Output
(Analysis Results,
Biological Interpretation)
Integrated Bioinformatics System
Data Mining
Tools
Sequence Analysis
Pipeline
(Retrieval, Visualization,
Analysis, Correlation)
(Family Classification &
Feature Identification)
Graphical
User Interface
(Browsing, Querying,
Navigation)
Data Warehouse
(Gene, Protein, Family, Function,
Structure, Pathway, Interaction)
26
Query Sequence
UniProt
Family Classification & Functional Analysis
BLAST Search
HMM Domain Search
Analytical
Pipeline
Top-Matched Superfamilies/Domains
HMM Motif Search
Pattern Search
SignalP/TMHMM
Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs
SSEARCH
CLUSTALW
Superfamily/Domain/Motif Alignments
Family Relationships & Functional Features
27
Integrated Bioinformatics System
Gene Expression Data
Proteomic Data
Global Bioinformatics
Analysis of 1000’s of
Genes and Proteins
Pathway Discovery,
Target Identification
Integrated Protein Knowledge System
Clustering
Gene/Peptide-Protein Mapping
Expression
Pattern
Protein
List
Functional Analysis
(Sequence Analysis & Information Retrieval)
Comprehensive
Protein
Information
Matrix
Visualization &
Statistical Analysis
Pathway Discovery
(Browsing, Sorting, Visualization & Statistical Analysis)
Clustered Matrix
Clustered Graph
Pathway Map
Process Hierarchy
28
Lab Section
29
Text Search
30
Text Search Result (I)
Extend your search or
start over
Choose columns to
be displayed
Expand view
Pre-computed
BLAST
Results
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
31
Text Search Result (III)
Number of Related
Seq. at 3 different
E-value cut-offs
32
Text Search Result (II)
Extend your search or
start over
Choose columns to
be displayed
Link to
PIRSF report
Curated domain
architecture with
links to Pfam
database
Extent of
family curation
33
Peptide Search
34
Peptide Search & Results
Sorting arrows
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
Matching peptide
highlighted in
the sequence
35
Batch Retrieval Results (I)
Retrieve more
sequences
1
Choose columns to
be displayed
2
3
4
5
6
Links to iProClass and
UniProtKB reports
36
Batch Retrieval Results (II)
Retrieve more
families
1
2
Choose columns to
be displayed
3
4
5
6
Links PIRSF reports
Curated domain architecture
(N- to C- termini) with links
to Pfam database
37
Blast Similarity Search
38
Blast / Related Sequences Results
40
Blast Result & Pairwise Alignment
BLAST
Aligment
41
Pairwise Alignment
42
Multiple Alignment Interactive
Phylogenetic Tree and Alignment
43
Phylogenetic Tree and Alignment View
44
Pattern Search (I)
45
Pattern Search (II)
Display the query
pattern
Sorting arrows
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
46
PIRSF scan
47
PIRSF Report
48
PIRSF Family Hierarchy
49
Taxonomic Distribution &
Phylogenetic Pattern
50
Rabbit Alpha Crystallin A Chain An
iProClass View of the entry
Pre-computed
BLAST results
See protein
synonyms
See IDs
from different
databases
51
alpha-Crystallin and Related Proteins
52