Transcript Slide 1

Sequence Based Analysis
Tutorial
NIH Proteomics Workshop
Lai-Su L. Yeh, Ph.D.
Protein Information Resource at
Georgetown University Medical Center
Retrieval, Sequence Search &
Classification Methods

Retrieve protein info by text / UID
 Sequence

BLAST, FASTA, Dynamic Programming
 Family

Similarity Search
Classification
Patterns, Profiles, Hidden Markov Models,
Sequence Alignments, Neural Networks
 Integrated
Search and Classification
System
2
Sequence Similarity Search (I)

Based on Pair-Wise Comparisons
 Dynamic Programming Algorithms



Global Similarity: Needleman-Wunch
Local Similarity: Smith-Waterman
Heuristic Algorithms





FASTA: Based on K-Tuples (2-Amino Acid)
BLAST: Triples of Conserved Amino Acids
Gapped-BLAST: Allow Gaps in Segment Pairs
PHI-BLAST: Pattern-Hit Initiated Search
PSI-BLAST: Position-Specific Iterated Search
3
Sequence Similarity Search (II)

Similarity Search Parameters



Scoring Matrices – Based on Conserved Amino
Acid Substitution
• Dayhoff Mutation Matrix, e.g., PAM250 (~20%
Identity)
• Henikoff Matrix from Ungapped Alignments,
e.g., BLOSUM 62
Gap Penalty
Search Time Comparisons



Smith-Waterman: 10 Min
FASTA: 2 Min
BLAST: 20 Sec
4
Feature Representation


Features of Amino Acids: Physicochemical Properties,
Context (Local & Global) Features, Evolutionary Features
Alternative Amino Acids: Classification of Amino Acids To
Capture Different Features of Amino Acid Residues
Alphabet
AA Identity
Exchange Group
Charge/Polarity
Hydrophobicity
Structural
2D Propensity
Size
20
6
4
3
3
3
Features
Membership
A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
Sequence Identity
EvolutionSubstitution {HRK}{DENQ}{C}{STPAG}{MILV}{FYW}
Charge and Polarity {HRK} {DE} {CTSGNQY} {PMLIVFW}
{DENQRK} {CSTPGHY} {AMILVFW}
Hydrophobicity
{DENQHRK} {CSTPAGWY} {MILVF}
Surface Exposure
Secondary Structure {AEQHKMLR} {CTIVFYW} {SGPDN}
5
Substitution Matrix




Likelihood of One Amino Acid Mutated into Another Over
Evolutionary Time
Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7)
Positive Score: Conservative Substitution (e.g., Lys/Arg, +3)
High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys)
6
BLAST
BALST (Basic Local Alignment Search Tool)
 Extremely fast
 Robust
 Most frequently used
It finds very short segment pairs (“seeds”) between the
query and the database sequence
These seeds are then extended in both directions until
the maximum possible score for extensions of this
particular seed is reached
7
BLAST Search

From BLAST Search Interface

Table-Format Result with BLAST Output and SSEARCH
(Smith-Waterman) Pair-Wise Alignment
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
Click to see
SSearch alignment
Click to see8
alignment
Blast Result & Pairwise Alignment
BLAST
Aligment
9
Classification
 What
is classification?
 Why do we need protein classification?
 Different levels of classification
 Basis for functional protein classification
 How to classify a protein of unknown
function?
10
Classification Databases

Protein motif

Protein domain

3-D structure

Whole-protein
Group
proteins
C - x(2,4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3,5) - H
to
The 2 C's and the 2 H's are zinc according
ligands
the presence
of a common
domain
Group proteins
Group proteins according to
according to
common 3D structure
common
domain
architecture and
length
11
Family Classification Methods






Based on Other Classification
Information
Multiple Sequence Alignment (ClustalW)
ProSite Pattern Search
Profile Search
Hidden Markov Models (HMMs)
Domain (Pfam); Whole protein (PIRSF)
Neural Networks
12
How do you build a tree?
 Pick
sequences to align
 Align them
 Verify the alignment
 Keep the parts that are aligned correctly
 Build and evaluate a phylogenetic tree
 Integrated Analysis
13
Multiple Sequence Alignment

ClustalW

Progressive Pairwise Approach

Base on Exhaustive Pairwise Alignments
 Neighbor Joining

Joining Order Corresponding to a Tree
 Alignment Varies

Dependent on Joining Order
14
Multiple Alignment and Tree

From Text/Sequence Search Result or ClustalW Alignment Interface
15
16
Motif Patterns (Regular Expressions)

Signature Patterns for Functional Motifs
PCM_AC PCM00836
PCM_ID ALADH_PNT_1; MOTIF
PS_DE Alanine dehydrogenase & pyridine nucleotide transhydrogenase signature 1
PS_PA G-[LIVM]-P-x-E-x(3)-N-E-x(1,3)-R-V-A-x-[ST]-P-x-[GST]-V-x(2)-L-x-[KRH]-x-G.
PROSITE PS00836; PDOC00654
LENGTH Conserve = 16aa; Maximum = 29aa; Minimum = 27aa
COUNT PST = 5 (5); PSN = 2 (2); PCT = 2 (2); PCN = 3 (3);
NNTM_BOVIN+DEBOXM
+G02257
DHA_BACSH+A34261
DHA_BACST+B34261
DHA_MYCTU+A43830
PNTA_ECOLI+DEECXA
PST
PCT
PST
PST
PST
PST
60
60
4
4
4
4
DHA_BACSU+A49337
PNTA_HAEIN+E64119
+S74638
+S77433
+F64694
PSN
PSN
PCN
PCN
PCn
4
4
4
23
4
Predicted
Not Predicted
GVPKEIFQNEK--RVALSPAGVQALVKQG
GVPKEIFQNEK--RVALSPAGVQNLVKQG
GIPKEIKNNEN--RVAMTPAGVVSLTHAG
GIPKEIKNNEN--RVAITPAGVMTLVKAG
GIPTETKNNEFQFRVAITPAGVAELTRRG
GIPRERLTNET--RVAATPKTVEQLLKLG
*:* *
**
*** :* * * : *
GVPKEIKNNEN--RVALTPGGVSQLISNG
GVPRELLENES--RVAATPKTVQQILKLG
GVPKEIKDQEF--RVGLTPSSVRALLSQG
GVPRESFDQEC--RVAMTPDTAQKLQKLG
GLVKESMDLES--RVALVPDDVALIVQKG
*: *
*
**. * . :
*
Member
True Positive (“T”)
False Negative (“N”)
Non-Member
False Positive (“F”)
True Negative
ProClass
Motif
Alignments
17



PIR Pattern Search
From Text/Sequence Search Result or Pattern Search Interface
One Query Sequence Against PROSITE Pattern Database
One Query Pattern (PROSITE or User-Defined) Against Sequence DB
18
Pattern Search Result (I)

One Query Sequence Against PROSITE Pattern Database
19
Pattern Search Result (II)

One Query Pattern Against Sequence Database
Display the query
pattern
1
Sorting arrows
2
3
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
20
Profile Method

Profile: A Table of Scores to Express Family
Consensus Derived from Multiple Sequence
Alignments
 Num of Rows = Num of Aligned Positions
 Each row contains a score for the alignment with
each possible residue.
 Profile Searching
 Summation of Scores for Each Amino Acid Residue
along Query Sequence
 Higher Match Values at Conserved Positions
21
1
PIRSF scan


Search One Query
Protein Against all
the Full-length and
Domain HMM
models for the fully
curated PIRSFs by
HAMMER
The matched regions
and statistics will be
displayed.
Shows PIRSF that the
query belongs to
Statistical data for all
domains
Statistical data
per domain
Alignment with
consensus
sequence
22
Secondary Structure Features
a Helix Patterns of Hydrophobic Residue Conservation Showing I,
I+3, I+4, I+7 Pattern Are Highly Indicative of an a Helix (Amphipathic)
 b Strands That Are Half Buried in the Protein Core Will Tend to Have
Hydrophobic Residues at Positions I, I+2, I+4, I+6

23
3D Structure
Proteins share the same fold suggesting homology
Gamma Crystallin C
Beta B1 Crystallin
24
Creation and Curation of PIRSFs
25
Integrated Bioinformatics System for
Function and Pathway Discovery


Data Integration
Associative Analysis
User
Input
Input
(Local
Data, Search
(Gene/Protein
Criteria,
Report
Format)
Expression
Data)
Output
(Analysis Results,
Biological Interpretation)
Integrated Bioinformatics System
Data Mining
Tools
Sequence Analysis
Pipeline
(Retrieval, Visualization,
Analysis, Correlation)
(Family Classification &
Feature Identification)
Graphical
User Interface
(Browsing, Querying,
Navigation)
Data Warehouse
(Gene, Protein, Family, Function,
Structure, Pathway, Interaction)
26
Query Sequence
UniProt
Family Classification & Functional Analysis
BLAST Search
HMM Domain Search
Analytical
Pipeline
Top-Matched Superfamilies/Domains
HMM Motif Search
Pattern Search
SignalP/TMHMM
Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs
SSEARCH
CLUSTALW
Superfamily/Domain/Motif Alignments
Family Relationships & Functional Features
27
Integrated Bioinformatics System
Gene Expression Data
Proteomic Data

Global Bioinformatics
Analysis of 1000’s of
Genes and Proteins

Pathway Discovery,
Target Identification
Integrated Protein Knowledge System
Clustering
Gene/Peptide-Protein Mapping
Expression
Pattern
Protein
List
Functional Analysis
(Sequence Analysis & Information Retrieval)
Comprehensive
Protein
Information
Matrix
Visualization &
Statistical Analysis
Pathway Discovery
(Browsing, Sorting, Visualization & Statistical Analysis)
Clustered Matrix
Clustered Graph
Pathway Map
Process Hierarchy
28
Lab Section
29
Text Search
30
Text Search Result (I)
Extend your search or
start over
Choose columns to
be displayed
Expand view
Pre-computed
BLAST
Results
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
31
Text Search Result (III)
Number of Related
Seq. at 3 different
E-value cut-offs
32
Text Search Result (II)
Extend your search or
start over
Choose columns to
be displayed
Link to
PIRSF report
Curated domain
architecture with
links to Pfam
database
Extent of
family curation
33
Peptide Search
34
Peptide Search & Results
Sorting arrows
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
Matching peptide
highlighted in
the sequence
35
Batch Retrieval Results (I)
Retrieve more
sequences
1
Choose columns to
be displayed
2
3
4
5
6
Links to iProClass and
UniProtKB reports
36
Batch Retrieval Results (II)
Retrieve more
families
1
2
Choose columns to
be displayed
3
4
5
6
Links PIRSF reports
Curated domain architecture
(N- to C- termini) with links
to Pfam database
37
Blast Similarity Search
38
Blast / Related Sequences Results
40
Blast Result & Pairwise Alignment
BLAST
Aligment
41
Pairwise Alignment
42
Multiple Alignment Interactive
Phylogenetic Tree and Alignment
43
Phylogenetic Tree and Alignment View
44
Pattern Search (I)
45
Pattern Search (II)
Display the query
pattern
Sorting arrows
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
46
PIRSF scan
47
PIRSF Report
48
PIRSF Family Hierarchy
49
Taxonomic Distribution &
Phylogenetic Pattern
50
Rabbit Alpha Crystallin A Chain An
iProClass View of the entry
Pre-computed
BLAST results
See protein
synonyms
See IDs
from different
databases
51
alpha-Crystallin and Related Proteins
52