Transcript Slide 1

Sequence Based Analysis Tutorial
NIH Proteomics Workshop
Cecilia Arighi, Ph.D.
Protein Information Resource at
Georgetown University Medical Center
Retrieval, Sequence Search &
Classification Methods




Retrieve protein info by text / UID
Sequence Similarity Search
 BLAST, FASTA, Dynamic Programming
Family Classification
 Patterns, Profiles, Hidden Markov Models,
Sequence Alignments, Neural Networks
Integrated Search and Classification System
2
Sequence Similarity Search (I)


Based on Pair-Wise Comparisons
Dynamic Programming Algorithms



Global Similarity: Needleman-Wunch
Local Similarity: Smith-Waterman
Heuristic Algorithms





FASTA: Based on K-Tuples (2-Amino Acid)
BLAST: Triples of Conserved Amino Acids
Gapped-BLAST: Allow Gaps in Segment Pairs
PHI-BLAST: Pattern-Hit Initiated Search
PSI-BLAST: Position-Specific Iterated Search
3
Sequence Similarity Search (II)

Similarity Search Parameters



Scoring Matrices – Based on Conserved Amino
Acid Substitution
 Dayhoff Mutation Matrix, e.g., PAM250 (~20%
Identity)
 Henikoff Matrix from Ungapped Alignments,
e.g., BLOSUM 62
Gap Penalty
Search Time Comparisons



Smith-Waterman: 10 Min
FASTA: 2 Min
BLAST: 20 Sec
4
Feature Representation


Features of Amino Acids: Physicochemical Properties,
Context (Local & Global) Features, Evolutionary Features
Alternative Amino Acids: Classification of Amino Acids To
Capture Different Features of Amino Acid Residues
Alphabet
AA Identity
Exchange Group
Charge/Polarity
Hydrophobicity
Structural
2D Propensity
Size
20
6
4
3
3
3
Features
Sequence Identity
EvolutionSubstitution
Charge and Polarity
Hydrophobicity
Surface Exposure
Secondary Structure
Membership
A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
{HRK}{DENQ}{C}{STPAG}{MILV}{FYW}
{HRK} {DE} {CTSGNQY} {PMLIVFW}
{DENQRK} {CSTPGHY} {AMILVFW}
{DENQHRK} {CSTPAGWY} {MILVF}
{AEQHKMLR} {CTIVFYW} {SGPDN}
5
Substitution Matrix




Likelihood of One Amino Acid Mutated into Another Over
Evolutionary Time
Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7)
Positive Score: Conservative Substitution (e.g., Lys/Arg, +3)
High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys)
6
Secondary Structure Features


a Helix Patterns of Hydrophobic Residue Conservation Showing I,
I+3, I+4, I+7 Pattern Are Highly Indicative of an a Helix
(Amphipathic)
b Strands That Are Half Buried in the Protein Core Will Tend to Have
Hydrophobic Residues at Positions I, I+2, I+4, I+6
7
BLAST
BLAST (Basic Local Alignment Search Tool)
 Extremely fast
 Robust
 Most frequently used
It finds very short segment pairs (“seeds”) between the query
and the database sequence
These seeds are then extended in both directions until the
maximum possible score for extensions of this particular
seed is reached
8
BLAST Search


From BLAST Search Interface
Table-Format Result with BLAST Output and SSEARCH (SmithWaterman) Pair-Wise Alignment
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
Click to see
SSearch alignment
Click 9
to see
alignment
Blast Result & Pairwise Alignment
BLAST
Aligment
10
Classification





What is classification?
Why do we need protein classification?
Different levels of classification
Basis for functional protein classification
How to classify a protein of unknown function?
11
Classification Databases

Protein motif

Protein
domain

3-D structure

Whole-protein
C - x(2,4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3,5) - H
The 2 C's and the 2 H's are zinc ligands
Group
proteins
according to
the presence
of
a common
Group
proteins
according to
Group proteins according to domain
common domain
common 3D structure
architecture and
length
12
Family Classification Methods






Based on Other Classification
Information
Multiple Sequence Alignment (ClustalW)
ProSite Pattern Search
Profile Search
Hidden Markov Models (HMMs)
Domain (Pfam); Whole protein (PIRSF)
Neural Networks
13
How do you build a tree?






Pick sequences to align
Align them
Verify the alignment
Keep the parts that are aligned correctly
Build and evaluate a phylogenetic tree
Integrated Analysis
14
Multiple Sequence Alignment: CLUSTALW
Pairwise alignment:
Calculate distance matrix
Mean number of
differences per residue
Unrooted Neighbor-Joining Tree
Branch length drawn
to scale
Rooted NJ Tree
(guide tree)
Root place at a position where
the means of the branch lengths
on either side of the root are
equal
Progressive
Alignment guided
by the tree
Alignment starts from the tips of
the tree towards the root
Thompson et al., NAR 22, 4675 (1994).
15
PIR Multiple Alignment and Tree

From Text/Sequence Search Result or CLUSTAL W Alignment
Interface
16
17
PIR Pattern Search


Signature Patterns for Functional Motifs
From Text/Sequence Search Result or Pattern Search Interface
Alignment of a region involved
in catalytic activity
A
P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N
Create Pattern and search in database:
P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N
Test sequence against PROSITE database
B
O05689
18
Pattern Search Result (I)
A.
One Query Pattern Against UniProtKB or UniRef100 DBs
Display the query
pattern
Indicate pattern
sequence region(s)
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Link to PIRSF
report
19
Pattern Search Result (II)
B.
One Query Sequence Against PROSITE Pattern Database
20
Profile Method


Profile: A Table of Scores to Express Family Consensus Derived
from Multiple Sequence Alignments
 Num of Rows = Num of Aligned Positions
 Each row contains a score for the alignment with each
possible residue.
Profile Searching
 Summation of Scores for Each Amino Acid Residue along
Query Sequence
 Higher Match Values at Conserved Positions
21
Prosite PS50157 profile for Zinc finger C2H2
22
1
PIRSF scan


Search One Query
Protein Against all
the Full-length and
Domain HMM models
for the fully curated
PIRSFs by HMMER
The matched regions
and statistics will be
displayed.
Shows PIRSF that the
query belongs to
Statistical data for all
domains
Statistical data
per domain
Alignment
with
consensus
sequence
23
Creation and Curation of PIRSFs
24
Integrated Bioinformatics System for
Function and Pathway Discovery


Data Integration
Associative Analysis
User
Input
Input
(Local
Data, Search
(Gene/Protein
Criteria,
Report
Format)
Expression Data)
Output
(Analysis Results,
Biological Interpretation)
Integrated Bioinformatics System
Data Mining
Tools
Sequence Analysis
Pipeline
(Retrieval, Visualization,
Analysis, Correlation)
(Family Classification &
Feature Identification)
Graphical
User Interface
(Browsing, Querying,
Navigation)
Data Warehouse
(Gene, Protein, Family, Function,
Structure, Pathway, Interaction)
25
Query Sequence
UniProt
Family Classification & Functional Analysis
BLAST Search
HMM Domain Search
Analytical
Pipeline
Top-Matched Superfamilies/Domains
HMM Motif Search
Pattern Search
SignalP/TMHMM
Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs
SSEARCH
CLUSTALW
Superfamily/Domain/Motif Alignments
Family Relationships & Functional Features
26
Integrated Bioinformatics System
Gene Expression Data
Proteomic Data

Integrated Protein Knowledge System
Clustering
Global Bioinformatics
Analysis of 1000’s of
Genes and Proteins
Gene/Peptide-Protein Mapping
Expression
Pattern
Protein
List
Functional Analysis

Pathway Discovery,
Target Identification
(Sequence Analysis & Information Retrieval)
Comprehensive
Protein
Information
Matrix
Visualization &
Statistical Analysis
Pathway Discovery
(Browsing, Sorting, Visualization & Statistical Analysis)
Clustered Matrix
Clustered Graph
Pathway Map
Process Hierarchy
27
Lab Section
28
Rat eye lens phosphoproteomics in normal and cataract
Kamei et al., Biol. Pharm. Bull., 2005.
Normal
pI
(+)
More phosphorylated spots
in cataract sample.
Digestion and MS from Spot
16 gave these peptides:
Mw
(-)
Cataract
MDVTIQHPWFKR
ALGPFYPSR
CSLSADGMLTFSG
YRLPSNVDQSALS
We want to identify the protein(s) that contain these peptides
Use Peptide Search
29
Peptide Search
Restrict search to
an organism
30
Peptide Search & Results
Species restricted search
Sorting arrows
Links to iProClass and
UniProtKB reports
Link to NCBI
taxonomy
Search in UniProtKB, 23 proteins
Link to PIRSF
report
Matching peptide
highlighted in
the sequence
31
Batch Retrieval Results (I)
• Retrieve multiple proteins in from
iProClass using a specific identifier or
a combination of them
• Provides a means to easily retrieve
and analyze proteins when the
identifiers come from different
databases
Retrieve more
sequences
32
Blast Similarity Search
What proteins are related to rat CRYAA?
• Perform sequence similarity search
>P24623
http://pir.georgetown.edu/pirwww/search/blast.shtml
33
Pairwise Alignment
35
PIR Text Search
(http://pir.georgetown.edu/search/textsearch.shtml)
UniProtKBDatabase
and unique UniParc
sequences
Let’s search for
human crystallins
PIR protein family
classification
database
36
Let’s look for crystallins which have 3D structure
Refine your
search or start
over
Display PDB ID
37
Domain Display allows to compare simultaneously Pfam domains present in
multiple proteins
Share same domain
architecture
Let’s perform a multiple alignment on the sequences containing PF00030
38
Multiple Alignment
39
Interactive Phylogenetic Tree and Alignment
Beta B1 and gamma crystallins share the same domains, SCOP fold
and share significant sequence similarity suggesting that they are
40
related
Pattern Search (I)
Select P07320 and perform a pattern search
Search for proteins containing this
pattern (PS00225) in rat
41
Pattern Search Result
Beta and gamma
Crystallins have
multiple copies of
this pattern
42
PIRSF provides a single platform where all the previous analysis has been
done by curators
Pfam domains assigned with high
confidence
Validation tag
Represents
extent of manual
curation
Link to
PIRSF report
43
Taxonomic
Distribution
Alpha-crystallin is
exclusively found in
metazoans
Domain
Architecture
Multiple Alignment
44
PIRSF scan
45
PIRSF report (I): a single platform to study proteins
Subfamily level
46
PIRSF report (II)
Cross-links to other databases
http://www.geneontology.org/
47
alpha-Crystallin and Related Proteins
Alpha crystallin beta chain
HSPs
Alpha crystallin alpha chain
48