lecture 4 - Helsingin yliopisto

Download Report

Transcript lecture 4 - Helsingin yliopisto

“Proteomics & Bioinformatics”
MBI, Master's Degree Program in Helsinki, Finland
Lecture 4
10 May, 2007
Sophia Kossida, BRF, Academy of Athens, Greece
Esa Pitkänen, Univeristy of Helsinki, Finland
Juho Rousu, University of Helsinki, Finland
Proteomics and biology /Applications
Protein Expression Profiling
Proteome Mining
Identifying as many as
possible of the proteins in
your sample
Identification of proteins in a particular
sample as a function of a particular
state of the organism or cell
Post-translational
modifications
Identifying how and
where the proteins are
modified
Functional
proteomics
Protein quantitation or
differential analysis
Protein-protein
interactions Proteinnetwork mapping
Structural
Proteomics
Determining how the
proteins interact with
each other in living
systems
Databases and tools
Melanie
General workflow of proteomics analysis
Proteins/peptides
Digestion and/or separation
2D gel image
aquisition and
storage
External data sources
taxonomy, ontologies,
bibliography…
Applications Systems biology
(pathways, interactions..)
biomarker-discovery, drug
targets
MALDI, MS/MS
Identification
PMF
Quantification
MS/MS
DIGE
LC-MS & Tags
Store peak
lists and all
meta data
General workflow of proteomics analysis
Digestion and/or separation
Make 2D
Proteins/peptides
2D Page data
bases
Sequence data bases:
Swiss 2D PAGE,
Gelbank, Cornelia,
WordPAGE
EMBL Nucleotide Sequence Database
GenBank
KEGG
PDB
DIP
OMIM
Reactome
PROSIT
Pfam
SPIN
BOND
STRING
AmiGO
David
PubMed
MEDLINE
UniProtKB/Swiss-Prot & TrEMBL
Ensemble
EST database
MALDI, MS/MS
PIR
Identification
Quantification
Mascot
Sequest
Aldente
Popitam
Phenyx
FindMod
Profound
PepFrag
MS-Fit
OMSSA
Search XLinks
Imaging tools:
Melanie, PDQuest
Progenesis
Delta 2D
Storing/ organising:
Proteincsape
MSight
General workflow of proteomics analysis
Digestion and/or separation
Make 2D
Proteins/peptides
2D Page data
bases
2D gel databases:
Imaging Softwares:
Data integration on the web
Image data and textual information
The ability to compare two gels (images) and
then identify differently expressed spots
•Swiss 2D PAGE
•Gelbank
•Cornelia
•WordPAGE
•Melanie
•PDQuest
•Progenesis
•Delta 2D
Proteinscape –platform for storing, organizing data
MSight -representation of mass spectra along with data from
the separation
2D Gel Databases
Swiss-2DPAGE www.expasy.ch
GelBank http://www.gelscape.ualberta.ca:8080/htm/gdbIndex.html
Cornea 2D-PAGE
http://www.cornea-proteomics.com/
World 2DPAGE, Index of 2D gel databases
http://ca.expasy.org/ch2d/2d-index.html
Swiss 2D PAGE viewer
Gel bank
Cornea
World-2DPAGE
http://ca.expasy.org/ch2d/2d-index.html
Make 2D database
A software package to create, convert, publish, interconnect and keep up to
date 2DE-databases. Provided by ExPASY
The database is queryable via description, accession or spot clicking.
Cross-references are provided to other federated 2D PAGE database entries,
Medline and SWISS-PROT
Entries are linked to images showing the experimentally determined and
theoretical protein locations.
Search via –clickable images, -keywords
It runs on most UNIX-based operating systems
(Linux, Solaris/SunOS, IRIX). Being continuously
developed, the tool is evolving in concert with the
current Proteomics Standards Initiative of the
Human Proteome Organization (HUPO).
Data can be marked to be public, as well as
fully or partially private.
An administration Web interface, highly
secured, makes external data integration,
data export, data privacy control, database
publication and versions' control a very easy
task to perform.
Federated database
A collection of databases that are treated as one entity and viewed
through a single user interface (pc.mag.com)
Robustness
Consistency
Maintenance of the database
Data quality
Limitations of current databases:
Do not contain strict/detailed descriptions of protocol (buffers, sample
volume, staining techniques all important information for gel comparisons).
Designed as 2D (and not proteomics) databases and therefore not readily
expandable to incorporate other proteomics data e.g. MS, MDLC.
Designed for reference gels, not on-going projects.
Guidelines for building a federated 2-DE database
http://ca.expasy.org/ch2d/fed-rules.html
Individual entries in the database must be accessible by a keyword search. Other
methods are possible but not required.
The database must be linked to other databases by active hypertext crossreferences, linking together all related databases. Database entries must be at
least linked to the main index.
A main index has to be supplied that provides a means of querying all databases
through one unique query point.
Individual protein entries must be available through clickable images.
2DE analysis software designed for use with federated databases, must be able to
access individual entries in any federated 2DE databases.
for a complete reference, see Appel et al.,
Electrophoresis 17, 1996, 540-546, 1996):
Image analysis software
ImageMaster2D/ Melanie
PDQuest (Bio-Rad, USA)
Progenesis (Nonlinear, UK)
Delta2D (Decodon, Germany)
Melanie
http://au.expasy.org/melanie/
Melanie
http://www.2d-gel-analysis.com/
PDQuest
http://www.bio-rad.com/
Progenesis
http://www.nonlinear.com/products/progenesis/
Delta 2D
http://www.decodon.com/Solutions/Delta2D/
ProteinScape
Platform for storing, organizing, analyzing data generated
during the proteomics workflow.
•
Hierarchy:
Project
Sample
Gel
Spots
MS Data
Search Events
MSight
Specifically developed for the representation of mass
spectra along with data from the separation
http://www.expasy.org/MSight
General workflow of proteomics analysis
Digestion and/or separation
Proteins/peptides
2D gel image
aquisition and
storage
Sequence data bases:
EMBL Nucleotide Sequence Database
GenBank
UniProtKB/Swiss-Prot & TrEMBL
MALDI, MS/MS
Ensemble
EST database
PIR
PMF
Identification
MS/MS
Quantification
DIGE
LC-MS & Tags
Store peak
lists and all
meta data
EMBL Nucleotide Sequence Database
Collaboration between GenBank (USA) and
DNA Database of Japan (DDBJ) and EBI.
New collected sequence data is exchanged, and each
database is updated daily.
EBI
GenBank
Gen Bank is the NIH genetic
sequence database, an annotated
collection of all publicly available
DNA sequences.
GenBank is available for searching
at NCBI
Each entry includes a concise description of the sequence, the scientific
name and the taxonomy of the source organism, and a table of features
that identifies coding regions and other sites of biological significance,
such as transcription units, sites of mutations or modifications and
repeats.
Protein translations for coding regions are included in the feature table.
Bibliographic references are included along with a link to the Medline
unique identifier for all published sequences.
http://www.psc.edu/general/software/packages/genbank/genbank.html
Search GenBank
http://www.ncbi.nlm.nih.gov/Genbank/index.html
DDBJ
INSDC
UniProt
Universal Protein Resource
Joining the information contained in UniProtKB/Swiss-Prot,
UniProteKB/TrEMBL and PIR.
It is comprised of three components
•UniProt Knowledge base (curated protein information, including
function, classification, and cross-reference.
•UniProt Reference Clusters (combines closely related sequences into a
single record to speed searches.)
•UniProt Archive (is a repository, reflecting the history of all protein
sequences)
ExPASy Proteomics Server
Expert Protein Analysis System
Proteomics server of the Swiss Institute of Bioinformatics (SIB)
is dedicated to the analysis of protein sequences and
structures as well as 2D-PAGE.
http://www.isb-sib.ch/
http://ca.expasy.org/
UniProtKB/Swiss-Prot
The UniProt KB/Swiss-Prot Protein Knowledgebase is a annotated protein
sequence database established in 1986.
It is maintained collaboratively by the SIB (Swiss Institute of Bioinformatics) and the
European Bioinformatics Institute (EBI)
http://ca.expasy.org/sprot/
Swiss Prot
TrEMBL
•Uni ProtKB/TrEMBL is a computer-annotated protein sequence
database complementing the UniProtKB/Swiss-Prot Protein
Knowledgebase.
•It contains the translations of all coding sequences (CDS) present
in the EMBL/GenBank/DDBJ Nucleotide Sequence Databases and
also protein sequences extracted from the literature or submitted
to UniProtKB/Swiss-Prot.
•The database is enriched with automated classification and
annotation.
PIR
http://pir.georgetown.edu/pirwww/
ESTdb
Expressed Sequence Tags, EST is a unique
DNA sequence within a coding region of a gene
that is useful for identifying full-length genes
and serves as a landmark for mapping.
The dbEST is a division of GenBank that
contains sequence data and other information
on “singke-pass” cDNA sequences, from a
number of organisms.
http://www.ncbi.nlm.nih.gov/dbEST/
Ensemble
Ensemble is a joint project between the
EMBL-EBI and the Welcome Trust Sanger
Institute that aims at developing a system
that maintains automatic annotation of
large eukaryotic genomes. Access to all
the software and data is free and without
constraints of any kind.
http://www.ebi.ac.uk/ensembl/
IPI- International Protein Index
General workflow of proteomics analysis
Digestion and/or separation
Proteins/peptides
2D gel image
aquisition and
storage
Mascot
Sequest
Aldente
Popitam
Phenyx
FindMod
Profound
PepFrag
MS-Fit
OMSSA
Search
XLinks
TagIdent
MALDI, MS/MS
PMF
Identification
MS/MS
Quantification
DIGE
LC-MS & Tags
Store peak
lists and all
meta data
Proteomics tools
http://restools.sdsc.edu/biotools/biotools19.html
http://ca.expasy.org/tools/
PROWL
Identification and Characterization Tools
PMFdata
MS/MS data
Mascot (Matrix Science)
Sequest
Aldente (ExPasy)
Mascot
Profound (Rockefeller University)
OMSSA
MS-Fit (Prospector; UCSF)
X!Hunter
Identification and Characterization Tools
Popitam (ExPASy, SIB)
Phenyx –GeneBio, Swizerland)
PepFrag (Rockefeller University,
USA)
SearchXLinks – (Caesar,
Germany)
Popitam
Popitam is designed to characterize peptides with
unexpected modification (e.g. post-translational
modifications or mutations) by tandem mass spectrometry
(ExPASy, SIB)
http://expasy.org/cgi-bin/popitam/help.pl
Popitam results
Phenyx
Phenyx is a software platform for the identification
and characterization of proteins and peptides from
mass spectrometry data.
Developed by GeneBio in collaboration with SIB
http://www.phenyx-ms.com/about/about_phenyx.html
PEPFRAG
Searches known protein
sequences with peptide
fragment mass information
http://prowl.rockefeller.edu/
SearchXLinks
http://www.searchxlinks.de/
Analysis of mass spectra of
modified, cross-linked, and
digested proteins, the amino
acid of which is known
Identification and Characterization Tools
FindMod predicts potential protein post-translational modifications (PTM) and
finds potential single amino acid substitutions in peptides.
FindPept identifies peptides that result from unspecific cleavage of proteins from
experimental masses, taking into account artefactual chemical modifications,
posttranslational modifications (PTM) and protease autolytic cleavage.
GlycoMod predicts possible oligosaccharide structures that occur on proteins
from their experimentally determined masses.
AACompIdent achieves identification with amino acid composition
TagIdent identifies proteins with isoelectric point, pI, molecular
weight, MW, and sequence tag generating a list of proteins close to a
given pI and Mw.
Multident achieves cross-species identification with multiple
parameters (pI, Mw, sequence tag and peptide mass fingerprinting
data)
http://au.expasy.org/tools/findmod/
General workflow of proteomics analysis
Digestion and/or separation
Proteins/peptides
2D gel image
aquisition and
storage
KEGG
PDB
DIP
OMIM
Reactome
PROSIT
Pfam
SPIN
BOND
STRING
AmiGO
David
PubMed
MEDLINE
MALDI, MS/MS
PMF
Identification
MS/MS
Quantification
DIGE
LC-MS & Tags
Store peak
lists and all
meta data
KEGG
KEGG: Kyoto Encyclopedia of Genes and Genomes
•Organism specific entry points:
-KEGG Organisms
•Subject specific entry points:
-DRUG, GLYCAN, REACTION, KAAS
http://www.genome.jp/kegg/kegg2.html
KEGG
KEGG is a “biological systems” database integrating both molecular building
block information and higher-level systematic information.
Manually drawn pathway maps representing our knowledge on
the molecular interaction and reaction networks for metabolism,
other cellular processes, and human diseases.
Functional hierarchies and binary relations of KEGG objects,
including genes and proteins, compounds and reactions, drugs
and diseases, and cells and organisms.
Gene catalogs of all complete genomes and some partial
genomes with ortholog annotation (KO assignment), enabling
KEGG PATHWAY mapping and BRITE mapping.
A composite database of chemical substances and reactions
representing our knowledge on the chemical repertoire of
biological systems and environments.
Search Pathway
Carbon fixation
Search “Pathway”
“Pathways” _motifs
Reactome
Reactome
PubMed
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pubmed
David
http://david.abcc.ncifcrf.gov/home.jsp
Protein Data Bank
Provides a variety of tools and
resources for studying the structures
of biological macromolecules and
their relationships to sequence,
function, and disease.
http://www.rcsb.org/pdb/home/home.do
OMIM
This database is a catalog of human genes and
genetic disorders.
The database contains textual information and
references. It also contains links to MEDLINE and
sequence records
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
Protein family classification
PROSITE (ExPASY)
Pfam (Sanger Institute)
SMART (EMBL)
Prosit
A Pseudo-Rotational Online Service and Interactive Tool
Proteins can be grouped on the basis of their sequences, into a limited
number of families.
Some regions have been better conserved than others during evolution.
These regions are generally important for the function of a protein and/or
the maintenance of the three- dimensional structure.
By analyzing the constant and variable properties of such groups of
similar sequences, it is possible to derive a signature for a protein family
or domain,
which distinguishes its members from all other unrelated
ww
proteins.
http://au.expasy.org/prosite/
PROSIT
PROSIT
PROSIT
Pfam
Multiple sequence alignments and HMMs of protein domains and
families, at Sanger Institute.
http://www.sanger.ac.uk/Software/Pfam/help/index.shtml
Browse interactions
http://smart.embl-heidelberg.de/
Structure data bases/interactions
STRING (EMBL)
BOND (Unleashed Informatics)
Cytoscape
DIP (UCLA)
iHOP
SPIN-PP (protein-protein interfaces in the PDB)
MIPS (Mammalian Protein-Protein Interaction Database)
InterAct (protein interactions from literature curation)
STRING
http://string.embl.de
STRING search results
STRING graphical
STRING_ new node
BOND
BOND
The Biomolecular Object Network Databank
http://bond.unleashedinformatics.com
Cytoscape
Cytoscape is an open source bioinformatics software
platform for visualizing molecular interactions with gene
expression profiles and other state data.
Node label position can be controled by new
GUI in VizMapper.
Cytoscape_ plugins
Plugins available for network and molecular profile analysis.
for example:
•Filter the network
•Find active subnetworks/ pathway modules
•Find clusters
A tool to determine which Gene Ontology (GO) categories are
statistically over respresented in a set of genes or a subgraph of a
biological network.
Database of Interacting Proteins
The DIP database catalogs
experimentally determined interactions
between proteins. It combines
information from a variety of sources to
create a single, consistent set of proteinprotein interactions.
http://dip.doe-mbi.ucla.edu/
iHOP
http://www.ihop-net.org/UniPub/iHOP/