Protein Family Classification for Functional Genomics
Download
Report
Transcript Protein Family Classification for Functional Genomics
Tutorial:
Bioinformatics Resources
(http://pir.georgetown.edu/pirwww/workshop/bioinfo_resource.html)
Bio-Trac 25 (Proteomics: Principles and Methods)
October 3, 2008
Zhang-Zhi Hu, M.D.
Research Associate Professor
Protein Information Resource, Department of
Biochemistry and Molecular & Cellular Biology
Georgetown University Medical Center
1
What is Bioinformatics?
computer + mouse = bioinformatics
(information)
(biology)
• NIH Biomedical Information Science and Technology
Initiative (BISTI) Working Definition (2000) - Research,
development, or application of computational tools
and approaches for expanding the use of biological,
medical, behavioral or health data, including those to
acquire, store, organize, archive, analyze, or visualize
such data.
2
Molecular Biology Database Collection
1078 key
databases of
14 categories
(http://nar.oxfordjournals.org/cgi/cont
ent/full/36/suppl_1/D2)
3
Database Collection in Nucleic Acids Res.
4
Online Access to Database Collection
http://pir.georgetown.edu/pirwww/workshop/2005_database_update.html
2008
http://www.oxfordjournals.org/nar/database/cap/
5
Overview
Database Contents, Search and Retrieval
I. Text search / Information retrieval
II. Sequence & genomics databases
III. Protein family databases
IV. Databases of protein functions
V. Databases of protein structures
VI. Proteomics databases
Lab session
6
Entrez Text Searches
Integrated one-stop search
(http://www.ncbi.nlm.nih.gov/Entrez/)
Lab
7
PubMed Literature Database
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed)
Literature mining
PMID:14640721
Lab
8
iProLINK: Protein Literature Mining Resource
RLIMS-P:
Text mining for protein
phosphorylation
BioThesaurus:
Gene/protein name
thesaurus: synonyms,
ambiguous names…
http://pir.georgetown.edu/iprolink/
Lab
9
BioThesaurus:
Gene/protein name
searches - synonyms,
ambiguous names…
Synonyms:
CRYAA
crystallin, alpha A
CRYA1
HSPB4…
http://pir.georgetown.edu/iprolink/biothesaurus
10
Lab
RLIMS-P: Text mining for protein phosphorylation
Lab
http://pir.georgetown.edu/iprolink/rlimsp/
11
PIR Text Search (I)
(http://pir.georgetown.edu/pirwww/
search/textsearch.html)
Google type search vs.
Boolean searches: AND, OR, NOT
12
Lab
PIR Text Search (II)
Search: alpha crystallin
A chain that are in
protein families?
null = absent; not null = present
Search for
synonyms
13
Lab
PIR Text Search (III)
Argininosuccinate
lyase (EC 4.3.2.1)
Search: what crystallins
are enzymes and what
families they belong to?
Can you find
which
crystallins
have 3D
structure
determined?
14
Lab
UniProt Text Search
http://www.uniprot.org/
Find proteins related to
diabetes and with 3Dstructure determined?
15
Lab
Search continues…
16
Lab
I. Sequence & Genomics Databases
•
NCBI Resources
– GenBank: An annotated collection of all publicly available nucleotide and
protein sequences.
– RefSeq: NCBI non-redundant set of reference sequences, including genomic
DNA, transcript (RNA), and protein products
– Entrez Gene: Gene-centered information at NCBI.
– UniGene: Unified clusters of ESTs and full-length mRNA sequences .
– OMIM: Online Mendelian inheritance in man: a catalog of human genetic and
genomic disorders.
•
•
•
•
UniProt Consortium Database: Universal protein resource, a central
repository of protein sequence and function.
Model Organism Genome Databases: MGD, RGD, SGD, Flybase…
GeneCards: Integrated database of human genes, maps, proteins and
diseases.
SNP Consortium Database (dbSNP); International HapMap Project:
Genes associated with human diseases
(http://www.oxfordjournals.org/nar/database/cap/)
17
UniProt Consortium Databases
Universal Protein Resource
(http://www.uniprot.org)
New!
UUW
Since October 2002
6.6 million
Since July 2008
18
Lab
UniProt Report (I)
Sections of the record
Entry View: Sequence & Annotation
http://www.uniprot.org/uniprot/P02493
19
UniProt Report (II) – sequence and features
20
Lab
UniProt Report (III) – UniRef90
http://www.uniprot.org/uniref/?query=member%3aP02493+identity:0.9
21
Entrez Gene – Gene centric information
22
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=12954#ubor0_RefSeq
OMIM: Online Mendelian inheritance in man
Juvenile cataract
of Down syndrome
Autosomal recessive
congenital progressive
cataract
(http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580)
23
II. Protein Family Databases
•
•
•
•
Whole Proteins
– PIRSF: Nonoverlapping Classification of Full Length Proteins Based on
Evolutionary Relationship
– COG (Clusters of Orthologous Groups) of Complete Genomes
– PANTHER: Proteins Classified into Families/Subfamilies of Shared Function
– ProtoNet: Automatic Hierarchical Classification of Proteins
Protein Domains
– Pfam: Alignments and HMM Models of Protein Domains
– SMART: Protein Domain Identification and Annotation
– CDD: Conserved Domain Database
Protein Motifs
– PROSITE: Protein Patterns and Profiles
– BLOCKS: Protein Sequence Motifs and Alignments
– PRINTS: Compendium of Protein Fingerprints (a group of conserved motifs)
Integrated Family Databases
– InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART, PIRSF,
SuperFamily…
24
Protein Clustering
Initial version
COGs:
(http://www.ncbi.nlm.
nih.gov/COG/)
New version: Includes
Eukaryotic Clusters 25
KOGs
Lab
PIRSF:
Full Length
Classification
iProClass
Family Report
26
(http://pir.georgetown.edu/cgi-bin/ipcSF?id=SF002280)
Domain Classification – Pfam Domain
(http://www.sanger.ac.uk/cgibin/Pfam/swisspfamget.pl?name=
CRYAA_RABIT)
(http://pir.georgetown.edu/cgibin/ipcEntry?id=P02493)
27
Pfam Domain
(http://www.sanger.ac.uk/cgibin/Pfam/getacc?PF00525)
28
Protein Motifs: PROSITE – A database of protein
families and domains. It consists of biologically significant sites,
patterns and profiles.
(http://us.expasy.org/prosite/)
29
Integrated Family Classification
InterPro:
An integrated
resource unifying
PROSITE,
PRINTS, ProDom,
Pfam, SMART,
and TIGRFAMs,
PIRSF.
(http://www.ebi.ac.uk/
interpro/search.html)
Mapping
of families
30
III. Databases of Protein Functions
• Metabolic Pathways, Enzymes, and Compounds
– Enzyme Classification: Classification and Nomenclature of EnzymeCatalysed Reactions (EC-IUBMB)
– KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways
– LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes
– EcoCyc: Encyclopedia of E. coli Genes and Metabolism
– MetaCyc: Metabolic Encyclopedia (Metabolic Pathways)
– BRENDA: Enzyme Database
– UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways
• Inter-Molecular Interactions and Regulatory Pathways
–
–
–
–
–
–
IntAct: Protein interaction data from literature and user submission
BIND: Descriptions of interactions, molecular complexes and pathways
DIP: Catalogs experimentally determined interactions between proteins
Reactome - A curated knowledgebase of biological pathways
BioCarta: Biological pathways of human and mouse
GO: Gene Ontology Consortium Database
• Pathway Resources - Pathguide
31
Biological Pathway Resource Collection
http://www.pathguide.org/
•
•
•
•
•
Protein-protein interactions
Metabolic pathways
Signaling pathways
Pathway diagrams
Transcription factors / gene
regulatory networks
• Protein-compound interactions
• Genetic interaction networks
32
Pathway
Commons
Search across multiple
pathway databases;
common format for
global analysis
http://www.pathwaycommons.org/pc/home.do
33
Lab
KEGG Metabolic & Regulatory Pathways
KEGG is a suite of databases and associated software, integrating our current knowledge
on molecular interaction networks, the information of genes and proteins, and of chemical
compounds and reactions. (http://www.genome.ad.jp/kegg/kegg2.html)
(http://www.genome.ad.jp/dbgetbin/show_pathway?hsa00220+4.3.2.1)
34
BioCyc: EcoCyc/MetaCyc
Metabolic Pathways
The BioCyc Knowledge Library is a collection of
Pathway/Genome Databases (http://biocyc.org/)
35
BioCarta Cellular Pathways
(http://www.biocarta.com/index.asp)
36
Reactome:
•
•
•
•
•
http://www.reactome.org/
Collaboration of CSHL, EBI and GO Consortium
Curated resource of core pathways and reactions in human biology
Authored by biological researchers of field experts
Cross-referenced with NCBI, Ensembl and UniProt, HapMap, KEGG…
Inferred orthologous events in 22 non-human species (mouse, rat…)
37
Transforming
Growth Factor (TGF)
beta signaling
[Homo sapiens]
Reactome:
events and
objects
(including
modified forms
and complex)
(http://reactome.org/cgibin/eventbrowser?DB=gk_curre
nt&FOCUS_SPECIES=Homo%
20sapiens&ID=170834&)
Event ->REACT_6879.1: Activated type I receptor phosphorylates R-SMAD directly [Homo sapiens]
Object -> REACT_7364.1: Phospho-R-SMAD [cytosol]
Event -> REACT_6760.1: Phospho-R-SMAD forms a complex with CO-SMAD [Homo sapiens]
Object -> REACT_7344.1: Phospho-R-SMAD:CO-SMAD complex [cytosol]
Event -> REACT_6726.1: The phospho-R-SMAD:CO-SMAD transfers to the nucleus
Object -> REACT_7382.2: Phospho-R-SMAD:CO-SMAD complex [nucleoplasm] ……
38
Protein-Protein Interaction Database - IntAct
(http://www.ebi.ac.uk/intact/)
39
Gene Ontology (GO)
(http://www.geneontology.org/)
- Molecular Function
- Biological Process
- Cellular Component
40
IV. Databases of Protein Structures
• Protein Structure
– PDB: Structure Determined by X-ray Crystallography and NMR
– PDBsum: Summaries and analyses of PDB structures
– MMDB: NCBI’s database of 3D structures, part of NCBI Entrez
– SWISS-MODEL Repository: Database of annotated protein 3D
models
– ModBase: Annotated comparative protein structure models
• Structure Classification
– CATH: Hierarchical Classification of Protein Domain Structures
– SCOP: Familial and Structural Protein Relationships
– FSSP: Protein Fold Classification Based on Structure--Structure
Alignment
41
PDB: Experimental 3D Structure Repository
Rat gamma-crystallin
(chain A, B.)
Can you do a text
search at PIR to find
this (CRGE_RAT)?
(http://www.rcsb.org/pdb/)
42
Lab
PDBsum:
Pictorial Database to Provide
Summary and Analysis to
PDB Entries
Search
3-D structure summary
2-D structure summary
(http://www.ebi.ac.uk/thornto
n-srv/databases/pdbsum/)
43
Protein Structural Classification (1)
CATH: Hierarchical domain
classification of protein structures
(http://www.cathdb.info/)
44
Protein Structural Classification (2)
SCOP: comprehensive description of structural and evolutionary relationships
between all proteins whose structure is known.
45
(http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html)
SWISS-MODEL Repository
http://swissmodel.expasy.org/
http://swissmodel.expasy.org/repository/
A database of annotated three-dimensional
comparative protein structure models
(http://swissmodel.expasy.org/repository/smr.php?spt
r_ac=CRBA1_MOUSE&job=2)
46
VI. Proteomic Resources
• GELBANK (http://gelbank.anl.gov): 2D-gel patterns of species with
completed genomes.
• SWISS-2DPAGE (http://www.expasy.org/ch2d/): index of 2D-gels
• PEP (http://cubic.bioc.columbia.edu/ pep/): Predictions for Entire
Proteomes: summarized analyses of protein sequences
• Integr8 (http://www.ebi.ac.uk/integr8/): A browser for information
relating to completed genomes and proteomes, based on data
contained in Genome Reviews and the UniProt proteome sets
• PRIDE (http://www.ebi.ac.uk/pride/): PRoteomics IDEntifications
database Expression Profiling databases
• GPMdb (http://gpmdb.thegpm.org/): Mass spec proteomics
Databases
• PeptideAtlas (http://www.peptideatlas.org/): compendium of peptides
identified in a large set of tandem mass spectrometry proteomic
experiments
• HUPO (http://www.hupo.org/): Human Proteome Organization to
47
foste international proteomics initiatives.
Lab
2D-Gel Image Databases
(http://us.expasy.org/ch2d/)
Part of WORLD-2DPAGE: index to
2-D PAGE databases and services
48
(http://us.expasy.org/swiss-2dpage/ac=P02489)
GPMdb: MS Data Search
(http://gpmdb.thegpm.org/)
49
Craig, et al., J Proteome Res. 2004, 3:1234-42.
PRIDE: centralized,
standards compliant,
public data repository
for proteomics data
http://www.ebi.ac.uk/pride/
HUPO
Plasma
Proteome
Project
50
Lab:
I. Text search / Information retrieval
1. Literature search and text mining
– Finding synonyms (BioThesaurus)
– Information extraction (e.g., protein phosphorylation sites)
2. Find the sequence for the rabbit alpha crystallin A chain
3. Find all alpha crystallin A chain classified in protein families
4. Search crystallins that have active enzyme activities
5. Find crystallins that have determined 3D structures
II. Database contents (reports)
1. Sequence & genomics databases (UniProt)
2. Protein family databases (PIRSF)
3. Database of protein functions (KEGG)
4. Databases of protein structures (PDB)
5. Proteomics databases (Swiss-2D)
Protein Examples
Rabbit alpha crystallin A
(UniProtKB:
CRYAA_RABIT/P02493)
• Delta crystallin II
(Argininosuccinate lyase)
(UniProtKB:
ARLY2_ANAPL/P24058)
• Any additional proteins of your
interest for search and retrieval
•
51