The STRING database
Download
Report
Transcript The STRING database
The STRING database
Michael Kuhn
EMBL Heidelberg
protein interactions
example
Tryptophan synthase beta chain
E. Coli K12
many sources
genomic context
curated knowledge
experimental evidence
T
literature
373 genomes
(only completely sequenced genomes)
1.5 million genes
(not proteins)
Genome Reviews
RefSeq
Ensembl
model organism databases
data integration
genomic context methods
gene fusion
gene neighborhood
phylogenetic profiles
Cell
Cellulosomes
Cellulose
automatic inference
of interactions
correct interactions
wrong associations
gene fusion
score: sequence similarity
gene neighborhood
score: sum of intergenic distances
phylogenetic profiles
SVD
singular value decomposition
(removes redundancy)
score: Euclidean distance
all scores are “raw scores”
not comparable
sequence similarity
sum of intergenic distances
Euclidean distance
benchmarking
calibrate against “gold standard”
(KEGG)
raw scores
probabilistic scores
e.g. “70% chance for an assocation”
curated knowledge
KEGG
Kyoto Encyclopedia of Genes
Reactome
GO
Gene Ontology
primary experimental data
many sources
many parsers
BIND
Biomolecular Interaction Network
Database
GRID
General Repository for Interaction
Datasets
HPRD
Human Protein Reference Database
co-expression
microarray data
GEO
Gene Expression Omnibus
correlation coefficient
literature mining
different gene identifiers
synonyms list
Medline
SGD
Saccharomyces Genome Database
The Interactive Fly
OMIM
Online Mendelian Inheritance in Man
simple scheme
co-mentioning
more advanced
NLP
Natural Language Processing
Gene and protein names
Cue words for entity recognition
Verbs for relation extraction
The expression of
the cytochrome genes
CYC1 and CYC7
is controlled by
HAP1
calibrate against gold
standard
combine all evidence
Bayesian scoring scheme
e.g.: two scores of 0.7
combined probability: ?
e.g.: two scores of 0.7
combined probability: 0.91
1 - (1-0.7)2 = 0.91
evidence transfer
evidence spread
over many species
transfer by orthology
(or “fuzzy orthology”)
von Mering et al., Nucleic Acids Research, 2005
von Mering et al., Nucleic Acids Research, 2005
two modes
COG mode
von Mering et al., Nucleic Acids Research, 2005
higher coverage
lower specificity
includes all available evidence
some orthologous groups are too large
to be meaningful
proteins mode
von Mering et al., Nucleic Acids Research, 2005
maximum specificity
lower coverage
information will be relevant for selected
species
Demo
outlook
take home message
STRING integrates information and
predicts interactions
You can always go to the sources
Proteins mode: specific species
COG mode: more coverage, especially
for prokaryotic genes
Acknowledgements
The STRING team
Lars Jensen
Peer Bork
Christian von Mering & group in Zurich
Berend Snel
Martijn Huynen
Thank you for your attention
take home message
STRING integrates information and
predicts interactions
You can always go to the sources
Proteins mode: specific species
COG mode: more coverage, especially
for prokaryotic genes
Exercises:
tinyurl.com/36twzq
(or via course wiki)
Alternative server:
xi.embl.de
Bork et al., Current Opinion in Structural Biology, 2004