STRING: Large-scale data and text mining

download report

Transcript STRING: Large-scale data and text mining

STRING
Large-scale data and text mining
Lars Juhl Jensen
association networks
guilt by association
biological systems
protein networks
STRING
1100+ genomes
computational predictions
gene fusion
Korbel et al., Nature Biotechnology, 2004
gene neighborhood
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
experimental data
gene coexpression
protein interactions
Jensen & Bork, Science, 2008
a real example
Cell
Cellulosomes
Cellulose
curated knowledge
complexes
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
not same species
hard work
(Ph.D. students)
common identifiers
quality scores
von Mering et al., Nucleic Acids Research, 2005
score calibration
von Mering et al., Nucleic Acids Research, 2005
homology-based transfer
Franceschini et al., Nucleic Acids Research, 2013
missing most of the data
text mining
>10 km
too much to read
computer
comprehensive lexicon
CDC2
cyclin dependent kinase 1
expansion rules
hCdc2
CDC2
flexible matching
cyclin-dependent kinase 1
cyclin dependent kinase 1
“black list”
SDS
co-mentioning
counting
within documents
within paragraphs
within sentences
natural language processing
Gene and protein names
Cue words for entity recognition
Verbs for relation extraction
[nxexpr The expression of
[nxgene the cytochrome genes
[nxpg CYC1 and CYC7]]]
is controlled by
[nxpg HAP1]
text corpus
~2 million full-text articles
~22 million abstracts
Exercise 1
Go to http://string-db.org
Query for Mt H37Rv adhD
(Rv3086)
Change between different views
Check evidence for adhD–lipR link
Extent network to 50 interactors
Exercise 2
Go to the paper PMC2995261
Extract the protein names in table
1
Create STRING network of them
Change to “advanced” mode
Analyze for clusters and
multi-page tables
related resources
general approach
curated knowledge
experimental data
text mining
computational predictions
common identifiers
quality scores
score calibration
visualization
protein networks
string-db.org
chemical networks
stitch-db.org
subcellular localization
compartments.jensenlab.org
tissue expression
tissues.jensenlab.org
disease associations
Work on own data
string-db.org
stitch-db.org
compartments.jensenlab.org
tissues.jensenlab.org
diseases.jensenlab.org