2008_01_22_NCBO_Seminar_Series_Ontrez_Shah

Download Report

Transcript 2008_01_22_NCBO_Seminar_Series_Ontrez_Shah

The Ontrez project at NCBO
Nigam Shah
[email protected]
Public data repositories
• Around 1100 databases in the NAR’s 2008
database issue.
• High throughput gene expression data in
repositories such as GEO, SMD, Array Express
• Clinical Trial repositories such as caBIG, TrialBank,
clinicaltrials.gov
• Guideline repositories such as www.guideline.gov
• Image repositories such as BIRN
• Observational studies such as Framingham,
NHANES, AMCIS.
2
Database annotation
• Ontology based annotation is not as widespread as desired
• Most annotation is still free-text
• Possible reasons:
1. Lack of a one stop shop for bio-ontologies
2. Lack of tools to annotate experimental data
• Manual  phenote
• Automatic  ?
3. Lack of a sustainable mechanism to create
ontology based annotations
3
Different kinds of annotations
cytoskeleton organization and biogenesis
metadata
annotation
ELMO1 expression is
altered by mechanical
stimuli
:
:
Other experiments
:
:
ELMO1 associated_with actin
Expression profiling of cultured bladder smooth
muscle cells subjected to repetitive mechanical
stimulation for 4 hours. Chronic overdistension
results in bladder wall thickening, associated with
loss of muscle contractility. Results identify genes
whose expression is altered by mechanical stimuli.
Chronic Bladder Overdistension
4
Annotations as assertions
• Annotation = An assertion declaring a relationship
b/w a biomedical entity and a type in an ontology.
• e.g. p53 <associated_with> cell death
• Annotations tell us what the biologists believe to
be true (in particular or in general)
• Most annotations are based on particular observations
and are generalized during interpretation by a
biologist/curator.
• Semantics of annotations are not always declared
apriori (e.g. associated_with, involves)
5
Annotations as ‘Meta-data’
• Metadata: The text description accompanying a
dataset in a database.
• Metadata-annotations should be machine
processed (and indexed using ontologies) because
• The volume is orders of magnitude more than the
summary results
• These annotations are not stating any biological fact
• Hence don’t need a curator to create them
• These annotations are to be used to LOCATE datasets
accurately as soon as they are available in a public
repository
• we can not afford to have a curation bottleneck
6
High level goal
• Process the metadata annotations to
automatically tag the ‘elements’ in public
repositories with as many ontology terms as
possible.
• For example in case of the GEO dataset 906:
• Expression profiling of cultured bladder smooth muscle cells subjected to
repetitive mechanical stimulation for 4 hours. Chronic overdistension
results in bladder wall thickening, associated with loss of muscle
contractility. Results identify genes whose expression is altered by
mechanical stimuli.
• Gets tagged with:
• Expression, Expression of bladder, bladder, smooth, bladder muscle,
muscle, smooth muscle, cells, mechanical, mechanical stimulation,
stimulation, Chronic, results, bladder overdistension, associated,
associated with, with, loss, genes, altered
7
Tagging [annotating] with ontology
terms
8
9
Querying the annotation index
10
11
12
13
14
WHAT NEW SCIENCE DO WE
ENABLE?
15
New Science enabled
• Nature study on image features and gene
expression
• Correlation b/w protein and gene
expression for cancer classification
• Correlating gene expression and drug effect
information for predicting drug efficacy
• Training and testing image processing
algorithms
16
Decoding global gene expression programs in liver cancer by noninvasive imaging
Eran Segal, Claude B Sirlin, Clara Ooi, Adam S Adler, Jeremy Gollub, Xin Chen, Bryan K Chan, George R Matcuk, Christopher T Barry, Howard Y
Chang & Michael D Kuo
Nature Biotechnology 25, 675 - 680 (2007) Published online: 21 May 2007
17
Correlation of protein and gene expression for the stratification
of breast cancer patients
18
There are 20 other diseases for
which this is possible!
Disease
GEO samples
Acute myeloid leukemia
Malignant melanoma
B-cell lymphoma
Prostate cancer
Renal carcinoma
Carcinoma squamous
Multiple myeloma
Clear cell carcinoma
Renal cell carcinoma
Breast carcinoma
Hepatocellular carcinoma
Carcinoma lung
Cutaneous
malignant
melanoma
T-cell lymphoma
Lymphoblastic lymphoma
Uterine fibroid
Medulloblastoma
Clear cell sarcoma
Leiomyosarcoma
Mesothelioma
Kaposi's sarcoma
Cardiomyopathy
Dilated cardiomyopathy
366
47
133
47
34
105
225
34
34
3
80
91
38
TMAD
samples
3
43
27
15
185
175
169
63
9
1277
163
66
41
29
29
10
46
35
24
54
4
14
14
31
30
19
9
8
5
5
3
2
2
19
20
TMAD incorporates the NCI Thesaurus ontology for searching tissues in the cancer
domain. Image processing researchers can extract images and scores for training and
testing classification algorithms.
21
Current status of the prototype
Resource
PubMed
Number of
elements
Resource Number of Number of
file size
direct
closure
(Kb)
annotations annotations
Total
number of
'useful'
annotations
10164
13461
187686
681973
857459
2751
2880
143134
484758
619133
ClinicalTrials.gov
43918
8379
1206939
6792430
5217115
Gene Expression
Omnibus
ARRS GoldMiner
546
163
16494
100984
116234
1155
494
53082
290935
340915
58534
25377
1607335
8351080
7150856
ArrayExpress
TOTAL
22
Ontrez: Target resources
Papers
Datasets
mRNA
express
ion
Protein
express
ion
Guideli
nes
GWAS
Clinical Trials
RCT
reports
Trial
descrip
tion
Treatm
ents
Drugs
Phenotype
text
images
Animal
models
Alleles and
Genotype
Genes
Variatio
ns
Metastatic
Melanoma
3330
7
76
237
1
1
314
1
2
47
0
0
Invasive
Melanoma
Melanoma
in situ
Spindle Cell
Melanoma
23
Where can we go?
• Become a service for ‘annotating’ biomedical text.
– People send us text, we send back recognized concepts
(may be even relationships)
– Given a set of concepts we provide a similarity metric
between them
– Both these services can be plugged into a variety of
community and collaborative annotations tools
• Become ‘the one stop shop’ for finding items across
a wide variety of resources …
– Integrate on the ‘disease’ dimension. Gene cards exist,
disease cards don’t
– Focus on approx. 15 resources in the next year.
– PDB and PLoS are interested
24
Research questions - 1
Genes/Proteins Diseases
Drugs
body parts
developmental
Pathways
stages
processes
genetic
markers
SNOMEDCT ..
X
..
..
..
..
..
..
RxNORM
INOH
NCIT
Gene
Ontology
(BP)
FMA
Cell type
Ontology
..
..
..
..
..
X
X
..
..
..
..
..
..
..
..
..
X
..
..
..
..
..
..
..
..
..
..
..
..
..
X
..
..
..
..
X
..
..
..
..
..
..
..
..
..
..
..
..
..
..
X
..
..
..
..
..
Mouse
anatomy and ..
development
..
..
X
X
..
..
..
Zebrafish
anatomy and ..
development
..
..
X
X
..
..
..
Mammalian
Phenotype
25
Research questions - 2
Genes/Proteins Diseases Drugs
body parts
developmental
Pathways processes genetic markers
stages
GATE
..
..
..
..
..
..
..
..
UMLS-Query
..
..
..
..
..
..
..
..
mgrep
..
..
..
..
..
..
..
..
MetaMAP
..
..
..
..
..
..
..
..
UPenn
(conditional
random fields)
..
..
..
..
..
..
..
..
Language
Modeling
methods
..
..
..
..
..
..
..
..
26
Credits and collaborations
• Clement Jonquet
• Nipun Bhatia
• Manhong Dai
• Fan Meng
• Brian Athey
• Mark Musen
27