coling_04_biowordnet.. - Buffalo Ontology Site

Download Report

Transcript coling_04_biowordnet.. - Buffalo Ontology Site

The Unbearable Lightness of
Biomedical Informatics
Barry Smith
Saarbrücken/Buffalo
http://ontologist.com
1
if Medical WordNet* is the
solution
what is the problem?
*Coling Proceedings, Vol. 1, pp. 371-380
2
3
Cerebellar tumor
4
Organism
Organ
10-1 m
Tissue
Cell
10-5 m
Organelle
Protein
DNA
10-9 m
5
The quantity-quality divide
30,000 genes in human
200,000 proteins
100s of cell types
100,000s of disease types
1,000,000s of biochemical pathways
(including disease pathways)
… legacy of Human Genome Project
… and of attempts to institute the
electronic health record
6
Organism
Organ
10-1 m
Tissue
Cell
10-5 m
Organelle
Protein
DNA
10-9 m
7
FUNCTIONAL GENOMICS
proteomics,
reactomics,
metabonomics,
toxicopharmacogenomics
phenomics,
behaviouromics,
…
8
Organism
The method of
annotations
Organ
10-1 m
Tissue
Cell
10-5 m
Organelle
Protein
DNA
10-9 m
9
Organism
The method of
indexing
Organ
10-1 m
Tissue
Cell
10-5 m
Organelle
Protein
DNA
10-9 m
10
The Gene Ontology
menopause
sensitivity to blue light
heptolysis
11
12
How overcome incompatibilities
between different scientific index
terms?
immunology
genetics
cell biology
13
One answer (statistical)
computational linguistics
Pattern recognition
based on string
searches
14
String searches need constraints
we can’t leave it to luck to overcome
terminological incompatibilities
15
Remember –different disciplines are using
different terminologies to refer to the same
objects, processes, features in reality
immunology
genetics
cell biology
16
An alternative answer:
“Ontology”
17
Ontology, roughly:
Overcome terminological incompatibilities
by creating a standardized framework into
which diverse vocabularies can be mapped
18
Kinds of Ontologies
ad hoc
Hierarchies
(Yahoo!)
Terms
structured
Glossaries
Thesauri
‘ordinary’
Glossaries
Data
Dictionaries
(EDI)
Glossaries &
Data Dictionaries
XML
Schema
formal
Taxonomies
XML DTDs
Data Models
(UML, STEP)
Principled,
informal
hierarchies
DB
Schema
Thesauri,
Taxonomies
Description
Logics
(DAML+OIL)
MetaData,
XML Schemas,
& Data Models
Frames
(OKBC)
General
Logic
Formal Ontologies
& Inference
19
Michael Gruninger
Kinds of Ontologies
A shared vocabulary plus a specification of its intended meaning
Two extremes
meaning specified
explicitly in a logically
rigorous way
20
Kinds of Ontologies
ad hoc
Hierarchies
(Yahoo!)
Terms
structured
Glossaries
Thesauri
‘ordinary’
Glossaries
Data
Dictionaries
(EDI)
Glossaries &
Data Dictionaries
XML
Schema
formal
Taxonomies
XML DTDs
Data Models
(UML, STEP)
Principled,
informal
hierarchies
DB
Schema
Thesauri,
Taxonomies
Description
Logics
(DAML+OIL)
MetaData,
XML Schemas,
& Data Models
Frames
(OKBC)
General
Logic
Formal Ontologies
& Inference
21
Kinds of Ontologies
A shared vocabulary plus a specification of its intended meaning
Too expensive
meaning specified
explicitly in a logically
rigorous way
22
Kinds of Ontologies
A shared vocabulary plus a specification of its intended meaning
Two extremes
Meaning specified
informally via natural
language
23
Work on biomedical ontologies grew out
of work on medical thesauri and
nomenclatures
24
Kinds of Ontologies
ad hoc
Hierarchies
(Yahoo!)
Terms
structured
Glossaries
Thesauri
‘ordinary’
Glossaries
Data
Dictionaries
(EDI)
Glossaries &
Data Dictionaries
XML
Schema
formal
Taxonomies
XML DTDs
Data Models
(UML, STEP)
Principled,
informal
hierarchies
DB
Schema
Thesauri,
Taxonomies
Description
Logics
(DAML+OIL)
MetaData,
XML Schemas,
& Data Models
Frames
(OKBC)
General
Logic
Formal Ontologies
& Inference
25
Fruit
similarTo
Vegetable
NarrowerTerm
Orange
synonymWith
Apfelsine
Graph with labels edges (similarTo,
Narrower, synonymWith)
Fixed set of edge labels (a.k.a.
relations)
26
Goble & Shadbolt
Unified Medical Language System (UMLS)
UMLS Metathesaurus:
1 million biomedical concepts
2.8 million concept names
from more than 100 controlled vocabularies
and classifications
built by US National Library of Medicine
27
UMLS Source Vocabularies
MeSH – Medical Subject Headings
…
ICD International Classification of Diseases
…
GO – Gene Ontology
…
FMA – Foundational Model of Anatomy
…
28
To reap the benefits of standardization
we need to make ONE SYSTEM out of
many different terminologies
=
UMLS “Semantic Network”
nearest thing to an “ontology” in the UMLS
29
UMLS SN
Alexa McCray, “An Upper Level
Ontology for the Biomedical
Domain”, Comparative and
Functional Genomics, 4 (2003),
80-84.
30
UMLS SN
134 Semantic Types
54 types of edges (relations)
yielding a graph containing more than
6,000 edges
31
Fragment of UMLS SN
32
33
34
UMLS SN Top Level
entity
physical
object
event
conceptual
entity
organism
35
conceptual entity
Organism Attribute
Finding
Idea or Concept
Occupation or Discipline
Organization
Group
Group Attribute
Intellectual Product
Language
36
conceptual entity
Organism Attribute
Finding
Idea or Concept
Occupation or Discipline
Organization
Group
Group Attribute
Intellectual Product
Language
37
Idea or Concept
Functional Concept
Qualitative Concept
Quantitative Concept
Spatial Concept
Body Location or Region
Body Space or Junction
Geographic Area
Molecular Sequence
Amino Acid Sequence
Carbohydrate Sequence
Nucleotide Sequence
38
Idea or Concept
Functional Concept
Qualitative Concept
Quantitative Concept
Spatial Concept
Body Location or Region
Body Space or Junction
Geographic Area
Molecular Sequence
Amino Acid Sequence
Carbohydrate Sequence
Nucleotide Sequence
39
Idea or Concept
Functional Concept
Qualitative Concept
Quantitative Concept
Spatial Concept
Body Location or Region
Body Space or Junction
Geographic Area
Molecular Sequence
Amino Acid Sequence
Carbohydrate Sequence
Nucleotide Sequence
40
Idea or Concept
Functional Concept
Qualitative Concept
Quantitative Concept
Spatial Concept
Body Location or Region
Body Space or Junction
Geographic Area
Molecular Sequence
Amino Acid Sequence
Carbohydrate Sequence
Nucleotide Sequence
41
Lake Geneva
is an Idea or Concept
42
Idea or Concept
Functional Concept
Qualitative Concept
Quantitative Concept
Spatial Concept
Body Location or Region
Body Space or Junction
Geographic Area
Molecular Sequence
Amino Acid Sequence
Carbohydrate Sequence
Nucleotide Sequence
43
UMLS
Fingers is_a Body Location or Region
Hand is_a Body Part, Organ, or Organ
Component
hand part_of body
BUT NOT
fingers part_of hand
44
Problem: Running together of
concepts and entities in reality
bioinformatics à la UMLS SN
( like many “knowledge engineering” disciplines )
floats free from reality
in a conceptual world
of its own creation
45
Blood Pressure Ontology
The hydraulic equation:
BP = CO*PVR
arterial blood pressure (BP) is directly
proportional to the product of blood flow
(cardiac output, CO) and peripheral
vascular resistance (PVR).
46
UMLS SN
blood pressure is an Organism
Function
cardiac output is a Laboratory or
Test Result or Diagnostic
Procedure
47
BP = CO*PVR thus asserts that
blood pressure is proportional
either to a laboratory or test result
or to a diagnostic procedure
48
Problem: Confusion of reality with
our (ways of gaining) knowledge
about reality
49
UMLS Semantic Network
entity
physical
object
conceptual
entity
50
Physical Object
Substance
Food
Chemical
Body
51
Chemical
Chemical
Viewed
Structurally
Chemical
Viewed
Functionally
52
Problem: Confusion of objects
with our ways of referring to
objects
53
Chemical
Chemical
Viewed
Structurally
Inorganic Organic
Chemical Chemical
Chemical
Viewed
Functionally
Enzyme
Biomedical or
Dental Material
54
This multiple inheritance leads to
errors in coding
Gene Ontology will eliminate multiple
inheritance
55
UMLS Semantic Network
entity
is_a
physical
object
conceptual
entity
organism
56
UMLS SN
is_a =def.
If one item ‘is_a’ another item
then the first item is more specific
in meaning than the second item.
(Italics added)
57
fish is_a vertebrate
copulation is_a biological process
both testes is_a testis
Nazi is_a Nazism
plant parts is_a plant
58
59
What are the nodes in this graph?
Almost all nodes are linked to other nodes
by a multiplicity of different types of edges
Compare: swimming is healthy
swimming has 8 letters
60
Semantic Network Definition:
Concept =def. An abstract concept, such as a
social, religious, or philosophical concept
UMLS Definition:
Concept =def. A class of synonymous terms
61
62
How can concepts figure as
relata of these relations?
part_of = def. Composes, with one or
more other physical units, some larger
whole
causes =def. Brings about a condition or
an effect.
contains =def. Holds or is the receptacle
for fluids or other substances.
63
How can a set of synonymous terms
serve as a receptacle for fluids or
other substances?
How can sets of synonymous terms
stand in relations such as affects or
causes?
64
connected_to =def.
Directly attached to another
physical unit as tendons are
connected to muscles.
How can a concept be directly attached to
another physical unit?
65
What are the relata which are
linked by the edges in the SN
graph?
66
To answer this question
we need to distinguish clearly between
concepts and classes:
concepts are creatures of cognition
classes are invariants (types, kinds,
universals) out there in reality
67
If ontologies are about
meanings / concepts
it becomes impossible to deal
coherently with those relations between
entities in reality which involve appeal to
both classes and their instances.
68
Illustration re: part_of
heart part_of human
human heart part_of human
testis part_of human
human testis part_of human
69
For instances:
part_of = instance-level parthood
(for example between Mary and her heart)
For classes
A part_of B =def. given any instance a of A
there is some instance b of B such that a
part_of b
This is an assertion about As.
70
a adjacent_to b
(instance-level adjacency, for example
between Mary’s head and Mary’s neck)
For classes:
A adjacent_to B =def. given any instance
a of A there is some instance b of B
which is such that a adjacent_to b
71
A adjacent_to B
as an assertion about classes
is never an assertion about As exclusively
72
A adjacent_to B =def.
given any instance a of A there is some
instance b of B which is such that a
adjacent_to b
and
given any instance b of B there is some
instance a of A which is such that a
adjacent_to b
73
Almost all of the 54 types of
edges in SN are dealt with
incoherently
part_of HAS INVERSE has_part
nucleus part_of cell
cell has_part nucleus
74
75
Acquired Abnormality affects Fish
Experimental Model of Disease affects
Fungus
Food causes Experimental Model of
Disease
Bacterium causes Experimental Model of
Disease
Biomedical or Dental Material causes
Mental or Behavioral Dysfunction
Manufactured Object causes Disease or
Syndrome
Vitamin causes Injury or Poisoning
76
How to do better?
77
How to do better?
How to create a network of biomedically
relevant terms/classes, with coherently
defined relations between them, to
which expert terms of the UMLS can be
assigned in a maximally intelligible
way?
78
What linguistic framework
is shared in common by immunologists,
geneticists and cell biologists,
by phenobehavioromists and by
toxicopharmacogenomists?
79
Answer:
the natural language they all use to
talk about biological (biomedical)
phenomena
80
BioWordNet
joint work with
Christiane Fellbaum
(see paper in Proceedings)
81
BioWordNet
use WordNet’s biomedical vocabulary,
to create a better alternative to UMLS
SN
82
Strengths of WordNet 2.0
Open source
Very broad coverage
Is-a / part-of architecture
Tool for automatic sense disambiguation
83
Weaknesses of WordNet 2.0
Problems with relations
Mixes up expert and non-expert vocabulary
Errors
Gaps
Noise
all prevent WordNet’s being used in scientific
context as substitute for UMLS SN
84
Fix WordNet’s relations by using
the methodology outlined above
already applied to:
Foundational Model of Anatomy
Gene Ontology
Open Biological Ontologies
85
Institute for Formal Ontology and
Medical Information Science
Saarbrücken
http://ifomis.org
86
WordNet mixes up expert
and non-expert vocabulary,
both current and medieval:
suppuration#2 {pus, purulence,
suppuration, ichor, sanies, festering}
87
WordNet contains biomedically
relevant errors
snore-sleep
WordNet: if someone snores, then he
necessarily also sleeps
snoring = the respiratory induced
vibration of glottal tissues
associated not only with sleep but also
with relaxation or obesity
88
WordNet has too much noise for
purposes of scientific applications
89
13 senses for feel is a verb
experience – She felt resentful
find – I feel that he doesn't like me
feel – She felt small and insignificant;
feel – We felt the effects of inflation
feel – The sheets feel soft
grope –He felt for his wallet
finger – Feel this soft cloth!
explore – He felt his way around the dark
room)
feel – It feels nice to be home again
feel – He felt the girl in the movie theater)
90
Medical senses of ‘feel’
palpate – examine a body part by palpation:
The runner felt her pulse.
sense – perceive by a physical sensation,
e.g. coming from the skin or muscles:
He felt his flesh crawl
feel – seem with respect to a given
sensation:
My cold is gone – I feel fine today
91
WordNet has gaps even in its
coverage of biomedical natural
language
92
WordNet seness of ‘regulation’
1. regulation (ordinance, rule)
2. rule, regulation -- (a principle that customarily
governs behavior; "short haircuts were the
regulation")
3. regulation -- (the state of being controlled or
governed)
4. regulation -- (the ability of an early embryo to
continue normal development after its structure has
been somehow damaged)
5. regulation, regularization, regularisation -- (the act
of bringing to uniformity)
6. regulation, regulating -- (the act of controlling
according to rule; "fiscal regulations are in the hands
of politicians")
93
Biological sense of ‘regulation’:
A process that modulates the frequency,
rate or extent of behavior
(Gene Ontology)
94
WordNet senses of ‘inhibition’
1. inhibition, suppression -- ((psychology)
the conscious exclusion of unacceptable
thoughts or desires)
2. inhibition -- (the quality of being inhibited)
3. inhibition -- the process whereby nerves
can retard or prevent the functioning of an
organ or part; "the inhibition of the heart by
the vagus nerve")
4. prohibition, inhibition, forbiddance -- (the
action of prohibiting or forbidding)
95
Biological senses of ‘inhibition’
much broader
inhibition = negative regulation
enzymes can be inhibited
reactions can be inhibited
… and not only by nerves
96
WordNet senses of ‘binding’
1. binding -- (the capacity to attract and hold
something)
2. binding -- (a strip sewn over or along an
edge for reinforcement or decoration)
3. dressing, bandaging -- (the act of applying
a bandage)
4. binding, book binding; "the book had a
leather binding")
97
biological sense of ‘binding’
interacting selectively with
(Gene Ontology)
98
Remove errors, noise and gaps
in a two-stage process
1.select biomedically relevant naturallanguage terms from WordNet 2.0
extended by standard biomedical
information sources
2.validate these terms and the relations
between them
99
Validation
each arc in BWN is converted into a naturallanguage sentence
e.g. ‘mumps is an inflammation’
via controlled human subjects experiments:
are accredited
1. as intelligible by non-experts
2. as true by experts
100
we use logical methods to ensure
a coherent treatment of BWN’s
upper-level classes and relations
and thereby also bring logical
rigor in a practical fashion to the
whole of the UMLS
Metathesaurus
101
Bring ontological rigour to BWN
ad hoc
Hierarchies
(Yahoo!)
Terms
structured
Glossaries
Thesauri
‘ordinary’
Glossaries
Data
Dictionaries
(EDI)
Glossaries &
Data Dictionaries
XML
Schema
formal
Taxonomies
XML DTDs
Data Models
(UML, STEP)
Principled,
informal
hierarchies
DB
Schema
Thesauri,
Taxonomies
Description
Logics
(DAML+OIL)
MetaData,
XML Schemas,
& Data Models
Frames
(OKBC)
General
Logic
Formal Ontologies
& Inference
102
The long-term goal
BWN should serve as scaffolding/indexing
system for the much larger and denser net
of expert biomedical terminology which is
the UMLS Metathesaurus
103
The End
104