Building a Terminological Database from Heterogeneous

Download Report

Transcript Building a Terminological Database from Heterogeneous

Building a Terminological Database
from Heterogeneous Definitional
Sources
Smaranda Muresan, Peter T. Davis
Samuel D. Popper, Judith L. Klavans
Columbia University
May 21, 2003
Why Terminology is Important?
 Each
agency and each department
might have different ways to define the
same concept
 Working
with multiple databases
requires understanding the data across
multiple agencies and domains
What’s an Employee?
An appointed officer or
employee of USDA including
special Government
employees (collaborators,
consultants and panel
members). The term excludes
independent contractors.
US Department of Agriculture
A person who works for
wages or salary in the
service of an employer.
Mine Safety and Health
Administration
An individual who is engaged
or compensated by a railroad
or by a contractor to a
railroad, who is authorized
by a railroad to use its
wireless communications in
connection with railroad
operations.
Federal Railroad
Administration
The term "employee" does
not include a director,
trustee, or officer.
US SEC
Desiderata for Terminological
Resources
 Capture
the ongoing evolution of
language
 Provide
consistency, ease of sharing
and integration across agencies.
Architecture
Collection
Heterogeneous
Definitional
Corpus
Building the
terminological DB
Semantic
Analysis
Database
Building
ParseGloss
Terminological
Database
Use
GetGloss
Definder
dynamic sources
relations among
concepts and their
attributes
fast access,
flexibility, sharing
database query
Collection
Motivation
 Solution

Acquisition of
– Definitions
are rich in terminological
– GetGloss
– identification
Heterogeneous
knowledge
Definitional
and
extraction of glossaries
Corpus
GetGloss
– On-line dictionaries are static and generally
– Definder - extraction of
incomplete
Definder
definitions
from online free
text
– Need to capture the evolution of language
Building the Terminological DB
Term: Motor Gasoline (Finished)
Source: (source (agency "EIA")
Gasoline:
Motor Gasoline
(Finished).
(resourceSee
"Gasoline
Glossary")
(url …)

Solution
Motor
Gasoline
(Finished):
A
complex
 Motivation
Paren-modifier: Finished
mixture of relatively volatile hydrocarbons
–without
Transform
definitional
Full
Definition:
A complex
mixture...
–
Need
to
identify
among
with
or
small
quantities
ofrelationships
for use text
in …. into conceptual
Semantic
Database
additives,
blended to form a fuel suitable
concepts
A engines.
complexMotor
mixture ...
Analysis
Building
forCore
use inDefinition:
spark-ignition
data
for use
in
spark-ignition
engines
 e.g.
synonyms,
hypernyms, cross-reference
gasoline,
as defined
in ASTM Specification
Genus
Phrase:
A complex
mixture of
–orParseGloss
– partial
D 4814
Federal
Specification
VV-Grelatively
volatile hydrocarbons
1690C, is semantic
characterized analysis
as having a boiling
of
Head
Genus
Word:
hydrocarbons
range
of 122 toto
158store
degrees Fahrenheit
at
–
Need
this
conceptual
information
definitions
to
identify
Terminological
theProperties:
10 percent recovery point to 365 to
ParseGloss
for
easy
and
fast
access and integrationDatabase
UsedIn:
relations
between
374 degrees
Fahrenheit
at the
90 percent
recovery spark-ignition
point.
"Motor Gasoline"
enginesincludes
concepts
conventional
gasoline; all types of
Excludes-Includes:
– Store
data
intogasohol;
a gasoline
oxygenated
gasoline,
including
includes
conventional
and reformulated
gasoline,
but excludes
relational
database
includes
gasohol
aviation gasoline.
excludes aviation gasoline ...
Definitional
Corpus
Database Use
SQL query for inflammation
 Motivation
 Solution
1. Redness, swelling, heat and pain
resulting
from injury
to tissue
(parts
– Query
module
for
the
of the body
underneath
the skin).
relational
database
– Enable
the user to access the richness
Also
known as swelling.
(SQL) reaction of tissues
2. A terminological
characteristic
knowledge
to disease or injury; it is marked by
Terminological
four
signs: swelling,
redness,
Database to data
– Assure
easy
andheat,
fast access
and pain.
Enable
data
3.–
The
reaction of
tissuesharing
to injury . and integration
4. A agencies
response to irritation , infection ,
or injury , resulting in pain ,
redness
, and swelling
.
– Enable
dynamic
update of data
…
of
across
Putting It All Together
Collection
Heterogeneous
Definitional
Corpus
Building the
terminological DB
Semantic
Analysis
Database
Building
ParseGloss
Terminological
Database
User
GetGloss
Definder
dynamic sources
relations among
concepts and their
attributes
fast access,
flexibility, sharing
database query
GetGloss – Automatic Glossary Extraction
 DGRC
project
 Given a URL find the glossary file
 Challenges:
– glossaries can constitute small parts of a
web page, being embedded inside
– there is no standard HTML tag formatting for
marking <term,definition> pairs
– a web page can contain <term,information>
pairs, where information is not a definition.
True Positive
False Positive
Algorithm
Two step algorithm
 Identification Component

– Find candidate glossaries
– Keyword + Rule-based algorithm (6 rules)
 “glossary”,
“dictionary” in HTML tags
 Terms in alphabetical order …

Classification Component
– Filter out false positives
– Rule-based method (9 rules)

e.g filter if term is a Named Entity (e.g California)
– Statistical method using SVM
Evaluating the Identification Component

10,000+ pages from 5 different sites
– 1,000 page sample: no glossaries (n=13)

286,579 page sample from 268 domains
– P=53%
Estimating recall is hard
 Precision and Recall both very sensitive to
perturbations (p=0 vs. p=53%)

Klavans et. al (dg.o 2002)
Evaluating the Classification Components
 GetGloss
Categorizer assigns a score
to each candidate based on a linear
combination of weighted features
 Corpus:
 Test:
2400 glossary candidates
300 randomly chosen, manually
categorized glossary candidates
Classification Component Performance
1
0.8
Precision
Recall
F1
%
0.6
0.4
0.2
0
0
1
2
3
4
5
Score Range
0 if Score < -100
3 if 0 ≤ Score < 50
1 if -100 ≤ Score < -50
4 if 50 ≤ Score < 100
2 if -50 ≤ Score < 0
5 if 100 ≤ Score
Definder- Automatic Extraction of
Definitions from Text
Definder- Automatic Extraction of
Definitions from Text
Definder
 Part
of NSF funded digital library project
– Medical domain
– Extract definitions from consumer oriented
medical text
 Corpus
– Medical articles written by doctors for lay
audience
– Different genre (articles, manual chapters)
Algorithm
 Shallow
parsing
– Simple definitions (e.g NP-NP pairs for
synonyms)
– Candidate complex definitions
 Full
parsing (Charniak’00 parser)
– appositions, relative clauses, complex
definitional sentences
Definition Patterns and Examples
Simple
definitions
NP ( NP )
moving x-ray pictures ( angiograms )
hypertension ( high blood pressure )
NP -- NP -tachycardia – racing heartbeat -- ….
NP of NP ( NP )
enlargement of the heart muscle ( hypertrophy )
Definition Patterns and Examples
Complex
definitions
[S [CNP NP , [CNP CNP CNP] CNP] [VP … VP] S]
Angina, the pressing chest pain most people
associate with heart problems, ….
[S … [CNP NP -- [CNP CNP CNP] CNP] S]
… atherosclerosis – the progressive narrowing of
the heart’s own arteries by cholesterol
plaque buildups, which starves the heart
itself for oxygen and nutrients.
Evaluation
 Quantitative
– 4 human subjects , 10 articles
– 53 definitions – gold standard
– DEFINDER – p=86.27%, r = 84.60%
 Qualitative
– Usefulness and readability (non-specialists)
– Completeness and accuracy (medical specialists)
Klavans and Muresan(JCDL’01)
Muresan and Klavans (LREC’02)
Coverage of On-line Dictionaries
70
60
78.5
76
80
60
50
defined
undefined
absent
40
30
20
24
24
16
10
0
UMLS
21.5
0
0
OMD
GPTMT
Characteristics of Automatically
Built Definitional Corpus
Ozone: (O3)Adomain
colorless gas with a pungent
Environment
Addison Disease
odor, having the molecular form of O3 , found
Column
Ozone:
ozone
between
a rare
that results
in two -layers
of disease
the
atmosphere,
thethe Earth's
surface
and outer
space.
Ozone
levels can
stratosphere
90% of
the total
from
a(about
deficiency
in
loading)
andways.
the troposphere
beatmospheric
described
in
several
One of the
adrenocortical
hormones
(about
10%).
Ozone
is
a
form
of
most common measures is howoxygen
much
foundisnaturally
in the column
stratosphere
that
ozone
in a vertical
of air.
The
provides
- an
a protective
endocrine
layer
disorder
shielding
thatthe Earth
dobson
unit is a radiation's
measure of
column ozone.
from ultraviolet
harmful health
affects
about
1
in
100,000
Other
measures
include
partial
pressure,
effects
on humans
and the
environment.
In
people
number
density, and
concentration
of
the troposphere,
ozone
is a chemical oxidant
ozone,
and can
represent
either column
and major
component
of photochemical
smog.
can seriously
affectat
the
ozone
or-Ozone
the
amount
of ozone
a human
a degenerative
disease
that
respiratory
system. See atmosphere,
particular
altitude.
is characterized
by low blood
Large amount of data
 Heterogeneous
– Structure
– Language
– Semantics
 Multiple definitions of
ultraviolet radiation.
the same term
pressure
Medical
domainand dark brown
<http://www.epa.gov/globalwarming/glossar

y.html.xml>
pigmentation
of the skin
Arrhythmia: A disturbance in the beating
pattern
of the heart .
Ozone: A form of oxygen in which atoms
combine in groups of three .
<1006_Oxygen_therapy>
Putting It All Together
Collection
Heterogeneous
Definitional
Corpus
Building the
terminological DB
Semantic
Analysis
Database
Building
ParseGloss
Terminological
Database
User
GetGloss
Definder
dynamic sources
relations among
concepts and their
attributes
fast access,
flexibility, sharing
database query
ParseGloss – Partial Semantic Analysis
Challenges – heterogeneous collection of
definitions
 Focus on identifying the main semantic
relations among concepts
 Partial Semantic Analysis based on shallow
parsing

– Genus phrase and genus term
– Synonyms, cross-reference
– Other common relations between the defined
term and the concepts inside the definition
Example
Term: Motor Gasoline (Finished)
Source: (source (agency "EIA") (resource "Gasoline
Glossary") (url …)
Paren-modifier: Finished
Full Definition: A complex mixture... for use in ….
Core Definition: A complex mixture ... for use in
spark-ignition engines
Genus Phrase: A complex mixture of relatively
volatile hydrocarbons
Head Genus Word: hydrocarbons
Properties:
UsedIn:
spark-ignition engines
Excludes-Includes:
includes conventional gasoline
includes gasohol
excludes aviation gasoline
…
Evaluation
 Task
- User based evaluation to
– build a gold standard for genus phrase
– to ask people to identify the most important
properties inside each definition

Data
– 100 term-definition pairs
– 7 different glossaries
– 26 subjects
Results
 Complex
evaluation
 Different
notions of agreement and
overlap:
– Head only – 64% precision
– Genus phrase – 59% precision
Building the Database

Relational database

XML – data transfer

Statistics
– ~8000 terms
– ~12,500 definitions
– ~2000 different sources (web pages, articles,
etc.)
Distribution of terms with multiple
definitions
Putting It All Together
Collection
Heterogeneous
Definitional
Corpus
Building the
terminological DB
Semantic
Analysis
Database
Building
ParseGloss
Terminological
Database
User
GetGloss
Definder
dynamic sources
relations among
concepts and their
attributes
fast access,
flexibility, sharing
database query
Query the terminological Database
- Term: Audit
Definition: An examination of the financial statements,
accounting records, and other supporting
evidence of an institution…
Source: www.fdic.gov
- Term: Industrial radiography
Definition: means an examination of the structure of
materials…
Source: www.nrc.gov
- Term: Medical Surveillance
Definition: is the systematic examination of medical
monitoring data to determine…
Source: www.osha.gov
Conclusions

Proposed a framework for solving the
heterogeneous terminology problem
– Automatically building a heterogeneous collection
of definitions from dynamic sources
– Partial semantic analysis of the definitions to
identify main semantic relations between concepts
– Building a database for fast, easy access, dynamic
update of data, sharing across agencies
Future Work
Deep domain specific semantic analysis
 But, how to automatically classify the
glossary entries and definitions?

– Based on the classification of their source (sites,
articles, etc)
When and how to merge different definitions
for the same term?
 Integrate this acquired terminological
knowledge into the DGRC system
