No Slide Title

Download Report

Transcript No Slide Title

Design and Creation of Ontologies for Environmental
(Multimedia) Information Retrieval*
Vipul Kashyap
National Library of Medicine
[email protected]
Workshop on Science and the Semantic Web
October 24, 2002
* Work done by the author when at MCC and LSDIS Lab, UGA
Outline
 Ontologies for Information Retrieval: The InfoSleuth System
 The Ontology Design Process:
– “Reverse Engineering” from a database schema
– Ontology refinement based on user queries
– Using a data dictionary and Thesaurus
 Ontology-based Multimedia Information Retrieval
– Information Extraction from Textual Data
– Information Extraction from Image Data
 Conclusions and Future Work
Science on the Semantic Web Worksshop – 2
Ontologies for Information Retrieval:
The InfoSleuth System
Image Database:
features, patterns,
semantic objects
Ontology-based
retrieval query
KQML/OKBC
agents
Document Database
e.g., Verity
Structured Database
e.g., Oracle
Science on the Semantic Web Worksshop – 3
A Multimedia GIS Query using an ontological model
Get me all regions (blocks, counties) having a population greater than 500 and
area greater than 50 acres having an urban land cover and such that all the
nearby fires have excellent containment
county
name
Fire
block
area
isLocatedNear
Region
population
containment
spatial_location
select county, block, spatial_location
from region
where area > 50 and population > 500
and land_cover = “urban”
and region.isLocatedNear.containment = “excellent”
Science on the Semantic Web Worksshop – 4
land_cover
Ontologies for Information Retrieval
 Provide a concise, uniform, declarative description of semantic
information
 Independent of syntactic representations, conceptual models of the
underlying information bases
 Domain models provide wider access by supporting multiple world
views on the same underlying data
 EDEN ontology defined in the context of the InfoSleuth system:
– important and crucial to capture elements of environmental information
Science on the Semantic Web Worksshop – 5
Sources for Ontology construction
 Pre-existing Database Schemas
– data directed component
 Collection of representative set of queries possibly parameterized
based on application user interface
– application directed component
 Thesauri and Vocabularies (e.g., EEA Thesaurus)
– knowledge directed component
 Ontology = knowledge-based middle ground
between applications and data !!!
Science on the Semantic Web Worksshop – 6
The Ontology Design Process
Choose new
Database Schema
Abstract details
from Database Schema
Determine entities
and attributes
Determine
Relationships
Group information,
Analyze foreign keys
and dependencies
Implement
and Test
Evaluate
Ontology
Ontology from
Database Schema
Drop entities
and attributes
Ontology from
Queries
No more
queries
Add new subclasses
and superclasses
Choose
new query
Science on the Semantic Web Worksshop – 7
Add new entities
and attributes
Environmental Databases
 CERCLIS 3
– http://www.epa.gov/enviro/html/cerclis/
 ITT
 HAZDAT
– http://www.atsdr.cdc.gov/hazdat.html
 ERPIMS
– http://ns1.ktc.com/personal/larnold/erpims.htm
 Basel Convention Database
– http://www.unep.ch/basel
Science on the Semantic Web Worksshop – 8
Grouping Information in Multiple Tables
Site
site_id (PK)
site_name
site_ifms_ssid_
code
site_rcra_id
site_epa_id
Site_Characteristic
site_id (PK, FK to Site)
rsic_code (PK, FK to Ref_Sic)
sc_date
Ref_Sic
rsic_code (PK)
rsic_code_desc
description
name
Site
code
date
Site_Alias
site_id (PK, FK to Site)
site_alias_id (PK)
sa_name
alias_name
Database Schema
Ontology
Science on the Semantic Web Worksshop – 9
Identifying Relationships
Site
Ref_action_type
site_id (PK)
site_name
site_ifms_ssid_
code
site_rcra_id
site_epa_id
rat_code (PK)
rat_name
rat_def
Action
site_id (PK, FK to Site)
rat_code (PK, FK to ref_action_type)
act_code_id (PK)
Waste_Src_Media_Contaminated
Database Schema
Remedial_Response
wsmrc_nmbr (PK)
site_id (PK, FK to Action)
rat_code (FK to Action)
act_code_id (FK to Action)
site_id
act_code_id
rat_code
Ontology
Contaminant
actionName
Site
PerformedAt
Science on the Semantic Web Worksshop – 10
RemedialResponse
Ontology refinement based on user queries
 Addition of New Attributes
– At NPL sites with a land use category of INDUSTRIAL, what is the cleanup level
range for LEAD ….
– Add an attribute landUseCategory to the entity Site in the ontology
 Addition of new Relationships
– What is the range of concentrations for ARSENIC is a contaminant of concern
in the SURFACE SOIL at NPL sites
– Add a relationship HasContaminant between the entities Site and Contaminant
in the ontology
 Addition of class-subclass relationships and new entities
– How many Super fund sites are in Edison County, New Jersey ?
– Add an entity SuperFundSite as a subclass of Site in the ontology
Science on the Semantic Web Worksshop – 11
Using a data dictionary (EDR) to enhance the ontology
Site
Map
coding_scheme1
state
coding_scheme2
coding_scheme3
StateName
StateCode
{ “Texas”, “California” }
StateAbbr
{ “TX”, “CA” }
 select * from Site where state = ‘TX’ or state = ‘California’
 select coding_scheme1 from Map where coding_scheme3 = ‘TX’
Science on the Semantic Web Worksshop – 12
Enhancing the Ontology by using a Thesaurus
abandoned site
THEME
BT
NT
POLLUTION
land setup
disused military site
LandSetup
Site
SuperfundSite
AbandonedSite
DisusedMilitarySite
Science on the Semantic Web Worksshop – 13
Information Extraction from Text and
Multimedia Data
Get me all regions (blocks, counties) having a population greater than 500 and
area greater than 50 acres having an urban land cover and such that all the
nearby fires have excellent containment
county
name
Fire
block
area
isLocatedNear
Region
population
containment
spatial_location
select county, block, spatial_location
from region
where area > 50 and population > 500
and land_cover = “urban”
and region.isLocatedNear.containment = “excellent”
Science on the Semantic Web Worksshop – 14
land_cover
Information Extraction from Textual Data
containment = “excellent”
county
block
state
Fire
Column1
isLocatedNear
Region
fire.name
containment
excellent
<ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X),
X < 25),
<WORD>(%), <WORD>(active)),
<PHRASE>(full, containment,,
<STEM>(was), expected)
<PHRASE>(the, fire, <STEM>(is),
contained))
region.county
<ACCRUE>(<SENTENCE>(
<PHRASE>(<OR>(New, Las, San),
[region.county]),
<OR>(county, block, state)))
<PARAGRAPH>(FIRE, REGION)
Science on the Semantic Web Worksshop – 15
Mapping “domain specific” model elements to media
specific metadata
 county(x,y) gets mapped to:
– word(x), phrase(x), accrue(<list-of-subtrees>)
 containment(x, “excellent”) gets mapped to:
– sentence(<set-of-words>), stem(x), accrue(<list-of-subtrees>)
 isLocatedNear(x, y) gets mapped to:
– paragraph(x,y)
Science on the Semantic Web Worksshop – 16
Mapping SQL queries to Topic Expressions
select county from region
where isLocatedNear.containment = “excellent”
<PARAGRAPH>(
<ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X),
X < 25),
<WORD>(%), <WORD>(active)),
<PHRASE>(full, containment,,
<STEM>(was), expected)
<PHRASE>(the, fire, <STEM>(is),
contained)), <ACCRUE>(<SENTENCE>(
<PHRASE>(<OR>(New, Las, San),
[region.county]),
county))
)
Science on the Semantic Web Worksshop – 17
Limitations of Current Indexing Technologies:
“selection operation”
select county from region
<ACCRUE>(<SENTENCE>(<PHRASE>(<OR>(New, Las, San),
WILDCARD),
<OR>(county, block, state)))
=> post-processing of patterns returned (WILDCARD as place-holder)
Problem: WILDCARD may match a lot of words in the same sentence
WILDCARD may match different words in different sentences
Science on the Semantic Web Worksshop – 18
Using NLP and statistical techniques
 WILDCARD matches a number of words in the same sentence
Yeltsin was appointed the Prime Minister when sleeping
article
noun
conjunction
verb
=> Use part of speech tagging to reduce number of possibilities
 WILDCARD matches different words in different sentences
Yeltsin was appointed Prime Minister
Yeltsin was appointed President
=> use frequency statistics to give a level of confidence
Science on the Semantic Web Worksshop – 19
Definition Support
INCIDENT MANAGEMENT SITUATION REPORT
Friday August 1, 1997 - 0530 MDT
NATIONAL PREPAREDNESS LEVEL II
Phrase:
CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires ha
SIMELS, Galina District, BLM.
staffed for structure protection.
fire.name
SIMELS, Galena District, BLM. This fire is on theSlot:
east side
of the Innoko Flats, between Galena
The fore is active on the southern perimeter, which is burning into a continuous stand of black s
SIMELS
fire has increased in size, but was not mapped due tovalue:
thick smoke.
The slopover on the eastern
35% contained, while protection of the historic cabit continues.
structure:
CHINIKLIK MOUNTAIN, Galena District, BLM. <name>
A Type II Incident
Management
Team
, <place>
, <unit>
. (Weh
assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up.
contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up wher
burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this we
depending on the results of infrared scanning.
Science on the Semantic Web Worksshop – 20
MIDAS*: Information Extraction from Multimedia Data
Query: Get me all regions (blocks, counties) having a population greater
than 500 and area greater than 50 acres having an urban land cover
select county, block, area, population,
spatial_location, land_cover
from region
where area > 50
and population > 500
and land_cover = ‘urban’
and relief = ‘moderate’
*Media Independent DomAin Specific correlation
Science on the Semantic Web Worksshop – 21
Get me all regions
(counties, blocks) having
50 < population < 100
25 < area < 50
and low density urban area
land cover ...
media independent
correlation across domain
specific metadata
correlation across image
and structured data at an
intensional domain level
Science on the Semantic Web Worksshop – 22
SQL queries to structured data
(Census DB)
Population:
Area:
SQL Gateway
to textual data
(TIGER/Line DB)
Boundaries:
Land cover:
Relief:
Image Processing routines
for Image Data
Science on the Semantic Web Worksshop – 23
Science on the Semantic Web Worksshop – 24
Mapping “domain specific” model elements
to media specific metadata

contained(<concept>, <image>) gets mapped to:
– latitude/longitude, image-coordinates
– bounding box of region
– image type: LULC, DEM

land_cover(x, “low density urban”) gets mapped to:
– percentage(<pixel-color>, <bounding-box>)

relief(x, “moderate”) gets mapped to:
– standard-deviation(<pixel-value, <bounding-box>)
Science on the Semantic Web Worksshop – 25
Need for characterization of Domain
Vocabularies
Geological Region
Urban
Water
Forest Land
Industrial
Residential
Lakes
Evergreen
Commercial
Deciduous
Reservoirs
Streams and Canals
Mixed
Geological Region
State
County
City
Rural Area
Tract
Block Group
Another source
of domain ontology
Construction:
- Classification
Standards
Block
Science on the Semantic Web Worksshop – 26
Conclusions and Future Work
 Role of semantic content in handling data/information overload
– Domain Specific ontologies: an approach for capturing semantic content
 Design and construction of domain ontologies
– labor intensive, time consuming, difficult endeavor
– Re-use readily information: schemas, queries, data dictionaries, thesauri
 minimize the involvement of the domain expert
 Metadata is the key for MultiMedia Information Retrieval
– Use an expanded notion of metadata as schema and declarative SQL like query language
– Pragamatic Incorporation of NLP/Image+Speech+Video Processing/Computer Vision
techniques
– Exploit synergy across multiple media for better precision and performance
 Extrapolate this technique into other domains:
– Medical and Bio-Informatics
– telecommunication
– IP networks (use of CIM information model by DMTF)
 Ontology Extraction from Textual Data:
– Clustering techniques to identify central concepts and taxonomic relationships
– NLP techniques to identify concept associations
– Consensus analysis techniques to establish ontologies
Science on the Semantic Web Worksshop – 27