32N1633 DBSG Future Database Needs

Download Report

Transcript 32N1633 DBSG Future Database Needs

Future Database Needs
SC 32 Study Period
February 5, 2007
JTC1 SC32N1633
Bruce Bargmeyer,
Lawrence Berkley National Laboratory
University of California
Tel: +1 510-495-2905
[email protected]
1
Topics
 Study
period purpose
 New challenges
 A brief tutorial on Semantics and semantic
computing
 where XMDR fits
 Semantic
computing technologies
 Traditional Data Administration
 Some
limitations of current relational
technologies
 Some input from other sources
2
Future Database Needs
Study Period
A
one-year study period to identify and
understand case studies related to this area.
 Bring together a small group of experts in a
meeting on “Case Studies on new Database
Standards Requirements”.
 The workshop would provide input to existing
SC32 projects and may provide background
material for new proposals for upgrades or for
new work within SC32 in time for 2007 SC32
Plenary
--Document 32N1451
3
The Internet Revolution
A world wide web of diverse content:
The information glut is nothing new. The access to it is astonishing.4
Challenge: Find and process nonexplicit data
For example…
Patient data on drugs contains brand
names (e.g. Tylenol, Anacin-3,
Datril,…);
Analgesic Agent
Non-Narcotic Analgesic
Analgesic and Antipyretic
However, want to study patients taking
analgesic agents
Nonsteroidal
Antiinflammatory
Drug
Tylenol
Acetominophen
Anacin-3
Datril
5
Challenge: Specify and compute across
Relations, e.g., within a food web in an
Arctic ecosystem
An organism is connected to another organism for which it is a source
of food energy and material by an arrow representing the direction of
biomass transfer.
Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)
6
Challenge: Combine Data, Metadata &
Concept Systems
Inference Search Query:
“find water bodies downstream from Fletcher
Creek where chemical contamination was
over 10 micrograms per liter between
December 2001 and March 2003”
Data:
ID Date
Temp
Hg
A
06-09-13
4.4
4
B
06-09-13
9.3
2
X
06-09-13
6.7
78
Concept system:
Contamination
Biological
Radioactive
mercury
Chemical
lead
cadmium
Metadata:
Name
Datatype
Definition
Units
ID
text
Monitoring
Station Identifier
not
applicable
Date
date
Date
yy-mm-dd
number
Temperature (to
0.1 degree C)
degrees
Celcius
number
Mercury
contamination
micrograms
per liter
Temp
Hg
7
Challenge: Use data from systems that record
the same facts with different terms
Database
Catalogs
Common Content
ISO 11179
Registries
Common Content
Data
Element
UDDI
Registries
Table
Column
Common Content
Business
Specification
OASIS/ebXML
Registries
XML Tag
Country
IdentifierAttribute
Common Content
CASE Tool
Repositories
Common Content
Business
Object
Coverage
Software
Component
Registries
Common Content
Term
Hierarchy
Ontological
Registries
Dublin
Core
Registries
Common Content
Common Content
9
Same Fact, Different Terms
Data
Element
Concept
Name: Country Identifiers
Context:
Definition:
Unique ID: 5769
Conceptual Domain:
Maintenance Org.:
Steward:
Classification:
Registration Authority:
Others
Algeria
Belgium
China
Denmark
Egypt
France
...
Zimbabwe
Data Elements
Name:
Context:
Definition:
Unique ID: 4572
Value Domain:
Maintenance Org.
Steward:
Classification:
Registration
Authority:
Others
Algeria
L`Algérie
DZ
DZA
012
Belgium
Belgique
BE
BEL
056
China
Chine
CN
CHN
156
Denmark
Danemark
DK
DNK
208
Egypt
Egypte
EG
EGY
818
France
La France
FR
FRA
250
...
...
...
...
...
Zimbabwe
Zimbabwe
ZW
ZWE
716
ISO 3166
French Name
ISO 3166
2-Alpha Code
ISO 3166
3-Alpha Code
ISO 3166
3-Numeric Code
ISO 3166
English Name
10
Challenge: Draw information together from a
broad range of studies, databases, reports, etc.
11
Challenge: Gain Common Understanding of
meaning between Data Creators and Data Users
A common interpretation of what the data
represents
EEA
text
environ
agriculture
climate
human health
industry
tourism
soil
water
air
USGS
12312332683268
34534508250825
44544513481348
67067050385038
24824827082708
59159100000000
30830821782178
1231233268
3268
3453450825
0825
4454451348
1348
6706705038
5038
2482482708
2708
5915910000
0000
3083082178
2178
text
environ
agriculture
climate
human health
industry
tourism
soil
water
air
text
ambiente
agricultura
tiempo
salud huno
industria
turismo
tierra
agua
aero
Users
data
data
DoD
EPA
text
ambiente
agricultura
tiempo
salud hunano
industria
turismo
tierra
agua
aero
text
data
environ
agriculture
climate
human health 12312332683268
34534508250825
industry
44544513481348
tourism
67067050385038
soil
24824827082708
59159100000000
water
30830821782178
air
data
12312332683268
34534508250825
44544513481348
67067050385038
24824827082708
59159100000000
30830821782178
3268
data
0825
123
1348
5038
345
123
3268
2708
0000
445
345
0825
2178
6701348
445
2485038
670
591
248
308
591
308
Information
systems
Others . . .
Data Creation
12
Challenge: Drawing Together Dispersed Data
A common interpretation of what the data
represents
EEA
text
environ
agriculture
climate
human health
industry
tourism
soil
water
air
USGS
12312332683268
34534508250825
44544513481348
67067050385038
24824827082708
59159100000000
30830821782178
1231233268
3268
3453450825
0825
4454451348
1348
6706705038
5038
2482482708
2708
5915910000
0000
3083082178
2178
text
environ
agriculture
climate
human health
industry
tourism
soil
water
air
text
ambiente
agricultura
tiempo
salud huno
industria
turismo
tierra
agua
aero
Users
data
data
DoD
EPA
text
ambiente
agricultura
tiempo
salud hunano
industria
turismo
tierra
agua
aero
text
data
environ
agriculture
climate
human health 12312332683268
34534508250825
industry
44544513481348
tourism
67067050385038
soil
24824827082708
59159100000000
water
30830821782178
air
data
12312332683268
34534508250825
44544513481348
67067050385038
24824827082708
59159100000000
30830821782178
3268
data
0825
123
1348
5038
345
123
3268
2708
0000
445
345
0825
2178
6701348
445
2485038
670
591
248
308
591
308
Information
systems
Others . . .
Data Creation
13
Semantic Computing
 We
are laying the foundation to make a quantum
leap toward a substantially new way of
computing: Semantic Computing
 How can we make use of semantic computing?
 What do organizations need to do to prepare for
and stimulate semantic computing?
14
Coming: A Semantic Revolution
Searching and ranking
Pattern analysis
Knowledge discovery
Question answering
Reasoning
Semi-automated
decision making
15
The Nub of It
 Processing
that takes “meaning” into
account
 Processing based on the relations between
things not just computing about the things
themselves.
 Computing that takes people out of the
processing, reducing the human toil
 Data
access, extraction, mapping, translation,
formatting, validation, inferencing, …
 Delivering higher-level
results that are more
helpful for the user’s thought and action
16
Semantics Challenges
 Managing,
harmonizing, and vetting semantics is
essential to enable enterprise semantic computing
 Managing, harmonizing and vetting semantics is
important for traditional data management.
 In the past we just covered the basics
 Enabling
“community intelligence” through
efforts similar to Wikipedia, Wikitionary,
Flickr
17
A Brief Tutorial on Semantics
 What
is meaning?
 What are concepts?
 What are relations?
 What are concept systems?
 What is “reasoning”?
18
Meaning: The Semiotic Triangle
Thought or Reference (Concept)
Refers to
Referent
Symbolises
Stands for
C.K Ogden and I. A. Richards. The Meaning of Meaning.
Symbol
“Rose”, “ClipArt”
19
Semiotic Triangle:
Concepts, Definitions and Signs
Definition
CONCEPT
Refers To
Symbolizes
“Rose”,
“ClipArt”
Referent
Sign
Stands For
20
Definitions in the EPA
Environmental Data Registry
Mailing
Address:
State
USPS
Code:
Mailing
Address
State
Name:
http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress
The exact address where a mail piece is intended to be delivered,
including urban-style address, rural route, and PO Box
http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode
The U.S. Postal Service (USPS) abbreviation that represents a state
or state equivalent for the U.S. or Canada
http://www.epa/gov/edr/sw/AdministeredItem#StateName
The name of the state where mail is delivered
24
SNOMED – Terms Defined by
Relations
26
Computable Meaning
rdfs:subClassOf
owl:equivalentClass
owl:disjointWith
CONCEPT
Refers To
Symbolizes
“Rose”,
“ClipArt”
Referent
Stands For
If “rose” is owl:disjointWith “daffodil”, then a computer can determine that an
assertion is invalid, if it states that a rose is also a daffodil (e.g., in a knowledgebase).
30
What are Relations?
Relation
WaterBody
Merced River
Fletcher Creek
isA
isA
Merced Lake
Merced
Lake
Fletcher Creek
Concepts and relations can be represented
as nodes and edges in formal graph
structures, e.g., “is-a” hierarchies.
31
Concept Systems have Nodes and may
have Relations
Nodes represent concepts
A
Lines (arcs) represent relations
1
a
2
b
c
Concept systems are concepts and the relations between them.
Concept systems can be represented & queried as graphs
d
32
A More Complex Concept Graph
Concept lattice of inland water features
Linear
Large linear
Large
Non-linear
Non-linear
Small linear
Small non- linear
Deep
Natural
Flowing
Shallow
Stagnant
Artificial
River
Stream
Canal
Reservoir
Lake
Marsh
Pond
From Supervaluation Semantics for an Inland Water Feature Ontology
Paulo Santos
and Brandon Bennett http://ijcai.org/papers/1187.pdf#search=%22terminology%20water%20ontology%22
33
Types of Concept System Graph Structures
Tree
Partial Order Tree
Ordered Tree
Partial Order Graph
Bipartite Graph
Faceted Classification
Powerset of 3 element set
Directed Acyclic Graph
Clique
Compound Graph
35
Graph Taxonomy
Graph
Directed Graph
Undirected Graph
Directed Acyclic Graph
Bipartite Graph
Clique
Partial Order Graph
Faceted Classification
Lattice
Partial Order Tree
Tree
Note: not all bipartite graphs
are undirected.
Ordered Tree
36
What Kind of Relations are There?
Lots!
Relationship class: A particular type of connection existing between
people related to or having dealings with each other.
 acquaintanceOf - A person having more than slight or superficial
knowledge of this person but short of friendship.
 ambivalentOf - A person towards whom this person has mixed feelings
or emotions.
 ancestorOf - A person who is a descendant of this person.
 antagonistOf - A person who opposes and contends against this person.
 apprenticeTo - A person to whom this person serves as a trusted
counselor or teacher.
 childOf - A person who was given birth to or nurtured and raised by
this person.
 closeFriendOf - A person who shares a close mutual friendship with
this person.
 collaboratesWith - A person who works towards a common goal with
this person.
…
37
Example of relations in a food web
in an Arctic ecosystem
An organism is connected to another organism for which it is a source
of food energy and material by an arrow representing the direction of
biomass transfer.
Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)
38
Ontologies
are a type of Concept System



Ontology: explicit formal specifications of the terms in the
domain and relations among them (Gruber 1993)
An ontology defines a common vocabulary for researchers
who need to share information in a domain. It includes
machine-interpretable definitions of basic concepts in the
domain and relations among them.
Why would someone want to develop an ontology? Some
of the reasons are:





To share common understanding of the structure of information
among people or software agents
To enable reuse of domain knowledge
To make domain assumptions explicit
To separate domain knowledge from the operational knowledge
To analyze domain knowledge
http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology101-noy-mcguinness.html
39
What is Reasoning?
Inference
Disease
is-a
is-a
Infectious Disease
is-a
Polio
Chronic Disease
is-a
Smallpox
is-a
Diabetes
is-a
Heart disease
Signifies inferred is-a relationship
40
Reasoning: Taxonomies & partonomies can
be used to support inference queries
E.g., if a database contains
information on events by city,
we could query that database
for events that happened in a
particular county or state,
even though the event data
does not contain explicit state
or county codes.
part-of
Oakland
California
part-of
part-of
Alameda County
part-of
Berkeley
part-of
Santa Clara County
part-of
Santa Clara San Jose
41
Reasoning: Relationship metadata can
be used to infer non-explicit data
For example…
(1) patient data on drugs currently
being taken contains brand names
(e.g. Tylenol, Anacin-3, Datril,…);
Analgesic Agent
Non-Narcotic Analgesic
(2) concept system connects different
drug types and names with one
another (via is-a, part-of, etc.
relationships);
(3) so… patient data can be linked and
searched by inferred terms like
“acetominophen” and “analgesic” as
well as trade names explicitly stored
as text strings in the database
Analgesic and Antipyretic
Nonsteroidal
Antiinflammatory
Drug
Tylenol
Acetominophen
Anacin-3
Datril
42
Reasoning: Least Common Ancestor Query
What is the least common ancestor concept in the NCI Thesaurus for
Acetominophen and Morphine Sulfate? (answer = Analgesic Agent)
Analgesic Agent
Opioid
Non-Narcotic Analgesic
Analgesic and Antipyretic
Opiate
Morphine Codeine
Sulfate
Phosphate
Nonsteroidal
Antiinflammatory
Drug
Acetominophen
43
Reasoning: Example “sibling” queries:
concepts that share a common ancestor
Environmental:

"siblings" of Wetland (in NASA SWEET ontology)
Health
Siblings of ERK1 finds all 700+ other kinase enzymes
 Siblings of Novastatin finds all other statins

11179 Metadata

Sibling values in an enumerated
value domain
44
Reasoning: More complex “sibling”
queries: concepts with multiple ancestors
Health




breast disorders
Find all the siblings of
Breast Neoplasm
Environmental

site neoplasms
Breast
Eye
Respiratory
neoplasm neoplasm
System
neoplasm
Non-Neoplastic
Breast
Disorder
Find all chemicals that are a
carcinogen (cause cancer) and
toxin (are poisonous) and
terratogenic (cause birth defects)
45
End of Tutorial about concept systems
What are the “Database Language”
challenges?
46
Metadata Registries & Database
Technologies – Which Does What?
Traditional Data Registries (11179 Edition 2)
 Register metadata which describes data—in databases,
applications, XML Schemas, data models, flat files, paper
 Assist in harmonizing, standardizing, and vetting metadata
 Assist data engineering
 Provide a source of well formed data designs for system
designers
 Record reporting requirements
 Assist data generation, by describing the meaning of data
entry fields and the potential valid values
 Register provenance information that can be provided to
end users of data
 Assist with information discovery by pointing to systems
where particular data is maintained.
49
Traditional MDR:
Manage Code Sets
Data
Element
Concept
Name: Country Identifiers
Context:
Definition:
Unique ID: 5769
Conceptual Domain:
Maintenance Org.:
Steward:
Classification:
Registration Authority:
Others
Algeria
Belgium
China
Denmark
Egypt
France
...
Zimbabwe
Data Elements
Name:
Context:
Definition:
Unique ID: 4572
Value Domain:
Maintenance Org.
Steward:
Classification:
Registration
Authority:
Others
Algeria
L`Algérie
DZ
DZA
012
Belgium
Belgique
BE
BEL
056
China
Chine
CN
CHN
156
Denmark
Danemark
DK
DNK
208
Egypt
Egypte
EG
EGY
818
France
La France
FR
FRA
250
...
...
...
...
...
Zimbabwe
Zimbabwe
ZW
ZWE
716
ISO 3166
French Name
ISO 3166
2-Alpha Code
ISO 3166
3-Alpha Code
ISO 3166
3-Numeric Code
ISO 3166
English Name
50
What Can XMDR Do?
Support a new generation of semantic computing
 Concept system management
 Harmonizing and vetting concept systems
 Linkage of concept systems to data
 Interrelation of multiple concept systems
 Grounding ontologies and RDF in agreed upon
semantics
 Reasoning across XMDR content (concept
systems and metadata)
 Provision of Semantic Services
51
We are trying to manage semantics in
an increasingly complex content space
Structured data
Semi-structured data
Unstructured data
Text
Pictographic
Graphics
Multimedia
Voice video
52
Case Study
 Combining
Concept Systems, Data, and
Metadata to answer queries.
53
Linking Concepts: Text Document
Title 40--Protection of Environment
CHAPTER I--ENVIRONMENTAL PROTECTION AGENCY
PART 141--NATIONAL PRIMARY DRINKING WATER REGULATIONS
§ 141.62 40 CFR Ch. I (7–1–02 Edition)
§ 141.62 Maximum contaminant levels
for inorganic contaminants.
(a) [Reserved]
(b) The maximum contaminant levels
for inorganic contaminants specified in
paragraphs (b) (2)–(6), (b)(10), and (b)
(11)–(16) of this section apply to community
water systems and non-transient,
non-community water systems.
The maximum contaminant level specified
in paragraph (b)(1) of this section
only applies to community water systems.
The maximum contaminant levels
specified in (b)(7), (b)(8), and (b)(9)
of this section apply to community
water systems; non-transient, noncommunity
water systems; and transient
non-community water systems.
Contaminant MCL (mg/l)
(1) Fluoride ............................ 4.0
(2) Asbestos .......................... 7 Million Fibers/liter (longer
than 10 μm).
(3) Barium .............................. 2
(4) Cadmium .......................... 0.005
(5) Chromium ......................... 0.1
(6) Mercury ............................ 0.002
(7) Nitrate ............................... 10 (as Nitrogen)
54
Thesaurus Concept System
(From GEMET)
Chemical Contamination
Definition The addition or presence of chemicals to, or in, another
substance to such a degree as to render it unfit for its intended purpose.
Broader Term contamination
Narrower Terms cadmium contamination, lead contamination,
mercury contamination
Related Terms chemical pollutant, chemical pollution
Deutsch: Chemische Verunreinigung
English (US): chemical contamination
Español: contaminación química
SOURCE General Multi-Lingual
Environmental Thesaurus (GEMET)
55
Concept System (Thesaurus)
Contamination
chemical pollutant
Biological
Radioactive
cadmium
Chemical
lead
chemical pollution
mercury
56
Chemicals in EPA Environmental Data Registry
Environmental Data Registry
Name
Mercury
Mercury, bis(acetato.kappa.O)
(benzenamine)-
Mercury, (acetato.kappa.O)
phenyl-, mixt. with
phenylmercuric
propionate
Type
Biological
Recent Additions | Contact
Us
Organism
Chemical
Chemical
Chemical
CAS
Number
7439-97-6
63549-47-3
No CAS Number
TSN
Acalypha
ostryifolia
28189
ICTV
EPA ID
E17113275
E965269
57
Data
X
Merced River
Fletcher Creek
B
A
Merced Lake
Monitoring Stations
Name
A
B
X
Latitude
41.45 N
43.23 N
39.45 N
Longitude
Measurements
Location
ID
125.99 W
Merced Lake
A
2006-09-13
4.4
4
B
2006-09-13
9.3
2
120.50 W
Merced
River
X
2006-09-15
5.2
3
118.12 W
Fletcher
Creek
X
2006-09-13
6.7
78
Date
Temp
Hg
58
Metadata
Contaminants
Contaminant
Threshold
mercury
5
lead
42?
cadmium
250?
Metadata
System
Data Element
Definition
Units
Precision
Measurements
ID
Monitoring Station Identifier
not applicable
not applicable
Measurements
Date
Date sample was collected
not applicable
not applicable
Measurements
Temp
Temperature
degrees Celcius
0.1
Measurements
Hg
Mercury contamination
micrograms per liter
0.004
Monitoring Stations
Name
Monitoring Station Identifier
Monitoring Stations
Latitude
Latitude where sample was taken
Monitoring Stations
Longitude
Longitude where sample was
taken
Monitoring Stations
Location
Body of water monitored
Contaminants
Contaminant
Name of contaminant
Contaminants
Threshold
Acceptable threshold value
59
Relations among Inland Bodies of Water
Fletcher Creek
feeds into
Merced River
feeds into
Merced River
fed from
Fletcher Creek
feeds into
Merced Lake
Merced Lake
60
Combining Data, Metadata & Concept
Systems
Inference Search Query:
“find water bodies downstream from Fletcher
Creek where chemical contamination was
over 2 parts per billion between December
2001 and March 2003”
Data
ID
Date
Temp
Hg
A
06-09-13
4.4
4
B
06-09-13
9.3
2
X
06-09-13
6.7
78
Concept system
Contamination
Biological
Radioactive
mercury
Chemical
lead
cadmium
Metadata
Name
Datatype
Definition
Units
ID
text
Monitoring
Station Identifier
not
applicable
Date
date
Date
yy-mm-dd
number
Temperature (to
0.1 degree C)
degrees
Celcius
number
Mercury
contamination
micrograms
per liter
Temp
Hg
61
Example – Environmental Text
Corpus
 Idea:
Develop an environmental research
corpus that could attract R&D efforts.
Include the reports and other material from
over $1b EPA sponsored research.
 Prepare

the corpus and make it available
Research results from years of ORD R&D
 Publish
associated metadata and concept
systems in XMDR
 Use open source software for EPA testing
62
Information Extraction & Semantic Computing
Extraction
Engine
Segment
Classify
Discover
patterns
Associate
Select models
Normalize
Fit parameters
Deduplicate
Inference
Report results
11179-3
(E3)
XMDR
Actionable
Information
Decision
Support
63
Metadata Registries are Useful
Registered semantics
 For “training” extraction engines
 The“Normalize” function can make use of
standard code sets that have mapping
between representation forms.
 The “Classify” function can interact with
pre-established concept systems.
Provenance
 High precision for proper nouns, less
precision (e.g., 70%) for other concepts ->
impacts downstream processing, Need to
track precision
65
Normalize – Need Registered and Mapped
Concepts/Code Sets
Data
Element
Concept
Name: Country Identifiers
Context:
Definition:
Unique ID: 5769
Conceptual Domain:
Maintenance Org.:
Steward:
Classification:
Registration Authority:
Others
Algeria
Belgium
China
Denmark
Egypt
France
...
Zimbabwe
Data Elements
Name:
Context:
Definition:
Unique ID: 4572
Value Domain:
Maintenance Org.
Steward:
Classification:
Registration
Authority:
Others
Algeria
L`Algérie
DZ
DZA
012
Belgium
Belgique
BE
BEL
056
China
Chine
CN
CHN
156
Denmark
Danemark
DK
DNK
208
Egypt
Egypte
EG
EGY
818
France
La France
FR
FRA
250
...
...
...
...
...
Zimbabwe
Zimbabwe
ZW
ZWE
716
ISO 3166
French Name
ISO 3166
2-Alpha Code
ISO 3166
3-Alpha Code
ISO 3166
3-Numeric Code
ISO 3166
English Name
66
Challenge for Database
Languages
 The
extraction database can contain graphs
with > a billion nodes.
 Types
of queries that can be done
 Query performance
 Linkage of “extract database” concepts and
relations to same concepts and relations in
traditional databases.
67
Example – 11179-3 (E3) Support
Semantic Web Applications
XMDR may be used to “ground” the Semantics
of an RDF Statement.
The address state code is “AB”. This can be expressed as a directed
Graph e.g., an RDF statement:
Graph
Node
RDF
Subject
Address
Edge
Predicate
Node
Object
State Code
AB
68
Example: Grounding RDF nodes and relations:
URIs Reference a Metadata Registry
dbA:e0139
ai: MailingAddress
dbA:ma344
ai: StateUSPSCode
“AB”^^ai:StateCode
@prefix dbA: “http:/www.epa.gov/databaseA”
@prefix ai: “http://www.epa.gov/edr/sw/AdministeredItem#”
69
Definitions in the EPA
Environmental Data Registry
Mailing
Address:
State
USPS
Code:
Mailing
Address
State
Name:
http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress
The exact address where a mail piece is intended to be delivered,
including urban-style address, rural route, and PO Box
http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode
The U.S. Postal Service (USPS) abbreviation that represents a state
or state equivalent for the U.S. or Canada
http://www.epa/gov/edr/sw/AdministeredItem#StateName
The name of the state where mail is delivered
70
Ontologies for Data Mapping
Ontologies can help to capture and express semantics
Concept
Concept
Concept
Geographic Area
Concept
Geographic Sub-Area
Country
Country Identifier
Country Name
Short Name
Mailing Address
Country Name
Long Name
Distributor
Country Name
Country Code
ISO 3166
2-Character
Code
ISO 3166
3-Numeric Code
ISO 3166
3- Character
Code
FIPS Code
72
Example: Content Mapping Service
data from many sources – files contain
data that has the same facts represented by
different terms. E.g., one system responds with
Danemark, DK, another with DNK, another with
208; map all to Denmark.
 XMDR could accept XML files with the data from
different code sets and return a result mapped to a
single code set.
 Collect
73
Actions to Manage Enterprise
Semantics
 Define,
data, concepts, and relations
 Harmonize and vet data and concept
systems
 Ground semantics for RDF, concept
systems, ontologies
 Provide semantics services
74
Challenge: Concept System Store
Concept systems:
Concept System
Thesaurus
Themes
Ontology
GEMET
Structured
Metadata
Data
Standards
}
Metadata Registry
Keywords
Controlled Vocabularies
Thesauri
Taxonomies
Ontologies
Axiomatized Ontologies
(Essentially graphs:
node-relation-node +
axioms)
75
Challenge: Management of Concept
Systems
Metadata Registry
Concept System
Thesaurus
Themes
Ontology
GEMET
Structured
Metadata
Data
Standards
Concept system:
}
Registration
Harmonization
Standardization
Acceptance (vetting)
Mapping
(correspondences)
76
Challenge: Life Cycle Management
Metadata Registry
Concept System
Thesaurus
Themes
Ontology
GEMET
Structured
Metadata
Data
Standards
Life cycle
management:
Data and
Concept systems
(ontologies)
77
Challenge: Grounding Semantics
Metadata
Registries
Metadata Registry
Concept System
Thesaurus
Themes
Ontology
GEMET
Structured
Metadata
Semantic Web
RDF Triples
Subject (node URI)
Verb (relation URI)
Object (node URI)
Ontologies
Data
Standards
78
Some Limitations of
Relational Technologies & SQL
 Limited
 Weak
 Limited
 Weak
graph computations
graph query language
object computations
object query language
 Inadequate
linkage of metadata to data
(underspecified “catalog”)
 CASE
tools also disable, rather than enable data
administration & semantics management
79
Limitations (Cont.)
 Limited
linkage of concept system (graphs)
to data (relational, graph, object)
80
Some Input From WG 2 and
XMDR
 Look
at recent work on a graph query
language by David Silberberg of Johns
Hopkins University Applied Physics Lab.
81
Input from WG 2 and XMDR

David Jensen, of the University of
Massachussetts Amherst (
http://kdl.cs.umass.edu/people/jensen/ ) has been
developing a very interesting Proximity system and in the
process has worked with complex patterns in very large
data sets, including alternative query languages and
database technologies. (
http://kdl.cs.umass.edu/proximity/index.html ). QGRAPH
is a new visual language for querying and updating graph
databases. A key feature of QGRAPH is that the user can
draw a query consisting of vertices and edges with
specified relations between their attributes. The response
will be the collection of all subgraphs of the database that
have the desired pattern.
82
Input from WG 2 and XMDR

Query languages are necessary to extract useful information from massive data
sets. Moreover, annotated corpora require thousands of hours of manual
annotation to create, revise and maintain. Query languages are also useful
during this process. For example, queries can be used to find parse errors or to
transform annotations into different schemes. However, they suffer from several
problems.
First, updates are not supported as query languages focus on the needs of linguists searching
for syntactic constructions.
 Second, their relationship to existing database query languages is poorly understood, making it
difficult to apply standard database indexing and query optimization techniques. As a
consequence they do not scale well.
 Finally, linguistic annotations have both a sequential and a hierarchical organization. Query
languages must support queries that refer to both of these types of structure simultaneously.
Such hybrid queries should have a concise syntax. The interplay between these factors has
resulted in a variety of mutually-inconsistent approaches.
Catherine Lai and Steven Bird
Department of Computer Science and Software Engineering
University of Melbourne, Victoria 3010, Australia

83
Input from WG 2 and XMDR



Try to keep an eye on companies that are grappling with advanced
database, knowledge management, information extraction, and
analysis requirements, such as Metamatrix, I2, NetViz, Top
Quadrant, OntologyWorks, Franz, Cogito, or Objectivity, with new
ones cropping up very often.
Check out the EU sites given the large investments being made
there in areas of interest. For example, KAON.
Watch the outcome of an NSF funded project on querying linguistic
databases,including annotated corpora (
http://projects.ldc.upenn.edu/QLDB/ ). Steven Bird at U. Melbourne
is one of the principals on that project.
84
Input from WG 2 and XMDR





Need for graph query languages that go beyond RDF and
XML
Frank Olken: Make SQL a strongly typed language with
respect to measurement dimensionality.
Performance: project graph structured queries against
graph structured data. Express with great difficulty the
query in SQL. Complex objects. Model gets
complex. Putting humpty dumpty together again at query
time.
Political problem in govt. Vendors on board, hard to
pursue other technologies.
Object systems. OMG working on it? (OQL?). JAVA has
ugly layer that maps into relational system. Franz has
SPARQL built on top of a graph store.
85
Input from WG 2 and XMDR

Link Mining Applications: Progress and Challenges - Ted E. Senator
Link mining is a fairly new research area that lies at the intersection of
link analysis, hypertext and web mining, relational learning and inductive
logic programming, and graph mining. However, and perhaps more
important, it also represents an important and essential set of techniques
for constructing useful applications of data mining in a wide variety of
real and important domains, especially those involving complex event
detection from highly structured data. Imagine a complete “link mining
toolkit.” What would such a toolkit look like?
86
Input from WG 2 and XMDR
Link Mining Applications: Progress and Challenges - Ted E. Senator
Most important,
it would require a language that enabled the natural
representation of entities and links. Such a language would also allow for
the representation of pattern templates and for specifying matches
between the templates and their instantiations.
The language would have to accept an arbitrary database schema as
input, with a specified mapping between relations in the database and
fundamental link types in the language.
It would have to compile into efficient and rapidly executable database
queries.
It would need to be able to represent grouped entities and multiple
abstraction hierarchies and reason at all levels.
It would have to enable the creation of new schema elements in the
database to represent newly discovered concepts.
87
Input from WG 2 and XMDR
Link Mining Applications: Progress and Challenges - Ted E. Senator
It
would need to represent both pattern templates and pattern instances,
and to have a mechanism for tracking matches between the two.
 It would have to have constructs for representing fundamental
relationships such as part-of, is-a, and connected-to (the most generic link
relationship), as well as perhaps other high-level link types such as temporal
relationships (e.g., before, after, during, overlapping, etc.), geo-spatial
relationships, organizational relationships, trust relationships, and activities
and events.
The toolkit would include at least one and possibly many pattern matchers.
It would require tools for creating and editing patterns. It would have to
include visualizations for many different types of structured data.
It would need mechanisms for handling uncertainty and confidence.
It would have to track the dependence of any conclusion (e.g., pattern
match or discovered pattern) back to the underlying data, and perhaps
incorporate backtracking so the impact of data corrections could be detected.
88
Input from WG 2 and XMDR
Link Mining Applications: Progress and Challenges - Ted E. Senator
It
would need configuration management tools to track the history of discovered
and matched patterns.
It would need workflow mechanisms to support multiple users in an
organizational structure.
It would need mechanisms for ingesting domain-specific knowledge.
It would have to be able to deal with multiple data types including text and
imagery.
And it would have to be able to rapidly incorporate new link mining techniques as
they are developed.
Finally, it would need to include mechanisms for maximum privacy protection.
89
Where to Progress
Semantics Management?
 SC
32 in WG 2 and WG 3 as extensions to
ongoing work or as New Work Items
 W3C as XQuery, SPARQL, Semantic Web
Deployment WG (RDF vocabularies,
SKOS)
 OMG as extensions to the MOF
…
90
Thanks & Acknowledgements







John McCarthy
Karlo Berket
Kevin Keck
Frank Olken
Harold Solbrig
L8 and SC 32/WG 2 Standards Committees
Major XMDR Project Sponsors and Collaborators






U.S. Environmental Protection Agency
Department of Defense
National Cancer Institute
U.S. Geological Survey
Mayo Clinic
Apelon
91