Ontology Generation from Tables

Download Report

Transcript Ontology Generation from Tables

TANGO
Table ANalysis for
Generating Ontologies
Yuri A. Tijerino*,
David W. Embley*,
Deryle W. Lonsdale* and
George Nagy**
* Brigham Young University
** Rensselaer Polytechnic Institute
List of contents
Motivation
 Applications
 Table understanding
 Concept matching
 Ontology merging/growing
 Example
 Future direction

Motivation




Semi-automated ontological engineering through
Table Analysis for Generating Ontologies
(TANGO)
Keyword or link analysis search not enough to
search for information in tables
Structure in tables can lead to domain knowledge
which includes concepts, relationships and
constraints (ontologies)
Tables on web created for human use can lead to
robust domain ontologies
TANGO Applications
Extraction ontologies (generation)
 Data integration
 Semantic web
 Multiple-source query processing
 Document image analysis for documents
that contain tables

Table understanding
What is a table?
 Why table normalization?
 What is table understanding?
 What is mini-ontology generation?

Table understanding:
What is a table?

“…a two-dimensional assembly of cells
used to present information…”

Lopresti and Nagy
Normalized tables (row-column format)
 Small paper (using OCR) and/or electronic
tables (marked up) intended for human
use

Table understanding:
What is table normalization?
Raw table
Table normalization
means to take any table
and produce a standard
row-column table with all
data cells containing
expanded values and type
information
Normalized
table
Country
GDP/PPP
GDP/PPP
Per
Capita
RealGrowth
Rate
Inflation
?
?
Afghanistan
$21,000,000,000
$800
Albania
$13,200,000,000
$3,800
7.3%
3.0%
Algeria
$177,000,000,000
$5,600
3.8%
3.0%
$1,300,000,000
$19,000
3.8%
4.3%
$13,300,000,000
$1,330
5.4%
110.0%
$674,000,000
$10,000
3.5%
0.4%
…
…
…
…
Andorra
Angola
Antigua and
Barbuda
…
Table understanding:
What is table normalization?
Table understanding:
What is table normalization?
??
Population
Population
Growth rate
Population
Density
Birth
Rate
Death
Rate
Migration
Rate
Life
Expectancy
Male
Life
Expectancy
Female
Infant
Mortality
Afghanistan
25,824,882
3.95%
39.88
persons/km2
4.19%
1.70%
1.46%
47.82 years
46.82 years
14.06%
Albania
3,364,571
1.05%
122.79
persons/km2
2.07%
0.74%
-0.29%
65.92 years
72.33 years
4.29%
Algeria
31,133,486
2.10%
13.07
persons/km2
2.70%
0.55%
-0.05%
68.07 years
70.46 years
4.38%
American Samoa
63,786
2.64%
320.53
persons/km2
2.65%
0.40%
0.39%
71.23 years
79.95 years
1.02%
Andorra
65,939
2.24%
146.53
persons/km2
1.03%
0.55%
1.76%
80.55 years
86.55 years
0.41%
Angola
11,510
2.84%
8.97
persons/km2
4.31%
1.64%
0.16%
46.08 years
50.82 years
12.92%
…
…
…
…
…
…
…
…
239,333
2.34%
0.90
persons/km2
4.54%
1.66%
-0.54%
47.98 years
50.57 years
13.67%
5,995,544,836
1.30%
14.42
persons/km2
2.20%
0.90%
?
61.00 years
65.00 years
5.60%
Yemen
16,942,230
3.34%
32.09
persons/km2
4.33%
0.99%
0.00%
58.17 years
61.88 years
6.98%
Zambia
9,663,535
2.12%
13.05
persons/km2
4.45%
2.26%
0.08%
36.72 years
37 21 years
9.19%
11,163,160
1.02%
28.87
persons/km2
3.06%
2.04%
?
38.77 years
38.94 years
6.12%
…
Western Sahara
World
Zimbabwe
…
Table understanding:
Information useful for normalization
Captions – in vicinity of table (above,
below etc)
 Footnotes – on annotated column labels or
data cells
 Embedded information – in rows, columns
or cells {e.g., $, %, (1,000), billions, etc}
 Links to other views of the table, possibly
with new information

What is table understanding?




Normalize table
Take a table as an input and produce standard records in the form of
attribute-value pairs as output
Discover constraints among columns
Understand the data values
Left-most,
primary key
{has(Country, GDP/PPP),has(Country,GDP/PPP Per Capita),
has(Country,Real-growth rate*), has(Country, Inflation*)
Country
GDP/PPP
GDP/PPP
Per Capita
Real-Growth
Rate
Inflation
?
?
Afghanistan
$21,000,000,000
$800
Albania
$13,200,000,000
$3,800
7.3%
3.0%
Algeria
$177,000,000,000
$5,600
3.8%
3.0%
$1,300,000,000
$19,000
3.8%
4.3%
$13,300,000,000
$1,330
5.4%
110.0%
$674,000,000
$10,000
3.5%
0.4%
…
…
…
…
Andorra
Angola
Antigua and
Barbuda
…
Country names
Dollar amount
Percentage
(from data frame)
(from data frame)
(from data frame)
{<Country: Afghanistan>,
<GDP/PPP:
$21,000,000,000>,
<GDP/PPP per capita:
$800>, <Real-growth rate:
?>, <Inflation: ?>}
Example:
Creating a domain ontology
Longitude
Latitude
Latitude and longitude
designates location
Name
Geopolitical Entity
names
Location
has
Has
GMT
Time
Country
City
Has associated
data frames
Distances
Includes procedural
knowledge
Duration between
Time zones
Example:
Table understanding to mini-ontology generation
Agglomeration
Population
Continent
Country
Tokyo
31,139,900
Asia
Japan
New York-Philadelphia
30,286,900
The Americas
United States of
America
Mexico
21,233,900
The Americas
Mexico
Seoul
19,969,100
Asia
Korea (South)
Sao Paulo
18,847,400
The Americas
Brazil
Jakarta
17,891,000
Asia
Indonesia
Osaka-Kobe-Kyoto
17,621,500
Asia
Japan
…
…
…
Agglomeration
Country
Population
Continent
…
Niigata
503,500
Asia
Japan
Raurkela
503,300
Asia
India
Homjel
502,200
Europe
Belarus
Zunyi
501,900
Asia
China
Santiago
501,800
The Americas
Dominican Republic
Pingdingshan
501,500
Asia
China
Fargona
501,000
Asia
Uzbekistan
Kirov
500,200
Europe
Russia
Newcastle
500,000
Australia
/Oceania
Australia
Example:
Concept matching to ontology Merging
Longitude
Latitude
Agglomeration
Latitude and longitude
designates location
Name
Geopolitical Entity
names
Population
Merge
Country
Location
has
Continent
Results
Has
GMT
Time
Country
Longitude
Population
City
Latitude
Latitude and longitude
designates location
Name
Continent
Geopolitical Entity
Country
Location
Agglomeration
City
Concept matching

We use exhaustive concept matching
techniques to match concepts from
different mini-ontologies, including:





Lexical and Natural Language Processing
Value Similarity
Value Features
Data Frame Comparison
Constraints
Concept Matching
(Lexical & NLP)

Lexical



Direct comparisons (substring/superstring)
WordNet (Synonyms, Word Senses,
Hypernyms/Hyponyms)
Natural Language Processing




Phrases in column headers
Footnotes (for columns, rows, values)
Explanations of symbols, rows, columns
Titles and subtitles
Concept Matching
(Value Similarity)
Compute overlap for string values
comparing data sets
 Compute overlap for numeric values
comparing Gaussian Probability
Distributions
 Compute similarity of numeric values
using regression

Concept Matching
(Value Similarity)
Afghanistan
Afghanistan
Albania
Albania
Algeria
Algeria
In B not in A
Andorra
American Samoa
Real-world example
Total of 193 cells in A
Total of 267 cells in B
77 fields in B not in A
3 fields in A not in B
In A not in B
190 total matches
…
…
In B not in A
Yemen
Proportion of matches with
respect to A = 190/193 = 98%
World
Yemen
Proportion of matches with
respect to B = 190/267 = 71%
Zambia
Zambia
Zimbabwe
Zimbabwe
A
B
Concept Matching
(Value Similarity)
Gaussian PDF
31,900,600
31,500,900
30,521,550
30,400,111
25,335,200
25,500,100
In B not in A
12,300,555
21,000,900
…
In B not in A
7,000,000
3,567,203
3,500,050
2,300,531
2,300,000
1,400,112
1,500,000
A
50 fields in B not in A
2 fields in A not in B
168 total matches
In A not in B
…
Total of 170 cells in A
Total of 240 cells in B
B
Proportion of matches with
respect to A = 168/170 = 99%
Proportion of matches with
respect to B = 168/240 = 70%
Concept Matching
(Value Features)

We can also compute similarities from
value characteristics such as:


Character/numeric length, ratio
Numeric values mean, variance, standard
deviation
Concept Matching
(Data frames)
Snippets of real-world knowledge about
data (type, length, nearby keywords,
patterns [as in regexps], functional, etc)
 We have used data frames to






Recognize data types
Include recognizers for values (dates, times,
longitude, latitude, countries, cities, etc)
Provide conversion routines
Match headers, labels, footnotes and values
Compose or split columns (e.g., addresses)
Concept Matching
(Constraints)
Keys in tables (as well as nonkeys)
 Functional relationships
 1-1, 1-*, *-1 or *-* correspondences
 Subset/superset of value sets
 Unknown and null values

Ontology merging/growing

Direct merge (no conflicts)


Conflict resolution


Use results of matching phase to find similar concepts in
ontologies (e.g., data value similarities, data frames,
NLP, etc)
Interactively identify evidence and counter evidence of
functional relationships among mini-ontologies using
constraint resolution
IDS Interaction with human knowledge engineer



Issues – identify
Default strategy – apply
Suggestions – make
Example:
Another mini-ontology generation
Longitude
Latitude
Place Name
Elevation
State
Place
Country
City/town
Lake
USGS Quad
⊎
Area
Reservoir
Mine
Example:
Another mini-ontology generation
Longitude
Latitude
Place Name
Elevation
State
Place
Country
City/town
Lake
USGS Quad
⊎
Area
Reservoir
Mine
Merge
Longitude
Population
Latitude
Latitude and longitude
designates location
Name
Geopolitical Entity
names
Location
has
has
GMT
Time
Continent
Country
Agglomeration
City
Example:
Concept Mapping to Ontology Merging
Population
Longitude
Latitude
Latitude and longitude
designates location
Name
Geopolitical Entity
names
Location
has
has
GMT
Time
Geopolitical
Entity with
population
Elevation
State
Place
Country
Continent
Country
Agglomeration
City/town
Lake
USGS Quad
⊎
Area
Reservoir
Mine
Future direction
Start with multiple tables (or URLs) and
generate mini-ontologies
 Identify most suitable mini-ontologies to
merge by calculating which tables have
most overlap of concepts
 Generate multiple domain ontologies
 Integrate with form-based data extraction
tools (smarter Web search engines)
