Wok: A Web of Knowledge - Brigham Young University

Download Report

Transcript Wok: A Web of Knowledge - Brigham Young University

Ontology Generation, Information
Harvesting and Semantic
Annotation For Machine-Generated
Web Pages
Cui Tao
PhD Dissertation Defense
1
Motivation
 Birth date of my great
grandpa
 Price and mileage of red
Nissans, 1990 or newer
 Protein and amino acids
information of gene cdk-4?
 US states with property crime
rates above 1%
2
Search by Search Engine
3
Search the Hidden Web
• The Hidden Web:
– Hidden behind forms
– Hard to query
“cdk-4"
4
Query for Data
• The Hidden Web:
– Hidden behind forms
– Hard to query
Find the protein
and the animo-acids
information for gene “cdk-4"
5
A Web of Pages  A Web of
Knowledge
• Web of Knowledge
– Machine-“understandable”
– Publicly accessible
– Queriable by standard query languages
• Semantic annotation
– Domain ontologies
– Populated conceptual model
• Problems to resolve
– How do we create ontologies?
– How do we annotate pages for ontologies?
6
Contributions of Dissertation Work
• Web of Pages  Web of Knowledge
– Knowledge & meta-knowledge extraction
– Reformulation as machine-“understandable”
knowledge
• Automatic & semi-automatic solutions via:
– Sibling tables (TISP/TISP++)
– User-created forms (FOCIH)
7
Automatic Annotation with TISP
(Table Interpretation with Sibling Pages)
•
•
•
•
Recognize tables (discard non-tables)
Locate table labels
Locate table values
Find label/value associations
8
Recognize Tables
Layout Tables
(discard)
Data Table
Nested
Data Tables
9
Find Label/Value Associations
Example:
(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918
1
2
10
Interpretation Technique:
Sibling Page Comparison
11
Interpretation Technique:
Sibling Page Comparison
Same
12
Interpretation Technique:
Sibling Page Comparison
Almost Same
13
Interpretation Technique:
Sibling Page Comparison
Different
Same
14
Technique Details
• Unnest tables
• Match tables in sibling pages
– “Perfect” match (table for layout  discard )
– “Reasonable” match (sibling table)
• Determine & use table-structure pattern
– Discover pattern
– Pattern usage
– Dynamic pattern adjustment
15
Table Unnesting
16
Table Structure Patterns
Regularity Expectations:
• (<tr><(td|th)> {L} <(td|th)> {V})n
• <tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
•…
Pattern combinations are
also possible.
17
Table Structure Patterns
<tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
18
Pattern Usage
19
Dynamic Pattern Adjustment
20
TISP++
• Automatic ontology generation
• Automatic information annotation
21
Ontology Generation – OSM
• Object set: table labels
– Lexical: labels that associate with actual values
– Non-lexical: labels that associate with other tables
• Relationship set: table nesting
• Constraints: updates based on observation
22
Ontology Generation – OWL
• Object set: OWL class
• Relationship set: OWL object property
• Lexical object set:
– OWL data type property
– Different annotation properties to keep track of
the provenance
23
Generated Ontology
Generated Ontology
RDF Graph
26
Query the Data
Find the protein
and the animo-acids
information for gene “cdk-4"
27
TISP Evaluation
• Applications
– Commercial: car ads
– Scientific: molecular biology
– Geopolitical: US states and countries
• Data: > 2,000 tables in 35 sites
• Evaluation
– Initial two sibling pages
•
•
Correct separation of data tables from layout tables?
Correct pattern recognition?
– Remaining tables in site
•
•
Information properly extracted?
Able to detect and adjust for pattern variations?
28
Experimental Results
• Table recognition: correctly discarded 157
of 158 layout tables
• Pattern recognition: correctly found 69 of
72 structure patterns
• Extraction and adjustments: 5 path
adjustments and 34 label adjustments 
all correct
29
TISP++ Performance
• Performance depends on TISP
• TISP test set
– Generates all ontologies correctly
– Annotates all information in tables correctly
30
Form-based Ontology Creation and
Information Harvesting (FOCIH)
• Personalized ontology creation by form
– General familiarity
– Reasonable conceptual framework
– Appropriate correspondence
• Transformable to ontological descriptions
• Capable of accepting source data
• Automated ontology creation
• Automated information harvesting
31
Form Creation
32
Created Sample Form
33
Generated Ontology View
34
Source-to-Form Mapping
35
Source-to-Form Mapping
36
Source-to-Form Mapping
37
Source-to-Form Mapping
38
Almost Ready to Harvest
• Need reading path: DOM-tree structure
• Need to resolve mapping problems
– Pattern recognition
– Instance recognition
39
Reading Path
40
Pattern & Instance Recognition
41
Pattern & Instance Recognition
42
Pattern & Instance Recognition
regular expression
for decimal number
left
context
right
context
43
Pattern & Instance Recognition
list pattern, delimiter is “,”
44
Pattern & Instance Recognition
list pattern,
delimiter is regular expression
for percentage numbers and a comma
45
Pattern & Instance Recognition
list pattern,
delimiter is regular expression
for percentage numbers and a comma
46
Can Now Harvest
47
Can Now Harvest
48
Can Now Harvest
49
Semantic Annotation
50
Semantic Annotation
51
Semantic Annotation
52
Semantic Annotation
53
Semantic Annotation
54
Semantic Query
55
FOCIH Performance
• Ontology creation
• Semantic annotation
– Depends on TISP performance
– Depends on pattern and instance recognition
performance
56
FOCIH Performance
• Pattern and instance recognition:
–
–
–
–
–
Works with highly regular data
Tested 71 mappings
25 full-string values (25/25 correct)
38 substring values (29/38 correct)
8 list patterns (6/8 correct)
57
FOCIH Difficulties
58
FOCIH Difficulties
59
FOCIH Difficulties
No selection
60
WoK via TISP
61
WoK via TISP
62
WoK via FOCIH
63
WoK via FOCIH
64
Contributions
• TISP: automatic sibling table interpretation
• TISP++:
– Automatic ontology generation based on interpreted
tables
– Automatic semantic annotation for interpreted tables
• FOCIH:
– Semi-automatic personalized ontology creation
– Automatic personalized information harvesting and
semantic annotation
• All together: contributes to turning the current web
of pages into a web of Knowledge
65
Future Work
• Sibling pages in addition to sibling tables
• Reverse engineer from ontologies to forms as
a basis for information harvesting for already
defined ontologies.
66