Informative terms

Download Report

Transcript Informative terms

Information Extraction and
Ontology Learning Guided by
Web Directory
Authors:
Martin Kavalec
Vojtěch Svátek
Presenter:
Mark Vickers
Outline
Introduction
– Mining Indicator Terms
– Integrating Rainbow
– Ontological Analysis of Web Directories
– IE and Ontology Learning
Future Work
Related Work
Assessment
Introduction
Goal:
“…to extract information about (mostly generic) products, services and
areas of competence of companies, from the free text chunks
embedded in web presentations.”
Taking advantage of:
– Collections of extraction patterns
– Ontologies of problem domains
Approach: Combine Information Extraction With
Ontologies
– Ontologies can improve quality of IE
– Extracted information can improve/extend ontologies
– Bootstrapping
Introduction
Uses Open Directory (http://dmoz.org)
– Obtain labeled training data
– Lightweight ontologies
“The Open Directory Project is the largest, most
comprehensive human-edited directory of the Web.”
Mining Indicator Terms
Informative terms = generic names of products
Indicator terms = situated near informative terms
– Example: ‘our assortment includes…’
‘in our shop you can buy…’
Assumption: Directory headings coincide with informatives
Purpose: Generate extraction patterns based on Indicator
terms
They use deeper linguistic techniques
Mining Indicator Terms
Example:
…/Manufacturing/Materials/Metals/Steel/…
Informative terms
Match headings with text pages to find
sentences containing informative terms
Grab nearby words as indicator terms
Generate extraction patterns from
indicator terms
Mining Indicator Terms
Choosing Indicator Terms
– Syntactical analysis: Link Grammar Parser
– Chose verbs occurring closest in parse tree to
informative word
– Arrange verbs into a frequency table
– Order by ratio of frequency near informative
term to frequency in general
– Chose 8 most promising verbs
Mining Indicator Terms
Preliminary Testing
– Sampled 14,500 sentences containing heading
terms
– Randomly chose 130 sentences with indicators
– Manually labeled to estimate if informative term
was present or not
Example:
“We are equipped to run any grade of corrugated from
E-flute to Triplewall, including all government
grades.”
Mining Indicator Terms
Preliminary Test Results
Coverage
Non-Filtered
10 – 20 %
Pre-Filtered
70 – 80 %
Integration into Rainbow
RAINBOW
(Reusable Architecture for INtelligent Brokering Of Web information access)
– Web Analysis Tasks:
Sentence Extraction
Explicit Metadata
HTML Structure*
Inline Image *
Link Topology Structure*
Page Similarity
– Internal Communication: based on SOAP
– Will use ontologies for verifying semantic consistency of web
services provided within the distributed system
Integration into Rainbow
Rainbow will help solve “coverage”
problem of directory links pointing to
‘barren’ pages
– Using Analysis of:
Keywords and HTML Structure on start-up pages
URLs of embedded links
– Metadata Extractor will be navigated towards
promising pages.
– Looking for ‘about-us’ or ‘profile’ to find more
syntactically correct text, for example.
Ontological Analysis of Web Directories
-Industries
- Construction_and_Maintenance
- Materials_and_supplies
- Masonry_and_Stone
- Natural_Stone
- International_Sources
- Mexico
Terms and Phrases in single heading belong to
a small set of classes
Parent-child relations belong to particular
classes corresponding to ‘deep’ ontological
relations.
Ontological Analysis of Web Directories
Classsubclass
Relations
Class
Named
Relations
Reflexive
Binary
Relations
Meta-ontology of directory headings
Ontological Analysis of Web Directories
Interpretation Rules
IE and Ontology Learning
Extracting with plain indicator terms with
simple heuristics works
But Even Better:
– Learn indicators for each class
– Use ontology analysis to classify indicators
found
– Fill in database templates: true IE
IE and Ontology Learning
Closed Loop Strategy:
Learn class-specific
indicators
Classify Headings
Human
Classifies
Directory
Headings
(WordNet)
Future Work
Complete the Information extraction & ontology
learning loop.
With relation to Semantic Web, they want to
adapt technique to the standards of usual
explicit metadata
– Example: The information extracted can be forged to
RDF triples, with indicator collections accessible over
the web
Related Work
Combining IE and Ontologies (without use of web
directories)
– Bootstrapping an Ontology-Based Information Extraction Systems
Advantages of using Link Grammar Parser
– Learning to Generate Semantic Annotation for Domain Specific
Sentences
Using Yahoo to classify whole documents
– Turning Yahoo into an Automatic Web-Page Classifier
Similar work aimed at more structured information
using search engines
– Extracting Patterns and Relations form the World Wide Web
Bootstrapping and other statistical methods for IE
– Text Classification by Bootstrapping with Keywords
– Learning Dictionaries of Information Extraction by Multi-Level
Bootstrapping
Assessment
I don’t think indicator term learning is done
(even though they say it is)
Counts on not yet decided Ontology
learning techniques
Need to develop an official directory