投影片 1 - 國立雲林科技大學
Download
Report
Transcript 投影片 1 - 國立雲林科技大學
國立雲林科技大學
National Yunlin University of Science and Technology
Information Extraction from Wikipedia:
Moving Down the Long Tail
Presenter : Cheng-Feng Weng
Authors : Fei Wu, Raphael Hoffmann,
Daniel S. Weld
2008/11/18
KDD.9 (2008)
Intelligent Database Systems Lab
Outline
N.Y.U.S.T.
I. M.
Motivation
Objective
Methods and Experiments
Conclusion
Comments
2
Intelligent Database Systems Lab
Introduction
N.Y.U.S.T.
I. M.
KYLIN automatically
constructs and completes
infoboxes for the articles
of Wikipedia.
3
Intelligent Database Systems Lab
Motivation
N.Y.U.S.T.
I. M.
The number of article instances per infobox class has a longtailed distribution.
Many articles simply does not have much information to
extracted.
4
Intelligent Database Systems Lab
Objective
N.Y.U.S.T.
I. M.
This paper presents three novel techniques for
increasing recall from Wikipedia’s long tail of sparse
classes:
Shrinkage over an automatically-learned subsumption
taxonomy
A retraining technique for improving the training data
Supplementing results by extracting from the broader Web
5
Intelligent Database Systems Lab
Shrinkage
N.Y.U.S.T.
I. M.
This paper use shrinkage when training an extractor
of an instance-sparse infobox class by aggregating
data from its parent and children classes.
Person.birth_plc=taiwan
Person
Scientist
ChungChian Hsu
Performer
Actor
Performer.location=?
Comedian
6
Intelligent Database Systems Lab
Shrinkage using the KOG Ontology
The Kylin Ontology Generator (KOG) is an autonomous system that builds a rich
ontology by combining Wikipedia infoboxes with WordNet using statisticalrelational machine learning [27].
The overall shrinkage procedure is as follows:
N.Y.U.S.T.
I. M.
To collect the related class set
Person
Query KOG for the mapped attribue
Assign weight to the training
Scientist
Performer
examples
ChungChian Hsu
Actor
7
Intelligent Database Systems Lab
Comedian
Shrinkage Experiments
N.Y.U.S.T.
I. M.
Considering three strategies to determine the weights:
Uniform:
Size adjusted:
W=1
W = min{1, k/(|C|+1) }
Precision Directed:
W = p(extraction precision)
8
Intelligent Database Systems Lab
Shrinkage Experiments (con.)
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Retraining
N.Y.U.S.T.
I. M.
A complementary idea is the notion of harvesting
additional training data even from the outside Web.
It utilizes TextRunner which extracts relations
from a crawl of about 100 million Web pages.
TextRunner’ crawl includes the top ten pages returned
by Google.
10
Intelligent Database Systems Lab
Using TextRunner for Retraining
The retrainer uses this mapped set(C.a) from
TextRunner to augment and clean the training
data for C’s extractors in two ways:
Adding positive examples
Filtering negative examples
Position example
Most
common
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Retraining Experiments
N.Y.U.S.T.
I. M.
12
Intelligent Database Systems Lab
Extracting From the Web
N.Y.U.S.T.
I. M.
It trained extractors on Wikipedia articles and
apply them to relevant Web pages.
Choosing search
engine queries
Weighting extractions
Combining Wikipedia
and Web extractions
13
Intelligent Database Systems Lab
Extracting From the Web (con.)
Choosing search engine
queries
Birthday of Andrew Murray
“Andrew Murray”
“Andrew murray” birth date
Weighting extractions
A set of query
Combining Wikipedia and
Web extractions
scoreweb : s* r* c*
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Web Experiments
N.Y.U.S.T.
I. M.
15
Intelligent Database Systems Lab
Combining Experiments
N.Y.U.S.T.
I. M.
16
Intelligent Database Systems Lab
Conclusions
N.Y.U.S.T.
I. M.
This paper describes three powerful methods for increasing
recall w.r.t. the above to long-tailed challenges: shrinkage,
retraining, and supplementing Wikipedia extractions with
those from the Web.
17
Intelligent Database Systems Lab
Comments
Advantage
It use a good idea to overcome long-tail problem.
Drawback
N.Y.U.S.T.
I. M.
Just about improving the performance of Kylin they developed
Application
To construct the knowledge network
18
Intelligent Database Systems Lab