Semantifying Wikipedia - University of Washington

Download Report

Transcript Semantifying Wikipedia - University of Washington

454 Project Ideas
Administrivia

Office Hours 11-noon, Fridays in 588


Project proposals due today




Or by email
Not binding (at least not yet)
To be elaborated
In-person project reviews next week.
HW 1 – due next Tues @ noon
Autonomously
Semantifying Wikipedia
Fei Wu
Dept. Computer Science & Eng.
University of Washington
(Joint work with Dan Weld)
Motivation

Semantic Web [Berners-Lee 01] is great.


Web content machine readable
Software agents find, share and integrate information
Motivation

Semantic Web [Berners-Lee 01] is great.


Web content machine readable
Software agents find, share and integrate information
Chicken-egg problem:
Semantic Data
Applications
Motivation

Semantic Web [Berners-Lee 01]


Web content machine readable
Software agents find, share and integrate information
Chicken-egg problem:
Semantic Data
Applications
Bootstrapping:
Automatically Semantifying Data
Idea: “Semantify” Wikipedia

Wikipedia [http://wikipedia.org]




Comprehensive
 (1.7 million English articles)
High-quality
Important
 6th most popular web-site & growing
Benefits:


User-tagged data
 (links, infobox, lists, categories, etc.)
Large, but not too large
Wikipedia Challenges
 Much natural-language text
 Missing data
 Inconsistency
 Low information redundancy
[Wu & Weld CIKM-07]
Kylin: Autonomously
Semantifying Wikipedia
 Totally autonomous with no additional human efforts
 Information extraction from both semi-structured and
unstructured data
Kylin: a mythical hooved Chinese
chimerical creature that is said to appear
in conjunction with the arrival of a sage.
------ Wikipedia
Outline

Semantics in Wikipedia



Kylin System



Opportunities
Challenges
Infobox Generation
Link Creation
Conclusion
Semantics in Wikipedia







Infobox
Link
List
Category
Redirection
Disambiguation
……
{{Infobox U.S. County|
county = Clearfield County|
state = Pennsylvania |
seal = |
map = Map of Pennsylvania
highlighting Clearfield County.svg |
map size = 225|
founded = [[March 26]], [[1804]]|
seat = [[Clearfield,
Pennsylvania|Clearfield]] |
area = 2,988 [[km²]] (1,154 [[square
mile|mi²]]) |
area water = 17 km² (6 mi²) |
area percentage = 0.56% |
census yr = 2000|
pop = 83,382 |
density = 28|
|}}
Self-Supervised Learning of Infoboxes
4/9/2016 1:26 AM
12
Infobox Challenges

Incompleteness


US County: ~50% of articles have infoboxes
Inconsistency



Manual process -> contradictions between text & infobox
16% of US County articles had an error (revision)
Schema Drift



U.S. County (1428), US County (574), Counties (50),
County (19)
Attribute drift & duplication,
Rare attributes: only 29% used by 30% or more articles
Infobox Challenges (Continued)

Type-free System


Deliberate low-tech design
“King county” has the following attributes:



Land area = 2126 sq miles
Land area (km) = 5506 sq km
Irregular lists



Some separate information in items
Others use tables with different schemata
Others are hierarchical
List of cities & towns in US
Places in Florida
List of counties in Florida
Infobox Challenges (Continued)

Infoboxes hierarchical themselves

Country leader – instead of name, has nested
element listing title to be “king” with name at lower
level
Semantics in Wikipedia






Infobox
Link
List
Category
Redirection
Disambiguation
Semantics in Wikipedia






Infobox
Link
List
Category
Redirection
Disambiguation
Semantics in Wikipedia






Infobox
Link
List
Category
Redirection
Disambiguation
“Seattle, Washington”
Semantics in Wikipedia






Infobox
Link
List
Category
Redirection
Disambiguation
Semantics in Wikipedia






Infobox
Link
List
Category
Redirection
Disambiguation
Semantics in Wikipedia






Infobox
Link
List
Category
Redirection
Disambiguation
Opportunities
 Semantic source
 Training dataset
Challenges
 Missing data
 Inconsistency
Semantics in Wikipedia






Infobox
Link
List
Category
Redirection
Disambiguation
Opportunities
 Semantic source
 Training dataset
Challenges
 Missing data
 Inconsistency
Kylin: Autonomously Semantifying Wikipedia
Outline

Semantics in Wikipedia



Kylin System



Opportunities
Challenges
Infobox Generation
Link Creation
Conclusion
Infobox Generation
Classifier
Preprocessor
Preprocessor
Schema Refinement


Extractor
Infobox
Free edit -> schema drift
Duplicate templates:
U.S.County(1428), US County(574), Counties(50), County(19)

Duplicate attributes:
“Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year”

Low usage of attribute
U.S. County Infobox
1
Kylin:
0.8
0.6
 Strict name match
0.4
????
0.2
w
e
de
ns b
ar
ea ity
km
w
at
er
de k m
ns
ity
ce
m
ns
ar i
us
ea
es
tim mi
at
e
yr
se
al
l in
k
l
e
le
a
d
ad
er er
_n
a
Ex me
ec
co
ut
un
iv
C
e
t
y
ou
m
nt
a
y
ex yor
ec
ut
iv
e
ar
ea
co
pe unt
y
rc
en
ta
m ge
ap
siz
e
de
ns
ity
0
 >15% occurrences
Classifier
Preprocessor

Preprocessor
Extractor
Infobox
Training Dataset Construction
Clearfield County was created on 1804 from parts of Huntingdon and
Lycoming Counties but was administered as part of Centre County until
1812.
Its county seat is Clearfield.
2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.
As of 2005, the population density was 28.2/km².
Problems:
Steps:
 Missing data
1. Segment to sentences
 Noise
2. Find unique match (heuristics)
Classifier
Classifier
Preprocessor
Extractor

Infobox
Document Classifiers (1 per article type)
List & Category
 Fast
 Precision(98.5%) – with no learning!
 Recall(68.8%)

Sentence Classifier (1 per article type x attribute)



Trained on preprocessor output
Features: bag of words, POS tags
Maximum Entropy Classifier with Bagging:
multi-class, multi-label, missing data
Classifier
Preprocessor
Extractor
Infobox
Extractor

Input


A sentence predicted to contain an attribute:
“After considerable debate, the county was
incorporated on September 13, 1852”
Output

<founding date, September 13, 1852>
Landscape of Extraction Techniques
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Boundary Models
Finite State Machines
Context Free Grammars
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
…and beyond
Any of these models can be used to capture words, formatting
or both.
Slides
from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
A “Naïve Bayes” Sliding
Window Model
…
[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
…
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
Slides from Cohen & McCallum
“Naïve Bayes” Sliding Window
Results
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
Slides from Cohen & McCallum
State of the Art Performance

Named entity recognition



Binary relation extraction



Person, Location, Organization, …
F1 in high 80’s or low- to mid-90’s
Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
F1 in 60’s or 70’s or 80’s
Wrapper induction


Extremely accurate performance obtainable
Human effort (~30min) required on each site
Slides from Cohen & McCallum
Classifier
Preprocessor
Extractor
Infobox
CRF Extractor

Conditional Random Fields Model [Lafferty 01]
Attribute value extraction: sequential data labeling
 CRF model for each attribute independently

Relabel – filter false negative training examples
2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.
Preprocessor: Water_area
Classifier:

Water_area; Land_area
Pipeline – prune irrelevant sentences
Precision +
Recall -
Infobox Generation Experiments

Dataset
2007.02.06 Wikipedia Dump Data
 4 popular classes:
U.S.County(1245)
Actor(3819)
Airline(791)
University(4025)
 50 random test articles per class
Kylin performance
Kylin performance (detailed view)

U.S.County (better than manual labeling)


Strict expression
Number-typed
Abbeville County is a county located in the U.S. state of South Carolina.
The county has a total area of 2,988 square kilometers (1,154 mi²).
2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.
Kylin performance (detailed view)

University (worse than manual labeling)

Flexible expression:
The College began first in 1855 as a one room schoolhouse.
UCL was founded in 1826 under the name “University of London”.
The college opened in 1973 with the Charlestown campus.

Global context:
Former U.S. President Dwight D. Eisenhower served as President of the University.

Implicit:
Eg: students at 3 campus sum up to the total student number
Effect of Relabel, Pipeline
Default Project

Reimplement Kylin (or build on Fei’s code)


Improve it
See how much information we can extract


Post on web: Dbpedia
Merge back into Wikipedia?



Bot issues
Associate javascript
Extraction from the Greater WWW


Self-verify accuracy by external extraction
Add infobox facts which are missing from articles
Extensions

Semi-automated bot interface


Firefox plugin
Displays improved infobox – user checks & says ok


For general Wikipedia authors




Safer than a bot
Extraction in real-time & error checking
Attribute values
Guide towards best schema & attribute
Typing & microformats
Extensions

Other wikipedia issues







Auto-generate disambiguation pages
Extract events & create a timeline view
Citation assistance


Learn author reputation
Watch for changes
Look for framing or biased language
Recognize vandalism
identify correspondence between text and citation
Semiautomatic article generation
Extensions

Where could this be applied besides Wikipedia?

Broader Questions



Internet enables generation of structured content
How integrate methods?
Overwrite, training data, ???