Course Overview

Download Report

Transcript Course Overview

CSE 5539: Web Information
Extraction
Instructor: Alan Ritter
Motivation
• Data Analytics / Big Data
– Companies have lots of data
lying around
– Computing cycles are cheap
– Using data to get insights:
• Business, Healthcare, Science,
Government, Politics
Structured
Data
• Challenge: Most of the
world’s data is Unstructured
– Text
– Speech
– Images
Bigger Unstructured
Data
Extracting Knowledge from Text
The Web
News
Text Extractors
Structured Data
Example: Information Extraction from
Twitter
“Yess! Yess! Its official Nintendo announced today that
they Will release the Nintendo 3DS in north America
march 27 for $250”
Information Extraction
“Yess! Yess! Its official Nintendo announced today that
they Will release the Nintendo 3DS in north America
march 27 for $250”
Information Extraction
“Yess! Yess! Its official Nintendo announced today that
they Will release the Nintendo 3DS in north America
march 27 for $250”
COMPANY
PRODUCT
PRODUCT RELEASE
DATE
PRICE
REGION
Information Extraction
“Yess! Yess! Its official Nintendo announced today that
they Will release the Nintendo 3DS in north America
march 27 for $250”
COMPANY
PRODUCT
DATE
PRICE
REGION
Nintendo
3DS
March 27
$250
North America
PRODUCT RELEASE
Information Extraction
Samsung Galaxy S5 Coming to All Major U.S. Carriers
Beginning April 11th
COMPANY
PRODUCT
DATE
PRICE
REGION
Samsung
Galaxy S5
April 11
?
U.S.
Nintendo
3DS
March 27
$250
North America
PRODUCT RELEASE
Information Extraction
News
COMPANY
PRODUCT
DATE
PRICE
REGION
Samsung
Galaxy S5
April 11
?
U.S.
Nintendo
3DS
March 27
$250
North America
…
PRODUCT RELEASE
…
…
…
…
Example Applications
• Question Answering / Structured Queries
– Which companies are releasing new smartphones
new products in Europe this Spring?
– Alert me anytime a new smartphone is announced
in the U.S.
• Data Mining
– Analyze trends in product releases across different
industries
– Is there a correlation between price and date of
release?
Knowledge Graphs
Things not strings!
Alan
Ritter
Course offered at
CSE
5539
Columbus
OH
Ohio
State
Univ.
Data Sources
Available Data Sources
All of these databases
are sparsely populated
and out of date.
We need to extract this
type of knowledge from
text!!!!
Available
Sources
The
LongData
Term
Goal
All of these databases
are sparsely populated
and out of date.
We need to extract this
type of knowledge from
text!!!!
…
Traditional information Extraction
Traditional information Extraction
[Cowie and Wilks]
Example Text from MUC-4 (1992)
[Cowie and Wilks]
Example Output from MUC-4 (1992)
…
Approaches
• Initially: Rule Based
– Basically just write a bunch of regular expressions
Approaches
• Initially: Rule Based
– Basically just write a bunch of regular expressions
Approaches
• Initially: Rule Based
– Basically just write a bunch of regular expressions
Approaches
• Initially: Rule Based
– Basically just write a bunch of regular expressions
• Machine Learning (Fietag 1998) (Soderland 1999), (Mooney 1999)
– Annotate training / dev / test documents
– Train machine learning models
[Slide from William Cohen]
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
[Slide from William Cohen]
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
[Slide from William Cohen]
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
[Slide from William Cohen]
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
[Slide from William Cohen]
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
[Slide from William Cohen]
“Naïve Bayes” Sliding Window Results
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
[Slide from William Cohen]
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Pedro Domingos spoke this example sentence.
and a trained HMM:
person name
location name
background
Find the most likely state sequence: (Viterbi)
 
arg max s P(s , o)
Yesterday Pedro Domingos spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Pedro Domingos
Finite State Models
Generative
directed models
HMMs
Naïve Bayes
Sequence
Conditional
General
Graphs
Conditional
Conditional
Logistic
Regression
General CRFs
Linear-chain CRFs
Sequence
General
Graphs
Various Annotated Datasets for Event /
Relation Extraction
• ACE
– Automatic Content Extraction
– Newswire
– Successor to MUC
Various Annotated Datasets for Event /
Relation Extraction
• GENIA
– Medline abstracts
– Similar extraction task in the Biomedical domain
Schemas -> Triples
“Yess! Yess! Its official Nintendo announced today that
they Will release the Nintendo 3DS in north America
march 27 for $250”
COMPANY
PRODUCT
DATE
PRICE
REGION
Nintendo
3DS
March 27
$250
North America
PRODUCT RELEASE
Relation
Extraction
Manufacturer(3DS, Nintendo)
ReleaseDate(3DS, March 27)
Price(3DS, $250)
…
Open Information Extraction (Banko
et. al. 2007)
Demo (TextRunner)
• http://openie.allenai.org/
Distant (weak) Supervision for
Relation Extraction e.g. [Mintz et. al. 2009]
(Albert Einstein, Ulm)
(Mitt Romney, Detroit)
(Barack Obama, Honolulu)
Person
Birth Location
Barack Obama Honolulu
Mitt Romney
Detroit
Albert Einstein Ulm
Nikola Tesla
Smiljan
…
…
“Barack Obama was born on
August 4, 1961 at … in the city
of Honolulu ...”
“Birth notices for Barack Obama were
published in the Honolulu Advertiser…”
“Born in Honolulu, Barack Obama went
on to become…”
…
41
Demo (NELL)
• http://rtw.ml.cmu.edu/rtw/kbbrowser/
Demo (Literome)
• http://literome.azurewebsites.net/
Knowledge Base Population Subtasks
•
•
•
•
Entity Recognition/Classification/Linking
Relation Extraction
Event Extraction
Knowledge Base Inference
Applications
•
•
•
•
Google knowledge graph
Facebook graph search
Biomedical knowledge bases
-> Your application domain here
– Geoscience knowledge graph?
– Patent knowledge graph?
– Cybersecurity knowledge graph?
Research Groups at Other Places
Why learn about this stuff?
Paper Selection Form!
(please fill out before next class)
https://goo.gl/AghZ1f
Administrative Details
• Course Webpage
– http://aritter.github.io/courses/5539_fall15.html