R. Rastogi(Yahoo!)

Download Report

Transcript R. Rastogi(Yahoo!)

Information Extraction Research @
Yahoo! Labs Bangalore
Rajeev Rastogi
Yahoo! Labs Bangalore
The most visited site on the
internet
• 600 million+ users per
month
• Super popular
properties
– News, finance, sports
– Answers, flickr,
del.icio.us
– Mail, messaging
– Search
Unparalleled scale
• 25 terabytes of data collected each day
– Over 4 billion clicks every day
– Over 4 billion emails per day
– Over 6 billion instant messages per day
• Over 20 billion web documents indexed
• Over 4 billion images searchable
No other company on the planet
processes as much data as we do!
Yahoo! Labs Bangalore
• Focus is on basic and applied research
– Search
– Advertizing
– Cloud computing
• University relations
– Faculty research grants
– Summer internships
– Sharing data/computing
infrastructure
– Conference sponsorships
– PhD co-op program
What does search look like
today?
Search results of the future:
Structured abstracts
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
Search results of the future:
Intelligent ranking
Rank by price
A key technology for enabling
search transformation
Information extraction (IE)
Information extraction (IE)
• Goal: Extract structured records from Web pages
Name
Category
Address
Map
Phone
Price
Reviews
Multiple verticals
• Business, social networking, video, ….
One schema per vertical
Price
Category
Address
Phone
Title
Name
Price
Posted by
Date
Title
Education
Connections
Rating
Views
IE on the Web is a hard problem
• Web pages are noisy
• Pages belonging to different Web sites have different layouts
Noise
Web page types
Template-based
Hand-crafted
Template-based pages
• Pages within a Web site generated using scripts,
have very similar structure
– Can be leveraged for extraction
• ~30% of crawled Web pages
• Information rich, frequently appear in the top
results of search queries
• E.g. search query: “Chinese Mirch New York”
– 9 template-based pages in the top 10 results
Wrapper Induction
• Enables extraction from template-based pages
Learn
Website pages
Sample
Sample pages
Annotations
Annotate
Pages
Apply wrappers
Learn
Wrappers
XPath
Rules
Extract
Website pages
Extract
Records
Example
Generalize
XPath: /html/body/div/div/div/div/div/div/span
/html/body//div//span
Filters
• Apply filters to prune from multiple candidates
that match XPath expression
XPath: /html/body//div//span
Regex Filter (Phone):
([0-9]3) [0-9]3-[0-9]4
Limitations of wrappers
• Won’t work across Web sites due to different
page layouts
• Scaling to thousands of sites can be a challenge
– Need to learn a separate wrapper for each site
– Annotating example pages from thousands of sites
can be time-consuming & expensive
Research challenge
• Unsupervised IE: Extract attribute values from
pages of a new Web site without annotating a
single page from the site
• Only annotate pages from a few sites initially as
training data
Conditional Random Fields
(CRFs)
• Models conditional probability distribution of label sequence
y=y1,…,yn given input sequence x=x1,…,xn
1 |x |


P(y | x ) 
exp   lk f k ( yt , yt 1 , x , t ) 

Z ( x ) t 1
 k

– fk: features, lk: weights
• Choose lk to maximize log-likelihood of training data
• Use Viterbi algorithm to compute label sequence y with
highest probability
CRFs-based IE
• Web pages can be viewed as labeled sequences
Name
Noise
Category
Address
Phone
• Train CRF using pages from few Web sites
• Then use trained CRF to extract from remaining sites
Drawbacks of CRFs
• Require too many training examples
• Have been used previously to segment short
strings with similar structure
• However, may not work too well across Web
sites that
– contain long pages with lots of noise
– have very different structure
An alternate approach that
exploits site knowledge
• Build attribute classifiers for each attribute
– Use pages from a few initial Web sites
• For each page from a new Web site
– Segment page into sequence of fields (using static repeating
text)
– Use attribute classifiers to assign attribute labels to fields
• Use constraints to disambiguate labels
– Uniqueness: an attribute occurs at most once in a page
– Proximity: attribute values appear close together in a page
– Structural: relative positions of attributes are identical across
pages of a Web site
Attribute classifiers +
constraints example
Page1:
Page2:
Chinese Mirch
Chinese, Indian
Name
Category
Jewel of India
Name
Page3:
21 Club
Name, Noise
120 Lexington Avenue
New York, NY 10016
(212) 532 3663
Phone
Address
15 W 44th St
New York, NY 10016
Indian
Category
Address
American
Category, Name
21 W 52nd St
New York, NY 10019
(212) 869 5544
Phone
(212) 582 7200
Phone
Address
Uniqueness constraint: Name
Precedence constraint: Name < Category
Page3:
21 Club
Name
American
Category
21 W 52nd St
New York, NY 10019
Address
(212) 582 7200
Phone
Performance evaluation:
Datasets
• 100 pages from 5 restaurant Web sites with very
different structure
– www.citysearch.com
– www.fromers.com
– www.nymag.com
– www.superpages.com
– www.yelp.com
• Extract attributes: Name, Address, Phone num,
Hours of operation, Description
Methods considered
• CRFs, attribute classifiers + constraints
• Features
– Lexicon: Words in the training Web pages
– Regex: isAlpha, isAllCaps, isNum, is5DigitNum,
isDay,…
– Attribute-level: Num of words, Overlap with title,…
Evaluation methodology
• Metrics
– Precision, recall, F1 for attributes
• Test on one site, use pages from remaining 4
sites as training data
• Average measures over all 5 sites
Experimental results
Precision
Name
Phone
Address
Hours
Desc
Overall
Recall
CRF
Constraint
CRF
Constraint
.39
.02
.01
.22
.13
.15
1
1
.81
1
.25
.81
.34
.2
.16
.36
0
.21
1
.99
.83
1
.15
.76
Other IE scenarios: Browse
page extraction
Similar-structured
records
IE big picture/taxonomy
• Things to extract from
– Template-based, browse, hand-crafted pages, text
• Things to extract
– Records, tables, lists, named entities
• Techniques used
– Structure-based (HTML tags, DOM tree paths) – e.g.
Wrappers
– Content-based (attribute values/models) – e.g. dictionaries
– Structure + Content (sequential/hierarchical relationships
among attribute values) – e.g. hierarchical CRFs
• Level of automation
– Manual, supervised, unsupervised