Transcript pptx
BOOTSTRAPPING
INFORMATION EXTRACTION
FROM SEMI-STRUCTURED WEB
PAGES
Andrew Carson and Charles Schafer
Abstract
• No human supervision required system
• Previous work:
1.
Required significant human effort
• Their solution:
• Requiring 2-5 annotated pages fro 4-6 web sites for training model
• No human supervision for the garget web site
• Result:
• 83.8% and 91.1% for different sites.
Introduction
• Extracting structured records from detail pages of semi-
structured web pages
Introduction
• Why semi-structured web
• Great sources of information
• Attribute/value structure: downstream learning or querying systems
Related Work
• Problem of Previous Work
• No labeling example pages, but manual labeling of the output
• Irrelevant fields(20 data fields and 7 schema columns)
• Dela system(automatically label extracted data)
• Problem of labeling detected data fields
• A data field does not have a label
• Multiple fields of the same data type
Methods
• Terms:
• Domain schema: a set of attributes
• Schema column: a single attribute
• Detailed page: a page that corresponds to a single data record
• Data field: a location within a template for that site
• Data values: an instance of that data field
Methods
• Detecting Data Fields
• Partial Tree Alignment Algorithm
Methods
• Classifying Data Fields
• Assign a score to each schema column
• c: Data values => data for training schema column
• f: data fields => contexts from the training data
• Compute the score:
• Use a classifier to map data fields to schema column
• Use a model
• K different feature types
Methods
• Feature Types
• Precontext character 3-grams
• Lowercase value tokens
• Lowercase value character 3-grams
• Value token types
Methods
• Comparing Distributions of Feature Values
• Advantage
• Similar data values
• Avoid over-fitting
• when high-dimensional feature spaces
• Small number of training example
Methods
• KL-Divergence
• Smoothed version
• Skew Similarity Score
Methods
• Combining Skew Similarity Scores
• Combine skew similarity scores for the dfferent feature types using
linear regression model
• Stacked classifier model
• Labeling the Target Site
• Higher
for each schema column c
Evaluation
• Accuracy of automatically labeling new sites
• How well it make recommendations to human annotators
• Input: a collection of annotated sites for a domain
• Method: cross-validation
Results by Site
Results by Schema Column
Identifying Missing Schema Columns
• Vacation rentals: 80.0%
• Job sites: 49.3%
Conclusion