Feature Generation and Selection in SRL

Download Report

Transcript Feature Generation and Selection in SRL

Feature Generation and
Selection in SRL
Alexandrin Popescul & Lyle H. Ungar
Presented By Stef Schoenmackers
Overview
• Structural Generalized Linear Regression
(SGLR) Overview
• Design Motivations
• Experiments
• Conclusions
SGLR Overview
• Adds statistical methods to ILP
 SQL as the logical language
 Generalized Linear Regression as statistical method
• Uses clustering to generate new relations
• Builds discriminative models
 Targeted at large problems where generative models
impossible
• Integrates feature generation and problem
modeling
SGLR Loop
SGLR Method
• Clusters data and adds clusters as new relations
• Searches the space of SQL query refinements





Features are numerical SQL aggregates
Test feature with statistical measure (e.g. AIC, BIC)
Add only significantly predictive features
Examine each feature only once
Use current set of features to guide search
Overview
Structural Generalized Linear Regression
(SGLR) Overview
• Design Motivations
• Experiments
• Conclusions
SQL Motivation
• Most of the world’s data is in relational
databases
 Can exploit schema and meta-information
• SQL uses a fairly expressive language
 Non-recursive first-order logic formulas
• Relational DBs have been studied and
optimized for decades, so should be more
scalable than other alternatives
Clustering Motivation
• Dimensionality reduction
• Clusters are added as relations (new firstclass concepts)
 Increases expressivity of the language
describing patterns in the data
 Can lead to a more rapid discovery of
predictive features
• Done as a pre-processing step
 cost(clustering) << cost(feature search)
Aggregation Motivation
• Summarizes the information in a table into scalar
values usable by a statistical model
 average, max, min, count, average, empty/exists (0/1)
• Exploits database work into making them
•
efficient
Provides a richer space of features to choose
from
Dynamic Feature Generation
• Most features do not provide useful information
• In large domains, feature generation is
•
•
expensive, and precomputing all possible
features is far too time consuming
Solution: Use a smarter search strategy and
dynamically generate features. Let the features
already selected influence which features are
added
Focuses only on the promising areas in the
search space
Feature Streams
• Put features into different evaluation
queues
• Choose next feature from the ‘best’ stream
• If feature in multiple streams, only evaluate
once
• Stream design can use prior
knowledge/bias
Refinement Graphs (in ILP)
• Start with most general rule, and ‘refines’ it
to produce more specific clauses
 Single variable substitution
 Add predicate involving 1+ existing variables
• Uses top-down breadth-first search to find
the most general rule that covers only
positive examples
• Performs poorly in noisy domains
Refinement Graphs (in SGLR)
• Adds one relation to a query and expands it into
all possible configurations of equality conditions
of new attributes with a new or old attribute
 Contains at least one equality condition between a
new and old attribute
 Any attribute can be set to a constant
 High-level variable typing/classes are enforced
• Not all refinements are most general, but
simplifies pruning of equivalent subspaces
(accounts only for the type and number of
relations joined in a query)
Example Refinement Graph
Query(d)
Cites(d,d1)
Author_of(d, a)
Word_count(d, w, int)
Author_of(d, a=“Smith”)
Cites(d,d1),Cites(d1,d2)
Cites(d,d1), Author_of(d1, a)
DB Tables
Cites(d,d1), Author_of(d1, a=“Domingos”)
Overview
Structural Generalized Linear Regression
(SGLR) Overview
Design Motivations
• Experiments
• Conclusions
Experiments
• Used CiteSeer data
 Citation(doc1, doc2), Author(doc, person),
PublishedIn(doc, venue), HasWord(doc,word)
 60k Docs, 131k Authors, 173k Citations, 6.8M
Words
• Two Tasks
 Predict the publication venue
 Predict existence of a citation
Experiments
• Cluster all many-to-many relations
 K-means
 Added 6 new relations
•
•
•
•
Use logistic regression for prediction
BFS of search space
5k+/5k- examples for venue prediction
2.5k+/2.5k- examples for citation prediction
Results
Venue (87.2%)
Citation (93.1%)
Dynamic Feature Generation
• Query expressions generated Breadth-First
• Baseline puts all queries into one queue
• Dynamic strategy enqueues queries into
separate streams
 Stream 1: exists and count over table
 Stream 2: other aggregates (counts of unique
elements in individual columns)
 Chooses next feature from stream where
(featuresAdded+1)/(featuresTried+1) is max
 Stop when a stream is empty
Results
Venue
Clusters
Citation
No Clusters
Time Results
Venue
Clusters
Citation
No Clusters
Domain Independent Learning
• Most citation prediction features are
research-area generic
• Can we train a model for one area and test
on another?
Domain Independent Results
• Used KDD-Cup 2003 data (High Energy
Physics papers in arXiv)
Train On
Test On
Accuracy
CiteSeer
arXiv
92.9%
CiteSeer
CiteSeer
92.6%
arXiv
arXiv
96.0%
Conclusions
• Cluster-based features add expressivity,
and apply to any domain or SRL method
• Generating queries dynamically can
reduce search time and increase accuracy
Questions?