Transcript ppt
Analyzing Execution
Strategies for Text Based
Tasks
SDBI 2007
Maayan Shiloach
[email protected]
To Search or to Crawl? Towards a Query
Optimizer for Text-Centric Tasks
P. Ipeirotis, E. Agichtein,
P. Jain, and L. Gravano.
Text is everywhere
2 Approaches
Crawl
Crawl
Crawl
Query
Query
Query
Outline
Analyze
Compare
Optimize
Two Examples of Text Tasks
Information Extraction
Ynet archive
year country magnitude casualties
2008 Mexico 5.7
0
2008 Greece 5.2
0
Content Summary Construction
www.cs.huji.ac.il
Word
frequency
database
439
Internet
622
crawl
36
Abstract View
DB
Execution Strategy
Tokens
Execution time
Train
Query Evaluation
Retrieve +
Filter
Process
Execution time
Recall
Four Strategy Examples
(1)
Scan
Target recall T
Recall < T?
no
yes
Retrieve an
unprocessed
document
Process and
extract tokens
<<end>>
Scan Execution Time
Scan Execution Time
But what is Dretr?
After scanning S documents, what is the
probability that a given token appeared at
least in one?
To obtain recall T, take the minimal S s.t.
Expected tokens in S ≥ T*|Tokens|
This is Dretr
(2)
Filtered Scan
Target recall T
Recall < T?
no
yes
Retrieve an
unprocessed
document
Process and
extract tokens
<<end>>
Document
useful?
no
yes
Filtered Scan Execution Time
Filtered Scan Execution Time
Classifier
Selectivity
Dretr?
Using similar calculations to Scan:
(3)
Iterative Set Expansion (ISE)
Seed
evaluate Document
DB
Processor
Create Query
Tokens
ISE Execution Time
ISE Execution Time
Querying Graph
Tokens
Documents
What are the execution time and recall if
the seed is a single token?
Finding Qsent and Dretr
Compute:
• Number of documents retrieved after Q
token queries
• Number of tokens in the retrieved
documents
Reachability Graph
What is the maximal recall in this case?
Recall Limit
Total size of connected components that
include tokens from seed
(4)
Automatic Query Generation
Train a classifier
Derive queries
Execute and process retrieved documents
AQG Execution Time
AQG Execution Time
Estimating Qsent and Dretr
And Dretr depends on the number of queries
Optimization
Just pick the best strategy for a given recall?
Well…
Some missing parameters:
-
Number of tokens
Classifier selectivity + recall
Token degree distribution
Document degree distribution
sampling
Distribution Examples
Estimating Distributions
Known distribution families:
Task
Information
extraction
Document Degree
Distribution
Power-Law
Content summary Lognormal
construction
Token Degree
Distribution
Power-Law
Power-Law
And the distribution parameters?
Estimating Distributions
Naïve approach: Sample until high confidence
Better approach: Estimate During Execution
But is it really random?
• Yes for Scan
• ISE: random in the iteration.
use generating functions
• Problematic for Filtered Scan and AQG
requires adjustments
Optimization
Pick a strategy given known
distribution family
Execute for a batch of documents
and update statistics
Replace Strategy if needed
Experimental Results
Experimental Results
Related Work
Optimizing SQL Queries Over Text
Databases
A. Jain, A. Doan, L. Gravano
4 Main Differences
1. Document Processor is not perfect
Processor
results
Ideal tokens
Quality should be measured accordingly
2. Considering On the Fly SQL query
evaluation
4 Main Differences
3. Multiple relations and extraction systems
4. No predefined recall, but “goodness”
optimization:
Summary
Defines blocks for comparing text task execution strategies
Helps predict
execution times
Suggests optimizations
× Not easily extendable
× Relies on task prior
knowledge
Possible Extensions
•
•
•
•
Analyze more strategies
Minimize dependence on prior knowledge
Optimize classifier training
Semi-online SQL query extraction
Thank You