To Search or to Crawl? Towards a Query Optimizer for Text
Download
Report
Transcript To Search or to Crawl? Towards a Query Optimizer for Text
Towards a Query Optimizer for Text-Centric Tasks
Panos Ipeirotis – New York University
Joint work with Luis Gravano, Eugene Agichtein, Pranay Jain
Text-Centric Task I: Information Extraction
Information extraction applications extract structured
relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Disease Outbreaks in The New York Times
Information
Extraction System
(e.g., NYU’s Proteus)
Date
Disease Name
Location
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow Disease U.K.
Feb. 1995
Pneumonia
U.S.
May 1995
Ebola
Zaire
2
Text-Centric Task II: Metasearching
Metasearchers create content summaries of databases
(words + frequencies) to direct queries appropriately
Friday June 16, NEW YORK (Forbes) - Starbucks Corp.
may be next on the target list of CSPI, a consumer-health group
that this week sued the operator of the KFC restaurant chain
Content Summary of Forbes.com
Word
Frequency
Content Summary
Extractor
Starbucks
102
103
consumer
215
216
soccer
1295
…
…
3
Text-Centric Task III: Focused Resource Discovery
Identify web pages about a given topic (multiple techniques
proposed: simple classifiers, focused crawlers, focused querying,…)
Web Pages about Botany
URL
http://biology.about.com/
Web Page
Classifier
http://www.amjbot.org/
http://www.sysbot.org/
http://www.botany.ubc.ca/
4
An Abstract View of Text-Centric Tasks
Output Tokens
Text Database
…
Extraction
System
1. Retrieve documents
from database
2. Process documents
3. Extract output tokens
Task
Token
Information Extraction
Relation Tuple
Database Selection
Word (+Frequency)
Focused Crawling
Web Page about a Topic
For the rest of the talk
5
Executing a Text-Centric Task
Output Tokens
Text Database
Extraction
…
System
1. Retrieve
documents from
database
Similar to relational world
2. Process
documents
3. Extract output
tokens
Two major execution paradigms
Scan-based: Retrieve and process documents
sequentially
Index-based: Query database (e.g., [case fatality rate]),
retrieve and process documents in results
→underlying data distribution dictates what is best
Indexes are only “approximate”: index is on
keywords, not on tokens of interest
Choice of execution plan affects output
completeness (not only speed)
Unlike the relational world
6
Execution Plan Characteristics
Output Tokens
Text Database
1.
Question: How
do we choose the…
Extraction
fastest execution
plan for reaching
System
a
target
recall
?
Retrieve documents
from database
2. Process documents
3. Extract output tokens
Execution Plans have two main characteristics:
Execution Time
Recall (fraction of tokens retrieved)
“What is the fastest plan for discovering 10% of the disease
outbreaks mentioned in The New York Times archive?”
7
Outline
Description and analysis of crawl- and query-based plans
Scan
Crawl-based
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Query-based
(Index-based)
Optimization strategy
Experimental results and conclusions
8
Scan
Output Tokens
Text Database
Extraction
…
System
1. Retrieve docs
from database
2. Process
documents
3. Extract output
tokens
Scan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| · (R + P)
Question: How many documents
does Scan retrieve to reach target
recall?
Time for retrieving a
document
Time for processing
a document
9
Estimating Recall of Scan
<SARS, China>
Modeling Scan for Token t:
What is the probability of seeing t (with
frequency g(t)) after retrieving S documents?
A “sampling without replacement” process
Token
t
d1
d2
S documents
...
After retrieving S documents, frequency of
token t follows hypergeometric distribution
Recall for token t is the probability that
frequency of t in S docs > 0
dS
...
dN
D
Probability of seeing token t
after retrieving S documents
g(t) = frequency of token t
Sampling
for t
10
Estimating Recall of Scan
<SARS, China>
<Ebola, Zaire>
Modeling Scan:
Multiple “sampling without replacement”
processes, one for each token
Overall recall is average recall across
tokens
Tokens
t1
t2
Sampling
for t1
Sampling
for t2
...
tM
d1
d2
d3
...
→ We can compute number of documents
required to reach target recall
dN
Execution time = |Retrieved Docs| · (R + P)
D
Sampling
for tM
11
Scan vs. Filtered Scan
Output Tokens
Text Database
Extraction
Classifier
1. Retrieve docs
from database
…
System
2. Filter documents
3. Process filtered
documents
4. Extract output
tokens
Scan retrieves and processes all documents (until reaching target recall)
Filtered Scan uses a classifier to identify and process only promising documents
(e.g., the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| * ( R + F + σ P)
Question: How many documents
does (Filtered) Scan retrieve to
reach target recall?
Time for retrieving a
document
Time for processing
a document
Classifier selectivity (σ≤1)
Time for filtering
a document
12
Estimating Recall of Filtered Scan
Tokens
Modeling Filtered Scan:
Analysis similar to Scan
Main difference: the classifier rejects
documents and
t1
t2
Sampling
for t1
Sampling
for t2
...
tM
d1
d2
Decreases effective database size
from |D| to σ·|D|
d3
(σ: classifier selectivity)
Decreases effective token frequency
from g(t) to r·g(t)
...
(r: classifier recall)
dN
D
Sampling
for tM
Documents rejected by
Tokens in rejected
classifier decrease effective
documents have lower
database size
effective token frequency
13
Outline
Description and analysis of crawl- and query-based plans
Scan
Crawl-based
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Query-based
Optimization strategy
Experimental results and conclusions
14
Iterative Set Expansion
Output Tokens
Text Database
…
Extraction
Query
System
1. Query
database with
seed tokens
Generation
2. Process retrieved
documents
3. Extract tokens
from docs
(e.g., <Malaria, Ethiopia>)
4. Augment seed
tokens with
new tokens
(e.g., [Ebola AND Zaire])
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Question: How many queries
and how many documents does
Iterative Set Expansion need to
reach target recall?
Time for retrieving a Time for processing
a document
document
Time for answering
a query15
Querying Graph
Tokens
The querying graph is a
bipartite graph, containing
tokens and documents
t1
Documents
d1
<SARS, China>
t2
d2
<Ebola, Zaire>
Each token (transformed to a
keyword query) retrieves
documents
Documents contain tokens
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
16
Using Querying Graph for Analysis
We need to compute the:
Number of documents retrieved after
sending Q tokens as queries (estimates time)
Number of tokens that appear in the
retrieved documents (estimates recall)
Tokens
Documents
t1
d1
<SARS, China>
t2
d2
<Ebola, Zaire>
To estimate these we need to compute the:
Degree distribution of the tokens
discovered by retrieving documents
Degree distribution of the documents
retrieved by the tokens
(Not the same as the degree distribution of a
randomly chosen token or document – it is easier to
discover documents and tokens with high degrees)
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
Elegant analysis framework based on generating functions – SIGMOD06
17
Recall Limit: Reachability Graph
Tokens
Documents
t1
d1
t2
d2
t3
d3
t4
d4
t5
d5
Reachability Graph
t1
t2
t3
t5
t4
t1 retrieves document d1
that contains t2
Upper recall limit: determined by the size
of the biggest connected component
18
Automatic Query Generation
Iterative Set Expansion has recall limitation due to
iterative nature of query generation
Automatic Query Generation avoids this problem by
creating queries offline (using machine learning), which
are designed to return documents with tokens
19
Automatic Query Generation
Offline
Query
Generation
Output Tokens
Text Database
1. Generate queries 2. Query
database
that tend to retrieve
documents with
tokens
…
Extraction
System
3. Process retrieved
documents
4. Extract tokens
from docs
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Time for retrieving a Time for processing
document
a document
Time for answering
a query20
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs
Query has precision p(q)
p(q)·g(q) useful docs
[1-p(q)]·g(q) useless docs
We compute total number of useful
(and useless) documents retrieved
Analysis similar to Filtered Scan:
Effective database size is |Duseful|
Sample size S is number of useful
documents retrieved
Text Database
q
p(q)·g(q)
(1-p(q))·g(q)
Useful Doc
Useless Doc
21
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
22
Summary of Cost Analysis
Our analysis so far:
Takes as input a target recall
Gives as output the time for each plan to reach target recall
(time = infinity, if plan cannot reach target recall)
Time and recall depend on task-specific properties of database:
Token degree distribution
Document degree distribution
Next, we show how to estimate degree distributions on-the-fly
23
Estimating Cost Model Parameters
Token and document degree distributions belong to known distribution families
Task
Document Distribution
Token Distribution
Information Extraction
Power-law
Power-law
Content Summary Construction
Lognormal
Power-law (Zipf)
Focused Resource Discovery
Uniform
Uniform
100000
10000
-3.3863
y = 43060x
10000
1000
y = 5492.2x-2.0254
100
10
1
1
10
Document Degree
100
Number of Tokens
Number of Documents
1000
100
10
1
1
10
100
Token Degree
1000
24
Can characterize distributions with only a few parameters!
Parameter Estimation
Naïve solution for parameter estimation:
Start with separate, “parameter-estimation” phase
Perform random sampling on database
Stop when cross-validation indicates high confidence
We can do better than this!
No need for separate sampling phase
Sampling is equivalent to executing the task:
→Piggyback parameter estimation into execution
25
On-the-fly Parameter Estimation
Correct (but unknown) distribution
Pick most promising execution
plan for target recall assuming
“default” parameter values
Start executing task
Update parameter estimates
during execution using MLE
Switch plan if updated statistics
indicate so
Initial,
default
estimation
Updated
estimation
Updated
estimation
(ACM TODS, Dec 2007)
Important
Only Scan acts as “random sampling”
26
All other execution plan need parameter adjustment
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
27
Correctness of Theoretical Analysis
100,000
Execution Time (secs)
10,000
Scan
1,000
Filt. Scan
Automatic Query Gen.
Iterative Set Expansion
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
Solid lines: Actual time
Dotted lines: Predicted time with correct parameters
0.70
0.80
0.90
1.00
Task: Disease Outbreaks
Snowball IE system
182,531 documents from NYT
28
16,921 tokens
Experimental Results (Information Extraction)
100,000
Execution Time (secs)
10,000
Scan
Filt. Scan
1,000
Iterative Set Expansion
Automatic Query Gen.
OPTIMIZED
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
0.70
0.80
0.90
1.00
Solid lines: Actual time
Green line: Time with optimizer
(results similar in other experiments – see ACM TODS paper)
29
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall
of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan
for target recall
30
Future Work
Incorporate precision and recall of extraction
system in framework using ROC curves
Create non-parametric optimization (i.e., no
assumption about distribution families)
Examine other text-centric tasks and analyze new
execution plans
31
Thank you!
ありがとう
Task
Filtered Scan
Iterative Set
Expansion
Automatic Query
Generation
Information
Extraction
Grishman et al.,
J.of Biomed. Inf. 2002
Agichtein and Gravano,
ICDE 2003
Agichtein and Gravano,
ICDE 2003
Content Summary
Construction
-
Callan et al.,
SIGMOD 1999
Ipeirotis and Gravano,
VLDB 2002
Focused Resource
Discovery
Chakrabarti et al.,
WWW 1999
-
Cohen and Singer,
AAAI WIBIS 1996
32
Overflow Slides
33
Experimental Results (IE, Headquarters)
Task: Company Headquarters
Snowball IE system
182,531 documents from NYT
16,921 tokens
34
Experimental Results (Content Summaries)
Content Summary Extraction
19,997 documents from 20newsgroups
120,024 tokens
35
Experimental Results (Content Summaries)
ISE is a cheap plan
for low target recall
but becomes the most
expensive for high
target recall
Content Summary Extraction
19,997 documents from 20newsgroups
120,024 tokens
36
Experimental Results (Content Summaries)
Underestimated recall for
AQG, switched to ISE
Content Summary Extraction
19,997 documents from 20newsgroups
120,024 tokens
37
Experimental Results (Information Extraction)
100,000
Execution Time (secs)
10,000
Scan
Filt. Scan
1,000
Iterative Set Expansion
Automatic Query Gen.
OPTIMIZED
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
OPTIMIZED
faster1.00
than
0.70
0.80 is0.90
“best plan”: overestimated
F.S. recall, but after F.S. run
to completion, OPTIMIZED
just switched to Scan
38
Focused Resource Discovery
Focused Resource Discovery
800,000 web pages
12,000 tokens
39