Execution time

Download Report

Transcript Execution time

Presented by
Avinash S Bharadwaj
How can data be extracted from
the web?
 Execution plans for text-centric tasks follow two
general paradigms for processing a text database:
 The entire web can be crawled or scanned for the text
automatically
 Search engine indexes can be used to retrieve the
documents of interest using carefully constructed
queries depending on the task.
Introduction
 Text is ubiquitous and many applications rely on the
text present in web pages for performing a variety of
tasks.
 Examples of text centric tasks
 Reputation management systems download web pages
to track the buzz around the companies.
 Comparative shopping agents locate e-commerce web
sites and add the products offered in the pages to their
own index.
Examples of text centric tasks
 According to the paper there are mainly three types of
text centric tasks
 Task 1: Information Extraction
 Task 2: Content Summary Construction
 Task 3: Focused Resource Discovery
Task 1: Information Extraction
 Information extraction applications extract structured
relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Disease Outbreaks in The New York Times
Information
Extraction System
(e.g., NYU’s
Proteus)
Date
Disease Name
Location
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow Disease U.K.
Feb. 1995
Pneumonia
U.S.
May 1995
Ebola
Zaire
Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan
5
Task 2: Content Summary construction
 Many text databases have valuable contents “hidden”
behind search interfaces.
 Metasearchers are used to search over multiple
databases using a unified query interface.
 Generation of content summary.
Task 3: Focused Resource Discovery
 This task considers building applications based on a
particular resource.
 Simplest approach is to crawl the entire web and classify
the web pages accordingly
 Much more efficient approach is to use a focused crawler.
 The focused crawlers depend documents and hyperlinks
that are on-topic, or likely to lead to on-topic documents,
as determined by a number of heuristics.
An Abstract View of Text-Centric
Tasks
Output Tokens
Text Database
…
Extraction
System
1. Retrieve documents
from database
2. Process documents
3. Extract output tokens
Task
Token
Information Extraction
Relation Tuple
Database Selection
Word (+Frequency)
Focused Crawling
Web Page about a Topic
8
Execution Strategies
 The paper describes four execution strategies.
 Scan
Crawl Based
 Filtered Scan
 Iterative Set Expansion
Query or
Index Based
 Automatic Query Generation
Execution Strategy: Scan
 Scan methodology processes each document in a
database exhaustively until the number of tokens
extracted satisfies the target recall.
 The Scan execution strategy does not need training and
does not send any queries to the database.
 Execution time = |Retrieved Docs| · (R + P)
Time for retrieving a
document
Time for processing
a document
 Prioritizing the documents based may help in
improving efficiency.
Execution Strategy: Filtered Scan
 Filtered scan is an improvement over the basic scan
methodology.
 Unlike scan filtered scan uses a classifier for a specific
task to check whether the article contributes at least
one token before parsing the article.
 Execution time = |Retrieved Docs| · (R + P + C)
Time for retrieving a
document
Time for processing
a document
Time for classifying
a document
Execution Strategy: Iterative Set Expansion
Output Tokens
Text Database
Extraction
…
System
1. Query
database with
seed tokens
(e.g., [Ebola AND Zaire])
2. Process retrieved
documents
Query
Generation
3. Extract tokens
from docs
(e.g., <Malaria, Ethiopia>)
4. Augment seed
tokens with new
tokens
Execution Strategy: Iterative Set Expansion contd…
 Iterative Set Expansion has been successfully applied
in many tasks.
 Execution time = |Retrieved Docs| * (R + P) +
|Queries| * Q
Time for answering a
query
Time for retrieving a
document
Time for processing
a document
Execution Strategy: Automatic Query Generation
 Iterative Set Expansion has recall limitation due to
iterative nature of query generation
 Automatic Query Generation avoids this problem by
creating queries offline (using machine learning),
which are designed to return documents with tokens.
 Automatic Query Generation works in two stages:
 In the first stage, Automatic Query Generation trains a
classifier to categorize documents as useful or not for
the task
 In the execution stage, Automatic Query Generation
searches a database using queries that are expected to
retrieve useful documents.
Estimating Execution plan costs: Scan
<SARS, China>
Modeling Scan for Token t:
 What is the probability of seeing t (with
frequency g(t)) after retrieving S
documents?
 A “sampling without replacement” process
 After retrieving S documents, frequency of
token t follows hypergeometric
distribution
 Recall for token t is the probability that
frequency of t in S docs > 0
Token
t
d1
d2
...
dS
...
dN
Probability of seeing token t
after retrieving S documents
g(t) = frequency of token t
D
Sampling
for t
Estimating Execution plan costs: Iterative Set
Expansion
Tokens
Documents

The querying graph is a
bipartite graph, containing
tokens and documents
 Each token (transformed to a
keyword query) retrieves
documents
 Documents contain tokens
t1
d1
<SARS, China>
t2
d2
t3
d3
<Ebola, Zaire>
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
Estimating execution plan costs: Iterative Set
Expansion contd…
We need to compute the:
 Number of documents retrieved after
sending Q tokens as queries (estimates time)
 Number of tokens that appear in the
retrieved documents (estimates recall)
Tokens
t1
Documents
d1
<SARS, China>
t2
d2
<Ebola, Zaire>
To estimate these we need to compute the:
t3
 Degree distribution of the tokens discovered <Malaria,
Ethiopia>
by retrieving documents
 Degree distribution of the documents
t4
retrieved by the tokens
<Cholera, Sudan>

(Not the same as the degree distribution of a randomly
chosen token or document – it is easier to discover
documents and tokens with high degrees)
t5
d3
d4
d5
<H5N1, Vietnam>
17
Experimental Results
100,000
Execution Time (secs)
10,000
Scan
1,000
Filt. Scan
Automatic Query Gen.
Iterative Set Expansion
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
0.70
0.80
0.90
1.00
100,000
Execution Time (secs)
10,000
Scan
Filt. Scan
1,000
Iterative Set Expansion
Automatic Query Gen.
OPTIMIZED
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
0.70
0.80
0.90
1.00
Experimental Results contd…
Conclusions
 Common execution plans for multiple text-centric tasks
 Analytic models for predicting execution time and recall
of various crawl- and query-based plans
 Techniques for on-the-fly parameter estimation
 Optimization framework picks on-the-fly the fastest plan
for target recall
20