Modeling Query-Based Access to Text Databases

Download Report

Transcript Modeling Query-Based Access to Text Databases

Modeling Query-Based Access to
Text Databases
Eugene Agichtein
Panagiotis Ipeirotis
Luis Gravano
Computer Science Department
Columbia University
1
Scalable Text “Mining”
2

Often only a tiny fraction of a text database is
relevant, so processing every document is
unnecessarily expensive.

Often relevant information is not crawlable,
but available only via a search engine.
Search engines can help:
efficiency and accessibility
Task1: Extracting Structured
Information “Buried” in Text
Documents
Organization
Microsoft's central headquarters in Redmond
is home to almost every product group and division.
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computer’s headquarters
in Cupertino, was fired Monday for "thinking
a little too different."
Location
Microsoft
Redmond
Apple
Computer
Cupertino
Nike
Portland
Apple's programmers "think different" on a "campus" in
3
Cupertino, Cal. Nike employees "just do it" at what the
company refers to as its "World Campus," near Portland,
Ore.
Information Extraction
Applications



Over a corporation’s customer report or email
complaint database: enabling sophisticated
querying and analysis
Over biomedical literature: identifying
drug/condition interactions
Over newspaper archives: tracking disease
outbreaks, terrorist attacks; intelligence
Significant progress over the last decade [MUC]
4
Goal: Extract All Tuples of a Relation
from a Document Database
Information
Extraction
System
Extracted Tuples


5
One approach: feed every document to
information extraction system
Problem: efficiency, accessibility!
A Query-Based Strategy for
Information Extraction
Intuition: Documents with one tuple for the relation
are also likely to contain other tuples.
0 While seed has unprocessed tuple t
6
seed
1
Retrieve up to MaxResults
documents matching t
2
Extract new tuples te from
these documents
t1
3
Augment seed with te
t2
t0
Problem: May run out of tuples (and queries) 
incomplete relation!
“Hidden Web” Databases
Keywords
SUBMIT
“Surface” Web
–
–
7
–
Link structure
Crawlable
Documents indexed
by search engines
CLEAR
“Hidden” Web
–
–
–
–
No link structure
Documents “hidden” in databases
Documents not indexed by search engines
Need to query each collection individually
Search Over the “Hidden Web”
2: relies
Database
Content
Summary
DatabaseTask
selection
on
thrombopenia
simple content summaries:
Construction
vocabulary, word frequencies
PubMed
• Typically the vocabulary
Metasearcher
(3,868,552 documents)
Problem:ofDatabases
don’t export
each database
plus
…
content simple
summaries!
frequency statistics


NYTimes
Archives
PubMed
8
?
cancer
1,398,178
aids
106,512
US Patents
heart
281,506
hepatitis 23,481
...
...
...
thrombopenia
24,826
thrombopenia 24,826
thrombopenia 0
…
...
...
thrombopenia 18
...
A Query-Based Strategy for
Constructing Database Summaries
0 While seed has unprocessed word t
seed
1
Retrieve up to MaxResults
documents matching t
2
Extract new words te from
these documents
t1
3
Augment seed with te
t2
t0
Problem: May run out of words (and queries)
 incomplete summary!
9
Query-Based Information Extraction
and Database Summary Construction
seed
10
“connected”
seed
“disconnected”
Model: Querying Graph

Tokens T:
d1
t2
d2
t3
d3
Tokens (as queries) retrieve
documents in D
t4
d4
Documents contain tokens
t5
d5
–
11

D
t1
Task 1: tuple attributes
“microsoft” AND “redmond”
“acm” AND “new york”

T
Task 2: words
“sigmod”, “webdb”
Model: Reachability Graph
T
12
D
t1
d1
t2
d2
t3
d3
t4
d4
t5
t1
t2
t3
t5
t4
t1 retrieves document d1
that contains t2
d5
t2, t3, and t4 “reachable” from t1
Model (cont.): Connectivity
t1
In
t2
t3
Core
(strongly
connected)
Out
t4
Tokens that retrieve
other tokens but
are not reachable
13
Tokens that retrieve
other tokens and
themselves
Reachable tokens,
do not retrieve
core tokens
Power-law Graphs
14

Conjecture: Degree distribution in the reachability
graph is described by a power-law:

Completely described by only two parameters,
a and b.

Power-law random graphs are expected to have at most
one giant connected component (~Core+In+Out).
Other connected components are small.
Model (cont.): Reachability
Giant Component CRG
In
Core
(strongly
connected)
Out
t1
seed
t2
reachable
Reachability:
15
relative size of the
largest Core + Out:
t3
t4
Outline



Task 1: Information Extraction
Task 2: Constructing Database Summary
Model:
–
–
–



16

Querying, reachability graphs
Power-law graphs
Reachability
Querying Real Databases
Estimation
Experimental Results
Discussion
Querying Real Databases

Task 1: NYT
DiseaseOutbreaks
Date
DiseaseName
Location
Jan. 1995
Malaria
Ethiopia
(date, disease, location) June 1995 Ebola
Africa
The New York Times
July 1995 Mad Cow Disease The U.K.
|D|=137,000
Feb. 1995 Pneumonia
The U.S.
|T|=8859
…
…
…

Task 2: 20NG
–
–
17
–
Postings from 20 Newsgroups
|D| ~ 20,000
|T| ~ 109,000
NYT Reachability Graph:
Outdegree Distribution
Matches the power-law distribution
MaxResults=10
18
MaxResults=50
NYT: Component Size Distribution
Not “reachable”
MaxResults=10
19
CG / |T| = 0.297
“reachable”
MaxResults=50
CG / |T| = 0.620
20NG: Outdegree Distribution
6
7
actual
best fit
5
4
3
4
3
2
2
1
1
0
Approximated by power-law
distribution
2
4
6
log(degree)
MaxResults=1
20
5
log(frequency)
log(frequency)
6
actual
best fit
8
4
6
log(degree)
8
MaxResults=10
CG / |T| = 1 (completely connected)
Estimating Reachability
In a power-law random graph G a giant component
CG emerges* if d (the average outdegree) > 1, and:


21
Estimate: Reachability ~ CG / |T|
Depends only on d (average outdegree)
* For b < 3.457
Estimating Reachability using Sampling
Choose S random seed
2.
Query the database for seed
Compute the outgoing edges t2
of nodes in seed.
t3
Estimate d as average
outdegree of seed tokens.
t4
3.
4.
22
T
1.
tokens
d =1.5
t1
t5
D
d1
d2
d3
d4
d5
Estimating Reachability for NYT
S=10
S=50
S=100
S=200
Real Graph
1
0.9
Reachability
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MR=1
MR=10
MR=50
MR=100
MR=200
MR=1000
MaxResults
23
Approximate reachability is estimated after ~ 50 queries.
 Can be used to predict success (or failure) of a Task 1 algorithm.
Reachability of NYT (cont.)
S=10
S=50
S=100
S=200
Real Graph
1
0.9
Reachability
0.8
0.7
0.6
0.5
0.4
0.3
0.2
.46
0.1
0
MR=1
MR=10
MR=50
MR=100
MR=200
MR=1000
MaxResults
Reachability correctly predicts
performance of the Tuples
24
strategy for Task 1 (described in
[Agichtein and Gravano, ICDE 2003])
Estimating Reachability of 20NG
Reachability
S=10
S=50
S=100
S=200
Real Graph
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MR=1
MR=10
MaxResults
Estimates reachability closely, after just 10 queries
25
Corroborates Callan’s results
[Callan et al., SIGMOD 1999]
Summary

Presented graph model for query-based
access to text databases
–
–


The reachability metric: predictions for
algorithm performance
Efficient estimation techniques
–
26
Querying and Reachability graphs
Formal tool for analyzing heuristic algorithms
Power-law random graph properties + Document
sampling
Future Work

Other properties of the reachability graph
–
–

“Real-life” limitations:
–
–

27
Edge Density
Diameter
Total number of queries?  querying graph
Total number of documents?  querying graph
Analyze other (heuristic) algorithms.
Modeling Query-Based Access to
Text Databases
Eugene Agichtein
Panagiotis Ipeirotis
Luis Gravano
Questions?
Computer Science
Columbia University
28
Overflow Slides
29
30
Information Extraction Example:
Organizations’ Headquarters
Input: Documents
Brent Barlow, a software analyst and beta-tester at Apple
Computer's headquarters in Cupertino, was fired Monday for "thinking
a little too different."
doc2
doc4
Named-Entity Tagging
<PERSON>Brent Barlow</PERSON>,
a software analyst and beta-tester at
<ORGANIZATION>Apple Computer</ORGANIZATION>'s
headquarters in <LOCATION>Cupertino</LOCATION>, was fired
Monday for "thinking a little too different."
doc4
Pattern Matching
31
Output: Tuples
<ORGANIZATION> = Apple
Computer
<LOCATION> = Cupertino
Pattern = p1
Extraction Patterns
doc4
p1
<ORGANIZATION>'s
headquarters in <LOCATION>
<ORGANIZATION>,
based in <LOCATION>
tid
Organization
Location
W
1
2
Eastman Kodak
Apple Computer
Rochester
Cupertino
0.9
0.8
p2
Useful
doc2
doc4
Efficient Information Extraction:
Alternatives
Given a large text database and an information
extraction task, how to proceed?

If a large fraction of
documents are relevant:
–

Else
–
32
Scan (not always possible)
Tuples ?
Query Interface
??
Text Database
Will Tuples retrieve “enough” of the relation?
Search Over “Hidden Web” Databases
 Metasearchers


33
Database Selection:
Choosing best databases
for a query
Database Selection Needs
“Content Summaries”:
Typically the vocabulary of
each database plus simple
frequency statistics
PubMed
(3,868,552 documents)
…
cancer
aids
heart
hepatitis
1,398,178
106,512
281,506
23,481
thrombopenia 24,826
…
Model



34
Is there a common model for
algorithms for Query-Based Information
Extraction and Database Summary
Construction?
What are the limitations of these
algorithms?
Given a new database, will such an
algorithm for “work”?