Querying Text Databases and the Web: Beyond Traditional

Transcript Querying Text Databases and the Web: Beyond Traditional

Querying Text Databases and the Web:
Beyond Traditional Keyword Search
Luis Gravano
Columbia University
Many thanks to Eugene Agichtein, AnHai Doan,
Panos Ipeirotis, and Alpa Jain!
1
Information Extraction, or Deriving Structured
Information from Text Documents



Natural-language text embeds “structured” data
Properly trained extraction systems extract this data
Extraction output can be regarded as “relations”
BBC: May, 2006
Cholera epidemic in Angola
Angola has
A Cholera
now killed more than 1,200
1,200 people
in the past three months …
Disease Outbreaks Proteus [Grishman et al., HLT 2000]
Disease Outbreaks
Disease
Country
Deaths
Disease
Cholera
Country
DeathsAngola
Year
Cholera
Angola
1,200
…
Virus
type …
1,200
…
…
Significant progress over the last decade [MUC]
2
Thomson Reuters’s OpenCalais
www.opencalais.com
A Dutch court ruled against
navigation systems company
TomTom on Thursday in a patent
infringement lawsuit against rival
navigation device maker IBM from
the United States.
<rdf:Description rdf:about="http://d.opencalais.com/dochash-1/_id_/Instance/5">
<rdf:type rdf:resource ="http://s.opencalais.com/1/type/sys/InstanceInfo"/>
<c:docId rdf:resource ="http://d.opencalais.com/dochash-1/_id_"/>
<c:subject rdf:resource ="http://d.opencalais.com/genericHasher-1/_id_"/>

<c:detection>[court ruled against navigation systems company ]TomTom on Thursday
in a patent infringement lawsuit against rival navigation device maker IBM from the
United States.
</c:detection>
<c:offset>138</c:offset>
<c:length>94</c:length>
</rdf:Description>
Information Extraction has Many Applications



Over a corporation’s customer report or email complaint
database: enabling sophisticated querying and analysis
Over biomedical literature: identifying drug/condition
interactions
Over newspaper archives: tracking disease outbreaks,
terrorist attacks; intelligence
• Disease Outbreaks
AllAfrica News
Archive
• Disease Remedies
• Gene Names
• Awards
• “Buzz”
• Critics Choice
Health Care Blog
MTV Movies
Blog
Health
Movies
• Authors and Titles
• Citations
• Job Announcements
• PC Committees
Fact Tables
• Headquarters
• Joint Ventures
• Person Political Past
• Natural Disasters
• …
Web
4
5
Focus of this Talk:
“On-The-Fly” Information Extraction

Information extraction task short-lived, new
Substantial recent progress in unsupervised or semi-supervised
information extraction


Text database previously unseen or too vast
“Blackbox” view of information extraction systems
Not covered in talk:

“Offline” information extraction (e.g., [Mansuri and Sarawagi, ICDE 2006],

Internal design of information extraction systems (effectiveness, efficiency)

[Cafarella et al., CIDR 2007], [Chen et al., ICDE 2008])
(e.g., [McCallum, ACM Queue 2005], [Chandel et al., ICDE 2006], [Shen et al.,
SIGMOD 2008])
Entity extraction and ranking (e.g., [Stoyanovich et al., WebDB 2007], [Cheng et
al., VLDB 2007], [Agrawal et al., PVLDB 2008])
6
“SQL” Queries Over Text Databases
Disease
SELECT Company, CEO
Ebola
SELECT *
FROM
Headquarters H, Executives E
FROM
DiseaseOutbreaks
Cholera
WHERE H.Company = E.Company
Ebola
WHERE Country = `Sudan’
and H.Location = ‘Redmond’ …
Year
2004
Company
Country
Virus
subtype
Microsoft
Ebola-
Sudan
CEO Case
Identified Deaths
cases
fatality
17
Sudan
IDD Aerospace Corporation
2004
1979
…
Sudan
-
Eddie
Inc.
SudanBauer,
Ebola…
…
20
34
Steve7 Ballmer
41%
Steve Hintzke
5
…
…
25%
-
Neil
22 Fiske65%
…
Sudan
…
…
…
…
…
…
BBC: May, 2006
A Cholera epidemic in Angola has
now killed more than 1,200 people
in the past three months …
Information
Carnegie Group Inc., of Pittsburgh
won a $627,068 contract from the
Army Research Lab in Adelphi for
research and development.
Extraction
Text Database
Lehman Brothers, one of
America's largest investment
banks, plans to sue a Japanese
trading house ..
Mr. Song-nam conveyed greetings
and best wishes of the CEO of
Tokyo-based Mitsubishi, Mr. Ichiro
Taniguchi.
No-one from Asclepius or its parent firm
LTT Bio-Pharma have commented.
Two hundred flights in
and out of T5 were
cancelled in its first
three days.
We are confident in our legal
claim which we will pursue until
we receive repayment from
Marubeni,"
America's largest
investment banks, plans to
sue a trading house ..
7
Processing a SQL Query over a Text
Database: A Naïve Execution

For every database document:




Retrieve document
Feed document to all relevant information
extraction systems
Add extracted tuples to corresponding “relations”
Evaluate SQL query over extracted relations
Any problems?
8
Information Extraction is Expensive
Complex text analysis
often required:




parsing
named-entity tagging
cleaning
…

Ronen Feldman,
Information Extraction
– Theory and Practice,
ICML 2006
[Source: Ronen Feldman, “Information Extraction – Theory and Practice,” ICML 2006]
9
Information Extraction is Error-Prone
<ORGANIZATION> in <LOCATION>
Pittsburgh-based
Carnegie
Group Inc.
Inc.,
Carnegie Group
Pittsburgh
won a $627,068 contract from the Army
Army
Adelphi for research
Research
Lab in Adelphi
Research Lab
and development.
Headquarters
Company
Location
Army Research Lab. Adelphi
Carnegie Group Inc. Pittsburgh
missing tuple
<LOCATION>-based <ORGANIZATION>
<ORGANIZATION> in <LOCATION>
Headquarters
Greenpeace took backwards steps today
as a result of their failed protests of
Exxon in New York
Company
Location
Army Research Lab. Adelphi
Exxon
New York
incorrect tuple
10
Information Extraction is Error-Prone

Recall: fraction of correct tuples that are
extracted
Extraction systems may miss valid tuples

Precision: fraction of extracted tuples that are
correct
Extraction systems may generate erroneous tuples
11
Top-k Query Variants to the Rescue




Often “complete” query results are not
necessary
Results can be sorted by expected accuracy
Top-k query results are preferable for
efficiency and correctness
Appropriate value of k is user-specific


Some users after relatively comprehensive results
Other users after “quick-and-dirty” results
12
Executing an “Any-k” Query over a Text
Database
1.
2.
3.
4.
5.
6.





Select information extraction system for each relation in query
Select document retrieval strategy for each selected extraction
system
Retrieve documents using selected retrieval strategies
Process documents using selected extraction systems
Clean and normalize extracted relations
Return query results
Goal: Return k correct extracted results as fast as possible:
Information extraction output is not trusted
Extracted tuples are estimated to include incorrect tuples
But no indication of tuple correctness is returned in results
“Any-k” query results
[Ipeirotis et al., SIGMOD 2006, TODS 2007]
[Jain et al., ICDE 2008, 2009]
13
An Execution of an “Any-k” Query
PromC
Headquarters(Company, Location)
Executives(Company, CEO)
SELECT H.Company, E.CEO
FROM
Headquarters H,
Executives E
WHERE H.Company = E.Company
AND
H.Location = ‘Redmond’
[based AND shares
AND Redmond]
Scan
Search Interface
Retrieved
documents
Headquarters
<Company, Location>
Company
Location
Redmond
Microsoft
Armonk
IBM
Microsoft Corp. New York
Microsoft Corp. Redmond
Sybase Corp. Mountain View
Executives
<Company, CEO>
Company
CEO
Microsoft Corp. Bill Gates
Apple
Amelio
Kmart Corp. Floyd Hall
Company
Location
Microsoft Corp. Redmond
IBM
Armonk
Microsoft Corp. New York
Microsoft Corp. Redmond
Sybase Corp. Mountain View
Company
CEO
Microsoft Corp. Bill Gates
Apple Inc.
Amelio
Kmart Corp. Floyd Hall
Company
Location
Microsoft Corp. Redmond
IBM
Armonk
Sybase Corp. Mountain View
Company
CEO
Microsoft Corp. Bill Gates
Apple Inc.
Amelio
Kmart Corp. Floyd Hall
Company
Location
CEO
Bill Gates
Microsoft Corp. Redmond
IBM
Armonk
Sybase Corp. Mountain View
Amelio
Apple Inc.
Floyd Hall
Kmart Corp.
Company
CEO
Microsoft Corp. Bill Gates
Extracted
tuples
Normalized
relations
Consistent
relations
Join
relations
Query
results
14
Optimizing an “Any-k” Query
1.
2.
3.
4.
5.
6.
Select information extraction system for each relation in query
Select document retrieval strategy for each selected extraction
system
Retrieve documents using selected retrieval strategies
Process documents using selected extraction systems
Clean and normalize extracted relations
Return query results
15
Executions Vary in their Document Retrieval
Strategy
Similar to
relational world
Unlike
relational world
Two major execution paradigms:
 Scan-based: Retrieve documents sequentially
(as in Naïve execution)
 Index-based: Retrieve documents via keyword
queries (e.g., [case fatality rate])
→underlying data distribution dictates what is best
Indexes are only “approximate”: index is on
keywords, not on tuples of interest
 Choice of execution plan affects output
completeness (not only speed)

16
Document Retrieval Strategies for
Any-k Queries: Outline

Scan- and index-based document retrieval strategies:







Scan
Scan-based
Filtered Scan
Iterative Set Expansion
Automatic Query Generation (“PromD”)
“Const”
“PromC” (PromD + Const)
Index-based
Analysis: execution time and number of correct extracted
tuples
[Ipeirotis et al., SIGMOD 2006, TODS 2007]
[Jain et al., ICDE 2008, 2009]
17
Scan for Any-k Queries
Relation
Text Database
…
Information
Extraction
1. Retrieve
documents
from database
2. Process documents
with information
extraction system
3. Extract relation
tuples
Scan retrieves and processes documents sequentially,
until k correct tuples extracted
Execution time = |Retrieved Docs| · (R + P)
Key question: How many
documents does Scan retrieve to
extract k correct tuples?
Time for retrieving
a document
Time for processing
a document
18
Iterative Set Expansion for Any-k Queries
Relation
Text Database
…
Information
Extraction
1. Query database
with seed tuples
2. Process retrieved
documents
(e.g., [Ebola AND Zaire])
Query
Generation
3. Extract tuples
from documents
4. Augment seed
tuples with new
tuples
(e.g., <Malaria, Ethiopia, …>)
Execution time = |Retrieved Docs| . (R + P) + |Queries| . Q
Key questions: How many queries
and how many documents does
Iterative Set Expansion need to
extract k correct tuples?
Time for retrieving
a document
Time for processing
a document
Time for answering
a query19
Understanding Iterative Set Expansion:
Querying Graph
Tuples

Querying graph is bipartite,
with tuples and documents
t1

Each tuple (transformed into
a keyword query) retrieves
documents
Documents contain tuples
d1
<SARS, China>
t2

Documents
d2
<Ebola, Zaire>
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
20
Reachability Graph: Can k Tuples be Extracted?
Tuples
Documents
t1
d1
t2
d2
t3
d3
t4
d4
t5
d5
Reachability Graph
t1
t2
t3
t5
t4
t1 retrieves document d1,
which contains t2
Upper limit on number of extracted tuples, if starting
21
with one seed tuple: size of biggest connected component
Automatic Query Generation (“PromD”)
for Any-k Queries


Iterative Set Expansion might not fully answer an any-k
query due to iterative nature of query generation
Automatic Query Generation avoids this problem by
learning queries offline, designed to return documents
with tuples
22
Automatic Query Generation for Any-k Queries
Offline
Query
Generation
1. Learn keyword
queries to retrieve
documents with
tuples
Relation
Text Database
…
Information
Extraction
2. Query
database
with learned
queries
3. Process retrieved
documents
4. Extract tuples
from documents
(e.g., [case fatality rate])
Execution time = |Retrieved Docs| . (R + P) + |Queries| . Q
Key questions: How many queries
and how many documents does
Automatic Query Generation need to
extract k correct tuples?
Time for retrieving
a document
Time for processing
a document
Time for answering
a query23
Const for Any-k Tuples

Exploits selection “constants” in SQL query
Documents without them cannot contribute tuples for corresponding
relations


Avoids processing all documents, conservatively
Still potentially retrieves useless documents
Keyword query: [Sudan]
Text Database
SELECT *
FROM DiseaseOutbreaks
WHERE Country = ‘Sudan’
Documents that generate tuples
Documents that do not generate tuples
Documents that contain SQL query constants
24
PromC for Any-k Correct Tuples



Exploits selection “constants” in SQL query
Avoids processing all documents, aggressively
May miss useful documents
[case fatality rate] from PromD + [Sudan] from Const:
Keyword query: [case fatality rate Sudan]
Text Database
SELECT *
FROM DiseaseOutbreaks
WHERE Country = ‘Sudan’
Documents that generate tuples
Documents that do not generate tuples
Documents that contain SQL query constants
25
Properties of Alternative Execution Plans


Execution time
Number of correct tuples that a plan manages
to extract
Execution plans vary in their extraction systems and
document retrieval strategies
Fastest plan that produces k correct tuples
for an any-k query?
[Ipeirotis et al., SIGMOD 2006, TODS 2007]
[Jain et al., ICDE 2008, 2009]
26
Document Retrieval Strategies for
Any-k Queries: Outline

Scan- and index-based document retrieval strategies:







Scan
Scan-based
Filtered Scan
Iterative Set Expansion
Automatic Query Generation (“PromD”)
“Const”
“PromC” (PromD + Const)
Index-based
Analysis: execution time and number of correct extracted
tuples
[Ipeirotis et al., SIGMOD 2006, TODS 2007]
[Jain et al., ICDE 2008, 2009]
27
Any-k Query Results Do Not Report
Tuple “Correctness”
Headquarters
Any-k
Top-k
Simplistic
Query
Execution
k=4
Company
Location
Score
Delta
Microsoft
Airlines
Atlanta
Redmond
0.97
Lighthearted
Pocahontas-Lion
King
Delta
Airlines
Central
Atlanta
Park
0.89
Microsoft
Vocaltec communications
Ltd.
Redmond
Tel Aviv
0.87
NEC Corp. Inc.
Sun Microsystems
Indonesia
Mountain
View
0.73
Sun Microsystems
Inc. King
Lighthearted
Pocahontas-Lion
Mountain
View
Central Park
0.2
Vocaltec communications
Ltd.
NEC Corp.
Tel Aviv
Indonesia
0.01
Scoring function for top-k queries should
reflect confidence in tuple correctness
[Answers from New York Times archive,
28
extracted by Snowball information extraction system]
Modeling Expected Tuple Correctness
Headquarters

Company
Location
Score
Microsoft
Redmond
0.97
Delta Airlines
Atlanta
0.89
Vocaltec communications Ltd.
Tel Aviv
0.87
Sun Microsystems Inc.
Mountain View
0.73
Lighthearted Pocahontas-Lion King
Central Park
0.2
NEC Corp.
Indonesia
0.01
One approach: use “confidence” scores as returned by
extraction systems
… handle within probabilistic databases
(e.g., [Gupta and Sarawagi, VLDB 2006], [Cafarella et al., CIDR 2007],
[Ilyas et al., 2008])

Alternative approach: do some “external” computation
29
Modeling Expected Tuple Correctness


Redundancy in text databases is key to
counter noise and extraction errors
Intuitively, the more times a tuple is
extracted (independently), the higher its
expected correctness




Snowball [Agichtein and Gravano, DL 2000]
URNS model [Downey et al., IJCAI 2005]
REALM model [Downey et al., ACL 2007], for
“sparse” data
…
30
Executing a “Top-k” Query over a Text Database
Relation
Text Database
…
Information
Extraction
1. Retrieve
documents
from database
2. Process documents
with information
extraction system
3. Extract relation
tuples
4. Evaluate SQL query over extracted relation, returning top-k results
Goal: Return top-k extracted results as fast as possible:




Information extraction output is not trusted
Extracted tuples are ranked according to expected correctness
Execution should avoid processing all documents, for efficiency
“Top-k” query results
31
Top-k Query Processing in this Scenario
is an Open Problem!


Recall “on-the-fly” extraction goal, need to
avoid processing all documents, for efficiency
Can we apply well-known top-k query
processing algorithms (e.g., TA, Upper)?

Problem is lack of “sorted access”
No efficient access to tuples in sorted order of expected
correctness

But interesting approximations, with probabilistic
guarantees, are possible
32
“Good(k, l)” Queries



Exploit redundancy of text databases to
predict tuple correctness
Define expected tuple correctness by
aggregating confidence of extracted instances
Define good(k, l) query results as including
any k tuples from top l%


Relaxation of top-k query definition enables efficient
executions
Adaptation of Upper algorithm [Bruno et al., ICDE 2002],
probabilistic formulation
[Jain and Srivastava, ICDE 2009]
33
Processing Top-k Query Variants
over Text Databases: Summary


We discussed how to process top-k query variants over text
databases: “any-k,” “good(k, l),” and “top-k” queries
We focused on “on-the-fly” information extraction


Information extraction task short-lived, new
Text database previously unseen or too vast
So exhaustive extraction approaches may be impractical



But intriguing connection with “open information extraction”
[Banko et al., IJCAI 2007] still unexplored
We focused on “blackbox” information extraction systems
But information extraction system “knobs” (e.g., for tuning
precision, recall) can play important role in query optimization
[Jain and Ipeirotis, TODS 2009]
Exciting research area, with many challenging open problems!
34
Thank you!
35

Querying Text Databases and the Web: Beyond Traditional

Transcript Querying Text Databases and the Web: Beyond Traditional

Directory