9-20-2006-overview

Download Report

Transcript 9-20-2006-overview

Surfacing Information in
Large Text Collections
Eugene Agichtein
Microsoft Research
guideline for unstable angina
Example: Angina treatments
unstable angina management
herbal treatment for angina pain
medications for treating angina
alternative treatment for angina pain
treatment for angina
angina treatments
Structured databases
(e.g., drug info, WHO drug
adverse effects DB, etc)
MedLine
PDR
Medical reference
and literature
Web search results
2
Research Goal
Seamless, intuitive, efficient, and robust access
to knowledge in unstructured sources
Some approaches:





Retrieve the relevant documents or passages
Question answering
Construct domain-specific “verticals” (MedLine)
Extract entities and relationships
Network of relationships: Semantic Web
3
Semantic Relationships “Buried” in
Unstructured Text
…
A number of well-designed and executed large-scale clinical trials have
now shown that treatment with statins
reduces recurrent myocardial infarction,
reduces strokes, and lessens the need
for revascularization or hospitalization
for unstable angina pectoris
…



RecommendedTreatment
Drug
Condition
statins
recurrent myocardial
infarction
statins
strokes
statins
unstable angina
pectoris
Web, newsgroups, web logs
Text databases (PubMed, CiteSeer, etc.)
Newspaper Archives
• Corporate mergers, succession, location
• Terrorist attacks
]
M essage
U nderstanding
C onferences
4
What Structured Representation
Can Do for You:
Large Text Collection





Structured Relation
… allow precise and efficient querying
… allow returning answers instead of documents
… support powerful query constructs
… allow data integration with (structured) RDBMS
… provide useful content for Semantic Web
5
Challenges in Information Extraction



Portability
•
•
Reduce effort to tune for new domains and tasks
MUC systems: experts would take 8-12 weeks to tune
Scalability, Efficiency, Access
•
•
Enable information extraction over large collections
1 sec / document * 5 billion docs = 158 CPU years
Approach: learn from data ( “Bootstrapping” )
•
•
Snowball: Partially Supervised Information Extraction
Querying Large Text Databases for Efficient Information Extraction
6
The Snowball System: Overview
Organization
Location
Conf
Microsoft
Redmond
1
IBM
Armonk
1
Intel
Santa Clara
1
AG Edwards
St Louis
0.9
Air Canada
Montreal
0.8
7th Level
Richardson
0.8
3Com Corp
Santa Clara
0.8
3DO
3M
1
Snowball
Redwood City 0.7
Minneapolis
2
Text Database
0.7
MacWorld San Francisco 0.7
...
...
..
157th Street
Manhattan
0.52
15th Party
Congress
China
0.3
15th Century
Europe
Dark Ages
0.1
3
7
Snowball: Getting User Input
Organization Headquarters
Microsoft
Redmond
IBM
Armonk
Intel
Santa Clara
Get
Examples
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
ACM DL 2000
Tag
Entities
Generate Extraction
Patterns
User input:
• a handful of example instances
• integrity constraints on the relation
e.g., Organization is a “key”, Age > 0, etc…
8
Evaluating Patterns and Tuples:
Expectation Maximization

EM-Spy Algorithm
•
•
•
•
“Hide” labels for some seed
tuples
Iterate EM algorithm to
convergence on tuple/pattern
confidence values
Set threshold t such that
(t > 90% of spy tuples)
Re-initialize Snowball using
new seed tuples
Organization
Headquarters
Initial
Final
Microsoft
Redmond
1
1
IBM
Armonk
1
0.8
Intel
Santa Clara
1
0.9
AG Edwards
St Louis
0
0.9
Air Canada
Montreal
0
0.8
7th Level
Richardson
0
0.8
3Com Corp
Santa Clara
0
0.8
3DO
Redwood City
0
0.7
3M
Minneapolis
0
0.7
MacWorld
San Francisco
0
0.7
…..
0
157th Street
Manhattan
0
0.52
15th Party
Congress
China
0
0.3
15th Century
Europe
Dark Ages
0
0.1
0
9
Adapting Snowball for New Relations


Large parameter space
•
•
•
•
•
Initial seed tuples (randomly chosen, multiple runs)
Acceptor features: words, stems, n-grams, phrases, punctuation, POS
Feature selection techniques: OR, NB, Freq, ``support’’, combinations
Feature weights: TF*IDF, TF, TF*NB, NB
Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy
Automatically estimate parameter values:
•
•
•
Estimate operating parameters based on occurrences of seed tuples
Run cross-validation on hold-out sets of seed tuples for optimal perf.
Seed occurrences that do not have close “neighbors” are discarded
10
SDM 2006
Example Task 1: DiseaseOutbreaks
Proteus:
Snowball:
0.409
0.415
11
ISMB 2003
Example Task 2: Bioinformatics

100,000+ gene and protein
synonyms extracted from
50,000+ journal articles

Approximately 40% of confirmed
synonyms not previously listed
in curated authoritative reference
(SWISSPROT)
“APO-1, also known as DR6…”
“MEK4, also called SEK1…”
12
Snowball Used in Various Domains

News: NYT, WSJ, AP [DL’00, SDM’06]
• CompanyHeadquarters, MergersAcquisitions,
DiseaseOutbreaks

Medical literature: PDRHealth, Micromedex…
[Ph.D. Thesis]
• AdverseEffects, DrugInteractions,
RecommendedTreatments

Biological literature: GeneWays corpus
[ISMB’03]
• Gene and Protein Synonyms
13
CIKM 2005
Limits of Bootstrapping for Extraction

Task “easy” when context term distributions diverge from
background
President George W Bush’s three-day visit to India
0.07
0.06
frequency
0.05
0.04
0.03
0.02
0.01
0
the

to
and
said
's
company
mrs
won
president
Quantify as relative entropy (Kullback-Liebler divergence)
LM C ( w)
KL( LM C || LM BG )   LM C ( wi )  log
LM BG ( w)
wV

After calibration, metric predicts if bootstrapping likely to work
14
Extracting All Relation Instances
From a Text Database
Information
Extraction
System
Text Database





Structured
Relation
Brute force approach: feed all
Expensive for
docs to information extraction system large collections
Only a tiny fraction of documents are often useful
Many databases are not crawlable
Often a search interface is available, with existing
keyword index
How to identify “useful” documents?
]
15
Accessing Text DBs via Search Engines
Search
Engine
Information
Extraction
System
Text Database
Search engines impose limitations
• Limit on documents retrieved per query
• Support simple keywords and phrases
• Ignore “stopwords” (e.g., “a”, “is”)
Structured
Relation
16
Text-Centric Task I: Information Extraction

Information extraction applications extract structured
relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Disease Outbreaks in The New York Times
Information
Extraction System
(e.g., NYU’s Proteus)
Date
Disease Name
Locatio
n
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow
Disease
U.K.
Feb. 1995
Pneumonia
U.S.
Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan
May 1995
Ebola
Zaire
17
Executing a Text-Centric Task
Text Database
Extraction
System
1. Retrieve
documents from
database
Similar to relational world
2. Process
documents
Output
Tokens
…
3. Extract output
tokens
Two major execution paradigms
 Scan-based: Retrieve and process documents
sequentially
 Index-based: Query database (e.g., [case fatality rate]),
retrieve and process documents in results
→underlying data distribution dictates what is best
Indexes are only “approximate”: index is on
keywords, not on tokens of interest
 Choice of execution plan affects output
completeness (not only speed)

Unlike the relational world
18
QXtract: Querying Text Databases for
Robust Scalable Information EXtraction
User-Provided Seed Tuples
DiseaseName
Location
Date
Malaria
Ethiopia
Jan. 1995
Ebola
Zaire
May 1995
Query Generation
Search Engine
Queries
Promising Documents
Text Database
Information Extraction System
DiseaseName
Location
Date
Malaria
Ethiopia
Jan. 1995
Ebola
Zaire
May 1995
Problem:
keyword
Cow Disease
Extracted
RelationLearnMad
queries
The U.K.
July 1995
Pneumonia
The U.S.
Feb. 1995
to retrieve “promising”
documents
19
Learning Queries to Retrieve
Promising Documents
1. Get document sample
with “likely negative” and
“likely positive” examples.
User-Provided Seed Tuples
Seed Sampling
?
3. Train classifiers to
“recognize” useful
documents.
4. Generate queries from
classifier model/rules.
Text Database
?
?
?
? ?
?
?
2. Label sample documents
using information
extraction system as
“oracle.”
Search Engine
Information Extraction System
tuple1
tuple2
tuple3
tuple4
tuple5
+
+
-
-
+
+
- -
Classifier Training
+
+
-
-
+
+
-
-
Query Generation
Queries
20
SIGMOD 2003
Demonstration
21
Querying Graph
Tokens

The querying graph is a
bipartite graph, containing
tokens and documents
t1
Documents
d1
<SARS, China>
t2
d2
<Ebola, Zaire>


Each token (transformed to
a keyword query)
retrieves documents
Documents contain tokens
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
22
Sizes of Connected Components
How many tuples are in largest Core + Out?
In
Core
Out
In


t0
Core
(strongly
connected)
Out
In
Core
Out
Conjecture:
•
Degree distribution in reachability graphs follows “power-law.”
•
Then, reachability graph has at most one giant component.
Define Reachability as Fraction of tuples in largest Core + Out
23
NYT Reachability Graph:
Outdegree Distribution
Matches the power-law distribution
MaxResults=10
MaxResults=50
24
NYT: Component Size Distribution
Not “reachable”
MaxResults=10
CG / |T| = 0.297
“reachable”
MaxResults=50
CG / |T| = 0.620
25
Connected Components Visualization
DiseaseOutbreaks,
New York Times 1995
26
SIGMOD 2006
Estimate Cost of Retrieval Methods




Alternatives:
•
Scan, Filtered Scan, Tuples, QXtract
General cost model for text-centric tasks
•
Information extraction, summary construction, etc…
Estimate the expected cost of each access method
•
•
•
Parametric model describing all retrieval steps
Extended analysis to arbitrary degree distributions
Parameters estimates can be “piggybacked” at runtime
Cost estimates can be provided to a query
optimizer for nearly optimal execution
27
Optimized Execution of Text-Centric Tasks
Scan
Filtered Scan
Tuples
28
Current Research Agenda

Seamless, intuitive, and robust access to
knowledge in biologicial and medical sources

Some research problems:
• Robust query processing over unstructured data
• Intelligently interpreting user information needs
• Text mining for bio- and medical informatics
• Model implicit network structures:
• Entity graphs in Wikipedia
• Protein-Protein interaction networks
• Semantic maps of MedLine
29
Deriving Actionable Knowledge from
Unstructured (text) Data

Extract actionable rules from medical text
(Medline, patient reports, …)
• Joint project (early stages) with medical school, GT

Epidemiology surveillance (w/ SPH)

Query processing over unstructured data
• Tune extraction for query workload
• Index structures to support effective extraction
• Queries over extracted and “native” tables
30
Text Mining for Bioinformatics

Impossible to keep up with literature,
experimental notes

Automatically update ontologies, indexes

Automate tedious work of post-wetlab search

Identify (and assign text label) DNA structures
31
Mining Text and Sequence Data
ROC50 scores for each class and method
PSB 2004
32