9-28-2006-grad
Download
Report
Transcript 9-28-2006-grad
Accessing, Managing, and Mining
Unstructured Data
Eugene Agichtein
1
The Web
20B+ of machine-readable text (some of it useful)
(Mostly) human-generated for human consumption
Both “artificial” and “natural” phenomenon
Still growing?
Local and global structure (links)
Headaches:
Dynamic vs. static content
People figured out how to make money
Positives:
Everything (almost) is on the web
People (eventually) can find info
People (on average) are not evil
2
Wait, there is more
Blogs, wikipedia
Hidden web: > 25 million databases
Accessible via keyword search interfaces
E.g., MedLine, CancerLit, USPTO, …
100x more data than surface web
(Transcribed) speech from
Classified
Genetic sequence annotations
Biological & Medical literature
Medical records, reports, alerts, 911 calls
3
Outline
Unstructured data (text, web, …) is
Important (really!)
Not so unstructured
Main tasks/requirements and challenges
Example problem: query optimization for text-centric
tasks
Fundamental research problems/directions
4
Unstructured data = natural language text
(for this talk)
Incredibly powerful and flexible means of
communicating knowledge
Local structures: syntax
Papers, news, web pages, lecture notes, patient
records, shopping lists…
English syntax
HTML layout
Semantics implicit, ambiguous, subjective
I saw a man with a chainsaw
Need incredibly powerful and flexible decoder
5
Some more structure
Explicit link structure
Web, Blogs, Wikipedia, citations
Implicit link structure
Co-occurrence of entities within same document/context
implies link between entities
Occurrence of same entity in multiple documents implies
link between documents
Physical location
Page primarily “about” Atlanta
User somewhere around N. Decatur Rd
E-mail sender is two floors down
More on this later
6
Global Problem Space
Crawling (accessing) the data
Storing (multiple version of) data
“Understanding” the data information
Indexing information
Integration from multiple sources
User-driven information retrieval
Exploiting unstructured data in applications
System-driven knowledge discovery
Building a nuclear/hydro/wind/ power plant
7
To Search or to Crawl? Towards a Query Optimizer for TextCentric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]
Information extraction applications extract structured
relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Disease Outbreaks in The New York Times
Information
Extraction System
(e.g., NYU’s Proteus)
Date
Disease Name
Location
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow Disease U.K.
Feb. 1995
Pneumonia
U.S.
May 1995
Ebola
Zaire
8
An Abstract View of Text-Centric Tasks
Output Tokens
Text Database
…
Extraction
System
1. Retrieve documents
from database
2. Process documents
3. Extract output tokens
Task
Token
Information Extraction
Relation Tuple
Database Selection
Word (+Frequency)
Focused Crawling
Web Page about a Topic
For the rest of the talk
9
Executing a Text-Centric Task
Output Tokens
Text Database
Extraction
…
System
1. Retrieve
documents from
database
Similar to relational world
2. Process
documents
3. Extract output
tokens
Two major execution paradigms
Scan-based: Retrieve and process documents
sequentially
Index-based: Query database (e.g., [case fatality rate]),
retrieve and process documents in results
→underlying data distribution dictates what is best
Indexes are only “approximate”: index is on
keywords, not on tokens of interest
Choice of execution plan affects output
completeness (not only speed)
Unlike the relational world
10
Execution Plan Characteristics
Output Tokens
Text Database
1.
Question: How
do we choose the…
Extraction
fastest execution
plan for reaching
System
a
target
recall
?
Retrieve documents
from database
2. Process documents
3. Extract output tokens
Execution Plans have two main characteristics:
Execution Time
Recall (fraction of tokens retrieved)
“What is the fastest plan for discovering 10% of the disease
outbreaks mentioned in The New York Times archive?”
11
Outline
Description and analysis of crawl- and query-based plans
Scan
Crawl-based
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Query-based
(Index-based)
Optimization strategy
Experimental results and conclusions
12
Scan
Output Tokens
Text Database
Extraction
…
System
1. Retrieve docs
from database
2. Process
documents
3. Extract output
tokens
Scan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| · (R + P)
Question: How many documents
does Scan retrieve to reach target
recall?
Time for retrieving a
document
Time for processing
a document
Filtered Scan uses a classifier to identify and process only promising documents (details in paper)
13
Estimating Recall of Scan
<SARS, China>
Modeling Scan for Token t:
What is the probability of seeing t (with
frequency g(t)) after retrieving S documents?
A “sampling without replacement” process
Token
t
d1
d2
S documents
...
After retrieving S documents, frequency of
token t follows hypergeometric distribution
Recall for token t is the probability that
frequency of t in S docs > 0
dS
...
dN
D
Probability of seeing token t
after retrieving S documents
g(t) = frequency of token t
Sampling
for t
14
Estimating Recall of Scan
<SARS, China>
<Ebola, Zaire>
Modeling Scan:
Multiple “sampling without replacement”
processes, one for each token
Overall recall is average recall across
tokens
Tokens
t1
t2
Sampling
for t1
Sampling
for t2
...
tM
d1
d2
→ We can compute number of documents
required to reach target recall
d3
...
Execution time = |Retrieved Docs| · (R + P)
dN
D
Sampling
for tM
15
Outline
Description and analysis of crawl- and query-based plans
Scan
Crawl-based
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Query-based
Optimization strategy
Experimental results and conclusions
16
Iterative Set Expansion
Output Tokens
Text Database
…
Extraction
Query
System
1. Query
database with
seed tokens
Generation
2. Process retrieved
documents
3. Extract tokens
from docs
(e.g., <Malaria, Ethiopia>)
4. Augment seed
tokens with
new tokens
(e.g., [Ebola AND Zaire])
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Question: How many queries
and how many documents does
Iterative Set Expansion need to
reach target recall?
Time for retrieving a Time for processing
document
a document
Time for answering
a query17
Querying Graph
Tokens
The querying graph is a
bipartite graph, containing
tokens and documents
t1
Documents
d1
<SARS, China>
t2
d2
<Ebola, Zaire>
Each token (transformed to a
keyword query) retrieves
documents
Documents contain tokens
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
18
Using Querying Graph for Analysis
We need to compute the:
Number of documents retrieved after
sending Q tokens as queries (estimates time)
Number of tokens that appear in the
retrieved documents (estimates recall)
Tokens
t1
Documents
d1
<SARS, China>
t2
d2
<Ebola, Zaire>
To estimate these we need to compute the:
Degree distribution of the tokens
discovered by retrieving documents
Degree distribution of the documents
retrieved by the tokens
(Not the same as the degree distribution of a
randomly chosen token or document – it is easier to
discover documents and tokens with high degrees)
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
19
Elegant analysis framework based on generating functions – details in the paper
Recall Limit: Reachability Graph
Tokens
Documents
t1
d1
t2
d2
t3
d3
t4
d4
t5
d5
Reachability Graph
t1
t2
t3
t5
t4
t1 retrieves document d1
that contains t2
Upper recall limit: determined by the size
of the biggest connected component
20
Automatic Query Generation
Iterative Set Expansion has recall limitation due to
iterative nature of query generation
Automatic Query Generation avoids this problem by
creating queries offline (using machine learning), which
are designed to return documents with tokens
Details in the papers
21
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
22
Summary of Cost Analysis
Our analysis so far:
Takes as input a target recall
Gives as output the time for each plan to reach target recall
(time = infinity, if plan cannot reach target recall)
Time and recall depend on task-specific properties of database:
Token degree distribution
Document degree distribution
Next, we show how to estimate degree distributions on-the-fly
23
Estimating Cost Model Parameters
Token and document degree distributions belong to known distribution families
Task
Document Distribution
Token Distribution
Information Extraction
Power-law
Power-law
Content Summary Construction
Lognormal
Power-law (Zipf)
Focused Resource Discovery
Uniform
Uniform
10000
100000
y = 43060x-3.3863
10000
1000
y = 5492.2x-2.0254
Number of Tokens
Number of Documents
1000
100
10
1
1
10
Document Degree
100
100
10
1
1
10
100
Token Degree
1000
24
Can characterize distributions with only a few parameters!
Parameter Estimation
Naïve solution for parameter estimation:
Start with separate, “parameter-estimation” phase
Perform random sampling on database
Stop when cross-validation indicates high confidence
We can do better than this!
No need for separate sampling phase
Sampling is equivalent to executing the task:
→Piggyback parameter estimation into execution
25
On-the-fly Parameter Estimation
Correct (but unknown) distribution
Pick most promising execution
plan for target recall assuming
“default” parameter values
Start executing task
Update parameter estimates
during execution
Switch plan if updated statistics
indicate so
Initial,
default
estimation
Updated
estimation
Updated
estimation
Important
Only Scan acts as “random sampling”
26
All other execution plan need parameter adjustment (see paper)
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
27
Correctness of Theoretical Analysis
100,000
Execution Time (secs)
10,000
Scan
1,000
Filt. Scan
Automatic Query Gen.
Iterative Set Expansion
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
Solid lines: Actual time
Dotted lines: Predicted time with correct parameters
0.70
0.80
0.90
1.00
Task: Disease Outbreaks
Snowball IE system
182,531 documents from NYT
28
16,921 tokens
Experimental Results (Information Extraction)
100,000
Execution Time (secs)
10,000
Scan
Filt. Scan
1,000
Iterative Set Expansion
Automatic Query Gen.
OPTIMIZED
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
0.70
0.80
0.90
1.00
Solid lines: Actual time
Green line: Time with optimizer
(results similar in other experiments – see paper)
29
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall
of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan
for target recall
30
Global Problem Space
Crawling (accessing) the data
“Understand” the data information
Indexing information
Integration from multiple sources
User-driven information retrieval
Exploiting unstructured data in applications
System-driven knowledge discovery
31
Some Research Directions
Modeling explicit and Implicit network structures
Knowledge Discovery from Biological and Medical Data
Automatic sequence annotation bioinformatics, genetics
Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing
Modeling evolution of explicit structure on web, blogspace, wikipedia
Modeling implicit link structures in text, collections, web
Exploiting implicit & explicit social networks (e.g., for epidemiology)
Integrating information in structured and unstructured sources
Robust search/question answering for medical applications
Confidence estimation for extraction from text and other sources
Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources
Information propagation on the web
In collaborative sources (wikipedia, MedLine)
32
Page Quality: In Search of an Unbiased Web Ranking
[Cho, Roy, Adams, SIGMOD 2005]
“popular pages tend to get even more popular, while
unpopular pages get ignored by an average user”
33
Sic Transit Gloria Telae: Towards an Understanding of the
Web’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]
34
Modeling Social Networks for
Epidemiology, security, …
Email exchange mapped onto cubicle locations.
35
Some Research Directions
Modeling explicit and Implicit network structures
Knowledge Discovery from Biological and Medical Data
Automatic sequence annotation bioinformatics, genetics
Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing
Modeling evolution of explicit structure on web, blogspace, wikipedia
Modeling implicit link structures in text, collections, web
Exploiting implicit & explicit social networks (e.g., for epidemiology)
Integrating information in structured and unstructured sources
Query processing over unstructured text
Robust search/question answering for medical applications
Confidence estimation for extraction from text and other sources
Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Information diffusion/propagation in online sources
Information propagation on the web
In collaborative sources (wikipedia, MedLine)
36
ISMB 2003
Applying Text Mining for Bioinformatics
100,000+ gene and protein synonyms
extracted from
50,000+ journal articles
Approximately 40% of confirmed
synonyms not previously listed
in curated authoritative reference
(SWISSPROT)
“APO-1, also known as DR6…”
“MEK4, also called SEK1…”
37
Examples of Entity-Relationship
Extraction
„We show that CBF-A and CBF-C interact
with each other to form a CBF-A-CBF-C complex
and that CBF-B does not interact with CBF-A or
CBF-C individually but that it associates with the
CBF-A-CBF-C complex.“
CBF-A
CBF-B
interact
complex
associates
CBF-C
CBF-A-CBF-C complex
38
Another Example
Z-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various
immunomodulatory activities, such as the induction of interleukin 12, interferon gamma
(IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency
virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are
investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only
macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes
that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G
envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after
infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv
vector (in which the env gene is defective and the nef gene is replaced with the firefly
luciferase gene) when this vector was transfected directly into MDMs. These findings
suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs,
suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100
induced IFN-beta production in these cells, resulting in induction of the 16-kDa
CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses
HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a
specific inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the
p38 MAPK signalling pathway was involved in Z-100-induced repression of HIV-1
replication in MDMs. These findings suggest that Z-100 might be a useful
immunomodulator for control of HIV-1 infection.
39
Query
AliBaba, Ulf Leser, http://wbi.informatik.hu-berlin.de:80
Extracted info
PubMed visualized
Links to databases
40
Agichtein & Eskin, PSB 2004
Mining Text and Sequence Data
ROC50 scores for each class and method
41
Some Research Directions
Modeling explicit and Implicit network structures
Knowledge Discovery from Biological and Medical Data
Automatic sequence annotation bioinformatics, genetics
Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing
Modeling evolution of explicit structure on web, blogspace, wikipedia
Modeling implicit link structures in text, collections, web
Exploiting implicit & explicit social networks (e.g., for epidemiology)
Integrating information in structured and unstructured sources
Robust search/question answering for medical applications
Confidence estimation for extraction from text and other sources
Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources
Information propagation on the web
In collaborative sources (wikipedia, MedLine)
42
Structure and evolution of blogspace
[Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006]
Fraction of nodes in components of various sizes within Flickr and Yahoo!
360 timegraph, by week.
43
Structure of implicit entity-entity networks in text [Agichtein&Gravano, ICDE 2003]
Connected Components Visualization
DiseaseOutbreaks,
44
New York Times 1995
Some Research Directions
Modeling explicit and Implicit network structures
Knowledge Discovery from Biological and Medical Data
Automatic sequence annotation bioinformatics, genetics
Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing
Modeling evolution of explicit structure on web, blogspace, wikipedia
Modeling implicit link structures in text, collections, web
Exploiting implicit & explicit social networks (e.g., for epidemiology)
Integrating information in structured and unstructured sources
Robust search/question answering for medical applications
Confidence estimation for extraction from text and other sources
Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources
Information propagation on the web, news
In collaborative sources (wikipedia, MedLine)
45
Thank You
Details:
http://www.mathcs.emory.edu/~eugene/
46