Transcript ppt
Untangling Text Data Mining
Marti Hearst
UC Berkeley SIMS
ACL’99 Plenary Talk
June 23, 1999
Outline
Untangling several different fields
– DM, CL, IA, TDM
TDM examples
TDM as Exploratory Data Analysis
– New Problems for Computational Linguistics
– Our current efforts
Classifying Application Types
Non-textual
data
Textual data
Patterns
Non-Novel
Nuggets
Novel
Nuggets
Standard data
mining
Database
queries
?
Computational
linguistics
Information
retrieval
Real text
data mining
What is Data Mining?
(Fayyad & Uthurusamy 96, Fayyad 97)
Fitting models to or determining patterns
from very large datasets.
A “regime” which enables people to interact
effectively with massive data stores.
Deriving new information from data.
Why Data Mining?
Because the data is there.
Because
–
–
–
–
larger disks
faster cpus
high-powered visualization
networked information
are becoming widely available.
The Knowledge Discovery
from Data Process (KDD)
KDD: The non-trivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in data.
(Fayyad, Shapiro, & Smyth, CACM 96)
Note: data mining is just one step in the process
DM Touchstone Applications
(CACM 39 (11) Special Issue)
Finding patterns across data sets:
– Reports on changes in retail sales
» to improve sales
– Patterns of sizes of TV audiences
» for marketing
– Patterns in NBA play
» to alter, and so improve, performance
– Deviations in standard phone calling behavior
» to detect fraud
» for marketing
What is Data Mining?
Potential point of confusion:
– The extracting ore from rock
metaphor does not really apply to the
practice of data mining
– If it did, then standard database
queries would fit under the rubric of
data mining
– In practice, DM refers to:
» finding patterns across large datasets
» discovering heretofore unknown
information
What is Text Data Mining?
Many peoples’ first thought:
– Make it easier to find things on the Web.
– But this is information retrieval!
Needles in Haystacks
The emphasis in IR is in finding documents
that already contain answers to questions.
Information Retrieval
A restricted form of Information Access
The system has available only pre-existing,
“canned” text passages.
Its response is limited to selecting from these
passages and presenting them to the user.
It must select, say, 10 or 20 passages out of
millions.
What is Text Data Mining?
The metaphor of extracting ore from rock:
– Does make sense for extracting
documents of interest from a huge pile.
– But does not reflect notions of DM in
practice:
» finding patterns across large collections
» discovering heretofore unknown information
Real Text DM
What would finding a pattern across a large
text collection really look like?
Bill Gates + MS-DOS
in the Bible!
From: “The Internet Diary of the man who cracked the Bible Code”
Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
(William Gates, agitator, leader)
From: “The Internet Diary of the man who cracked the Bible Code”
Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
Real Text DM
The point:
– Discovering heretofore unknown
information is not what we usually do
with text.
– (If it weren’t known, it could not have
been written by someone!)
However:
– There is a field whose goal is to learn
about patterns in text for their own
sake ...
Computational Linguistics!
Goal: automated language understanding
– this isn’t possible
– instead, go for subgoals, e.g.,
» word sense disambiguation
» phrase recognition
» semantic associations
Common current approach:
– statistical analyses over very large text
collections
Why CL Isn’t TDM
A linguist finds it interesting that
“cloying” co-occurs significantly with
“Jar Jar Binks” ...
… But this doesn’t really answer a
question relevant to the world outside
the text itself.
Why CL Isn’t TDM
We need to use the text indirectly
to answer questions about the world
Direct:
– Analyze patent text; determine which word
patterns indicate various subject categories.
Indirect:
– Analyze patent text; find out whether private
or public funding leads to more inventions.
Why CL Isn’t TDM
Direct:
– Cluster newswire text; determine which terms
are predominant
Indirect:
– Analyze newswire text; gather evidence about
which countries/alliances are dominating which
financial sectors
Nuggets vs. Patterns
TDM: we want to discover new information …
… As opposed to discovering which statistical
patterns characterize occurrence of known
information.
Example: WSD
– not TDM: computing statistics over a corpus to
determine what patterns characterize Sense S.
– TDM: discovering the meaning of a new sense of a
word.
Nuggets vs. Patterns
Nugget: a new, heretofore unknown item of
information.
Pattern: distributions or rules that
characterize the occurrence (or nonoccurrence) of a known item of information.
Application of rules can create nuggets in
some circumstances.
Example: Lexicon Augmentation
Application of a lexico-syntactic pattern:
NP0 such as NP1, {NP2 …, (and | or) NPi }
i >= 1, implies that
forall NPi, i>=1, hyponym(NPi, NP0)
Extracts out a new hypernym:
– “Agar is a substance prepared from a mixture of red
algae, such as Gelidium, for laboratory or industrial
use.”
– implies hyponym(“Gelidium”, “red algae”)
However, this fact was already known to
the author of the text.
The Quandry
How do we use text to both
– Find new information not known to the
author of the text
– Find information that is not about the
text itself
Idea: Exploratory Data Analysis
Use large text collections to gather
evidence to support (or refute)
hypotheses
– Not known to author: links across many
texts
– Not self-referential: work within the
domain of discourse
Example: Etiology
Given
– medical titles and abstracts
– a problem (incurable rare disease)
– some medical expertise
find causal links among titles
– symptoms
– drugs
– results
Swanson Example (1991)
Problem: Migraine headaches (M)
–
–
–
–
–
stress associated with M
stress leads to loss of magnesium
calcium channel blockers prevent some M
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD) implicated
in M
– high levels of magnesium inhibit SCD
– M patients have high platelet aggregability
– magnesium can suppress platelet aggregability
All extracted from medical journal titles
Gathering Evidence
stress
magnesium
CCB
migraine
magnesium
SCD
magnesium
PA
magnesium
Gathering Evidence
CCB
migraine
PA
SCD
stress
magnesium
Swanson’s TDM
Two of his hypotheses have received
some experimental verification.
His technique
– Only partially automated
– Required medical expertise
Few people are working on this.
How to Automate This?
Idea: mixed-initiative interaction
– User applies tools to help explore the
hypothesis space
– System runs suites of algorithms to help
explore the space, suggest directions
Our Proposed Approach
Three main parts
– UI for building/using strategies
– Backend for interfacing with various
databases and translating different
formats
– Content analysis/machine learning for
figuring out good hypotheses/throwing
out bad ones
How to find functions of genes?
Important problem in molecular
biology
– Have the genetic sequence
– Don’t know what it does
– But …
» Know which genes it coexpresses with
» Some of these have known function
– So … Infer function based on function of
co-expressed genes
» This is new work by Michael Walker and
others at Incyte Pharmaceuticals
Gene Co-expression:
Role in the genetic pathway
Kall.
g?
Kall.
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
Make use of the literature
Look up what is known about the
other genes.
Different articles in different
collections
Look for commonalities
– Similar topics indicated by Subject
Descriptors
– Similar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
Developing Strategies
Different strategies seem needed for
different situations
– First: see what is known about Kallikrein.
– 7341 documents. Too many
– AND the result with “disease” category
» If result is non-empty, this might be an
interesting gene
– Now get 803 documents
– AND the result with PSA
» Get 11 documents. Better!
Developing Strategies
Look for commalities among these
documents
– Manual scan through ~100 category
labels
– Would have been better if
» Automatically organized
» Intersections of “important” categories
scanned for first
Try a new tack
Researcher uses knowledge of field to
realize these are related to prostate
cancer and diagnostic tests
New tack: intersect search on all three
known genes
– Hope they all talk about diagnostics and
prostate cancer
– Fortunately, 7 documents returned
– Bingo! A relation to regulation of this
cancer
Formulate a Hypothesis
Hypothesis: mystery gene has to do with
regulation of expression of genes leading
to prostate cancer
New tack: do some lab tests
– See if mystery gene is similar in
molecular structure to the others
– If so, it might do some of the same
things they do
Strategies again
In hindsight, combining all three
genes was a good strategy.
– Store this for later
Might not have worked
– Need a suite of strategies
– Build them up via experience and a good
UI
The System
Doing the same query with slightly
different values each time is timeconsuming and tedious
Same goes for cutting and pasting results
– IR systems don’t support varying queries
like this very well.
– Each situation is a bit different
Some automatic processing is needed in the
background to eliminate/suggest
hypotheses
The UI part
Need support for building strategies
Mixed-initiative system
– Trade off between user-initiated hypotheses
exploration and system-initiated suggestions
Information visualization
– Another way to show lots of choices
Candidate Associations
Suggested Strategies
Current Retrieval Results
LINDI: Linking Information for
Novel Discovery and Insight
Just starting up now (fall 98)
Initial work: Hao Chen, Ketan MayerPatel, Shankar Raman
Summary
The future: analyzing what the text
is about
– We don’t know how; text is tough!
– Idea: bring the user into the loop.
– Build up piecewise evidence to support
hypotheses
– Make use of partial domain models.
The Truth is Out There!
Summary
Text Data Mining:
– Extracting heretofore undiscovered
information from large text collections
Information Access TDM
– IA: locating already known information
that is currently of interest
Finding patterns across text is
already done in CL
– Tells us about the behavior of language
– Helps build very useful tools!