Transcript agogino
Text Data Mining
Prof. Marti Hearst
UC Berkeley SIMS
Guest Lecture, ME 290M
Prof. Agogino
May 4, 1999
There’s Lots of Text Out There
Is it Information Overload?
Why not TURBO-Text?
How can we SYNTHESIZE what’s there
to make new discoveries?
Talk Outline
Definitions
– What is Data Mining?
– What is Text Data Mining?
Text data mining examples
– Lexical knowledge acquisition
– Merging textual records
– Finding cures for diseases (from medical literature)
Future Directions
What is Data Mining?
(Fayyad & Uthurusamy 96, Fayyad 97)
Fitting models to or determining patterns
from very large datasets.
A “regime” which enables people to interact
effectively with massive data stores.
Deriving new information from data.
– finding patterns across large datasets
– discovering heretofore unknown
information
What is Data Mining?
Potential point of confusion:
– The extracting ore from rock metaphor
does not really apply to the practice of
data mining
– If it did, then standard database queries
would fit under the rubric of data mining
» Find all employee records in which employee
earns $300/month less than their managers
– In practice, DM refers to:
» finding patterns across large datasets
» discovering heretofore unknown information
Why Data Mining?
Because the data is there.
Because current DBMS technology
does not support data analysis.
Because
–
–
–
–
larger disks
faster cpus
high-powered visualization
networked information
are becoming widely available.
DM Touchstone Applications
(CACM 39 (11) Special Issue)
Finding patterns across data sets:
– Reports on changes in retail sales
» to improve sales
– Patterns of sizes of TV audiences
» for marketing
– Patterns in NBA play
» to alter, and so improve, performance
– Deviations in standard phone calling behavior
» to detect fraud
» for marketing
DM Touchstone Applications
(CACM 39 (11) Special Issue)
Separating signal from noise:
– Classifying faint astronomical objects
– Finding genes within DNA sequences
– Discovering novel tectonic activity
What is Text Data Mining?
Peoples’ first thought:
– Make it easier to find things on the Web.
– This is information retrieval!
The metaphor of extracting ore from rock
does make sense for extracting documents
of interest from a huge pile.
But does not reflect notions of DM in
practice:
– finding patterns across large collections
– discovering heretofore unknown information
Text DM IR
Data Mining:
» Patterns, Nuggets, Exploratory Analysis
Information Retrieval:
– Finding and ranking documents that
match users’ information need
» ad hoc query
» filtering/standing query
– Rarely Patterns, Exploratory Analysis
Real Text DM
The point:
– Discovering heretofore unknown
information is not what we usually do
with text.
– (If it weren’t known, it could not have
been written by someone.)
However:
– There is a field whose goal is to learn
about patterns in text for its own sake
...
Computational Lingustics
Goal: automated language understanding
– this isn’t possible
– instead, go for subgoals, e.g.,
» word sense disambiguation
» phrase recognition
» semantic associations
Current approach:
– statistical analyses of very large text
collections
WordNet: A Lexical Database
A list
of
hypernyms
for each
sense of
“crow”
Lexicographic
Knowledge Acquisition
Given a large lexical database ...
– Wordnet: Miller, Fellbaum et al. at
Princeton
– http://www.cogsci.princeton.edu/~wn
… and a huge text collection
– How to automatically add new relations?
Idea: Use Simple LexicoSyntactic Analysis
Patterns of the following type work:
NP0 such as NP1, {NP2 …, (and | or) NPi
i >= 1, implies
forall NPi, i>=1, hyponym(NPi, NP0)
Example:
– “Agar is a substance prepared from a
mixture of red algae, such as Gelidium,
for laboratory or industrial use.”
– implies hyponym(“Gelidium”, “red algae”)
More Examples
“Felonies, such as shootings and
stabbings …” implies
– hyponym(shootings, felonies)
– hyponym(stabbings, felonies)
Is this in the WordNet hierarchy?
Linking Killing to Felonies
Another Example
Einstein is (was) a physicist.
Is/was he a genius?
Making Einstein a Genius
Results from
“such as” lexicosyntactic relation
Results with the “or other” lexicosyntactic relation
Procedure
Discover a pattern that indicates a
lexical relationship
Scan through a large collection;
extract sentences that match the
pattern
Extract the NPs from the sentence
– requires some phrase parsing
Check if suggested relation is in
WordNet or not
– this part not automated, but could be
Discovering New Patterns
Suggested algorithm:
– Decide on a lexical relation of interest, e.g.,
hyponymy
– Derive a list of word pairs from WordNet that
are known to hold that relation
» e.g., (crow, bird)
– Extract sentence from text collection in which
both terms occur
– Find commonalities among lexico-syntactic
context
– Test these out against other word pairs known
to hold the relationship in WordNet
Text Merging Example:
Discovering Hypocritical Congresspersons
Discovering Hypocritical
Congresspersons
Feb 1, 1996
– US House of Reps votes to pass
Telecommunications Reform Act
– this contains the CDA (Communications
Decency Act)
– violaters subject to fines of $250,000
and 5 years in prison
– eventually struck down by court
Discovering Hypocritical
Congresspersons
Sept 11, 1998
– US House of Reps votes to place the
Starr report online
– the content would (most likely) have
violated the CDA
365 people were members for both
votes
– 284 members voted aye both times
» 185 (94%) Republicants voted aye both times
» 96 (57%) Democrats voted aye both times
How to find Hypocritical
Congresspersons?
This must have taken a lot of work
– Hand cutting and pasting
– Lots of picky details
» Some people voted on one but not the other bill
» Some people share the same name
Check for different county/state
Still messed up on “Bono”
– Taking stats at the end on various attributes
» Which state
» Which party
Tools should help streamline, reuse results
How to find Hypocritical
Congresspersons?
The hard part?
– Knowing two compare these two sets of voting
records.
How to find causes of disease?
Don Swanson’s Medical Work
Given
– medical titles and abstracts
– a problem (incurable rare disease)
– some medical expertise
find causal links among titles
– symptoms
– drugs
– results
Swanson Example (1991)
Problem: Migraine headaches (M)
–
–
–
–
–
stress associated with M
stress leads to loss of magnesium
calcium channel blockers prevent some M
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD)implicated
in M
– high levels of magnesium inhibit SCD
– M patients have high platelet aggregability
– magnesium can suppress platelet aggregability
All extracted from medical journal titles
Swanson’s TDM
Two of his hypotheses have received
some experimental verification.
His technique
– Only partially automated
– Required medical expertise
Few people are working on this.
How to Automate This?
Idea: mixed-initiative interaction
– User applies tools to help explore the
hypothesis space
– System runs suites of algorithms to help
explore the space, suggest directions
Our Proposed Approach
Three main parts
– UI for building/using strategies
– Backend for interfacing with various
databases and translating different
formats
– Content analysis/machine learning for
figuring out good hypotheses/throwing
out bad ones
The UI part
Need support for building strategies
Mixed-initiative system
– Trade off between user-initiated hypotheses
exploration and system-initiated suggestions
Information visualization
– Another way to show lots of choices
Candidate Associations
Suggested Strategies
Current Retrieval Results
Lindi: Linking Information for
Novel Discovery and Insight
Just starting up now (fall 98)
Initial work: Hao Chen, Ketan MayerPatel, Shankar Raman
Summary
Text Data Mining:
– Extracting heretofore undiscovered
information from large text collections
– Not the same as information retrieval
Examples
– Lexicographic knowledge acquisition
– Merging of text representations
– Linking related information
The truth is out there!