Transcript agogino

Text Data Mining
Prof. Marti Hearst
UC Berkeley SIMS
Guest Lecture, ME 290M
Prof. Agogino
May 4, 1999
There’s Lots of Text Out There

Is it Information Overload?
Why not TURBO-Text?
How can we SYNTHESIZE what’s there
to make new discoveries?
Talk Outline

Definitions
– What is Data Mining?
– What is Text Data Mining?

Text data mining examples
– Lexical knowledge acquisition
– Merging textual records
– Finding cures for diseases (from medical literature)

Future Directions
What is Data Mining?
(Fayyad & Uthurusamy 96, Fayyad 97)



Fitting models to or determining patterns
from very large datasets.
A “regime” which enables people to interact
effectively with massive data stores.
Deriving new information from data.
– finding patterns across large datasets
– discovering heretofore unknown
information
What is Data Mining?

Potential point of confusion:
– The extracting ore from rock metaphor
does not really apply to the practice of
data mining
– If it did, then standard database queries
would fit under the rubric of data mining
» Find all employee records in which employee
earns $300/month less than their managers
– In practice, DM refers to:
» finding patterns across large datasets
» discovering heretofore unknown information
Why Data Mining?
Because the data is there.
 Because current DBMS technology
does not support data analysis.
 Because

–
–
–
–
larger disks
faster cpus
high-powered visualization
networked information
are becoming widely available.
DM Touchstone Applications
(CACM 39 (11) Special Issue)

Finding patterns across data sets:
– Reports on changes in retail sales
» to improve sales
– Patterns of sizes of TV audiences
» for marketing
– Patterns in NBA play
» to alter, and so improve, performance
– Deviations in standard phone calling behavior
» to detect fraud
» for marketing
DM Touchstone Applications
(CACM 39 (11) Special Issue)

Separating signal from noise:
– Classifying faint astronomical objects
– Finding genes within DNA sequences
– Discovering novel tectonic activity
What is Text Data Mining?

Peoples’ first thought:
– Make it easier to find things on the Web.
– This is information retrieval!


The metaphor of extracting ore from rock
does make sense for extracting documents
of interest from a huge pile.
But does not reflect notions of DM in
practice:
– finding patterns across large collections
– discovering heretofore unknown information
Text DM  IR

Data Mining:
» Patterns, Nuggets, Exploratory Analysis

Information Retrieval:
– Finding and ranking documents that
match users’ information need
» ad hoc query
» filtering/standing query
– Rarely Patterns, Exploratory Analysis
Real Text DM

The point:
– Discovering heretofore unknown
information is not what we usually do
with text.
– (If it weren’t known, it could not have
been written by someone.)

However:
– There is a field whose goal is to learn
about patterns in text for its own sake
...
Computational Lingustics

Goal: automated language understanding
– this isn’t possible
– instead, go for subgoals, e.g.,
» word sense disambiguation
» phrase recognition
» semantic associations

Current approach:
– statistical analyses of very large text
collections
WordNet: A Lexical Database
A list
of
hypernyms
for each
sense of
“crow”
Lexicographic
Knowledge Acquisition

Given a large lexical database ...
– Wordnet: Miller, Fellbaum et al. at
Princeton
– http://www.cogsci.princeton.edu/~wn

… and a huge text collection
– How to automatically add new relations?
Idea: Use Simple LexicoSyntactic Analysis

Patterns of the following type work:
NP0 such as NP1, {NP2 …, (and | or) NPi
i >= 1, implies
forall NPi, i>=1, hyponym(NPi, NP0)

Example:
– “Agar is a substance prepared from a
mixture of red algae, such as Gelidium,
for laboratory or industrial use.”
– implies hyponym(“Gelidium”, “red algae”)
More Examples

“Felonies, such as shootings and
stabbings …” implies
– hyponym(shootings, felonies)
– hyponym(stabbings, felonies)

Is this in the WordNet hierarchy?
Linking Killing to Felonies
Another Example
Einstein is (was) a physicist.
 Is/was he a genius?

Making Einstein a Genius
Results from
“such as” lexicosyntactic relation
Results with the “or other” lexicosyntactic relation
Procedure
Discover a pattern that indicates a
lexical relationship
 Scan through a large collection;
extract sentences that match the
pattern
 Extract the NPs from the sentence

– requires some phrase parsing

Check if suggested relation is in
WordNet or not
– this part not automated, but could be
Discovering New Patterns

Suggested algorithm:
– Decide on a lexical relation of interest, e.g.,
hyponymy
– Derive a list of word pairs from WordNet that
are known to hold that relation
» e.g., (crow, bird)
– Extract sentence from text collection in which
both terms occur
– Find commonalities among lexico-syntactic
context
– Test these out against other word pairs known
to hold the relationship in WordNet
Text Merging Example:
Discovering Hypocritical Congresspersons
Discovering Hypocritical
Congresspersons

Feb 1, 1996
– US House of Reps votes to pass
Telecommunications Reform Act
– this contains the CDA (Communications
Decency Act)
– violaters subject to fines of $250,000
and 5 years in prison
– eventually struck down by court
Discovering Hypocritical
Congresspersons

Sept 11, 1998
– US House of Reps votes to place the
Starr report online
– the content would (most likely) have
violated the CDA

365 people were members for both
votes
– 284 members voted aye both times
» 185 (94%) Republicants voted aye both times
» 96 (57%) Democrats voted aye both times
How to find Hypocritical
Congresspersons?

This must have taken a lot of work
– Hand cutting and pasting
– Lots of picky details
» Some people voted on one but not the other bill
» Some people share the same name


Check for different county/state
Still messed up on “Bono”
– Taking stats at the end on various attributes
» Which state
» Which party

Tools should help streamline, reuse results
How to find Hypocritical
Congresspersons?

The hard part?
– Knowing two compare these two sets of voting
records.
How to find causes of disease?
Don Swanson’s Medical Work

Given
– medical titles and abstracts
– a problem (incurable rare disease)
– some medical expertise

find causal links among titles
– symptoms
– drugs
– results
Swanson Example (1991)

Problem: Migraine headaches (M)
–
–
–
–
–
stress associated with M
stress leads to loss of magnesium
calcium channel blockers prevent some M
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD)implicated
in M
– high levels of magnesium inhibit SCD
– M patients have high platelet aggregability
– magnesium can suppress platelet aggregability

All extracted from medical journal titles
Swanson’s TDM
Two of his hypotheses have received
some experimental verification.
 His technique

– Only partially automated
– Required medical expertise

Few people are working on this.
How to Automate This?

Idea: mixed-initiative interaction
– User applies tools to help explore the
hypothesis space
– System runs suites of algorithms to help
explore the space, suggest directions
Our Proposed Approach

Three main parts
– UI for building/using strategies
– Backend for interfacing with various
databases and translating different
formats
– Content analysis/machine learning for
figuring out good hypotheses/throwing
out bad ones
The UI part


Need support for building strategies
Mixed-initiative system
– Trade off between user-initiated hypotheses
exploration and system-initiated suggestions

Information visualization
– Another way to show lots of choices
Candidate Associations
Suggested Strategies
Current Retrieval Results
Lindi: Linking Information for
Novel Discovery and Insight
Just starting up now (fall 98)
 Initial work: Hao Chen, Ketan MayerPatel, Shankar Raman

Summary

Text Data Mining:
– Extracting heretofore undiscovered
information from large text collections
– Not the same as information retrieval

Examples
– Lexicographic knowledge acquisition
– Merging of text representations
– Linking related information

The truth is out there!