lindi-guir - UC Berkeley School of Information
Download
Report
Transcript lindi-guir - UC Berkeley School of Information
Text Tango:
A New Text Data
Mining Project
Marti A. Hearst
GUIR Meeting, Sept 17, 1998
Talk Outline
What is Data Mining?
What isn’t Text Data Mining?
What is Text Data Mining
Examples
A proposal for a system for Text Data
Mining
Marti A. Hearst
UC Berkeley SIMS 1998
What is Data Mining?
(Fayyad & Uthurusamy 96, Fayyad 97)
Fitting models to or determining
patterns from very large datasets.
A “regime” which enables people to
interact effectively with massive data
stores.
Deriving new information from data.
finding patterns across large datasets
discovering heretofore unknown
information
Marti A. Hearst
UC Berkeley SIMS 1998
What is Data Mining?
Potential point of confusion:
The extracting ore from rock metaphor
does not really apply to the practice of data
mining
If it did, then standard database queries
would fit under the rubric of data mining
Find all employee records in which employee
earns $300/month less than their managers
In practice, DM refers to:
finding patterns across large datasets
discovering heretofore unknown information
Marti A. Hearst
UC Berkeley SIMS 1998
DM Touchstone Applications
(CACM 39 (11) Special Issue)
Finding patterns across data sets:
Reports on changes in retail sales
Patterns of sizes of TV audiences
for marketing
Patterns in NBA play
to improve sales
to alter, and so improve, performance
Deviations in standard phone calling behavior
to detect fraud
for marketing
Marti A. Hearst
UC Berkeley SIMS 1998
What is Text Data Mining?
Peoples’ first thought:
Make it easier to find things on the Web.
This is information retrieval!
The metaphor of extracting ore from
rock does make sense for extracting
documents of interest from a huge pile.
But does not reflect notions of DM in
practice:
finding patterns across large collections
discovering heretofore unknown information
Marti A. Hearst
UC Berkeley SIMS 1998
Text DM != IR
Data Mining:
Patterns, Nuggets, Exploratory Analysis
Information Retrieval:
Finding and ranking documents that match
users’ information need
ad hoc query
filtering/standing query
Marti A. Hearst
UC Berkeley SIMS 1998
Real Text DM
What would finding a pattern across a
large text collection really look like?
Marti A. Hearst
UC Berkeley SIMS 1998
Bill Gates + MS-DOS in the Bible!
From: “The Internet Diary of the man who cracked the Bible Code”
Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
(William Gates, agitator, leader)
Marti A. Hearst
UC Berkeley SIMS 1998
From: “The Internet Diary of the man who cracked the Bible Code”
Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
Marti A. Hearst
UC Berkeley SIMS 1998
Real Text DM
The point:
Discovering heretofore unknown
information is not what we usually do with
text.
(If it weren’t known, it could not have been
written by someone.)
However:
There are some interesting problems of
this type!
Marti A. Hearst
UC Berkeley SIMS 1998
Combining Data Types
for Novel Tasks
Text + Links to find “authority pages”
(Kleinberg at Cornell, Page at Stanford)
Usage + Time + Links to study evolution
of web and information use (Pitkow et al. at
PARC)
Marti A. Hearst
UC Berkeley SIMS 1998
Ore-Filled Text Collections
Congressional Voting Records
Answer questions like:
Who are the most hypocritical congresspeople?
Medical Articles
Create hypotheses about causes of rare
diseases
Create hypotheses about gene function
Patent Law
Answer questions like:
Marti A. Hearst
UC Berkeley SIMS 1998
Is government funding of research worthwhile?
Marti A. Hearst
UC Berkeley SIMS 1998
Marti A. Hearst
UC Berkeley SIMS 1998
How to find Hypocritical
Congresspersons?
This must have taken a lot of work
Hand cutting and pasting
Lots of picky details
Some people voted on one but not the other bill
Some people share the same name
Check for different county/state
Still messed up on “Bono”
Taking stats at the end on various
attributes
Which state
Which party
Marti A. Hearst
UC Berkeley SIMS 1998
How to find functions of genes?
Important problem in molecular
biology
Have the genetic sequence
Don’t know what it does
But …
Know which genes it coexpresses with
Some of these have known function
So … Infer function based on function of
co-expressed genes
This is new work by Michael Walker and others
at Incyte Pharmaceuticals
Marti A. Hearst
UC Berkeley SIMS 1998
Gene Co-expression:
Role in the genetic pathway
Kall.
g?
Kall.
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
Marti A. Hearst
UC Berkeley SIMS 1998
Make use of the literature
Look up what is known about the
other genes.
Different articles in different
collections
Look for commonalities
Similar topics indicated by Subject
Descriptors
Similar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
Marti A. Hearst
UC Berkeley SIMS 1998
Developing Strategies
Different strategies seem needed
for different situations
First: see what is known about Kallikrein.
7341 documents. Too many
AND the result with “disease” category
If result is non-empty, this might be an
interesting gene
Now get 803 documents
AND the result with PSA
Get 11 documents. Better!
Marti A. Hearst
UC Berkeley SIMS 1998
Developing Strategies
Look for commalities among these
documents
Manual scan through ~100 category labels
Would have been better if
Automatically organized
Intersections of “important” categories scanned
for first
Marti A. Hearst
UC Berkeley SIMS 1998
Try a new tack
Researcher uses knowledge of field to
realize these are related to prostate
cancer and diagnostic tests
New tack: intersect search on all three
known genes
Hope they all talk about diagnostics and
prostate cancer
Fortunately, 7 documents returned
Bingo! A relation to regulation of this
cancer
Marti A. Hearst
UC Berkeley SIMS 1998
Formulate a Hypothesis
Hypothesis: mystery gene has to do
with regulation of expression of genes
leading to prostate cancer
New tack: do some lab tests
See if mystery gene is similar in molecular
structure to the others
If so, it might do some of the same things
they do
Marti A. Hearst
UC Berkeley SIMS 1998
Strategies again
In hindsight, combining all three
genes was a good strategy.
Store this for later
Might not have worked
Need a suite of strategies
Build them up via experience and a good
UI
Marti A. Hearst
UC Berkeley SIMS 1998
The System
Doing the same query with slightly different
values each time is time-consuming and
tedious
Same goes for cutting and pasting results
IR systems don’t support varying queries
like this very well.
Each situation is a bit different
Some automatic processing is needed in the
background to eliminate/suggest hypotheses
Marti A. Hearst
UC Berkeley SIMS 1998
The System
Three main parts
UI for building/using strategies
Backend for interfacing with various
databases and translating different formats
Content analysis/machine learning for
figuring out good hypotheses/throwing out
bad ones
Marti A. Hearst
UC Berkeley SIMS 1998
The UI part
Need support for building strategies
Lots of info lying around, so a nice option is ...
Two-handed interface
Big table display
Mixed-initiative system
Trade off between user-initiated hypotheses
exploration and system-initiated suggestions
Information visualization
Another way to show lots of choices
Marti A. Hearst
UC Berkeley SIMS 1998
Candidate Associations
Suggested Strategies
Current Retrieval Results
Marti A. Hearst
UC Berkeley SIMS 1998
Other applications
Patent example
Political example
The truth’s out there!
Marti A. Hearst
UC Berkeley SIMS 1998
Text Tango
Just starting up now.
Let me know if you’d like to work
on it!
Marti A. Hearst
UC Berkeley SIMS 1998