lindi-guir - UC Berkeley School of Information

Download Report

Transcript lindi-guir - UC Berkeley School of Information

Text Tango:
A New Text Data
Mining Project
Marti A. Hearst
GUIR Meeting, Sept 17, 1998
Talk Outline



What is Data Mining?
What isn’t Text Data Mining?
What is Text Data Mining


Examples
A proposal for a system for Text Data
Mining
Marti A. Hearst
UC Berkeley SIMS 1998
What is Data Mining?
(Fayyad & Uthurusamy 96, Fayyad 97)



Fitting models to or determining
patterns from very large datasets.
A “regime” which enables people to
interact effectively with massive data
stores.
Deriving new information from data.
 finding patterns across large datasets
 discovering heretofore unknown
information
Marti A. Hearst
UC Berkeley SIMS 1998
What is Data Mining?

Potential point of confusion:
The extracting ore from rock metaphor
does not really apply to the practice of data
mining
 If it did, then standard database queries
would fit under the rubric of data mining



Find all employee records in which employee
earns $300/month less than their managers
In practice, DM refers to:
finding patterns across large datasets
 discovering heretofore unknown information
Marti A. Hearst

UC Berkeley SIMS 1998
DM Touchstone Applications
(CACM 39 (11) Special Issue)

Finding patterns across data sets:

Reports on changes in retail sales


Patterns of sizes of TV audiences


for marketing
Patterns in NBA play


to improve sales
to alter, and so improve, performance
Deviations in standard phone calling behavior


to detect fraud
for marketing
Marti A. Hearst
UC Berkeley SIMS 1998
What is Text Data Mining?

Peoples’ first thought:




Make it easier to find things on the Web.
This is information retrieval!
The metaphor of extracting ore from
rock does make sense for extracting
documents of interest from a huge pile.
But does not reflect notions of DM in
practice:


finding patterns across large collections
discovering heretofore unknown information
Marti A. Hearst
UC Berkeley SIMS 1998
Text DM != IR

Data Mining:


Patterns, Nuggets, Exploratory Analysis
Information Retrieval:

Finding and ranking documents that match
users’ information need


ad hoc query
filtering/standing query
Marti A. Hearst
UC Berkeley SIMS 1998
Real Text DM

What would finding a pattern across a
large text collection really look like?
Marti A. Hearst
UC Berkeley SIMS 1998
Bill Gates + MS-DOS in the Bible!
From: “The Internet Diary of the man who cracked the Bible Code”
Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
(William Gates, agitator, leader)
Marti A. Hearst
UC Berkeley SIMS 1998
From: “The Internet Diary of the man who cracked the Bible Code”
Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
Marti A. Hearst
UC Berkeley SIMS 1998
Real Text DM

The point:
Discovering heretofore unknown
information is not what we usually do with
text.
 (If it weren’t known, it could not have been
written by someone.)


However:

There are some interesting problems of
this type!
Marti A. Hearst
UC Berkeley SIMS 1998
Combining Data Types
for Novel Tasks

Text + Links to find “authority pages”
(Kleinberg at Cornell, Page at Stanford)

Usage + Time + Links to study evolution
of web and information use (Pitkow et al. at
PARC)
Marti A. Hearst
UC Berkeley SIMS 1998
Ore-Filled Text Collections

Congressional Voting Records

Answer questions like:


Who are the most hypocritical congresspeople?
Medical Articles
Create hypotheses about causes of rare
diseases
 Create hypotheses about gene function


Patent Law

Answer questions like:

Marti A. Hearst
UC Berkeley SIMS 1998
Is government funding of research worthwhile?
Marti A. Hearst
UC Berkeley SIMS 1998
Marti A. Hearst
UC Berkeley SIMS 1998
How to find Hypocritical
Congresspersons?

This must have taken a lot of work
Hand cutting and pasting
 Lots of picky details

Some people voted on one but not the other bill
 Some people share the same name




Check for different county/state
Still messed up on “Bono”
Taking stats at the end on various
attributes
Which state
 Which party

Marti A. Hearst
UC Berkeley SIMS 1998
How to find functions of genes?

Important problem in molecular
biology
Have the genetic sequence
 Don’t know what it does
 But …

Know which genes it coexpresses with
 Some of these have known function


So … Infer function based on function of
co-expressed genes

This is new work by Michael Walker and others
at Incyte Pharmaceuticals
Marti A. Hearst
UC Berkeley SIMS 1998
Gene Co-expression:
Role in the genetic pathway
Kall.
g?
Kall.
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
Marti A. Hearst
UC Berkeley SIMS 1998
Make use of the literature
Look up what is known about the
other genes.
 Different articles in different
collections
 Look for commonalities

Similar topics indicated by Subject
Descriptors
 Similar words in titles and abstracts

adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
Marti A. Hearst
UC Berkeley SIMS 1998
Developing Strategies

Different strategies seem needed
for different situations
First: see what is known about Kallikrein.
 7341 documents. Too many
 AND the result with “disease” category


If result is non-empty, this might be an
interesting gene
Now get 803 documents
 AND the result with PSA


Get 11 documents. Better!
Marti A. Hearst
UC Berkeley SIMS 1998
Developing Strategies

Look for commalities among these
documents
Manual scan through ~100 category labels
 Would have been better if

Automatically organized
 Intersections of “important” categories scanned
for first

Marti A. Hearst
UC Berkeley SIMS 1998
Try a new tack


Researcher uses knowledge of field to
realize these are related to prostate
cancer and diagnostic tests
New tack: intersect search on all three
known genes
 Hope they all talk about diagnostics and
prostate cancer
 Fortunately, 7 documents returned
 Bingo! A relation to regulation of this
cancer
Marti A. Hearst
UC Berkeley SIMS 1998
Formulate a Hypothesis


Hypothesis: mystery gene has to do
with regulation of expression of genes
leading to prostate cancer
New tack: do some lab tests
 See if mystery gene is similar in molecular
structure to the others
 If so, it might do some of the same things
they do
Marti A. Hearst
UC Berkeley SIMS 1998
Strategies again

In hindsight, combining all three
genes was a good strategy.


Store this for later
Might not have worked
Need a suite of strategies
 Build them up via experience and a good
UI

Marti A. Hearst
UC Berkeley SIMS 1998
The System



Doing the same query with slightly different
values each time is time-consuming and
tedious
Same goes for cutting and pasting results
 IR systems don’t support varying queries
like this very well.
 Each situation is a bit different
Some automatic processing is needed in the
background to eliminate/suggest hypotheses
Marti A. Hearst
UC Berkeley SIMS 1998
The System

Three main parts
UI for building/using strategies
 Backend for interfacing with various
databases and translating different formats
 Content analysis/machine learning for
figuring out good hypotheses/throwing out
bad ones

Marti A. Hearst
UC Berkeley SIMS 1998
The UI part



Need support for building strategies
Lots of info lying around, so a nice option is ...
 Two-handed interface
 Big table display
Mixed-initiative system


Trade off between user-initiated hypotheses
exploration and system-initiated suggestions
Information visualization

Another way to show lots of choices
Marti A. Hearst
UC Berkeley SIMS 1998
Candidate Associations
Suggested Strategies
Current Retrieval Results
Marti A. Hearst
UC Berkeley SIMS 1998
Other applications
Patent example
 Political example
 The truth’s out there!

Marti A. Hearst
UC Berkeley SIMS 1998
Text Tango
Just starting up now.
 Let me know if you’d like to work
on it!

Marti A. Hearst
UC Berkeley SIMS 1998