SIMS 290-2: Applied Natural Language Processing: Marti
Download
Report
Transcript SIMS 290-2: Applied Natural Language Processing: Marti
SIMS 290-2:
Applied Natural Language Processing
Marti Hearst
October 20, 2004
1
Untangling Text Data Mining
(updated from lecture from 1999)
2
Outline
Untangling several different fields
DM, CL, IA, TDM
TDM examples
TDM as Exploratory Data Analysis
New Problems for Computational Linguistics
Our current efforts
3
Classifying Application Types
Patterns
Non-textual
data
Textual data
Non-Novel
Nuggets
Novel
Nuggets
Standard data
mining
Database
queries
Automated
Reasoning
(AI)
Computational
linguistics
Information
retrieval
Real text
data mining
4
What is Data Mining?
(Fayyad & Uthurusamy 96, Fayyad 97)
Fitting models to or determining patterns from very
large datasets.
A “regime” which enables people to interact
effectively with massive data stores.
Deriving new information from data.
5
Why Data Mining?
Because the data is there.
Because
larger disks
faster cpus
high-powered visualization
networked information
are now widely available.
6
Knowledge Discovery from Data
(KDD)
KDD: The non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data.
(Fayyad, Shapiro, & Smyth, CACM 96)
Note: data mining is just one step in the process
7
Data Mining Applications
(CACM 39 (11) Special Issue)
Finding patterns across data sets:
Reports on changes in retail sales
– to improve sales
Patterns of sizes of TV audiences
– for marketing
Patterns in NBA play
– to alter, and so improve, performance
Deviations in standard phone calling behavior
– to detect fraud
– for marketing
8
What is Data Mining?
Potential point of confusion:
The extracting ore from rock metaphor does not
really apply to the practice of data mining
If it did, then standard database queries would
fit under the rubric of data mining
In practice, DM refers to:
– finding patterns across large datasets
– discovering heretofore unknown information
9
What is Text Data Mining?
Many people’s first thought:
Make it easier to find things on the Web.
But this is information retrieval!
10
Needles in Haystacks
The emphasis in IR is in finding documents that already contain
answers to questions.
11
Information Retrieval
A restricted form of Information Access
The system has only pre-existing, “canned” text passages.
Its response is limited to selecting from these passages and
presenting them to the user.
It must select, say, 10 or 20 passages out of millions.
12
What is Text Data Mining?
The metaphor of extracting ore from rock:
Does make sense for extracting documents of
interest from a huge pile.
But does not reflect notions of DM in practice:
– finding patterns across large collections
– discovering heretofore unknown information
What would finding a pattern across a large text collection
really look like …?
13
Bill Gates + MS-DOS in the Bible!
From: “The Internet Diary of the man who cracked the Bible Code”
Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
(William Gates, agitator, leader)
14
From: “The Internet Diary of the man who cracked the Bible Code”
Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
More info: http://cs.anu.edu.au/~bdm/dilugim/gatesdet.txt
http://cs.anu.edu.au/~bdm/dilugim/torah.html
15
Real Text DM
The point:
Discovering heretofore unknown information is not
what we usually do with text.
(If it weren’t known, it could not have been written
by someone!)
However:
There is a field whose goal is to learn about patterns
in text for their own sake ...
16
Computational Linguistics!
Goal: automated language understanding
this isn’t possible (yet)
instead, go for subgoals, e.g.,
– word sense disambiguation
– phrase recognition
– semantic associations
Common current approach:
statistical analyses over very large text collections
17
Why CL Isn’t TDM
A linguist finds it interesting that “cloying” co-occurs
significantly with “Jar Jar Binks” ...
… But this doesn’t really answer a question relevant
to the world outside the text itself.
18
Why CL Isn’t TDM
We need to use the text indirectly to answer
questions about the world
Direct:
Analyze patent text; determine which word patterns
indicate various subject categories.
Indirect:
Analyze patent text; find out whether private or
public funding leads to more inventions.
19
Why CL Isn’t TDM
Direct:
Cluster newswire text; determine which terms are
predominant
Indirect:
Analyze newswire text; gather evidence about which
countries/alliances are dominating which financial sectors
20
Nuggets vs. Patterns
TDM: we want to discover new information …
… As opposed to discovering which statistical patterns
characterize occurrence of known information.
Example: WSD
not TDM: computing statistics over a corpus to
determine what patterns characterize Sense S.
TDM: discovering the meaning of a new sense of a word.
21
Nuggets vs. Patterns
Nugget:
a new, heretofore unknown item of information.
Pattern:
distributions or rules that characterize the occurrence
(or non-occurrence) of a known item of information.
Application of rules can create nuggets in some
circumstances.
22
Example: Lexicon Augmentation
Application of a lexico-syntactic pattern:
NP0 such as NP1, {NP2 …, (and | or) NPi }
i >= 1, implies that
forall NPi, i>=1, hyponym(NPi, NP0)
Extracts out a new hypernym:
“Agar is a substance prepared from a mixture of red algae, such
as Gelidium, for laboratory or industrial use.”
implies hyponym(“Gelidium”, “red algae”)
However, this fact was already known to the author
of the text.
23
The Quandry
How do we use text to both
Find new information not known to the author of
the text
Find information that is not about the text itself
24
Idea: Exploratory Data Analysis
Use large text collections to gather evidence to
support (or refute) hypotheses
Not known to author: links across many texts
Not self-referential: work within the domain of
discourse
25
Example: Etiology
Given
medical titles and abstracts
a problem (incurable rare disease)
some medical expertise
find causal links among titles
symptoms
drugs
results
26
Swanson Example (1991)
Problem: Migraine headaches (M)
Facts extracted from medical journal titles:
stress associated with M
stress leads to loss of magnesium
calcium channel blockers prevent some M
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD) implicated in M
high levels of magnesium inhibit SCD
M patients have high platelet aggregability
magnesium can suppress platelet aggregability
27
Gathering Evidence
stress
magnesium
CCB
migraine
magnesium
SCD
magnesium
PA
magnesium
28
Gathering Evidence
CCB
migraine
PA
magnesium
SCD
stress
29
Swanson’s TDM
Two of his hypotheses have received some
experimental verification.
His technique
Only partially automated
Required medical expertise
Some researchers are pursuing this further.
30
How to find functions of genes?
Important problem in molecular biology
Have the genetic sequence
Don’t know what it does
But …
– Know which genes it coexpresses with
– Some of these have known function
So … Infer function based on function of coexpressed genes
– This idea suggested to me by Michael Walker and
others at Incyte Pharmaceuticals
31
Gene Co-expression:
Role in the genetic pathway
Kall.
g?
Kall.
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
32
Make use of the literature
Look up what is known about the other genes.
Different articles in different collections
Look for commonalities
Similar topics indicated by Subject Descriptors
Similar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
33
Developing Strategies
Different strategies seem needed for different
situations
First: see what is known about
Kallikrein.
7341 documents. Too many
AND the result with “disease” category
– If result is non-empty, this might be an
interesting gene
Now get 803 documents
AND the result with PSA
– Get 11 documents. Better!
34
Developing Strategies
Look for commalities among these documents
Manual scan through ~100 category labels
Would have been better if
– Automatically organized
– Intersections of “important” categories scanned for first
35
Try a new tack
Researcher uses knowledge of field to realize these
are related to prostate cancer and diagnostic tests
New tack: intersect search on all three known genes
Hope they all talk about diagnostics and
prostate cancer
Fortunately, 7 documents returned
Bingo! A relation to regulation of this
cancer
36
Formulate a Hypothesis
Hypothesis: mystery gene has to do with regulation
of expression of genes leading to prostate cancer
New tack: do some lab tests
See if mystery gene is similar in molecular structure
to the others
If so, it might do some of the same things they do
37
38
Strategies again
In hindsight, combining all three genes was a good
strategy.
Store this for later
Might not have worked
Need a suite of strategies
Build them up via experience and a good UI
39
Text Merging Example
Discovering Hypocritical Congresspersons
40
Discovering Hypocritical
Congresspersons
Feb 1, 1996
US House of Reps votes to pass Telecommunications
Reform Act
This contains the CDA (Communications Decency Act)
– Sought to criminalize posting to the Internet any material
deemed indecent and patently offensive, with no exception for
socially redeeming material.
Violaters subject to fines of $250,000 and 5 years in
prison
Eventually struck down by courts
http://www.tbtf.com/resource/hypocrites.html
41
Discovering Hypocritical
Congresspersons
Sept 11, 1998
US House of Reps votes to place the Starr report online
The content would (most likely) have violated the CDA
365 people were members for both votes
284 members voted aye both times
– 185 (94%) Republicants voted aye both times
– 96 (57%) Democrats voted aye both times
http://www.tbtf.com/resource/hypocrites.html
42
http://www.tbtf.com/resource/hypocrites.html
43
http://www.tbtf.com/resource/hypocrites.html
44
How to find Hypocritical
Congresspersons?
This must have taken a lot of work
Hand cutting and pasting
Lots of picky details
– Some people voted on one but not the other bill
– Some people share the same name
Check for different county/state
Still messed up on “Bono”
Taking stats at the end on various attributes
– Which state
– Which party
Tools should help streamline, reuse results
The hardest part?
Knowing to compare these two sets of voting records in the first
place.
45
Summary
Text Data Mining:
Extracting heretofore undiscovered information from
large text collections
Information Access TDM
IA: locating already known information that is
currently of interest
Finding patterns across text is already done in CL
Tells us about the behavior of language
Helps build very useful tools!
46
Summary on Text Data Mining
The future: analyzing what the text is about
We don’t know how; text is tough!
Idea: bring the user into the loop.
Build up piecewise evidence to support hypotheses
Make use of partial domain models.
The Truth is Out There!
47