Open Source Text Mining
Download
Report
Transcript Open Source Text Mining
Open Source Text Mining
Hinrich Schütze, Enkata
Text Mining 2003 @ SDM03
Cathedral Hill Hotel, San Francisco
May 3, 2003
1
Motivation
Open source used to be a crackpot idea.
Bill Gates on linux (1999.03.24): “I really don't think in the
commercial market, we'll see it in any significant way.”
MS 10-Q quarterly filing (2003.01.31): “The popularization
of the open source movement continues to pose a
significant challenge to the company's business model.”
Open source is an enabler for radical new things
Google
Ultra-cheap web servers
Free news
Free email
Free …
Class projects
Walmart pc for $200
2
GNU-Linux
3
Web Servers:
Open Source Dominates
Source: Netcraft
4
Motivation (cont.)
Text mining has not had much impact.
Many small companies & small projects
No large-scale adoption
Exception: text-mining-enhanced search
Text mining could transform the world.
Unstructured → structured
Information explosion
Amount of information has exploded
Amount of accessible information has not
Can open source text mining make this
happen?
5
Unstructured vs Structured
Data
100
90
80
70
60
Unstructured
Structured
50
40
30
20
10
0
Data volume
Market Cap
6
Prabhakar Raghavan, Verity
Business Motivation
High cost of deploying text mining solutions
How can we lower this cost?
100% proprietary solutions
Require re-invention of core infrastructure
Leave fewer resources for high-value
applications built on top of core
infrastructure
7
Definitions
Open source
Public domain, bsd, gpl (gnu public license)
Text mining
Like data mining but for text
NLP (Natural Language Processing)
subdiscipline
Has interesting applications now
More than just information retrieval /
keyword search
Usually: some statistical, probabilistic or
frequentistic component
8
Text Mining vs. NLP
(Natural Language Processing)
What is not text mining: speech, language
models, parsing, machine translation
Typical text mining: clustering, information
extraction, question answering
Statistical and high volume
9
Text Mining: History
80s: Electronic text gives birth to Statistical
Natural Language Processing (StatNLP).
90s: DARPA sponsors Message
Understanding Conferences (MUC) and
Information Extraction (IE) community.
Mid-90s: Data Mining becomes a discipline
and usurps much of IE and StatNLP as “text
mining”.
10
Text Mining: Hearst’s Definition
Finding nuggets
Finding patterns
Information extraction
Question answering
Clustering
Knowledge discovery
Text visualization
11
Information Extraction
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
12
Knowledge Discovery:
Arrowsmith
Goal: Connect two disconnected subfields of
medicine.
Technique
Start with 1st subfield
Identify key concepts
Search for 2nd subfield with same concepts
Implemented in Arrowsmith system
Discovery: magnesium is potential treatment
for migraine
13
Knowledge Discovery:
Arrowsmith
14
When is Open Source
Successful?
“Important” problem
Adaptation
A little adaptation is easy
Most users do not need any adaptation (out of the box use)
Incremental releases are useful
Cost sharing without administrative/legal overhead
Many users (operating system)
Fun to work on (games)
Public funding available (OpenBSD, security)
Open source author gains fame/satisfaction/immortality/community
Dozens of companies with significant interest in linux (ibm …)
Many of these companies contribute to open source
This is in effect an informal consortium
A formal effort probably would have killed linux.
Same applies to text mining?
Also: bugs, security, high-availability, ideal for consulting &
hardware companies like IBM
15
When is Open Source Not
Successful?
Boring & rare problem
Complex integrated solutions
Print driver for 10 year old printer
QuarkXPress
ERP systems
Good UI experience for non-geeks
Apple
Microsoft Windows
(at least for now)
16
Text Mining and Open Source
Pro
Important problem: fame, satisfaction,
immortality, community can be gained
Pooling of resources / critical mass
Con
Non-incremental?
Most text mining requires significant
adaptation.
Most text mining requires data resources as
well as source code.
The need for data resources does not fit well 17
into the open source paradigm.
Text Mining Open Source Today
Lucene
Rain/bow, Weka, GTP, TDMAPI
Text mining algorithms / infrastructure, no
data resources
NLTK
Excellent for information retrieval, but not
much text mining.
NLP toolkit, some data resources
WordNet, DMOZ
Excellent data resources, but not enough
breadth/depth.
18
Open Source with Open Data
Spell checkers (e.g., emacs)
Antispam software (e.g., spamassassin)
Named entity recognition (Gate/Annie)
Free version less powerful than in-house
19
SpamAssassin: Code + Data
20
Open Data Resources:
Examples
SpamAssassin
Named entity recognition
Word lists, dictionaries
Information extraction
Classification model for spam
Domain model, taxonomies, regular
expressions
Shallow parsing
Grammars
21
Code vs Data
Significant
Resources
Needed
Text Classification
N. Entity Recognition
Information Extraction
?
Spam Filtering
Spell Checkers
No
Resources
Needed
Complex&Integrated SW
Good UI Design
Proprietary
Code
Linux
Web Servers
Open
Source
22
Open Source with Data: Key Issues
Can data resources be recycled?
Assume there is a large library of data resources
available.
Problems have to be similar.
More difficult than one would expect: my first attempt
failed (medline/reuters).
Next: case study
How do we identify the data resources that can be
recycled?
How do we adapt them?
How do we get from here to there?
Need incremental approach that is sustained by
successes along the way.
23
Text Mining without Data
Resources
Premise: “Knowledge-poor” text mining taps
small part of potential of text mining.
Knowledge-poor text mining examples
Clustering
Phrase extraction
First story detection
Many success stories
24
Case Study: ODP ->
Case Study:
Reuters
Train on ODP
Apply to Reuters
25
Case Study: Text Classification
Key Issues for text classification
Show that text classifiers can be recycled
How can we select reusable classifiers for a
particular task?
How do we adapt them?
Case Study
Train classifiers on open directory (ODP)
Apply classifiers to Reuters RCV1
165,000 docs (nodes), crawled in 2000, 505 classes
780,000 docs, >1000 classes
Hypothesis: A library of classifiers based on
ODP can be recycled for RCV1.
26
Experimental Setup
Train 505 classifiers on ODP
Apply them to Reuters
Compute chi2 for all ODP x Reuters pairs
Evaluate n pairs with the best chi2
Evaluation Measures
Area under ROC curve
Average precision
Plot false positive rate vs true positive rate
Compute area under the curve
Rank documents, compute precision for each rank
Average for all positive documents
Estimated based on 25% sample
27
Japan: ODP -> Reuters
ROC Curve
Japan Classifier Trained on ODP Applied to Reuters
1.00
0.90
True Positive Rate
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
0.00
0.10
0.20
0.30
0.40
0.50
False Positive Rate
0.60
0.70
0.80
0.90
28
Some Results
29
BusIndTraMar0 / I76300: Ports
30
Discussion
Promising results
These are results without any adaptation.
Performance expected to be much better
after adaptation.
31
Discussion (cont)
Class relationships are m:n, not 1:1
Reuters: GSPO
SpoBasCol0
SpoBasMinLea0
SpoBasReg0
SpoHocIceLeaNatPla0
SpoHocIceLeaPro0
ODP: RegEurUniBusInd0 (UK industries)
I13000 (petroleum & natural gas)
I17000 (water supply)
I32000 (mechanical engineering)
I66100 (restaurants, cafes, fast food)
I79020 (telecommunications)
I9741105 (radio broadcasting)
32
Why Recycling Classifiers is
Difficult
Autonomous vs relative decisions
ODP Japan classifier w/o modifications has
high precision, but only 1% recall on RCV1!
Most classifiers are tuned for optimal
performance in embedded system.
Tuning decreases robustness in recycling.
Tokenization, document length, numbers
Numbers throw off medline vs. non-medline
categorizer (financial classified as medical)
Length-sensitive multinomial Naïve Bayes:
nonsensical results
33
Specifics
What would an open source text classification
package look like?
Code
Text mining algorithms
Customization component
Creation component
To create new data resources
Data
To adapt recycled data resources
Recycled data resources
Newly created data resources
Pick a good area
Bioinformatics: genes / proteins
Product catalogs
34
Other Text Mining Areas
Named entity recognition
Information extraction
Shallow parsing
35
Data vs Code
What about just sharing training sets?
What about just sharing models?
Often proprietary
Small preprocessing changes can throw you
off completely
Share (simple?) classifier cum preprocessor
and models
Still proprietary issues
36
Open Source & Data
Public
Code+Data
V1.1
Proprietar
y
adapt
Enhanced
Code+Data
Code+Data
V1.0
sanitize
new release
publish
Sanitized&
Enhanced
Code+Data
37
Free Riders?
Open source is successful because it makes
free riding hard.
Harder to achieve for some data resources
Viral nature of GPL.
Download models
Apply to your data
Retrain
You own 100% of the result
Less of a problem for dictionaries and
grammars
38
Data Licenses
Open Directory License
http://rdf.dmoz.org/license.html
Bsd flavor
Wordnet
http://www.cogsci.princeton.edu/~wn/license
.shtml
Copyright
No license to sell derivative works?
Some criteria for derivative works
Substantially similar (seinfeld trivia)
Potential damage to future marketing of derivative
works
39
Code vs Data Licenses
Some similarity
If I open-source my code, then I will benefit
from bug fixes & enhancements written by
others.
If I open-source my data resource, then my
classification model may become more robust
due to improvements made by others.
Some dissimilarity
Code is very abstract: few issues with
proprietary information creeping in.
Text mining resources are not very abstract:
there is a potential of sensitive information
40
Areas in Need of Research
How to identify reusable text mining components
How to adapt reusable text mining components
Active learning
Interactive parameter tweaking?
Combination of recycled classifier and new training
information
Estimate performance
ODP/Reuters case study does not address this.
Need (small) labeled sample to be able to do this?
Most estimation techniques require large labeled
samples.
The point is to avoid construction of a large labeled
sample.
Create viral license for data resources.
41
Summary
Many interesting research issues
Need institution/individual to take the lead
Need motivated network of contributors
data resource contributors
source code contributors
Start with small & simple project that proves
idea
If it works … text mining could become an
enabler on a par with linux.
42
More Slides
43
RegAsiJap0
JAP
0.86
0.62
RegAsiPhi0
PHLNS
0.91
0.56
RegAsiIndSta0
INDIA
0.85
0.53
SpoSocPla0
CCAT
0.60
0.53
RegEurRus0
CCAT
0.58
0.51
RegEurRus0
RUSS
0.85
0.51
SpoSocPla0
GSPO
0.78
0.42
SpoBasReg0
GSPO
0.75
0.33
RegAsiIndSta0
MCAT
0.56
0.32
SpoBasPla1
GSPO
0.80
0.31
SpoBasCol0
GSPO
0.78
0.31
SpoBasCol1
GSPO
0.74
0.26
RegEurSlo0
SLVAK
0.86
0.25
SpoBasPla0
GSPO
0.77
0.24
RegEurRus0
MCAT
0.49
0.23
BusIndTraMar0
I76300
0.81
0.23
SpoHocIceLeaPro0
GSPO
0.71
0.20
SpoBasMinLea0
GSPO
0.71
0.20
RegMidLeb0
LEBAN
0.83
0.19
RecAvi0
I36400
0.74
0.18
RegSou0
BRAZ
0.84
0.18
44
Resources
http://www-csli.stanford.edu/~schuetze (this talk, some
additional material)
Source of Gates quote:
http://www.techweb.com/wire/story/TWB19990324S0014
Kurt D. Bollacker and Joydeep Ghosh. A scalable method for
classifier knowledge reuse. In Proceedings of the 1997
International Conference on Neural Networks, pages 1474-79,
June 1997. (proposes measure for selecting classifiers for reuse)
W.Cohen, D.Kudenko: Transferring and Retraining Learned
Information Filters, Proceedings of the Fourteenth National
Conference on Artificial Intelligence, AAAI 97. (transfer within
the same dataset)
Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier
architecture for scalable knowledge reuse. In The 1998
International Conference on Machine Learning, pp. 64-72, July
1998. (transfer within the same dataset)
Motivation of open source contributors:
http://newsforge.com/newsforge/03/04/19/2128256.shtml?tid
=11,
http://cybernaut.com/modules.php?op=modload&name=News&f
45
ile=article&sid=8&mode=thread&order=0&thold=0