Open Source Text Mining

Transcript Open Source Text Mining

Open Source Text Mining
Hinrich Schütze, Enkata
Text Mining 2003 @ SDM03
Cathedral Hill Hotel, San Francisco
May 3, 2003
1
Motivation




Open source used to be a crackpot idea.
Bill Gates on linux (1999.03.24): “I really don't think in the
commercial market, we'll see it in any significant way.”
MS 10-Q quarterly filing (2003.01.31): “The popularization
of the open source movement continues to pose a
significant challenge to the company's business model.”
Open source is an enabler for radical new things


Google
Ultra-cheap web servers





Free news
Free email
Free …
Class projects
Walmart pc for $200
2
GNU-Linux
3
Web Servers:
Open Source Dominates
Source: Netcraft
4
Motivation (cont.)

Text mining has not had much impact.




Many small companies & small projects
No large-scale adoption
Exception: text-mining-enhanced search
Text mining could transform the world.


Unstructured → structured
Information explosion



Amount of information has exploded
Amount of accessible information has not
Can open source text mining make this
happen?
5
Unstructured vs Structured
Data
100
90
80
70
60
Unstructured
Structured
50
40
30
20
10
0
Data volume
Market Cap
6
Prabhakar Raghavan, Verity
Business Motivation



High cost of deploying text mining solutions
How can we lower this cost?
100% proprietary solutions


Require re-invention of core infrastructure
Leave fewer resources for high-value
applications built on top of core
infrastructure
7
Definitions

Open source


Public domain, bsd, gpl (gnu public license)
Text mining





Like data mining but for text
NLP (Natural Language Processing)
subdiscipline
Has interesting applications now
More than just information retrieval /
keyword search
Usually: some statistical, probabilistic or
frequentistic component
8
Text Mining vs. NLP
(Natural Language Processing)



What is not text mining: speech, language
models, parsing, machine translation
Typical text mining: clustering, information
extraction, question answering
Statistical and high volume
9
Text Mining: History



80s: Electronic text gives birth to Statistical
Natural Language Processing (StatNLP).
90s: DARPA sponsors Message
Understanding Conferences (MUC) and
Information Extraction (IE) community.
Mid-90s: Data Mining becomes a discipline
and usurps much of IE and StatNLP as “text
mining”.
10
Text Mining: Hearst’s Definition

Finding nuggets



Finding patterns



Information extraction
Question answering
Clustering
Knowledge discovery
Text visualization
11
Information Extraction
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
12
Knowledge Discovery:
Arrowsmith


Goal: Connect two disconnected subfields of
medicine.
Technique





Start with 1st subfield
Identify key concepts
Search for 2nd subfield with same concepts
Implemented in Arrowsmith system
Discovery: magnesium is potential treatment
for migraine
13
Knowledge Discovery:
Arrowsmith
14
When is Open Source
Successful?

“Important” problem





Adaptation




A little adaptation is easy
Most users do not need any adaptation (out of the box use)
Incremental releases are useful
Cost sharing without administrative/legal overhead






Many users (operating system)
Fun to work on (games)
Public funding available (OpenBSD, security)
Open source author gains fame/satisfaction/immortality/community
Dozens of companies with significant interest in linux (ibm …)
Many of these companies contribute to open source
This is in effect an informal consortium
A formal effort probably would have killed linux.
Same applies to text mining?
Also: bugs, security, high-availability, ideal for consulting &
hardware companies like IBM
15
When is Open Source Not
Successful?

Boring & rare problem


Complex integrated solutions



Print driver for 10 year old printer
QuarkXPress
ERP systems
Good UI experience for non-geeks



Apple
Microsoft Windows
(at least for now)
16
Text Mining and Open Source

Pro



Important problem: fame, satisfaction,
immortality, community can be gained
Pooling of resources / critical mass
Con




Non-incremental?
Most text mining requires significant
adaptation.
Most text mining requires data resources as
well as source code.
The need for data resources does not fit well 17
into the open source paradigm.
Text Mining Open Source Today

Lucene


Rain/bow, Weka, GTP, TDMAPI


Text mining algorithms / infrastructure, no
data resources
NLTK


Excellent for information retrieval, but not
much text mining.
NLP toolkit, some data resources
WordNet, DMOZ

Excellent data resources, but not enough
breadth/depth.
18
Open Source with Open Data



Spell checkers (e.g., emacs)
Antispam software (e.g., spamassassin)
Named entity recognition (Gate/Annie)

Free version less powerful than in-house
19
SpamAssassin: Code + Data
20
Open Data Resources:
Examples

SpamAssassin


Named entity recognition


Word lists, dictionaries
Information extraction


Classification model for spam
Domain model, taxonomies, regular
expressions
Shallow parsing

Grammars
21
Code vs Data
Significant
Resources
Needed
Text Classification
N. Entity Recognition
Information Extraction
?
Spam Filtering
Spell Checkers
No
Resources
Needed
Complex&Integrated SW
Good UI Design
Proprietary
Code
Linux
Web Servers
Open
Source
22
Open Source with Data: Key Issues

Can data resources be recycled?




Assume there is a large library of data resources
available.



Problems have to be similar.
More difficult than one would expect: my first attempt
failed (medline/reuters).
Next: case study
How do we identify the data resources that can be
recycled?
How do we adapt them?
How do we get from here to there?

Need incremental approach that is sustained by
successes along the way.
23
Text Mining without Data
Resources


Premise: “Knowledge-poor” text mining taps
small part of potential of text mining.
Knowledge-poor text mining examples




Clustering
Phrase extraction
First story detection
Many success stories
24
Case Study: ODP ->
Case Study:
Reuters
Train on ODP
Apply to Reuters
25
Case Study: Text Classification

Key Issues for text classification




Show that text classifiers can be recycled
How can we select reusable classifiers for a
particular task?
How do we adapt them?
Case Study

Train classifiers on open directory (ODP)


Apply classifiers to Reuters RCV1


165,000 docs (nodes), crawled in 2000, 505 classes
780,000 docs, >1000 classes
Hypothesis: A library of classifiers based on
ODP can be recycled for RCV1.
26
Experimental Setup





Train 505 classifiers on ODP
Apply them to Reuters
Compute chi2 for all ODP x Reuters pairs
Evaluate n pairs with the best chi2
Evaluation Measures

Area under ROC curve



Average precision



Plot false positive rate vs true positive rate
Compute area under the curve
Rank documents, compute precision for each rank
Average for all positive documents
Estimated based on 25% sample
27
Japan: ODP -> Reuters
ROC Curve
Japan Classifier Trained on ODP Applied to Reuters
1.00
0.90
True Positive Rate
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
0.00
0.10
0.20
0.30
0.40
0.50
False Positive Rate
0.60
0.70
0.80
0.90
28
Some Results
29
BusIndTraMar0 / I76300: Ports
30
Discussion



Promising results
These are results without any adaptation.
Performance expected to be much better
after adaptation.
31
Discussion (cont)


Class relationships are m:n, not 1:1
Reuters: GSPO






SpoBasCol0
SpoBasMinLea0
SpoBasReg0
SpoHocIceLeaNatPla0
SpoHocIceLeaPro0
ODP: RegEurUniBusInd0 (UK industries)






I13000 (petroleum & natural gas)
I17000 (water supply)
I32000 (mechanical engineering)
I66100 (restaurants, cafes, fast food)
I79020 (telecommunications)
I9741105 (radio broadcasting)
32
Why Recycling Classifiers is
Difficult

Autonomous vs relative decisions


ODP Japan classifier w/o modifications has
high precision, but only 1% recall on RCV1!
Most classifiers are tuned for optimal
performance in embedded system.




Tuning decreases robustness in recycling.
Tokenization, document length, numbers
Numbers throw off medline vs. non-medline
categorizer (financial classified as medical)
Length-sensitive multinomial Naïve Bayes:
nonsensical results
33
Specifics


What would an open source text classification
package look like?
Code


Text mining algorithms
Customization component


Creation component


To create new data resources
Data



To adapt recycled data resources
Recycled data resources
Newly created data resources
Pick a good area


Bioinformatics: genes / proteins
Product catalogs
34
Other Text Mining Areas



Named entity recognition
Information extraction
Shallow parsing
35
Data vs Code

What about just sharing training sets?


What about just sharing models?


Often proprietary
Small preprocessing changes can throw you
off completely
Share (simple?) classifier cum preprocessor
and models

Still proprietary issues
36
Open Source & Data
Public
Code+Data
V1.1
Proprietar
y
adapt
Enhanced
Code+Data
Code+Data
V1.0
sanitize
new release
publish
Sanitized&
Enhanced
Code+Data
37
Free Riders?

Open source is successful because it makes
free riding hard.


Harder to achieve for some data resources





Viral nature of GPL.
Download models
Apply to your data
Retrain
You own 100% of the result
Less of a problem for dictionaries and
grammars
38
Data Licenses

Open Directory License



http://rdf.dmoz.org/license.html
Bsd flavor
Wordnet


http://www.cogsci.princeton.edu/~wn/license
.shtml
Copyright


No license to sell derivative works?
Some criteria for derivative works


Substantially similar (seinfeld trivia)
Potential damage to future marketing of derivative
works
39
Code vs Data Licenses

Some similarity



If I open-source my code, then I will benefit
from bug fixes & enhancements written by
others.
If I open-source my data resource, then my
classification model may become more robust
due to improvements made by others.
Some dissimilarity


Code is very abstract: few issues with
proprietary information creeping in.
Text mining resources are not very abstract:
there is a potential of sensitive information
40
Areas in Need of Research

How to identify reusable text mining components



How to adapt reusable text mining components




Active learning
Interactive parameter tweaking?
Combination of recycled classifier and new training
information
Estimate performance



ODP/Reuters case study does not address this.
Need (small) labeled sample to be able to do this?
Most estimation techniques require large labeled
samples.
The point is to avoid construction of a large labeled
sample.
Create viral license for data resources.
41
Summary



Many interesting research issues
Need institution/individual to take the lead
Need motivated network of contributors




data resource contributors
source code contributors
Start with small & simple project that proves
idea
If it works … text mining could become an
enabler on a par with linux.
42
More Slides
43
RegAsiJap0
JAP
0.86
0.62
RegAsiPhi0
PHLNS
0.91
0.56
RegAsiIndSta0
INDIA
0.85
0.53
SpoSocPla0
CCAT
0.60
0.53
RegEurRus0
CCAT
0.58
0.51
RegEurRus0
RUSS
0.85
0.51
SpoSocPla0
GSPO
0.78
0.42
SpoBasReg0
GSPO
0.75
0.33
RegAsiIndSta0
MCAT
0.56
0.32
SpoBasPla1
GSPO
0.80
0.31
SpoBasCol0
GSPO
0.78
0.31
SpoBasCol1
GSPO
0.74
0.26
RegEurSlo0
SLVAK
0.86
0.25
SpoBasPla0
GSPO
0.77
0.24
RegEurRus0
MCAT
0.49
0.23
BusIndTraMar0
I76300
0.81
0.23
SpoHocIceLeaPro0
GSPO
0.71
0.20
SpoBasMinLea0
GSPO
0.71
0.20
RegMidLeb0
LEBAN
0.83
0.19
RecAvi0
I36400
0.74
0.18
RegSou0
BRAZ
0.84
0.18
44
Resources






http://www-csli.stanford.edu/~schuetze (this talk, some
additional material)
Source of Gates quote:
http://www.techweb.com/wire/story/TWB19990324S0014
Kurt D. Bollacker and Joydeep Ghosh. A scalable method for
classifier knowledge reuse. In Proceedings of the 1997
International Conference on Neural Networks, pages 1474-79,
June 1997. (proposes measure for selecting classifiers for reuse)
W.Cohen, D.Kudenko: Transferring and Retraining Learned
Information Filters, Proceedings of the Fourteenth National
Conference on Artificial Intelligence, AAAI 97. (transfer within
the same dataset)
Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier
architecture for scalable knowledge reuse. In The 1998
International Conference on Machine Learning, pp. 64-72, July
1998. (transfer within the same dataset)
Motivation of open source contributors:
http://newsforge.com/newsforge/03/04/19/2128256.shtml?tid
=11,
http://cybernaut.com/modules.php?op=modload&name=News&f
45
ile=article&sid=8&mode=thread&order=0&thold=0