An Overview of Text Mining
Download
Report
Transcript An Overview of Text Mining
An Overview of Text Mining
Rebecca Hwa
4/25/2002
References
M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37th Annual
Meeting of the Association for Computational Linguistics, 1999.
E. Riloff and R. Jones, “Learning Dictionaries for Information Extraction Using
Multi-level Boot-strapping,” in the Proceedings of AAAI-99, 1999.
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text Classification from
Labeled and Unlabeled Documents using EM,” in Machine Learning, 2000.
M. Grobelnik, D. Mladenic, and N. Milic-Frayling, “Text Mining as Integration of
Several Related Research Areas: Report on KDD’2000 Workshop on Text
Mining,” 2000.
What Is Text Mining?
“The objective of Text Mining is to exploit information
contained in textual documents in various ways,
including …discovery of patterns and trends in data,
associations among entities, predictive rules, etc.”
(Grobelnik et al., 2001)
“Another way to view text data mining is as a process of
exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known.” (Hearst,
1999)
Text Mining
• How does it relate to data mining in general?
• How does it relate to computational linguistics?
• How does it relate to information retrieval?
Finding Patterns
Non-textual data
Textual data
General
data-mining
Computational
Linguistics
Finding “Nuggets”
Novel
Non-Novel
Exploratory
Data
Analysis
Database
queries
Information
Retrieval
Challenges in Text Mining
• Data collection is “free text”
– Data is not well-organized
• Semi-structured or unstructured
– Natural language text contains ambiguities on
many levels
• Lexical, syntactic, semantic, and pragmatic
– Learning techniques for processing text
typically need annotated training examples
• Consider bootstrapping techniques
Text Mining Tasks
• Exploratory Data Analysis
– Using text to form hypotheses about diseases (Swanson
and Smalheiser, 1997).
• Information Extraction
– (Semi)automatically create (domain specific)
knowledge bases, and then use standard data-mining
techniques.
• Bootstrapping methods (Riloff and Jones, 1999).
• Text Classification
– Useful intermediary step for information extraction
• Bootstrapping method using EM (Nigam et al., 2000).
Biomedical Data Exploration
(Swanson, and Smalheiser, 1997)
• Extract pieces of evidence from article titles in the
biomedical literature
•
•
•
•
“stress is associated with migraines”
“stress can lead to loss of magnesium”
“calcium channel blockers prevent some migraines”
“magnesium is a natural calcium channel blocker”
• Induce a new hypothesis not in the literature by
combining culled text fragments with human
medical expertise
• Magnesium deficiency may play a role in some kinds of
migraine headache
Challenges in Data Exploration
• How can valid inference links be found
without succumbing to combinatorial
explosion of possibilities?
– Need better models of lexical relationships and
semantic constraints (very hard)
• How should the information be presented to
the human experts to facilitate their
exploration?
Information Extraction (IE)
• Extract domain-specific information from natural
language text
– Need a dictionary of extraction patterns (e.g.,
“traveled to <x>” or “presidents of <x>”)
• Constructed by hand
• Automatically learned from hand-annotated training data
– Need a semantic lexicon (dictionary of words with
semantic category labels)
• Typically constructed by hand
Challenges in IE
• Automatic learning methods are typically
supervised (i.e., need labeled examples)
• But annotating training data is a timeconsuming and expensive task.
• Can we develop better unsupervised
algorithm?
• Can we make better use of a small set of
labeled example?
Learning Dictionaries for IE via
Bootstrapping (Riloff and Jones, 1999)
• Simultaneously learn extraction patterns and
domain-specific semantic lexicons
• Input requires a small set of seed words (for the
semantic categories) and a large collection of text
• Mutual bootstrapping
– Learns extraction patterns from seed words
– Use extraction patterns to identify new words to add to
the semantic categories
– Meta-bootstrapping to reduce noise
Text classification (TC)
• Tag a document as belonging to one of a set
of pre-defined classes
– “This does not lead to discovery of new
information…” (Hearst, 1999).
– Many practical uses
• Group documents into different domains (useful for
domain specific information extraction)
• Learn reading interests of users
• Automatically sort e-mail
• On-line New Event Detection
Challenges in TC
• Like IE, also need lots of labeled examples
as training data
– After a user has labeled 1000 UseNet news
articles, the system was only right ~50% of the
time at selecting articles interesting to the user.
• What other sources of information can
reduce the need for labeled examples?
TC from Labeled and Unlabeled Documents
using EM (Nigam et al., 2000)
• Expectation-Maximization
– Iterative algorithm for MLE in parametric estimation
problems with missing data (e.g. the labels for the example)
• Nigam et al. combined the EM algorithm with a
Naïve Bayes classifier, using both labeled and
unlabeled data as input
– Dynamically adjust strength of unlabeled data’s
contribution to parameter estimation in EM
– Reduce the bias of naïve Bayes by modeling each class
with multiple mixture components
Probabilistic Framework for TC
• Assumption #1: Doc produced by mixture model
– Generate docs according to probability distribution defined by
the model parameters q
• Assumption #2: Each class is modeled by one mixture
component: C ={c1,…,c|C|}
Prob. of model generating doc di is:
|C|
P(di | q ) P(c j | q ) P(di | c j ;q )
j 1
Naïve Bayes Model
• Assumes words in the document are generated
independently (no context)
• Assume all text have the same length
|C |
P(d i | q ) P(c j | q ) P(d i | c j ;q )
j 1
P(d i | c j ;q ) P( w1 ,..., w|di | | c j ;q )
|d i |
P( wk | c j ;q )
k 1
• Model parameters:
q {q w|c ,q c }
Using a Trained Model
• What class should a new document d be
assigned to?
P(c | q ) P(d | c;q )
P( Label (d ) c | d ;q )
P(d | q )
• Pick the class with the highest probability
Parameter Estimation with
Labeled Documents
• Estimating model parameters:
q {q w|c ,q c }
q w|c
# (w d ) Ind ( Label (d ) c)
P( w | c;q )
# (w' d ' )Ind ( Label (d ' ) c)
d D
w 'V d 'D
q c P (c | q )
Ind ( Label (d ) c)
d D
|D|
Parameter Estimation with
Unlabeled Documents
• EM: for “incomplete data” problems
• Maximize prob. of model generating observed data
• Build initial classifier (initialize the parameters to
“reasonable” starting values)
• Repeat until convergence
– E-Step: Use current classifier params, qt, to estimate
P(c|d;qt) for all d in Du
– M-Step: Re-estimate the classifier, qt+1, using the expected
counts from the E-Step
Augmented EM
• Weight the unlabeled data
– Otherwise, unlabeled data overwhelms the small
amount of labeled data
– Modify M-step to multiply expected counts with a
weight factor
• Relax the one class one mixture component
assumption
– Allow labeled data to fall into “topics” within a class
– Modify E-step to allow labeled document to
probabilistically belong to sub-topics