Transcript Document
Text Mining
1
What is Text Mining?
• There are many examples of text-based documents (all in
‘electronic’ format…)
– e-mails, corporate Web pages, customer surveys, résumés, medical records,
DNA sequences, technical papers, incident reports, news stories and
more…
• Not enough time or patience to read
– Can we extract the most vital kernels of information…
• So, we wish to find a way to gain knowledge (in summarised form)
from all that text, without reading or examining them fully first…!
– Some others (e.g. DNA seq.) are hard to comprehend!
2
What is Text Mining?
• Traditional data mining uses ‘structured data’ (n x p
matrix)
• The analysis of ‘free-form text’ is also referred to as
‘unstructured data’,
– successful categorisation of such data can be a difficult and
time-consuming task…
• Often, can combine free-form text and structured data to
derive valuable, actionable information… (e.g. as in
typical surveys) – semi-structured
3
“Search” versus “Discover”
Structured
Data
Unstructured
Data (Text)
Search
(goal-oriented)
Discover
(opportunistic)
Data
Retrieval
Data
Mining
Information
Retrieval
Text
Mining
© 2002, AvaQuest Inc.
Data Retrieval
• Find records within a structured
database.
Database Type
Structured
Search Mode
Goal-driven
Atomic entity
Data Record
Example Information Need
“Find a Japanese restaurant in Boston
that serves vegetarian food.”
Example Query
“SELECT * FROM restaurants WHERE
city = boston AND type = japanese
AND has_veg = true”
© 2002, AvaQuest Inc.
Information Retrieval
• Find relevant information in an
unstructured information source
(usually text)
Database Type
Unstructured
Search Mode
Goal-driven
Atomic entity
Document
Example Information Need
“Find a Japanese restaurant in Boston
that serves vegetarian food.”
Example Query
“Japanese restaurant Boston” or
Boston->Restaurants->Japanese
© 2002, AvaQuest Inc.
Data Mining
• Discover new knowledge
through analysis of data
Database Type
Structured
Search Mode
Opportunistic
Atomic entity
Numbers and Dimensions
Example Information Need
“Show trend over time in # of visits to
Japanese restaurants in Boston ”
Example Query
“SELECT SUM(visits) FROM restaurants
WHERE city = boston AND type =
japanese ORDER BY date”
© 2002, AvaQuest Inc.
Text Mining
• Discover new knowledge
through analysis of text
Database Type
Unstructured
Search Mode
Opportunistic
Atomic entity
Language feature or concept
Example Information Need
“Find the types of food poisoning most
often associated with Japanese
restaurants”
Example Query
Rank diseases found associated with
“Japanese restaurants”
© 2002, AvaQuest Inc.
Motivation for Text Mining
•
•
Approximately 90% of the world’s data is held in
unstructured formats (source: Oracle Corporation)
Information intensive business processes demand that we
transcend from simple document retrieval to “knowledge”
discovery.
10%
90%
Structured Numerical or Coded
Information
Unstructured or Semi-structured
Information
© 2002, AvaQuest Inc.
Text Mining: Examples
• Text mining is an exercise to gain knowledge from stores
of language text.
• Text:
–
–
–
–
–
–
–
–
Web pages
Medical records
Customer surveys
Email filtering (spam)
DNA sequences
Incident reports
Drug interaction reports
News stories (e.g. predict stock movement)
10
What is Text Mining
• Data examples
– Web pages
– Customer surveys
Customer
Age
Sex
Tenure
Comments
Outcome
123
24
M
12 years
Incorrect
charges on bill
customer angry
Y
243
26
F
1 month
Inquiry about
charges to India
N
346
54
M
3 years
Question about
charges on bill
N
11
Text Mining
• Typically falls into one of two categories
– Analysis of text: I have a bunch of text I am
interested in, tell me something about it
• E.g. sentiment analysis, “buzz” searches
– Retrieval: There is a large corpus of text documents,
and I want the one closest to a specified query
• E.g. web search, library catalogs, legal and medical
precedent studies
12
Text Mining: Analysis
•
•
•
•
Which words are most present
Which words are most surprising
Which words help define the document
What are the interesting text phrases?
13
Text Mining: Retrieval
• Find k objects in the corpus of documents which
are most similar to my query.
• Can be viewed as “interactive” data mining query not specified a priori.
• Main problems of text retrieval:
– What does “similar” mean?
– How do I know if I have the right documents?
– How can I incorporate user feedback?
14
IR architecture
CS583, Bing Liu, UIC
15
Text Retrieval: Challenges
• Calculating similarity is not obvious - what is the distance between
two sentences or queries?
• Evaluating retrieval is hard: what is the “right” answer ? (no
ground truth)
• User can query things you have not seen before e.g. misspelled,
foreign, new terms.
• Goal (score function) is different than in classification/regression:
not looking to model all of the data, just get best results for a
given user.
• Words can hide semantic content
– Synonymy: A keyword T does not appear anywhere in the document, even
though the document is closely related to T, e.g., data mining
– Polysemy: The same keyword may mean different things in different
contexts, e.g., mining
16
Basic Measures for Text Retrieval
• Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
| {Relevant} {Retrieved } |
precision
| {Retrieved } |
• Recall: the percentage of documents that are relevant to the
query and were, in fact, retrieved
|{Relevant }{Retrieved } |
recall
|{Relevant } |
17
Precision vs. Recall
Truth:Relvant
Truth:Not Relevant
Algorithm:Relevant
TP
FP
Algorithm: Not Relevant
FN
TN
• We’ve been here before!
– Precision = TP/(TP+FP)
– Recall = TP/(TP+FN)
– Trade off:
• If algorithm is ‘picky’: precision high, recall low
• If algorithm is ‘relaxed’: precision low, recall high
actual
outcome
1
0
predicted
outcome
1
a
b
– BUT: recall often hard if not impossible to calculate
0
c18
d18
Precision Recall Curves
• If we have a labelled training set, we can calculate recall.
• For any given number of returned documents, we can plot a point
for precision vs. recall. (similar to thresholds in ROC curves)
• Different retrieval algorithms might have very different curves hard to tell which is “best”
19
Term / document matrix
• Most common form of representation in text
mining is the term - document matrix
– Term: typically a single word, but could be a word
phrase like “data mining”
– Document: a generic term meaning a collection of
text to be retrieved
– Can be large - terms are often 50k or larger,
documents can be in the billions (www).
– Can be binary, or use counts
20
Term document matrix
Example: 10 documents: 6 terms
Database
SQL
Index
Regression
Likelihood
linear
D1
24
21
9
0
0
3
D2
32
10
5
0
3
0
D3
12
16
5
0
0
0
D4
6
7
2
0
0
0
D5
43
31
20
0
3
0
D6
2
0
0
18
7
6
D7
0
0
1
32
12
0
D8
3
0
0
22
4
4
D9
1
0
0
34
27
25
D10
6
0
0
17
4
23
• Each document now is just a vector of terms, sometimes
boolean
21
Term document matrix
• We have lost all semantic content
• Be careful constructing your term list!
– Not all words are created equal!
– Words that are the same should be treated the same!
• Stop Words
• Stemming
22
Stop words
• Many of the most frequently used words in English are worthless in
retrieval and text mining – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stop words list may be
constructed
• Why do we need to remove stop words?
– Reduce indexing (or data) file size
• stopwords accounts 20-30% of total word counts.
– Improve efficiency
• stop words are not useful for searching or text mining
• stop words always have a large number of hits
23
Stemming
• Techniques used to find out the root/stem of a word:
– E.g.,
–
–
–
–
• stem:
user
users
used
using
use
engineering
engineered
engineer
engineer
Usefulness
• improving effectiveness of retrieval and text mining
– matching similar words
• reducing indexing size
– combing words with same roots may reduce indexing size as
much as 40-50%.
24
Basic stemming methods
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word consists only of
one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”
25
Feature Selection
• Performance of text classification algorithms can be optimized by
selecting only a subset of the discriminative terms
– Even after stemming and stopword removal.
• Greedy search
– Start from full set and delete one at a time
– Find the least important variable
• Can use Gini index for this if a classification problem
• Often performance does not degrade even with orders of
magnitude reductions
– Only 140 out of 20,000 terms needed for classification!
26
Distances in TD matrices
• Given a term doc matrix represetnation, now we can define
distances between documents (or terms!)
• Elements of matrix can be 0,1 or term frequencies (sometimes
normalized)
• Can use Euclidean or cosine distance
• Cosine distance is the angle between the two vectors
• Not intuitive, but has been proven to work well
• If docs are the same, dc =1, if nothing in common dc=0
27
• We can calculate cosine and Euclidean distance
for this matrix
• What would you want the distances to look like?
Database
SQL
Index
Regression
Likelihood
linear
D1
24
21
9
0
0
3
D2
32
10
5
0
3
0
D3
12
16
5
0
0
0
D4
6
7
2
0
0
0
D5
43
31
20
0
3
0
D6
2
0
0
18
7
6
D7
0
0
1
32
12
0
D8
3
0
0
22
4
4
D9
1
0
0
34
27
25
D10
6
0
0
17
4
23
28
Document distance
• Pairwise distances between documents
• Image plots of cosine distance, Euclidean, and
scaled Euclidean
R function: ‘image’
29
Information retrieval models
• An IR model governs how a document and a
query are represented and how the relevance of
a document to a user query is defined.
• Main models:
–
–
–
–
Boolean model
Vector space model
Statistical language model
etc
CS583, Bing Liu, UIC
30
Boolean model
• Each document or query is treated as a “bag” of
words or terms. Word sequence is not
considered.
• Given a collection of documents D, let V = {t1,
t2, ..., t|V|} be the set of distinctive words/terms
in the collection. V is called the vocabulary.
• A weight wij > 0 is associated with each term ti
of a document dj ∈ D. For a term that does not
appear in document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),
CS583, Bing Liu, UIC
31
Boolean model (contd)
• Query terms are combined logically using the
Boolean operators AND, OR, and NOT.
– E.g., ((data AND mining) AND (NOT text))
• Retrieval
– Given a Boolean query, the system retrieves every
document that makes the query logically true.
– Called exact match.
• The retrieval results are usually quite poor
because term frequency is not considered.
CS583, Bing Liu, UIC
32
Vector space model
• Documents are also treated as a “bag” of words or terms.
• Each document is represented as a vector.
• However, the term weights are no longer 0 or 1. Each term
weight is computed based on some variations of TF or TFIDF scheme.
• Term Frequency (TF) Scheme: The weight of a term ti in
document dj is the number of times that ti appears in dj,
denoted by fij. Normalization may also be applied.
CS583, Bing Liu, UIC
33
TF-IDF term weighting scheme
• The most well known
weighting scheme
– TF: still term frequency
– IDF: inverse document
frequency.
N: total number of docs
dfi: the number of docs that ti
appears.
• The final TF-IDF term
weight is:
CS583, Bing Liu, UIC
34
Retrieval in vector space model
• Query q is represented in the same way or slightly
differently.
• Relevance of di to q: Compare the similarity of query q
and document di.
• Cosine similarity (the cosine of the angle between the two
vectors)
• Cosine is also commonly used in text clustering
CS583, Bing Liu, UIC
35
An Example
• A document space is defined by three terms:
– hardware, software, users
– the vocabulary
• A set of documents are defined as:
– A1=(1, 0, 0),
– A4=(1, 1, 0),
– A7=(1, 1, 1)
A2=(0, 1, 0),
A5=(1, 0, 1),
A8=(1, 0, 1).
A3=(0, 0, 1)
A6=(0, 1, 1)
A9=(0, 1, 1)
• If the Query is “hardware and software”
• what documents should be retrieved?
CS583, Bing Liu, UIC
36
An Example (cont.)
• In Boolean query matching:
– document A4, A7 will be retrieved (“AND”)
– retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
• In similarity matching (cosine):
–
–
–
–
–
q=(1, 1, 0)
S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
S(q, A4)=1,
S(q, A5)=0.5, S(q, A6)=0.5
S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
Document retrieved set (with ranking)=
• {A4, A7, A1, A2, A5, A6, A8, A9}
CS583, Bing Liu, UIC
37
Okapi relevance method
• Another way to assess the degree of relevance is to directly
compute a relevance score for each document to the query.
• The Okapi method and its variations are popular
techniques in this setting.
CS583, Bing Liu, UIC
38
Relevance feedback
• Relevance feedback is one of the techniques for improving
retrieval effectiveness. The steps:
– the user first identifies some relevant (Dr) and irrelevant
documents (Dir) in the initial list of retrieved documents
– the system expands the query q by extracting some additional
terms from the sample relevant and irrelevant documents to
produce qe
– Perform a second round of retrieval.
• Rocchio method (α, β and γ are parameters)
CS583, Bing Liu, UIC
39
Rocchio text classifier
• In fact, a variation of the Rocchio method above, called
the Rocchio classification method, can be used to
improve retrieval effectiveness too
– so are other machine learning methods. Why?
• Rocchio classifier is constructed by producing a prototype
vector ci for each class i (relevant or irrelevant in this case):
• In classification, cosine is used.
CS583, Bing Liu, UIC
40
Queries
• A query is a representation of the user’s information needs
– Normally a list of words.
• Once we have a TD matrix, queries can be represented as a vector in
the same space
– “Database Index” = (1,0,1,0,0,0)
• Query can be a simple question in natural language
• Calculate cosine distance between query and the TF x IDF version of
the TD matrix
• Returns a ranked vector of documents
41
Latent Semantic Indexing
• Criticism: queries can be posed in many ways, but still
mean the same
– Data mining and knowledge discovery
– Car and automobile
– Beet and beetroot
• Semantically, these are the same, and documents with
either term are relevant.
• Using synonym lists or thesauri are solutions, but messy
and difficult.
• Latent Semantic Indexing (LSI): tries to extract hidden
semantic structure in the documents
• Search what I meant, not what I said!
42
LSI
• Approximate the T-dimensional term space using
principle components calculated from the TD matrix
• The first k PC directions provide the best set of k
orthogonal basis vectors - these explain the most
variance in the data.
– Data is reduced to an N x k matrix, without much loss of
information
• Each “direction” is a linear combination of the input
terms, and define a clustering of “topics” in the data.
43
LSI
• Typically done using Singular Value Decomposition
(SVD) to find principal components
TD matrix
New orthogonal basis for the
data (PC directions) Diagonal matrix of
eigenvalues
Term weighting by
document - 10 x 6
For our example: S = (77.4,69.5,22.9,13.5,12.1,4.8)
Fraction of the variance explained (PC1&2) =
=
92.5%
44
LSI
Top 2 PC make new
pseudo-terms to define
documents…
Also, can look at first two Principal components:
(0.74,0.49, 0.27,0.28,0.18,0.19)
-> emphasizes first two terms
(-0.28,-0.24,-0.12,0.74,0.37,0.31) -> separates the two clusters
Note how distance from the origin shows number of terms,
And angle (from the origin) shows similarity as well
45
LSI
• Here we show the same
plot, but with two new
documents, one with the
term “SQL” 50 times,
another with the term
“Databases” 50 times.
• Even though they have no
phrases in common, they
are close in LSI space
46
Textual analysis
• Once we have the data into a nice matrix
representation (TD, TDxIDF, or LSI), we can
throw the data mining toolbox at it:
– Classification of documents
• If we have training data for classes
– Clustering of documents
• unsupervised
47
Automatic document classification
• Motivation
– Automatic classification for the tremendous number of on-line text documents (Web
pages, e-mails, etc.)
– Customer comments: Requests for info, complaints, inquiries
• A classification problem
– Training set: Human experts generate a training data set
– Classification: The computer system discovers the classification rules
– Application: The discovered rules can be applied to classify new/unknown documents
• Techniques
– Linear/logistic regression, naïve Bayes
– Trees not so good here due to massive dimension, few interactions
48
Naïve Bayes Classifier for Text
• Naïve Bayes classifier = conditional independence model
– Also called “multivariate Bernoulli”
– Assumes conditional independence assumption given the class:
p( x | ck ) = P p( xj | ck )
– Note that we model each term xj as a discrete random variable
In other words, the probability that a bunch of words comes from a given class
equals the product of the individual probabilities of those words.
49
.
Multinomial Classifier for Text
• Multinomial Classification model
– Assumes that the data are generated by a p-sided die (multinomial model)
Nx
p (x | ck ) p ( Nx | ck ) p ( xj | ck )
nj
j 1
– where Nx = number of terms (total count) in document x
– xj = number of times term j occurs in the document
– ck = class = k
– Based on training data, each class has its own multinomial probability across all
words.
50
Document Clustering
•
•
•
•
Can also do clustering, or unsupervised learning of docs.
Automatically group related documents based on their content.
Require no training sets or predetermined taxonomies.
Major steps
– Preprocessing
• Remove stop words, stem, feature extraction, lexical analysis, …
– Hierarchical clustering
• Compute similarities applying clustering algorithms, …
– Slicing
• Fan out controls, flatten the tree to desired number of levels.
• Like all clustering examples, success is relative
51
Document Clustering
• To Cluster:
– Can use LSI
– Another model: Latent Dirichlet Allocation (LDA)
– LDA is a generative probabilistic model of a corpus. Documents are
represented as random mixtures over latent topics, where a topic is
characterized by a distribution over words.
• LDA:
– Three concepts: words, topics, and documents
– Documents are a collection of words and have a probability
distribution over topics
– Topics have a probability distribution over words
– Fully Bayesian Model
52
LDA
•
•
•
Assume data was generated by a generative process:
qis a document - made up from topics from a probability distribution
z is a topic made up from words from a probability distribution
w is a word, the only real observables (N=number of words in all documents)
a=per-document topic distributions
•
Then, the LDA equations are specified in a fully Bayesian model:
53
Which can be solved via advance
computational techniques
see Blei, et al 2003
54
LDA output
• The result can be an often-useful classification of documents into topics, and a
distribution of each topic across words:
55
Another Look at LDA
• Model: Topics made up of words used to generate documents
56
Another Look at LDA
• Reality: Documents observed, infer topics
57
Text Mining: Helpful Data
• WordNet
Courtesy: Luca Lanzi
58
Text Mining - Other Topics
• Part of Speech Tagging
– Assign grammatical tags to words (verb, noun, etc)
– Helps in understanding documents : uses Hidden Markov Models
• Named Entity Classification
– Classification task: can we automatically detect proper nouns and tag them
– “Mr. Jones” is a person; “Madison” is a town.
– Helps with dis-ambiguation: spears
59
Text Mining - Other Topics
• Sentiment Analysis
– Automatically determine tone in text: positive, negative or neutral
– Typically uses collections of good and bad words
– “While the traditional media is slowly starting to take John McCain’s straight talking
image with increasingly large grains of salt, his base isn’t quite ready to give up on their
favorite son. Jonathan Alter’s bizarre defense of McCain after he was caught telling an
outright lie, perfectly captures that reluctance[.]”
– Often fit using Naïve Bayes
• There are sentiment word lists out there:
– See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis
60
Text Mining - Other Topics
• Summarizing text: Word Clouds
– Takes text as input, finds the most
interesting ones, and displays them
graphically
– Blogs do this
– Wordle.net
61
Modest Mouse lyrics
62
References
• Text classification
– Excellent lecture by William Cohen – quite detailed, uses knn, TFIDF,nerural nets, other models
– http://videolectures.net/mlas06_cohen_tc/
•
LDA and topic models
– Seminal paper: Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John. ed.
"Latent Dirichlet allocation". Journal of Machine Learning Research 3
– Tutorial on text topic modelling from David Blei:
http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
• General text mining topics
– Many text mining tutorial available at LingPipe:
• http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html
– Code available but written in Java
• Sentiment Analysis
– Bing Liu tutorial: http://www.cs.uic.edu/~liub/FBS/Sentiment-Analysis-tutorial-AAAI-2011.pdf
– Searching for Bing Liu will find many resources on this topic
– Sentiment analysis in Twitter: http://danzambonini.com/self-improving-bayesian-sentiment-analysisfor-twitter/
• Twitter text mining tutorial
– http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/
63