Text Mining - Georges Gardarin

Download Report

Transcript Text Mining - Georges Gardarin

Text Mining
Huaizhong KOU
PHD Student of
Georges Gardarin
PRiSM
Laboratory
0. Content
1. Introduction
– What happens
– What is Text Mining
– Text Mining vs Data
Mining
– Applications
2. Feature Extraction
– Task
– Indexing
– Dimensionality Reduction
3. Document Categorization
–
–
–
–
Task
Architecture
Categorization Classifiers
Application:Trend
Analysis
4. Document Clustering
– Task
– Algorithms
– Application
5. Product
6.Reference
1. Text Mining:Introduction
1.1 What happens
1.2 What’s Text Mining
1.3 Text Mining vs Data Mining
1.4 Application
1.1 Introduction:What happens(1)

•Information explosive
•80% information stored in text
documents:journals, web pages,emails...
•Difficult to extract special information
•Current technologies...
Internet
1.1 Introduction: What happens(2)
•It is necessary to
automatically analyze,
organize,summarize...
1.2 Introduction: What’s Text Mining(1)
•Text Mining::= the procedure of synthesizing the information
by analyzing the relations, the patterns, and the rules among
textual data - semi-structured or unstructured text.
•Techniques::=
data mining
machine learning
information retrieval
statistics
natural-language understanding
case-based reasoning
Ref:[1]-[4]
1.2 Introduction: What’s Text Mining(2)
Sample
Documents
Text document
Transformed
Representation
models
Learning
Learning
Working
Domain specific
templates/models
Visualizations
1.3 Introduction:TM vs DM
Data Mining
Text Mining
Data Object
Numerical & categorical data
Textual data
Data structure
Structure
Unstructure&semi-structure
Data representation
Straightforward
Complex
Space dimension
<tens of thousands
Methods
Maturity
> tens of thousands
Data analysis, machine learning
Data mining, information
statistic, neural networks
retrieval, NLP,...
Broad implementation since1994 Broad implementation
starting 2000
Market
Ref:[12][13]
105 analysts at large and mid
108 analysts corporate workers
size companies
and individual users
1.4 Introduction:Application
•The potential applications are countless.
•Customer profile analysis
•Trend analysis
•Information filtering and routing
•Event tracks
•news stories classification
•Web search
•…….
Ref:[12][13]
2. Feature Extraction
2.1 Task
2.2 Indexing
2.3 Weighting Model
2.4 Dimensionality Reduction
Ref:[7][11][14][18][20][22]
2.1 Feature Extraction:Task(1)
Task: Extract a good subset of words to represent documents
Document
collection
All unique
words/phrases
Feature
Extraction
All good
words/phrases
Ref:[7][11][14][18][20][22]
2.1 Feature Extraction:Task(2)
While more and more textual information
is available online, effective retrieval is
difficult without good indexing of text
content.
16
While-more-and-textual-information-is-available-onlineeffective-retrieval-difficult-without-good-indexing-text-content
Feature
Extraction
5
Text-information-online-retrieval-index
2
Ref:[7][11][14][18][20][22]
1
1
1
1
2.2 Feature Extraction:Indexing(1)
Training
documents
Identification
all unique words
Removal
stop words


non-informative word
ex.{the,and,when,more}
Removal
Ref:[7][11][14][18][20][22]
Word Stemming
of suffix to
generate word stem
grouping words
 increasing the relevance
 ex.{walker,walking}walk
Term Weighting
•Naive terms
•Importance of term in Doc
2.2 Feature Extraction:Indexing(2)
•Document representations: vector space models
d=(w1,w2,…wt)Rt
wi is the weight of ith term in document d.
Ref:[7][11][14][18][20][22]
2.3 Feature Extraction:Weighting Model(1)
•tf - Term Frequency weighting
wij = Freqij
Freqij : := the number of times jth term
occurs in document Di.
 Drawback: without reflection of importance
factor for document discrimination.
•Ex.
D1
D2
Ref:[11][22]
A B K O Q R S T W X
ABRTSAQWA
XAO
RTABBAXA
QSAK
D1
3
1
0
1
1
1 1
1
1
1
D2
3
2
1
0
1
1 1
1
0
1
2.3 Feature Extraction:Weighting Model(2)
•tfidf - Inverse Document Frequency weighting
wij = Freqij * log(N/ DocFreqj) .
N : := the number of documents in the training
document collection.
DocFreqj ::= the number of documents in
which the jth term occurs.
Advantage: with reflection of importance factor for
document discrimination.
Assumption:terms with low DocFreq are better discriminator
than ones with high DocFreq in document collection
•Ex.
A B K O Q R S T W X
Ref:[11][22]
Ref:[13]
D1
0
0
0 0.3 0
D2
0
0 0.3 0
0
0 0 0 0.3 0
0 0 0
0
0
2.3 Feature Extraction:Weighting Model(3)
•Entropy weighting
wij  log FREQij  1.0 * 1  entropy ( wi ) 
where
N 
 FREQij 
1
FREQij

entropy( wi ) 
log 


log N  j 1  DOCFREQ j  DOCFREQ j 
is average entropy of ith term and
-1: if word occurs once time in every document
0: if word occurs in only one document
Ref:[11][22]
Ref:[13]
2.4 Feature Extraction:Dimension Reduction
•2.4.1 Document Frequency Thresholding
•2.4.2 X2-statistic
•2.4.3 Latent Semantic Indexing
Ref:[11][20][21][27]
2.4.1 Dimension Reduction:DocFreq Thresholding
•Document Frequency Thresholding
Training
documents D
Naive
Calculates DocFreq(w)
Sets threshold 
Removes all words:
DocFreq < 
Feature Terms
Ref:[11][20][21][27]
Terms
2.4.2 Dimension Reduction: X2-statistic
•Assumption:a pre-defined category set for a training collection D
•Goal: Estimation independence between term and category
Category set
C={c1,c2,..cm}
Naive
Terms
N   AD  CB 
X w, c j  
 A  C  B  D   A  B  C  D 
2
2
2
w  max X 2 w, c j 
X max
j
Sets threshold 
Removes all words: X2max(w)< 
FEATURE TERMS
Ref:[11][20][21][27]
Term categorical score
A:=|{d| d cj  w d}|
B:=|{d| d cj  w d}|
C:=|{d| d cj  w  d}|
D:=|{d| d  cj  w  d}|
N:=|{d| d D}|
2.4.3 Dimension Reduction:LSI(1)
•LSI=Latent Semantic Indexing.
1.SVD Model:Singular Value Decomposition of matrix
documents
m<=min(t,d)
**
T0' T0 = I
X
terms
= T0  S0**  D0'
D0' D0 = I
*
(t,d)
(t,m)
documents X ' = D0
(d,t)
(d,m)
(m,m)

S0

(m,m)
(m,d)
T0'
(m,t)
T0: t m orthogonal eigenvector matrix. Its rows are eigenvectors of X  X'
D0: d m orthogonal eigenvector matrix. Its rows are eigenvectors of X 'X
S0: m m diagonal matrix of singular value( square roots of eigenvalue) in
decreasing order of importance.
m: rank of matrix X. m<=min(t,d).
Ref:[11][20][21][27]
2.4.3 Dimension Reduction:LSI(2)
2.Approximate Model
X  appr(X) =
(t,d)
T
(t,k)
**
 S ** 
*
(k,k)
D'
Select
k<=m
!!
(k,d)
X '  appr(X)' = D  S  T '
•Every rows of matrix appr(X) 'approximately represents one documents.
•Given a row xi and its corresponding row di, the following holds:
xi = di  S  T '
and
di = xi  T  S-1
Ref:[11][20][21][27]
2.4.3 Dimension Reduction:LSI(3)
3.Document Represent Model:
new document d
t
Naive Terms
d=(w1,w2,…,wt) Rt
d  appr(d) = d  T  S-1 Rk
(1,k)
(1,t) (t,k) (k, k)
•No good methods to determine k. It depends on application domain.
•Some experiments suggest: 100  300.
Ref:[11][20][21][27]
3. Document Categorization
3.1 Task
3.2 Architecture
3.3 Categorization Classifiers
3.4 Application
Ref:[1][2][4][5][11][18][23][24]
3.1 Categorization:Task
•Task: assignment of one or more predefined
categories to one document.
Topics
Themes
3.2 Categorization:Architecture
Training
documents
New document
preprocessing
Weighting
Selecting feature
Predefined
categories
Classifier
Category(ies) to d
d
3.3 Categorization Classifiers
3.3.1Centroid-based Classifier
3.3.2 k-Nearest Neighbor Classifier
3.3.3 Naive Bayes Classifier
3.3.1 Model:Centroid-Based Classifier(1)
1.Input:new document d =(w1, w2,…,wn);
2.Predefined categories:C={c1,c2,….,cl};
3.//Compute centroid vector
d'

c

, ciC
c
4.//Similarity model - cosine function
d d
w w
Simil (d , d )  cosd , d  

d  d
w  w
5.//Compute similarity

Simil (ci , d )  cos(ci , d )
6.//Output:Assign to document d the category cmax
d 'ci
i
i
i
i
j
i
j
il
j
2
i 2
j 2
il

Simil (ci , d )  Simil (cmax , d )
jl
2
jl
3.3.1 Model:Centroid-Based Classifier(2)
• > 
•cos()<cos()
•d2 is more close to d1
than d3
•Cosine-based similarity model can reflect the relations between features.
3.3.2 Model:K-Nearest Neighbor Classifier
1.Input:new document d;
2.training collection:D={d1,d2,…dn };
3.predefined categories:C={c1,c2,….,cl};
4.//Compute similarities
for(diD){ Simil(d,di) =cos(d,di); }
5.//Select k-nearest neighbor
Construct k-document subset Dk so that
Simil(d,di) < min(Simil(d,doc) | doc Dk) di D- Dk.
6.//Compute score for each category
for(ciC){
score(ci)=0;
for(docDk){ score(ci)+=((docci)=true?1:0)}
}
7.//Output:Assign to d the category c with the highest score:
score(c)  score(ci) , ci C- {c}
3.3.3 Model:Naive Bayes Classifier
Basic assumption: all terms distribute in documents independently.
1.Input:new document d;
2.predefined categories:C={c1,c2,….,cl};
3.//Compute the probability that d is in each class c C
for(ciC){
Pr(ci d ) 
Pr d ci Pr ci 
Pr d 


Pr d ci Pr ci 
c j C
Pr d c j Pr c j 
//note that terms wi in document are independent each other

Pr ci
 Pr w j
d
ci

J 1
 d
Pr w j ck
ck C 

 j 1

 Pr c

k

}
4.//output:Assigns to d the category c with the highest probability:

Pr d c   m
 ax Pr d c 
cC
3.4 Categorization: Application
Trend Analysis-EAnalyst System
Goal: Predicts the trends in stock price based on news stories
Find Trends
Time series data
Trends
Piecewise linear fitting
Retrieve
Sample Textual data Docs
Sample
relevant
docs
Align
trends
with
docs
Trend::=(slope,confidence)
Language
models
Sample docs
News
documents
Learning Process
Categorization
Ref:[28]
New trend
Bayes
Classifier
Language
model
4. Document Clustering
4.1 Task
4.2 Algorithms
4.3 Application
Ref:[5][7][8][9][10][15][16][29]
4.1 Document Clustering:Task
•Task: It groups all documents so that the documents in the same
group are more similar than ones in other groups.
•Cluster hypothesis: relevant documents tend to be more
closely related to each other than to
non-relevant document.
Ref:[5][7][8][9][10][15][16][29]
4.2 Document Clustering:Algorithms
•4.2.1 k-means
•4.2.2 Hierarchic Agglomerative Clustering(HAC)
•4.2.3 Association Rule Hypergraph Partitioning (ARHP)
Ref:[5][7][8][9][10][15][16][29]
4.2.1 Document Clustering:k-means
•k-means: distance-based flat clustering
0. Input: D::={d1,d2,…dn }; k::=the cluster number;
1. Select k document vectors as the initial centriods of k clusters
2. Repeat
3.
Select one vector d in remaining documents
4.
Compute similarities between d and k centroids
5.
Put d in the closest cluster and recompute the centroid
6. Until the centroids don’t change
7. Output:k clusters of documents
•Advantage:
•linear time complexity
•works relatively well in low dimension space
•Drawback:
•distance computation in high dimension space
•centroid vector may not well summarize the cluster documents
•initial k clusters affect the quality of clusters
Ref:[5][7][8][9][10][15][16][29]
4.2.2 Document Clustering:HAC
•Hierarchic agglomerative clustering(HAC):distance-based
hierarchic clustering
0. Input: D::={d1,d2,…dn };
1. Calculate similarity matrix SIM[i,j]
2. Repeat
3. Merge the most similar two clusters, K and L,
to form a new cluster KL
4. Compute similarities between KL and each of the remaining
cluster and update SIM[i,j]
5. Until there is a single(or specified number) cluster
6. Output: dendogram of clusters
•Advantage:
•producing better quality clusters
•works relatively well in low dimension space
•Drawback:
•distance computation in high dimension space
•quadratic time complexity
Ref:[5][7][8][9][10][15][16][29]
4.2.3 Document Clustering: Association Rule
Hypergraph Partitioning(1)
•Hypergraph
H=(V,E)
V::a set of vertices
E::a set of hyperedges.
A
a
1
D
h 5/6
b 1/3
B
C
Ref:[30]-[35]
i 1/2
g
F
1/3 e 3/4
c 1
E
G
f 3/2
4.2.3 Document Clustering: Association Rule
Hypergraph Partitioning (2)
•Transactional View of Documents and features
item::=Document
transaction::=feature
Doc1
Doc2
Doc3 …
Docn
w1
5
5
2
...
1
w2
2
4
3
…
5
w3
.
.
.
0
.
.
.
0
.
.
.
0
.
.
.
…
…
...
1
.
.
.
wt
6
0
0
…
3
(Transactional database of Documents and features)
Ref:[30]-[35]
transactions
items
4.2.3 Document Clustering: Association Rule
Hypergraph Partitioning(3)
•Clustering
Document-feature transaction database
Discovering association rules
Apriori algorithm
Constructing hypergraph
Association rule hypergraph
Partitioning hypergraph
Hypergraph partitioning algorithm
k partitions
•Hyperedges::=frequent item sets
•Hyperedge weight::=average of the confidences of all rules
•Assumption: documents occurring in the same frequent item set are more similar
Ref:[30]-[35]
4.2.3 Document Clustering: Association Rule
Hypergraph Partitioning(4)
•Advantage
•Without the calculation of the mean of clusters.
•Linear time complexity.
•The quality of the clusters is not affected by the space
dimensionality.
•performing much better than traditional clustering in high
dimensional space in terms of the quality of clusters and
runtime.
Ref:[30]-[35]
4.3 Document Clustering:Application
•Summarization of documents
•Navigation of large document collections
•Organization of Web search results
Ref:[10][15]-[17]
5. Product:Intelligent Miner for Text(IMT)(1)
Name Extractions
Term Extraction
Feature extraction
Categorization
Text Analysis
Tools
Summarization
Clustering
IMT
Web Searching
Tools
Ref:[5][36]
Text search engine
NetQuestion Solution
Web Crawler
Abbreviation Extraction
Relationship Extraction
Hierarchical Clustering
Binary relational Clustering
5. Product:Intelligent Miner for Text(IMT)(2)
1.Feature extraction tools
1.1 Information extraction
•Extract linguistic items that represent document contents
1.2 Feature extraction
•Assign of different categories to vocabulary in documents,
•Measure their importance to the document content.
1.3 Name extraction
•Locate names in text,
•Determine what type of entity the name refers to
1.4 Term extraction
•Discover terms in text. Multiword technical terms
•Recognize variants of the same concept
1.5 Abbreviation recognition
•Find abbreviation and math them with their full forms.
1.6 Relation extraction
5. Product:Intelligent Miner for Text(IMT)(3)
Feature extraction Demo.
5. Product:Intelligent Miner for Text(IMT)(4)
2.Clustering tools
2.1 Applications
• Provide a overview of content in a large document collection
• Identify hidden structures between groups of objects
• Improve the browsing process to find similar or related information
• Find outstanding documents within a collection
2.2 Hierarchical clustering
• Clusters are organized in a clustering tree and related clusters occurs in the
same branch of tree.
2.3 Binary relational clustering
• Relationship of topics.
• document  cluster topic.
NB:preprocessing step for the categorization tool
5.Product:Intelligent Miner for Text(IMT)(5)
•Clustering demo.:navigation of document collection
5. Product:Intelligent Miner for Text(IMT)(6)
3.Summarization tools
3.1 Steps

the most relevant sentences  the relevancy of a sentence to a document
 a summary of the document with length set by user
3.2 Applications
•Judge the relevancy of a full text
Easily determine whether the document is relevant to read.
•Enrich search results
The results of a query to a search engine can be enriched with a short
summary of each document.
•Get a fast overview over document collections
summary  full document
5.Product:Intelligent Miner for Text(IMT)(7)
4.Categorization tool
•Applications
• Organize intranet documents
• Assign documents to folders
• Dispatch requests
• Forward news to subscribers
News
article
sports
cultures
categorizer
health
politics
economics
vacations
I like health news
new router
Black cat
6. Reference (1)
Bibliography
[1] Marti A. Hearst, Untangling Text Data Mining, Proceedings of ACL’99 the 37th
Annual Meeting of the Association for Computational Linguistics, University of
Maryland, June 20-26, 1999 (invited paper) http://www.sims.berkeley.edu/~hearst
[2] Feldman and Dagan 1995 KDT - knowledge discovery in texts.
In Proceedings of the First Annual Conference on Knowledge Discovery and Data Mining (KDD), Montreal.
[3] IJCAI-99 Workshop TEXT MINING: FOUNDATIONS, TECHNIQUES AND APPLICATIONS Stockholm,
Sweden August 2, 1999 http://www.cs.biu.ac.il/~feldman/ijcai-workshop%20cfp.html
[4] Taeho C. Jo Text Categorization considering Categorical Weights and Substantial Weights of Informative Keywords,
1999 (http://www.sccs.chukyo-u.ac.jp/ICCS/olp/p3-13/p3-13.htm)
[5] A White Paper from IBM TechnologyText Mining: Turning Information Into Knowledge, February 17, 1998
editor: Daniel Tkach IBM Software Solutions ( http://allen.comm.virginia.edu/jtl5t/whiteweb.html)
[6] http://allen.comm.virginia.edu/jtl5t/index.htm
[7] G. Salton et al, Introduction to Modern Information Retrieval, McGraw-Hill Book company, 1983
[8] Michael Steinbach and George Karypis and Vipin Kumar, A Comparison of Document Clustering Techiques, KDD-2000
[9] Douglass R. Cutting, Divid R. Karger, Jan O. Pedersen, and John W. Tukey, Scatter/Gather :
A Cluster-based Approach to Browsing large Document Collections, SIGIR ’92,Pages 318 - 329
[10] Oren Zamir, Oren Etzioni, Omid Madani, Richard M. Karp, Fast and Intuitive Clustering of Web Documents,
KDD ’97, Pages 287 – 290, 1997
6. Reference (2)
Bibliography
[11] Kjersti Aas et al. Text Categorization: Survey, 1999
[12] Text mining White Paper
http://textmining.krdl.org.sg/whiteppr.html
[13] Gartner Group, Text mining White Paper , June 2000
http://www.xanalys.com/intelligence_tools/products/text_mining_text.html
[14] Yiming Y. and Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, In the 14th Int.
Conf. On Machine Learning, PP. 412- 420, 1997
[15] C. J. van Rijsbergen, (1979), Information Retrieval, Buttersworth, London.
[16] Chris Buckley and Alan F. Lewit, Optimizations of inverted vector searches, SIGIR ’85, PP 97 – 110,1985.
[17] Daphe Koller and Mehran Sahami, Hierarchically classifying documents using very few words, proceedings of
the 14th International Conference on Machine Learning, Nashville, Tennessee, July 1997, PP170 – 178.
[18] T. Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, In Int. Conf.
Machine Learning, 1997.
[19] K. Lang, NewsWeeder: Learning to Filter Netnews, International conference on Machine learning, 1995,
http://anther.learning.cs.cmu.edu/ml95.ps
[20] Hart, S.P. A Probabilistic Approch to Automatic Keyword Indexing, Journal of the American Society for
Information Science, July-August, 1975
6. Reference (3)
Bibliography
[21] Scott D., Indexing by Latent Semantic Analsis, Journal of the American Society for Information Science,
Vol. 41,No,6, P.P. 391-407, 1990
[22] S.T. Dumais, Improving the retrieval information from external sources, Behavior Research Methods,
Instruments and Computers, Vol23, No.2,PP.229-236, 1991
[23] T. Yavuz and H. A. Guvenir, Application of k-Nearest Neighbor on Feature Projections Classifier
to Text Categorization, 1998
[24]Eui-Hong H. and George K., Centroid-Based Document Classification: Analysis & Experimental Results,
In European Conference on Principles of Data Mining and Knowledge Discovery(PKDD), 2000
[25]Vladmimir N. Vapnik, the Nature of Statistical Learning Theory, Springer, New York, 1995
[26]Ji He,A-Hwee Tan and Chew-Lim Tan, A comparative Study on Chinese Text Categorization Methods,
, PRICAI 2000 Workshop on Text And Web Mining, Melbourne, pp.24-35,August 2000
(http://textmining.krdl.org.sg/PRICAI2000/text-categorization.pdf)
[27] Erik W. and Jan O. Pedersen and Andreas S. Weigend, A Neural Network Approach to Topic Spotting,
In Pro 4th annaul symposium on cocument analysis and information retrieval, PP 22-34,1993
[28] Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D. and Allan, J. in the Proceedings of KDD 2000
Conference, pp. 37-44.
[29]Peter W. , Recent Trends in Hierarchic Document Clustering : A Critical Review, Information Procession &
Management Vol.24, No. 5 pp. 577-597, 1988
6. Reference (4)
Bibliography
[30] J. Larocca Neto, A.D. Santos, C.A.A. Kaestner, A.A. Freitas. Document clustering and text summarization. Proc.
4th Int. Conf. Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), 41-55. London: The
Practical Application Company. 2000.
[31] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar and Bamshad Mobasher, Clustering Based On Association
Rule Hypergraphs , SIGMOD'97 Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
[32] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings, George Karypis, Vipin Kumar,
Bamshad Mobasher, and Jerome Moore, Parititioning-Based Clustering for Web Document Categorization,
Decision Support Systems Journal, Vol 27, No. 3, pp 329-341, 1999.
[34 George K. and Rajat A. and Vipin K. and Shashi S., Multilevel Hypergraph Partitioning: Applications in VLSI
Domain, In proceedings o th e Design and Automation Conferences 97.
[35] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar and Bamshad Mobasher, Clustering In A HighDimensional Space Using Hypergraph Models, Technical report #97-019, http://www-users.cs.umn.edu/~han/.
[36] IBM White paper: Information Mining with the IBM Intelligent Miner Family, Daniel S. Tkach, February, 1998