Transcript Wrap-up
CSC 594 Topics in AI –
Text Mining and Analytics
Fall 2015/16
11. Wrap Up
1
Text Mining
•
[from Wikipedia]
- “Text mining refers to the process of deriving
high-quality information from text.”
- “The overarching goal is, essentially, to turn
text into data for analysis, via application of natural
language processing (NLP) and analytical methods.”
2
Text Mining is Growing (1)
“North America text analytics market is expected to
reach a value of $1,995.8 million by 2019 according to
new research report”
(Information Communications Media Technology Market News, October 15, 2015)
This market is estimated to grow from $827.1 million in 2014 to $1,995.8
million by 2019, at a Compound Annual Growth Rate (CAGR) of 19.3%
from 2014 to 2019.
3
Text Mining is Growing (2)
“Discover the text analytics market -- ”
(Information Communications Media Technology Market News, November 4, 2015)
•Factors which are driving the growth of global text analytics
service market are growing demand of social media analysis for
effective brand building, development of multilingual text analytics to
overcome language barriers, increasing concern of financial frauds and
growing big data market.
•On the other hand, factors which are restraining the growth of
global text analytics market are lack of awareness among end users
about software handling, high deployment cost and compliance issue
with present IT infrastructure.
•However, added advantage of predictive analytics and credibility to
analyse big data is expected to create great opportunity for text
analytics market in future.
4
Text Mining is Hard…(?)
• Data Collection:
– Raw texts are ‘dirty’ – markup tags, nonsense words/symbols, irregular
punctuations, mis-spellings..
– Collected data becomes huge in size.
• Text (Pre-)Processing:
– So many ‘options’
• Segmentation (Text unit) – whole document vs. paragraph vs. sentence vs.
n-word context window, specific patterns (e.g. <Adj><Noun>).
• Tokenization -- stemming/lemmatization, case normalization, removing
punctuations,
• Term – removing stop words, defining a ‘keep’ list, POS, synonyms
• Transformation – various term weighting schemes, dimensionality reduction
(by top N terms, PCA, model parameter coefficients, etc.).
•
– We don’t know how each one affects the result until we generate the
result need for iterative experiments (i.e., feedback loop).
Mining/Analysis Step:
– Whole Machine Learning and Data Mining comes after structured data is
obtained.
5
Survey
In your midterm project, did you do..?
– Stemming
– Case normalization
– Removing punctuations
– Removing stop words
– POS-tagging
– Synonym creation
– Term weighting schemes
– Dimensionality reduction
6
Word Frequency
• Most naïve text mining is to look at the word frequency.
• But surprisingly, word frequency provides a lot of useful information
(when the data size is large)…
– A good article, “Where to start with text mining”
(http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/)
– Google Ngram Viewer (https://books.google.com/ngrams/)
– Word Cloud
7
Word Association, Concept Linking
• Slightly more sophisticated analysis
– But still based on frequency.
– Two words/concepts occurring TOGETHER more than chance.
– Typically PMI or Likelihood is used to measure the strength of the co-occurrence.
8
Clustering, Topic Extraction
• Discover the overall grouping of the corpus
– Clustering – a document is assigned to exactly one cluster.
– Topic – a document could be assigned to multiple clusters/topics.
• Cluster/topic definitions through terms/words
– Look at cluster centroids or term-cluster relevancy scores.
9
Text Categorization
• Build a classification/prediction model for texts
– Goal 1: An optimal classifier (for the purpose of classification/prediction)
– Goal 2: Lean the domain of the texts (e.g. important features for each
target category such as POS/NEG reviews).
10