A journey into text analytics (PowerPoint 1.3MB)

Download Report

Transcript A journey into text analytics (PowerPoint 1.3MB)

A journey into Text Analytics
John McConnell
Analytical People
ASC Winchester
7th September 2013
© analytical-people 2013
Contents
• Background & Objectives
• Our current view on Text Analytics
– Value
– Process
• An example application
• Conclusions
2
Background
• Text Analytics and Text Mining are largely synonymous
• Interest and execution of Text Analytics is growing
– Social Media sources are largely responsible for this
– And that often means “Big Data”
• This should lead to further improvements in technology and
methodology which will benefit survey practitioners
3
Objectives
• We’ve been involved in more Text Analytics work in the last 2
years than in all previous years
• Our objective in this presentation is to share some of our
experience and thoughts around some of the technology we
have used
4
The Value Propositions
1. Reduce cost (and time)
*http://wp.eaagle.com/
2. Generating actionable insights
–
Improve public and commercial processes
5
Using Text Analytics to find Text Analytics software
http://www.isvworld.com/
6
3 Software tools
R
• Open Source Statistical Platform
• Command driven
Rapid Miner
• Open Source Data Mining Workbench
• GUI
• Built on R and Weka
SPSS Text Analytics for Surveys
• Commercial Text Analytics
• GUI
7
The Process – Highest Level
Unstructured
data
Structured
data
Process – Level 2
1. Extract
2. Refine
3. Analyse
How can we tell if we are using the right tool(s)?
Extract
• How good is the first extraction?
• How long to get to an acceptable extraction?
Refine
• How easy is to refine?
• How easy is to capture refinements to re-use them in
future?
Analyse
• What tools exist to support the Text Analytics process?
• What tools exist to use the Structured Text in other analyses?
How well do the tools/methods deliver on the value propositions?
10
Algorithms and Dictionaries
1. Extract
Algorithms
• e.g. Natural Language Processing (NLP)
Dictionaries
• Variously called Lexicons, Resources, Libraries,
etc.
• Are usually contextual e.g. Customer Satisfaction
11
Example Data
• The American Physical Society (APS)
• Student Survey Comments from 2009 (Base=1304)
• Q4.2 Comments about the best features of and what could be
added or improved to the special programses for Student
Members*
*http://www.aps.org/about/governance/committees/commemb/upload/2009-student-comments.pdf
12
The first extraction with R
library("tm", lib.loc="C:/Users/jmcconnell/Documents/R/winlibrary/3.0")
APS2009df = read.csv("C:/AP/ASC/APS/APS2009Verbatims.csv", header =
TRUE)
text_corpus <- Corpus(VectorSource(APS2009df), readerControl =
list(language = "en"))
summary(text_corpus) #check what went in
text_corpus <- tm_map(text_corpus, removeNumbers)
text_corpus <- tm_map(text_corpus, removePunctuation)
text_corpus <- tm_map(text_corpus , stripWhitespace)
text_corpus <- tm_map(text_corpus, tolower)
We apply a basic set of text handling methods (simple NLP)
e.g. removePunctuation
We also apply a small dictionary of known “Stopwords” (not shown)
13
R Extraction Results – Top 20 Terms
14
The first extraction with Rapid Miner
We visually construct a similar set of steps
15
Rapid Miner Extraction Results – Top 20 Terms
16
Improving and creating new data
2. Refine
Improve the extraction
• Correct mistakes
• Add omissions
Map the extraction to structured data
• Group and combine meaningful terms that
will become data for further analysis
In second and subsequent waves (where applicable) Refine should be a
shorter step where we look for new concepts
17
Rapid Miner - Refine
We add one process step to fix up some of the issues in the first extraction
Filter Tokens sets a lower limit for the length of an extracted term/attribute
18
Rapid Miner results after first refinement
19
The first extraction with SPSS
SPSS Uses a Wizard to specify the extraction steps
20
SPSS Extraction Results – Top 20 Terms
SPSS Is counting respondents not
occurrences
Synonyms are used from the dictionaries
21
Synonyms for “Excellent”
10 stars, 10/10, 100 % correct, 100% accurate, 100% correct, 100% grade a, 5 star, 5 stars, 5-star, ^ best $, ^ great $, a must, a
nice plus, a plus, a+, a++, aagood, above and beyond, above excellence, absolute life saver, absolute word class, acceptional,
admirable, all was well, allright, alright, always a please, amazing, among the best, among the very best, appreciable,
appreciative, award winning, awesome, awesopme, awsome, beenfantastic, best asset, best of all, best possible, beyond
expectation, beyond expectations, big asset, big beast, big hit, big hits, big kudos to, big plus, blow all others away, blows all
others away, blows the doors off, brilliant $, can not be beat, can't be beat, can't beat, cannot be beat, capable, capible, class
service, compliment, compliment one another well, congrats, congratulations, copious, cutting edge, cutting-edge, dandy, delight,
deluxe, deserves a raise, deserves credit, does that well, doing her best, doing his best, doing their best, done very well,
dynamite, exccellent, excelent, excellant, excellence, excellet, excelllent, excepional, exceptional, exceptionl, execellent, exelant,
exelent, exellant, exellecent, exellent, exlt, expectional, exquis, exquise, exquises, exquisite, exquisitely $, extraordinary,
extrodinary, fabulous, fairly well, fanatstic, fantabulous, fantasic, fantastic, fantatic, finest, first class, first-class, first-rate, five stars,
formidable, frantastic, given me the most, godsend, goes over well, goodd, gooood, graet, grat, grea, greaat, great pleasure,
greate, greatest, greeeeeeeaaattttt, gret, greta, hats down, hats off, head and shoulders better, heavenly, high hats off, ideal,
impecable, impeccable, impress, impresses me most, impressive, in an orderly fashion, incomparable, incredibe, incredible,
increible, indisputable, ingenious, inpecable, invaluable, is still the best, it was a pleasure, knock socks off, knock spots off, kudos,
kudos to, laudable, lifesaver, made an impression, made the difference, magnificent, marvellous, marvelous, my compliments to,
nicest, number 1, number one, oustanding, out of the woods, out of the world, out of this world, outperform, outperforming,
outsanding, outstanding, peachy, perfect, perfection, perfectly done, phenomenal, phenominal, pleasure of working with, prettier,
pretty good, quintessential, reach a ten, real good, real nice, remarkable, right direction, rock $, rocked my world, second to none,
sensational, smashing, spectacular, spendid, splendid, stand head & shoulders above, stand head and shoulders above, standing
head & shoulders above, standing head and shoulders above, stands head & shoulders above, stands head and shoulders
above, stood head & shoulders above, stood head and shoulders above, strong positive, superb, supurb, surpassed my
expectations, surreal, sweetheart, ten stars, terric, terrific, terrifig, the best, the best one so far, the best thing, the highlight of, the
only one that works, thebest, think highly, think very highly, to die for, top notch, top quality, top ranked, top-flight, top-notch, topof-the-line, top-ranked, top-ranking, topflight, topnotch, topranked, topranking, tremedous, tremendous, tried and proven,
trmendous, turn out good, two thumbs up, unbeatable, unmatched, unmnatched, unparalleled, unquestionable, unquestionnable,
unsurpassed, up 2 standard, up 2 standards, up 2 usual standards, up to standard, up to standards, up to usual standards, up to
your usual standards, up-beat, upbeat, utmost, v-good, well done, went above and beyond my expectations, woderful, womderful,
wondeful, wonderful, wonderfull, wonedeful, wonederful, would be the smartest, wounderful
22
Adding Wordnet to our R (/RapidMiner) analysis
library("wordnet")
setDict ("C:/Wordnet/WordNet-3.0/dict")
synonyms("excellent", "ADJECTIVE")
[1] "excellent" "fantabulous" "first-class" "splendid"
23
Analytics to aid refinement
24
Job … Fair
Students are asking for more “stuff” at the job fair
25
R Extraction Results – Top 20 Terms
26
Onward to analysis
Key Drivers of Recommendation*
40%
Teaching quality
30%
Support Services
20%
Accommodation
10%
Job Fair - Would like more
0%
5%
10% 15% 20% 25% 30% 35% 40% 45%
*This is an anonymised example
27
Onward to analysis
3. Analyse
R
• In R we are in a statistical platform already
• Text Analytics outputs are part of the data in the current
“Workspace”
• For Research style charts and tables we may need to export data
Rapid Miner
• In RM we are in a Data Mining platform already
• Text Analytics is part of the current process flow
SPSS Text Analytics for Surveys
• Data needs to be exported elsewhere for Analysis
• To SPSS .sav, Excel or Data Collection
28
A High Level Comparison
Attribute
R
Rapid Miner
SPSS TAfS
Help & Support
Lot of User Generated
Content
Lots of User
Generated Content
Paid support option
Paid support
Usability
Low level coding
control
Visual programming Visual UI
Scalability
R in itself isn’t too scalable
but many scalable
implementations exist e.g.
Revolution, Hadoop
Radoop
We experienced
Issues with data
sets around
100,000 cases*
Extensibility
Various options
Various options
None
Automation
Can be run in batch
Can run in batch
None
Overall
Great for the coder.
Those familiar with R
The power of R with The most graphical
a GUI
and tuned for
Generic survey
types e.g. Opinions
*IBM/SPSS have a Text Analytics option for Data Mining which may be more scalable – we haven’t tested yet
29
Our current conclusions
• Dictionaries help in the initial extraction
– But it is almost inevitable you will want to extend them to get to the
specificity of the study. If the study domain is very specific you can build
your own dictionaries in all 3 tools. A lot of social media monitoring
starts with libraries of regular expressions built from the ground up.
• Open Source tools like R and Rapid Miner will continue to
improve with “packages” added by the R community
• There is no “silver bullet”. The Refine step will typically require
a lot of manual input
– Especially in the initial “build” phase
– More is required on larger surveys
• But the ROI – in time and/or cost - should be clear
– And the results more robust and reliable
30
A journey into Text Analytics
Thank-you & Questions
John McConnell
Analytical People
[email protected]