1 0 0 1.30102999566398 0 0 0 0 0 1 0, 0 0 0 0 0 0 0 0

Download Report

Transcript 1 0 0 1.30102999566398 0 0 0 0 0 1 0, 0 0 0 0 0 0 0 0

Text Analytics Using JMP®
Melvin Alexander - Social Security Administration
JMP 12 Roadshow
Rockville MD
June 4, 2015
Disclaimer
• The views expressed in this presentation are
those of the presenters and do not necessarily
represent the views of the Social Security
Administration(SSA) or SAS Institute, Inc.
•
•
•
•
•
Agenda
Purpose: Help MAJUG leadership plan better
meetings from feedback comments using text
analytic tools of JMP®
Review MAJUG meeting data and text-analytic
methods
Demonstrate text mining techniques with JMP JSL
and other analytical tools and utilities (e.g., JMP®
free-text analyses from the Analyze > Consumer
Research > Categorical Platform, JSL commands,
SVD matrix function, Analyze platforms, etc.)
Summary and Conclusions
Q&A
Text Mining/Text Analytics
• Text mining: “refers to the process of deriving highquality information from text…through the devising
of patterns and trends through means such
as statistical pattern learning.”
• Text analytics: ”a set of linguistic, statistical,
and machine learning techniques that model and
structure the information content of textual sources
for business intelligence, exploratory data
analysis, research, or investigation”
Source: http://en.wikipedia.org/wiki/Text_mining#Text_mining_and_text_analytics (accessed 03/31/2015)..
4
Text Mining Flow
Define
Problem
Statement
•
•
Determine clear study objectives and end-state
Identify relevant data sources to answer research questions
Get and
Extract Data
•
•
•
Scrape internet with web crawling and social media tools
Extract text from disparate file types (pptx, doc, txt, pdf, html)
Strip off code, figures, extraneous characters
Parse and
Filter Text
•
•
•
Clean manually with character functions, queries, filters, R&R
Remove punctuation, numbers, stop words
Stem and tokenize text, change to lowercase, identify multiwords.
Transform
Text
•
•
•
Create document term matrix
Weight matrix based on analysis objectives
Use Singular Value Decomposition to get structured data
Structure
and Explore
Text
•
•
•
Discover topics and common themes
Group like documents and words
Subset documents and link concepts
Visualize
and Analyze
Text
•
•
•
Combine with structured data
Visualize exploitable patterns
Understand sentiments and trends
WH Rushing, J Wisnowski, “Harness the Power of Text Mining: Analyse FDA Recalls and Inspection Observations, Discovery Summit –
5
Europe: Brussels, March 24, 2015, https://community.jmp.com/docs/DOC-7204 (accessed 03/19/2015).
MAJUG meetings
are held three or
four per year
and posted on the
www.majug.com
website.
MAJUG Meeting Evaluation Form
Sample Data Table of Respondent’s Feedback from MAJUG Meeting Evaluations
MAJUG Meeting Evaluation Comments about Suggested Improvements
Select Where Clause using a stopwords list from the Term Frequency Vector (TFV)
Results after Stop words were removed from the TFV
Recode Word Column to change values of “$5” to “Charge-$5-fee” and “10” to “Start-at-10”
Create single string of comma-separated words from the TFV . This string would be copied and pasted into Julian
Parris’ “Text Column” role of his “Word Counts for k words as columns” JSL script
Julian Parris’ “Word Counts to Columns” JSL script with pasted Terms Frequency Data Table Subset formed from the
original MAJUG Meeting Evaluation Comments about Suggested Improvements
Terms Frequency Data Table Subset formed from the original MAJUG Meeting Evaluation Comments about
Suggested Improvements
JSL script that created the Term-Document-Matrix(TDM) and Document-Term-Matrix (DTM)
//:*/
A = Data Table ( "Terms Matrix" );
/*:
Data Table( "Terms Matrix" )
//:*/
DTM=A << Get As Matrix ;
/* B Transposes DTM to form Term-Document-Matrix (TDM) */
B = DTM` ;
/*:
[1 0 0 2 0 0 0 0 0 1 0, 0 0 0 0 0 0 0 0 2 1 0,
0 0 1 2 0 0 0 0 0 0 0, 2 0 0 1 0 0 0 0 0 0 0,
0 0 2 1 0 0 0 0 0 0 0, 0 0 1 0 1 0 0 0 0 1 0,
0 0 1 1 0 0 0 0 1 0 0,
etc.
0 0 0 1 0 0 0 0 0 0 0, 1 0 0 0 0 0 0 0 0 0 0,
1 0 0 0 0 0 0 0 0 0 0, 0 0 0 1 0 0 0 0 0 0 0,
0 0 0 1 0 0 0 0 0 0 0, 0 0 0 1 0 0 0 0 0 0 0,
0 1 0 0 0 0 0 0 0 0 0, 1 0 0 0 0 0 0 0 0 0 0]
JSL scripts to compute Log Term Frequency Weights replacing raw frequencies with their logs
//Raw
B = DTM` ;
// log base10
D = J(Nrow(B),NCol(B),0);
For( i = 1, i <= n, i++,
For( j = 1, j <= p, j++,
if(B[i, j]>0, D[i,j] = 1 +log10(B[i,j]),D[i,j] = 0 );
) );
show(D);
// log base2
C= J(Nrow(B),Ncol(B),0);
For( i = 1, i <= NRow(B), i++,
For( j = 1, j <= NCol(B), j++,
C[i,j] = log(B[i,j]+1,2);
) );
show(C);
B=
[1 0 0 2 0 0 0 0 0 1 0,
0 0 0 0 0 0 0 0 2 1 0,
0 0 1 2 0 0 0 0 0 0 0,
2 0 0 1 0 0 0 0 0 0 0,
… etc. …
1 0 0 0 0 0 0 0 0 0 0];
D=
[1 0 0 1.30102999566398 0 0 0 0 0 1 0,
0 0 0 0 0 0 0 0 1.30102999566398 1 0,
0 0 1 1.30102999566398 0 0 0 0 0 0 0,
1.30102999566398 0 0 1 0 0 0 0 0 0 0,
… etc. …
1 0 0 0 0 0 0 0 0 0 0];
C=
[1 0 0 1.58496250072116 0 0 0 0 0 1 0,
0 0 0 0 0 0 0 0 1.58496250072116 1 0,
0 0 1 1.58496250072116 0 0 0 0 0 0 0,
1.58496250072116 0 0 1 0 0 0 0 0 0 0,
… etc. …
1 0 0 0 0 0 0 0 0 0 0];
SVD Formula Definition
TDM[t x d] = U[t x r] D[r x r] VT[r x d]
d
t
TDM:
txd
(t terms, d
documents)
sparsetermdocumentmatrix
r
t
=
U:
txr
(t terms, r
concepts)
left-singular
x
{LS}, rankreduced
eigenvector
term matrix
r
r
x
D:
r x r (r rank of
matrix; strength
of each ‘concept’)
diagonal {D}
matrix of singular
eigenvalues
d
r
x
V T:
r x d (r
concepts, d
documents)
right-singular
{RS}, rankreduced
eigenvector
document
matrix
JSL script snippet that created the Singular Value Matrices(LS, RS) and Eigenvalues (D) from the SVD function
//:*/
{LS,D,RS}= SVD(B); /* singular value decomposition of B = LS*D*RS` */
/*:
{[0.378096662351085 0.131824388966428 0.0211184899994071 0.10589219870066
0.180893870586793 - 0.249947997346168 0 0 - 0.0598591783533297
0.132522620357014 0, 0.0331182960691775 - 0.0292243730113285
0.337183161237851 0.583127882442963 0.125047226003007 0.154020292284754 0
0 0.16821759320533 - 0.185471956068679 0, 0.350648279538604 0.163307416493917 - 0.00534077688514531 - 0.0449366248671385 0.190147431306457 - 0.0225114783847062 0 0 0.142431314885967 0.68019673189248 0, 0.2492189633787 0.419512224556798 0.0192515782022323 0.0343837845368551 -0.0279803345343858 0.0579557417687915 0 0
etc.
0 0 0 0 0 0 0 0 0 - 1 0, 0 0 0 0 0 0 0 0 1 0 0, 0 0 0 0 0 0 0 1 0 0 0,
0.0511449194105313 - 0.0617494272704414 0.375812713862112
0.725075996234368b0.0094742784022233 0.571411451830774 0 0 0 0 0,
0.0931094176656951 0.00702825750170741 0.404132013103739
0.379678618426814 0.32464030058507 -0.76053361389797 0 0 0 0 0, 0 0 0 0 0 0 0 0
0 0 1]}
JSL scripts to create Principal Components and Left Singular Value (SVD) Data Tables
Bi-plots of Principal Components (PC2 by PC1) and SVDs (SVD2 by SVD1)
Theme 1
Theme 1
1
1
2
2
3
3
Conclusions
•
•
•
•
JMP’s Free-Text tools captured the essence of text
meanings from MAJUG-meeting participants more
analytically.
The Principal Components, SVDs provided inputs
to estimate probability models, enabling further
exploration
Employing these JMP tools will eventually lead to
greater satisfaction, and added value to MAJUG
attendees at future meetings.
That’s a worthy goal Users Group leaders all want
to achieve.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Albright, R (2004), “Taming Text with the SVD”, Cary, NC: SAS Institute, Inc. ,
ftp://ftp.dataflux.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf (accessed
03/06/2015).
Alexander, M and Klick, J (2014), “Text Mining Feedback Comments from JMP® Users Group
Meeting Participants” , https://community.jmp.com/docs/DOC-6748 (accessed 02/13/2015).
Bogard, M (2012), “An Intuitive Approach to Text Mining with SAS IML”,
http://econometricsense.blogspot.com/2012/05/intuitive-approach-to-text-mining-vis.html
(accessed 02/13//2015).
Hastie, T, Tibshirani, R, and Friedman, J (2009), The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd ed. New York: Springer-Verlag,
Karl, A, and Rushing, H (2013) “Text Mining with JMP and R”,
http://www.jmp.com/about/events/summit2013/resources/Paper_Karl_Rushing.pdf (accessed
02/26/2015).
McNeill, F (2014) “The Text Frontier – SAS Blog”, http://blogs.sas.com/content/text-mining/
(accessed 02/26/2015).
Mroz, P (2014) “Word Cloud in Graph Builder?” , https://community.jmp.com/thread/58441
(accessed 03/24/2015).
Rushing, H and Wisnowski, J (2015), “Harness the Power of Text Mining: Analyse FDA Recalls and
Inspection Observations”, https://community.jmp.com/docs/DOC-7204 (accessed03/19/2015)
Parris, J (2014), “Word Counts to Columns”, https://community.jmp.com/docs/DOC-7056 (accessed
02/13/2015).
Porter, MF (2006), “The Porter Stemming Algorithm”, http://tartarus.org/martin/PorterStemmer/
(accessed 02/26/2015).
Wicklin, R (2015), “Compute the rank of a matrix in SAS”,
http://blogs.sas.com/content/iml/2015/04/08/rank-of-matrix.html (accessed 04/08/2015).
Sall, J (2015), “Wide data discriminant analysis,”
http://blogs.sas.com/content/jmp/2015/05/11/wide-data-discriminant-analysis/ (accessed
05/11/2015).
Acknowledgements
(Thanks to the following for their help with this presentation)
Josh Klick
SAS Institute, Inc:
Robin Moran
Gail Massari
Tom Donnelly
John Sall & JMP‘s Development/Support Team
Lucia Ward-Alexander
Questions?
Contact:
[email protected]
JMP, SAS, and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.