knime - Meetup

Download Report

Transcript knime - Meetup

Network Analytics
meets
Text Mining
for Social Media Analysis
Dr. Rosaria Silipo
Social Media Data
Water Water Everywhere, and not a drop to drink
2
Social Media Data
Water Water Everywhere, and not a drop to drink
What companies do with it:
•
•
•
•
•
Download and keep
Topic [Shift] Detection (email content routing, detect
market interest shift, clinical studies, query non
structured DBs, ...)
Sentiment Analysis (marketing, polls, elections, ...)
Connection Analysis (influencers, risk analysis, ...)
....
3
Social Media Data
Water Water Everywhere, and not a drop to drink
The Analysis Tools:
•
•
•
•
•
•
Web Crawlers
Visual Exploration
Topic Detection (Text Mining, NLP, Ontologies)
Sentiment Score (Text Mining, NLP)
Influence Score (Network Analytics)
Find Groups (Predictive Analytics)
4
Case Study Example: Slashdot Data
Post
Basic Numbers:
• 24532 users
• 491 threads with
• 15 – 843 responses
• 12 – 507 users
• 113505 posts
Comments
• 60 main topics
• Selected Topic: Politics
5
Case Study Example: Slashdot
• Very rich data sources about customers !
• We want to establish:
• How users feel about the discussed topic
• Whether it matters how users feel
• A more general abstraction of the results
6
Sentiment Analysis
Remove anonymous users,
group by PostID
Words Tagging
MPQA
Corpus
Positive words
BoW, Entity Filter, Word Frequency,
Attitude Calculation by Document
Negative words
Total Attitude by User
User Bins
Word cloud for selected users
Slashdot – Text Mining
Most Negative User pNutz
Slashdot – Text Mining
Most Positive User dada21
Slashdot – Sentiment Analysis
• 16016 positive users
• 7107 negative users
• Most positive user: dada21 (2838 positive/1725 negative words)
• Most negative user: pNutz (43 positive/109 negative words)
• Which Topics have positive users in common ?
–
–
–
–
–
–
Government
People
Law/s
Money
Market
Parties
Network Creation
User1
User2
User3
User4
User5
User6
11
Topic Graphs
12
Topic Graph: NASA
14
Topic Graph: Sci-Fi
15
Hubs & Authorities
• Hubs = Followers
• Authorities = Leaders
Filtering anonymous users and creating network
Users with hub and
authority weights and
other features
Centrality index to
define hub weight
and authority weight
16
Hubs & Authorities
dada21
Carl Bialik from the WSJ
Tube Steak
Doc Ruby
pNutz
99BottlesOfBeerInMyF
17
KNIME: Bringing it all together
Users with hub and authority
weights and other features
Network Analysis
Text Analysis
Users bins: positive,
negative, neutral
18
dada21
Carl Bialik from the WSJ
WebHosting Guy
Catbeller
Tube Steak
Doc Ruby
99BottlesOfBeerInMyF
pNutz
19
What we have found ...
-
The positive leaders
The neutral leaders
The negative leaders
The inactive users
What identifies each group?
How do I identify a new user?
How do I handle each user?
20
Why Clustering?
- No a priori knowledge (not
even on a subset of users)
- Prediction and interpretation
capabilities required
k-Means algorithm
21
Re-sampling the Training Set
k = 10
23
The k-Means Clusters
24
The k-Means Clusters
Superfans
Neutral
users
Fans
Negative
users
25
Additional Discoveries
•
•
•
•
•
There are only very few real leaders!
Authority and hub scores identify active participants
rather than leaders.
Superfans can be found in cluster_3
Negative and (sigh!) active users are collected in
cluster_1.
Neutral users are usually inactive (cluster_2, cluster_7,
and cluster_8)
Positive users with different degrees of activity are
scattered across the remaining clusters.
26
The operational Workflow
Cluster Extraction
Pre-processing
Assignment of new data
27
Notes
• MPQA Corpus: publicly available Subjectivity
Lexicon (http://www.cs.pitt.edu/mpqa/lexicons.html)
• User Characterization is Sum -> Mean
• NLP: No sentence splitting, no negation
identification.
• For a more refined syntax-based sentiment
analysis -> „External Tool“ node
28
External Tool Node
The „External Tool“ node executes any
external program from command line
1. Writes input data to an input file
2. Calls Tool to run on input file and command line
options and to write results to output file
3. Reads output file and presents data at output
port
29
Alternative Sentiment Analysis
Free non-interactive Command Line running
Tools for Sentiment Analysis not found
SentiStrength v2.2 (still interactive)
External Tool and
Generic Web Service Client
30
Community Web
Crawler Node
Web Crawling Workflow
XML Parsing Nodes
31
Next Steps
- Integrate topic information
- Integrate user demographic and
behavioural information
- Discover [time series] patterns for early
detection of negative users and superfans
- Try other techniques, maybe even on
manually segmented data, to discover new
user segments
32
Where do I find more?
Whitepaper:
[email protected]
Complete Workflows + Data:
www.knime.com
- text mining
- network mining
- combined analysis
(note the above 3 process huge data and require 16G memory)
– clustering
Open Source Software: KNIME
www.knime.com
33
Next Appointment
User Day US Boston (free)
October 22nd 2013 10:00 -17:00
Microsoft New England R&D Center (NERD)
One Memorial Drive, Suite 100, Cambridge
http://www.knime.com/user-day-boston-2013
34
Hands-on Session
1. Download KNIME from www.knime.com
35
Hands-on Session
2. Install Extensions
Help -> Install New Software
Select:
• KNIME & Extensions
In KNIME Labs Extensions,
select:
• KNIME Network Mining
• KNIME Textprocessing
36
Hands-on Session
3. Get workflows and Slashdot data
•
Get workflows from USB stick (KNIMEBoston2013.zip)
•
•
•
•
Text Mining
Network Analytics
Text and Network Mining
Social Media Clustering
•
Slashdot Raw Data is included in the downloaded
workflows
•
A smaller set of data is available, Slashdot Reduced
Data, for lower memory requirements
•
Both data sets are available from USB Stick
37
Hands-on Session
3. Import Workflows
38
Hands-on Session
Memory Increase in knime.ini
-startup
plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20110502
-vmargs
-Xmx2G
-XX:MaxPermSize=256m
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-XX:+UnsyncloadClass
-Dknime.enable.fastload=true
-Djava.library.path=C:\Users\rosy\Documents\R\win-library\2.15\rJava\jri\x64
39
Hands-on Session
5. Improve Workflows: Text Mining
Data
Reading
Data
Tagging
Preprocessing Words
Reading
Tag Corpus
Scoring and
Tag Cloud
BoW
40
Hands-on Session
6. Improve Workflows: Network Analytics
Data Reading and
preprocessing
Create Network
Object
Visualize Network
Clean up Network
41
zoomba
42
nahdude812
43