Qian (Steve) He`s Association Rules Mining over Twitter Data

Download Report

Transcript Qian (Steve) He`s Association Rules Mining over Twitter Data

Data Preparation– Project 3: Part II
Steve Qian He
Prof. Carolina Ruiz
CS 548 – Data Mining
Overview
• Project Description
• Data Collection
• Data Preprocessing
• Data Transformation
• Results
Project Description
• Find the word set in Diabetes domain
• Find the associations between words
in this set
sugar
wound
Diabetes
ice cream
insulin
chocolate
Overview
• Project Description
• Data Collection
• Data Preprocessing
• Data Transformation
• Results
Data Collection
• Source: Twitter Public Timeline
•
http://api.twitter.com/1/statuses/public_timeline.format
•
format: json, xml, rss, atom
• Tool:
• The Archivist Desktop Version
•
http://visitmix.com/work/archivist-desktop/
• (You can do this with any programming
language. But please don’t waste your
time reinventing the wheel…)
Data Collection
Just in case you want to REINVENT it…
Libraries for different programming languages:
• C++: Twitcurl
• Ruby: Twitter
• Java: Twitter4J
• Perl: Net::Twitter
• .NET: Twitterizer
• Objective-C:
MGTwitterEngine
• PHP: TmhOAuth
• Python: Tweepy
Data Collection
Just in case you want to REINVENT the
tools for REINVENTING the wheel…
1. Parse XML (RSS, Atom), JSON with your language;
2. Follow the Twitter API resource documentation.
https://dev.twitter.com/docs
Data Collection
Kept “The Archivist” running on my lab computer
for about 7 days (Mar. 20 – Mar 27.)
Data Collection
Collected 40,545 tweets from Twitter
with keyword “diabetes”.
Overview
• Project Description
• Data Collection
• Data Preprocessing
• Data Transformation
• Results
Data Preprocessing
Why do we need to preprocess the data
before importing it into Weka?
• Weka doesn’t understand the file format.
• We only care about “instance” (tweet) and
“attribute” (word).
• We need manually pick some words
(attributes) which make sense in Diabetes
domain.
The “word” here means a substring
of a string separated by one of
“!\"#$%&'()*+,./0123456789:;<=>?@[\\]^_`{|}~\t\n\
r\f€
‚„€” characters.
€
My bad, we have better ways to do
this…
Data Preprocessing
1. Choose the high-frequency words:
“t co http rt a to in the de of and for i you with
type is have lajava.util.StringTokenizer
s y that it terlalu does
new has el may en
notdiet
do a good
jobon
there.
Please
weight juan my
what
que
loss can risk me
use java.util.regex instead.
surgery study at he manis some now this or un does
d be para - your blood kicks”
2. Remove the obvious noise tweets.
“Juan has 40 chocolate bars. He eats 35. What does Juan have now?
Diabetes. Juan has diabetes.” – cold joke…
“Weight-loss surgery may stem diabetes in some –
http://t.co/2sZQzoER: Two new clinical trials show that patients ...
http://t.co/Gq0d3VMs” -- pop news
Data Preprocessing
“surgery weight loss alert risk
health treatment bariatric high
help medicine disease remission
record test study heart combat
diet sugar chocolate obesity con
pro blood leading learn support
diabetic stroke patient”
31,621 tweets remained after filtering obvious noises
31 meaningful words selected from 150 words
(with minimum frequency 0.1)
Overview
• Project Description
• Data Collection
• Data Preprocessing
• Data Transformation
• Results
Data Transformation
• Generate Weka file (.arff)
• Instance: tweet
• Attribute: selected word in tweet
e.g. “Experimental study suggests lack of sleep may
pose risk for development of Diabetes.”
tweet id
weight
loss
surgery
risk
xxxxxx
0
0
0
1
Weka File
@relation diabetes
@attribute diet {0,1}
@attribute weight {0,1}
@attribute risk {0,1}
@attribute surgery {0,1}
@data
0,0,1,0
0,0,0,0
0,1,0,0
0,0,0,0
0,0,0,0
…
Overview
• Project Description
• Data Collection
• Data Preprocessing
• Data Transformation
• Results
Results
Results I got from Weka:
1. surgery loss ==> weight conf:(1) lift:(10.41)
lev:(0.04) [1141] < conv:(571.36)>
2. bariatric ==> surgery conf:(0.95) lift:(7.68)
lev:(0.03) [901] < conv:(18.67)>
3. loss ==> weight conf:(0.94) lift:(9.77)
lev:(0.08) [2451] < conv:(14.46)>
4. surgery weight ==> loss conf:(0.92)
lift:(10.03) lev:(0.04) [1137] <
conv:(11.72)>
Future Work
• Word segmentation
• “I like ice cream.”
• Or “I like ice cream.”
• Polysemy
• “I'm sick of sugar!”
• “For people with diabetes, being sick can
also affect blood sugar levels.”
• I need a “cold joke & news” detection
tool!!
Thanks
Q&A