ChatterGrabber.py Methods and Development

Download Report

Transcript ChatterGrabber.py Methods and Development

ChatterGrabber.py Methods and Development
A System for High Throughput Social Media
Data Collection
By James Schlitt, in collaboration with Elizabeth
Musser, Dr. Bryan Lewis, and Dr. Stephen Eubank
Introduction
Social
–
–
–
media surveillance is a valuable tool for epidemiological research:
Pros: Cheap, consistent, and easy to parse data source.
Cons: Volume & specificity vary, content cannot be easily verified.
Tweepy provides easy Twitter API access via python.
The Twitter Norovirus Study
•
Developed under MIDAS funding, the Virginia Department of Health requested a
tool to track Norovirus and gastrointestinal illness (GI) outbreaks within
Montgomery County, VA with the following capabilities:
– Automated surveillance of social media.
– No special skills required to use.
– Forward compatible for GIS applications.
•
Twitter was well suited to GI outbreak surveillance due to the short duration of
infection and Tweet-worthy symptoms. For example:
@ATweeter20 My hubs did some vomiting w/ his flu. Had stuff messing w/ his
tummy - high fever & snot was bad. Get better!
•
Challenged by low population density, high degree of linguistic confounders.
Methods Considered
• Tweepy.py: Python wrapper for Twitter RESTful APIs
• Gnip: Twitter commercial partner
Twitter Method Comparison
Streaming:
•
•
•
•
•
Up to 1% of total stream volume
by location or keywords from
Twitter.
About 10 keywords per query, 1
query per stream, and 1 stream
per OAuth key.
Tweets come in real-time.
Tweet pull rate limited by Twitter.
Most commonly used, great for
whole country studies with simple
queries.
Search:
•
•
•
•
•
•
Up to 100% of all tweets matching
query, may use multiple queries.
Search by location and/ or
keywords.
~35 mile search radius limit.
All tweets within the last week.
Query rate limited per Twitter
OAuth key to 180 searches every
15 minutes.
Narrower geographic coverage,
but very flexible.
Gnip Method Comparison
• Official Twitter data partner.
• Historic or real-time, variety of services.
• Large volume, representative sample. Excellent choice
when affordable!
• Prices not public, quoted in a 2010 interview as:
– 5% stream for $60k/year.
– 50% stream for $360k/year.
Challenges
Given a partial data sample, how can we accurately track
tweets in an area with low engagement?
– 12 potential NRV Norovirus/GI Tweets per day.
– 4 suspected hits after human confirmation.
– Long keyword list requires multiple queries.
Twitter 1% streaming limited by query length and volume and
Gnip was not affordable, that leaves the search method...
ChatterGrabber Introduction
ChatterGrabber: A search method based social media data
miner developed in Python.
– GDI Google Docs interface included for simplified partner
access.
– Specialized hunters pull from GDI Spreadsheets to set
run parameters.
– Multiple logins may be used to increase search
frequency during collaborative experiments.
– No limits on query length.
– Data sent nightly to subscribers as CSV.
– Summary of history presented in dashboard (under
development)
ChatterGrabber Reliability
High redundancy & error tolerance for long term experiments:
– If multiple API keys used, functional keys take up the
work of failed keys until they may be reconnected.
– Daemon automatically executes & resumes experiments
on start up and after an interruption.
– Any hunter may be resumed up to 1 week after
termination without loss of incoming data.
General Execution
Pull list of condition
phrases & config from
Google spreadsheet
Partition
conditions into
{x} queries
Search
radius > 35
miles?
Generate {y}
Coordinate sets via
covering algorithm
Prepare search
with |x|*|y| queries
Prepare search
With |x| queries
Run Twitter search,
from last tweet ID
recorded for location
and query pair
Filter results by
phrases, classifiers,
and location; sleep
Yes
No
Store data, send
subscribers CSV
and config link
Yes
Has a new
day begun?
No
ChatterGrabber GDI Interface Example
ChatterGrabber Search Methods
Pure Query Based:
NLTK* Based:
• Conditions, qualifiers, &
exclusions.
• Searches by conditions,
keeps if qualifier and no
exclusions present.
• Simple, easy to setup, but
vulnerable to complexities
of wording.
• Take output from
conditions search,
manually classify.
• Train NLTK maxEnt or
Naïve Bayesian classifier
via content n-grams.
• Classifier discards tweets
that don’t fit desired
categories.
• Powerful, but requires
longer setup,
representative tweet
sample.
*NLTK: Natural Language Tool Kit
Tweet Linguistic Classification
Yes
Tweet passed for
classification
Using
NLTK
mode?
No
Extract
features from
Tweet
No
Does Tweet
contain an
exclusion?
Classify Tweet
by features
Yes
Yes
Does Tweet
contain a
qualifier?
Is Tweet
classification
sought?
Yes
No
Store Tweet data
and derived data
No
Yes
Keeping
non-hits?
No
Discard
Tweet
NLTK Classifier Example
ChatterGrabber Geographic Methods
•
•
•
Large lat/lon boxes filled via covering algorithm.
Fine and coarse geolocations obtained via GoogleMapsV3 API:
– If coordinates to tweet are present, finds street address.
– If common name present, finds coordinates, then searches by
coordinates for proper name/ street address of position.
– If location is outside of lat/lon box, discards tweet.
All geo queries cached locally, shared between experiments, and
pulled on demand to reduce API utilization.
Work Flow
Basic Execution:
1. Create GDI sheet, run initial experiment.
2. Check first results for confounders, update keyword lists.
3. Rerun experiment with new keywords, monitor periodically
for new keywords & memes.
If Greater Specificity Desired:
1. Run whole country experiment with desired query list.
2. Score output manually & enable NLTK classification.
3. Expand area as desired.
Results
•
•
Found and geolocated 4,000-8,000 suspected Norovirus tweets per day
across the US during peak Norovirus season.
Preliminary estimates of 70-80% accuracy with 2,000 tweet training set
Results Continued
Limitations
• Results exceed the geographic and temporal resolution of
existing surveillance systems, complicating verification
• No true denominator, ChatterGrabber only collects queried
hits.
• Not all desired information is available in social media,
some may be incomplete or falsified.
• ChatterGrabber is just an information gathering method,
external analysis and review needed for validity.
• Twitter users will differ from population at large.
Conclusions
●
ChatterGrabber provides an easy to use social media
surveillance tool
– Natural Language Processing speeds illness identification.
– Geographic region directed searching allows complete
coverage of a user defined jurisdiction.
●
ChatterGrabber can successfully identify GI illness related
tweets in a population.
–
–
–
–
220 Million USA Tweets per day
6,000 matches per day by Nationwide NLTK search.
353 matches per day by Virginia keyword search.
136 matches per day by Virginia NLTK search.
Next Steps
• Streamlined web interface needed for NDSSL long term
studies.
• Real-time bioterrorism surveillance methods under
evaluation using gun violence as a proof of concept.
• Norovirus visualization & dashboard under development by
Elizabeth Musser.
• Tick bite zoonosis and unlicensed tattoo hepatitis risk
tracking underway by Pyrros Telionis.
• Vaccine sentiment tracking underway by Meredith Wilson.
Next Steps
Next Steps
Firearm violence related tweets by time of day
Next Steps
Next Steps
Next Steps: Public Health Outreach
• Design and execution of real-world use by state and local
public health offices.
• Dashboard deployed and customized for users across
Virginia.
• Evaluation of pre and post deployment practice.
• Assessment of utility and iterative refinement.
• If interested contact: [email protected]
References
Python Resources
I.
Roesslein, J. (2009). Tweepy (Version 1.8) [Computer program]. Available at
https://github.com/tweepy/tweepy (Accessed 1 November 2013)
II. Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python.
O’Reilly Media Inc. (Accessed 14 January 2014)
III. Google Developers (2012). gdata-python-client (Version 3.0) [Computer program]. Available at
http://code.google.com/p/gdata-python-client/ (Accessed 6 January 2014)
IV. McKinney, W. (2010). Data structures for statistical computing in Python. In Proc. 9th Python Sci.
Conf (pp. 51-56)
V.
Tigas, M. (2014). GeoPy (Version 0.99) [Computer program]. Available at
https://github.com/geopy/geopy (Accessed 21 December 2013)
VI. KilleBrew, K. (2013). query_places.py [Computer program]. Available at
https://gist.github.com/flibbertigibbet/7956133 (Accessed 27 January 2014)
VII. Coutinho, R. (2007, August 22nd) Sending emails via Gmail with Python [Web log Post]. Retrieved
January 5th fromhttp://kutuma.blogspot.com/2007/08/sending-emails-via-gmail-with-python.html
Relevant Papers
I.
Rivers, C. M., & Lewis, B. L. (2014). Ethical research standards in a world of big data.
F1000Research, 3.
II. Young, S. D., Rivers, C., & Lewis, B. (2014). Methods of using real-time social media technologies
for detection and remote monitoring of HIV outcomes. Preventive medicine.
III. Chakraborty, P., Khadivi, P., Lewis, B., Mahendiran, A., Chen, J., Butler, P., ... & Ramakrishnan, N.
Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions. SDM14
Questions?