Twitter-Updatex
Download
Report
Transcript Twitter-Updatex
Update
By: Brian Klug, Li Fan
Presentation Overview:
• API we plan to use
• Obtainable Data Types
• Infrastructure
• Tentative Work Plan
(Syntax and commands)
(Location, Text, Time, User, Reply)
(Hardware, Storage Req’s, Design)
(Timeline and Schedule)
Update
API: Streaming API
• Enables near-real time access to a subset of
public Twitter statuses.
– Currently in alpha test
– Access to further restricted resources is extremely limited and granted only
after acceptance of an additional TOS document.
• We have applied for credentials which grant us access to these
increased resources (namely a larger sampling, more statuses)
– http://apiwiki.twitter.com/Streaming-API-Documentation
• Features of streaming API
–
–
–
Continual connection that streams statuses over HTTP. Opened indefinitely and only requires basic
authentication for the most basic level
Output data is in XML or JSON formats, both of which are easy to parse.
Can focus on certain tracking predicates that, when specific enough, return all occurrences in full
Firehose stream
• EG "track=basketball,football,baseball,footy,soccer". Execute: curl -d @tracking
http://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password
2
Update
Streaming API data
• Example data:
–
{"truncated":false,"text":"@FreedomProject Can you bring the script tomorrow? We can write in the APE if you're not
busy.","favorited":false,"in_reply_to_screen_name":"FreedomProject","source":"<a href=\"http://www.tweetdeck.com/\"
rel=\"nofollow\">TweetDeck</a>","created_at":"Fri Nov 20 06:37:58 +0000
2009","in_reply_to_user_id":20688076,"in_reply_to_status_id":5882468251,"geo":null,"user":{"favourites_count":0,"ve
rified":false,"notifications":null,"profile_text_color":"34da43","time_zone":"Tijuana","profile_link_color":"e98907","descri
ption":"I'm a Robot created in Mexican soil, therefore my name is Mexican
Robot","profile_background_image_url":"http://a3.twimg.com/profile_background_images/4329659/d2e513deb84e6fd
c10de6ac70ef2f637f8f62f26.jpg","created_at":"Mon Dec 22 07:34:02 +0000
2008","profile_sidebar_fill_color":"b03636","profile_background_tile":false,"location":"Surfin' tubular Innernet
waves","following":null,"profile_sidebar_border_color":"050e61","protected":false,"profile_image_url":"http://a3.twimg.c
om/profile_images/515614231/jessicaavvy_normal.png","statuses_count":946,"followers_count":59,"name":"Mexican
Robot","friends_count":173,"screen_name":"MexicanRobot","id":18303131,"geo_enabled":false,"utc_offset":28800,"profile_background_color":"000000","url":"http://sharkwithwheels.webs.com"},"id":5882552501}
• Data Classes:
•
•
•
•
•
•
•
Who the message is in response to, if anyone
Client user agent
Location tagged geo-aware data, if any
Time of creation and time zone of poster
Information about avatar, background, profile
User metrics: Statuses posted, Followers, Friends
User description: short user-defined string
3
Update
Infrastructure
• Streaming API expected volume: 3-4 million
entries/day
• Storage Consideration:
–
–
–
–
Average total JSON example output size: ~1400 characters
Messages are UTF-8, we’ll assume most are 1 byte
1400 msg/day * 1 byte * 3.5 million = 4.56 gigabytes/day
1 year ~ 1.6 terabytes
• Currently working on getting at least one
server running Ubuntu Server in a VM to begin
downloading data
– May require additional public IP addresses depending on rate limits,
additional servers depending on load
• Download first, parse later
4
Update
Tentative Timeline
• Work Plan
– Continue investigating using RSS to download status updates from far in the
past beyond the 15,000 we are allowed to go back using the streaming API
– 1-2 weeks: test our environment and make sure everything is working well
• Make sure our methodology for downloading from the stream is resistant
to Twitter downtime as features are rolled in and out of the alpha test
• Await possible response from Twitter regarding access to additional
restricted resources (even higher rate firehose)
– 2 weeks to explore how to parse the content into a DB, whether this can be
realistically done real time in another process.
– Additional time for data mining, research topics, e.t.c.
5