PPSX - Sean, Cheng

Download Report

Transcript PPSX - Sean, Cheng

BIG Data: Crawling Large-Scale and Real-Time
Tweets With MySQL Database
2013 Open Seminar Series 6
Open Geospatial Informatics
Cheng-Ying Liu (Sean)
[email protected]
http://bermuda.citi.sinica.edu.tw
BIG Data & Twitter
WHAT IS BIG DATA ?
In information technology, big data is a loosely-defined term
used to describe data sets so large and complex that they
become awkward to work with using on-hand database
management tools.
《Wikipedia Big data》
Source: http://en.wikipedia.org/wiki/Big_data
WHAT IS BIG DATA ?
•
In 2001, Doug Laney use 3V model to describe Big Data
‒ Volume: amount of data
‒ Velocity: speed of data in and out
‒ Variety: range of data types and sources
‒ Veracity: truth or fact of data
WHAT IS BIG DATA ?
•
In 2012, Gartner updated the definition
– Still advocate 3V model for describing data
– Require new forms of processing
– Enhanced decision making
– Insight discovery
– Process optimization
HOW BIG IS BIG DATA ?
•
•
Beyond the ability of commonly used
A few dozen terabytes (107) to many petabytes (108)
− 2008: Google processes 20 PB a day
− 2009: Facebook has 2.5 PB user data + 15 TB/day
− 2009: eBay has 6.5 PB user data + 50 TB/day
− 2011: Yahoo! has 180-200 PB of data
− 2012: Facebook ingests 500 TB/day
NEW TECHNOLOGY FOR BIG DATA
•
Hadoop
– Developed by Apache Software Foundation
– Derived from Google's MapReduce & File System
– Able to process peta-bytes scale database
•
NoSQL (Not Only SQL)
– Relational databases is not applicable for all cases
– NoSQL is a new choose for non-relational databases
– Adopted by Google, Facebook, Twitter, etc.
WHAT IS TWITTER?
•
•
•
•
•
•
•
•
•
The fastest, simplest way to communicate
More than 140M active users
Majority source from mobile
60% of user is out of U.S.
More than 400M twitter.com visitors
More than 400M tweets/day (peak: 25K/sec)
1,000 employees (majority in San Francisco)
50% of employee are engineers
Expect to hit nearly $1 billion on global ad revenue in
2014 by eMarketer
TWITTER HISTORY
•
Evan Williams on the genesis of Twitter, ICWSM, April 2007:
− A side project started from Jack Dorsey’s idea Oct, 2006
− Wanted a ubiquitous status message
− A community of people answering the question “what are
you doing?”
− Exploded at SXSW, SF earthquakes (2011)
− Good for collective “backchanneling”
− High “Ambient intimacy”
− Huge API usage was unexpected, as was the rise of the @
sign for replies
HOW BIG IS TWITTER ?
Source: http://blog.twitter.com/2011/06/200-million-tweets-per-day.html
IT’S NOT JUST BIG! IT’S FRESH!
Source: http://xkcd.com/723/
WHAT IS TWEET ?
TWITTER TOWN HALL
July 6, 2011
TWITTER STATS
Mapping the global Twitter heartbeat: The geography of Twitter, May 2013
Source: http://www.sgi.com/go/twitter/images/hires/figure4.png
TWITTER STATS
TWITTER STATS
Source: Pew Research Center's Internet &American
Life Project Winter 2012 Tracking Survey, January 20February 19, 2012. N=2,253 adults age 18 and older,
including 901 cell phone interviews. Interviews
conducted in English and Spanish. The margin of
error is +/-2.7 percentage points for internet users.
**Represents significant difference compared with all
other rows in group.
TWITTER STATS
TWITTER STATS
Twitter Dev
TWITTER ACCOUNT
•
Register a Twitter account (required)
REGISTER A TWITTER APPLICATION
•
•
Twitter developer web site: https://dev.twitter.com/
Select “My applications”
REGISTER A TWITTER APPLICATION
•
Click “Create a new application”
Application List
REGISTER A TWITTER APPLICATION
•
Fill the required information
1.
2.
3.
REGISTER A TWITTER APPLICATION
•
Agree developer rules and fill captcha
1.
2.
REGISTER A TWITTER APPLICATION
•
•
Go back to application list and click your application
Click “Settings”
REGISTER A TWITTER APPLICATION
•
•
Select “Read, Write and Access direct messages”
Click “Update this Twitter application’s settings”
REGISTER A TWITTER APPLICATION
•
Click “Create my access token”
REGISTER A TWITTER APPLICATION
Twitter API Resource
REST API
Source: https://dev.twitter.com/docs/streaming-apis
STREAMING API
Source: https://dev.twitter.com/docs/streaming-apis
TWEET CRAWL API
Resource
Description
Request Limit
(Per User)
Request Limit
(Via OAuth)
GET statuses/show/:id
Returns a single Tweet, specified by the
id parameter.
180 / 15 mins
180 / 15 mins
POST statuses/update
Updates the authenticating user's
current status, also known as tweeting.
-
-
GET search/tweets
Returns a collection of relevant Tweets
matching a specified query.
180 / 15 mins
450 / 15 mins
POST statuses/filter
Returns public statuses that match one
or more filter predicates.
-
-
GET statuses/firehose
This endpoint requires special
permission to access. Returns all public
statuses.
-
-
Source: https://dev.twitter.com/docs/api/1.1
Source: https://dev.twitter.com/docs/rate-limiting/1.1/limits
tmhOAuth LIBRARY
•
•
•
•
•
Website: https://github.com/themattharris/tmhOAuth
$ git clone https://github.com/themattharris/tmhOAuth.git
Current Version 0.8.2
Author: Matt Harris @themattharris
Goal:
‒ Support OAuth 1.0A
‒ Use authorization headers instead of query string or POST parameters
‒ Allow uploading of images
‒ Provide enough information to assist with debugging
CRAWLING WITH REST API
•
New a Oauth object contains authentication token
•
Set parameters for API
•
Use Twitter REST API to obtain tweets
CRAWLING WITH STREAMING API
•
New a Oauth object contains authentication token
•
Set parameters for API
•
Construct a connection to Twitter server
WHAT IS OAuth ?
•
•
•
OAuth = Open Authentication
What is OAuth:
‒ An open protocol to allow secure API authorization in
a simple and standard method from desktop and web
applications.
Goal of OAuth:
‒ Request token URL
‒ Authorize URL
‒ Access token URL
NORMAL SEARCH OPERATORS
Operator
Finds tweets...
twitter search
containing both "twitter" and "search".
This is the default operator.
"happy hour"
containing the exact phrase "happy hour".
love OR hate
containing either "love" or "hate" (or both).
beer -root
containing "beer" but not "root".
#haiku
containing the hashtag "haiku".
from:alexiskold
sent from person "alexiskold".
to:techcrunch
sent to person "techcrunch".
@mashable
referencing person "mashable".
"happy hour" near: "san francisco"
containing the exact phrase "happy hour" and
sent near "san francisco".
near:NYC within:15mi
sent within 15 miles of "NYC".
SEARCH PARAMETERS (REST)
Parameter
Description
q
A UTF-8, URL-encoded search query of 1,000 characters maximum
geocode
Returns tweets within a given radius of the given coordinates.
lang
Restricts tweets to the given language, given by an ISO 639-1 code.
locale
Specify the language of the query you are sending. (Only ja)
result_type
Specifies from mixed, recent or popular.
count
The number of tweets to return per page (<=100)
until
Returns tweets generated before the given date.
since_id
Returns results with an ID greater than the specified ID.
max_id
Returns results with an ID less than or equal to the specified ID.
include_entities The entities node will be disincluded when set to false.
callback
The response will use the JSONP format with a callback.
Source: https://dev.twitter.com/docs/api/1.1/get/search/tweets
SEARCH PARAMETERS (STREAMING)
Parameter
Description
follow
Indicating the users to return statuses for in the stream.
track
Keywords to track.
locations
Specifies a set of bounding boxes to track.
delimited
Specifies whether messages should be length-delimited.
stall_warnings
Specifies whether stall warnings should be delivered.
Source: https://dev.twitter.com/docs/api/1.1/post/statuses/filter
WHAT DOES A TWEET LOOK LIKE?
CRAWLING EFFICIENCY
Streaming API
REST API
Total
TPS
Total
TPS
Proportion
(S/R)*
YouTube
143,869,821
30.28
6,306,355
1.33
22.81
News
41,482,108
8.73
7,906,215
1.66
5.25
Google
28,720,525
6.04
7,474,687
1.57
3.84
Obama
8,503,834
1.79
5,271,187
1.11
1.61
Keyword
*TPS: Tweet Per Second
*S/R: Streaming/REST
•
Duration: May 6th to June 30th in 2012 (55 days)
•
REST API
– Maximum TPS : 450 × 100 ÷ 15 ÷ 60 = 50 (Tweet / sec)
•
Steaming API
– Randomly returns tweets containing a specific search keyword
– The total quantity never exceeding 1% of all public data streams
LARGE-SCALE CRAWLING
Track Word
Size
Duration
From – To
#Tweet
1 Year
YouTube
12.0 G
21 days
2013-07-07 15:12:25
2013-07-28 13:10:01
52,913,498
209 G
News
5.7 G
22 days
2013-07-07 15:07:15
2013-07-28 13:10:00
21,894,823
95 G
Http
15.0 G
21 days
2013-07-07 15:44:13
2013-07-28 13:10:00
62,976,451
261 G
Apple
1.0 G
22 days
2013-07-07 15:07:20
2013-07-28 13:10:01
4,038,241
17 G
Android
4.1 G
20 days
2013-07-07 15:20:43
2013-07-28 13:10:00
16,605,070
75 G
Obama
682 M
22 days
2013-07-07 15:07:05
2013-07-28 13:10:01
2,768,149
11 G
Twitter + MySQL
SINGLE NODE CRAWLING TYPE
Tweets Streaming - A
Tweets Streaming - B
Tweets Streaming - C
•
…
Twitter Server
Tweet Crawler
Guideline for single node crawling:
− Each streaming needs to authenticate itself
− Total data size seems bounded
(i.e. #Tweet to crawler is limited)
− Prevent aggressively connecting to Twitter server
− Crawling with different Twitter accounts is recommended
MULTI-NODE CRAWLING TYPE
Tweets Streaming - A
Tweet Crawler
Twitter Server
Tweets Streaming - B
•
Tweet Crawler
Guideline for multi-node crawling:
− Automatically check connection status
− Automatically update databases summary information
− Design the crawl program with well log file report function
− Design a good database schema for distributed accessing
DESIGN TWEET TABLE
Name
Type
Description
Index Type
Id
BIGINT UNSIGNED
Unique index ID in database
PRIMARY
tweet_id
BIGINT UNSIGNED
Official Tweet ID
UNIQUE
text
VARCHAR( 150 )
Tweet content
-
screen_name
VARCHAR( 255 )
User screen name
-
user_id
BIGINT UNSIGNED
User ID
-
followers_count
INT
Number of followers
-
friends_count
INT
Number of friends
-
created_at
DATETIME
Tweet create time
-
language
VARCHAR( 5 )
Language to Tweet
-
source
VARCHAR( 150 )
Device or browser to Tweet
-
urls_count
INT
Number of URL in the Tweet
-
SETTING ENVIRONMENT
•
Install packages
‒ # apt-get install php5 php5-curl
‒ # apt-get install mysql-client mysql-server
‒ # apt-get install phpmyadmin
‒ Set Apache2 as web server when install phpymadmin
SETTING ENVIRONMENT
•
Create databsase and table for Tweet crawling
− Create a *.sql file for database format
− Change directory to that file
− # mysql -h {$HOST} -u {$USER} -p{$PASSWORD}
− mysql> \. {$SQL_FILE}
SETTING ENVIRONMENT
•
Check the database by phpmyadmin
− Open browser and connect URL http://localhost/phpmyadmin
− Select database and check the structure
CRAWLING REAL-TIME TWEETS
•
Connect database
•
Save Tweet into database
CRAWLING REAL-TIME TWEETS
•
Copy all files in twitter_watch to /var/www/twitter_watch
‒ # cp twitter_watch/server.php /var/www/twitter_watch
‒ # cp twitter_watch/logic.hjs /var/www/twitter_watch
‒ # cp twitter_watch/index.html /var/www/twitter_watch
•
Start crawling tweets
‒ $ php5 watch.php
CRAWLING REAL-TIME TWEETS
•
Click “Browse” to show crawling Tweets in database
CRAWLING REAL-TIME TWEETS
•
Real-Time update Tweets by JQuery
‒ Browse http://localhost/twitter_watch/index.html
TROUBLESHOOTING
•
Access denied for user 'root'@'localhost' (using password: NO)‘
• # /etc/init.d/mysql stop
• # mysqld_safe --skip-grant-tables &
• # mysql -u root mysql
• mysql> UPDATE user SET Password=PASSWORD(‘xxx') where USER='root';
• mysql> FLUSH PRIVILEGES;
• mysql> quit;
• # /etc/init.d/mysql restart
•
Be aware of time synchronization
• # apt-get install ntp
• # ntpdate -s time.stdtime.gov.tw
• # hwclock --systohc
URL @ Tweet
SURLMINE
Incremental Mining of Significant URLs in Real-Time and Large-Scale Social Streams
PAKDD 2013
WHY URL?
•
•
•
•
High percentage of URLs have been embedded in Tweets
− Content length limitation and information completeness
Social Media
Character Limit
Nature
Twitter
140 characters
Short message
Plurk
140 characters
Short message
LinkedIn
200 ~ 689 characters
Job opportunities
Google+
100,000 characters
Mix information
Facebook
63,206 characters
Mix information
YouTube
1,000 characters
Video sharing
URL is an universal language without linguistic differences
URL is able to connect different social media platforms
Tweet with URL has been verified with low spam possibility
CHALLENGE
•
•
•
•
•
URL shorterners make URLs hard to be analyzed
The usage of various URL shortening services are different
Keyword
original
bit.ly
tinyurl
ow.ly
goo.gl
others
URL %
YouTube
96.49%
0.95%
0.14%
0.10%
0.12%
2.20%
90.80%
News
37.92%
17.92%
1.10%
0.00%
2.17%
40.89% 75.77%
Google
54.49%
16.30%
0.98%
2.28%
4.12%
21.83% 60.67%
Obama
30.20%
23.33%
2.27%
2.62%
2.87%
38.71% 54.22%
URL shorterner is time-effective which could expired anytime
A general solution to expand URL shorterner to original URL
Some of URLs link to phishing websites
EXPAND URL SHORTERNERS
•
Recursively tracking web page redirections
− Be aware of to be identified as DNS attack (cache table)
− Redirection link may changes with various browsers
URL STATS @ TWEET
Track Word
#Tweet
#URL
URL %
URL Per Second
YouTube
529,82,166
49,975,035
94.32 %
27.62
News
21,948,837
15,572,228
70.95 %
8.60
Http
62,976,451
42,249,898
67.09 %
23.41
Apple
4,045,333
2,670,731
66.02 %
1.48
Android
16,605,070
15,242,497
91.79 %
8.44
Obama
2,771,791
950,780
34.30 %
0.53
TRACK “TAIWAN” ON TWITTER
We demand the truth and justice!
Thank You
Q&A