Diffusion and Cascading Behavior in Networks

Download Report

Transcript Diffusion and Cascading Behavior in Networks

Microsoft Instant Messenger
Communication Network
How does the world communicate?
Jure Leskovec ([email protected])
Machine Learning Department
http://www.cs.cmu.edu/~jure
Joint work with: Eric Horvitz, Microsoft Research
Networks: Why?
 Today: large on-line systems leave detailed
records of social activity
 On-line communities: MyScace, Facebook
 Email, blogging, instant messaging
 On-line publications repositories, arXiv, MedLine
 Emerging behavior (need lots of data):
 Actions of individual nodes are
independent but global patterns
and regularities emerge
The Largest Social Network
 What is the largest social network in the world
(that we can relatively easily obtain)? 
For the first time we had a chance to look at
complete (anonymized) communication of the
whole planet (using Microsoft MSN instant
messenger network)
3
Instant Messaging
• Contact (buddy) list
• Messaging window
4
Instant Messaging as a Network
Buddy
Conversation
5
IM – Phenomena at planetary scale
Observe social phenomena at planetary scale:
 How does communication change with user
demographics (distance, age, sex)?
 How does geography affect communication?
 What is the structure of the communication
network?
6
Communication data
The record of communication
 Presence data
 user status events (login, status change)
 Communication data
 who talks to whom
 Demographics data
 user age, sex, location
7
Data description: Presence
 Events:





Login, Logout
Is this first ever login
Add/Remove/Block buddy
Add unregistered buddy (invite new user)
Change of status (busy, away, BRB, Idle,…)
 For each event:
 User Id
 Time
8
Data description: Communication
 For every conversation (session) we have a list
of users who participated in the conversation
 There can be multiple people per conversation
 For each conversation and each user:





User Id
Time Joined
Time Left
Number of Messages Sent
Number of Messages Received
9
Data description: Demographics
 For every user (self reported):





Age
Gender
Location (Country, ZIP)
Language
IP address (we can do reverse geo IP lookup)
10
Data collection





Log size: 150Gb/day
Just copying over the network takes 8 to 10h
Parsing and processing takes another 4 to 6h
After parsing and compressing ~ 45 Gb/day
Collected data for 30 days of June 2006:
 Total: 1.3Tb of compressed data
11
Network: Conversations
Conversation
12
Data statistics
Activity over June 2006 (30 days)
 245 million users logged in
 180 million users engaged in conversations
 17,5 million new accounts activated
 More than 30 billion conversations
13
Data statistics per day
Activity on June 1 2006
 1 billion conversations
 93 million users login
 65 million different users talk (exchange
messages)
 1.5 million invitations for new accounts sent
14
User characteristics: age
15
Age piramid: MSN vs. the world
16
Conversation: Who talks to whom?
 Cross gender edges:
 300 male-male and 235 female-female edges
 640 million female-male edges
17
Number of people per conversation
 Max number of people simultaneously talking
is 20, but conversation can have more people
18
Conversation duration
 Most conversations are short
19
Conversations: number of messages
Sessions between fewer people run out of steam
20
Time between conversations
 Individuals are highly diverse
 What is probability to login
into the system after t
minutes?
 Power-law with exponent 1.5
 Task queuing model [Barabasi]
 My email, Darvin’s and
Einstein’s letters follow the
same pattern
21
Age: Number of conversations
User self reported age
High
Low
22
Age: Total conversation duration
User self reported age
High
Low
23
Age: Messages per conversation
User self reported age
High
Low
24
Age: Messages per unit time
User self reported age
High
Low
25
Who talks to whom:
Number of conversations
26
Who talks to whom:
Conversation duration
27
Geography and communication
 Count the number of users logging in from
particular location on the earth
28
How is Europe talking
 Logins from Europe
29
Users per geo location
Blue circles have
more than 1 million
logins.
30
Users per capita
Fraction of population
using MSN:
•Iceland: 35%
•Spain: 28%
•Netherlands, Canada,
Sweden, Norway: 26%
•France, UK: 18%
•USA, Brazil: 8%
31
Communication heat map
 For each conversation between geo points (A,B) we
increase the intensity on the line between A and B
32
Homophily (gliha v kup štriha) 
 Probability:
Age vs. Age
 Correlation:
33
Per country statistics
 On a particular typical day…
Country
USA
Brazil
France
Unknown
Spain
UK
Canada
China
Turkey
Mexico
# of logins
# of users # of messages Messages per user
38,319,363 13,261,337
412,729,278
31.12
20,582,613 7,864,424
467,972,522
59.50
19,163,131 6,475,858
518,931,785
80.13
18,444,352 6,872,347
191,167,085
27.81
16,868,549 6,140,895
503,759,240
82.03
16,659,009 5,724,826
487,018,470
85.07
14,558,692 5,021,185
160,249,686
31.91
14,225,163 5,314,463
101,003,729
19.00
13,619,789 4,696,555
353,540,475
75.27
10,756,989 4,359,932
209,195,100
47.98
Note that global usage and market share statistics are higher if we accumulate data
over longer time periods.
34
Per typical user per country
 On a typical day MSN user from a country …
Logins on a
Users on a
Messages
Messages
Country particular day particular day
sent
per user
Slovenia
364,988
130,884 15,919,892 121.6335992
Malta
122,846
41,829
4,993,316 119.3745009
Hungary
1,214,268
427,320 47,623,604 111.4471684
Bosnia
105,584
35,689
3,254,170 91.18131637
Teunion
100,335
33,399
3,041,635 91.0696428
Gibraltar
19,096
6,452
581,195 90.07982021
UK
16,659,009
5,724,826 487,018,470 85.07131396
Macedonia
126,729
43,754
3,669,977 83.87751977
Netherlands
7,399,160
2,696,669 221,300,210 82.06428375
Spain
16,868,549
6,140,895 503,759,240 82.03352117
Note that global usage and market share numbers are higher if we accumulate
data over longer time periods.
35
What about Slovenia (per capita)?
Statistic
Conversations inside
Conversation to outside
Total conversations
Avg. time inside
Avg. time outside
Avg. time inside (pct.)
Messages sent inside
Messages sent outside
Messages inside (pct.)
Number
19,868,886
7,868,483
27,737,369
309.49
314.39
0.4960
9.78
9.46
0.5083
Rank
(per capita)
22
48
29
147
80
32
19
36
Who is Slovenia talking to?
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Target
Pairs of
Number of Avg. time Avg. # of
Country
people conversations per conv. messages
Slovenia
13,41,250
19,868,886
309.4
9.78
USA
61,794
922,527
303.4
9.14
Spain
27,650
310,357
289.4
7.97
UK
14,709
204,335
325.4
9.02
Germany
9,047
129,551
350.3
10.20
Bosnia
9,956
114,509
385.9
14.62
Yugoslavia
8,194
104,270
381.7
12.55
Italy
8,612
100,698
358.8
9.89
Croatia
6,838
84,362
359.0
11.00
Turkey
10,763
77,651
292.4
8.08
Albania
9,517
76,440
320.7
10.88
Sweden
5,083
69,019
306.9
8.34
Netherlands
5,061
68,287
315.9
8.87
Canada
5,003
60,617
301.8
7.38
37
Instant Messaging as a Network
Buddy
38
IM Communication Network
 Buddy graph:
 240 million people (people that login in June ’06)
 9.1 billion edges (friendship links)
 Communication graph:
 There is an edge if the users exchanged at least
one message in June 2006
 180 million people
 1.3 billion edges
 30 billion conversations
39
Buddy network: Number of buddies
 Buddy graph: 240 million nodes, 9.1 billion
edges (~40 buddies per user)
40
Network: Small-world
 6 degrees of separation [Milgram ’60s]
 Average distance 5.5
 90% of nodes can be reached in < 8 hops
Hops
Nodes
1
10
2
78
3
396
4
8648
5
3299252
6
28395849
7
79059497
8
52995778
9
10321008
10
1955007
11
518410
12
149945
13
44616
14
13740
15
4476
16
1542
17
536
18
167
19
71
20
29
21
16
22
10
23
3
24
2
25
3
Network: Searchability
 Milgram’s experiment showed:
v
 (1) short paths exist in networks
 (2) humans are able to find them
 Assume the following setting:
 Nodes are scattered on a plane
 Given starting node u and we want to
reach target node v
 Algorithm: always navigate to a
neighbor that is geographically closest
to target node v
 Surprise: Geo-routing finds the
short paths (for appropriate
distance measure)
u
43
Communication network: Clustering
 How many triangles
are closed?
 Clustering normally
decays as k-1
 Communication
network is highly
clustered: k-0.37
High clustering
Low clustering
44
Communication Network Connectivity
45
k-Cores decomposition
 What is the structure of the core of the
network?
46
k-Cores: core of the network
 People with k<20 are the periphery
 Core is composed of 79 people, each having 68
edges among them
47
Network robustness
 We delete nodes (in some order) and observe
how network falls apart:
 Number of edges deleted
 Size of largest connected component
48
Robustness: Nodes vs. Edges
49
Robustness: Connectivity
50
Conclusion
 A first look at planetary scale social network
 The largest social network analyzed
 Strong presence of homophily: people that
communicate share attributes
 Well connected: in only few hops one can
research most of the network
 Very robust: Many (random) people can be
removed  and the network is still connected
51
References
 Leskovec and Horvitz: Worldwide Buzz:
Planetary-Scale Views on an InstantMessaging Network, 2007
 http://www.cs.cmu.edu/~jure
52