Average number of friends
Download
Report
Transcript Average number of friends
CS 1944: Sophomore Seminar
Big Data and Machine Learning
B. Aditya Prakash
Assistant Professor
Nov 3, 2015
About me
Assistant Professor, CS
– Member, Discovery Analytics Center
Previously
– Ph.D. in Computer Science,
Carnegie Mellon University
– B.Tech in Computer Science and Engg,
Indian Institute of Technology (IIT) – Bombay
– Internships at Sprint, Yahoo, Microsoft
Research
Prakash 2015
2
Prakash 2015
3
Data contains value and knowledge
Prakash 2015
4
Data and Business
Recommended'linksA
+79%''clicksA
Prakash 2015
Top'SearchesA
Personalized''
News'InterestsA
+250%'clicksA
+43%'clicksA
Source: A. Machhanavajjhala
5
Data and Science
5*
Data1and1science*
Red:1official1numbers1from1Center1for1Disease1Control1and1Prevention;1weekly11
Black:1based1on1Google1search1logs;1daily1(potentially1instantaneously)*
Detecting'influenza'epidemics'using'search'
engine'query'data1
http:/ / www.nature.com/ nature/ journal/ v457/ n7232/ full/
nature07634.html
Prakash 2015
6
Data and Government
6*
Data1and1government*
http:/ / www.washingtonpost.com/ opinions/ obama-the-big-datapresident/ 2013/ 06/ 14/ 1d71fe2e-d391-11e2b05f-3ea3f0e7bb5a_stor y.html
http:/ / www.washingtonpost.com/
business/ economy/ democratspush-to-redeploy-obamas-voterdatabase/ 2012/ 11/ 20/
d14793a4-2e83-11e2-89d4-040c93
30702a_story.html
http:/ / www.whitehouse.gov/ blog/
Democratizing-Data
Prakash 2015
http:/ / www.theguardian.com/ world/ 2013/ jun/ 23/
edward-snowden-nsa-fi les-timeline
Source: A. Machhanavajjhala
7
Data and Culture
•
http:/ / blogs.plos.org/ everyone/
2013/ 03/ 20/ what-are-you-inthe-mood-for-emotionaltrends-in-20th-centur y-books/
Prakash 2015
Source: A. Machhanavajjhala
8
8*
Data1and1____1☜ your1favorite1subject*
Sports*
Prakash 2015
Journalism*
9
Good news: Demand for Data Mining
Prakash 2015
10
How to extract value from data?
Manipulate Data
– CS, Domain expertise
Analyze Data
– Math, CS, Stat…
Communicate your results
– CS, Domain Expertise
Prakash 2015
11
Communication is important!
“The"British"government"spends""
£13"billion"a"year"on"universities.”F
–
–
http:/ / wheredoesmymoneygo.org/ "
bubbletree-map.html#/ ~/ total/ education/ university
“On"average,"1"in"every"15"Europeans""
is"totally"illiterate.”F
–
–
http:/ / datajournalismhandbook.org/ 1.0/ en/ understanding_data_0.html
Prakash 2015
12
What is Data Mining?
Given lots of data
Discover patterns and models that are:
– Valid: hold on new data with some certainty
– Useful: should be possible to act on the item
– Unexpected: non-obvious to the system
– Understandable: humans should be able to
interpret the pattern
Prakash 2015
13
Data Mining Tasks
Descriptive methods
– Find human-interpretable patterns that
describe the data
• Example: Clustering
Predictive methods
– Use some variables to predict unknown
or future values of other variables
• Example: Recommender systems
Prakash 2015
14
Theory
& Algo.
Biology
Physics
Comp.
Systems
ML &
Stats.
Social
Science
Big data
Econ.
15
Prakash 2015
Data at CS, VT
Knowledge, Information and Data
http://www.cs.vt.edu/undergraduate/tracks/k
id
People: Fox, Harrison, Huang, Lu (in NVA),
Ramakrishnan (in NVA), Rozovskaya, Prakash
Prakash 2015
16
Courses
Background in some areas:
– CS3414 (Numerical Methods); also prob/stat
4000 level
– 4244 Internet Software Development
– 4604 Database Management Systems
– 4624 Capstone (Multimedia, Information Access)
– 4634 Design of Information (Capstone)
– 4804 AI
– 4984 Computational Linguistics (Capstone)
Prakash 2015
17
Discovery Analytics Center
Prakash 2015
18
MY RESEARCH
Prakash 2015
19
Networks are everywhere!
Facebook Network [2010]
Gene Regulatory Network
[Decourty 2008]
Human Disease Network
[Barabasi 2007]
The Internet [2005]
Prakash 2015
20
What else do they have in
common?
Prakash 2015
21
High School Dating Network
Bearman et. al. Am. Jnl. of Sociology,
2004. Image: Mark Newman
Blue: Male
Pink: Female
Interesting
observations?
Prakash 2015
22
The Internet
Skewed Degrees
Robustness
Prakash 2015
23
Karate Club Network
Prakash 2015
24
Dynamical Processes over networks
are also everywhere!
Prakash 2015
25
Why do we care?
Social collaboration
Information Diffusion
Viral Marketing
Epidemiology and Public Health
Cyber Security
Human mobility
Games and Virtual Worlds
Ecology
........
Prakash 2015
26
Why do we care? (1: Epidemiology)
Dynamical Processes over networks
[AJPH 2007]
Diseases over contact networks
Prakash 2015
CDC data: Visualization of
the first 35 tuberculosis
(TB) patients and their
1039 contacts
27
Why do we care? (1: Epidemiology)
Dynamical Processes over networks
• Each circle is a hospital
• ~3000 hospitals
• More than 30,000 patients
transferred
[US-MEDICARE
NETWORK 2005]
Prakash 2015
Problem: Given k units of
disinfectant, whom to immunize?
28
Why do we care? (1: Epidemiology)
~6x
fewer!
CURRENT PRACTICE
[US-MEDICARE
NETWORK 2005]
OUR METHOD
Prakash 2015
29
Hospital-acquired
inf. took 99K+ lives, cost $5B+ (all per year)
Why do we care? (2: Online Diffusion)
> 800m users, ~$1B
revenue [WSJ 2010]
~100m active users
> 50m users
Prakash 2015
30
Why do we care? (2: Online Diffusion)
Dynamical Processes over networks
Buy Versace™!
Followers
Celebrity
Prakash 2015
Social Media Marketing
31
Social Biological Contagion
Automatically
learn
models
Prakash 2014
32
Why do we care?
(3: To change the world?)
Dynamical Processes over networks
Social networks and Collaborative Action
Prakash 2015
33
High Impact – Multiple Settings
epidemic out-breaks
Q. How to squash rumors faster?
products/viruses
Q. How do opinions spread?
transmit s/w patches
Q. How to market better?
Prakash 2015
34
Dynamical Processes
= (a lot of) Networks
+ (some) Time-Series
Prakash 2015
35
Research Theme
ANALYSIS
Understanding
DATA
Large real-world
networks
& processes
Prakash 2015
POLICY/
ACTION
Managing
36
Research Theme – Public Health
ANALYSIS
Will an epidemic
happen?
DATA
Modeling # patient
transfers
Prakash 2015
POLICY/
ACTION
How to control
out-breaks?
37
Research Theme – Social Media
ANALYSIS
# cascades in
future?
DATA
Modeling Tweets
spreading
Prakash 2015
POLICY/
ACTION
How to market
better?38
A Question
How many of you think your friends have more friends
than you?
A recent Facebook study
– Examined all of FB’s users: 721 million people with 69
billion friendships.
• about 10 percent of the world’s population!
– Found that user’s friend count was less than the average
friend count of his or her friends, 93 percent of the time.
– Users had an average of 190 friends, while their friends
averaged 635 friends of their own.
Prakash 2015
39
Possible Reasons?
You are a loner?
Your friends are extroverts?
There are more extroverts than introverts in
the world?
Prakash 2015
40
Example
Average number of friends?
Source: S. Strogatz, NYT 2012
Prakash 2015
41
Example
Average number of friends
=(1+3+2+2)/4
=2
Source: S. Strogatz, NYT 2012
Prakash 2015
42
Example
Average number of friends
=(1+3+2+2)/4
=2
Average number of friends of
friends
Source: S. Strogatz, NYT 2012
Prakash 2015
43
Example
Average number of friends
=(1+3+2+2)/4
=2
Average number of friends of
friends
= (3 + 1 + 2 + 2 + 3 + 2 + 3 + 2)/8
= ((1x1) + (3x3) + (2x2) + (2x2))/8
Source: S. Strogatz, NYT 2012
Prakash 2015
44
Example
Average number of friends
=(1+3+2+2)/4
=2
Average number of friends of
friends
= (3 + 1 + 2 + 2 + 3 + 2 + 3 + 2)/8
= ((1x1) + (3x3) + (2x2) + (2x2))/8
= 2.25!
Source: S. Strogatz, NYT 2012
Prakash 2015
45
Actually it is (almost) always true!
Proof?
Prakash 2015
46
Actually it is (almost) always true!
Proof?
E[X] = å xi / N
Prakash 2015
47
Actually it is (almost) always true!
Proof?
E[X] = å xi / N
Var[X] = E[(X - E[X]) ]
2
= E[X ]- E[X]
2
Prakash 2015
2
48
Actually it is (almost) always true!
Proof?
E[X] = å xi / N
Var[X] = E[(X - E[X]) ]
2
= E[X ]- E[X]
2
2
E[X ]
Var[X]
= E[X]+
E[X]
E[X]
2
Prakash 2015
49
Actually it is (almost) always true!
Essentially, it is true if there is any
spread in # of friends (non-zero
variance)!
Proof?
E[X] = å xi / N
Var[X] = E[(X - E[X]) ]
2
= E[X ]- E[X]
2
2
E[X ]
Var[X]
= E[X]+
E[X]
E[X]
2
Prakash 2015
50
Implications
Immunization
Figure 1. Network Illustrating Structural Parameters. This real
network of 105 students shows variation in structural attributes and
topological position. Each circle represents a person and each line
represents a friendship tie. Nodes A and B have different ‘‘degree,’’ a
measure that indicates the number of ties. Nodes with higher degree
also tend to exhibit higher ‘‘centrality’’ (node A with six friends is more
central than B and C who both only have four friends). If contagions
infect people at random at the beginning of an epidemic, central
individuals are likely to be infected sooner because they lie a shorter
number of steps (on average) from all other individuals in the network.
Finally, although nodes B and C have the same degree, they differ in
‘‘transitivity’’ (the probability that any two of one’s friends are friends
with each other). Node B exhibits high transitivity with many friends
that know one another. In contrast, node C’s friends are not connected
to one another and therefore they offer more independent possibilities
for becoming infected earlier in the epidemic.
doi:10.1371/journal.pone.0012948.g001
– acquaintance immunization
• Immunize friend-of-friend
the variance of the degree distribution divided by m. Hence, when
there is variance in degree in a population, and especially when
there ishigh variance, the mean number of contacts for the friends
will be greater (and potentially much greater) than the mean for
the random sample. This is sometimes known as the ‘‘friendship
paradox’’ (‘‘your friends have more friends than you do’’) [15–19].
While the idea of immunizing such friends of randomly chosen
people has previously been explored in a stimulating theoretical
paper [12], to our knowledge, a method that uses nominated
friends as sensors for early detection of an outbreak has not
previously been proposed, nor hasit been tested on any sort of real
outbreak. To evaluate the effectiveness of nominated friends as
social network sensors, we therefore monitored the spread of flu at
Harvard College from September 1 to December 31, 2009. In the
fall of 2009, both seasonal flu (which typically kills 41,000
Americans each year [20]) and the H1N1 strain were prevalent in
the US, though the great majority of cases in 2009 have been
attributed to the latter.[1] It is estimated that this H1N1 epidemic,
which began roughly in April 2009, infected over 50 million
Americans. Unlike seasonal flu, which typically affects individuals
older than 65, H1N1 tends to affect young people. Nationally,
according to the CDC, the epidemic peaked in late October 2009,
and vaccination only became widely available in December 2009.
Whether another outbreak of H1N1 will occur (for example, in
areas and populations that have heretofore been spared) is a
Early warning of outbreaks
– Again, monitor friends of friends
Prakash 2015
Figure 2. Theoretical expectatio ns of differences in contagion between central individuals and the populatio n as a whole. A
contagious process passes through two phases, one in which the number of infected individuals exponentially increases as the contagion spreads,
and one in which incidence exponentially decreases as susceptible individuals become increasingly scarce. These dynamics can be modeled by a
logistic function. Central individuals lie on more paths in a network compared to the average person in a population and are therefore more likely to
be infected early by a contagion that randomly infects some individuals and then spreads from person to person within the network. This shifts the Sshaped logistic cumulative incidence function forward in time for central individuals compared to the population as a whole (left panel). It also shifts
the peak infection rate forward (right panel).
doi:10.1371/journal.pone.0012948.g002
51
Thanks---Questions?
B. Aditya Prakash
3160 F Torgersen Hall
[email protected]
See my homepage for more details and papers:
http://www.cs.vt.edu/~badityap
Prakash 2015
52