Transcript Slide 1

EK Ch 17: Power laws and
rich-get-richer phenomena
(with an application of
Web Spam detection
Spam, Damn Spam and Statistics)
Your Quiz grades so far
25
Quiz grades
20
15
10
5
0
A
B
C
D
E
F
Students
G
H
I
J
Numbers
Your grades so far in this class.
The weight of an apple.
The temperature in Chicago on July 4th.
The height of a Dutch man.
The speed of a car on I-90.
Most instances are typical.
Seeing a rare number is very surprising.
These numbers are well-characterized by
the average and the standard deviation.
City populations
1. New York
2. Los Angeles
3. Chicago
4. Houston
5. Phoenix
6. Philadelphia
7. San Antonio
8. San Diego
9. Dallas
10. San Jose
8,310,212
3,834,340
2,836,658
2,208,180
1,552,259
1,449,634
1,328,984
1,266,731
1,266,372
939,899
City populations
1. New York
2. Los Angeles
3. Chicago
8,310,212
3,834,340
2,836,658
230. Cambridge, MA
101,335
240. Gainesville, FL
95,447
250. McKinney, TX
54,369
A few cities with
high population
Many cities with
low population
City populations
City populations
Power Law: The number of cities with
population > k is proportional to k-c.
“fraction of items”
Power Law: The number of cities with
population > k is proportional to k-c.
“popularity = k”
Power Law: Fraction f(k) of items with
popularity k is proportional to k-c.
f(k)
log [f(k)]
log [f(k)]
k-c
log [k-c]
-c log [k]
A power law is a straight line
on a log-log plot.
City populations
Number of Web page in-links
(Broder+)
Other examples
Length of the URL’s host
Number of host name resolutions to a single IP
Web page out-degrees
Web page in-degrees
Word count variance
Content evolution
Cluster size
… because they care to know ;-)
Why does data exhibit power laws?
Imitation
Power law
Constructing the web
1. Pages are created in order, named 1, 2, …, N
2. When created, page j links to a page by
a)
b)
With probability p, picking a page i uniformly at random from 1,
…, j-1
With probability (1-p), pick page i uniformly at random and link
to the page that i links too
Imitation
The rich get richer
2 b) With prob. (1-p), pick page i uniformly
at random and link to the page that i links
too
3/4
1/4
The rich get richer
2 b) With prob. (1-p), pick page i uniformly
at random and link to the page that i links
too
Equivalently,
2 b) With prob. (1-p), pick a page proportional to its indegree and link to it
Food for thought
Why is Harry Potter popular?
If we could re-play history, would we still read Harry Potter,
or would it be some other book?
Information cascades and the rich
Information cascade = so some people get a little bit richer
by chance
and then rich-get-richer dynamics = the random rich people
get a lot richer very fast
Music download site – 8 worlds
1. “Let’s go driving,”
Barzin
18
2. “Silence is sexy,”
Einstürzende
3
Neubauten
3. “Go it alone,”
Noonday
47
Underground
1. “Let’s go driving,”
Barzin
59
2. “Silence is sexy,”
Einstürzende
7
Neubauten
3. “Go it alone,”
Noonday
10
Underground
10.“Picadilly Lilly,”
Tiger Lillies
10.“Picadilly Lilly,”
Tiger Lillies
2
1