Transcript Slide 1

Power Laws and
Rich-Get-Richer Phenomena
(with an application of
Web Spam detection)
CS315-Web Search and Mining
What do these have in common?
The grades of students in a class.
The weights of apples.
The high temperatures in Boston on July 4th.
The heights of Dutch men.
The speed of cars on I-90.
Most instances are typical.
Seeing an outlier is very surprising.
These measurements are well-characterized by
the average and the standard deviation.
City populations
1. New York
2. Los Angeles
3. Chicago
4. Houston
5. Phoenix
6. Philadelphia
7. San Antonio
8. San Diego
9. Dallas
10. San Jose
8,310,212
3,834,340
2,836,658
2,208,180
1,552,259
1,449,634
1,328,984
1,266,731
1,266,372
939,899
City populations
1. New York
2. Los Angeles
3. Chicago
8,310,212
3,834,340
2,836,658
21. Boston, MA
625,087
248. Cambridge, MA
106,038
25,375. Lost Springs, WY
1
A few cities with
high population
Many cities with
low population
City populations
Cities ordered on population range
Word Frequencies
Power Law: The number of cities with
population > k is proportional to k-c.
“fraction of items”
Power Law: The number of cities with
population > k is proportional to k-c.
“popularity = k”
Power Law: Fraction f(k) of items with
popularity k is proportional to k-c.
f(k)
log [f(k)]
log [f(k)]
y
k-c
log [k-c]
-c log [k]
-c x
A power law is a straight line
on a log-log plot.
Number of Web page in-links
(Broder+)
Examples (some better than others)
frequency of words
protein-interaction degree distribution
Internet (AS) degree distribution
severity of inter-state wars
severity of terrorist attacks
frequency of bird sightings
size of blackouts
book sales
population of US cities
size of religions
number of citations
papers authored
popularity of surnames
number of web hits
number of web links, with cut-off
number of phone calls
size of email address book
number of species per genus
What is going on?
Nature
seems to create
bell curves
(range around an average)
Human activity
seems to create
power laws
(popularity skewing)
“seems to”
Network Science: Scale-Free Property 2012
How can we use this to… fight spam?
The main idea behind “Spam, Damn Spam and Statistics”
Spammers manufacture pages and links to fool search
engines
In this process, they will overdo it
Their actions would likely fall outside the normal human
activity
Let’s look for outliers in the power laws!
Web page out-degrees
There are 158,290 pages with out-degree 1301,
while according to the overall trend
only 1,700 such pages are expected.
Web page in-degrees
There are 369,457 pages have the in-degree of 1001,
while according to the trend
only 2,000 such pages are expected
Length of the URL’s host
The 100 longest hostnames reveal that
80 of them belong to adult site and
11 refer to the financial and credit related sites
Number of host name resolutions to a single IP
There are 100,000’s host names mapped to a single IP,
The record-breaking IP is referred by 8,967,154 host names
Clusters of similar pages (shingling)
The red group has duplicated content, not spam).
The blue group is mainly spam.
15 of 20 largest clusters have 2,080,112 spam pages
Spammers are studious!
Why does data exhibit power laws?
imitation
Can imitation
explain the
size of the
Web parts?
Power law
Constructing a model of the Web
1. Pages are created in order, named 1, 2, …, N
2. When created, page j links to a page randomly:
a)
With probability p, picking a page i uniformly at random
from pages 1, …, j-1
randomness
b)
With probability (1-p), pick page i uniformly at random
and link to the page that i links too
imitation
This is the well-studied
“preferential attachment” model
of Web generation
The rich get richer
2 b) With prob. (1-p), pick page i uniformly
at random and link to the page that i links
too
3/4
1/4
The rich get richer
2 b) With prob. (1-p), pick page i uniformly
at random and link to the page that i links
too
Equivalently,
2 b) With prob. (1-p), pick a page
proportional to its in-degree
and link to it
Information cascades and the rich
Information cascade = some people get a little bit richer
by chance
and then
rich-get-richer dynamics = the random rich people
get a lot richer very fast
Is popularity predictable?
Why is Harry Potter popular?
If we could re-play history,
would we still read Harry Potter en masse,
or would it be some other book?
(But then, why JK Rowling had troubles
publishing it at first?)
Is popularity… random?
Why “hits” in cultural markets are much more successful
than average (and yet so hard to predict)?
Can we study it with an experiment?
“Experimental Study of Inequality and
Unpredictability in an Artificial Cultural Market”
14,000 participants randomly assigned to
“social influence” and “independent” conditions
chose between 48 songs by unknown bands
in 8+1 parallel worlds
World 1
Subject
See what others
downloaded
World 8
No information
World 0
Music download site – 8+1 worlds
The best songs never went to the bottom, the worse never became popular.
But their order changed a lot.
1. “Let’s go driving,”
Barzin
18
2. “Silence is sexy,”
Einstürzende
3
Neubauten
3. “Go it alone,”
Noonday
47
Underground
1. “Let’s go driving,”
Barzin
59
2. “Silence is sexy,”
Einstürzende
7
Neubauten
3. “Go it alone,”
Noonday
10
Underground
10.“Picadilly Lilly,”
Tiger Lillies
10.“Picadilly Lilly,”
Tiger Lillies
2
1