How Google Works and why you should care

Download Report

Transcript How Google Works and why you should care

Detecting Web Spam through
Backward Propagation of Distrust
CS315-Web Search and Mining
And Now For Something Completely(?) Different
Propaganda:

Attempt to modify human behavior,
and thus influence people’s actions
in ways beneficial to propagandists
Theory of Propaganda

Developed by the Institute for Propaganda Analysis 1938-42
Propagandistic Techniques (and ways of detecting propaganda)

Word games - associate good/bad concept with social entity
 Glittering Generalities — Name Calling





Transfer - use special privileges (e.g., office) to breach trust
Testimonial - famous non-experts’ claims
Plain Folk - people like us think this way
Bandwagon - everybody’s doing it, jump on the wagon
Card Stacking - use of bad logic
Web Spammers as Propagandists
Web Spammers can be seen as
employing propagandistic techniques
in order to modify the Web Graph
There is a pattern on how to spam!
Anti-Spam Lessons from Society
What would you do if you realize that
you should not trust
a member of your trust network?
Famous
Actress
Democracy
Rev. Y
Mom
YOU
Partner
The Coffee Joint
NYTimes
?
X
Your Boss
Prof. X
?
Joe
(a plumber)
?
?
?
US
Pres.
?
?
Anti-Propagandistic Lessons for Web
How do you deal with propaganda in real
life?
Backwards propagation of distrust
The recommender of an untrustworthy
message becomes untrustworthy
Can you transfer this technique to the web?
An Anti-Propagandistic Algorithm
Start from untrustworthy site s
S = {s}
Using BFS for depth D do:



Find the set U of sites
linking to sites in S
(using the Google API
for up to B b-links/site)
Ignore blogs, directories, edu’s
S=S+U
Find the bi-connected component
BCC of U
that includes s
BCC shows multiple paths
to boost the reputation of s
Backwards Propagation of Distrust
Start from untrustworthy site s
S = {s}
Using BFS for depth D do:



Find the set U of sites
linking to sites in S
(using the Google API
for up to B b-links/site)
Ignore blogs, directories, edu’s
S=S+U
Find the bi-connected component
BCC of U
that includes s
BCC shows multiple paths
to boost the reputation of s
BCC vs Periphery
Since the BCC reveals
multiple paths to boost the
reputation of s,
we expect it to contain
a higher percentage of
untrustworthy sites
The Periphery of the BCC,
on the other hand,
should have
significantly lower percentage
of untrustworthy sites
Periphery
BCC
Explored neighborhoods
Evaluated Experimental Results
The trustworthiness of
starting site is a very
good predictor for the
trustworthiness of BCC
sites
The BCC is significantly
more predictive of
untrustworthiness than
the Periphery
BCC
Periphery
Link Farms vs MAS