Transcript Document

Taking over Search Engines
Web Spamming

What is Spamming ?
–

Why ?
–

Spamming is the art of increasing
the rank of a page.
Having more visits means gaining more money.
How ?
–
–
Web search engines are the gateways to the web.
Get listed in the top results.
How much Spam out there ?


Real-Web data from the MSN crawler
collected during August 2004
105,484,446 Web pages
Why is spam bad ?

For Users:
–

Useless pages.
For Search Engines:
–
–
–
Wastes bandwidth, CPU cycles, storage space.
Pollutes corpus.
Distorts ranking of results.
(Again bad news for users !)
Techniques

Web Search Engines use a number of
measure to estimate the importance of a
page
–
–

Content Analysis: TF x IDF, …
Link Analysis: PageRank, …
Also spammers use a number of techniques !
–
–
Content Manipulation, i.e. terms
Stucture Manipulation, i.e. links
Content Manipulation 1

Repetition Repetition Repetition Repetition
Repetition Repetition Repetition Repetition :
–

Increases the Term Frequency
dumortierite dumose dumous dump dumper
dumpage Dumping dumper dumpily :
–
–
Makes a document relevant to many queries.
It is effective when using rare words
(Inverse Document Frequency).
Where ?

Body, Title, Meta Tag, Anchor, Url.
Content Manipulation 2

Content Repurposing:
–
Weaving :

–
Phrase Stitching :


Insertion of spam words into a well formed page copied
another web-site.
Gluing well formed sentences copied from many other
web-sites.
Why ?
–
Overcomes simple statistics that may be taken
into account by web search engines
The Big Picture (1)
Techniques / Boosting / Term
<a href=“target.html”>
free, great deals, cheap,
inexpensive, cheap, free
</a>
Link Bombing
Link Manipulation

Links and pages from the attacker point of
view
Creating (Hijacked) In-Links

Honey pots.
–

Web Directories, Blogs, Wikis
–



copies of valuable content (e.g. Unix Man Pages) with
hidden links to spam farms or target pages.
all of the above usually have high Page Rank, and it is
possible to add outgoing links to owned pages.
Link Exchange
Buy Expired domain
Creating Link Farms
Spamming HITS

HITS algorithm:
–
–

Searches for Hubs and Authorities
Top ranked pages are the more authoritative ones
Spam on HITS
–
–
–
Find a collection of good Hubs
Add links from Hubs to the target page
The target page is now linked to good Hubs !!
PageRank

PageRank in one equation:
–
–
–
–
–

PR(p) =  M + (1- ) Vp
M is the adjacency matrix of the Web Graph.
 is the damping factor. (usually .85)
in case of fairness Vp=1/N
(N = # of pages in the Web).
V is the personalization vector.
What happens if a page p has no outgoing links ?
–
–
 of its PR is lost --> all the PR will be lost eventually.
solution: normalize rows of M.
(i.e. insert links to every other page)
Aggregate Page Rank

Total page rank is affected by
–
–
–
–

Number of pages
Incoming Links
Outgoing Links
Dangling Nodes
Topologies that:
–
–
–
incoming links
WEB-SITE
Use as many pages as possible
minimize outgoing links
minimize dangling nodes
outgoing links
Chain topology (more is better)
I
a
O
0.18
0.34
0.47
I
0.11
PR (Web Site) = 0.34
a
b
O
0.21
0.29
0.37
PR (Web Site)
= 0.21+0.29 = 0.50
I
a
b
c
d
e
f
O
0.03
0.07
0.09
0.12
0.14
0.16
0.17
0.18
PR (Web Site) = 0.77
Ring topology
I
0.18
a
0.34
O
0.47
0.18
a
I
0.03
0.15
b
f
e
0.14
O
0.11
0.11
c
d
0.13
0.12
PR (Web Site) = 0.86
Clique topology
I
0.18
a
0.34
O
0.47
0.18
a
I
0.03
0.15
b
f
e
0.15
O
0.04
0.15
c
d
0.15
0.15
PR (Web Site) = 0.93
Increasing Page Rank of a
single target page

Complicated structures do not help
–

chain, ring, clique
waste page rank among every node in the website
To maximize the page rank of a target page a
–
–
–
–
all hijacked pages I must point to a
all boosting pages (b,c,d,e,f) must point to a
no links among boosting pages
the target page must point to all of the boosting pages
Star topology
I
0.18
a
0.34
b
O
0.47
0.09
0.09
f
I
0.03
c
0.09
a
e
0.09
O
0.09
d
0.09
PR (a) = 0.43
Putting all together

Given many spam farms
–

Create highly connected topologies among target
pages
Link Exchange
–
every target page will be rewarded proportionally
to their previous page rank
Is it worth ?

Page rank has a power low distribution
–
–

if a page has a low initial PageRank
it is easy to improve it and to get higher ranking
if a page as an higher initial PageRank
it is hard to improve it and it is harder to overcome other
pages
Consider that:
–
–
it is cheap to generate automatically a link farm, but
spamming is expensive in terms of
registered domains and IPs.
Hiding Techniques

Discriminate between real users and
crawlers in order to hide spam activity to both
of them
Content Hiding

Use background color for text.
–

add keywords
Use small 1 pixel anchor images.
–
add links
Cloaking


Identify whether the request comes from a real user
or a search engine and provide different content.
To users:
–

To Search Engines
–
–

provide target pages.
provide useful and interesting text.
provide a link structure that increase PageRank.
Solution:
–
Download the same page twice.
Redirection


The redirection mechanism is used to create
doorways to target pages
Search Engines:
–

download the page and crawl its links.
Users:
–
are immediately redirected to a target page.
Why content hiding is tough




HTML code can be parsed trying to detect
spam intrusions.
Javascript code can be parsed too, but it is
more difficult.
Eventually, it is needed to interpret the code.
Crawling is already very expensive !
Link analysis algorithms against
web spamming




TrustRank
Anti-Trust Rank
Truncated Page Rank
SpamRank
Trust Rank

Observation
–
–

Good pages tend to link good pages.
Human is the best spam detector
Algorithm
–
–
Select a small subset of pages and let a human
classify them
Propagate goodness of pages
Trust Rank: Selection

The seed set S should:
–
–

Covering is related to out-links in the very same way
PageRank is related to in-link
–

be as small as possible
cover a large part of the Web
Inverse PageRank !
A small number of pages with the highest Inverse
PageRank is labeled by a human expert.
Trust Rank: Propagation

Initial values
–
–

TR(p) = 1, if p was found to be a good page
TR(p) = 0, otherwise
Iterations:
–
propagate Trust in the same way as PageRank


–
splitting through out-links
damping (attenuation) 
only a fixed number of iteration M.
Trust Rank: Results
Anti-Trust Rank

Goal
–

find spam pages
Algorithm
–
–
–
–
Obtain a seed set of spam pages labeled by hand.
(prefer high PageRank)
Compute PageRank Algorithm on the trasnposed adjacency
matrix.
Use the seed set in the personalization vector.
Rank the pages in descending order of their scores.
Anti-Trust Rank
Truncated Page Rank

Observation
–
Good pages have high page rank because of
pages between 5 and 10 hops away
Truncated Page Rank

Observation
–
–
Good pages have high page rank because of
pages between 5 and 10 hops away
Spam pages gain page rank because of pages in
their neighborhood
Truncated Page Rank

Observation
–
–

Good pages have high page rank because of
pages between 5 and 10 hops away
Spam pages gain page rank because of pages in
their neighborhood
Solution
–
–
promote rank coming from far away
demote rank coming from the closest pages
Truncated Page Rank

Rank propagates through links
–

5 steps of propagation mean
–

M · M · M · M · M = 5·M5
We can calculate the page rank of a page by summing up the
contributions from different distances:
–

only a fraction  propagates according to the adjacency matrix M
PR(p) =  t · Mt =  damping(t) · Mt
We can replace n with a function like this:
Truncated Page Rank

Strategy:
–

Pages whose PageRank is largely different from
its Truncated PageRank are likely to be spam
Results:
–
Comparable with TrustRank
Spam Rank

Observations:
–
–

Spam pages are usually supported by low PageRank Pages.
Spammers have a limited budget, so they replicate only
what they need for boosting PageRank.
Idea:
–
–
Find missing statistical features of dishonest supporters.
Due to the self-similarity, the honest supporter set should
have a power-law distribution of PageRank.
Spam Rank: Algorithm





Find supporters for each page.
Check whether each set of supporters follows a
power-law distribution of its PageRank.
Create penalties for suspicious pages.
Run PageRank using a personalization vector based
on penalties.
Spam Rank is a Measure of Undeserved
PageRank
Spam Rank: Results
fine.