Web Spam Detection: link-based and content

Download Report

Transcript Web Spam Detection: link-based and content

Web Spam Detection: link-based
and content-based techniques
Reporter : 鄭志欣
Advisor : Hsing-Kuo Pao
2010/11/8
1
Outline
•
•
•
•
•
•
•
Introduction
Web Spam: a debatable problem
Characterizing Spam Pages
DataSets
Method
Combined Classifier
Conclusion
2
Introduction
• Characterize Web Spam pages[1][2]
–
–
–
–
Inclusion of many unrelated keywords and links.
Use of many keywords in the URL.
Redirection of the user to another page.
Creation of many copies with substantially
duplicate content.
– Insertion of hide text by writing in the same color
as the background of the page.
3
[3]
4
Web Spam: a debatable problem
• Some Define
– All deceptive actions which try to increase the
ranking of a page in search engines are generally
referred to as Web spam or spamdexing.
– An unjustifiably favorable relevance or
importance score for some web page, considering
the page’s true value.[4]
– Any attempt to deceive a search engine’s
relevancy algorithm.
• Search Engine Optimization (SEO)
5
Characterizing Spam Pages
• Content spam
– Inserting a large number of keywords.
– It is shown that 82-86% of spam pages of this type
can be detected by an automatic classifier.[5]
• Link spam
– A link farm is a densely connected set of pages,
created explicitly with the purpose of deceiving a
link-based ranking algorithm.
6
Link Farm[6]
“manipulation of the link structure by a group of users with the intent of improving
the rating of one or more users in the group”.
7
8
High and low-ranked pages are
different
9
DataSet[7]
• WEBSPAM-UK2006
– .uk Domain
• 77.9 million pages, over 3 billion links, 11,400
hosts, May 2006 .
http://barcelona.research.yahoo.net/webspam/
10
TrustRank[4]
11
Truncated PageRank(1/2)[2]
12
Truncated PageRank(2/2)
13
Estimation of Supporters[2]
14
Link and Content features
15
Topological dependencies : in-links[6]
16
Topological dependencies : out-links
17
Conclusion
• The current precision and recall of Web spam
detection algorithms can be improved using a
combination of factors already used by search
engine.
• User interaction features (e.g. data collected
via toolbar or by observing clicks in search
engine results).
18
Reference
•
•
•
•
•
•
•
[1]Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo
Baeza-Yates. Link-based characterization and detection of Web Spam. In Second
International Workshop on Adversarial Information Retrieval on the Web (AIRWeb),
Seattle, USA, August 2006.(cita 57)
[2]Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates,
R.(2006).Using rank propagation and probabilistic counting for link-based spam
detection. In Proceedings of the Workshop on Web Mining and Web Usage
Analysis (WebKDD), Pennsylvania, USA. ACM Press(cita 49)
[3] Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio
Silvestri. Know your neighbors: Web spam detection using the web topology. In
Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR), pages
423–430, Amsterdam, Netherlands, 2007. ACM Press(cita 90)
[4]Gy¨ongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating Web spam
with TrustRank.In Proceedings of the 30th International Conference on Very Large
Data Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.(cita 455)
[5] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. Detecting
spam web pages through content analysis. In Proceedings of the World Wide Web
conference, pages 83–92, Edinburgh, Scotland, May 2006.(cita 196)
[6]Gibson, D., Kumar, R., and Tomkins, A. (2005). Discovering large dense
subgraphs in massive graphs. In VLDB ’05: Proceedings of the 31st international
conference on Very large data bases, pages 721–732. VLDB Endowment(cita 96)
[7] http://barcelona.research.yahoo.net/webspam/
19