Comprehensive Literature Review on Machine Learning Structures

Download Report

Transcript Comprehensive Literature Review on Machine Learning Structures

Comprehensive Literature Review
on Machine Learning Structures for
Web Spam Classification
Source: Procedia Computer Science(2015)70:434-441
Authors: Kwang Leng Goh,Ashutosh Kumar Singh
Speaker: Jia Qing Wang
Date:
2016/10/13
1
Outline
•
•
•
•
•
•
Introduction
Related Work
Methodology
Experiment
Results & Discussions
Conclusion & Future Work
2
Introduction(1/3)
The website of a
municipal health bureau
was attacked by the
detectaphone
advertisements
3
Introduction(2/3)
4
Introduction(3/3)
 The intention of Web spam was to mislead search engines by boosting one
page to undeserved rank, and leaded Web user to irrelevant information.
 Spam detection is needed , however, this process is often time-consuming,
expensive and difficult to automate because of massive amounts of data, multidimensional attributes.
 Machine learning methods provide an ideal solution due to its adaptive
ability to learn the underlying patterns for classifying spam and non-spam.22
[22] Erdlyi M, Garz A, Benczr AA (2011) Web spam classification: a few features worth more. DOI
10.1145/1964114.1964121
5
Related Work(1/3)
Web spam detection problem  classification problem
Extract features
Dimensionality reduction use
feature selection and feature
extraction methods
Novel
high-quality features for web pages
Link features
Content features
……
PCA
LDA
……
Classifier
Experiment result
6
Related Work(2/3)
Datasets: WEBSPAM-UK2006 and WEBSPAM-UK2007
A&B: the number of words in the page, number of words in the title, average word length,
fraction of anchor text and visible text…
C: in-degree, out-degree, PageRank, TrustRank, truncated PageRank
D: just simple numeric transformations and combinations of the link-based features.
[22] Erdlyi M, Garz A, Benczr AA (2011) Web spam classification: a few features worth more. DOI 10.1145/1964114.1964121
[24] Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML, vol 96, pp 148–156
[26] Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Machine learning 29(2-3):131–163
7
Related Work(3/3)
Performance Evaluation
As binary classification problem, these datasets we use is very unbalanced, common
evaluation measures are not suited , this paper use AUC as the performance evaluation.
8
Methodology(1/1)
several machine learning algorithms from top 10 data mining algorithms are described
and evaluated in the paper.
Support Vector Machine (SVM)
Multilayer Perceptron Neural Network (MLP)
Bayesian Network (BN)
C4.5 Decision Tree (DT)  as example
Random Forest (RF)
Nave Bayes (NB)
K-nearest Neighbour (KNN)
Furthermore, several meta-algorithms are presented to enhance the AUC results of
selected machine learning algorithms.
Boosting algorithms  as example
Bagging
Dagging
Rotation Forest
9
Experiment(1/3)
WEBSPAM-UK2007
(B+C 137 features)
Spam
223
No-spam
5248
AUC = 0.693
10
Input: sample set
Experiment(2/3)
Attributes/features
possible decision
(information gain)
class
11
11
Experiment(3/3)
DT based adaboost algorithm
AUC = 0.769>0.693
12
Results & Discussions(1/3)
B+C:
AUC = 0.693
13
Results & Discussions(2/3)
14
Results & Discussions(3/3)
15
Conclusion & Future Work
Random Forest has proven to be a powerful classifier than most top data mining tools
including SVM and MLP in Web spam detection
For future work, the features for Web spam detection are intended to comprehensively
compared and studied.
comprehensively
compare features
Add new features
16
Thanks