Web Page Classification

Download Report

Transcript Web Page Classification

Features and Algorithms
Paper by: XIAOGUANG QI and BRIAN D. DAVISON
Presentation by: Jason Bender
Outline
Introduction to Classification
 Background

 Classification Types
 Classification Methods
Applications
 Features
 Algorithms
 Evolution of Websites

What is web page classification?

The process of assigning a web page to
one or more predefined category labels
(ex: news, sports, business…)

Classification is generally posed as a
supervised learning problem
 Set of labeled data is used to train a
classifier which is applied to label future
examples
Background - Classification Types

Supervised learning problem broken into
sub problems:
 Subject Classification
 Functional Classification
 Sentiment Classification
 Other types of Classification
Subject Classification

Concerned with subject or topic of the
web page
 Judging whether a page is about arts,
business, sports, etc…
Functional Classification

Role that the page is playing
 Deciding a page to be a personal
homepage, course page, admissions page,
etc…
Sentiment Classification

Focuses on the opinion that is presented
in a web page
Other types of Classification

Such as genre classification and search
engine spam classification
Background - Classification
Methods
Binary vs. Multiclass
 Single Label vs. Multi Label
 Soft vs. Hard
 Flat vs. Hierarchical

Binary vs. Multiclass Classification
Single-Label vs. Multi-Label
Classification
Soft vs. Hard Classification
Flat vs. Hierarchical Classification
Applications

Why is classification important and how
can we use it efficiently?
Constructing, maintaining, or
expanding web directories

Web directories provide an efficient way to
browse for information within a predefined
set of categories

Example:
 Open Directory Project

Currently constructed by human effort
 78,940 editors of ODP
Improving the quality of search
results

Big problem with search results is
search ambiguity
Helping question and answering
systems
Can use classification systems to help
improve the quality of answers
 Example: Wolfram alpha

Other applications

Contextual advertising
Features

What features can we extract from a
web page to use to help classify it?
Features - Introduction

Because of features such as the hyperlink
<a> … </a>, webpage classification is vastly
different from other forms of classification
such as plaintext classification.

Features organized into two groups:
○ On-page features – directly located on page
○ Neighbor features – found on related pages
On Page Features

Textual Contents & Tags
 Bag-of-words
○ N-gram feature
 Rather than analyzing individual words, group them into
clusters of n-words.
- Ex: New York vs. new ….. ….. York
 Yahoo! Has used a 5-gram feature
 HTML tags – title, heading, metadata, main text
 URL
On Page Features

Visual Analysis
 Each page has two representations
○ Text via HTML
○ Visual via the browser
 Each page can be represented as a visual
adjacency multigraph
Features of Neighbors

What happens when a page’s features
are missing or are unrecognizable?
Features of Neighbors

Assumptions
 If page1 is in the neighborhood of many
“sports” pages then there is an increasing
probability that page1 is also a “sports”
page.
 Linked pages are more likely to have terms
in common
Features of Neighbors

Neighbor Selection
 Focus on pages within 2 steps of target
 6 types: parent, child, sibling, spouse,
grandparent, and grandchild
Features of Neighbors
Labels
 Anchor Text
 Surrounding Anchor Text


By using the anchor text, surrounding
text, and page title of a parent page in
combination with text from target page,
classification can be improved.
Features of Neighbors

Implicit Links
 Connections between pages that appear in
the results of the same query and are both
clicked by users
Algorithms

What are the algorithmic approaches to
webpage classification?
 Dimension reduction
 Relational learning
 Hierarchal classification
 Information combination
Dimension Reduction

Boost classification by emphasizing
certain features that are more useful in
classification
 Feature Weighting
○ Reduces the dimensions of feature space
○ Reduces computational complexity
○ Classification more accurate as a result of
reduced space
Dimension Reduction

Methods
 Use first fragment
 K-nearest neighbor algorithm
○ Weighted features
○ Weighted HTML Tags
○ Metrics
 Expected mutual information
 Mutual information
Relational Learning

Relaxation Labeling
Hierarchical Classification

Based on “divide and conquer”
 Classification problems split into hierarchical
set of sub problems.

Error Minimization
 When a lower level category is uncertain of
whether page belongs or not, shift
assignment one level up.
Information Combination

Combine several methods into one
 Information from different sources are used
to train multiple classifiers and the collective
work of those classifiers make a final
decision.
Conclusion

Webpage classification is a type of
supervised learning problem aiming to
categorize a webpage into a predefined
set of categories.

In the future, efforts will most likely be
focused on effectively combining content
and link information to build a more
accurate classifier
Evolution of Websites

Apple in 1998
Evolution of Websites

Apple 2008
Evolution of Websites

Nike in 2000
Evolution of Websites

Nike in 2008
Evolution of Websites

Yahoo in 1996
Evolution of Websites

Yahoo in 2008
Evolution of Websites

Microsoft in 1998
Evolution of Websites

Microsoft in 2008
Evolution of Websites

MTV in 1998
Evolution of Websites

MTV in 2008
Sources

Web Page Classification: Features and Algorithms
by Xiaoguang Qi & Brian D. Davison

Visual Adjacency Multigraphs – A Novel Approach for a
Web Page Classification
by Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko
Milutinovic

The Evolution of Websites
http://www.wakeuplater.com/website-building/evolution-of-websites-10popular-websites.aspx