Web Page Classification
Download
Report
Transcript Web Page Classification
Features and Algorithms
Paper by: XIAOGUANG QI and BRIAN D. DAVISON
Presentation by: Jason Bender
Outline
Introduction to Classification
Background
Classification Types
Classification Methods
Applications
Features
Algorithms
Evolution of Websites
What is web page classification?
The process of assigning a web page to
one or more predefined category labels
(ex: news, sports, business…)
Classification is generally posed as a
supervised learning problem
Set of labeled data is used to train a
classifier which is applied to label future
examples
Background - Classification Types
Supervised learning problem broken into
sub problems:
Subject Classification
Functional Classification
Sentiment Classification
Other types of Classification
Subject Classification
Concerned with subject or topic of the
web page
Judging whether a page is about arts,
business, sports, etc…
Functional Classification
Role that the page is playing
Deciding a page to be a personal
homepage, course page, admissions page,
etc…
Sentiment Classification
Focuses on the opinion that is presented
in a web page
Other types of Classification
Such as genre classification and search
engine spam classification
Background - Classification
Methods
Binary vs. Multiclass
Single Label vs. Multi Label
Soft vs. Hard
Flat vs. Hierarchical
Binary vs. Multiclass Classification
Single-Label vs. Multi-Label
Classification
Soft vs. Hard Classification
Flat vs. Hierarchical Classification
Applications
Why is classification important and how
can we use it efficiently?
Constructing, maintaining, or
expanding web directories
Web directories provide an efficient way to
browse for information within a predefined
set of categories
Example:
Open Directory Project
Currently constructed by human effort
78,940 editors of ODP
Improving the quality of search
results
Big problem with search results is
search ambiguity
Helping question and answering
systems
Can use classification systems to help
improve the quality of answers
Example: Wolfram alpha
Other applications
Contextual advertising
Features
What features can we extract from a
web page to use to help classify it?
Features - Introduction
Because of features such as the hyperlink
<a> … </a>, webpage classification is vastly
different from other forms of classification
such as plaintext classification.
Features organized into two groups:
○ On-page features – directly located on page
○ Neighbor features – found on related pages
On Page Features
Textual Contents & Tags
Bag-of-words
○ N-gram feature
Rather than analyzing individual words, group them into
clusters of n-words.
- Ex: New York vs. new ….. ….. York
Yahoo! Has used a 5-gram feature
HTML tags – title, heading, metadata, main text
URL
On Page Features
Visual Analysis
Each page has two representations
○ Text via HTML
○ Visual via the browser
Each page can be represented as a visual
adjacency multigraph
Features of Neighbors
What happens when a page’s features
are missing or are unrecognizable?
Features of Neighbors
Assumptions
If page1 is in the neighborhood of many
“sports” pages then there is an increasing
probability that page1 is also a “sports”
page.
Linked pages are more likely to have terms
in common
Features of Neighbors
Neighbor Selection
Focus on pages within 2 steps of target
6 types: parent, child, sibling, spouse,
grandparent, and grandchild
Features of Neighbors
Labels
Anchor Text
Surrounding Anchor Text
By using the anchor text, surrounding
text, and page title of a parent page in
combination with text from target page,
classification can be improved.
Features of Neighbors
Implicit Links
Connections between pages that appear in
the results of the same query and are both
clicked by users
Algorithms
What are the algorithmic approaches to
webpage classification?
Dimension reduction
Relational learning
Hierarchal classification
Information combination
Dimension Reduction
Boost classification by emphasizing
certain features that are more useful in
classification
Feature Weighting
○ Reduces the dimensions of feature space
○ Reduces computational complexity
○ Classification more accurate as a result of
reduced space
Dimension Reduction
Methods
Use first fragment
K-nearest neighbor algorithm
○ Weighted features
○ Weighted HTML Tags
○ Metrics
Expected mutual information
Mutual information
Relational Learning
Relaxation Labeling
Hierarchical Classification
Based on “divide and conquer”
Classification problems split into hierarchical
set of sub problems.
Error Minimization
When a lower level category is uncertain of
whether page belongs or not, shift
assignment one level up.
Information Combination
Combine several methods into one
Information from different sources are used
to train multiple classifiers and the collective
work of those classifiers make a final
decision.
Conclusion
Webpage classification is a type of
supervised learning problem aiming to
categorize a webpage into a predefined
set of categories.
In the future, efforts will most likely be
focused on effectively combining content
and link information to build a more
accurate classifier
Evolution of Websites
Apple in 1998
Evolution of Websites
Apple 2008
Evolution of Websites
Nike in 2000
Evolution of Websites
Nike in 2008
Evolution of Websites
Yahoo in 1996
Evolution of Websites
Yahoo in 2008
Evolution of Websites
Microsoft in 1998
Evolution of Websites
Microsoft in 2008
Evolution of Websites
MTV in 1998
Evolution of Websites
MTV in 2008
Sources
Web Page Classification: Features and Algorithms
by Xiaoguang Qi & Brian D. Davison
Visual Adjacency Multigraphs – A Novel Approach for a
Web Page Classification
by Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko
Milutinovic
The Evolution of Websites
http://www.wakeuplater.com/website-building/evolution-of-websites-10popular-websites.aspx