hypertext-icml011 - Carnegie Mellon University

Download Report

Transcript hypertext-icml011 - Carnegie Mellon University

Hypertext Categorization using
Hyperlink Patterns and Meta
Data
Rayid Ghani
Séan Slattery
Yiming Yang
Carnegie Mellon University
How is hypertext different?
 Link Information (possibly useful but noisy)
 Diverse Authorship
 Short text - topic not obvious from the text
 Structure / position within the web graph
 Author-supplied features (meta-tags)
 External Sources of Information (Meta-
Data)
 Bold , italics, heading etc.
Goal
 Present several hypothesis about regularities
in hypertext classification tasks
 Describe methods to exploit these regularities
 Evaluate the different methods and
regularities on real-world hypertext datasets
Regularities in Hypertext
 No Regularity
 “Encyclopedia” Regularity
 “Co-Referencing” Regularity
 Partial “Co-Referencing” Regularity
 Preclassified Regularity
 Meta-Data Regularity
Regularities in Hypertext
No Regularity
The documents are linked at random, or at least independent of the
document class
“Encyclopedia” Regularity
The majority of linked documents share the same class as a document.
Encyclopedia articles generally reference other articles which are
topically similar.
“Co-Referencing” Regularity
Documents with the same class tend to link to documents not of that
class, but which are topically similar to each other.
University student index pages which tend not to link to other student
index pages, but do link mostly to home pages of students.
Regularities in Hypertext
Partial “Co-Referencing” Regularity
“Coreferencing” regularity with more than a few “noisy” links
Many students may point to pages about their hobbies, but also link to
a wide variety of other pages which are less unique to student home
pages
Pre-Classified Regularity
Either one page, or some small set of pages, may contain lists of
hyperlinks to pages that are mostly members of the same class
Any page from the Yahoo topic hierarchy
Meta-Data Regularity
Metadata available from external sources on the web that can be
exploited in the form of additional features.
Movie reviews for movie classification, online discussion boards for
various other topic classification tasks
Ignore Links
Use standard text classifiers on the text of the document itself
Also serves as baseline
Use All the Text From Neighbors
Augment the text of each document with the text of its neighbors
Adding more topicrelated words to the document.
Use All the Text From Neighbors Separately
Add the words of linked documents, but treating them as if they come
from a separate vocabulary.
A simple way to do this is to prefix the words in the linked documents
with a tag, such as linkedword:
Look for Linked document subsets
Search for the topically similar linked pages
At the top level, this is a clustering problem to find similar documents
among all the documents linked to documents in the same class.
Use the identity of the linked documents
Search for these pages by representing each page with only the names
of the pages it links with.
Use External Features / Meta-Data
Collect features that relate two or more entities/documents being
classified using information extraction techniques. These extracted
features can then be used in a similar fashion by using the identity of
the related documents and by using the text of related documents in
various ways.
Learning Algorithms Used
 Naïve Bayes (NB)
– Probabilistic, Builds a Generative Model
 k Nearest Neighbor (kNN)
– Example-based
 First Order Inductive Learner (FOIL)
– Relational Learner
Datasets
 Hoovers
– A collection of up to 50 web pages from 4285 companies
(as used in Ghani et al. 2000)
– Two types of classifications (labels obtained from
www.hoovers.com)
• Coarse-grained Classification - 28 classes
• Fine-grained Classification – 255 classes
 University Dataset
– 4165 web pages collected by the WebKB Group at CMU
– Web pages classified into student, faculty, course,
project,staff, other
Accuracy for 28 Class Task
80
NB
kNN
FOIL
70
60
F1 Score
50
40
30
20
10
0
Page Only
All linked Words All linked words
Tagged
Names of
Linked
Companies
Names of
Competitors
All competitors
words
Accuracy for 255 Class Task
60
NB
50
kNN
FOIL
F1 Score
40
30
20
10
0
Page Only
All linked
Words
All linked words
Tagged
Names of
Linked
Companies
Names of
Competitors
All competitors
words
Accuracy Vs. Feature Size
35
30
FOIL
kNN
NB
20
15
10
5
10
00
20
00
30
00
40
00
50
00
10
00
0
20
00
0
30
00
0
40
00
0
50
00
0
50
0
0
10
0
Accuracy
25
Number of Features
Conclusions
 Hyperlinks can be extremely noisy and harmful
for classification
 Meta-Data about websites can be useful and
techniques for automatically finding meta-data
should be explored
 Naïve Bayes and kNN are suitable since they scale
up well for the noise and feature-set size while
FOIL has the power to discover relational
regularities that cannot be explicitly identified by
others.