Web Content: Metadata Experimental Setup: Targeted Spidering

Transcript Web Content: Metadata Experimental Setup: Targeted Spidering

Practical Issues for
Automated Categorization
of
Web Sites
John M. Pierre
[email protected]
Metacode Technologies, Inc.
139 Townsend Street
San Francisco, CA 94107
(Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)
Outline
Project overview
Web content
•Automated Categorization
•Feature Selection
•Metadata
Experimental Setup
•Data
•Targeted Spidering
•System Architecture
Results
Conclusions
Project Overview
Specific:
•Categorize large number of domain names by industry category
•NAICS classification scheme
•~30,000 domain names for testing (.com)
•Text categorization approach
General:
•Domain specific classification
•Metadata
•Targeted spidering
•Feature selection
•Classifier training
Web Content: Automated Categorization
Challenges:
•Vast (over 1 Billion pages)
•Heterogeneous (content, formats, not just HTML)
•Dynamic (growing, changing)
Benefits:
•Good source of information
•Accessible!
•Machine readable (vs. machine understandable)
•Semi-structured
Tools:
•Classification
•Automated classification
•Text Categorization/Machine Learning
•Intelligent agents
Related Work
Manual:
•Yahoo!
•Open Directory Project
•Looksmart
Automatic:
•Northern Light
•Thunderstone/Texis
•Inktomi
Other:
•EU Project DESIRE II
•Pharos
•Attardi, Sebanstiani et al
•L. Page et al
•McCallum et al
Web Content: Feature Selection
Text Features: (D. Lewis)
•Relatively few in number
•Moderate in frequency of assignment
•Low in redundancy
•Low in noise
•Related to semantic scope to the classes to be assigned
•Relatively unambiguous in meaning
Preliminary Experiment
•1125 web domains
•SEC+NAICS training set
Precision
Recall
micro F1
Body
0.47
0.34
0.39
Body + Metatags
0.55
0.34
0.42
Metatags
0.64
0.39
0.48
Use metadata if
possible, use body
text as last resort!
Web Content: Metadata
Web Page Content
Percentage
Title
Meta-Description
Meta-Keywords
Body
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
1 to 10
11 to 50
Number of Words
51 or more
Experimental Setup: Targeted Spidering
Domain
name
live?
Yes
‘Query’
Pages
HTTP Get
No
Try www.
Yes
Frames?
Use
<body>
No
No
Metatags?
Yes
<a href=?
Send
Query
prod, service, about, info, press, news
Experimental Setup: Data
Classification scheme: NAICS
11
21
23
31-33
42
44-45
48-49
51
52
53
54
55
56
61
62
71
72
81
92
99
Agriculture, Forestry, Fishing and Hunting
Mining
Construction
Manufacturing
Wholesale Trade
Retail Trade
Transportation and Warehousing
Information
Finance and Insurance
Real Estate and Rental and Leasing
Professional, Scientific and Technical Services
Management of Companies and Enterprise
Admin. Support, Waste Mgmt and Remediation Srvcs
Educational Services
Health Care and Social Assistance
Arts, Entertainment & Recreation
Accommodation and Food Services
Other services (except 92)
Public Administration
Unclassified Establishments
Test Data
~30,000 domain names (SIC)
~13,500 pre-classified/content
Training Data
“SEC-NAICS”:
•1504 SEC 10-K fillings (SIC)
•426 NAICS labels/descriptions
“Web pages”:
•3618 pre-classified domains
Crosswalk
•SIC <-> NAICS
Experimental Setup: System Architecture
Domain
Names
Spider
The Web
Text Query
SEC-NAICS
IR Engine
Web pages
Matching documents
Decision
Foo.com 11, 21, 23
Results
P=Precision = # correctly assigned / # assigned
R=Recall = # correctly assigned / # total correct
F1 = 2 P R / (P+R)
micro-averaged = computer over all categories
macro-averaged = per category, then averaged
micro P
micro R
micro F1
macro P
macro R
macro F1
SECNAICS
0.66
0.35
0.45
0.23
0.18
0.09
Web
pages
0.71
0.75
0.73
0.7
0.37
0.4
Conclusions
Domain Specific Classification
•Knowledge Gathering
•Use of specialized knowledge
•Targeted Spidering
•Efficient use of resources
•Extract key features, Metadata
•Training
•Prior knowledge
•Bootstrapping
•Classification
•Robust, tolerant of noisy data
Benefits of Semantic Web
•Better Metadata
•Semantic linking & intelligent spidering