Presentation name or presenter name
Download
Report
Transcript Presentation name or presenter name
Web Archaeology
Raymie Stata
Compaq Systems Research Center
[email protected]
www.research.compaq.com/SRC
What is Web Archaeology?
The study of the content of the Web
– exploring the Web
– sifting through data
– making valuable discoveries
Difficult! Because the Web is:
– Boundless
– Dynamic
– Radically decentralized
Some recent results
Empirical studies
– Quality of almost-breadth-first crawling
– Structure of the Web
– Random walks (size of search engines)
Improving the Web experience
– Better and more precise search results
– Surfing assistants and tools
Data mining
– Technologies for “page scraping”
Tools for “Web scale” research
Apps
Use data
• Search quality
• Crawl quality
• Duplicate elimination
• Web characterization
Feature Databases
Access subset of data fast
• Full-text index, shingleprints
• Connectivity, Term vectors
Data storage
Store and access web pages
•Myriad
Data collection
Download web pages
• Mercator, Web Language
Web-scale crawling
Mercator
Atrax
The Mercator web crawler
A high-performance web crawler
– Downloads and processes web pages
Written entirely in Java
– Runs on any Java-capable platform
Extensible
– Wide range of possible configurations
– Users can plug in new modules at run-time
Mercator design points
Extensible
– Well-chosen extensibility points
– Framework for configuration
Multiple threads, synchronous I/O
– vs. single thread, asynchronous I/O
Checkpointing
– Allows crawls to be restarted
– Modules export prepare, commit
System Architecture
Crawl quality
Atrax, a distributed version of Mercator
Distributes load across cluster of crawlers
Partitions data structures across crawlers
No central bottleneck
Network bandwidth is limiting factor
Performance of Atrax vs Mercator
2000/1/31 vs 2000/12/06 crawl - Downloads per day
70,000,000
60,000,000
Documents
50,000,000
40,000,000
30,000,000
20,000,000
10,000,000
0
1
2
3
4
5
6
Days
7
8
9
10
Myriad -- new project
A very large, archival storage system
– Scalable to a petabyte
With function shipping
– Supports data mining
Myriad Requirements
Large (up to 10K disks)
Commodity hardware (low cost)
Easy to manage
Easy to use (queries vs. code)
Fault tolerance & containment
– No backups, tape or otherwise
Two phases of Myriad project
Define service-level interface
– Implemented to run on collections of files
– Testing and tuning
Build scalable implementation
– Cluster storage and processing
– Designing now, prototype in summer
– Won’t describe today
New service level interface
Applications
Storage service
file systems, databases, Myriad
Blocks
Better suited to this problem and scale
Supports “function shipping”
Myriad interface
Single table database
Stored vs. virtual columns
– Virtual columns computed by injected code
Bulk input of new records
Management of code defining virtual columns
Output via select/project queries
– User-defined code run implicitly
– Support for repeatable random sampling
Example Myriad query
[samplingprob=0.1,
samplingseed=321223421332]
select name, length where
insertionDate < Date(00/01/01)
&& mimeType == “text/html”;
Model for large-scale data mining
Step 1: make an extract
– Do data-parallel select and project
– Don’t do any sorts, joins, groupings
Step 2: put extract into high-power analysis tool
– Joins, sorts, joins, groupings
Feature Databases
URL DB
– URL pgid
Host DB
– pgid hostid
Link DB
– out: pgid pgid*
– in: pgid pgid*
Term vector DB
– pgid term vector
URL database: prefix compression
http://kiva.net/~markh/surnames.html
http://kiwi-us.com/~amigo/links/index.htm
http://kiwi.emse.fr/
http://kiwi.etri.re.kr/~khshim/internet/bookmark.html
http://kiwi.etri.re.kr/~ksw/bookmark
http://kiwi.futuris.net/linen
http://kiwi.futuris.net/linen/special/backiss.html
Prefix compress
0 http://kiva.net/~markh/surnames.html
9 wi-us.com/~amigo/links/index.htm
11 .emse.fr/
13 tri.re.kr/~khshim/internet/bookmark.html
25 sw/bookmark
12 futuris.net/linen
29 /special/backiss.html
URL compression
Prefix compression
– 44 14.5 bytes/URL
– Fast to decompress (~10 s)
ZIP compression
– 14.5 9.2 bytes/URL
– Slow to decompress (~80 s)
Term vector basics
•Basic abstraction for information retrieval
•Useful for measuring “semantic” similarity of text
Document Id
Doc 1
Doc 2
Doc 3
…
…
Automobile
2
0
2
…
Carburetor
3
0
0
…
Feline
0
2
0
…
•A row in the above table is a “term vector”
•Columns are word stems and phrases
•Trying to capture “meaning”
Jaguar
2
2
2
…
Compressing term vectors
Sparse representation
– Only store columns with non-zero counts
Lossy representation
– Only store “important” columns
– “Importance” determined by:
Count of term on page (high ==> important)
Number of pages with term (low ==> important)
TVDB Builder
Applications
Categorizing pages
Topic distillation
Filtering pages
Identifying languages
Identifying running text
Relevance feedback (“more like this”)
Abstracting pages
Categorization
“Bulls take over”
How to categorize a page
Off line:
– Collect training set of pages per category (~30K)
– Combine training pages into category vectors
~10K terms per category vector
On line:
– Use term vector DB to look up vector of page
– Find category vector that best matches this page
vector
Use a Bayesian classifier to match vectors
Give no category if match not “definitive”
Topic drift in topic distillation
Some Web IR algorithms have this structure:
– Compute a “seed set” on a query
– Find neighborhood by following links
– Rank this neighborhood
Topic drift (a problem):
– The neighborhood graph includes off-topic nodes
“Download Microsoft Explorer” MS Home page
Avoid topic drift with term vectors
Combine term vectors of seed set into topic
vector
Detecting topic drift in neighboring nodes:
– Combine topic vector with node’s term vector
Inner product works fine
– Expunge or weight
Integration of feature databases helps!!
Link database
Goals
– Fit links into RAM (fast lookup)
– Build in 24 hours
Applications
–
–
–
–
Page ranking
Web structure
Mirror site detection
Related page detection
Link storage baseline design
Links
Starts
Id
104
105
106
107
108
106
115
101
72
208
111
Link storage: deltas
Link
Deltas
Starts
Id
104
105
106
107
108
106
115
101
72
208
111
2
9
-4
-31
136
4
Link storage: compression
Link
Deltas
Variable-length encoding
1.7 bytes/link
Starts
Id
104
105
106
107
108
106
115
101
72
208
111
2
9
-4
-31
136
4
= 4 bits
= 8 bits
= 8 bits
= 8 bits
= 12 bits
= 8 bits
LDBng
The future of Web Archaeology
Driving applications
– Web search -- “finding things on the web”
Page classification (topic, community, type)
Purpose-specific search
– Web “asset management” (what’s on my site?)
– Automated information extraction (price robots)
Multi-billion page web
Dynamics