Presentation name or presenter name

Download Report

Transcript Presentation name or presenter name

Web Archaeology
Raymie Stata
Compaq Systems Research Center
[email protected]
www.research.compaq.com/SRC
What is Web Archaeology?
 The study of the content of the Web
– exploring the Web
– sifting through data
– making valuable discoveries
 Difficult! Because the Web is:
– Boundless
– Dynamic
– Radically decentralized
Some recent results
 Empirical studies
– Quality of almost-breadth-first crawling
– Structure of the Web
– Random walks (size of search engines)
 Improving the Web experience
– Better and more precise search results
– Surfing assistants and tools
 Data mining
– Technologies for “page scraping”
Tools for “Web scale” research
Apps
Use data
• Search quality
• Crawl quality
• Duplicate elimination
• Web characterization
Feature Databases
Access subset of data fast
• Full-text index, shingleprints
• Connectivity, Term vectors
Data storage
Store and access web pages
•Myriad
Data collection
Download web pages
• Mercator, Web Language
Web-scale crawling
Mercator
Atrax
The Mercator web crawler
 A high-performance web crawler
– Downloads and processes web pages
 Written entirely in Java
– Runs on any Java-capable platform
 Extensible
– Wide range of possible configurations
– Users can plug in new modules at run-time
Mercator design points
 Extensible
– Well-chosen extensibility points
– Framework for configuration
 Multiple threads, synchronous I/O
– vs. single thread, asynchronous I/O
 Checkpointing
– Allows crawls to be restarted
– Modules export prepare, commit
System Architecture
Crawl quality
Atrax, a distributed version of Mercator
 Distributes load across cluster of crawlers
 Partitions data structures across crawlers
 No central bottleneck
 Network bandwidth is limiting factor
Performance of Atrax vs Mercator
2000/1/31 vs 2000/12/06 crawl - Downloads per day
70,000,000
60,000,000
Documents
50,000,000
40,000,000
30,000,000
20,000,000
10,000,000
0
1
2
3
4
5
6
Days
7
8
9
10
Myriad -- new project
 A very large, archival storage system
– Scalable to a petabyte
 With function shipping
– Supports data mining
Myriad Requirements
 Large (up to 10K disks)
 Commodity hardware (low cost)
 Easy to manage
 Easy to use (queries vs. code)
 Fault tolerance & containment
– No backups, tape or otherwise
Two phases of Myriad project
 Define service-level interface
– Implemented to run on collections of files
– Testing and tuning
 Build scalable implementation
– Cluster storage and processing
– Designing now, prototype in summer
– Won’t describe today
New service level interface
Applications
Storage service
file systems, databases, Myriad
Blocks
 Better suited to this problem and scale
 Supports “function shipping”
Myriad interface
 Single table database
 Stored vs. virtual columns
– Virtual columns computed by injected code
 Bulk input of new records
 Management of code defining virtual columns
 Output via select/project queries
– User-defined code run implicitly
– Support for repeatable random sampling
Example Myriad query
[samplingprob=0.1,
samplingseed=321223421332]
select name, length where
insertionDate < Date(00/01/01)
&& mimeType == “text/html”;
Model for large-scale data mining
 Step 1: make an extract
– Do data-parallel select and project
– Don’t do any sorts, joins, groupings
 Step 2: put extract into high-power analysis tool
– Joins, sorts, joins, groupings
Feature Databases
 URL DB
– URL  pgid
 Host DB
– pgid  hostid
 Link DB
– out: pgid  pgid*
– in: pgid  pgid*
 Term vector DB
– pgid  term vector
URL database: prefix compression
http://kiva.net/~markh/surnames.html
http://kiwi-us.com/~amigo/links/index.htm
http://kiwi.emse.fr/
http://kiwi.etri.re.kr/~khshim/internet/bookmark.html
http://kiwi.etri.re.kr/~ksw/bookmark
http://kiwi.futuris.net/linen
http://kiwi.futuris.net/linen/special/backiss.html
Prefix compress
0 http://kiva.net/~markh/surnames.html
9 wi-us.com/~amigo/links/index.htm
11 .emse.fr/
13 tri.re.kr/~khshim/internet/bookmark.html
25 sw/bookmark
12 futuris.net/linen
29 /special/backiss.html
URL compression
 Prefix compression
– 44  14.5 bytes/URL
– Fast to decompress (~10 s)
 ZIP compression
– 14.5  9.2 bytes/URL
– Slow to decompress (~80 s)
Term vector basics
•Basic abstraction for information retrieval
•Useful for measuring “semantic” similarity of text
Document Id
Doc 1
Doc 2
Doc 3
…
…
Automobile
2
0
2
…
Carburetor
3
0
0
…
Feline
0
2
0
…
•A row in the above table is a “term vector”
•Columns are word stems and phrases
•Trying to capture “meaning”
Jaguar
2
2
2
…
Compressing term vectors
 Sparse representation
– Only store columns with non-zero counts
 Lossy representation
– Only store “important” columns
– “Importance” determined by:
 Count of term on page (high ==> important)
 Number of pages with term (low ==> important)
TVDB Builder
Applications
 Categorizing pages
 Topic distillation
 Filtering pages
 Identifying languages
 Identifying running text
 Relevance feedback (“more like this”)
 Abstracting pages
Categorization
 “Bulls take over”
How to categorize a page
 Off line:
– Collect training set of pages per category (~30K)
– Combine training pages into category vectors
 ~10K terms per category vector
 On line:
– Use term vector DB to look up vector of page
– Find category vector that best matches this page
vector
 Use a Bayesian classifier to match vectors
 Give no category if match not “definitive”
Topic drift in topic distillation
 Some Web IR algorithms have this structure:
– Compute a “seed set” on a query
– Find neighborhood by following links
– Rank this neighborhood
 Topic drift (a problem):
– The neighborhood graph includes off-topic nodes
 “Download Microsoft Explorer”  MS Home page
Avoid topic drift with term vectors
 Combine term vectors of seed set into topic
vector
 Detecting topic drift in neighboring nodes:
– Combine topic vector with node’s term vector
 Inner product works fine
– Expunge or weight
 Integration of feature databases helps!!
Link database
 Goals
– Fit links into RAM (fast lookup)
– Build in 24 hours
 Applications
–
–
–
–
Page ranking
Web structure
Mirror site detection
Related page detection
Link storage baseline design
Links
Starts
Id
104
105
106
107
108
106
115
101
72
208
111
Link storage: deltas
Link
Deltas
Starts
Id
104
105
106
107
108
106
115
101
72
208
111
2
9
-4
-31
136
4
Link storage: compression
Link
Deltas
Variable-length encoding
1.7 bytes/link
Starts
Id
104
105
106
107
108
106
115
101
72
208
111
2
9
-4
-31
136
4
= 4 bits
= 8 bits
= 8 bits
= 8 bits
= 12 bits
= 8 bits
LDBng
The future of Web Archaeology
 Driving applications
– Web search -- “finding things on the web”
 Page classification (topic, community, type)
 Purpose-specific search
– Web “asset management” (what’s on my site?)
– Automated information extraction (price robots)
 Multi-billion page web
 Dynamics