Transcript slides

Web Crawling and Automatic
Discovery
Donna Bergmark
March 14, 2002
Web Resource Discovery
• Surfing  Serendipity
• Search  Specific Information
• Inverted keyword list  Page lookup
• Crawler  Text for keyword indexing
Hence, crawlers are needed for discovery of
Web resources
Definition
Spider = robot = crawler
Crawlers are computer programs that roam
the Web with the goal of automating
specific tasks related to the Web.
Some History
•
•
•
•
First crawlers appeared in 1994
Why? Web growth
April 1993: 62 registered web servers
In 1994, Web (http) traffic grew 15 X faster
than the Internet itself
• Lycos was announced in 1994 as a search
engine.
So, why not write a robot?
You’d think a crawler would be easy to write:
Pick up the next URL
Connect to the server
GET the URL
When the page arrives, get its
links (optionally do other stuff)
REPEAT
Crawler Issues
•
•
•
•
•
•
The URL itself
Politeness
Visit Order
Robot Traps
The hidden web
System Considerations
Standard for Robot Exclusion
•
•
•
•
•
•
Martin Koster (1994)
http://any-server:80/robots.txt
Maintained by the webmaster
Forbid access to pages, directories
Commonly excluded: /cgi-bin/
Adherence is voluntary for the crawler
The Four Laws of Web Robotics
•
•
•
•
A Crawler must show identifications
A Crawler must obey the robots.txt
A Crawler must not hog resources
A Crawler must report errors
Visit Order
•
•
•
•
•
•
The frontier
Breadth-first: FIFO queue
Depth-first: LIFO queue
Best-first: Priority queue
Random
Refresh rate
Robot Traps
• Cycles in the Web graph
• Infinite links on a page
• Traps set out by the Webmaster
The Hidden Web
• Dynamic pages increasing
• Subscription pages
• Username and password pages
• Research in progress on how crawlers
can “get into” the hidden web
System Issues
• Crawlers are complicated systems
• Efficiency is of utmost importance
• Crawlers are demanding of system and
network resources
Mercator - 1
• Written in Java
• One file configures a crawl
– How many threads
– What analyzers to use
– What filters to use
– How to place links on the frontier
– How long to run
Mercator - 2
• Tell it what seed URL[s] to start with
• Can add your own code
– Extend one or more of M’s base classes
– Add totally new classes called by your own
• Is very efficient at memory usage
– URLs are hashed
– Documents are finger-printed
Mercator - 3
• Industrial-strength crawler:
– Multi-threaded for parallel crawls
– Polite: one thread for one server
– Mercator implements own host lookup
– Mercator uses its own DNS
The Web as a Graph
• Crawling is meant to traverse the web
• Remove some edges to create a tree
– I.e. do not revisit URLs
• You can only crawl forwards
– I.e. need explicit back-links
• Page rank
The Web is a BIG Graph
• “Diameter” of the Web
• Cannot crawl even the static part,
completely
• New technology: the focused crawl
Conclusion
•
•
•
•
Clearly crawling is not simple
Hot topic of the late 90’s research
Good technologies as a result
Focused crawling is where crawling is
going next (hot topic of early 2000’s)