robots_overview

Download Report

Transcript robots_overview

Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
Introduction
to
Web Robots,
Crawlers & Spiders
Instructor: Joseph DiVerdi, Ph.D., MBA
Copyright © XTR Systems, LLC
Web Robot Defined
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• A Web Robot Is a Program
– That Automatically Traverses the Web
• Using Hypertext Links
– Retrieving a Particular Document
• Then Retrieving All Documents That Are Referenced
– Recursively
• Recursive Doesn't Limit the Definition
– To Any Specific Traversal Algorithm
– Even If a Robot Applies Some Heuristic to the Selection
& Order of Documents to Visit & Spaces Out Requests
Over a Long Time Period
• It Is Still a Robot
Copyright © XTR Systems, LLC
Web Robot Defined
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Normal Web Browsers Are Not Robots
– Because the Are Operated by a Human
– Don't Automatically Retrieve Referenced Documents
• Other Than Inline Images
Copyright © XTR Systems, LLC
Web Robot Defined
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Sometimes Referred to As
– Web Wanderers
– Web Crawlers
– Spiders
• These Names Are a Bit Misleading
– They Give the Impression the Software Itself Moves
Between Sites
• Like a Virus
– This Not the Case
• A Robot Visits Sites by Requesting Documents From Them
Copyright © XTR Systems, LLC
Agent Defined
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• The Term Agent Is (Over) Used These Days
• Specific Agents Include:
– Autonomous Agent
– Intelligent Agent
– User-Agent
Copyright © XTR Systems, LLC
Autonomous Agent Defined
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• An Autonomous Agent Is a Program
– That Automatically Travels Between Sites
– Makes Its Own Decisions
• When To Move, When To Stay
– Are Limited to Travel Between Selected Sites
– Currently Not Widespread on the Web
Copyright © XTR Systems, LLC
Intelligent Agent Defined
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• An Intelligent Agent Is a Program
– That Helps Users With Certain Activities
• Choosing a Product
• Filling Out a Form
• Find Particular Items
– Generally Have Little to Do With Networking
– Usually Created & Maintained by an Organization
• To Assist Its Own Viewers
Copyright © XTR Systems, LLC
User-Agent Defined
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• An User-Agent Is a Program
– Performs Networking Tasks for a User
• Web User-Agent
– Navigator
– Internet Explorer
– Opera
• Email User-Agent
– Eudora
• FTP User-Agent
– HTML-Kit
– Fetch
– cute_FTP
Copyright © XTR Systems, LLC
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
Search Engine Defined
• A Search Engine Is a Program
– That Examines A Database
• Upon Request or Automatically
• Delivers Results or Creates Digest
– In the Context of the Web A Search Engine Is
• A Program That Examines Databases of HTML
Documents
– Databases Gathered by a Robot
• Upon Request
• Delivers Results Via HTML Document
Copyright © XTR Systems, LLC
Robot Purposes
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Robots Are Used for a Number of Tasks
– Indexing
• Just Like a Book Index
– HTML Validation
– Link Validation
• Searching for Broken Links
– What's New Monitoring
– Mirroring
• Making a Copy of a Primary Web Site
• On a Separate Server
– More Local to Some Users
– Shares the Work Load With the Primary Server
Copyright © XTR Systems, LLC
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
Other Popular Names
• All Names for the Same Sort of Program
– With Slightly Different Connotations
• Web Spiders
– Sounds Cooler in the Media
• Web Crawlers
– Webcrawler Is a Specific Robot
• Web Worms
– A Worm Is a Replicating Program
• Web Ants
– Distributed Cooperating Robots
Copyright © XTR Systems, LLC
Robot Ethics
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Robots Have Enjoyed a Checkered History
– Certain Robot Programs Can
• And Have in the Past
– Overload Networks & Servers
• With Numerous Requests
• This Happens Especially With Programmers
– Just Starting to Write a Robot Program
• These Days There Is Sufficient Information on
Robots to Prevent Many of These Mistakes
– But Does Everyone Read It?
Copyright © XTR Systems, LLC
Robot Ethics
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Robots Have Enjoyed a Checkered History
– Robots Are Operated by Humans
• Can Make Mistakes in Configuration
• Don't Consider the Implications of Actions
• This Means
– Robot Operators Need to Be Careful
– Robot Authors Need to Make It Difficult for
Operators to Make Mistakes
• With Bad Effects
Copyright © XTR Systems, LLC
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
Robot Ethics
• Robots Have Enjoyed a Checkered History
– Indexing Robots Build Central Database of
Documents
– Which Doesn't Always Scale Well
• To Millions of Documents
• On Millions of Sites
– Many Different Problems Occur
• Missing Sites & Links
• High Server Loads
• Broken Links
Copyright © XTR Systems, LLC
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
Robot Ethics
• Robots Have Enjoyed a Checkered History
– Majority of Robots Are
•
•
•
•
Well Designed
Professionally Operated
Cause No Problems
Provide a Valuable Service
• Robots Aren't Inherently Bad
– Nor Are They Inherently Brilliant
• They Just Need Careful Attention
Copyright © XTR Systems, LLC
Robot Visitation Strategies
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Generally Start From Historical URL List
– Especially Documents With Many or Certain Links
• Server Lists
• What's New Pages
• Most Popular Sites on the Web
• Other Sources for URLs Are Used
– Scans Through USENET Postings
– Published Mailing List Archives
• Robot Selects URLs to Visit, Index, & Parse
• And Use As a Source for New URLs
Copyright © XTR Systems, LLC
Robot Indexing Strategies
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• If an Indexing Robot Is Aware of a Document
– Robot May Decide to Parse Document
– Insert Document Content Into Robot's Database
• Decision Depends on the Robot
– Some Robots Index
• HTML Titles
• The First Few Paragraphs
• Parse the Entire HTML & Index All Words
– With Weightings Depending on HTML Constructs
• Parse the META Tag
– Or Other Special Internal Tags
Copyright © XTR Systems, LLC
Robot Visitation Strategies
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Many Indexing Services Also Allow Web
Developers to Submit URL Manually
– Which Is Queued
– Visited by the Robot
• Exact Process Depends on Robot Service
– Many Services Have a Link to a URL Submission
Form on Their Search Page
• Certain Aggregators Exist
– Which Purport to Submit to Many Robots at Once
http://www.submit-it.com/
Copyright © XTR Systems, LLC
Determining Robot Activity
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Examine Server Logs
–
–
–
–
Examine User-Agent, If Available
Examine Host Name or IP Address
Check for Many Accesses in Short Time Period
Check for Robot Exclusion Document Access
• Found at: /robots.txt
Copyright © XTR Systems, LLC
Apache Access Log Snippet
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
"GET /robots.txt HTTP/1.0" 200 0 "-" "Scooter-3.2.EX"
"GET / HTTP/1.0" 200 4591 "-" "Scooter-3.2.EX"
"GET /robots.txt HTTP/1.0" 200 64 "-" "ia_archiver"
"GET / HTTP/1.1" 200 4205 "-" "libwww-perl/5.63"
"GET /robots.txt HTTP/1.0" 200 64 "-" "FAST-WebCrawler/3.5 (atwcrawler at fast dot no; http://fast.no/support.php?c=faqs/crawler)"
"GET /robots.txt HTTP/1.0" 200 64 "-" "Mozilla/3.0 (Slurp/si;
[email protected]; http://www.inktomi.com/slurp.html)"
Copyright © XTR Systems, LLC
After Robot Visitation
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Some Webmasters Panic After Being Visited
–
–
–
–
–
Generally Not a Problem
Generally a Benefit
No Relation to Viruses
Little Relation to Hackers
Close Relation to Lots of Visits
Copyright © XTR Systems, LLC
Controlling Robot Access
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Excluding Robots Is Feasible Using Server
Authentication Techniques
– .htaccess File & Directives
• Deny From 0.0.0.0 (IP Address)
• SetEnvIf User-Agent Robot is_a_robot
• Can Increase Server Load
• Seldom Required
– More Often (Mis) Desired
Copyright © XTR Systems, LLC
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
Robot Exclusion Standard
• Robot Exclusion Standard Exists
– Consists of Single Site-wide File
• /robots.txt
• Contains Directives, Comment Lines, & Blank Lines
–
–
–
–
Not a Locked Door
More of a "No Entry" Sign
Represents a Declaration of Owner's Wishes
May Be Ignored by Incoming Traffic
• Much Like a Red Traffic Light
– If Everyone Follows The Rules, The World's a Better Place
Copyright © XTR Systems, LLC
Sample robots.txt File
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
# /robots.txt file for http://webcrawler.com/
# mail [email protected] for constructive criticism
User-agent: webcrawler
Disallow:
User-agent: lycra
Disallow: /
User-agent: *
Disallow: /tmp
Disallow: /logs
Copyright © XTR Systems, LLC
Exclusion Standard Syntax
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
# /robots.txt file for http://webcrawler.com/
# mail [email protected] for constructive criticism
• Lines Beginning With '#' Are Comments
• Comment Lines Are Ignored
– Comments May Not Appear Mid-Line
Copyright © XTR Systems, LLC
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
Exclusion Standard Syntax
User-agent: webcrawler
Disallow:
• Specify That the Robot Named 'webcrawler'
• Has Nothing Disallowed
– It May Go Anywhere on This Site
Copyright © XTR Systems, LLC
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
Exclusion Standard Syntax
User-agent: lycra
Disallow: /
• Specify That the Robot Named 'lycra'
• Has All URLs starting with '/' Disallowed
– It May Go Nowhere on This Site
– Because All URLs On This Server
• Begin With Slash
Copyright © XTR Systems, LLC
Exclusion Standard Syntax
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
User-agent: *
Disallow: /tmp
Disallow: /logs
• Specify That All Robots
• Has URLs starting with '/tmp' & '/log' Disallowed
– It May Not Access Any URLs Beginning With Those
Strings
• Note The '*' is a Special Token
– Meaning "any other User-agent"
• Regular Expressions Cannot Be Used
Copyright © XTR Systems, LLC
Exclusion Standard Syntax
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Two Common Configuration Errors
– Wildcards Are Not Supported
• Do Not Use 'Disallow: /tmp/*'
• Use 'Disallow: /tmp'
– Put Only One Path on Each Disallow Line
• This May Change in a Future Version of the Standard
Copyright © XTR Systems, LLC
robots.txt File Location
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• The Robot Exclusion File Must be Placed at
The Server's Document Root
• For example:
Site URL
Corresponding Robots.txt URL
http://www.w3.org/
-> http://www.w3.org/robots.txt
http://www.w3.org:80/ -> http://www.w3.org:80/robots.txt
http://www.w3.org:1234/ -> http://www.w3.org:1234/robots.txt
http://w3.org/
-> http://w3.org/robots.txt
Copyright © XTR Systems, LLC
Common Mistakes
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Urls Are Case Sensitive
– "/robots.txt" must be all lower-case
• Pointless robots.txt URLs
http://www.w3.org/admin/robots.txt
http://www.w3.org/~timbl/robots.txt
• On a Server With Multiple Users
– Like linus.ulltra.com
– robots.txt Cannot Be Placed in Individual Users'
Directories
– It Must Be Placed in the Server Root
• By the Server Administrator
Copyright © XTR Systems, LLC
For Non-System Administrators
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Sometimes Users Have Insufficient Authority
to Install a /robots.txt File
– Because They Don't Administer the Entire Server
• Use META Tag In individual HTML
Documents to Exclude Robots
<META NAME="ROBOTS" CONTENT="NOINDEX">
– Prevents Document From Being Indexed
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
– Prevents Document Links From Being Followed
Copyright © XTR Systems, LLC
Bottom Line
Web Robots, Crawlers, & Spiders
Webmaster- Fort Collins, CO
• Use Robots Exclusion to Prevent Time
Variant Content From Being Improperly
Indexed
• Don't Use It to Exclude Visitors
• Don't Use It to Secure Sensitive Content
– Use Authentication If It's Important
– Use SSL If It's Really Important
Copyright © XTR Systems, LLC