Privacy - FTP Directory Listing

Download Report

Transcript Privacy - FTP Directory Listing

“Privacy is the claim of individuals, groups or institutions to determine for themselves
when, how, and to what extent information about them is communicated to others”
- Alan Westin: Privacy & Freedom,1967
Wasim Rangoonwala
ID# 00506259
CS-460 Computer Security
What are www Robots?
A robot is a program that
automatically traverses the Web's
hypertext structure by retrieving a
document, and recursively retrieving
all documents that are referenced.
Web robots are sometimes referred
to as Web Wanderers, Web Crawlers,
or Spiders or Bots.
Web Spiders / Robots Collecting Data
Controlling how search engine access
and index your website?
Google refers to their spiders as
Googlebots and Googlebots-Image
Google has a set of computers that
continually crawl the web. Together
these machines are known as the
Googlebot. In general you want
Googlebot to access your site so
your web pages can be found by
people searching on Google.
Controlling how search engine access
and index your website?
One key Question is: how does
Google know what parts of a website
the site owner wants to have show
up in search results? Can publishers
specify that some parts of the site
should
be
private
and
nonsearchable? The good news is that
those who publish on the web have a
lot of control over which pages
should appear in search results and
which pages can be kept Private.
.
Answer:
Robots.txt File
Controlling how search engine access
and index your website?
1. Robots.txt has been an industry
standard for many years that lets
a site owner control how search
engines access their web site.
2. The robots.txt file contains a list
of the pages that search engines
shouldn't access.
3. You can exclude pages from
Google's crawler by creating a
text file called robots.txt and
placing it in the root directory.
Making Use of
Robots.txt File
Controlling how search engine access
and index your website?
•
1.
2.
3.
4.
Example of pages you want to
kept private from search engines
A directory that contains internal
logs.
News
articles
that
require
payment to access.
Administration area of website.
Database configuration string,
stored passwords, credit card
details.
Images that you want to kept
Private.
Making Use of
Robots.txt File
Continue
Achieving Privacy through Robots.txt File
# robots.txt File
# Currently disallow all images to the Google Image bot
User-agent: Googlebot-Image
Disallow: /
# ALL search engine spiders/crawlers (put at end of file)
User-agent: Googlebot
Disallow: /admin/
Disallow: /account_password.html
Disallow: /address_book.html
Disallow: /checkout_payment.html
Disallow: /cookie_usage.html
Disallow: /login.html
Example of
Robots.txt File
Privacy through Robots <META> tag
You can use a special HTML <META> tag to tell robots not to index
the content of a page, and/or not scan it for links to follow.
Example
<html>
<head>
<title>...</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
</head>
•The "NAME" attribute must be "ROBOTS".
•Valid values for the "CONTENT" attribute are: "INDEX",
"NOINDEX", "FOLLOW", "NOFOLLOW". Multiple comma-separated
values are allowed, but obviously only some combinations make
sense. If there is no robots <META> tag, the default is
"INDEX,FOLLOW", so there's no need to spell that out.
Example of
<META> Tag
Search Engine Web Spiders Names
• Yahoo! Search-Yahoo
Slurp
• AltaVista- Scooter
• AskJeeves- Ask
Jeeves/Teoma
• MSN Search- MSNbot
• Visit
http://www.robotstxt.org/db.html
For more details on Search Engine
Web Spider Names.
Bonus
Google: Anatomy
Google Crawlers (GoogleBot)
• Multiple distributed
crawlers
• Own DNS cache
• 300 connections open at
once
• Send fetched pages to
Store Server
• Originally written in Python
Google: Technology
PageRank
™ Algorithm
Hypertextmatching
Analysis
Google Webmaster Central
Webmasters Central offer services:
• see which parts of a site Googlebot had
problems crawling
• upload an XML Sitemap file
• analyze and generate robots.txt files
• remove URLs already crawled by
Googlebot
• specify the preferred domain
• identify issues with title and description
meta tags
• understand the top searches used to
reach a site
• get a glimpse at how Googlebot sees
pages
• remove unwanted site links that Google
may use in results
When surfing the
internet, avoid “free”
Beware of phishing,
which are fake e-mails
Sent to try to gain
your personal and
financial information.
Don’t even
open Spam, download
a spam buster!
offers and protect
your information!
Protect your
privacy
on the Web
E-mail is not
secure and should
never be though
of as private.
Chatting – guard
your information unless
You are 100%
Sure who you are
chatting with.
Cookies aren’t just
for eating, they may
be sending your
personal information
to others.
Protect your passwords
like you would your
wallet or car keys.
Make it complicate!
•http://www.google.com/support/webmasters/bin/answer.py?answer=80553
•http://www.google.com/bot.html
•http://www.googleguide.com
•http://www.searchengineposition.com
•http://www.google-watch.org
•http://www.robotstxt.org/db.html
•http://www.googleblog.blogspot.com
• For more Details Visit http://techwasim.blogspot.com