The History of Search Engines

Download Report

Transcript The History of Search Engines

Lecture Number Two
Web Searching and How it Evolved
The web is basically a jungle!
What do you mean “jungle”?
More than 2 billion pages of accessible
into
No consistent id system (like _______ )
for books
No cataloguing principle (like __________
or _________________)
And furthermore…
Many documents don’t name the author and
you can’t tell how current the information is
Most engines index every word so too much
comes back
And remember: You’re NOT searching live,
you’re looking at a fixed database compiled
before you searched
Two Ways to Find Things
Search Engines (all electronic)
Subject Directories (electronic search
of human-maintained content)
Search Engines
electronic
index every page of a Web site
3 Types of Search Engines
--global : search _______________
--subject-specific: search only within a
defined area
--meta-engines (aka _________) :
- Combine results from several search engines,
ranked by relevance
Examples of Search Engines
Google
Altavista
Alltheweb
A History of Search Engines
1990 to the present
Precursors to Search Engines
Before the internet there was….
1969
New York Times Project
1990
ARCHIE- the first search engine
_________ University, Alan Emtage
called Archie because _______________
search term had to match exactly
1993
- EXCITE
- __________ undergraduates
- different from Archie
because____________
1993 (cont)
WORLD WIDE WEB WANDERERfirst “bot” (robot)
counted ________
_______
and
eventually
- bots evolved into Spiders
catalogued links to make searchable
index
Some Spiders
JumpStation
World Wide Web Worm
RSBE
( __________________________)
first to rank results by relevance
But spiders caused trouble
Because they
_______________________
Jump Station famous for that
1994
- Web Crawler - first to index TEXT on
webpages (rather than just url/page title)
- Yahoo! 2 Stanford undergrads favorite
pages - the first __________
- Infoseek and Lycos
(Lycos reputedly best for technical
searches)
1995
____________: (DEC)– December
- got big fast
-lots of firsts:
natural language queries
boolean techniques
search tips
1996
Hot Bot
-indexes up to _________ pages/day
- until _________, the most powerful
engine
- has boasted it can index the entire web
Metacrawler -- 1st meta-engine
1999
Google!
first engine to pass a billion pages
reports pages ranked by number of hits
Of these
-Google commercially dominant
(about 75% of most websites’ external referrals)
-Microsoft wants to buy it, but can’t because
of antitrust laws
-Yahoo owned as of 2003
Overture,Alltheweb,Alta Vista, Inktomi
Privacy and Google
- every time you access a page, you get a
cookie on your hard drive, recording your
IP address, the date/time, your search
terms, your browser configuration
- the cookies are basically “immortal”
(expire ________)
How is this info used?
Google customizes your search results
using your IP number
The latest on google and privacy
Google changed its privacy policy in July 2004
now they
- pool the information they collect on you from
all their various services.
- may keep this information indefinitely
- may give this information to whomever they
wish.
if they "have a good faith belief that
access, preservation or disclosure of
such information is reasonably
necessary to protect the rights, property
or safety of Google, its users or the
public."
Focus
- Before 911, privacy issues turned on
consumer protection
- but now government is thinking about
looking at your information in the name
of national security
TIA (total information awareness)
Goal: To anticipate terrorist activities
What: credit card, travel,. Email, telephone records
2002- Google chief declined comment when asked by
NY Times if google had been subpoenaed to turn
information it gathered over
Subject Directories
human-compiled and maintained
(review: search engines are ______)
index only home pages
(review: search engines index______)
(Dis)advantages of Subject Directories
use heirarchies
Smaller Content
may be annotated
But quality control varies
Virtual Libraries (some SDs)
Created and maintained by info
professionals
Internet Public Library
Resource Discovery Network ( from
Britain)
Subject Directory Approaches
General - searching from one site
Clearinghouses – searching from multiple
sites
Examples of Subject Directories
general
- www. yahoo.com
- www.looksmart.com
Clearing houses
- Argus (www.clearinghouse.net)
- About.com
- Virtual Library (www.Vlib.org)
Search Tips
Get specific by using Boolean Logic
AND OR NOT
(often ___ and ____)
A Boolean Example
1. Tupac Amaru AND Peru
2. Tupac Amaru OR MRTA (movimiento
revolucionaro tupac amaru)
3. Tupac Amaru NOT Shakur (the rap singer
killed in 1996)
To be exact, use quotes “Tupac Amaru”
More Search Tips
Use Wildcards
like * # ?
for roots like psychol*
for variant spellings
like color colo*r
More Search Tips
Many urls are predictableso guess first
utampa.ed
Don’t look at every returned page
Use your Tools
Pay attention to the relevance
rankings some engines give you
Organize your bookmarks
The Invisible Web: What Most
Search Engines don’t find
Specialized databases (7,000+)
What’s a Specialized Database
Searchable indexes of subjects like
email addresses, magazine
archives,government data files, census
info, medcal info, etc.
2 types: full text and bibliographic
How is that different from a subject
directory?
Subject dir are collections of urls
Specialized dbs are collections of actual
data/information
Why they aren’t found
-search engines are databases
themselves- programming one
database to search another is difficult
-specialized databases often require
search forms
-databases don’t rely on fixed urls
-text in databases in form not usable by
search engines (Like adobe pdf)
What can you do?
pick your search engine carefully
google for instance lets you use the
keyword database plus the subject you
want
Some helpful sites
Beaucoup
Librarian's Index to the Internet
Gary Price
Two kinds of web data bases
full text -- FindLaw
(yahoo)
bibliographic -- medline
(librarian's index to the internet