Search Engine Comparisons - Pennsylvania State University
Download
Report
Transcript Search Engine Comparisons - Pennsylvania State University
Search Engine
Comparisons
By: Thomie Ventura
Search Engines
Today, much, but not all, of the work we do
revolves around the web
Internet is accessible to almost anyone
Impact on businesses, schools, professionals,
home users
Web is changing every day, but everything is still
not ACCESSIBLE
FTP Servers
Only way of sharing files up to 1990
FTP Servers and FTP Clients
Down Side
Servers were mostly known through word of mouth
Not everyone was setting up their servers
Grandfather, Grandmother, Mother
Archie ( Grandfather)
Veronica (Grandmother)
Used FTP file Servers
Used Gopher file Servers
World Wide Web Wanderer (Mother)
First Robot
Caused Controversy
Are Robots a good or bad thing for the Internet?
“Web Search”
What exactly does it mean?
Involve tools ?
Accessing proprietary databases such as
www.Factiva.com or www.dialog.com
We’ll focus on “web search” as an open web
source, and look at a searchers point of view
Difficulty Coping
Volume and Speed of the web and Search
Engines
Something new happens each day
So many things to do, so little time to do it
Dynamic nature of web searching (indexing new
documents)
Staying up-to-date with traditional tools( also undergo
changes)
Other random issues that arise everyday
Will an “open web” search engine
always have my answers?
Questions that should arise about searching the
web
How long did it take to get it?
What is the database or search engine?
What kinds of questions will it help me answer?
Open web will not always give me the answer
What can it be used for?
Quality of Information
Anyone can become a publisher
Evaluating content is crucial
Reputation
Background
Qualifications
Where did it come from?
What its purpose?
Relevant to my topic?
Limitations of General Web Search
Tools
Spiders don’t crawl in real-time
Recency
Linked or Submitted Sites
If a website contains 1000 pages, does not mean
Search Engines make all of them accessible
Invisible or Hidden Web resources
Examples:
Interacting resources, return “custom” sites
Registration
Why is it hidden?
Created on the fly
Spiders don’t fill in registration forms
“No-Robot” Tag
Hidden is not always bad
Research and Effort
Without proper tools, we can make large
databases even larger
Google
Altavista
Excite
Distributing Information Properly
Specialized Focused and Site Specific
Search Tools
Necessary and Important
Hidden Web is out of reach of general purpose
Search Engines
More Precision than Recall
Examples:
www.Psychcrawler.com www.Inomics.com
[http://newssearch.bbc.co.uk/
ksenglish/query.htm],
Identifying and Collecting
Specialized Engines
Profusion
[http://www.profusion.com]
Librarians Index
Covers large amount of specialized and invisible web
databases
[http://www.lii.org]
Meta – Search Engines
Major Disadvantages
You get it all!! High Recall Low Precision
Basics of Search Engines used
Send queries to “pay for placement” engines
A good metasearch Engine
www.vivisimo.com
Old Pages, GONE!
Trying to find old pages?
Contact webmaster
Fortunately
Archiving Old Material
Example:
[http://www.clinton.nara.gov/index.html]
ALexa Research
[http://archive.alexa.com/]
carries over 18 terabytes of data covering some 5 million Web
sites and some 1.9 billion pages
Search Engine Sizes
This is a search engine
size analysis as of
December 11, 2001
Google Dominates
Sizes Over Time
Closer Look
Dealing with Coping
Use the Search Engine
Conduct research on a topic
This will get you familiar with search engine
You can see how results are displayed
Relevancy of returned documents
Let you gather your own bookmarks
Understanding limitations
What to do with these limitations?
Know limitations
Use more than one search engine
Use “specialized” search engines that go deeper into
a site to collect more information
Use “invisible web” resources
Use web directories, and bookmark important sites
Ability to Search Multimedia
Now Available, but still expanding
Wait weeks now becomes instant
search tools that provide access to video and audio
material using a non-text mechanism to access the
material ex: searching a specific background or type color
Still image tools
Google, Altavista, and Fast, use text surrounding image
Become Aware of Multimedia Search
Video Searches
Virage www.virage.com
TVeyes www.tveyes.com
ShadowTv www.shadowtv.com
Wordwave www.wordwave.com
SpeechBot (keyword search engine demo by Compaq, uses speech
technology to create real-time transcripts) www.speechbot.com
Image Searches
Webseek (search or browse criteria in image)
www.ctr.columbia.edu/webseek/
Visoo( uses software that looks for words embedded in image
www.visoo.com
Making Old Pages Stay
Long Term?
Offer comments ( suggest how material can be more
accessible and searcheable, a great archive of content without
the correct means of accessing it will be a hassle and is not
great)
Short Term?
Take advanatage of Googles cache feature ( google crawls a
site and makes a copy unless unauthorized, and puts it on
server, if site is gone, the copy is in googles server, you must
go to search results and next to URL go to “cached”, will not
always be there, next time spider crawls site and it is missing
it will not save onto server
www.savethis.com (lets you save web pages, and access them)