Transcript Slide 1

Start Your Search Engines
A. Craig Dixon
Madisonville Community College
August 24, 2007
http://www.madisonville.kctcs.edu/facstaf/cdixon
Content
•
•
•
•
•
The Challenge of Searching the Web
Directories, Engines, and Meta Engines
Constructing an Efficient Search Query
Finding Multimedia Content
Other Good Places to Find Stuff
The Challenge
• There are literally billions of pages on the World Wide
Web (aka the Internet) on topics from aardvarks to
Zunis. How do you find what you are looking for?
• Chances are, you've already used some type of web
search. You may have turned up thousands or even
millions of hits that may or may not have been relevant
to your interest.
• Doing a web search is not the problem. Doing an
efficient, tightly focused web search should be the goal.
Web Resources
• A resource is a file found on the Internet. Some common
resource types are:
– HTML: “Hypertext Markup Language”; the language of web
pages. By far, the most common web resource.
– PDF - “Portable Document Format”; proprietary document
format by Adobe, Inc. Viewable using Adobe's free Acrobat
Reader software
– JPG, GIF, and PNG – web graphics
• When you perform a web search, you are searching for
a web resource.
Defining Terms
• A query is the string of characters submitted to a
web search mechanism. A query can be a
combination of words, quoted phrases,
operators, and keywords.
• A hit is a result returned by the search. It usually
includes the resource's title, a short description
of the resources, and the Uniform Resource
Locater (URL); “web address” in the vernacular
Types of Web Search Mechanisms
• Although the term “search engine” is often used as a
“catch all” for web search mechanisms, there are
actually three types:
– Search engines
– Meta-search engines
– Directories
• Knowing how each type of mechanism works can help
you find what you are looking for more easily.
Search Engines
• Traditional search engines return relevant results based
on complex algorithms.
• Their “knowledge” of the web is based on its use of
Internet spiders.
– Internet spiders are programs that request a web page, parse
its contents, and return the results to the search engine.
– As part of the parsing process, the spider identifies hyperlinks
on the page. It chooses one of these hyperlinks to decide
which page to parse next.
Advantages and Disadvantages of
Search Engines
• Constantly “crawling” the Internet with spiders means
content is relatively current.
• Algorithms are usually good at ranking page results, but
can be manipulated.
• Most search engines have policies against deceitfully
manipulating their algorithms. Some sites have been
“de-listed” for violating these policies.
Examples of Search Engines
• A few popular traditional search engines are:
–
–
–
–
Google ( www.google.com)
Ask Jeeves! ( www.ask.com)
MSN Search ( www.msn.com)
Lycos ( www.lycos.com)
• Yahoo has a traditional search engine component, but
also has a directory component (discussed later).
Meta-Search Engines
• While meta-search engines do use page-ranking
algorithms, they do not use spiders to gather web
content.
• Instead, meta search engines query multiple traditional
search engines and rank the hits returned from all of
them.
• The results may be listed by the engine that returned
them while others may eliminate data about which
engine returned each hit altogether.
Advantages and Disadvantages of
Meta-Search Engines
• A meta-search engine will always search more of the
web than any single traditional search engine.
• The quality of the hits returned is only as good as the
weakest algorithm used by any one of the traditional
search engines.
• Though most meta-search engines try to eliminate
duplicate hits, they sometimes aren't successful.
Examples of Meta-Search Engines
• A few popular meta-search engines are:
– Dogpile ( www.dogpile.com)
– Excite ( www.excite.com)
– MetaCrawler ( www.metacrawler.com)
Web Directories
• Web directories are large quantities of links,
grouped by categories and including
descriptions. Directories are entirely created and
maintained by humans.
• Instead of typing a query, directory users
navigate through the various categories until
they find what they are looking for.
Advantages and Disadvantages of Web
Directories
• Because they are maintained by humans, directories
usually are not subject to deceitful coding practices the
way algorithm-driven searches are.
• Maintaining a directory can be burdensome, and the
information may not be up-to-date.
• Because going from category to category requires
visiting different pages, it often takes several clicks to
get to what you want.
Examples of Web Directories
• A few popular web directories are:
– The Open Directory Project ( www.dmoz.com)
– Answers.com ( www.answers.com)
– LookSmart ( www.looksmart.com)
• Yahoo also includes a directory component
Constructing More Efficient Queries
• Use Quoted Phrases
• Use Boolean Operators
• Use as Many Key Words as Possible
Use Quoted Phrases
• If you are looking for Microsoft Windows, type
“Microsoft Windows” (with quotes) instead of
Microsoft Windows (without quotes.)
• The former searches for pages that contain the
phrase Microsoft Windows, while the latter
searches for pages containing both the words
Microsoft and Windows anywhere in the page.
Use Boolean Operators
• The Boolean operators are AND, OR, and NOT.
– Some search engines automatically include AND between
search terms. Others require you to add the word AND. Still
others require a plus sign (+).
– Some search engines automatically include OR instead of
AND between search terms. Others require the word OR.
– Some search engines require the word NOT; others require a
minus sign (-)
Use As Many Keywords As Possible
• Consider that a test search I did for the term
“cat” returned 664,000,000 results; restricting
the search to “siamese cat” returns 2,150,000.
Still too many, but a substantial reduction.
• You can always reduce the number of key words
later if you didn’t find what you were looking
for.
Finding Multimedia Content
• Recently, many search engines have begun to index
graphical and multimedia content to be searched.
• The same efficiency measures apply to multimedia
searches.
• It should be noted that the hits returned by multimedia
searches may include copyrighted content. Be sure to
check the legality of using multimedia content returned
by these searches.
What to Find Where
• Images
–
–
–
–
Google (images.google.com)
Yahoo (images.search.yahoo.com)
Ask Jeeves (images.ask.com)
AltaVista ( www.altavista.com/image/default)
• Audio
– AltaVista ( www.altavista.com/audio/default)
– Yahoo (audio.search.yahoo.com)
• Video
– AltaVista ( www.altavista.com/video/default)
– Yahoo (video.search.yahoo.com)
Other Places to Find Stuff
• Forums and Newsgroups
• Blogs
• Wikis
Forums and Newsgroups
• A forum (sometimes called a message board) is a location
on the web where users can post questions or other
musings and other users can view and comment on
them.
– A post on an original topic starts a thread.
– All subsequent posts on the same topic are part of that
thread.
• A newsgroup is like an email version of a forum. Instead
of being posted to a web site, all messages are sent via
email to everyone subscribed to a newsgroup.
Advantages of Forums and
Newsgroups
• The main advantage of forums and newsgroups is that
they are often focused on very narrow subjects, which
means generally only people with an interest in or
knowledge of that subject will post messages.
• Often, a strong sense of community is built between
members on forums and newsgroups.
• Forums are usually staffed by at least one moderator, who
tries to eliminate garbage posts.
Disadvantages of Forums and
Newsgroups
• Usually, anyone can post content to a forum or newsgroup; be wary of
anything you read.
• As with standard email, forums and newsgroups can be used as media for
spam.
• Users sometimes resort to name-calling or insults directed at other users.
These posts are called flames. When two or more individuals begin exchanging
flames, the resultant conversation is called a flame war.
• Some users deliberately post offensive or controversial comments to draw the
ire of other users. These users are called trolls. Users who bite on these
enticements are said to be feeding the trolls.
• Active newsgroups generate a lot of content. If one is intimately interested in
the subject, this may be desirable, but if one is only casually interested, the
deluge of email from a newsgroup can be overwhelming and annoying.
Blogs
• A relatively recent phenomenon, a blog is an online
journal of sorts.
– The name “blog” is a contraction of “web log.”
– The group of all blogs on the Internet is affectionately
referred to as the “blogosphere.”
• Blogging sites provide users with a method of
generating content on the web without the messy
details of HTML coding.
• Blogs may be maintained by an individual or a group.
Advantages and Disadvantages of
Blogs
• Because they are a quick and easy way to create web
content, companies may use blogs to distribute updates
about product development or company news.
• Creating a popular blog can elevate a person to
“expert” (or at least “celebrity”) status on the Internet.
• Effective blogs usually focus on a single topic or niche,
but many personal blogs are little more than a
hodgepodge of the author’s thoughts or opinions.
Examples of Blogging Sites
• Several popular free sites have blogging capability
including:
–
–
–
–
–
–
MySpace
LiveJournal
Facebook
Blogger.com
Yahoo! 360
Xanga
Wikis
• A wiki is a term used for software that allows
multiple users to collaborate in creating web
content. For example, many editors could work
on a single web article.
• By far, the most popular example of a wiki is
Wikipedia, an online encyclopedia that can be
edited by anyone.
Advantages and Disadvantages of
Wikis
• Having a multitude of editors provides several
viewpoints and provides a more complete
overview of a given subject.
• Quality control in large wikis is difficult,
therefore, fact checking by the user of the
information is essential. Older information has
presumably been reviewed more often, and is
thus usually more reliable than recent updates.