Class5SearchingTheInternet

Download Report

Transcript Class5SearchingTheInternet

Exploring the Internet
91.113-001
Instructor: Michael Krolak
Authors: P. D. & M. S. Krolak Copyright 2005
Blog of the Week
“Does my computer really know me? What an interesting thought. Could my computer
come alive? Does it really know what I am thinking or how I am feeling? Sherry Turkles'
article "Who Am We" made me stop to ponder for a bit.
Sherry Turkle writes:
"Granting a psychology to computers can mean that objects in the category 'machine,'
like objects in the categories 'people' and 'pets,' are fitting partners for dialog and
relationship. Although children increasingly regard computers as mere machines, they
are also increasingly likely to attribute qualities to them that undermine the
machine/person distinction."
This passage really struck a cord with me. Before this article, I always thought about my
computer as a machine with no real function other than to provide me with information or
maintain data as I needed. Sherry Turkle change that ever so slightly. The computer is
after just a box with electrical components and wires. It needs no life sustaining oxygen
or food to survive.
The gaming lifestyle is a point that I could relate to. Having a teenage daughter at home,
she too plays Sims. Although I never asked her if she imagined herself in those roles that
she creates. She did mold her characters after her and her friends. Often stopping to let
me know that this was her friend so and so or that was what her husband was going to
be like. At the age of 13 we bought Sims for her. She has since added several updates or
expansion packs for the original game.”
Blog of the Week (cont.)
“She still plays the game but not like she did in the past. She has returned to the real
world. She does not proclaim the figures in the game as being figures of herself or
friends. I asked her if she still creates figures of someone she would like to be a she told
that was crazy. "Dad it's like not real life, It's like only a game, be real Dad." then I get the
look... If you are a parent you know the one, the get a life Dad look.
The so called MUDers are a totally different breed. This how I see anyway. They
probably do have some identity or self esteem issues. I have fiends who are very big into
computers. And recall them talking about these "MUDs" or rooms as my called them.
This was when computers were first being introduced. I was amazed at what they would
do on the computer. I was also confused about the whole secret identity they would
maintain. One friend, I will call him Fred, would be in his MUD for hours. I would ask him
what he was doing he tried to explain it to me to no avail. I just did not comprehend the
whole idea of the room lifestyle. I never thought about it again till now. We are going back
about eighteen years now. Man am I old.
Although Fred never took to as far an extreme as some of the folks in the article, he did
get consumed by his personae in the room.
Humans are a very strange species for sure. Some of the people from the article may
have some serious identity issues for sure. Perhaps maybe they just want to be in a
different place or a different time. I am not sure why they do what they do, but all the
power to them if they can truly differentiate between their multiple personae. Heaven help
them if they cannot. “
Search for Information on the Web
Finding information on the web
requires some concepts of how
the various types search
engines work.
Archives that capture the
changes in the documents on
the web are highly useful for
those in the social sciences,
technology, and business
dynamics.
“Intelligence is not the ability to store
information, but to know where to find it.“
- Albert Einstein
How do we find information?
• Memory
• Media
–
–
–
–
Books
Movies
Music
Art
• Observe
• Ask other people
The Problem with the Internet
• The “Surface Web” contains 2.5 Billion pages.
• Each day 7.5 million web pages are added to the World Wide Web
• Information is submitted to the web without any context or test of
validity
The Archives of the Web
1. Archival of the Web’s websites
2. Google’s archive of the Internet
newsgroups.
The Way Back Machine
• Frustrated by dead links – there is an answer. The WayBack
Machine at http://www.archive.org/
• Just fill in the URL of the dead link and the links history will give
the history of the link (how the page changed over time) and allow
you to view the dead link.
Google’s Newsgroup archive
http://groups.google.com/
•
•
•
•
Archives over 100,000 groups
Goes back for some groups over 30 years.
There are for fee sites that provide competitive services.
Depending on the group it can provide a treasure trove
of insight into the cyber information society and it early
history.
• Not all messages in the database are true, have merit or
redeeming value, or are appropriate for children.
Searching for
Information on the web
1. Search engines
2. Meta-Search engines
What is a Search Engine?
search engine
n.
1. A software program that searches a database and
gathers and reports information that contains or is
related to specified terms.
2. A website whose primary function is providing a
search engine for gathering and reporting information
available on the Internet or a portion of the Internet.
Source: The American Heritage® Dictionary
Copyright © 2002, 2001, 1995 by Houghton Mifflin Company. Published by Houghton
Mifflin Company.
Search Engines
Search engines have two parts:
1. The search sends out onto the Internet a software called a spider
or bot (robot).
• Traces all the links and returns all the pages found.
• The pages are characterized by algorithms and stored in
databases
2. The retrieval system that takes a query and maps against the
databases.
• The retrieval rank orders the responses by relevance
• Each search engine uses a unique technique for retrieval and
ranking.
What is a spider?
n.
1. An automated program
which crawls over the World
Wide Web, gathering web
pages for search engines.
Spiders will ignore sites that
explicitly state not be indexed
by the search engines.
Also referred to as a
webcrawler, crawler, or bot
What are Meta Tags
meta tags
n.
1. Attributes that describe information about
the content of the document. Some spiders
use these tags to determine the relevance of a
site to future queries.
Example
<META NAME="keywords" CONTENT=“red sox world champions schilling manny damon">
How do search engines work?
Meta Search Engines
• Meta search engines are search engines that use their own
resources for answering the question
• but they mostly form the query from the user input and package it
and send it off to many other search engines simultaneously (the
process is called spawning) and then wait until the replies come
back.
• After a fixed time the meta takes the responses received and pulls
them together into a report.
• There are many ways to create a meta search based on the idea.
Some allow you to search only the web, others newsgroups,
newspapers, and scientific journal.
Why is an understanding of how a
search engine works important?
• From the view of a user:
– The user wants to find the information with as few downloads
as possible.
– The easier to use and the more accurate the ranking the better.
• From the view of a web site developer:
– The developer wants the site to found by in the first 5-10
ranked responses to a query.
– The merit of a web design is often based on the search
rankings. This requires a knowledge of a given search engine
ranks a page.
When in doubt ask a librarian:
• The librarian is a trained professional and are well
versed in using the various WWW resources for
finding answers to a vast array of subjects.
• The librarian should be used for difficult searches;
but the student will wisely observe, learn, and
contemplate the librarian's techniques, resources,
and methods.
What is a Subject Directory?
subject directory
n.
1. An Internet research tool on the World Wide
Web that organizes Internet resources by
subject headings and subheadings. Subject
directories are usually compiled by human
beings who apply some selection criteria to
resources included in the database.
Examples of Subject Directories
•
•
•
•
•
•
www.yahoo.com Yahoo!
http://bubl.ac.uk/ BUBL
http://www.ipl.org/ Internet Public Library
www.about.com About.com
www.jumpcity.com Jump City
http://www.joeant.com/ Joe Ant
What is a Meta Search Engine?
search engine
n.
1. Meta search engines are search engines that
use their own database as well as sending the
query to many other search engines
simultaneously (called spawning) and report the
unique responses from other search engines.
2. Meta search engines that are limited to only the
web, newsgroups, newspapers, and scientific
journals.
Examples of Meta Search Engines
• Ask Jeeves -- frequently get the answer in the first
pass. Jeeves allows queries in natural language.
• Dogpile -- for its variety of sources (web,
newsgroups, newspapers)
• Ixquick
• Metacrawler
• ProFusion
The Deep Web
What is the invisible or deep web
• Invisible Web (n.) Also referred to as the deep Web, the term
refers to either Web pages that cannot be indexed by a typical
search engine or Web pages that a search engine purposely does
not index, rendering the data “invisible” to the general user. One of
the most common reasons that a Web site’s content is not indexed
is because of the site’s use of dynamic databases, which opens
the door for a potential spider trap. Web pages can also fall into
the invisible Web if there are no links leading to them, since search
engine spiders typically crawl through links that lead them from
one destination to another. Data on the invisible Web is not
inaccessible; the information is out there—it is stored on a Web
server somewhere and can be accessed using a browser—but the
data must be found using means other than the general-purpose
search engines, such as Google and Yahoo!.
Source: http://www.webopedia.com/TERM/I/invisible_Web.html
The deep web
• The deep web is not mysterious, it simply means
that normal search engines that use spiders that
go from one link to another will not work with
pages that are generated on the fly from data
requested from a database, or not linked to other
data, etc.
• Example of a deep website are the yellow or
white pages, catalogues, and patents.
• Google can index search pdf, text, and word
documents
What is the Deep Web?
• Estimated to be 500 times (1.25 trillion web sites) the size of the
surface web.
Using the Search tools to
find information of the web
Successful searching
Plan your search:
1. What are the words that will only be on the right
web page. Should they all be there or are there
alternatives. The most specific concept is the
best.
2. If you do not know your ideal topic well, use a
meta search engine to get the smart. Then refine
your search with a search engine like google or
altavista.
3. Use a virtual library site to find information
reviewed by experts if it is technical.
What is Boolean Logic?
We use Boolean Logic to evaluate the
truth of one or more propositions. There
are three important operators: AND, OR,
NOT
•AND – only true if A and B are both
true.
•OR - only true if either A or B is true.
•NOT - only true when A is false.
When searching for information, we use
Boolean logic to find results that are
relevant to our search terms. If a web
page is relevant to a search term, the
search engine evaluates the page as true.
Examples of Searching
with Boolean Logic
• Yankees and Choke
– All web pages that contain the
terms Yankees and Choke.
• Yankees or Choke
– All web pages that contain the
word Yankees.
– All web pages that contain the
word Choke
– All web pages that contain the
terms Yankees and Choke
• Choke and not Yankees
– All web pages that contain the
word Choke, but don’t contain the
word Yankees
More Advanced Uses
of Boolean Logic
•
•
•
•
If you are looking for a proper name, a
phrase, or an other collection of words
that normally are found together, then
enclose them in double quotes, i.e.
"President Gerald Ford".
If the web page should have one or more
words that must be on the page, then use
the logical And, i.e. President And Ford And
"United States".
If the web page may have different forms
of the name, or titles, etc. then use the
logical Or, i.e. President Or "Vice President"
Or Representative And "Gerald Ford".
If document should exclude a word or
phrase, then use the logical Not, i.e.
"Gerald Ford" Not "Ford automotive" and Not
"Ford car" and Not "Ford truck".
Other Helpful Hints
• While not Boolean logic, some search engines
allow concepts like -- NEAR and FOLLOWED BY
are also allowed, to indicate the relationship of the
words or phrases other words and phrases.
Normally these relations can be which comes first
or whether the word is within a certain number of
words to the first word. This concept is called
proximity logic.
• Not all search engines use the AND, OR, NOT
notation some like Alta Vista use " +" for AND
and "-" for NOT.
Tips for Using Search Engines
• When searching for a large scale
database, it is important to be extremely
precise.
• Avoid using vague or common words that
will only produce millions of pages.
• Read the instructions for each new search
engine you use. There are many different
methods of searching between the search
engines and subject directories.
Finding Audio and Video
• http://www.alltheweb.com/ video, audio, news
• http://images.google.com – Good source of
images
• www.dogpile.com – One of the few search
engines that provides searches for video.
• www.fazzle.com – Provides limited video and
image searching capabilities
• http://video.google.com/ -- A new beta product
may have bugs.
• http://www.tssphoto.com/ Stock photo images
Dogpile for finding nontext based files
The number of sites that allow so called
“anonyms or guest” ftp directories is now greatly
diminished. Due to security considerations most
sites do not have non-text directories that are
open to search and file download. Dogpile still
allows search for images, audio, and videos.
Lab Exercise
1. Tell me in 100 words or less what the Flying
Spaghetti Monster
2. Show me a video of the Spaghetti Harvest
3. Show me an image of an atoll
4. Show me the headline image on the boston.com
website from Oct 21, 2004