Tema 4. Búsquedas en el Web

Download Report

Transcript Tema 4. Búsquedas en el Web

Tema 4.
Búsquedas en el Web
Sistemas de Gestión Documental
1
Introducción




El WWW data de finales de 1980.
Tiene un ritmo de crecimiento
exponencial.
Podemos encontrar información textual,
pero también multimedia.
Podemos considerar el web como una
enorme base de datos sin estructura.
2
Introducción

Se plantea el problema de encontrar
información en el Web. Existen 3 formas
distintas de hacer búsquedas:



Utilizar motores de búsqueda (indexan parte del
web como documentos en una base de datos
textual).
Usar Directorios Web (clasifican documentos por
temas).
Realizar búsquedas utilizando la característica de
hiperenlaces.
3
Introducción

Los principales problemas con los que
nos enfrentamos son:






Datos distribuidos.
Alto porcentaje de datos volátiles.
Enorme cantidad de información.
Datos redundantes y no estructurados.
Calidad de los datos.
Datos heterogéneos.
4
Tipos de buscadores
Types of Search Tools
Search Engines (& MetaSearch Engines)
Characteristics
•
•
•
•
•
•
Full-text of selected Web pages
Search by keyword, trying to match
exactly the words in the pages
No browsing, no subject categories
Databases compiled by "spiders"
(computer-robot programs) with
minimal human oversight
Search-Engine size: from small and
specialized to huge (about 20 billion
websites or pages)
Meta-Search Engines quickly and
superficially search several individual
search engines at once and return
results compiled into a sometimes
convenient format. Caveat: They only
catch about 1% of search results in
any of the search engines they visit.
Examples
•
•
Google, Yahoo Search,
Ask.com
Meta-Search Engines:
Dogpile, Copernic
5
Tipos de buscadores
Types of Search Tools
Subject Directories
Characteristics
•
•
•
•
•
•
Human-selected sites picked by editors
(sometimes experts in a subject)
Often carefully evaluated and kept up
to date, but not always -- frequently
not if large and general
Usually organized into hierarchical
subject categories
Often annotated with descriptions (not
in Yahoo!)
Can browse subject categories or
search using broad, general terms
NO full-text of documents. Searches
need to be less specific than in search
engines, because you are not matching
on the words in the pages you
eventually want. In Directories you are
searching only the subject categories
and descriptions you see in its pages.
Examples
•
•
Librarians' Index,
Infomine, Google
Directory, About.com,
AcademicInfo
There are thousand more
of Subject Directories on
practically every topic you
can think of.
6
Tipos de buscadores
Types of Search Tools
Specialized Databases
(The Invisible Web)
Characteristics
•
•
•
The Web provides access through a
search box into the contents of a
database in a computer somewhere
Can be on any topic, can be trivial,
commercial, task-specific,
governmental, or a rich treasure
devoted to your topic Also includes
Also includes many pages generated as
search results from libraries online
catalogs, and the many copyrightprotected articles in the databases of
journal and magazine publishers.
Examples
•
Locate specialized
databases by looking for
them in good Subject
Directories like the
Librarian's Index, Yahoo!,
or AcademicInfo; in
special guides to
searchable databases; and
sometimes by keyword
searching in general
search engines
7
Search Engines

¿Como funcionan?






No buscan en el web directamente
Utilizan una base de datos de páginas web.
Las bases de datos las crean los spiders o crawlers. Buscan
páginas en base a los links que poseen.
Una página que no esté enlazada nunca será indexada.
Los spiders envían las páginas web a programas
indexadores, que identifican texto, enlaces, ... Almacenan en
la base de datos los términos indexados.
Algunos tipos de páginas son excluidos de la indexación
siguiendo alguna regla (páginas no encontradas, contenido
no adecuado, formato no procesable, información generada
de forma dinámica, etc.).
8
Search Engines
Search Engine
Google
www.google.com
Size, type
Size varies frequently
and widely.
HUGE. Size not disclosed in any way
that allows comparison. Probably the
biggest.
Biggest in tests.
HUGE. Claims over 20 billion
total "web objects."
LARGE. Claims to have 2
billion fully indexed,
searchable pages.
Strives to become #1 in size.
Noteworthy features
and limitations
Popularity ranking using PageRank™.
Indexes the first 101KB of a Web
page, and 120KB of PDF's.
~ before a word finds synonyms
sometimes (~help > FAQ, tutorial,
etc.)
Shortcuts give quick access to
dictionary, synonyms, patents,
traffic, stocks, encyclopedia,
and more.
Subject-Specific Popularity™
ranking.
Suggests broader and
narrower terms.
Phrase searching
Yes. Use " ".
Searches common "stop words" if in
phrases in quotes.
Yes. Use " "
Yes. Use " ".
Searches common "stop
words" if in phrases in quotes.
Boolean logic
Partial. AND assumed between
words.
Capitalize OR.
- excludes.
No ( ) or nesting.
In Advanced Search, partial Boolean
available in boxes.
Accepts AND, OR, NOT or AND
NOT, and ( ). Must be
capitalized.
You must enclose terms
joined by OR in parentheses
(classic Boolean).
Partial. AND assumed between
words.
Capitalize OR.
- excludes.
No ( ) or nesting.
+Requires/ Excludes
- excludes
+ will allow you to retrieve "stop
words" (e.g., +in)
- excludes
+ will allow you to search
common words: "+in truth"
- excludes
+ will allow you to retrieve
"stop words" (e.g., +in)
Yahoo! Search
search.yahoo.com
Ask.com
www.ask.com
9
Search Engines
Search Engine
Google
www.google.com
Yahoo! Search
search.yahoo.com
Ask.com
www.ask.com
Sub-Searching
Sort of . At bottom of results
page, click "Search within
results" and enter more
terms. Adds terms.
Add terms.
Sort of . Add terms.
Results Ranking
Based on page popularity
measured in links to it from
other pages: high rank if a lot
of other pages link to it.
Fuzzy AND also invoked.
Matching and ranking based
on "cached" version of pages
that may not be the most
recent version.
Automatic Fuzzy AND.
Based on Subject-Specific
Popularity™, links to a page
by related pages. More info.
link:
site:
intitle:
inurl:
Advanced Search boxes for
most of these.
Offers Uncle Sam for US
federal pages and other
special searches.
link:
site:
intitle:
inurl:
url:
hostname:
(Explanation of these
distinctions.)
intitle:
inurl:
site:
Field limiting
10
Search Engines
Search Engine
Google
www.google.com
Truncation
Stemming
No truncation. Stems some
words. Search variant endings
and synonyms separately,
separating with OR
(capitalized):
airline OR airlines
Neither. Search with OR as in
Google.
Neither. Search with OR as in
Google.
No.
No.
No.
Yes. Major Romanized and
non-Romanized languages in
Advanced Search.
Yes. Major Romanized and
non-Romanized languages.
Yes. Major Romanized
languages. Use Advanced
Search to limit.
In Advanced Search.
In Advanced Search.
In Advanced Search.
Yes, in Translate this page link
following some pages. To and
sometimes from English and
major European languages
and Chinese, Japanese,
Korean.
Yes.
No.
Case sensitivity
Language
Limit by age of documents
Translation
Yahoo! Search
search.yahoo.com
Ask.com
www.ask.com
11
Search Engines
Features Chart
Last updated Oct. 1, 2007.
Search
Engines
Boolean
Default
Proximity
Truncation
Fields
Limits
Stop
Sorting
Google
-, OR
and
Phrase
No (stems)
word in phrase
intitle, inurl, link,
site, more
Language, filetype,
date, domain
Few, +
searches
Relevance,
site
Yahoo!
AND, OR, NOT,
( ), -
and
Phrase
No word in
phrase
intitle, inurl, link,
site, more
Language, file type,
date, domain
No
Relevance,
site
Ask
-, OR
and
Phrase
No
intitle, inurl, site
Language, site, date
Yes, +
searches
Relevance,
metasites
Live Search
AND, OR, NOT,
( ), -
and
Phrase
No
intitle, link, site,
loc, url
Language, site
Varies,
+
searches
Relevance,si
te, sliders
Gigablast
AND, OR, AND
NOT, ( ), +, -
and
Phrase
No
title, site, ip, more
Domain, type
Varies,
+
searches
Relevance
Exalead
AND, OR, NOT,
( ),-
and
Phrase,
NEAR
Yes and stems
intitle, inurl, link,
site
Language, file type,
date, domain
Varies,
+
searches
Relevance,
date
12
Search Engines (¿diferentes?)
http://www.bruceclay.com/searchenginerelationshipchart.htm
13
Search Engines
14
Search Engines
15
Search Engines
16
Search Engines
17
Metasearch
Meta-Search Tool
Clusty
clusty.com
Dogpile
www.dogpile.com
What's Searched
(As of date at bottom of page.
They change often.)
Complex Search Ability
Results Display
Currently searches a number
of free, search engines and
directories, not Google or
Yahoo.
Accepts and "translates"
complex searches with
Boolean operators and field
limiting.
Results accompanied with
subject subdivisions based on
words in search results, giving
usually the major themes
(Vivisimo Clustering
Engine™). Click on these to
search within results on each
theme.
Searches Google, Yahoo,
LookSmart, AskJeeves/Teoma,
Google ADS, MSN search.
Sites that have purchased
ranking and inclusion are
blended in. Watch for
Sponsored by... links below
search results.
Accepts Boolean logic,
especially in advanced search
modes.
Dogpile allows you to see each
search engine's results
separately in a useful list for
comparison. Click the search
engine icons by "Best of
Breed."
18
Metasearch
Meta-Search Tool
What's Searched
(As of date at bottom of page.
They change often.)
Complex Search Ability
Results Display
SurfWax
www.surfwax.com
A better than average set of
search engines.
Can mix with educational, US
Govt tools, and news sources,
or many other categories.
Accepts " ", +/-. Default is
AND between words. I
recommend fairly simple
searches, allowing SurfWax's
SiteSnaps and other features
to help you dig deeply into
results.
Click on source link to view
complete search results there.
Click on
to view helpful
"SiteSnap™" extracted from
most sites in frame on right.
Many additional features for
probing within a site.
Copernic Agent
www.copernic.com
Select from list of search
engines by clicking the
Properties button following
Advanced Search search box.
ALL, ANY, Phrase, and more.
Also Boolean searching within
results under Refine
(powerful!).
Must be downloaded and
installed, but Basic version is
free of charge. Table
comparing versions.
19
Metasearch
Dogpile http://www.dogpile.com
Popular metasearch site owned by InfoSpace that sends a search to a customizable list of search engines,
directories and specialty search sites, then displays results from each search engine individually.
Vivisimo http://vivisimo.com/
Enter a search term, and Vivismo will not only pull back matching responses from major search engines but
also automatically organize the pages into categories. Slick and easy to use.
Kartoo http://www.kartoo.com
If you like the idea of seeing your web results visually, this meta search site shows the results with sites
being interconnected by keywords.
Mamma http://www.mamma.com
Founded in 1996, Mamma.com is one of the oldest meta search engines on the web. Mamma searches
against a variety of major crawlers, directories and specialty search sites. The service also provides a paid
listings option for advertisers, Mamma Classifieds.
SurfWax http://www.surfwax.com
Searches against major engines or provides those who open free accounts the ability to chose from a list of
hundreds. Using the "SiteSnaps" feature, you can preview any page in the results and see where your terms
appear in the document. Allows results or documents to be saved for future use.
20
Metasearch
Clusty
http://www.clusty.com
InfoGrid
http://www.infogrid.com
MetaEureka
http://www.metaeureka.com
CurryGuide
http://web.curryguide.com/
Infonetware RealTerm Search
http://www.infonetware.com
ProFusion
http://www.profusion.com
Excite
http://www.excite.com
Ixquick
http://www.ixquick.com/
Query Server
http://www.queryserver.com/web.htm
Fazzle
http://www.fazzle.com/
iZito
http://www.izito.com
Turbo10
http://turbo10.com
Gimenei
http://gimenei.com/
Jux2
http://www.jux2.com/
Search.com
http://www.search.com
IceRocket
http://www.icerocket.com/
Meceoo
http://www.meceoo.com/
Ujiko
http://www.ujiko.com/
Info.com
http://www.info.com
MetaCrawler
http://www.metacrawler.com
WebCrawler
http://www.webcrawler.com
ZapMeta
http://www.zapmeta.com
21
Directorios
Subject
Directories
Size, type
Phrase
searching
Librarians'
Index
www.lii.org
Infomine
infomine.ucr.edu
Academic Info
www.academicin
fo.us
Recommend
Browsing
About.com
www.about.co
m
Google
Directory
directory.google
.com
Yahoo!
dir.yahoo.com
Over 16,000
Compiled by
public librarians
in information
supply
business.
Highest quality
sites only.
Great, reliable
annotations.
Over 120,000
Great, reliable
annotations.
Cooperatively
compiled by
university &
college-level,
academic
librarians of the
UC campuses.
Rich selection of
about 25,000
pages, selected
as "college and
research level
Internet
resources" aimed
at "at the
undergraduate
level or above."
Brief
annotations.
Over 2 million
Generally
good
annotations
done by
"Guides" with
various levels
of expertise.
About 5 million
web pages,
selected by the
Open Directory
Project and
enhanced by
Google
searching and
ranking.
Often useful to
find "better"
results,
especially on
broad or widely
covered topics.
About 4 million.
Scarce
descriptions and
annotations.
Often useful,
especially for
popular and
commercial
topics.
Yes. Use " "
Yes. Use " "
|term term|
requires exact
match
No. " " make
searches fail.
Yes. Use " "
Yes. Use " "
Yes. Use " "
22
Directorios
Subject
Directories
Librarians'
Index
www.lii.org
Infomine
infomine.ucr.edu
Academic Info
www.academicinf
o.us
Recommend
Browsing
About.com
www.about.
com
Google
Directory
directory.googl
e.com
Yahoo!
dir.yahoo.com
Boolean logic
AND implied
between words.
Also accepts OR
and NOT, and
( ).
AND implied
between words.
Also accepts OR,
NOT, and ( ).
OR implied
between words.
Accepts AND, OR,
NOT and ( )
Recommend AND
between words in
most searches.
No.
OR,
capitalized, as
in Google's
web search
engine.
Yes, as in
Yahoo! Search
web search
engine.
Truncation
Use *. Also
stems.
Can turn
stemming off on
Advanced
Search page.
Use *. Also stems.
Can turn stemming
off. Use " " or | |
to search exact
terms.
No.
Use *.
Not
accepted
consistently.
No.
No.
Field
searching
Advanced
Search allows
Boolean
searching within
subject, titles,
description,
parts of URLs,
and more.
Select boxes under
search box to limit.
No.
No.
Same as in
Google's web
search engine.
As in Yahoo!
Search web
search engine.
23
El web invisible

¿Qué es?




El web visible es lo que se ve como resultado de una
consulta en un buscador o en los directorios.
El web invisible está formado por todas aquellas páginas y
contenidos que no pueden ser procesados por los
buscadores y catalogados en los índices. Por ejemplo:
 Información dinámica.
 Bases de datos buscables.
 Páginas excluidas de los buscadores por algún tipo de
política de procesamiento.
Los buscadores no pueden encontrar la información ofrecida
en estas páginas.
Para acceder a la información del web invisible hay que ir
directamente a la página que la ofrece, y buscar en ella.
24
El web invisible

¿Cómo buscar en el web invisible?




Hay que mantener en la mente el concepto “bases de datos”
y permanecer atento a cualquier información que nos
puedan ofrecer los buscadores y directorios.
Las páginas pueden aparecer en cualquier momento de la
navegación o ejecución de nuestras consultas.
Para encontrar páginas del web invisible se pueden utilizar
buscadores añadiendo en la consulta el término “base de
datos” o “database”. Ejemplo: plane crash database
Además de planificar una buena búsqueda con una
estrategia adecuada en un buscador o un directorio, hay que
dedicar tiempo a investigar las bases de datos que
encontremos referentes a los temas de nuestra necesidad de
información.
25
El web invisible
When dealing with the Deep Web, keep these points in mind:
•
•
•
•
•
•
•
Information that is likely to be stored in a database is a part of the
deep Web.
Information that is new and dynamically changing in content will
appear on the deep Web.
Web sites of searchable databases can be retrieved via directories
and search engines.
Many search engine sites and commercial portals feature searchable
databases as part of their package of services.
Some search engines will search the deep Web for related content
subsequent to an initial search.
Topical coverage on the deep Web is extremely varied.
Some of the information stored on Web-accessible databases may
not be substantive or useful to most searchers.
26
El web invisible
The Invisible Web: Databases not accessible to ordinary search engines.
Librarians’ Internet Index
(lii.org)
Lots of categorized
databases.
Complete Planet
(www.completeplanet.com)
Hundreds of databases by
category.
All Academic
(www.allacademic.com)
Journals & other free
academic content.
Invisible-Web.net
(www.invisible-web.net)
Companion site to Invisible
Web book.
Findarticles.com
(www.findarticles.com)
Magportal
(www.magportal.com)
Infomine (infomine.ucr.edu)
Online Books Page
(onlinebooks.library.upenn.edu)
Search hundreds of journals.
Full text magazine articles.
Scholarly Internet Resource
Collections.
Full text of more than 18,000
books.
27
Algunas estadísticas
28
Algunas estadísticas
29
Algunas estadísticas
Millions Of Textual Documents Indexed
30
Algunas estadísticas
Billions Of Textual Documents Indexed
December 1995-September 2003
Search Engine Size
November 2004
Search Engine
Reported Size
Page Depth
Google
8.1 billion
101K
MSN
5.0 billion
150K
4.2 billion
(estimate)
500K
Yahoo
Ask Jeeves
2.5 billion
101K+
31
Algunas estadísticas
32
Algunas estadísticas
33
Algunas estadísticas
34
Algunas estadísticas
35
Algunas estadísticas
36
Algunas estadísticas
37
Algunas estadísticas
38
Algunas estadísticas
39
Algunas estadísticas
40
Algunas estadísticas
How many searches are performed each day? Below are how
many searches happen within the United States in March 2006,
based on comScore figures.
Searches
Per Day (Millions)
Per Month (Millions)
Google
91
2,733
Yahoo
60
1,792
MSN
28
845
AOL
16
486
Ask
13
378
Others
6
166
Total
213
6,400
41
Algunas estadísticas
42
Como buscan otros en el Web
43
Como buscan otros en el Web
44
Como buscan otros en el Web
45
Como buscar en el Web
Estrategias
Step #1. Analyze your topic to decide where to begin
Click here for a printable FORM you may use to Analyze Your Topic (pdf file). PDF files are supported in Netscape 4.x and
some other browsers. To view, search, or print the PDF files, you will need to use Adobe® Acrobat® Reader software, which is
available free from Adobe if you need it.
have distinctive words or phrases?
methernitha, unique meaning
"affirmative action", specific, accepted meaning in word cluster
have NO distinctive words or phrases you can think of? You have only common or general terms that
get the "wrong" pages.
"order out of chaos", used in too many contexts to be useful
sundiata, retrieves a myth, a rock group, a person, etc.
seek an overview of a broad topic? victorian literature, alternative energy sources
Does your topic...
specify a narrow aspect of a broad or common topic?
automobile recyclability, want current research, future designs, not how to recycle or oil
recycling or other community efforts
have synonymous, equivalent terms, or variant spellings or endings that need to be included?
echinoderm OR echinoidea OR "sea urchin", any may be in useful pages
"cold fusion energy" OR "hydrogen energy", some use one term, some the other; you
want both, although not precisely equivalent
millennium OR millennial OR millenium OR millenial OR "year 2000", etc.
Pages you want may contain any or all.
Make you feel confused? Don't really know much about the topic yet? Need guidance?
46
Como buscar en el Web
Estrategias
Step #2. Pick the right starting place using this table:
YOUR TOPIC'S
FEATURES:
Search Engines
Distinctive or
word or phrase?
Enclose phrases in " ".
Test run your word or
phrase in Google.
Search the broader
concept, what your
term is "about."
NO distinctive
words or
phrases?
Use more than one
term or phrase in " " to
get fewer results.
Try to find distinctive
terms in Subject
Directories
NOT RECOMMENDED
Look for a specialized
Subject Directory
focused on your topic
Seek an
overview?
Narrow aspect
of broad or
common topic?
Synonyms,
equivalent
terms, variants
Confused? Need
more
information?
Boolean searching as in
Yahoo! Search.
Choose search engines
with Boolean OR, or
Truncation, or Field
limiting.
NOT RECOMMENDED
Subject Directories
Look for a Directory
focused on the broad
subject.
NOT RECOMMENDED
Look for a Gateway
Page (Subject Guide).
Try an encyclopedia.
iAsk at a library
reference desk.
Specialized
Databases
"Invisible Web"
Want data? Facts?
Statistics?
All of something?
One of many like
things?
Schedules? Maps?
Look for a
specialized
database on the
Invisible Web.
Hard to predict
what you might
find.
Find an Expert
LUCK
Look for a
specialized subject
directory on your
topic.
E-mail the author of
a good page you
find.
Ask a discussion
group or blog.
Never hurts to seek
help.
Always on
your side.
Keep your
mind open.
Learn as
you
search.
47
Como buscar en el Web
Estrategias
Step #3. Learn as you go & VARY your approach with what you learn.
Don't assume you know what you want to find. Look at search results and see
what you might use in addition to what you've thought of.
Step #4. Don't bog down in any strategy that doesn't work.
Switch from search engines to directories and back. Find specialized
directories on your topic. Think about possible databases and look
for them.
Step #5. Return to previous strategies better informed.
48
Como buscar en el Web
Estrategias
Search Strategies We Do NOT Recommend
Because of their inefficiency and often haphazard and frustrating results, we do not recommend either of
the following two approaches to finding Web documents:
•
Browsing searchable directories. If you can find a search box, search a directory. BROWSING is
sometimes fun but rarely as efficient. The term "directories" refers here to any collection of web
resources organized into subject categories or some other breakdown appropriate to the content
(Subject Directories or directories of specialized databases). Browsing locates documents by your
trying to match your topic in first the top, broadest layer of a subject hierarchy, then by choosing
narrower sub-subject-categories in the hierarchy that you hope will lead to your target. Browsing
encounters the difficulty of guessing under which subject category your topic is classified. The
taxonomy in every directory differs, making browsing inconsistent from one search tool to another.
The category "health" may contain documents on medicine, homeopathy, psychiatry, and fitness in
one directory. In another "medicine" may include health, mental health, and alternative medicine,
but not the term psychiatry and may classify fitness only under "lifestyle." Searching (typing
keywords in a search box) retrieves occurrences of your words no matter where they may be
classified by subject. Use broad terms in searching any directory.
•
Following links to sites recommended by heavy use or commercial interest. Often in
search engine results, you will see links to sites that are selected based on how often they are
visited by others, or based on fees paid to the browser. Or you may see recommended "cool"
sites. Use these with caution! Others may visit sites for reasons having no relation to your
information interests, and the best sites for you may still be largely undiscovered by the vast public
searching the Web. Taste varies and should vary. Make your own evaluations.
49
Como buscar en el Web
Estrategias
Features of your search inquiry
Matching Search Tools Features worth learning
Are you looking for a proper name or a distinct phrase ?
PHRASE SEARCHING is a feature you want in every search
tools you choose.
Requires your terms all to appear in exactly the order
you enter them.
Enclose the phrase in double quotations " "
Examples:
"affirmative action"
"world health organization"
"a person's name"
In , capitalizing initial letters will cause the terms to be
searched as a phrase:
World Health Organization
•The name of an organization or society or movement
•A proper name or an individual
•A distinctive string of words generally associated with your topic
Can you think of an organization, proper name, or phrase to search
for? It might help zoom in on the pages you want.
Are some of your terms common words with many meanings
and contexts ?
•Children in conjunction with television and also violence
•Censorship as an aspect of ethics in journalism
Do you anticipate lots of search results with terms you do not
want ?
•Your search for biomedical engineering and cancer brings you lots
of academic programs, and you want research reports. So you try
to exclude documents containing Department of or School of
BOOLEAN AND will help:
children AND television AND violence
journalism AND ethics AND censorship
Google and AllTheWeband most other search engines put
AND in between words automatically (by default):
children television violence
journalism ethics censorship
BOOLEAN AND NOT will help:
"biomedical engineering" AND cancer AND NOT
"Department of" AND NOT "School of"
or its -EXCLUDES near equivalent:
"biomedical engineering" cancer -"Department
of" -"School of"
50
Como buscar en el Web
Estrategias
Features of your search inquiry
Matching Search Tools Features worth learning
Are there synonyms, spelling variations, or foreign
spellings for some of your terms?
BOOLEAN OR will help:
(women OR females) AND networking
(Sarajevo OR Sarayevo) AND peace
(literature OR litterature) AND (French or
francaise)
In Google, capitalize OR (no need to type "and"):
peace sarajevo OR sarayevo
literature OR litterature french OR francaise
In AllTheWeb, use parentheses and omit the OR:
peace (sarajevo sarayevo)
(literature litterature) (french francaise)
•women, females with networking
•Sarajevo, Sarayevo with peace
•literature, litterature with French, francaise
Are you looking for home pages and/or other documents
primarily about your term(s)?
•The home page of the American Dietetic Association
•Pages primarily about Affirmative Action
Are you looking for terms with many possible endings ?
•Feminism, feminist, feminine
•Children, child
LIMIT TO TITLE FIELD IN DOCUMENTS
intitle:"American Dietetic Association"
intitle:"affirmative action"
In Google, use intitle:"affirmative action"
Some systems search word ending variants automatically
(stemming). See the specific instructions for each of the
recommended search tools.
To be sure use OR searches:
children OR child
51
Como buscar en el Web
Comandos
Command
How
Supported By
Must Include Term
+
All
Must Exclude Term
-
All
Must Include Phrase
""
All
Match All Terms
Automatic at
All
Via Advanced Search
AllTheWeb, AltaVista, Google,
Lycos, MSN Search, Teoma, Yahoo
(HotBot offers but failed to work when tested)
OR
AltaVista, AOL Search, Ask Jeeves,
Google, HotBot, MSN Search, Teoma, Yahoo
(must be done in ALL CAPS)
AllTheWeb, Lycos
(only works for two words)
Match Any Terms
52
Como buscar en el Web
Comandos
Command
Title Search
(Updated March 11,
2003)
Site
Search
How
Supported By
title:
AltaVista, AllTheWeb,
Inktomi
intitle:
Google
Teoma
allintitle:
Google
host:
AltaVista
site:
Excite, Google
(Netscape, Yahoo)
url.host:
AllTheWeb,
Lycos (for AllTheWeb
results only)
domain:
Inktomi (HotBot,
iWon, LookSmart)
none
AOL, Direct Hit,
HotBot, LookSmart,
Lycos, MSN, Netscape,
Northern Light, Open
Directory, Yahoo
53
Como buscar en el Web
Comandos
URL Search
Link Search
url:
AltaVista, Excite, Northern Light
url.all:
AllTheWeb,
Lycos (for AllTheWeb results only)
allinurl:
inurl:
Google
originurl:
Inktomi
(AOL, GoTo, HotBot)
u:
Yahoo
none
AOL, Direct Hit, HotBot, LookSmart,
MSN
Not yet updated, but may be still
correct:
Open Directory
link:
AltaVista, Google, Northern Light
linkdomain:
Inktomi (AOL, HotBot, iWon, MSN)
(NOTE: measures links to entire
domains)
link.all:
AllTheWeb,
Lycos (for AllTheWeb results only)
none
AOL, Direct Hit, Excite, HotBot,
LookSmart,
Northern Light
Not yet updated, but may be still
correct:
Netscape, Yahoo (n/a)
54
Como buscar en el Web
Comandos
*
AltaVista, Inktomi (iWon), Northern
Light
Not yet updated, but may be still
correct:
Yahoo
?
AOL Search, Inktomi (iWon)
%
Northern Light
none
AllTheWeb, Direct Hit, Excite,
Google, HotBot, LookSmart,
Lycos, MSN
(MSN's help says it offers wildcard,
but it failed to during testing)
anchor:
AltaVista
None
AllTheWeb, AOL Search, Direct Hit,
Excite, Google, Inktomi, HotBot,
Lycos
Wildcard
Anchor Search
55
Como buscar en el Web
Ayudas
Feature
Offered By
Related Searches
AltaVista, AllTheWeb, Excite,
HotBot, Lycos, MSN, Yahoo
Not yet updated, but may be still
correct:
iWon
Clustering
AltaVista, AllTheWeb, Excite, Google,
HotBot, MSN, Northern Light
Find Similar
AltaVista, AOL Search, Google
Stemming
AOL Search, Direct Hit, HotBot, Inktomi
(HotBot, MSN)
Search Within
AltaVista, Google, HotBot, Lycos
Spidered Version
Google
Search By Language
AltaVista, AllTheWeb, Excite, Google,
HotBot, Lycos, MSN, Northern Light
Page Translation
AltaVista, Google, Lycos
Porn Filter
AltaVista, AllTheWeb, Google
Porn Warning
HotBot, MSN, Northern Light
56
Como buscar en el Web
Ayudas
Feature
Supported By
Number Of Listings Shown
(10 unless noted)
AltaVista, AllTheWeb, AOL Search (5), Direct Hit, Excite, Google,
HotBot, LookSmart (15), Lycos, MSN (15), Northern Light
Not yet updated, but may be still correct:
iWon, Netscape, Yahoo (20)
Ability To Increase Number Of Listings?
AltaVista, AllTheWeb, Excite, Google, HotBot, MSN
Not yet updated, but may be still correct: Yahoo
See 20 Results
AltaVista, AllTheWeb, Excite, Google, HotBot, MSN
Not yet updated, but may be still correct: Yahoo
See 50 Results
AltaVista, AllTheWeb, Excite, Google, HotBot, MSN
Not yet updated, but may be still correct: Yahoo
See 100 Results
AllTheWeb, Google, HotBot,
Not yet updated, but may be still correct: Yahoo
Sort By Date
MSN Search, Northern Light
Date Range
AltaVista, Google, HotBot, MSN, Northern Light
Not yet updated, but may be still correct: iWon, Yahoo
Date Displayed?
AltaVista, HotBot (for Inktomi results), Northern Light
Display Titles Only?
AltaVista, Excite, HotBot (URLs only option), MSN
Other Major Customize Options
AltaVista, AllTheWeb, Google
57
Como buscar en el Web
Operadores
Command
Or
And
Not
Nesting
Near
How
Supported By
OR
AltaVista, AOL Search, Excite, Google, Inktomi (HotBot, MSN), Lycos, Northern Light
None
AllTheWeb, Direct Hit, LookSmart,
Not yet updated, but may be still correct: Yahoo
AND
AltaVista, AOL Search, Excite, Inktomi (HotBot, MSN) Lycos, Northern Light
None
AllTheWeb, Direct Hit, Google, LookSmart
Not yet updated, but may be still correct: Yahoo
NOT
AOL Search, Excite, Inktomi (HotBot), Lycos, Northern Light
AND NOT
AltaVista, Inktomi (MSN)
Not yet updated, but may be still correct: Netscape
None
AllTheWeb, Direct Hit, Google, LookSmart,
Not yet updated, but may be still correct: Yahoo
()
AltaVista, AOL Search, Excite, Inktomi (MSN), Northern Light
None
AllTheWeb, Direct Hit, Google, Inktomi (HotBot), LookSmart, Lycos
Not yet updated, but may be still correct: Yahoo
NEAR
AltaVista (10 words), AOL Search (specify number), Lycos (25 words)
None
AllTheWeb, Direct Hit, Google, Inktomi (HotBot, MSN), LookSmart
Notes
At AltaVista, Boolean only works on advanced search page.
At Excite, Google & MSN, Boolean commands must be in UPPERCASE
At Inktomi-powered services, set menu to "Boolean"
58
Un ejemplo: Google


Google = [googol] = 10100
Objetivo en su creación (1997): mejorar los
buscadores existentes en cuanto a calidad de
las búsquedas.



Ej. De los 4 principales buscadores de la época,
sólo 1 se encontraba a sí mismo.
Se pretende obtener muy alta precisión a costa de
la exhaustividad.
Se contempla la inclusión de texto y estructura de
los enlaces como mejora a otros sistemas.
59
Un ejemplo: Google

Características:





Utiliza la estructura de los enlaces para calcular el ranking de cada página, a
través de una medida llamada PageRank.
Utiliza los enlaces para mejorar los resultados de las búsquedas. Se incluye
la información del enlace tanto en la página que lo contiene como en la
enlazada (en algunos casos, el texto del enlace es más descriptivo de la
página enlazada que los propios contenidos de la página).
Mantiene información sobre localización de términos. Por tanto, permite
utilizar búsquedas de proximidad, y aplicar la proximidad al cálculo de la
relevancia.
Mantiene información sobre la tipología y visualización de los caracteres
(negrita, comillas, ...) para determinar la importancia de un término.
Mantiene todas las páginas que analiza en formato comprimido (sólo el
contenido html).
60
Un ejemplo: Google

PageRank


Medida objetiva de la importancia de una página
atendiendo al número de referencias que existen a
la misma en otras páginas.
Tiene en cuenta:



El número de referencias a esa página.
La calidad de las páginas que hacen referencia a esa
página.
El número total de referencias existentes en cada página
que hace referencia a esa página.
61
Un ejemplo: Google

Elementos considerados:




El web no es una colección controlada.
Mejorar la búsqueda no tiene que restringirse a
mejorar la consulta (un usuario puede consultar lo
que quiera y como quiera).
No hay control sobre lo que la gente pone el en
web.
Las empresas comerciales aprovechan el
funcionamiento de los buscadores para
manipularlos y obtener altos rankings.
62
Un ejemplo: Google
Arquitectura
63
Un ejemplo: Google

Funcionamiento






El URLServer envía URLs a los crawlers
Las páginas encontradas se envían al StoreServer para que se almacenen
en el Repository (comprimidas).
El Indexer lee el repositorio, descomprime los documentos y los parsea.
Convierte el documento en un conjunto de ocurrencias de palabras
llamadas hits. Los hits almacenan la palabra, posición en el documento,
tamaño de fuente y mayúsculas. Distribuye los hits en los barrels creando el
forward index parcialmente ordenado. Almacena información sobre los
enlaces hallados en las páginas.
El URLResolver convierte direcciones relativas en absolutas, y genera los
identificadores de documentos. Genera base de datos de links para calcular
el PageRank.
El sorter reordena la información de los barrels por identificador de palabras
en lugar de por identificador de documentos. Genera el fichero invertido.
El Searcher se encarga de resolver las consultas.
64
Un ejemplo: Google

Estructuras de datos:







BigFiles. Ficheros virtuales.
Repositorio. Documentos comprimidos.
Indices de documentos.
Lexicon. Lista completa de palabras.
Hit Lists.
Forward index. Ordenación parcial (barrels)
Inverted index. Ordenación total (barrels)
65
Un ejemplo: Google

El proceso de indexación:

Parsing


Indexar documento en los ‘barrels’


Muchos problemas por errores de sintaxis y tipos de
contenidos.
El parsing genera documentos que se codifican en los
‘barrels’.
Ordenar

Se genera el índice invertido ordenando por
identificadores de palabras.
66
Un ejemplo: Google

El proceso de búsqueda






Parsing de la consulta.
Conversión de palabras en identificadores.
Búsqueda de comienzo de lista de documentos
para cada palabra.
Buscar documentos que contengan todas las
palabras.
Calcular el ranking de cada documento.
Ordenar y mostrar los primeros k documentos.
67
Un ejemplo: Google

Algunas estadísticas (1997)
Storage Statistics
Total Size of Fetched Pages
147.8 GB
Compressed Repository
53.5 GB
Short Inverted Index
4.1 GB
Full Inverted Index
37.2 GB
Lexicon
293 MB
Temporary Anchor Data
(not in total)
6.6 GB
Document Index Incl.
Variable Width Data
9.7 GB
Links Database
3.9 GB
Total Without Repository
55.2 GB
Total With Repository
108.7 GB
Web Page Statistics
Number of Web Pages Fetched
24 million
Number of Urls Seen
76.5
million
Number of Email Addresses
1.7 million
Number of 404's
1.6 million
68
Un ejemplo: Google
69
Un ejemplo: Google
70
Evaluar páginas



Los buscadores recuperan información, pero (por
ahora) no dan datos sobre la calidad de las páginas
encontradas.
En algunos casos el ranking de los resultados de una
consulta trata de considerar la calidad de las páginas
(PageRank – google), pero no hay criterios objetivos
para su valoración.
Es necesario evaluar de forma objetiva las páginas
encontradas. Para ello se necesita:


Utilizar técnicas para identificar características de las páginas
y la información que se necesita
Aplicar un pensamiento crítico sobre los contenidos, y
realizar una serie de preguntas para decidir sobre su calidad.
71
1. What can the URL tell you?
Questions to ask:
What are the implications?
Is it somebody's personal page?
•
Read the URL* carefully:
• Look for a personal name (e.g., jbarker or
barker) following a tilde ( ~ ), a percent sign
( % ), or or the words "users," "members,"
or "people."
• Is the server a commercial ISP* or other
provider mostly of web page hosting (like
aol.com or geocities.com
Personal pages are not necessarily "bad," but you need to investigate
the author very carefully.
For personal pages, there is no publisher or domain owner
vouching for the information in the page.
What type of domain does it come from ?
(educational, nonprofit, commercial, government,
etc.)
•
Is the domain appropriate for the content?
• Government sites: look for .gov, .mil, .us, or
other country code
• Educational sites: look for .edu
• Nonprofit organizations: look for .org
•
If from a foreign country, look at the country code
and read the page to be sure who published it.
Look for a appropriateness, fit. What kind of information source do
you think is most reliable for your topic?
Is it published by an entity that makes sense?
Who "published" the page?
•
In general, the publisher is the agency or person
operating the "server" computer from which the
document is issued.
• The server is usually named in first portion of
the URL (between http:// and the first /)
•
Have you heard of this entity before?
•
Does it correspond the name of the site? Should
it?
You can rely more on information that is published by the source:
Evaluar páginas
•
•
Look for New York Times news from www.nytimes.com
Look for health information from any of the agencies of the
National Institute of Health on sites with nih somewhere in
the domain name.
72
2. Scan the perimeter of the page
Questions to ask:
What are the implications?
Who wrote the page?
• Look for the name of the author, or the name of the
organization, institution, agency, or whatever who is responsible
for the page
Web pages are all created with a purpose in mind by some
person or agency or entity. They do not simply "grow" on the
web like mildew grows in moist corners.
You are looking for someone who claims accountability
and responsibility for the content.
An e-mail address with no additional information about the
author is not sufficient for assessing the author's credentials.
If this is all you have, try e-mailing the author and asking
Evaluar páginas
• An e-mail contact is not enough
• If there is no personal author, look for an agency or
organization that claims responsibility for the page.
• If you cannot find this, locate the publisher by truncating
back the URL (see technique above). Does this publisher
claim responsibility for the content? Does it explain why
politely for more information about him/her.
the page exists in any way?
Is the page dated? Is it current enough?
• Is it "stale" or "dusty" information on a time-sensitive or
evolving topic?
• CAUTION: Undated factual or statistical information is no
better than anonymous information. Don't use it.
How recent the date needs to be depends on your needs.
For some topics you want current information.
For others, you want information put on the web near the
time it became known.
In some cases, the importance of the date is to tell you whether
the page author is still maintaining an interest in the page, or has
abandoned it.
What are the author's credentials on this subject?
• Does the purported background or education look like someone
who is qualified to write on this topic?
• Might the page be by a hobbyist, self-proclaimed expert, or
enthusiast?
• Is the page merely an opinion? Is there any reason you
should believe its content more than any other page?
• Is the page a rant, an extreme view, possibly distorted
Anyone can put anything on the web for pennies in just a few
minutes. Your task is to distinguish between the reliable and
questionable.
Many web pages are opinion pieces offered in a vast public
forum.
You should hold the author to the same degree of credentials,
authority, and documentation that you would expect from
something published in a reputable print resource (book, journal
or exaggerated?
• If you cannot find strong, relevant credentials, look very closely
at documentation of sources (next section).
article, good newspaper).
73
3. Look for indicators of quality information
Questions to ask:
What are the implications?
Evaluar páginas
Are sources documented with footnotes or links?
• Where did the author get the information?
• As in published scholarly/academic journals and books,
you should expect documentation.
• If there are links to other pages as sources, are they to reliable
sources?
• Do the links work?
In scholarly/research work, the credibility of most writings is
proven through footnote documentation or other means of
revealing the sources of information. Saying what you believe
without documentation is not much better than just expressing
an opinion or a point of view. What credibility does your research
need? An exception can be journalism from highly reputable
newspapers. But these are not scholarly. Check with your
instructor before using this type of material.
Links that don't work or are to other weak or fringe pages do not
help strengthen the credibility of your research.
If reproduced information (from another source), is it
complete, not altered, not fake or forged?
• Is it retyped? If so, it could easily be altered.
• Is it reproduced from another publication?
• Are permissions to reproduce and copyright information
provided?
• Is there a reason there are not links to the original
source if it is online (instead of reproducing it)?
Are there links to other resources on the topic?
• Are the links well chosen, well organized, and/or
evaluated/annotated?
• Do the links work?
• Do the links represent other viewpoints?
• Do the links (or absence of other viewpoints) indicate a bias?
You may have to find the original to be sure a copy of something
is not altered and is complete.
Look at the URL: is it from the original source?
If you find a legitimate article from a reputable journal or other
publication, it should be accompanied by the copyright
statement and/or permission to reprint. If it is not, be
suspicious.
Try to find the source. If the URL of the document is not
to the original source, it is likely that it is illegally
reproduced, and the text could be altered, even with the
copyright information present.
Many well developed pages offer links to other pages on the
same topic that they consider worthwhile. They are inviting you
compare their information with other pages.
Links that offer opposing viewpoints as well as their own are
more likely to be balanced and unbiased than pages that offer
only one view. Anything not said that could be said? And
perhaps would be said if all points of view were represented?
Always look for bias.
Especially when you agree with something, check for
bias.
74
4. What do others say?
Questions to ask:
What are the implications?
Who links to the page?
Sometimes a page is linked to only by other
parts of its own site (not much of a
recommendation).
Sometimes a page is linked to by its fan club,
and by detractors. Read both points of view.
If a page or its site is in a bona fide directory,
think about whether there is much critical
Evaluar páginas
• Are there many links?
• What kinds of sites link to it?
• What do they say?
• Are any of them directories? Try looking at
what directories say.
evaluation of the links in the directory.
Is the page listed in one or more
reputable directories or pages?
Good directories include a tiny fraction of the
web, and inclusion in a directory is therefore
noteworthy.
But read what the directory says! It may
not be 100% positive.
What do others say about the author or
responsible authoring body?
"Googling someone" (new term for this) can
be revealing. Be sure to consider the source.
If the viewpoint is radical or controversial,
expect to find detractors. Think critically about
all points of view.
75
5. Does it all add up?
Questions to ask:
So what? What are the implications?
Why was the page put on the web?
These are some of the reasons to think of. The web is
a public place, open to all. You need to be aware of
the entire range of human possibilities of intentions
behind web pages.
Evaluar páginas
• Inform, give facts, give data?
• Explain, persuade?
• Sell, entice?
• Share?
• Disclose?
Might it be ironic? Satire or parody?
• Think about the "tone" of the page.
• Humorous? Parody? Exaggerated? Overblown
arguments?
• Outrageous photographs or juxtaposition of unlikely
images?
• Arguing a viewpoint with examples that suggest that
what is argued is ultimately not possible.
It is easy to be fooled, and this can make you look
foolish in turn.
Is this as good as resources I could find if I used
the library, or some of the web-based indexes
available through the library, or other print
resources?
• Are you being completely fair? Too harsh? Totally
objective? Requiring the same degree of "proof" you
What is your requirement (or your instructor's
requirement) for the quality of reliability of your
information?
In general, published information is considered
more reliable than what is on the web. But
many, many reputable agencies and publishers
make great stuff available by "publishing" it on
the web. This applies to most governments,
most institutions and societies, many publishing
houses and news sources.
would from a print publication?
• Is the site good for some things and not for others?
• Are your hopes biasing your interpretation?
But take the time to check it out.
76
Evaluar Buscadores

Creación de índices













¿Cómo se compila el índice?
Tamaño – número de páginas indexadas
Cobertura (http, ftp, www, news, …)
¿Hay criterios especiales de inclusión?
¿Tiene el spider acceso a sitios protegidos por contraseñas?
¿Dónde no busca el motor?
¿Qué elementos de las páginas se indexan?
¿Hay control de vocabulario?
¿Se usan stopwords?
Frecuencia de actualizaciones
Tiempo de indexación de una página solicitada
Páginas indexadas por día
Comprobación de enlaces muertos
77
Evaluar Buscadores

Capacidad de búsqueda










¿Dónde busca (que hay en el índice)?
Búsqueda en distintos lugares a la vez
Tratamiento de stopwords
Rango de funciones de búsqueda
Refinamiento de búsquedas
Opciones avanzadas
Uso de campos
Uso de lógica boolean (si/no, fácil/difícil, …)
Tratamiento de sinónimos / Uso de tesauros
¿Se puede guardar la búsqueda?
78
Evaluar Buscadores

Calidad de las respuestas








Tiempo de respuesta
Número de resultados
Calidad del resumen del hitlist (host, motivo, enlace,
ranking, …)
Detalle del criterio de relevancia usado
Eliminación de duplicados
Tratamiento de resultados (visualización, ordenación,
exportación, buscar-como, …)
Guardar resultados de la búsqueda
Análisis metodológico (precisión, exhaustividad, relevancia,
cobertura, fiabilidad, utilidad, novedad, …)
79
Evaluar Buscadores

Usabilidad








Interface (claridad, simplicidad, …)
Legibilidad (tamaño de letra, distribución de texto,
disposición de párrafos, …)
Facilidad de uso (navegación)
Ayuda en línea
Proceso de construcción de la consulta
Capacidad de personalización
Guardar preferencias
Tiempos de carga y respuesta
80