Internet en WWW voor het opsporen van informatie

Download Report

Transcript Internet en WWW voor het opsporen van informatie

1
Internet en WWW
voor het opsporen van informatie
[email protected]
Vrije Universiteit Brussel,
Pleinlaan 2, B-1050 Brussel.
februari 2004
VUB-IDLO
2
The slides are available from
http://www.vub.ac.be/BIBLIO/nieuwenhuysen/courses/
(note: BIBLIO and not biblio)
3
Planning van de dag:
voormiddag
• Over “informatie”
• Informatiemarkt
• Information retrieval
• Thesaurussen
(+ oefenen van query-formulering)
• Netwerken en Internet i.h.b.
• World-Wide Web (+ oefenen van “browsing” + “saving”)
• LUNCH
4
Planning van de dag:
namiddag (deel 1)
• Online toegankelijke informatiebronnen!
»Globale Internet directories
(+oefenen)
»Internet indexes
(+ oefenen)
»Boek-databases
(+ oefenen)
»Te betalen databases
»Databases met titels van tijdschriftartikels
»Vinden van illustraties/beelden/foto’s
(+ oefenen)
5
Planning van de dag:
namiddag (deel 2)
• Evaluatie van informatiebronnen
• Vrij zoeken volgens eigen interesse, met assistentie
-Interruptions
-Questions
-Remarks
-Discussions
6
are welcome
7
About “information”
Information concepts
8
The flow of documentary information
with primary and secondary sources
Author /
Creator /
Sender
Primary sources / systems: mainly
Journal articles / Books /
Electronic mail / Online sources /...
Secondary sources / systems: mainly
Reference works (printed, CD-ROM, online)
Library catalogues, including OPACs...
Reader /
User /
Receiver
9
The role of secondary information
sources
• The secondary information flow is generated on the basis
of the primary flow, mainly because the great amounts of
primary information lower the chance to retrieve and use
the appropriate information item.
• Secondary information tries to bring some order in the
great chaos.
10
Various categorisations of
documentary information sources
Information sources can be categorised in various ways.
For instance:
•Books
•Primary
•Text
•Hard copy /
•Image
not digital
•Sound
•Animation/
•Digital
video
•Offline
•Software
•Serials
•Secondary
•Data
•Online
•Interactive
11
Retrospective searching versus
current awareness: scheme
Retrospective searching
Past
Now
Current awareness
Future
12
Information retrieval: evolution of
storage and distribution media
• 1450
printing with reusable characters/fonts
• 1975
+ online access databases
from the 1970s
growing Internet
• 1985
+ CD-ROM
• 1990
+ World-Wide Web
(based on the Internet)
13
Information retrieval:
end user or information intermediaries
End-user
Information intermediary
(Broker or library or ...)
Information
14
End user versus information
intermediary
• People can retrieve information themselves, directly as socalled “end-users”.
• However,
»the information landscape is complex,
»it may cost a lot of the time to find the right information,
»it may be costly to search for information
• Therefore it may be wise to obtain the assistance of an
expert information intermediary, such a a reference
librarian or an information broker.
15
About “information”
Computer- and network-based information
16
Information: from bits
to meaningful information
Digital
computer data
= bits
01
Information = “documents”,
meaningful for and
to be interpreted by
human beings
or
Program code,
meaningful for and
to be interpreted / executed by
a suitable / compatible computer
17
Information: digitally stored and
managed information
Categories of digital, computer readable
information / data, forming electronic “documents”,
understandable by human beings.
text
numbers
images
video
+ sounds
01
multimedia
18
Information:
types of digital information
Linear text
Hypertext
Sound
Static images
Video
Multimedia / Hypermedia
Programs for computers
Digital information
01
19
Some publication media
compared
Update
speed
Online / Networked
Printed
CD-ROM
Volume
20
Scientific publishing in Utopia:
an ideal scheme
Many authors
author = reader
in science
Many editors / publishers
Online remote access multimedia database server
one global ,
international computer
data communication
network
Many database search clients
and user interfaces
Many readers / users
21
?? Question ??
Indicate the differences
between reality
and that simplified, ideal scheme
of the information flow.
22
?? Question ??
Which basic problems/difficulties
hinder people
to find / access / use information?
23
Information retrieval:
basic difficulties
(Part 1)
• In many cases it is not completely clear to the user of an
information retrieval system which information is in fact
needed, required.
• In many cases the need for information cannot be
expressed completely in the form of a query.
One of the reasons is that the complete context of the
information need should ideally be expressed, including
the knowledge and background of the searcher.
24
Information retrieval:
basic difficulties
(Part 2)
• Computer systems are artificial, but nevertheless most
use human language in their interface with the human
users, for instance in database search systems.
This may cause difficulties related to language and
vocabulary in particular. Some examples:
• People use different languages and different terms
(vocabularies) to describe a similar concept.
• Concepts, vocabularies and meanings of words and terms
may change over time.
• Meanings of words / terms may depend on their context.
25
Information retrieval:
basic difficulties
(Part 3)
• Many different and imperfect retrieval systems should or
must be used.
»To retrieve and access the information that is in principle
available, many different retrieval systems must be
available and be mastered.
»Furthermore, a perfect information retrieval software does
not (yet) exist; scientific and technological evolution is fast
in the domain of information retrieval software since about
1970.
26
Information retrieval:
basic difficulties
(Part 4)
• Information overload
Users are often overwhelmed
by the amount of available information and
by the large influx of new information.
27
Information retrieval:
basic difficulties
(Part 5)
• The price (or inaccessibility) of particular information
A lot of information cannot be obtained or at least not free
of charge.
28
The information industry and the
information market
The components of the information industry
29
The components of the
information industry
• Authors
• Publishers
• Distributors
• Users
• Related organizations
30
The information industry and the
information market
Overview and evolution
31
Increase in the number of scientific
and technical serial publications
1000000
100000
10000
1000
100
10
1
1650 1700 1750 1800 1850 1900 1950 2000
32
The information market:
growth in the database industry
10000
Number of
living
databases
8000
6000
4000
Number of
database
producers
2000
Number of
vendors
0
1975
1980
1985
1990
1995
Source: Williams, in: Gale Directory of Databases, 1998.
33
The information industry / market:
future trends
(Part 1)
• Growth in the production of databases.
• Less analogue / hard-copy production
= more digital production, storage, and distribution of
information.
• More integration of information types
into multimedia and hypermedia.
34
The information industry / market:
future trends
(Part 2)
• Growth in the number of
»producers and distributors,
»end-users searching databases
due to
easier use
and
lower costs of information technology
35
Databases and computerized
information retrieval
Introduction
36
What is a
database?
A database is a collection of similar data records stored in a
common file (or collection of files).
37
Types of databases:
examples
Examples: The databases that form the basis for
»catalogues of books or other types of documents
»computerized bibliographies
»address directories
»a full text newspaper, newsletter, magazine, journal
+ collections of these
»WWW and Internet search engines
»intranet search engines
»...
38
Information retrieval:
the basic processes in search systems
Information
problem
Text
documents
Representation
Query
Evaluation
and
feedback
Representation
Indexed documents
Comparison
Retrieved, sorted documents
39
Databases and computerized
information retrieval
Text retrieval and language
40
Text retrieval and language:
a word is not a concept (a)
Problem:
A word or phrase or term is not the same as a concept or
subject or topic.
Word
Concept
Word
L
41
Text retrieval and language:
a word is not a concept (a’)
So, to ‘cover’ a concept in a search,
to increase the recall of a search,
the user of a retrieval system should consider an
expansion of the query;
that is:
the user should also include other words in the query to
‘cover’ the concept.
L
42
Text retrieval and language:
a word is not a concept (a’’)
»synonyms!
(such as :
Latin names of species in biology besides the common
names,
scientific names besides common names of substances in
chemistry…)
L
43
Text retrieval and language:
a word is not a concept (a’’’)
»narrower terms, more specific terms
(such as particular brand names);
including terms with prefixes
(for instance: viruses, retroviruses, rotaviruses,...)
»spelling variations
(such as UK English versus US English);
possible variations after transliteration
L
44
Text retrieval and language:
a word is not a concept (a’’’’)
»singular or plural forms of a noun
(when this is used as a search term)
»(relevant) related terms
»various forms of a verb
(when this is used in the query)
»broader terms (perhaps)
L
45
?? Question ??
Which problems in text retrieval
are illustrated by the following sentences?
L
46
Examples
Time flies like an arrow.
Fruit flies like a banana.
?
47
Examples
Time flies like an arrow.
Fruit flies like a banana.
48
Examples
Time flies like an arrow.
Fruit flies like a banana.
OK!
49
Text retrieval and language:
ambiguity of meaning (a)
• Problem:
A word or phrase can have more than 1 meaning.
Ambiguity of the meaning of a word is a problem for
retrieval.
This decreases the precision of many searches.
The meaning can depend on the context.
The meaning may depend on the region where the term is
used.
L
50
Example
Text retrieval and language:
ambiguity of meaning (a’)
• Example of a word:
»Pascal the philosopher
»Pascal the computer language
L
51
Example
Text retrieval and language:
ambiguity of meaning (a’’)
• Example of sentences:
»The banks of New Zealand flooded our mailboxes with
free account proposals.
»The banks of New Zealand flooded with heavy rains
account for the economic loss.
L
52
Text retrieval and language:
ambiguity of meaning (a’’’)
Problem:
Ambiguity of meaning
may be the cause of low precision.
Concept
Word
Concept
L
53
A word is not a concept
A concept is not a word
Word1
Concept1
Word2
Concept2
Word3
Concept3
A concept cannot be “covered” by only 1 word or term;
this may be the cause of low recall of a search.
The meaning of many words is ambiguous;
this may be the cause of low precision of a search.
54
Databases and computerized
information retrieval
Hints on how to use information sources
55
Hints on how to use information
sources: overview
(Part 1)
• Know the purpose and motivation for each search.
• Do not be lazy: search on your own, before bothering
experts with requests for advice.
• Plan your search in advance.
• Choose the best source(s) for each search.
• Use the available tools for subject searching well.
• Try to cope with the language problems;
avoid spelling errors in your search query;
use spelling variations in your search query
56
Hints on how to use information
sources: overview
(Part 2)
• Match your search strategy with the type of source.
• Work cost-effectively.
• Use special care when searching for names.
• Be specific.
Avoid broad searches.
Limit your search to a specific country or region if
required.
• Work iteratively.
• Keep a record of your work.
57
Hints on how to use information
sources: overview
(Part 3)
• Do not only focus on a single source.
• Consider citation indexes besides subject-oriented
databases, as useful secondary information sources.
• Stop searching when “enough is enough”
• Give up if necessary... (Not all questions have an answer.)
• Be critical: not all information is correct or useful.
58
Hints on how to use information
sources: overview
(Part 4)
• In computer-based retrieval systems, consider applying
»truncation of search terms (using a symbol like * or ?)
»combine search terms, using
—Boolean operators:
OR
AND / +
NOT / AND NOT / -
—proximity operators
(for instance “NEAR”)
—phrase searching (“word1 word2”)
»searching limited to a field (for instance URL, title…)
59
Hints on how to use information
sources: subject searching
• When you search for information on a particular
topic/subject: investigate if the database producer offers
»a subject classification scheme and/or
»a controlled/approved/accepted subject terms, and/or
»a subject thesaurus
• Exploit these, if they are available.
• In most cases you should find and use
synonyms and narrower terms
• Use broader and /or related terms, if appropriate.
60
Hints on how to use information
sources: Boolean combinations
Most text search systems understand the basic
Boolean operators:
OR
= obtain records that contain one or both
search terms
AND
= obtain records that contain both search
terms
NOT
= exclude records that contain a search term
61
Hints on how to use information
sources: Boolean combinations
In the case of computer-based information sources, use
Boolean combinations of search terms when appropriate
and when possible.
term x1
term y1
term z1
OR
OR
OR
term x2 AND term y2 AND term z2
OR
OR
OR
term x3
term y3
term z3
AND ...
62
Hints on how to use information
sources: Boolean queries
Most text search systems understand the basic Boolean
operators typed in capital characters:
OR
AND
63
Hints on how to use information
sources: default Boolean operator
• Find out if there is a default implicit Boolean operator
working in the search system that you use.
• This works even when no operator is used explicitly
among words.
• This can be OR, AND, NEAR...
64
?? Question ??
How many (and which) concepts/facets
do you see in a search for
“general reviews
about
monitoring seawater pollution
that is due to effluents in Tanzania”?
65
!! Task - Assignment !!
Prepare off-line, on paper, a suitable search query
in a generic format, to find
“general reviews
about
monitoring seawater pollution that is due to effluents”
as the basis for later, concrete searches in databases.
(Limit yourself to 1 of the concepts.)
66
?? Question ??
What did you learn
from the exercise
on the formulation of a query?
67
Hints on how to use information
sources: work iteratively
Work iteratively =
search, investigate your results, refine your search, search
again, and so on;
do not try to find everything in 1 step, with 1 search.
Query
Feedback
Results
Searching
68
Hints on how to use information
sources: work iteratively: example
When you search a database with subject keywords from a
controlled list, added to each record:
1. Search with search terms that you know
2. Investigate the results and select good, relevant items
3. Look for the keywords added to these items
4. Select the good, relevant keywords
5. Formulate a new search with these keywords added
6. Execute the new search
7. Repeat the procedure
69
“The ability to ask the right question
is more than half the battle of finding the answer.”
Thomas J. Watson
?
70
Hints on how to use information
sources: when to stop searching?
Develop a feel for the “curve of diminishing returns”:
If you spend too much time, effort, and/or money
with too few benefits, you should stop.
payoff
Time to stop?
time / effort / money
71
Knowledge organisation:
classifications, and thesaurus systems
Introduction
72
Knowledge organisation:
introduction
• To organise knowledge / documents / books / reports /
information / data / records / things / items / materials
for more efficient storage and retrieval, some related,
similar tools / systems / methods / approaches are used.
• Often but not yet always, this process is assisted by a
computer system.
• Good systems are expanded and updated when the need
arises.
• The organization system applied should ideally be clearly
and immediately visible or even searchable on computer,
by the user of the materials.
73
Knowledge organisation:
classifications, and thesaurus systems
Classifications
74
Examples
Classification systems:
examples of universal systems
• Universal means here: covering all subjects
• Not just one but several competing systems exist.
Examples
»Universal Decimal Classification = UDC
used mainly outside U.S.A.
»Dewey Decimal Classification = DDC
used mainly in U.S.A.
»Library of Congress Classification
used mainly in U.S.A.
»...
75
Knowledge organisation:
classifications, and thesaurus systems
Thesaurus systems
76
Thesaurus:
description
• Thesaurus (contents) =
»system to control a vocabulary
(= words and phrases + their relations)
»+ the contents of this vocabulary
• Thesaurus program =
program to create, manage, modify and/or search a
thesaurus using a computer
77
Thesaurus
relations
Term(s) with broader meaning
BT (= Broader Term)
RT (= Related Term)
UF (= Use(d) For)
Other term(s)
Term
Synonym(s)
NT (= Narrower Term)
Term(s) with narrower meaning
78
!! Task - Assignment - Exercise !!
Try to find suitable search terms
to retrieve documents on “pollution”
from a database on marine science,
by using for instance the thesaurus
included in the program for word processing
that you use.
79
Knowledge organisation:
classifications, and thesaurus systems
Classification systems
versus
thesaurus systems
80
Knowledge organization:
classifications versus thesauri
• Classification
»Good for placement of documents in a library (because
documents on many related subjects can be kept together)
»Not well suited for computer searching (too complicated)
• Thesaurus
»Not suited for placement of documents in a library
(because documents with related subjects would NOT be
kept together)
» Well suited for computer searching
(relatively simple alphabetic listing of keywords)
81
Computer networks,
data communication and Internet
Introduction
82
Computer networks:
summary
The following gives an overview of computer networks and
data communication:
»The basic principles
»Local area networks
»National computers networks
»International computer networks
»The Internet
»Future impact of digital communication networks
83
Computer networks:
prerequisites
Before using computer networks, you should ideally have
some knowledge and skills related to
• computer hardware
• computer software
84
Data communication:
a definition
• Interpersonal communication
»Telecommunication
—Broadcast
—Telephone
—Data communication
–Remote login
–File transfer
–Hypertext transfer
–Electronic mail
85
Data communication:
which types of ‘data’?
Linear text
Hypertext
Sound
Static images
Video
Multimedia / Hypermedia
Programs for computers
Digital information
01
86
Data communication:
which types of ‘data’?
• The same types of data (information) that can be stored
and managed on a computer can be transferred over
computer networks to one or several other computers.
• So the networks form an important extension of the
stand-alone computers.
• “The network is the computer”
87
Data communication:
applications (Part 1)
• Hard-copy transfer (Fax)
• Online use of the processing power of a remote computer
• Online access to information sources !
»library catalogues,
»bookshop catalogues,
»publisher’s catalogues,
»campus-wide and community information systems,
»(text or multimedia) databases,
»network-based journals, ...
88
Data communication:
applications (Part 2)
• Software-downloading
• Electronic mail from a person to one or several persons
• Computer-network based interest groups
• Online talking / chatting (IRC,...)
• Video conferencing (Cu-seeme, ...)
• Selling, shopping, buying,..
• ...
89
Data communication:
modems
• description: MODulator-DEModulator: device to convert
digital data signals into a suitable form for transmission
along a telecommunications channel, and to convert them
back upon receipt into machine readable form.
• types
»(Acoustic coupler)
»Free standing box
»Board/card to plug-in
microcomputer
90
Computer network protocols:
definition
• When 2 computer systems communicate via network,
they do that by exchanging messages.
• The structure of network messages varies from network
to network.
• Thus the message structure in a particular network is
agreed upon a priori and is described in a set of rules,
each defined in a protocol.
91
Computer networks,
data communication and Internet
Local Area Networks
92
Data communication with a
server in a Local Area Network
• (Terminal)
• Microcomputer with
serial line
communications software /
terminal emulation software
• Microcomputer with
network card and
network software
Network
server
Examples
LAN software packages for
heterogeneous networks: examples
Based on TCP/IP (protocol suite used in Internet)
• For DOS:
NCSA (= National Center for Supercomputing Applications)
CUTCP, PC/NFS,...
• For Windows 3.x:
PC/NFS, PC/TCP, Trumpet TCP Manager,...
• For Windows 95, 98,...: included!
• For Windows NT, 2000,...: included!
93
94
Computer networks,
data communication and Internet
National Wide Area Networks
95
National
Wide Area Networks
• Public access national packet switching networks
• Research computer networks
• Public access made available by
Internet Service Providers
• ...
96
Examples
National research computer networks:
examples
• Belgium:
BELNET
• Finland:
FUNET
• Germany:
DFN
• The Netherlands:
Surfnet
• United Kingdom:
JANET (Joint Academic Network)
• ...
97
Computer networks,
data communication and Internet
International computer networks
Examples
International computer networks:
examples
• National public data communication networks linked together
• FidoNet
• Bitnet / EARN
• Usenet
• Internet!
• ...
98
99
Computer networks,
data communication and Internet
The Internet data communication network
100
?? Question ??
What is the Internet?
101
The Internet
data communications network (Part 1)
• “Internet” is not well-defined.
• A network of smaller networks:
The global collection of interconnected local area,
regional and wide-area (national backbone) networks
which use the TCP/IP suite of data communication
protocols.
@
102
The Internet
data communications network (Part 2)
• Links computers of various types.
• Is constantly growing.
• The analogy of a superhighway has been used to describe
the emerging system of networked computers.
• The Internet has no owner, and is not managed by one
organization.
@
103
The Internet:
access from your Local Area Network
Your microcomputer
Local Area Network (LAN)
One of the national networks
The global Internet
104
Host computers in the Internet:
definition
• A host (computer) is a domain name that has a unique IP
address record associated with it.
• Could be any computer connected to the Internet by any
means.
• For instance:
www.vub.ac.be
@
105
Transmission Control Protocol /
Internet Protocol (TCP/IP)
• the main suite of transport protocols used on the Internet
for connectivity and transmission of data across
heterogeneous systems
• “glue that holds the Internet together”
• an open standard
• available on most Unix systems, VMS and other
minicomputer systems, many mainframe and
supercomputing systems and some microcomputer and
PC systems
106
Internet: addresses of computers
with the Domain Name System
• Internet style = Domain name system
• The Internet naming scheme consists of a hierarchical
sequence of names from the most specific to the most
general (left to right), separated by dots.
computer.subdomain.domain.(country if not USA)
n1.n2.n3.n4
where n is
a natural number
(8-bit)
OR
107
Internet: growth in number of hosts
worldwide: linear plot
20000000
15000000
10000000
5000000
0
1993
January of each year
1994
1995
1996
1997
1998
108
Internet Service Provider
= ISP
Internet Service Providers provide their clients access to
Internet + in many cases
»an email address / server
»space for a web site
»software tools to start
»training
»technical support
»an accessible location for a WWW site of the client
»assistance with WWW site design and promotion
109
Microcomputer -- external computer:
some ways of data communication
Microcomputer
Modem
Voice telecommunication network
Telephone
ISDN
TelePAD
LAN
Public data
comm. network
Local
PAD
Gateway computer system
Private/academic data comm.
network (e.g. Internet)
Leased, fixed communication line
Intern
Extern
External
computer
110
Online communication:
remote login and file transfer
Remote terminal log-in / access
111
Remote terminal log-in / access:
definition
The ability to access a computer from outside a building in
which it is housed.
This requires communications hardware, software, and
actual physical links,
although this can be as simple as common carrier
(telephone) lines or as complex as telnet login to another
computer across the Internet.
112
Online communication:
remote login and file transfer
Telnet in the Internet
113
Telnet:
description
• The Internet standard protocol for remote terminal
connection service; on top of the TCP/IP protocol suite
• Allows a user at one site to interact with a remote
timesharing system at another site as if the user's
terminal was connected directly to the remote computer
• Includes VT100 terminal emulation
114
Online communication:
remote login and file transfer
Downloading and file transfer
115
Data communication:
downloading by copying a fragment
Capturing a small fragment of the information displayed:
1. select information on the display,
2. copy, and
3. paste in a document managed by another program.
116
Online communication:
remote login and file transfer
File transfer
ftp in the Internet
117
Data communication:
file transfer
• Copying + downloading / transfer of a whole file
• Requires a transfer protocol with error correction
118
World-Wide Web = WWW
Introduction
119
The World-Wide Web:
prerequisites
Before using the WWW you should ideally already have
learned to understand and to use
•
computer hardware
•
computer software
•
the Internet
•
older methods for online communication, such as telnet
120
Example
The WWW:
example of a welcome page
121
URL =
Universal Resource Locator
• = draft standard for specifying an object on the Internet
• the structure is in most cases
protocol://computer_address[/path_name/file_name]
• examples:
»telnet://biblio.vub.ac.be
»ftp://ftp.vub.ac.be/
»gopher://gopher.vub.ac.be/
»http://www.vub.ac.be/BIBLIO/index.html
»news://news.server.edu/comp.infosystems.www
122
URL
format / structure
1. The first part of a URL, before the colon “:”, specifies the
access method = protocol
2. The second part of the URL, after the colon “:”, is
interpreted specific to the access method.
In general, two slashes after the colon
indicate a machine /computer name.
123
?? Question ??
What is the difference between
Internet and the World-Wide Web?
124
The WWW is an application of
Internet
• The World-Wide Web (WWW) is a service, an application
of Internet.
• It is based on the Internet infrastructure.
• So the WWW is newer than the Internet.
The concept of the WWW was created at the end of the
1980s when the Internet was already well established.
125
The WWW is an application of
Internet: scheme
Data communication
Internet
WWW
126
The WWW:
the essential elements
• Information delivery and access using
hypertext/hypermedia documents/objects
»html documents
»http protocol:
http clients
http servers
• Integration of protocols in the Internet:
»http servers offering html documents including links to
other http servers, telnet servers, ftp servers, nntp servers,
gopher servers...
127
The WWW:
hyperlinks
Hyperlinks can link a part of a hypermedia document to
• another part of the same document file
• another document file on the same server computer
• another document file on a server computer located
elsewhere in the world
Computer 1
Computer 2
128
The WWW:
hypertext mark-up language = HTML
• Hypertext mark-up language = HTML =
the system of codes used by authors to build the
hypertext-pages/files in WWW, for instance to create a
title or an anchor.
• The codes are invisible / transparent for the user / reader.
129
The WWW:
hypertext transfer protocol = HTTP
• Hypertext transfer protocol = HTTP =
the software conventions used by client and server
programs for WWW to request and transfer hypermedia
documents.
• The protocol must not be known by he user / reader
= the protocol is invisible / transparent for the user.
• Analogous with the telnet, ftp and gopher protocol.
130
?? Question ??
Briefly compare
TCP/IP and HTTP.
131
The WWW:
pages and forms
• Pages
Many documents developed for WWW are kept small and
are named “pages”.
These often refer to several other “pages”.
• Forms = gateways to services and databases on server
computers in WWW
Some pages contain electronic forms, to be filled in by the
user.
132
The WWW
applications
Analogous to gopher applications:
• Access to online public access catalogues
• Campus-wide information systems
• Access to subject-oriented information
• Access to computer file archives
• Traveling / navigating through the Internet
via linked html-pages
• Access to intranets within institutes / companies
133
World-Wide Web = WWW
WWW client programs
134
WWW:
client / browse programs
• To access the WWW, you run a browser program.
• The browser reads documents, and can fetch documents
from other sources. Information providers set up
hypermedia servers which browsers can get documents
from.
• The browser can display hypertext documents.
Hypertext is text with pointers to other text. The browsers
let you deal with the pointers in a transparent way:
select the pointer, and you are presented with the text that
is pointed to.
135
WWW: examples of
browsers for your own computer
Browsers are available for many computer platforms;
in particular:
browsers for Windows + Winsock:
»Netscape
»Microsoft Internet Explorer
»...
136
?? Question ??
Which client program
do YOU use or will YOU use
to access the WWW?
137
!! Task - Assignment - Exercise !!
Browse the WWW,
using an available
browser client program.
138
!! Task - Assignment - Exercise !!
Visualise the HTML source code
of a WWW page,
using a WWW client program.
What do you learn from this exercise
about the basic properties of HTML?
139
!! Task - Assignment - Exercise !!
Exploit the possibility
to open more than one window,
using a WWW client program
in Windows.
140
?? Question ??
Why would you want
to open more than one window
on WWW servers,
using a WWW client program?
141
World-Wide Web = WWW
Saving information from a web
142
WWW: How to save information
from a web?
Information displayed by your web browser/client program
can be saved,
• by select, copy, paste in another document (and save)
• by saving a complete page to your disk
»in separate files
(for instance 1 HTML file + some image files)
»in 1 file, using Microsoft Internet Explorer 5 or a later
version
• by copying the information into an e-mail message that
you send to your own e-mail account
143
!! Task - Assignment - Exercise !!
Copy some text fragment from WWW
and paste it into another document
on your computer.
144
!! Task - Assignment - Exercise !!
Save a text from WWW
to disk, as HTML,
using a browser program.
145
!! Task - Assignment - Exercise !!
Display an HTML file
that you have saved
from the WWW to your disk,
in a program for word processing.
Is the file displayed properly?
146
World-Wide Web = WWW
The success of WWW
147
WWW: growing number of
WWW servers
7000000
6000000
5000000
4000000
3000000
2000000
1000000
0
1993 1994 1995 1996 1997 1998 1999 2000
148
WWW as popular method to access
information from computers
• The WWW has quickly become the most popular medium
to access information that resides on various computers
that are connected to a computer network.
149
Online access information
sources and services
Introduction
150
Online information sources:
summary
• The following gives a general overview of online
accessible information sources.
• This overview is not limited to or focusing on a particular
concrete subject domain/area.
151
Online access to information:
avoid network traffic jams
To access from Europe online information sources in the
US, work when lines are not saturated.
(better in the morning than in the afternoon)
152
Internet based information sources:
problems / difficulties (Part 1)
• Redundancy and overlap:
On the one hand, there is too much information on some
topics; in other words, the redundancy and overlap are high in
many cases.
Too few information sources:
On the other hand, there are too few information sources on
some topics.
153
Internet based information sources:
problems / difficulties (Part 2)
• No order is imposed on most sources.
Quality checks / quality controls are not performed.
Related to this: it is not required to register new information
offered.
Is the information that you find real, honest, authentic?
154
Internet based information sources:
problems / difficulties (Part 3)
• Change is the only constant:
Information sources are constantly changing, growing, but
sometimes disappearing.
155
Internet based information sources:
problems / difficulties (Part 4)
• Scattering:
There is no single simple but powerful system to find
relevant information through the Internet.
In other words:
integration / aggregation is still far from perfect.
156
Internet based information sources:
problems / difficulties (Part 5)
• Slow:
The Internet is in many places and for many applications not
yet fast enough.
157
Internet based information sources:
problems / difficulties (Part 6)
• In conclusion:
Surfing, using the
Internet, the WWW,
can be a time sink instead
of a productive activity.
158
Internet based information sources:
how many? how much information?
• More than 10 million WWW sites
(in 2003)
• More than 2000 million (= 2 billion) unique URLs in the
total Internet (in 2002)
• More than 10 terabyte (= 10 000 gigabyte) of text data
(in 2001)
159
Online access information
sources and services
Types of online access information systems
160
Types of online access information
systems: “free” versus “fee”
Public access information sources
free of charge
Fee-based online information services
(NOT free of charge)
161
Online access information
sources and services
Dictionaries and encyclopaedias
accessible through the WWW
162
Dictionaries and encyclopedias
through the WWW: introduction
• Dictionaries and encyclopedias are the first choice among
many types of information sources,
»when we do not need detailed information on a common
topic
»when we want to prepare a more detailed search on an
unfamiliar topic, by searching for the right spelling,
synonyms, context,…
• Some dictionaries and encyclopedias are available
through the WWW free of charge.
163
Example
Dictionaries accessible through
Internet and the WWW: example
• The American Heritage® Dictionary of the English
Language
»Over 200,000 entries,
70,000 audio word pronunciations,
900 full-page color illustrations
»Available free of charge from
http://education.yahoo.com/reference/dictionary/
Example
Dictionaries accessible through
Internet and the WWW: compilation
• A compilation/collection of dictionaries can be searched
simultaneously and free of charge:
http://www.onelook.com/
164
Example
Encyclopedias accessible through
Internet and the WWW: examples
• Encarta Concise Free Encyclopedia
»http://encarta.msn.com/
»Available in English and in some other languages
165
Example
Encyclopedias accessible through
Internet and the WWW: examples
• Encyclopædia Britannica
only a small part is available free of charge
+ links to selected WWW sites
»http://www.britannica.com/
• Encyclopædia Britannica Concise
»http://education.yahoo.com/reference/encyclopedia/
166
Example
Encyclopedias accessible through
Internet and the WWW: examples
• The Canadian Encyclopedia
(in English and in French):
»http://thecanadianencyclopedia.com/
167
Example
Encyclopedias accessible through
Internet and the WWW: overviews
• A list / overview of encyclopedia on the Internet:
http://www.internetoracle.com/encyclop.htm
• Other lists of encyclopedia on Internet
can be found as a part of more general directories of
Internet-based information sources.
168
169
Online access information
sources and services
Internet directories and indexes
170
Internet: meta-information about
Internet information sources
• in printed manuals and guides:
- it is not always possible to get a copy fast
- it costs money to get a copy
- they are soon out of date
• offered on the WWW!:
+ directly available when we want to use the Internet
+ many systems are accessible free of charge
+ most systems are regularly updated
• (“intelligent agent” software on client PC)
171
Internet: subject-oriented metainformation offered via WWW
Information about information sources: in the form of
»subject guides = texts with references
»subject hypertext directories = subject guides
»key word indexes, generated automatically, for searching
»collections of links or forms to the above
»(multi-threaded search systems)
172
Internet global subject directories:
introduction
• They are virtual libraries with open shelves, for browsing.
• They are manually generated, man-made by many
people.
• They can be browsed following a tree structure or a more
complicated variation.
• The most famous of these systems belong to the most
popular and most visited sites on the WWW: e.g. Yahoo!
173
Internet global subject directories:
structure
The structure corresponds to a classification that is in most
cases specific for the particular overview.
In other words: the well-known and classical universal
classification systems are not used in most Internet
directories.
174
Internet global subject directories:
pros and cons
• They cover a small number of selected WWW sites,
in comparison with the total number of sites that are
accessible.

+ The selected, included sites should be better than average.
- They are not suitable for deep, detailed, specific searches
with a high coverage.
175
Internet global subject directories:
why use one?
• They are suitable mainly for broad searches that can be
difficult to formulate in words,
but NOT for more specific searches that require
combinations of several concepts.
176
Internet global subject directories:
searching directories with a query
• Many of the Internet directories include an index to
search their contents with a query.
• However, then the assisting classification structure is not
well exploited and the user should be aware of the
problems and difficulties of information retrieval with
natural language queries.
• Furthermore, the possibility to use the system in this way
may be confusing, as these directories are not real fulltext Internet indexes, like those provided by other search
tools.
177
Internet global subject directories:
Yahoo!
• A hypertext global subject directory can be found at
http://www.yahoo.com/
and at many other sites, including
http://www.yahoo.co.uk/
• Entries are NOT rated.
• Accessible free of charge.
178
Internet global subject directories:
Google directory
• A hypertext global subject directory can be found at
http://directory.google.com/
• Accessible free of charge.
• Based on the Netscape DMOZ
Open Directory Project.
• Do not confuse this with the famous Google WWW search
engine.
179
Internet global subject directories:
Open Directory Project
• A hypertext global subject directory can be found at
http://www.dmoz.org/
• The contents is also used in other systems,
such as Google Directory and Webbrain.
• Accessible free of charge.
180
!! Task - Assignment - Exercise !!
Try to find Internet sources
which are relevant for you,
by using an Internet-based
global subject directory.
181
Internet local subject directories:
examples in Belgium
• http://yellow.advalvas.be/weblist.html
• http://search.msn.be/
• The guide developed by the public libraries in Flanders:
http://www.bib.vlaanderen.be/webwijzer
182
Internet indexes:
automated search tools
• Several systems allow to search for and to locate many
items (addressable resources) in the Internet in a more
systematic, direct way than by only browsing/navigating.
• These systems do NOT search the contents of computers
through the real Internet in real time and completely
when a user makes a query.
Searching in that way would be much too slow due to
limitations in the technology.
183
Internet indexes:
scheme of the mechanism
User searching for Internet based information
Internet client hardware and software
user interface to a search engine
Internet index search engine
Internet information source
Internet crawler and indexing system
database of Internet files, including an index
184
Internet indexes:
description of the mechanism
Each of these search systems is based on:
• a database of links to pages / URLs that can be retrieved
by searching with queries through a big index that is built
machine-made on the basis of the contents, the texts, of
these pages
(to build this database and to keep it up to date, pages are
continuously collected from the Internet by a “robot”
computer software system)
• a search system with a user interface in a WWW form, to
allow the user to search through that database
185
Internet indexes:
AltaVista
• The primary search interface can be found in the US.
The following addresses all lead to the same information:
»http://www.altavista.com/
»http://www.av.com/
»http://av.com/
• Mirror site in UK:
»http://uk.altavista.com/
»http://www.altavista.co.uk/
186
Internet indexes:
AltaVista: features
• Allows full text searching of the WWW
• Offers relevance ranking of search results
• Allows also advanced Boolean searching
(in “Advanced” mode)
• Offers a link to an Internet subject directory (Looksmart)
• Offers links to systems to find
images, sounds… (multimedia) in the Internet
187
Internet indexes:
All the Web
• The search interface can be found at:
http://www.alltheweb.com/
http://alltheweb.com/
• You can search the WWW and ftp servers.
• The database is one of the biggest.
• Not only HTML and plain text files, but also the full text
of many Adobe PDF files is indexed.
• Offers also a module to search for pictures/images.
• Offers spelling suggestions in the search interface.
188
Internet indexes:
Google (Part 1)
• http://www.google.com/
• Full-text searching is possible of many files that are
available through the WWW.
• Not only HTML and plain text pages are covered, but also
the first part is indexed of many files in other file formats,
such as
»Adobe PDF,
»Microsoft Word, Microsoft Excel, Microsoft PowerPoint
»Rich Text Format…
189
Internet indexes:
Google (Part 2)
• One of the most popular systems in 2001, 2002, 2003…
• For retrieval an algorithm is used that takes into account
the links between WWW pages.
A retrieved page is ranked higher when
»many sites/pages point to it
»“important” sites/pages point to it
• Some other famous search systems are based on Google
such as Netscape Search and the WWW searches of
Yahoo! (at least in 2003).
190
Internet indexes:
Google computer servers
• Google uses a system of more than 10 000 small computer
servers to offer it’s information services.
191
Internet indexes:
Google additional features
• Besides a system to search for WWW pages,
Google offers also
»a subject directory
»searching for images/pictures on the WWW
»searching an archive of Usenet messages +
posting to Usenet groups
»searching for news
• Thus Google has become a great integrator / aggregator.
192
Internet indexes:
coverage
•
Internet indexes do not cover all static documents on the
WWW.
•
Most indexes grow and their “size ranking” is variable.
•
If exhaustive results are desired, then more than one
Internet index search system should be used.
193
Internet indexes:
coverage and size of each index
•
Most indexes grow and their “size ranking” is variable.
•
The biggest systems in 2003:
»
Google !
» AltaVista
»
All the Web (serving also Lycos)
»
Systems based on the INKTOMI database of WWW
pages.
194
!! Task - Assignment - Exercise !!
Try to find Internet sources
which are relevant for you,
by using an Internet index.
195
Coverage of Internet directories and
Internet indexes
Internet information sources
A global Internet directory
A global Internet index
196
Global Internet search tools:
a comparison
Global Internet
directories
Global Internet
indexes
Multi-threaded
search systems
• Only a limited
selection of Internet
sources
• About 1/3 of the
• These get information
Internet is covered by
from directories
an index
and indexes
• Browsing
information sources
is easy
• Searching requires
some skills and
knowledge
• Searching requires
some skills and
knowledge
• Good for broad
searches
• Good for specific,
narrow searches
• Good when even 1
index does not yield
information
197
Internet:
who owns the search tools?
In 2003:
• The company Yahoo! owns
»the most famous global Internet subject directory
»3 (!) Internet full-text search engines:
All the Web, AltaVista, Inktomi
• The company Google owns
»the most famous Internet full-text search engine
»one of the best Internet image search engines
»a gateway to old and new Usenet news messages
198
Online access information
sources and services
Public access book databases
199
Public access book databases:
introduction
• Even in this age of Internet-based information sources, a
lot of information is still distributed in the form of printed
books.
• The contents of most books is (still) not available on the
Internet.
• Most general Internet search tools do NOT allow you
to find out about the existence of books that may be
interesting for you.
• So, specific search tools to find books can be useful.
200
Public access book databases:
an overview
• (Databases by publishers.)
• Fee-based databases by commercial providers
• Databases by book distributors / bookshops!
• Online public access catalogues of
»local libraries,
»national libraries (which produce and offer normally
their national bibliography)!
»big, famous libraries!!
• (Databases of computer-based versions of books.)
201
Public access book databases:
which one to use?
• For years, the market of bibliographic information
on books was limited to the services and databases of
subscription-based bibliographic providers.
• Nowadays, the WWW provides a key to unlock many
possibilities to find bibliographic information.
• Which book database should be preferred for
particular applications is not clear for most
librarians or end-users.
202
Public access book databases
by commercial producers
• To find currently available books, some databases
assembled by commercial producers can be
interesting.
• Example: Global Books in Print
• These databases offer formal descriptions of books,
prices of the books, short descriptions of the contents
with subject terms…
• However, access to such a database is not free of
charge and can be expensive
(in comparison with alternatives).
203
Public access book databases
provided by bookshops
• To find currently available books, the bibliographic
databases assembled by big bookshops are interesting.
• Several offer a good coverage and
are accessible free of charge.
• The added price information can be useful for the
acquisition and accounting department of a library or if
an individual user wants to buy a book.
• Some provide a current awareness service,
also free of charge.
Examples
Book databases accessible free of
charge: examples in U.S.A.
• Amazon.com (US):
http://www.amazon.com/
http://www.amazon.co.uk/
note: amazon, NOT amazone
Subject description is poor.
• Barnes and Noble (US):
http://www.bn.com/
204
205
Free public access bibliographic book
database + price comparisons
• Even comparisons of the catalogues of shops of books
(as well as of music, movies and many other goods)
are available free of charge.
• See for instance
»http://www.bookfinder.com/
»http://www.dealtime.com/
206
!! Task - Assignment - Exercise !!
Search for titles of books
which are relevant for you,
using an online database provided by
a book publisher or bookshop.
207
Online Public Access Catalogues of
libraries
• Mainly to find older books, the catalogues of libraries can
be useful.
• Most are accessible online and free of charge.
208
Online access information
sources and services
Fee-based online public access
information services
209
Types of online access information
systems: “free” versus “fee”
• A lot of the information on the Internet is available free of
charge, but another part is only accessible when a fee is
paid to the producer and / or the distributor.
• The first commercial computer systems that make
information available online were born around 1975.
Most of them are now also available through the Internet.
• Some organisations pay these fees for some sources and
then organise access, so that the members of the
organisation can retrieve and exploit the information as if
it is free of charge.
210
Types of online access information
systems: “free” versus “fee”
Public access information sources
free of charge
Fee-based online information services
(NOT free of charge)
211
Types of online access information
systems: “free” for members only
Public access information sources
free of charge
Fee-based online information services,
made accessible “free of charge”
by an institute to its members
Fee-based online information services
(NOT free of charge)
212
Online information services:
total size of their databases
In 1999:
The big host systems and the public access WWW pages
offer a comparable quantity of information:
• WWW offered about 8 terabytes (= 8 000 gigabytes) of
text data
(according to Lawrence and Lee Giles, Nature, 1999, Vol. 400, pp. 107-109.)
• Dialog offered about 9 terabytes (= 9 000 gigabytes)
(in 1998)
»6 billion pages of text
»3 million images
213
Online access information
sources and services
Online access databases about journal articles
214
Online access databases
about journal articles: overview
• Thousands of fee-based online access databases offer
bibliographies or full-texts of journal articles in
particular subject domains and published by many
publishers.
• Many publishers offer searchable bibliographies, but only
of their own publications. (for instance Emerald, Elsevier)
• Only few large databases offer access to bibliographies of
articles published in journals from many publishers, free
of charge.
215
Online access databases
about journal articles: Article@INIST
• Article@INIST allows you to search in a bibliographic
database, NOT full-text, (Journal articles, journal issues,
books, reports, conferences, doctoral dissertations)
at the Institut de l'Information Scientifique et Technique,
France.
• Does not offer usage of classification or thesaurus.
• Searching is free of charge.
• Available from http://form.inist.fr/public/eng/conslt.htm
• Payment is required to receive the full text of an article.
216
Online access databases
about journal articles: Ingenta (1)
• Ingenta Journals allows you to search a bibliographic
database of millions of journal articles,
including titles, authors, in many cases abstracts.
• Searching is free of charge.
217
Online access databases
about journal articles: Ingenta (2)
• Payment is required to receive the full text of an article.
• Available from
»http://www.ingenta.co.uk/
»http://www.ingenta.com/
• Ingenta has acquired Uncover in 2000.
218
Online access databases
about journal articles: Infotrieve
• Infotrieve allows you to search free of charge in a
bibliographic database of the articles of more than 20 000
journal titles and conference proceedings, NOT full-text.
• Available from http://www3.infotrieve.com/
• Payment is required to receive the full text of a document.
• Current awareness services are also offered free of
charge:
the table of contents of new issues of the journals that you
have selected are sent to you by email.
219
Example
Online access databases
about journal articles: Scirus
• This is a specialised Internet index that allows you to
search for selected scientific information (only) on the
WWW.
This includes the peer-reviewed articles in the journals
that are published in ScienceDirect by Elsevier.
• An article can be downloaded in full-text format only
when a fee has been paid to the publisher
• The search interface: http://www.scirus.com
Example
Online access databases
about journal articles: Scirus features
• Offered free of charge by Elsevier.
• Is partly based on the Fast WWW search system that is
also used by Alltheweb.
• Offers access to information ordered according to some
classification system / taxonomy.
220
221
Online access information
sources and services
Finding multimedia files on the Internet
222
Finding multimedia files on the
Internet: introduction
Several public access search systems are available
free of charge, to search the Internet for multimedia files:
»images / pictures (either artwork, either photos, or both)
»sound / audio files (music, speeches...); video
223
Finding images on the Internet:
introduction
• Several public access search systems are available free of
charge to search for
images / pictures (either artwork, either photos, or both)
on the Internet.
• When searching for images, the search results from such
a system offer not only links to the image files on the
Internet, but also directly small versions of the images
(so-called “thumbnails”).
Examples
Finding images on the Internet:
screen shot of a Google image search
224
225
Examples
Finding images on the Internet:
examples of search engines (1)
• http://alltheweb.com/ !!
• http://gallery.yahoo.com/ !
• http://images.google.com/ !!!
or through http://www.google.com/
The largest database in this category (at least in 2002,
2003). For each result, not only a thumbnail is offered,
but also directly the origin with the readable URL;
this makes it easier to guess the relevance of the
document.
226
Examples
Finding images on the Internet:
examples of search engines (2)
• http://multimedia.lycos.com/
• http://www.altavista.com/ !!
(also audio and video, choose not the normal text search,
but IMAGES in the user interface.)
227
!! Task - Assignment - Exercise !!
Use a specialised search engine
to find images
about a particular subject
on the Internet.
228
Online access information
sources and services
Evolution and future trends
229
Online access information:
evolution and future trends
• An increasing amount of information becomes available
online.
• A growing amount of this online information becomes
available free of charge.
• The quality and ease of use of software
on server as well as client is growing.
A consequence is:
• An increasing number of end-users searching for
information online.
230
Online access information:
easier and more complicated?!
• At the same time, information retrieval becomes
both easier and also more complicated.
This may seem strange and contradictory, but it is reality.
This is a paradox.
231
Online access information:
easier information retrieval systems
• Individual information retrieval systems become easier:
»they react faster;
»they can provide access to more data/information in one
action;
»their user interfaces are simple,
but more sophisticated, intelligent retrieval algorithms can
nevertheless deliver satisfactory results in most simple
cases.
232
Online access information:
more complicated information market
• The whole information landscape consists of more and
more decentralised information sources, each one
bringing an individual user interface that should be
mastered.
Making the right, ideal choice among the sources becomes
not easier, perhaps even more complicated every day.
233
Online access information:
more complicated information market
• Furthermore, for many sources
the accessibility / availability,
the user interface,
the interlinking,
depend on the organisation in which the searcher is
active.
234
Online access information:
conclusion
• In the case of simple information needs, the WWW and the
search tools can work like “magic”.
• However, in the case of more complicated information
needs, there is still is no “magic button” that brings you
immediately to all the required information.
235
Evaluating the quality
of information
Documentary information sources:
evaluating their quality
236
Documentary information sources:
evaluating their quality
• We should always be critical when using information
sources, in view of
»the widely varying degrees of quality of information
sources, and of
»the costs associated with searching, finding, using
information.
237
Documentary information sources:
evaluation criteria (1)
• Is the information valid, reliable, trustworthy, genuine,
authentic?
Is the author honest?
Is the source objective, not subjective, without cultural or
political or ideological or commercial bias?
Is the origin an individual or a company or an
organisation?
Is the publication sponsored by some company or
organisation?
238
Documentary information sources:
evaluation criteria (2)
• Is the information accurate, correct?
Who is the author or producer?
Has the source an author or a producer with a high
expertise, a good reputation, good qualifications?
Can the author be contacted for clarification or
discussion?
Was the information reviewed, edited, improved,
corrected, censored, approved, verified, before
publication?
Do experts agree on the information provided?
239
Documentary information sources:
evaluation criteria (3)
• Is the information source unique?
Does it offer a great amount of primary information,
which is not obtainable from other sources?
• Is the information complete?
Is the work available in its entirety?
• Does the source offer a wide coverage?
Is the source comprehensive, substantive?
• Is the information current enough, up to date?
Is a publication date provided?
Is an expiration date provided?
240
Documentary information sources:
evaluation criteria (4)
• Does the document provide suitable references, so that
you can verify statements and find older suitable
information sources?
• Good clear format and lay-out of the information /
User-friendly information system /
Easy for users to orientate themselves within the resource
and to find their way around it?
• Good user support / Good customer support?
• Is the type of distribution medium appropriate?
(print, e-mail, online,...)
241
Documentary information sources:
evaluation criteria (5)
• Is the information what you want?
If not, then reassess your needs and consider other types
of information as well.
242
Documentary information sources:
evaluation criteria (6)
• Is the information suitable for your level of
understanding of the subject?
Is the document popular, suitable for the general public,
for students, for professionals, for scholarly/academic
use…?
Does it report new, primary research (survey, experiment,
observation, measurement, invention) or is it a review of
sources published earlier?
• Does the information repeat or confirm what you already
know, or is it complementary, contradictory, new?