Nomadic Digital Library Research at Cornell

Download Report

Transcript Nomadic Digital Library Research at Cornell

Automated Digital Libraries
William Y. Arms
Department of Computer Science
Cornell University
1
Two Questions
2
Before Digital Libraries
Access to scientific, medical, legal
information
In the United States:
-- excellent if you belonged to a rich organization
(e.g, a major university)
-- very poor otherwise
In many countries of the world:
-- very poor for everybody
3
Question 1
Must access to scientific and
professional information be expensive?
4
Research Libraries are Expensive
library
materials
buildings
& facilities
staff
5
The Potential of Digital Libraries
open
access
?
materials
computers
& networks
staff
6
Question 2
How effectively can computers be used
for the skilled tasks of professional
librarianship?
-- Time horizon: 5 to 20 years
-- All materials in digital form
7
Automated Library Services
8
Skilled Librarianship
People are skilled at judgment, understanding,
discrimination, etc.:
-- selection
-- cataloguing, indexing
-- seeking for information
-- evaluating information
-- reference service
Can computers provide equivalent services?
9
Equivalent Services
Example: Cataloguing rules
-- Application of cataloguing rules to monographs is skilled
-- It is hard to imagine a computer system with these
skills
but ...
-- Catalogs and cataloguing rules are the means not the
end
10
Equivalent Services
Information discovery
Why are web search services the most widely used
information discovery tools in universities today?
11
Conventional Criteria
Web search services have many weaknesses
-- selection is arbitrary
-- index records are crude
-- no authority control
-- duplicate detection is weak
-- search precision is deplorable
yet they clearly satisfy important requirements ...
12
Effectiveness of Web Search
Inspec v. Google
Google is usually superior for general computing and
computer science questions
> Broader coverage
> Adequate indexing records
> Better ranking
13
Simple Algorithms
+
Immense Computing Power
14
History: Licklider
J. C. R. Licklider
Libraries of the Future, 1965
-- envisaged digital libraries for scientists at their place
of work
-- listed desiderata for a digital library
-- studied construction of fully automated digital libraries
-- put emphasis on artificial intelligence and natural
language processing
15
History: Licklider
Licklider's predictions for digital libraries
were remarkably good, but ...
-- over optimistic about progress in artificial
intelligence
-- underestimated what can be done by brute force
computing
16
Brute Force Computing
Few people can appreciate the power of
Moore's Law
-- Computing power doubles every 18 months
-- Increases 100 times in 10 years
-- Increases 10,000 times in 20 years
Simple algorithms + immense computing power
may outperform human intelligence
17
Brute Force Computing
Example
Creators of the world champion chess program
(Deep Thought later Deep Blue)
-- moderate chess players
-- simple tree-search algorithm
-- very, very fast computer hardware
18
An Anecdote
The question (Marvin Minsky)
-- How would you design as computer system
that can answer questions such as, "Why
was the space station a bad idea?"?
The answer (Danny Hillis)
-- Design much more powerful computers!
19
Examples of
Automated Digital Library
Services
20
Web Search
Brute force indexing and retrieval
-- retrieve every page on the web
-- index every word
-- repeat every month
Getting better all the time
-- improved algorithms
-- faster computers and networks
-- analysis of users
21
Web Search
Ranking algorithms
Closeness of match
-- vector space and statistical methods
(Salton, et al., c. 1970)
Importance of digital object
-- Google ranks web pages by how many other pages link
to them, gives greater weight to links from higher
ranking pages.
(NSF/DARPA/NASA Digital Libraries Initiative)
22
Archiving and Preservation
Internet Archive
-- Monthly, web crawler gathers every open access web
page with associated images
-- Web pages are preserved for future generations
-- Files are available for scholarly research
not perfect ...
-- HTML pages, images; no Java applets, style sheets
-- materials are dumped with no organization or indexing
-- access for scholars is rudimentary
23
Reference Linking
Web of Science
(ISI)
-- input: combination of automatic means, skilled people
-- limited number of journals
-- very expensive
ResearchIndex (a.k.a. CiteSeer, a.k.a. ScienceIndex)
(NEC)
-- fully automatic
-- all open access material in computer science
-- a free service
24
Beyond Text
Informedia (Carnegie Mellon)
Automatic processing of segments of video, e.g., television
news. Algorithms for:
------
dividing raw video into discrete items
generating short summaries
indexing the sound track using speech recognition
recognizing faces
searching using natural language processing
(NSF/DARPA/NASA Digital Libraries Initiative)
25
Costs and Benefits
26
Costs of Catalogs and Indexes
Catalog, index and abstracting records are very
expensive when created by skilled professionals
-- only available for certain categories of material
(e.g., monographs, scientific journals)
-- contain limited fields of information
(e.g., no contents page)
-- restricted to static information
High costs reduce effectiveness and access
27
Costs of Automated Digital
Libraries
The Google company
-- 5.5 million searches daily
-- 85 people (half technical, 14 with Ph.D. in computing)
-- 2,500 PCs running Linux, with 80 terabytes of disk
The Internet Archive
-- 7 people with support from Alexa
(March 2000)
28
Overall
If you are rich ...
-- Research libraries, using commercial
information services, provide excellent
service at very high cost to a favored few
-- Automated digital libraries are a long way
from providing the personal reference
service available to a faculty member at a
well-endowed university
but ...
29
The Model T Library
The Model T Ford, with mass production, brought car
travel to the masses ...
-- Automated digital libraries, with open access materials,
can already provide good service at low cost
-- In the future automated digital libraries can bring
scientific, scholarly, medical and legal information
to everybody at negligible cost
30
A Footnote
31
Library Expertise
The future of scientific and professional information is
tied to computing, but ...
-- automated digital libraries need small teams of highly
skilled people
-- development of automated digital libraries is bypassing
libraries
(Google, ResearchIndex, Informedia, Internet Archive)
The level of computing expertise in U.S. research
libraries is depressingly low
32
Further reading
William Y. Arms, "Automated digital libraries." To be
submitted to D-Lib Magazine, July/August 2000.
William Y. Arms, "Economic models for open-access
publishing." iMP, March 2000.
http://www.cisp.org/imp/march_2000/03_00arms.htm
33
Automated Digital Libraries
William Y. Arms
Department of Computer Science
Cornell University
34