Presentation in MS PowerPoint

Download Report

Transcript Presentation in MS PowerPoint

Building the Digital Library:
Setting the Standards, Building the Tool Kit
Roy Tennant
California Digital Library
Outline

Project Planning
 Selecting Material to Digitize
 Basic Imaging Principles
 Capturing Images
 Editing Images
 Conversion to Text
 Best Practices
 Metadata
 Case Studies
 Skills Required of Staff
 Final Thoughts
Project Planning

What are the project goals and objectives?
 Which audiences do you wish to serve and
how?
 Who will do the work?
 What systems will be required?
 What are the specifications for images and
metadata?
 How much will the project cost?
 Who will own and manage the digital
products that will be produced?
See Handbook for Digital Projects, section by Steve Chapman
Start at the End

What do you want the end user experience to
be like?
 For example, do you want them to be able to:





View thumbnail images that lead to one or more
larger sizes? How big?
Search full-text?
See text as page images or fully converted text?
Etc.
Answers to questions like these will dictate
how you will need to digitize, what metadata
you must capture, etc.
Images w/Descriptive Text

Benefits



Drawbacks



Relatively easy
Can provide an
enjoyable and
instructive
overview of a
collection
Does not scale
Lack of metadata
limits uses
Requires

HTML
Images w/Metadata

Benefits



Drawbacks


Can provide
reasonable access
to many images
Can repurpose and
combine with other
collections
Can be expensive to
produce and
maintain
Requires


Item-level metadata
Database
Page Images

Benefits



Drawbacks



Depicts page
accurately
Retains historical
accuracy
Unsearchable
Cannot
repurpose (e.g.,
other access
formats)
Requires


Structural
metadata
Page-turning
system
Page Images w/OCR

Benefits



Drawbacks



Same as w/page
images only, but with
Searchable text
Typically, OCR is left
“dirty” (uncorrected),
so searching is not
100%
Cannot repurpose
Requires



Structural metadata
Page turning system
Full-text index
Full Text Basic

Benefits



Drawbacks



Searchable
Lower cost to
produce than
enriched texts
Difficult to
repurpose
Loses fidelity to
original
Requires


HTML or database
(browse)
Full-text index
(search)
Full Text Enriched

Benefits



Drawbacks



Searchable
Repurposable
Loses fidelity to
original (although less
than basic full text)
Expensive to produce
Requires



An XML-serving
infrastructure
HTML or database
(browse)
Full-text index
(search)
Selecting Material to Digitize

Publishing rights
 Available support/funding opportunity
 Critical mass
 Uniqueness
 Reputation
 Audience and potential use
 Diversity of material type
 Ability to stand on its own and fit in with other
collections
Types of Materials
Printed text/
Halftones
Mixed
Simple line art
Manuscripts
Continuous Tone
From Anne Kenney, et.al., Moving Theory into Practice
Benchmarking

The process whereby you determine your
digitization requirements using the material
you will digitize
Resolution
The number of pixels in a given area
defines the resolution of an image
One pixel
1”
500 x 1,000 pixels
Dynamic Range (bit-depth)
1 bit
8 bit grayscale
(GIF)
8 bit color
(GIF)
1 bit = black or white
8 bits = 256 shades
16 bits = thousands
24 bits = millions
36 bits = billions
24 bit color
(JPEG)
RGB Color Space
Color
Channels
Red
8 bits per channel =
24 bit color image
Green
Blue
12 bits per channel =
36 bit color image
Image Compression
— the image is unchanged
after compression (no image data is
lost)
 Lossless
 Typical
file size: 50% of original
 Example: LZW compression
— the image is altered after
compression (image data is lost)
 Lossy
 Example:
JPEG
TIFF
 Tagged
Image File Format
 Most often used to save “master
versions” of images (unedited)
 Can be compressed or uncompressed
(typically lossless)
Compuserve GIF

Graphic Interchange Format (GIF)
 Maximum 8 bits/pixel: 256 colors (shades)
 Good for:



Text and line art
Thumbnails
Not good for:


Full-color pictures
Anything that requires more than 256 colors
JPEG

Joint Photographic Engineers Group
 JPEG is actually a compression scheme; the
image file format is JFIF (JPEG File Image
Format)
 Good for:



Full-color pictures
Anything that requires more than 256 colors
Not good for:

Text or line art
New Image Formats

Portable Network Graphics (PNG) - from the
W3C to replace the Compuserve GIF format
and provide more capabilities
 JPEG2000 - An upgrade of the JPEG format
 Flashpix - from a consortium of commercial
companies, to provide much higher-resolution
images in a way that allows speedy network
delivery
 MrSID - From LizardTech, good for large
format materials (maps, panoramic photos,
etc.)
Capturing Images
 Rendering
Intent
 Technologies
 Digital
Cameras
 Flatbed Scanners
 Film Scanners
 Kodak PhotoCD
 Outsourcing
 Standards
and Best Practices
What is Your Rendering Intent?

The Artifact:



The “look and feel”
The experience of interacting with a specific object
Possible consequences:




Choices for providing access are limited
Time and money spent on recreating the artifact may be better
spent on increasing access
In some cases, preserving the look and feel actually harms
other uses
The Intellectual Object:


The content and the use of that content is central
Possible consequences:



The experience of interacting with a specific object may be lost
The “look and feel” of a specific object may be lost
The user may not be aware of the actual physical state of the
original
Digital Cameras
Phase One PowerPhase FX
10,500 x 12,600 pixels, 760MB (48 bit RGB)
BetterLight Super6K
6,000 x 8,000 pixels, 136MB (24bit RGB)
$16,990
Flatbed Scanners
 Minimum
requirements:
 2400
dpi optical
resolution
 42-bit color
 Not
for slides or transparencies, best for
81/2”x11” or 81/2”x14” originals
 Sheet feeder (often optional) helpful for
digitizing text
Film Scanners
 For
35mm slides and negatives;
others available for larger formats
 $400 - $3,000
 Most around 2700-4000
dpi,30-36 bit color
Kodak PhotoCD
 Take
pictures with a normal camera, but
have your pictures “developed” onto a
PhotoCD
 A proprietary image format: ImagePAC,
but very high resolution (4 different
resolutions)
Outsourcing: Pros and Cons

Benefits:




No ramp-up costs (both time and money)
Probably higher quality, at least to begin with
High volume capability
Drawbacks:



May be more costly if you have underutilized staff
time
No internal capability or experience developed
(that is, when the money runs out, so does your
chance to do anything more)
Rare items may require in-house digitization
Outsourcing: How

Write an RFQ (Request for Quote) outlining:




Type and amount of material being digitized
Quality requirements
Volume per unit of time requirements
For RFQ guidance and samples, see RLG
Tools for Digital Imaging:

www.rlg.org/preserv/RLGtools.html
Digital Image Work Flow
Original TIFF or PCD
10-100+MB
Rotate,
Crop,
Retouch,
Brightness/
Contrast
RGB Color Space
Stored offline
Resize,
Sharpen
JPEG
100K
GIF
10K
Indexed
Color
Space
Stored online
Editing Images
 Rotating
 Cropping
 Retouching
 Adjusting
 Resizing
 Sharpening
 Saving
Image Editing Demonstration
Conversion to Text

Optical Character Recognition (OCR)
software is required (Abbyy FineReader,
Caere OmniPage Pro, Xerox TextBridge, etc.)
 Quality and typography of originals is key
 Less than 99.5% accuracy is less expensive
to have re-keyed offshore
 For some applications, uncorrected text is
sufficient
Imaging Best Practices
 General
guidelines for archival versions:
 Photos,
illustrations, maps, etc.:
300-600dpi
 24-36 bit color

 B/W
Text document:
300-600dpi
 8 bit grayscale

 Negatives
and Slides:
3000-4000 pixels in longest dimension
 24-36 bit color for color; 8 bit grayscale for B/W

Imaging Best Practices
“The key to image quality is not to capture at the highest
resolution or bit depth possible, but to match the
conversion process to the informational content of
the original, and to scan at that level--no more, no
less.” — Moving Theory Into Practice
The Importance of Metadata






First definition: Cataloging by those paid better
than librarians
Second definition: Structured description of an
object or collection of objects
No matter what access system you use, having the
right metadata is essential
The services you want to offer will define the
metadata you must capture
The storage format is not that important as long as
you lose nothing and you can output it in all the
ways you wish to support
Capturing it at the correct granularity is key
Metadata Granularity
 The
degree to which you segment or
“chop up” your metadata
 Gross: <name>John Doe</name>
 Fine:
<name>
<given>John</given>
<family>Doe</family>
</name>
Metadata: Qualification
role=“creator”>William Randolph
Hearst</name>
 <subject scheme=“LCSH”>Builder -Castles -- Southern California</subject>
 <name
Metadata: Machine Parseability
 The
ability to pull apart and reconstruct
metadata via software
 For example, this:
<name>
<first>William</first>
<middle>Randolph</middle>
<last>Hearst</last>
</name>
 Can
easily become this:
<DC.creator>Hearst, William Randolph</DC.creator>
Metadata: Types

descriptive - e.g., title, creator, subject used for discovery
 administrative - e.g., resolution, bit depth
- used for managing the collection
 structural - e.g., table of contents page,
page 34, etc. - used for navigation
 preservation - metadata useful for
preserving the item
Item v. Collection Metadata

Collection-level metadata:
Discovery metadata describes the collection
 Example: Kentuckiana Digital Library; see

www.kyvl.org/kentuckiana/digilibcoll/digilibcoll.shtml

Item-level metadata:
Discovery metadata describes the item
 Example: MARC or Dublin Core records for
each item; see californiadigitallibrary.org


Both types may be appropriate
 Doing both often takes very little extra effort
Key Metadata Standards

Encoded Archival Description (EAD)




Benefits:



Used to describe archival collections
www.loc.gov/ead/
Expressed in SGML, or increasingly XML
Allows for the description of large collections
without individual item cataloging
Is the only relevant standard in the field and has
the support of the key professional association
Drawbacks:

Is often used to encapsulate individual digitized
items as the only method of access to those items
Key Metadata Standards

Metadata Object Description Schema (MODS)




Benefits:



Purpose is not completely clear, but it is a bibliographic
record standard similar to MARC
www.loc.gov/standards/mods/
Expressed in XML
May be our best bet for leaving some of the baggage of
MARC/AACR2 behind
Has the backing and support of the Library of Congress
Drawbacks:


Appears to be under the control of the Library of Congress,
which is finding it difficult to think outside of the MARC box
Is not yet fully described
Key Metadata Standards

Dublin Core (DC)




Benefits:



Used to provide a “lowest common denominator” for
resource discovery among collections with more complex
and unique metadata formats
dublincore.org
Expressed in various ways: HTML, XML, RDF, etc.
Provides a useful way to unify resource discovery for
disparate collections
Is the only standard addressing this need and has the
support of major players in digital libraries
Drawbacks:


Is not yet fully described
There is often a mistaken assumption that adherence to DC
for internal metadata needs is both sufficient and desirable
Key Metadata Standards

Metadata Encoding and Transfer Syntax (METS)




Benefits:



Used to encapsulate digital objects (both simple and
complex)
www.loc.gov/standards/mets/
Expressed in XML
Provides a method for unifying access to one or more
packages of descriptive metadata, all of the relevant digital
files, and structural information
Is the only standard addressing this need and has the
support of major players in digital libraries
Drawbacks:

Is at an early stage of development, although projects are
using it now
Key Metadata Standards

Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH)




Benefits:



Used to provide a method of harvesting metadata from
compliant repositories
www.openarchives.org
Is both a protocol and a syntax (expressed in XML)
Provides a useful way to unify resource discovery for
disparate collections
Is the only standard addressing this need and has the
support of major players in digital libraries
Drawbacks:


Is in an early state, with only one standard metadata format
(DC)
It is a harvesting protocol, not a searching protocol, and
therefore your ability to get only those records that interest
you is limited
Databases: Pick Your Poison

Virtually any database or indexing product will
in most cases work
 Key considerations:





What do you already have?
Which platform are you on?
Which product will your IT staff be willing to
support?
Are there search features you must have?
How much money do you have to spend?
Databases: Examples
 Targeted
to the market and purpose;
e.g., ContentDM from OCLC
 General purpose commercial; e.g.,
Oracle, Sybase, SQL Server
 General purpose open source; e.g.,
MySQL, SWISH-E
 Shrink-wrapped consumer; e.g., MS
Access
Case Study: SWISH-E


Free, open
source
indexing
software for
Unix
(including
Mac OS X)
and Windows
Is HTML and
XML aware
(you can limit
searches to
specific tags)
http://escholarship.cdlib.org/ucpres
s/
Encoded
in TEI
Stored
File System
XML
Full Text
Search
Index
Encoded
in TEI
Stored
File System
XML
Search
Index
Full Text
Structure
Selected
Fields
Records Stored
Created
Extracted
Search
Index
METS
Repository
Project
Profile
MODS record
UC Press record
Library
Catalog
UC Press
Database
Encoded
in TEI
Stored
File System
XML
Search
Index
Full Text
Structure
Selected
Fields
Records Stored
Created
Extracted
Search
Index
METS
Repository
Project
Profile
MODS record
UC Press record
Library
Catalog
UC Press
Database
User
queries
Encoded
in TEI
Stored
File System
XML
Search
Index
Full Text
Structure
Selected
Fields
Records Stored
Created
Extracted
Search
Index
METS
Repository
Project
Profile
MODS record
UC Press record
Library
Catalog
UC Press
Database
User requests
book
Search
Results
Encoded
in TEI
Stored
File System
XML
Search
Index
Full Text
Java
Structure
Selected
Fields
Records Stored
Created
Extracted
servlet
Search
Index
METS
Repository
Project
Profile
MODS record
METS record in XML
UC Press record
XSLT
Library
Catalog
UC Press
Database
User requests
book segment
Encoded
in TEI
Stored
File System
XML
Search
Index
Full Text
Java
Structure
Selected
Fields
Records Stored
Created
Extracted
servlet
Search
Index
METS
Repository
Project
Profile
MODS record
UC Press record
Library
Catalog
UC Press
Database
XSLT
Book
segment
returned
Encoded
in TEI
Stored
File System
XML
Search
Index
Full Text
Structure
Selected
Fields
Records Stored
Created
Extracted
Search
Index
METS
Repository
Project
Profile
MODS record
UC Press record
Library
Catalog
UC Press
Database
User
queries
Encoded
in TEI
Stored
File System
XML
Search
Index
Full Text
Structure
Selected
Fields
Records Stored
Created
Extracted
Search
Index
METS
Repository
Project
Profile
MODS record
UC Press record
Library
Catalog
UC Press
Database
Results
returned
Encoded
in TEI
Stored
File System
XML
Search
Index
Full Text
Java
Structure
Selected
Fields
Records Stored
Created
Extracted
servlet
Search
Index
METS
Repository
Project
Profile
MODS record
UC Press record
Library
Catalog
UC Press
Database
User wants to see
search words
in context
Encoded
in TEI
Stored
File System
XML
Search
Index
Full Text
Java
Structure
Selected
Fields
Records Stored
Created
Extracted
servlet
Search
Index
METS
Repository
Project
Profile
MODS record
UC Press record
Library
Catalog
UC Press
Database
Book
segment XSLT
returned
w/terms
highlighted
Skills Required of Staff

Imaging
 OCR
 Markup languages (HTML, XML)
 Cataloging & metadata
 Indexing and database technology
 User interface design
 Programming
 Web technology
 Project management
Final Thoughts
Be careful what you wish for…
“Once you start a digital project, you are
committed to it for life” - Peter Hirtle
 Hardware is cheap, people are expensive
 For any given project, there are several ways
it can succeed (there is no one right answer)
 Never forget for whom you are doing this! (it’s
the customer, stupid)
