- Courses - University of California, Berkeley

Download Report

Transcript - Courses - University of California, Berkeley

Object-Relational Database
Applications -- The UC Berkeley
Environmental Digital Library
University of California, Berkeley
School of Information Management and
Systems
SIMS 257: Database Management
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Today
• Object Relational Database Applications
– The Berkeley Digital Library Project
• Slides from RRL and Robert Wilensky, EECS
– Use of DBMS in DL project.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Final Presentations and Reports
• Specifications for final report are on the
Web Site under assignments
• Presentations (1 on Nov. 28, Others on Nov
30, Dec 5th and 7th (Full))
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Today
• Object Relational Applications
• The UCB Digital Library
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Overview
• What is an Digital Library?
• Overview of Ongoing Research on
Information Access in Digital Libraries
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Digital Libraries Are Like
Traditional Libraries...
• Involve large repositories of information
(storage, preservation, and access)
• Provide information organization and
retrieval facilities (categorization, indexing)
• Provide access for communities of users
(communities may be as large as the general
public or small as the employees of a
particular organization)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Traditional Library System
Originators
Libraries
Users
11/21/2000
Database Management -- Spring 1998 -- R. Larson
But Digital Libraries Are
Different From Libraries...
• Not a physical location with local copies;
objects held closer to originators
• Decoupling of storage, organization, access
• Enhanced Authoring (origination,
annotation, support for work groups)
• Subscription, pay-per-view supported in
addition to “free” browsing.
• Integration into user tasks.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
A Digital Library Infrastructure Model
Originators
Index
Services
Users
11/21/2000
Repositories
Network
Database Management -- Spring 1998 -- R. Larson
UC Berkeley Digital Library
Project
• Focus: Work-centered digital information services
• Testbed: Digital Library for the California
Environment
• Research: Technical agenda supporting useroriented access to large distributed collections of
diverse data types.
• Part of the NSF/NASA/DARPA Digital Library
Initiative (Phases 1 and 2)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
UCB Digital Library Project:
Research Organizations
• UC Berkeley EECS, SIMS, CED, IS&T
• UCOP
• Xerox PARC’s Document Image Decoding group and
Work Practices group
• Hewlett-Packard
• NEC
• SUN Microsystems
• IBM Almaden
• Microsoft
• Ricoh California Research
• Philips Research
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Testbed: An Environmental
Digital Library
• Collection: Diverse material relevant to
California’s key habitats.
• Users: A consortium of state agencies,
development corporations, private
corporations, regional government alliances,
educational institutions, and libraries.
• Potential: Impact on state-wide
environmental system (CERES )
11/21/2000
Database Management -- Spring 1998 -- R. Larson
The Environmental Library Users/Contributors
• California Resources Agency, California
Environment Resources Evaluation System
(CERES)
• California Department of Water Resources
• The California Department of Fish & Game
• SANDAG
• UC Water Resources Center Archives
• New Partners: CDL and SDSC
11/21/2000
Database Management -- Spring 1998 -- R. Larson
The Environmental Library Contents
•
•
•
•
•
•
•
•
Environmental technical reports, bulletins, etc.
County general plans
Aerial and ground photography
USGS topographic maps
Land use and other special purpose maps
Sensor data
“Derived” information
Collection data bases for the classification and
distribution of the California biota (e.g., SMASCH)
• Supporting 3-D, economic, traffic, etc. models
• Videos collected by the California Resources Agency
11/21/2000
Database Management -- Spring 1998 -- R. Larson
The Environmental Library Contents
• As of late 2000, the collection represents
about one terabyte of data, including over
165,000 digital images, about 300,000
pages of environmental documents, and
nearly 2 million records in geographical and
botanical databases.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Botanical Data:
 The CalFlora Database contains taxonomical
and distribution information for more than
8000 native California plants. The Occurrence
Database includes over 600,000 records of
California plant sightings from many federal,
state, and private sources. The botanical
databases are linked to our CalPhotos
collection of Calfornia plants, and are also
linked to external collections of data, maps,
and photos.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Geographical Data:
 Much of the geographical data in our collection is
being used to develop our web-based GIS Viewer.
The Street Finder uses 500,000 Tiger records of
S.F. Bay Area streets along with the 70,000records from the USGS GNIS database. California
Dams is a database of information about the 1395
dams under state jurisdiction. An additional 11 GB
of geographical data represents maps and imagery
that have been processed for inclusion as layers in
our GIS Viewer. This includes Digital Ortho
Quads and DRG maps for the S.F. Bay Area.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Documents:
 Most of the 300,000 pages of digital documents are
environmental reports and plans that were provided by
California state agencies. This collection includes
documents, maps, articles, and reports on the California
environment including Environmental Impact Reports
(EIRs), educational pamphlets, water usage bulletins, and
county plans. Documents in this collection come from the
California Department of Water Resources (DWR),
California Department of Fish and Game (DFG), San
Diego Association of Governments (SANDAG), and many
other agencies. Among the most frequently accessed
documents are County General Plans for every California
county and a survey of 125 Sacramento Delta fish species.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Documents - cont.
The collection also includes about 20Mb of
full-text (HTML) documents from the
World Conservation Digital Library. In
addition to providing online access to
important environmental documents, the
document collection is the testbed for our
Multivalent Document research.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Testbed Success Stories
• LUPIN: CERES’ Land Use Planning Information Network
– California Country General Plans and other environmental
documents.
– Enter at Resources Agency Server, documents stored at and
retrieved from UCB DLIB server.
• California flood relief efforts
– High demand for some data sets only available on our server
(created by document recognition).
• CalFlora: Creation and interoperation of repositories
pertaining to plant biology.
• Cloning of services at Cal State Library, FBI
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Research Highlights
• Documents
– Multivalent Document prototype
• Page images, structured documents, GIS data, photographs
• Intelligent Access to Content
– Document recognition
– Vision-based Image Retrieval: stuff, thing, scene
retrieval
– Natural Language Processing: categorizing the web,
Cheshire II, TileBar Interfaces
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Multivalent Documents
• MVD Model
– radically distributed, open, extensible
– “behaviors” and “layers”
• behaviors conform to a protocol suite
• inter-operation via “IDEG”
• Applied to “enlivening legacy documents”
– various nice behaviors, e.g., lenses
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Document Presentation
• Problem: Digital libraries must deliver
digital documents -- but in what form?
• Different forms have advantages for
particular purposes
–
–
–
–
Retrieval
Reuse
Content Analysis
Storage and archiving
• Combining forms (Multivalent documents)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Spectrum of Digital Document
Representations
Adapted from Fox, E.A., et al. “Users, User Interfaces and Objects: Evision, an Electronic Library”, JASIS 44(8), 1993
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Document Representation:
Multivalent Documents
• Primary user interface/document model for
UCB Digital Library (Wilensky & Phelps)
• Goal: An approach to new document
representations and their authoring.
• Supports active, distributed, composable
transformations of multimedia documents.
• Enables sophisticated annotations,
intelligent result handling, user-modifiable
interface, composite documents.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Multivalent DocumentsNetwork
Cheshire Layer
GIS Layer
Valence:
2: The relative
capacity to unite,
react, or interact
(as with antigens
or a biological
substrate).
Webster’s 7th Collegiate
Dictionary
Table Layer
History of The Classical World
kdk
dkd
kdk
The jsfj sjjhfjs jsjj
jsjhfsjf sjhfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
ksfksjfkskflk sjfjksf
kjsfkjsfkjshf sjfsjfjks
ksfjksfjksjfkthsjir\\
ks
ksfjksjfkksjkls’ks
klsjfkskfksjjjhsjhuu
sfsjfkjs
taksksh
sksksk
skksksk
Modernjsfj sjjhfjs jsjj
jsjhfsjf sslfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
kdjjdkd kdjkdjkd kj
kdkdk kdkd dkk
jdjjdj
clclc ldldl
Table 1.
11/21/2000
Protocols &
Resources
OCR Layer
OCR Mapping
Layer
Database Management -- Spring 1998 -- R. Larson
Scanned
Page
Image
11/21/2000
Database Management -- Spring 1998 -- R. Larson
11/21/2000
Database Management -- Spring 1998 -- R. Larson
MVD Third Party Work
• Japanese support by NEC; application to
office document management
• Printing, support for other OCR formats,
by HP
• Chinese character and multilingual lens by
UCB Instructional Support staff (Owen
McGrath)
• Automatic enlivening of documents via
Transcend proxy.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
MVD Forthcoming
•
•
•
•
Support for XML + style sheets
More robust parsing
Saving where you want
Media adaptors for
– Continuous media
– Near image formats, word proc. formats
•
•
•
•
Improve authoring tools
Interoperation with paper
Application versus applet?
Release to community, get feedback, iterate.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
GIS in the MVD Framework
• Layers are georeferenced data sets.
• Behaviors are
– display semi-transparently
– pan
– zoom
– issue query
– display context
– “spatial hyperlinks”
– annotations
• Written in Java (to be merged with MVD-1 code
line?)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
GIS Viewer: Recent
Developments
• Annotation and saving
– points, rectangles (w. labels and links), vectors
– saving of annotations as separate layer
• Integration with address, street finding,
gazetteer services
• Application to image viewing: tilePix
• Castanet client
11/21/2000
Database Management -- Spring 1998 -- R. Larson
11/21/2000
Database Management -- Spring 1998 -- R. Larson
11/21/2000
Database Management -- Spring 1998 -- R. Larson
11/21/2000
Database Management -- Spring 1998 -- R. Larson
GIS Viewer Example
http://elib.cs.berkeley.edu/annotations/gis/buildings.html
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Geographic Information: Plans
and Ideas
• More annotations, flexible saving
• Support for large vector data sets
• Interoperability
– On-the-fly
• conversion of formats
• generation of “catalogs”
– Via OGDI/GLTP
– Experimenting with various CERES servers
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Documents: Information from
scanned document
• Built document recognizers for some important
documents, e.g. “Bulletin 17”. “TR-9”.
• Recognized document structure, with order
magnitude better OCR.
• Automatically generated 1395 item dam relational
data base.
• Enabled access via forms, map interfaces.
• Enable interoperation with image DB.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Document Recognition: Future
Plans
• Document recognizers: for ~ dozen
document types
• Development and integration of
mathematical OCR and recognition.
• Eventually produce document recognizer
generator, i.e., make it easier to write
recognizers.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Vision-Based Image Retrieval
Find objects by grouping coherent low-level properties
• Stuff-based queries: “blobs”
– Basic blobs: colors, sizes, variable number
• demonstrated utility for interesting queries
– “Blob world”: Above plus texture, applied to
• retrieving similar images
• successful learning scene classifier
• Thing-finding: Successfully deployed
detectors adding body plans (adding shape,
geometry and kinematic constraints)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Image Retrieval Research
• Finding “Stuff” vs “Things”
• BlobWorld
• Other Vision Research
11/21/2000
Database Management -- Spring 1998 -- R. Larson
(Old “stuff”-based image retrieval: Query)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
(Old “stuff”-based image retrieval: Result)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Blobworld: use regions for retrieval
• We want to find general objects
 Represent images based on coherent regions
11/21/2000
Database Management -- Spring 1998 -- R. Larson
(“Thing”-based image retrieval using
“body plans”: Result)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Natural Language Processing
Automatic Topic Assignment
• Developed automatic
categorization/disambiguation method to point
where topic assignment (but not disambiguation)
appears feasible.
• Ran controlled experiment:
– Took Yahoo as ground truth.
– Chose 9 overlapping categories; took 1000 web pages
from Yahoo as input.
– Result: 84% precision; 48% recall (using top 5 of 1073
categories)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
(Isaac’s Automatically Generated Ontology)
IAGO (0.1)! = Yahoo - labor + NLP
• We categorized (part of) the Web:
– 1073 categories; 8000 web pages
– ~80% precision for good categories
• E.g., “motion pictures”, “the environment”, “music”
• IAGO 1.0 in the works:
– Eliminate pages with little text.
– Eliminate proper nouns.
– Retrained with MS Encarta - Improved performance
dramatically (perhaps enough to disambiguate the
web)!
– Need to compute word sense priors using the web.
– [Recode implementation to keep up with web crawler.]
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Cheshire II: Cross-Domain
Resource Discovery: Integrated
Discovery and Use of Textual,
Numeric and Spatial Data
Ray R. Larson, PI
Kirby Zhang – Yonghui Zhang
School of Information Management & Systems
University of California, Berkeley
[email protected]
Paul Watry, Co-PI
Robert Sanderson
University of Liverpool
Archives and Special Collections
[email protected]
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Overview
• Goals are
– Practical application of existing DL
technologies to some large-scale cross-domain
collections
– Theoretical examination and evaluation of nextgeneration designs for systems architecture and
and distributed cross-domain searching for DLs
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Current Usage of Cheshire II
• Web clients for:
–
–
–
–
–
–
Berkeley NSF/NASA/ARPA Digital Library
World Conservation Digital Library
SunSite (UC Berkeley Science Libraries)
University of Liverpool
DeMontfort University (MASTER)
Higher Education Archives Hub
–
–
–
–
–
–
University of Essex, HDS (part of AHDS)
Oxford Text Archive (test only)
California Sheet Music Project
Cha-Cha (Berkeley Intranet Search Engine)
Berkeley Metadata project cross-language demo
Univ. of Virginia (test implementations)
• Glasgow, Edinburgh, Bath, Liverpool, Kings College London,
University College London, Nottingham, Durham, School of
Oriental and African Studies, Manchester, Southhampton,
Warwick and others (to be expanded)
– Use in NESSTAR (NEtworked Social Science Tools and
Resources)
– Cheshire ranking algorithm is basis for Inktomi
11/21/2000
Database Management -- Spring 1998 -- R. Larson
The Participants
• NSF/JISC International Digital Library Grant
Berkeley working with
– University of Liverpool/Manchester Computing
– DeMontfort University (MASTER)
– Art and Humanities Data Service (http://ahds.ac.uk/)
• OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York),
VADS (Surrey & Northumbria)
– Consortium of University Research Libraries (CURL)
– UC Berkeley Library
• Making of America II
• Online Archive of California
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Approach
• For the first goal, we are implementing a
distributed search system based on
international standards (Z39.50 and
SGML/XML) (existing Cheshire II technology)
which will be used for cross-domain
searching. Databases include:
–
–
–
–
HE Archives hub
Arts and Humanities Data Service (AHDS)
MASTER
CURL (Consortium of University Research
Libraries)
– Online Archive of California (OAC)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
– Making of America
II (MOA2)
Approach
• The second goal will be addressed in the
design, development, and evaluation of the
distributed information retrieval system
architecture, its client-side systems that aid
the user in exploiting distributed resources
and in the design and evaluation of
protocols for efficient and effective retrieval
in a internationally distributed multidatabase environment. (Cheshire III?)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Research Issues
• Appropriate system architecture for information
retrieval in distributed network environment
(distributed object architecture)
• Management of vocabulary control in a CrossDomain context
• Distributed access to existing metadata resources
• Navigating Collections
• Support for Cross-Domain resource clumps to
facilitate resource discovery
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Architecture Overview
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Architecture Overview
• Focus on high performance N.O.W. style
operations: A scalable, extensible platform
for IR
• Current design uses JavaSpaces – a highlevel coordination mechanism for
distributed systems using a light-weight
publish/subscribe distributed programming
model
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Current Design
• A single operational model for Cheshire that
encompasses single node installations,
uniformly administered clusters, as well as
independently administered federations.
– every operation is a distributed operation
– an operation is applied over a set of collections
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Collections:
• Single node or cluster
– can be partitions of other collections
• Federation
– can be partitions or subsets of other
collections. In other words, collections in a
loosely coupled federation may have
overlapping records
• Virtual Collections
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Virtual Collections
• The external interface to collections
– A VC may only present part of the underlying real collection in its
interface
– A VC may grow or shrink dynamically within the bounds of the
real collection. A search only needs to be done over documents in
VC, not all documents in the collection
– Ability to logically partition a collection across a number of
machines for performance increase, with built in redundancy in the
case of node failures.
– When a node failures, its VC is simply distributed (logically) to
other nodes in the cluster.
– Cheshire servers can be organized into server groups. A server
group can be thought of as an administrative unit.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Distributed Access to Existing
Metadata Resources
• Use of current (Z39.50) and new (SDLIP)
protocols for access to other metadata
systems
– Support for common semantics (e.g. Dublin
Core mappings for disparate systems)
– Cross-system use of EVMs
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Navigating Collections
• Support for “drilling down” from broad
Collection-level descriptions, to subcollection descriptions to individual digital
objects.
– Primary test bases will be EAD collection
descriptions linked to digital objects as in
MOA2.
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Cross-Domain Resource
Discovery
• Initially -- Use of Z39.50 Cross-domain element
set for search (Dublin Core based)
• Support for new protocols and semantics (such as
SDLIP)
• Research into a metaprotocol for communicating
information about databases, search elements and
collections between systems
– Initially based on Z39.50 Explain
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Meta-Search for CrossDomain Resource Discovery
• Hundreds or Thousands of servers with databases
ranging widely in content, topic, format
– Broadcast search is expensive in terms of bandwidth
and in processing too many irrelevant results
– How to select the “best” ones to search?
• What to search first
• Which to search next
– Topical /domain constraints on the search selections
(EVMs for databases?)
11/21/2000
Database Management -- Spring 1998 -- R. Larson
•
Cross-Domain Resource
Discovery
Meta-Search
– New approach to building metasearch based on Z39.50
– Instead of using broadcast search we will explore
• Extraction of GlOSS-like indexes using Z39.50 SCAN
• GIPSY2 extraction of place coverages from index data
– We will also Investigate
• How to choose databases using the index
• How to merge search results from multiple sources
• Hierarchies of servers (general/meta-topical/individual)
– Other methods
• Treating database contents as distributed objects
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Distributed Metadata Servers
General Servers
Meta-Topical
Servers
Replicated
servers
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Database
Servers
Meta-Search Server Index
Creation
• For all servers, or a topical subset…
– Get Explain information (especially DC
mappings)
– For each index (or each DC index)
• Use SCAN to extract terms and frequency
• Add term + freq + source index + database to the
meta-search index
– Post-Process indexes (especially Geo Names,
etc) for special types of data
• e.g. create “geographical coverage” indexes
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Z39.50 SCANzscan
Results
topic cat 1 20 1
% zscan title cat 1 20 1
{SCAN {Status 0}
{Terms 20}
{StepSize 1}
{Position 1}}
{cat 27}
{cat-fight 1}
{catalan 19}
{catalogu 37}
{catalonia 8}
{catalyt 2}
{catania 1}
{cataract 1}
{catch 173}
{catch-all 3}
{catch-up 2} …
11/21/2000
{SCAN {Status 0}
{Terms 20}
{StepSize 1}
{Position 1}}
{cat 706}
{cat-and-mouse 19}
{cat-burglar 1}
{cat-carrying 1}
{cat-egory 1}
{cat-fight 1}
{cat-gut 1}
{cat-litter 1}
{cat-lovers 2}
{cat-pee 1}
{cat-run 1}
{cat-scanners 1} …
Database Management -- Spring 1998 -- R. Larson
Conclusions
• A lot of interesting work to be done
– Redesign and development of the Cheshire II system
– Evaluating new meta-indexing methods
– Developing and Evaluating methods for merging crossdomain results (or, perhaps, when to keep them
separate)
– Developing, Testing and evaluating GIPSY2
– User interface development and testing for distributed
resource and object access
11/21/2000
Database Management -- Spring 1998 -- R. Larson
Further Information
• Berkeley DL web site
http://elib.cs.berkeley.edu
• Full Cheshire II client and server source is
available
ftp://cheshire.berkeley.edu/pub/cheshire/
– Includes HTML documentation
• Project Web Site
http://cheshire.berkeley.edu/
11/21/2000
Database Management -- Spring 1998 -- R. Larson