Slides from Lecture 19 - Courses - University of California, Berkeley

Download Report

Transcript Slides from Lecture 19 - Courses - University of California, Berkeley

Database Applications -- The UC
Berkeley Environmental Digital
Library
University of California, Berkeley
School of Information Management
and Systems
SIMS 257: Database Management
IS 257 - Fall 2002
2002.11.07- SLIDE 1
Lecture Outline
• Review
– Database Administration
• Database Applications
– Berkeley’s Environmental Digital Library
IS 257 - Fall 2002
2002.11.07- SLIDE 2
Final Project Requirements
• See WWW site:
– http://sims.berkeley.edu/courses/is257/f02/index.html
• Report on personal/group database including:
–
–
–
–
–
–
–
Database description and purpose
Data Dictionary
Relationships Diagram
Sample queries and results (Web or Access tools)
Sample forms (Web or Access tools)
Sample reports (Web or Access tools)
Application Screens (Web or Access tools)
IS 257 - Fall 2002
2002.11.07- SLIDE 3
Final Presentations and Reports
• Specifications for final report are on the
Web Site under assignments
• Presentations (1 on Nov. 28, Others on
Nov 30, Dec 5th and 7th (Full))
IS 257 - Fall 2002
2002.11.07- SLIDE 4
Lecture Outline
• Review
– Database Administration
• Database Applications
– Berkeley’s Environmental Digital Library
IS 257 - Fall 2002
2002.11.07- SLIDE 5
Terms and Concepts (trad)
• Data Administration
– Responsibility for the overall management
of data resources within an organization
• Database Administration
– Responsibility for physical database design
and technical issues in database
management
• These roles are often combined or
overlapping in some organizations
IS 257 - Fall 2002
2002.11.07- SLIDE 6
Database System Life Cycle
Database
Planning
Database
Analysis
Growth &
Change
Operation &
Maintenance
Database
Design
Database
Implementation
Note: this is a different version of this
life cycle than discussed previously
IS 257 - Fall 2002
2002.11.07- SLIDE 7
Database Planning: DA & DBA functions
•
•
•
•
•
Develop corporate database strategy (DA)
Develop enterprise model (DA)
Develop cost/benefit models (DA)
Design database environment (DA)
Develop data administration plan (DA)
IS 257 - Fall 2002
2002.11.07- SLIDE 8
Database Analysis: DA & DBA functions
•
•
•
•
Define and model data requirements (DA)
Define and model business rules (DA)
Define operational requirements (DA)
Maintain corporate Data Dictionary (DA)
IS 257 - Fall 2002
2002.11.07- SLIDE 9
Database Design: DA &DBA functions
• Perform logical database design (DA)
• Design external models (subschemas)
(DBA)
• Design internal model (Physical design)
(DBA)
• Design integrity controls (DBA)
IS 257 - Fall 2002
2002.11.07- SLIDE 10
Database Implementation DA & DBA functions
•
•
•
•
•
Specify database access policies (DA & DBA)
Establish Security controls (DBA)
Supervise Database loading (DBA)
Specify test procedures (DBA)
Develop application programming standards
(DBA)
• Establish procedures for backup and recovery
(DBA)
• Conduct User training (DA & DBA)
IS 257 - Fall 2002
2002.11.07- SLIDE 11
Operation and Maintenance: DA & DBA functions
•
•
•
•
Monitor database performance (DBA)
Tune and reorganize databases (DBA)
Enforce standards and procedures (DBA)
Support users (DA & DBA)
IS 257 - Fall 2002
2002.11.07- SLIDE 12
Growth & Change: DA & DBA functions
• Implement change control procedures (DA
& DBA)
• Plan for growth and change (DA & DBA)
• Evaluate new technology (DA & DBA)
IS 257 - Fall 2002
2002.11.07- SLIDE 13
Functions in Database Administration
• Planning and Design (we have already
looked at theses processes in detail)
• Data Integrity
• Backup and Recovery
• Security Management
IS 257 - Fall 2002
2002.11.07- SLIDE 14
Data Integrity
• Intrarecord integrity (enforcing constraints
on contents of fields, etc.)
• Referential Integrity (enforcing the validity
of references between records in the
database)
• Concurrency control (ensuring the validity
of database updates in a shared multiuser
environment)
IS 257 - Fall 2002
2002.11.07- SLIDE 15
Database Security
• Views or restricted subschemas
• Authorization rules to identify users and
the actions they can perform
• User-defined procedures (and rule
systems) to define additional constraints or
limitations in using the database
• Encryption to encode sensitive data
• Authentication schemes to positively
identify a person attempting to gain access
to the database
IS 257 - Fall 2002
2002.11.07- SLIDE 16
Database Backup and Recovery
•
•
•
•
Backup
Journaling (audit trail)
Checkpoint facility
Recovery manager
IS 257 - Fall 2002
2002.11.07- SLIDE 17
Disaster Recovery Planning
Risk
Analysis
Recovery
Strategies
Plan
Maintenance
Testing and
Training
Budget &
Implement
Procedures
Development
From Toigo “Disaster Recovery Planning”
IS 257 - Fall 2002
2002.11.07- SLIDE 18
Threats to Assets and Functions
•
•
•
•
•
Water
Fire
Power Failure
Mechanical breakdown or software failure
Accidental or deliberate destruction of
hardware or software
– By hackers, disgruntled employees, industrial
saboteurs, terrorists, or others
IS 257 - Fall 2002
2002.11.07- SLIDE 19
Threats
• Between 1967 and 1978 fire and water
damage accounted for 62% of all data
processing disasters in the U.S.
• The water damage was sometimes
caused by fighting fires
• More recently improvements in fire
suppression (e.g., Halon) for DP centers
has meant that water is the primary
danger to DP centers
IS 257 - Fall 2002
2002.11.07- SLIDE 20
Kinds of Records
• Class I: VITAL
– Essential, irreplaceable or necessary to recovery
• Class II: IMPORTANT
– Essential or important, but reproducible with difficulty
or at extra expense
• Class III: USEFUL
– Records whose loss would be inconvenient, but which
are replaceable
• Class IV: NONESSENTIAL
– Records which upon examination are found to be no
longer necessary
IS 257 - Fall 2002
2002.11.07- SLIDE 21
Offsite Storage of Data
• Early offsite storage facilities were often
intended to survive atomic explosions
• PRISM International directory
• Mirror sites (Hot sites)
– E.g. Cantor-Fitzgerald
IS 257 - Fall 2002
2002.11.07- SLIDE 22
Lecture Outline
• Review
– Database Administration
• Database Applications
– Berkeley’s Environmental Digital Library
IS 257 - Fall 2002
2002.11.07- SLIDE 23
Berkeley DL Project
• Object Relational Database Applications
– The Berkeley Digital Library Project
• Slides from RRL and Robert Wilensky, EECS
– Use of DBMS in DL project
IS 257 - Fall 2002
2002.11.07- SLIDE 24
Overview
• What is an Digital Library?
• Overview of Ongoing Research on
Information Access in Digital Libraries
IS 257 - Fall 2002
2002.11.07- SLIDE 25
Digital Libraries Are Like Traditional Libraries...
• Involve large repositories of information
(storage, preservation, and access)
• Provide information organization and
retrieval facilities (categorization, indexing)
• Provide access for communities of users
(communities may be as large as the
general public or small as the employees
of a particular organization)
IS 257 - Fall 2002
2002.11.07- SLIDE 26
Traditional Library System
Originators
Libraries
Users
IS 257 - Fall 2002
2002.11.07- SLIDE 27
But Digital Libraries Are Different From
Libraries...
• Not a physical location with local copies;
objects held closer to originators
• Decoupling of storage, organization,
access
• Enhanced Authoring (origination,
annotation, support for work groups)
• Subscription, pay-per-view supported in
addition to “free” browsing.
• Integration into user tasks.
IS 257 - Fall 2002
2002.11.07- SLIDE 28
A Digital Library Infrastructure Model
Originators
Index
Services
Repositories
Network
Users
IS 257 - Fall 2002
2002.11.07- SLIDE 29
UC Berkeley Digital Library Project
• Focus: Work-centered digital information
services
• Testbed: Digital Library for the California
Environment
• Research: Technical agenda supporting
user-oriented access to large distributed
collections of diverse data types.
• Part of the NSF/NASA/DARPA Digital
Library Initiative (Phases 1 and 2)
IS 257 - Fall 2002
2002.11.07- SLIDE 30
UCB Digital Library Project: Research
Organizations
• UC Berkeley EECS, SIMS, CED, IS&T
• UCOP/CDL
• Xerox PARC’s Document Image Decoding group
and Work Practices group
• Hewlett-Packard
• NEC
• SUN Microsystems
• IBM Almaden
• Microsoft
• Ricoh California Research
• Philips Research
IS 257 - Fall 2002
2002.11.07- SLIDE 31
Testbed: An Environmental Digital Library
• Collection: Diverse material relevant to
California’s key habitats.
• Users: A consortium of state agencies,
development corporations, private
corporations, regional government
alliances, educational institutions, and
libraries.
• Potential: Impact on state-wide
environmental system (CERES )
IS 257 - Fall 2002
2002.11.07- SLIDE 32
The Environmental Library Users/Contributors
• California Resources Agency, California
Environment Resources Evaluation
System (CERES)
• California Department of Water Resources
• The California Department of Fish & Game
• SANDAG
• UC Water Resources Center Archives
• New Partners: CDL and SDSC
IS 257 - Fall 2002
2002.11.07- SLIDE 33
The Environmental Library - Contents
•
•
•
•
•
•
•
•
Environmental technical reports, bulletins, etc.
County general plans
Aerial and ground photography
USGS topographic maps
Land use and other special purpose maps
Sensor data
“Derived” information
Collection data bases for the classification and
distribution of the California biota (e.g.,
SMASCH)
• Supporting 3-D, economic, traffic, etc. models
• Videos collected by the California Resources
Agency
IS 257 - Fall 2002
2002.11.07- SLIDE 34
The Environmental Library - Contents
• As of late 2002, the collection represents
over one terabyte of data, including over
183,000 digital images, about 300,000
pages of environmental documents, and
over 2 million records in geographical and
botanical databases.
IS 257 - Fall 2002
2002.11.07- SLIDE 35
Botanical Data:
• The CalFlora Database contains
taxonomical and distribution information
for more than 8000 native California
plants. The Occurrence Database includes
over 600,000 records of California plant
sightings from many federal, state, and
private sources. The botanical databases
are linked to the CalPhotos collection of
California plants, and are also linked to
external collections of data, maps, and
photos.
IS 257 - Fall 2002
2002.11.07- SLIDE 36
Geographical Data:
• Much of the geographical data in the collection
has been used to develop our web-based GIS
Viewer. The Street Finder uses 500,000 Tiger
records of S.F. Bay Area streets along with the
70,000-records from the USGS GNIS database.
California Dams is a database of information
about the 1395 dams under state jurisdiction. An
additional 11 GB of geographical data
represents maps and imagery that have been
processed for inclusion as layers in our GIS
Viewer. This includes Digital Ortho Quads and
DRG maps for the S.F. Bay Area.
IS 257 - Fall 2002
2002.11.07- SLIDE 37
Documents:
• Most of the 300,000 pages of digital documents are
environmental reports and plans that were provided by
California state agencies. This collection includes
documents, maps, articles, and reports on the California
environment including Environmental Impact Reports
(EIRs), educational pamphlets, water usage bulletins,
and county plans. Documents in this collection come
from the California Department of Water Resources
(DWR), California Department of Fish and Game (DFG),
San Diego Association of Governments (SANDAG), and
many other agencies. Among the most frequently
accessed documents are County General Plans for
every California county and a survey of 125 Sacramento
Delta fish species.
IS 257 - Fall 2002
2002.11.07- SLIDE 38
Testbed Success Stories
• LUPIN: CERES’ Land Use Planning Information
Network
– California Country General Plans and other
environmental documents.
– Enter at Resources Agency Server, documents stored
at and retrieved from UCB DLIB server.
• California flood relief efforts
– High demand for some data sets only available on our
server (created by document recognition).
• CalFlora: Creation and interoperation of
repositories pertaining to plant biology.
• Cloning of services at Cal State Library, FBI
IS 257 - Fall 2002
2002.11.07- SLIDE 39
Research Highlights
• Documents
– Multivalent Document prototype
• Page images, structured documents, GIS data,
photographs
• Intelligent Access to Content
– Document recognition
– Vision-based Image Retrieval: stuff, thing,
scene retrieval
– Natural Language Processing: categorizing
the web, Cheshire II, TileBar Interfaces
IS 257 - Fall 2002
2002.11.07- SLIDE 40
Multivalent Documents
• MVD Model
– radically distributed, open, extensible
– “behaviors” and “layers”
• behaviors conform to a protocol suite
• inter-operation via “IDEG”
• Applied to “enlivening legacy documents”
– various nice behaviors, e.g., lenses
IS 257 - Fall 2002
2002.11.07- SLIDE 41
Document Presentation
• Problem: Digital libraries must deliver
digital documents -- but in what form?
• Different forms have advantages for
particular purposes
– Retrieval
– Reuse
– Content Analysis
– Storage and archiving
• Combining forms (Multivalent documents)
IS 257 - Fall 2002
2002.11.07- SLIDE 42
Spectrum of Digital Document
Representations
Adapted from Fox, E.A., et al. “Users, User Interfaces and Objects: Evision, an Electronic Library”, JASIS 44(8), 1993
IS 257 - Fall 2002
2002.11.07- SLIDE 43
Document Representation: Multivalent
Documents
• Primary user interface/document model for
UCB Digital Library (Wilensky & Phelps)
• Goal: An approach to new document
representations and their authoring.
• Supports active, distributed, composable
transformations of multimedia documents.
• Enables sophisticated annotations,
intelligent result handling, user-modifiable
interface, composite documents.
IS 257 - Fall 2002
2002.11.07- SLIDE 44
Multivalent Documents
Cheshire Layer
GIS Layer
Valence:
2: The relative
capacity to unite,
react, or interact
(as with antigens
or a biological
substrate).
Webster’s 7th Collegiate
Dictionary
Table Layer
History of The Classical World
kdk
dkd
kdk
The jsfj sjjhfjs jsjj
jsjhfsjf sjhfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
ksfksjfkskflk sjfjksf
kjsfkjsfkjshf sjfsjfjks
ksfjksfjksjfkthsjir\\
ks
ksfjksjfkksjkls’ks
klsjfkskfksjjjhsjhuu
sfsjfkjs
taksksh
sksksk
skksksk
Network
Protocols &
Resources
OCR Layer
OCR Mapping
Layer
Modernjsfj sjjhfjs jsjj
jsjhfsjf sslfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
Scanned
Page
Image
kdjjdkd kdjkdjkd kj
kdkdk kdkd dkk
jdjjdj
clclc ldldl
Table 1.
IS 257 - Fall 2002
2002.11.07- SLIDE 45
IS 257 - Fall 2002
2002.11.07- SLIDE 46
IS 257 - Fall 2002
2002.11.07- SLIDE 47
MVD availability
• The MVD Browser is now available as
open source on SourceForge
– http://sourceforge.net/project/showfiles.php?group_id=44509
• See also:
– http://http.cs.berkeley.edu/~phelps/Multivalent/
IS 257 - Fall 2002
2002.11.07- SLIDE 48
GIS in the MVD Framework
• Layers are georeferenced data sets.
• Behaviors are
– display semi-transparently
– pan
– zoom
– issue query
– display context
– “spatial hyperlinks”
– annotations
• Written in Java
IS 257 - Fall 2002
2002.11.07- SLIDE 49
GIS Viewer: Features
• Annotation and saving
– points, rectangles (w. labels and links),
vectors
– saving of annotations as separate layer
• Integration with address, street finding,
gazetteer services
• Application to image viewing: tilePix
• Castanet client
IS 257 - Fall 2002
2002.11.07- SLIDE 50
IS 257 - Fall 2002
2002.11.07- SLIDE 51
IS 257 - Fall 2002
2002.11.07- SLIDE 52
IS 257 - Fall 2002
2002.11.07- SLIDE 53
GIS Viewer Example
http://elib.cs.berkeley.edu/annotations/gis/buildings.html
IS 257 - Fall 2002
2002.11.07- SLIDE 54
Geographic Information: Plans and Ideas
• More annotations, flexible saving
• Support for large vector data sets
• Interoperability
– On-the-fly
• conversion of formats
• generation of “catalogs”
– Via OGDI/GLTP
– Experimenting with various CERES servers
IS 257 - Fall 2002
2002.11.07- SLIDE 55
Documents: Information from scanned
documents
• Built document recognizers for some
important documents, e.g. “Bulletin 17”.
“TR-9”.
• Recognized document structure, with
order magnitude better OCR.
• Automatically generated 1395 item dam
relational data base.
• Enabled access via forms, map interfaces.
• Enable interoperation with image DB.
IS 257 - Fall 2002
2002.11.07- SLIDE 56
Document Recognition: Ongoing Work
• Document recognizers: for ~ dozen
document types
• Development and integration of
mathematical OCR and recognition.
• Eventually produce document recognizer
generator, i.e., make it easier to write
recognizers.
IS 257 - Fall 2002
2002.11.07- SLIDE 60
Vision-Based Image Retrieval
• Stuff-based queries: “blobs”
– Basic blobs: colors, sizes, variable number
• demonstrated utility for interesting queries
– “Blob world”: Above plus texture, applied to
• retrieving similar images
• successful learning scene classifier
• Thing-finding: Successfully deployed
detectors adding body plans (adding
shape, geometry and kinematic
constraints)
IS 257 - Fall 2002
2002.11.07- SLIDE 61
Image Retrieval Research
• Finding “Stuff” vs “Things”
• BlobWorld
• Other Vision Research
IS 257 - Fall 2002
2002.11.07- SLIDE 62
(Old “stuff”-based image retrieval: Query)
IS 257 - Fall 2002
2002.11.07- SLIDE 63
(Old “stuff”-based image retrieval: Result)
IS 257 - Fall 2002
2002.11.07- SLIDE 64
Blobworld: use regions for retrieval
• We want to find general objects
 Represent images based on coherent
regions
IS 257 - Fall 2002
2002.11.07- SLIDE 65
(“Thing”-based image retrieval using
“body plans”: Result)
IS 257 - Fall 2002
2002.11.07- SLIDE 68
Natural Language Processing
Automatic Topic Assignment
• Developed automatic
categorization/disambiguation method to
point where topic assignment (but not
disambiguation) appears feasible.
• Ran controlled experiment:
– Took Yahoo as ground truth.
– Chose 9 overlapping categories; took 1000
web pages from Yahoo as input.
– Result: 84% precision; 48% recall (using top
5 of 1073 categories)
IS 257 - Fall 2002
2002.11.07- SLIDE 69
Further Information
• Berkeley DL web site
http://elib.cs.berkeley.edu
IS 257 - Fall 2002
2002.11.07- SLIDE 70