Slides from Lecture 21 - Courses - University of California, Berkeley

Download Report

Transcript Slides from Lecture 21 - Courses - University of California, Berkeley

Future of Database Systems 2: XML
Databases and Grid-based Digital
Libraries
University of California, Berkeley
School of Information Management
and Systems
SIMS 257: Database Management
IS 257 – Fall 2005
2005.11.22- SLIDE 1
Lecture Outline
• Review
– Future of Database Systems
• XML and DBMS
• Grid-Based Digital Libraries
– Data Grids
– Grid-based IR
• DBMS and usability
IS 257 – Fall 2005
2005.11.22- SLIDE 2
Lecture Outline
• Review
– Future of Database Systems
• XML and DBMS
• Grid-Based Digital Libraries
– Data Grids
– Grid-based IR
• DBMS and usability
IS 257 – Fall 2005
2005.11.22- SLIDE 3
• Radio has no future, Heavier-than-air
flying machines are impossible. X-rays will
prove to be a hoax.
– William Thompson (Lord Kelvin), 1899
IS 257 – Fall 2005
2005.11.22- SLIDE 4
• This “Telephone” has too many
shortcomings to be seriously considered
as a means of communication. The device
is inherently of no value to us.
– Western Union, Internal Memo, 1876
IS 257 – Fall 2005
2005.11.22- SLIDE 5
• I think there is a world market for maybe
five computers
– Thomas Watson, Chair of IBM, 1943
IS 257 – Fall 2005
2005.11.22- SLIDE 6
• By the turn of this century, we will live in a
paperless society.
– Roger Smith, Chair of GM, 1986
IS 257 – Fall 2005
2005.11.22- SLIDE 7
• I predict the internet… will go
spectacularly supernova and in 1996
catastrophically collapse.
– Bob Metcalfe (3-Com founder and inventor of
ethernet), 1995
IS 257 – Fall 2005
2005.11.22- SLIDE 8
Accomplishments of DBMS Research
• DBMS are now used in almost every
computing environment to create, organize
and maintain large collections of
information, and this is largely due to the
results of the DBMS research community’s
efforts, in particular:
– Relational DBMS
– Transaction management
– Distributed DBMS
IS 257 – Fall 2005
2005.11.22- SLIDE 9
Next Generation Database Systems
• Where are we going from here?
– Hardware is getting faster and cheaper
– DBMS technology continues to improve and
change
• OODBMS
• ORDBMS
– Bigger challenges for DBMS technology
• Medicine, design, manufacturing, digital libraries,
sciences, environment, planning, etc...
IS 257 – Fall 2005
2005.11.22- SLIDE 10
Examples
• NASA EOSDIS
– Estimated 1016 Bytes (Exabyte)
• Computer-Aided design
• The Human Genome
• Department Store tracking
– Mining non-transactional data (e.g. Scientific
data, text data?)
• Insurance Company
– Multimedia DBMS support
IS 257 – Fall 2005
2005.11.22- SLIDE 11
New Features
•
•
•
•
•
•
New Data types
Rule Processing
New concepts and data models
Problems of Scale
Parallelism/Grid-based DB
Tertiary Storage vs Very Large-Scale Disk
Storage
• Heterogeneous Databases
• Memory Only DBMS
IS 257 – Fall 2005
2005.11.22- SLIDE 12
Coming to a Database Near You…
•
•
•
•
•
•
•
•
Browsibility
User-defined access methods
Security
Steering Long processes
Federated Databases
IR capabilities
XML
The Semantic Web(?)
IS 257 – Fall 2005
2005.11.22- SLIDE 13
Some things to consider
• Bandwidth will keep increasing and getting
cheaper (and go wireless)
• Processing power will keep increasing
– Moore’s law: Number of circuits on the most
advanced semiconductors doubling every 18 months
• Memory and Storage will keep getting cheaper
(and probably smaller)
– “Storage law”: Worldwide digital data storage capacity
has doubled every 9 months for the past decade
• Put it all together and what do you have?
– “The ideal database machine would have a single
infinitely fast processor with infinite memory with
infinite bandwidth – and it would be infinitely cheap
(free)” : David DeWitt and Jim Gray, 1992
IS 257 – Fall 2005
2005.11.22- SLIDE 14
Lecture Outline
• Review
– Future of Database Systems
• XML and DBMS
• Grid-Based Digital Libraries
– Data Grids
– Grid-based IR
• DBMS and usability
IS 257 – Fall 2005
2005.11.22- SLIDE 15
Standards: XML/SQL
• As part of SQL3 an extension providing a
mapping from XML to DBMS is being
created called XML/SQL
• The (draft) standard is very complex, but
the ideas are actually pretty simple
• Suppose we have a table called
EMPLOYEE that has columns EMPNO,
FIRSTNAME, LASTNAME, BIRTHDATE,
SALARY
IS 257 – Fall 2005
2005.11.22- SLIDE 16
Standards: XML/SQL
• That table can be mapped to:
<EMPLOYEE>
<row><EMPNO>000020</EMPNO>
<FIRSTNAME>John</FIRSTNAME>
<LASTNAME>Smith</LASTNAME>
<BIRTHDATE>1955-08-21</BIRTHDATE>
<SALARY>52300.00</SALARY>
</row>
<row> … etc. …
IS 257 – Fall 2005
2005.11.22- SLIDE 17
Standards: XML/SQL
• In addition the standard says that
XMLSchemas must be generated for each
table, and also allows relations to be
managed by nesting records from tables in
the XML.
• Don’t know whether this has actually been
implemented by anyone
– There is actually something very similar in the
Cheshire II interface to RDBMS
IS 257 – Fall 2005
2005.11.22- SLIDE 18
Lecture Outline
• Review
– Future of Database Systems
• XML and DBMS
• Grid-Based Digital Libraries
– Data Grids
– Grid-based IR
• DBMS and usability
IS 257 – Fall 2005
2005.11.22- SLIDE 19
Grid-based Digital Libraries
•
•
•
•
So what’s this Grid thing anyhow?
Data Grids and Distributed Storage
Grid-Based IR
Grid-Based Digital Libraries
This lecture borrows heavily from presentations by Ian Foster (Argonne
National Laboratory & University of Chicago), Reagan Moore and others
from San Diego Supercomputer Center
IS 257 – Fall 2005
2005.11.22- SLIDE 20
Quality, economies of scale
The Grid: On-Demand Access to Electricity
Source: Ian Foster
IS 257 – Fall 2005
Time
2005.11.22- SLIDE 21
By Analogy, A Computing Grid
• Decouples production and consumption
– Enable on-demand access
– Achieve economies of scale
– Enhance consumer flexibility
– Enable new devices
• On a variety of scales
– Department
– Campus
– Enterprise
– Internet
IS 257 – Fall 2005
Source: Ian Foster
2005.11.22- SLIDE 22
Not Exactly a New Idea …
• “The time-sharing computer system can
unite a group of investigators …. one can
conceive of such a facility as an …
intellectual public utility.”
– Fernando Corbato and Robert Fano , 1966
• “We will perhaps see the spread of
‘computer utilities’, which, like present
electric and telephone utilities, will service
individual homes and offices across the
country.” Len Kleinrock, 1967
Source: Ian Foster
IS 257 – Fall 2005
2005.11.22- SLIDE 23
But, Things are Different Now
• Networks are far faster (and cheaper)
– Faster than computer backplanes
• “Computing” is very different than pre-Net
– Our “computers” have already disintegrated
– E-commerce increases size of demand peaks
– Entirely new applications & social structures
• We’ve learned a few things about software
Source: Ian Foster
IS 257 – Fall 2005
2005.11.22- SLIDE 24
Computing isn’t Really Like Electricity
• I import electricity but must export data
• “Computing” is not interchangeable but highly
heterogeneous: data, sensors, services, …
• This complicates things; but also means that the
sum can be greater than the parts
– Real opportunity: Construct new capabilities
dynamically from distributed services
• Raises three fundamental questions
– Can I really achieve economies of scale?
– Can I achieve QoS across distributed services?
– Can I identify apps that exploit synergies?
Source: Ian Foster
IS 257 – Fall 2005
2005.11.22- SLIDE 25
Why the Grid?
(1) Revolution in Science
• Pre-Internet
– Theorize &/or experiment, alone
or in small teams; publish paper
• Post-Internet
– Construct and mine large databases of
observational or simulation data
– Develop simulations & analyses
– Access specialized devices remotely
– Exchange information within
distributed multidisciplinary teams
Source: Ian Foster
IS 257 – Fall 2005
2005.11.22- SLIDE 26
Why the Grid?
(2) Revolution in Business
• Pre-Internet
– Central data processing facility
• Post-Internet
– Enterprise computing is highly distributed,
heterogeneous, inter-enterprise (B2B)
– Business processes increasingly
computing- & data-rich
– Outsourcing becomes feasible =>
service providers of various sorts
Source: Ian Foster
IS 257 – Fall 2005
2005.11.22- SLIDE 27
New Opportunities
Demand New Technology
“Resource sharing & coordinated
problem solving in dynamic, multiinstitutional virtual organizations”
Source: Ian Foster
IS 257 – Fall 2005
2005.11.22- SLIDE 28
Building an Open Grid
IS 257 – Fall 2005
2005.11.22- SLIDE 29
Building an Open Grid
Open
Standards
IS 257 – Fall 2005
2005.11.22- SLIDE 30
Building an Open Grid
Open
Standards
Open
Source
IS 257 – Fall 2005
2005.11.22- SLIDE 31
Building an Open Grid
Open
Standards
Open
Source
IS 257 – Fall 2005
Open
Infrastructure
2005.11.22- SLIDE 32
Building an Open Grid
Open
Standards
Open
Grid
Open
Source
IS 257 – Fall 2005
Open
Infrastructure
2005.11.22- SLIDE 33
Building an Open Grid
Open
Standards
Open
Grid
Open
Source
IS 257 – Fall 2005
Open
Infrastructure
2005.11.22- SLIDE 34
Increased functionality,
standardization
Grids and Open Standards
App-specific
Services
Web services
X.509,
LDAP,
FTP, …
Custom
solutions
Open Grid
Services Arch
GGF: OGSI, …
(+ OASIS, W3C)
Globus Toolkit Multiple implementations,
including Globus Toolkit
Defacto standards
GGF: GridFTP, GSI
Time
IS 257 – Fall 2005
2005.11.22- SLIDE 35
Open Grid Services Architecture
• Service-oriented architecture
– Key to virtualization, discovery, composition,
local-remote transparency
• Leverage industry standards
– Internet, Web services
• Distributed service management
– A “component model for Web services”
• A framework for the definition of
composable, interoperable services
“The Physiology of the Grid: An Open Grid Services Architecture for
Distributed
2002
IS 257 – Fall 2005 Systems Integration”, Foster, Kesselman, Nick, Tuecke,
2005.11.22- SLIDE
36
Realizing a Service-Oriented Architecture: How
Do I
•
•
•
•
•
•
•
•
•
Create, name, manage, discover services?
Render resources, data, sensors as services?
Negotiate service level agreements?
Express & negotiate policy?
Organize & manage service collections?
Establish identity, negotiate authentication?
Manage VO membership & communication?
Compose services efficiently?
Achieve interoperability?
IS 257 – Fall 2005
2005.11.22- SLIDE 37
Web Services
• XML-based distributed computing technology
• Web service = a server process that exposes
typed ports to the network
• Described by the Web Services Definition
Language, an XML document that contains
– Type of message(s) the service understands & types
of responses & exceptions it returns
– “Methods” bound together as “port types”
– Port types bound to protocols as “ports”
• A WSDL document completely defines a service
and how to access it
IS 257 – Fall 2005
2005.11.22- SLIDE 38
Open Grid Services Infrastructure
Client
Introspection:
• What port types?
• What policy?
• What state?
Grid Service
Handle
handle
resolution
Grid Service
Reference
Lifetime management
• Explicit destruction
• Soft-state lifetime
GridService
(required)
Data
access
Service
data
element
Service
data
element
Other standard interfaces:
factory,
notification,
collections
Service
data
element
Implementation
Hosting environment/runtime
(“C”, J2EE, .NET, …)
IS 257 – Fall 2005
2005.11.22- SLIDE 39
The Grid
as Enabler of 21st Century Science
• Entirely new approaches to enquiry based
on
– Deep analysis of huge quantities of data
– Interdisciplinary collaboration
– Large-scale simulation
– Smart instrumentation
• Enabled by an infrastructure that enables
access to, and integration of, resources &
services without regard for location
IS 257 – Fall 2005
2005.11.22- SLIDE 40
Grid Infrastructure
• Broadly deployed services in support of
fundamental collaborative activities
– Formation & operation of virtual organizations
– Authentication, authorization, discovery, …
• Services, software, and policies enabling ondemand access to critical resources
– Computers, databases, networks, storage, software
services,…
• Operational support for 24x7 availability
• Integration with campus and commercial
infrastructures
IS 257 – Fall 2005
2005.11.22- SLIDE 41
The Foundations are
Being Laid
Edinburgh
Glasgow
DL
Belfast
Newcastle
Manchester
Cambridge
Oxford
Cardiff
RAL
Hinxton
London
Soton
Tier0/1 facility
Tier2 facility
Tier3 facility
10 Gbps link
2.5 Gbps link
622 Mbps link
Other link
IS 257 – Fall 2005
2005.11.22- SLIDE 42
Data Grid Problem
• “Enable a geographically distributed
community [of thousands] to pool their
resources in order to perform
sophisticated, computationally intensive
analyses on Petabytes of data”
• Note that this problem:
– Is common to many areas of science
– Overlaps strongly with other Grid problems
IS 257 – Fall 2005
2005.11.22- SLIDE 43
Data Grids for
High Energy Physics
~PBytes/sec
Online System
~100 MBytes/sec
~20 TIPS
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
~622 Mbits/sec
or Air Freight (deprecated)
France Regional
Centre
SpecInt95 equivalents
Offline Processor Farm
There is a “bunch crossing” every 25 nsecs.
Tier 1
1 TIPS is approximately 25,000
Tier 0
Germany Regional
Centre
Italy Regional
Centre
~100 MBytes/sec
CERN Computer Centre
FermiLab ~4 TIPS
~622 Mbits/sec
Tier 2
~622 Mbits/sec
Institute
Institute Institute
~0.25TIPS
Physics data cache
Institute
Caltech
~1 TIPS
Tier2 Centre
Tier2 Centre
Tier2 Centre
Tier2 Centre
~1 TIPS ~1 TIPS ~1 TIPS ~1 TIPS
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more
channels; data for these channels should be cached by the
institute server
~1 MBytes/sec
Tier 4
Physicist workstations
Image courtesy Harvey Newman, Caltech
IS 257 – Fall 2005
2005.11.22- SLIDE 44
Data Intensive Issues Include …
• Harness [potentially large numbers of]
data, storage, network resources located
in distinct administrative domains
• Respect local and global policies
governing what can be used for what
• Schedule resources efficiently, again
subject to local and global constraints
• Achieve high performance, with respect to
both speed and reliability
• Catalog software and virtual data
IS 257 – Fall 2005
2005.11.22- SLIDE 45
Data Intensive Computing and Grids
• The term “Data Grid” is often used
– Implies a distinct infrastructure, which it isn’t; but easy
to say
• Data-intensive computing shares numerous
requirements with collaboration, instrumentation,
computation, …
– Security, resource mgt, info services, etc.
• Important to exploit commonalities as very
unlikely that multiple infrastructures can be
maintained
• Fortunately this seems easy to do!
IS 257 – Fall 2005
2005.11.22- SLIDE 46
Examples of
Desired Data Grid Functionality
•
•
•
•
•
•
•
High-speed, reliable access to remote data
Automated discovery of “best” copy of data
Manage replication to improve performance
Co-schedule compute, storage, network
“Transparency” wrt delivered performance
Enforce access control on data
Allow representation of “global” resource
allocation policies
IS 257 – Fall 2005
2005.11.22- SLIDE 47
A Model Architecture for Data Grids
Metadata
Catalog
Attribute
Specification
Application
Logical Collection and
Logical File Name
Selected
Replica
Replica
Selection
MDS
NWS
Disk Cache
Tape Library
Disk Array
Replica Location 1
Multiple Locations
Performance
Information &
Predictions
GridFTP Control Channel
GridFTP
Data
Channel
Replica
Catalog
Disk Cache
Replica Location 2
Replica Location 3
Source: Arcot Rajasekar (SDSC)
IS 257 – Fall 2005
2005.11.22- SLIDE 48
Data Grid Requirements
• Seamless access to data and information stored
at local and remote sites
• Virtualization of data, collection and meta information
• Handle Dataset Scaling – size & number
• Integrate Data Collections & Associated Metadata
• Handle Multiplicity of Platforms,
Resource & Data Types
• Handle Seamless Authentication
• Handle Access Control
• Provide Auditing Facilities
• Handle Legacy Data & Methods
Source: Arcot Rajasekar (SDSC)
IS 257 – Fall 2005
2005.11.22- SLIDE 49
SRB as a Solution
• The Storage Resource Broker is a middleware
• It virtualizes resource access
• It mediates access to distributed heterogeneous resources
• It uses a MetaCATalog to facilitate the brokering
• It integrates data and metadata
Application
MCAT
SRB Server
HRM DB2, Oracle, Illustra, ObjectStore HPSS, ADSM, UniTree UNIX, NTFS, HTTP, FTP
Distributed Storage Resources
(database systems, archival storage systems, file systems, ftp, http, …)
Source: Arcot Rajasekar (SDSC)
IS 257 – Fall 2005
2005.11.22- SLIDE 50
SDSC Storage Resource Broker
& Meta-data Catalog
Application
Resource,
User
User
Defined
C, C++,
Linux I/O
Unix
Shell
Java, NT
Browsers
Prolog
Python
Web
SRB
MCAT
Dublin
Core
Archives
HPSS, ADSM, HRM
UniTree, DMF
File Systems Databases
Unix, NT,
Mac OSX
Third-party
copy
Remote
Proxies
DB2, Oracle,
Sybase
DataCutter
Application
Meta-data
Source: Arcot Rajasekar (SDSC)
IS 257 – Fall 2005
2005.11.22- SLIDE 51
SRB Single SignOn
Authentication
Secure Password,
GSI or SEA
Application
Session Established
1
(Host,port)
SRB
Master
Identification &
Initialization
(port)
2
Server spawned
SRB agents
4
3
MCAT
3
CA
Source: Arcot Rajasekar (SDSC)
IS 257 – Fall 2005
2005.11.22- SLIDE 52
Federated SRB Operation
Read Application
Logical Name
Or
Attribute Condition
Peer-to-peer
Brokering
Parallel Data
Access
1
6
SRB
server
SRB
server
3
4
5
SRB
agent
1.Logical-to-Physical mapping
2. Identification of Replicas
3.Access & Audit Control
5/6
SRB
agent
2
R1
MCAT
Data
Access
R2
Server(s)
Spawning
Source: Arcot Rajasekar (SDSC)
IS 257 – Fall 2005
2005.11.22- SLIDE 53
SRB Concepts
• Abstraction of User Space
– Single sign-on
– Multiple authentication schemes
• certificates, (secure) passwords, tickets, group permissions, roles
• Virtualization of Resources
– Resource Location, Type & Access transparency
– Logical Resource Definitions - bundling
• Abstraction of Data and Collections
•
– Virtual Collections: Persistent Identifier and Global Name Space
– Replication & Segmentation
Data Discovery – system & application metadata
– User-defined Metadata – Structural & Descriptive
– Attribute-based Access (path names become irrelevant)
• Uniform Access Methods
– APIs, Command Line, GUI Browsers, Web-Access (Portal,WSDL, CGI)
– Parallel Access with both Client and Server-driven strategies
Source: Arcot Rajasekar (SDSC)
IS 257 – Fall 2005
2005.11.22- SLIDE 54
OceanStore:
Everyone’s data, One big Utility
OStor
“The data is just out there”
• Separate information from location
– Locality is an only an optimization (an important one!)
– Wide-scale coding and replication for durability
• All information is globally identified
– Unique identifiers are hashes over names & keys
– Single uniform lookup interface replaces: DNS, server
location, data location
– No centralized namespace required (such as SDSI)
Source: John Kubiatowicz (UCB)
IS 257 – Fall 2005
2005.11.22- SLIDE 55
Basic Structure:
Irregular Mesh of “Pools”
OStor
Source: John Kubiatowicz (UCB)
IS 257 – Fall 2005
2005.11.22- SLIDE 56
Amusing back of the envelope calculation
OStore
• How many files in the OceanStore?
– Assume 1010 people in world
– Say 10,000 files/person (very conservative?)
– So 1014 files in OceanStore!
– If 1 gig files (not likely), get 1 mole of files!
• Truly impressive number of elements…
•
… but small relative to physical constants
– (courtesy Bill Bolotsky, Microsoft)
Source: John Kubiatowicz (UCB)
IS 257 – Fall 2005
2005.11.22- SLIDE 57
Utility-based Infrastructure
OStore
Canadian
OceanStore
Sprint
AT&T
Pac
Bell
IBM
IBM
• Service provided by confederation of companies
– Monthly fee paid to one service provider
– Companies buy and sell capacity from each other
Source: John Kubiatowicz (UCB)
IS 257 – Fall 2005
2005.11.22- SLIDE 58
Lecture Outline
• Review
– Future of Database Systems
• Grid-Based Digital Libraries
– Data Grids
– Grid-based IR
• DBMS and usability
IS 257 – Fall 2005
2005.11.22- SLIDE 59
DBMS and Usability
• What features would you like to see in
DBMS?
IS 257 – Fall 2005
2005.11.22- SLIDE 60
DBMS and Usability
• What do you hate about Database
Management Systems?
– From your experiences
– In general
• What do you like about Database
Management Systems?
– From your experience
– In general
IS 257 – Fall 2005
2005.11.22- SLIDE 61
Next Week
• Workshops to help you develop the final
reports and presentations.
IS 257 – Fall 2005
2005.11.22- SLIDE 62