SDSC Staff Home Pages - San Diego Supercomputer Center

Download Report

Transcript SDSC Staff Home Pages - San Diego Supercomputer Center

Data Grid Management Systems
(DGMS)
Arun Swaran Jagatheesan
San Diego Supercomputer Center
University of California, San Diego
Tutorial at the fourth IEEE International Conference on Data Mining
Brighton, UK
November 01 - 04, 2004
San Diego Supercomputer Center
University of Florida
Grid Physics Network (GriPhyN)
Dynamic Calibration of Content
•
•
•
•
•
Academic researchers and Students ()
Business analysts & Office of the CTO ()
Software architects ()
Software developers ()
Savvy users ()
Just to make sure all of us get the
most out of this tutorial when we
leave this room
University of Florida
2
San Diego Supercomputer Center
ICDM 2004
Some Questions
Is there a problem worth
my (student’s) Ph.D
thesis?
What is Data Grid?
Can data grid technologies
help IT department to save
cost?
How are the concepts I
should know to design data
grid product. What are the
undocumented tricks of the
trade that worked?
Are there people using
these or is it still in
research labs?
University of Florida
Been hearing
about it a lot. Can I
see something for
real
3
San Diego Supercomputer Center
ICDM 2004
Tutorial Outline







Introduction to Data Grids
Data Grid Design philosophies
SDSC Storage Resource Broker
Gridflows
DGMS Related Topics
Q & A Session
Hand-on/Demo Session
University of Florida
4
San Diego Supercomputer Center
ICDM 2004
Tutorial Outline
 Introduction to Data Grids
o
o
o
o






The “Grid” Vision
Hype/Reality
Where: Data Grid Infrastructures in production
Why: Data Grids
Data Grid Design philosophies
SDSC Storage Resource Broker
Gridflows
DGMS Related Topics
Q & A Session
Hand-on/Demo Session
University of Florida
5
San Diego Supercomputer Center
ICDM 2004
We All Know the Story
• Information is growing …, information explosion
• In 2002, 5 Exabytes of data was produced by
man kind
(Equivalent of all words ever spoken by human
beings)
• In 2006, 62 billion e-mails in 800~1335 PB
• Mostly files, e-mails,multi-media
Source: “How much Information? 20002” – SIMS, UC Berkeley
University of Florida
6
San Diego Supercomputer Center
ICDM 2004
Enterprise Data Storage
• Distributed in multiple locations and resources
• Geographically or administratively distributed
• Multiple “autonomous administrative domains”
• Independent control over operations
• Business driven - collaborations, acquisitions etc.,
• Heterogeneous storage and data handling
systems
• File systems, databases, archives, e-mail/web servers…
Can we have a single logical namespace for data storage?
Can enterprises reduce TCO treating data storage as a single utility
or infrastructure?
University of Florida
7
San Diego Supercomputer Center
ICDM 2004
Distributed Computing
© Images courtesy of Computer History Museum
University of Florida
8
San Diego Supercomputer Center
ICDM 2004
The “Grid” Vision
University of Florida
9
San Diego Supercomputer Center
ICDM 2004
Data Grids – Hype / Reality?
• Forrester research
• “Data and infrastructure are top of mind for grid at more than 50
percent of firms … the vision of data grids will become part of a
greater vision of storage virtualization and information life cycle
management” – may 2004
• CIO magazine
• “While most people think of computational grids, enterprises are
looking into data grids” – may 2004
• Why talk about Busine$$ in an IEEE conference?
• Necessity drives business; Business drives standards and
technology evolution; …; Grid is not just technology alone, but
also standards evolution (that require businesses participation)
University of Florida
10
San Diego Supercomputer Center
ICDM 2004
Media Perspective
• CNN: The world's biggest brain: Distributed
[grid] computing
• BBC: Grid virtualises processing and
storage resources and lets people use, or
rent, the capacity they need for
particular tasks
• Lots more
University of Florida
11
San Diego Supercomputer Center
ICDM 2004
Visionaries Perspective
• Computing and storage
• shared amongst autonomous organizations [using an grid
enabled cyberinfrastructure]
• As an utility like commodities used within inter/intra
organizational collaborations
• Government agencies and policy makers have
subscribed to this and don’t want their country
to be left behind
University of Florida
12
San Diego Supercomputer Center
ICDM 2004
Academia Perspective
• “Same questions, different answers”
• Parallel distributed computing (with multiple organizations)
• Autonomous administrative domains
• Heterogeneous Infrastructure
• Hide heterogeneity using logical resource namespace
• Change in computing models to take advantage of grid?
• Large bandwidth, large storage space, large computing power
everywhere – does the “large” affect the models/algorithms?
• Old wine in a new bottle
• It’s a solved problem. Nothing new
• “Greed Computing”: Use it to get funding
University of Florida
13
San Diego Supercomputer Center
ICDM 2004
User / Vendor Perspective
• Vendor
• Yes, its brand new paradigm.
• We were ready for this long time back – In fact our product X had all
the concepts. Do you want to test drive our product and give feed
back?
• Users
• Our resources (human, computer) are distributed world-wide
• Collaborations that can span across multiple resources from
autonomous administrations of the same company
• Reducing the Total Cost of Operation (TCO), flexibility to create or use
logical resource pools
• Automobile Industry, Bio-Tech, Electronics, … (Mostly distributed
teams with multiple locations and very large data/computing)
University of Florida
14
San Diego Supercomputer Center
ICDM 2004
Reality?
• Navigating hype wave
• Touting immediate soft benefits
• Just an application (“Seti@home”,…)
• Need for standards
• Standards that can facilitate new algorithms that can take advantage
of heterogeneous infrastructure
• No use without standard on which interoperable products are
developed
• How long is the wait?
• Will grid computing deliver? How soon? Or is it just an hype that will
fade away?
• Already some technologies available which are quite promising
University of Florida
15
San Diego Supercomputer Center
ICDM 2004
Autonomous Administrative Domain
• A Grid Entity that:
•
•
•
•
Manages one or more grid resources
Can make its own policies
Might abide by a superior or global policy
Can be act as a resource provider or requestor or both
•
• Examples:
• A department or research lab in an university
• A HR or finance department of a company (sub-organization)
• Or simply a single computational or storage resource that
manages it self governed by some policies
• A Grid / Enterprise contains one or more autonomous
administrative domains with distributed heterogeneous
resources
University of Florida
16
San Diego Supercomputer Center
ICDM 2004
Data Grid Resources
• Context (Information)
• Information about digital entities (location, size, owners, ..)
• Relationship between digital entities (replicas, collection, .)
• Behavior the digital entities (services)
• Content (Data)
• Structured and unstructured
• Virtual or derived
• Commodity (Producers and consumers)
• Storage resources
• Also providers, brokers and requestors
University of Florida
17
San Diego Supercomputer Center
ICDM 2004
Very Large Scale Data Storage
Grid Resource Providers
(GRP) providing content
and/or storage
GRP
/txt3.txt
University of Florida
18
San Diego Supercomputer Center
ICDM 2004
GRP
Very Large Scale Data Storage
Autonomous Administrative
Domain with one or more
Grid Resource Providers
Research Lab
GRP
/txt3.txt
University of Florida
19
San Diego Supercomputer Center
ICDM 2004
GRP
Very Large Scale Data Storage
Research Lab
data + storage (10)
Storage-R-Us Resource
Providers
data + storage (50)
GRP
GRP
GRP
Finance Department
data + storage (40)
GRP
/…/text1.txt
University of Florida
GRP
GRP
/…//text2.txt
20
GRP
/txt3.txt
San Diego Supercomputer Center
ICDM 2004
GRP
Very Large Scale Data Storage
/home/arun.sdsc/exp1
/home/arun.sdsc/exp1/text1.txt
/home/arun.sdsc/exp1/text2.txt
/home/arun.sdsc/exp1/text3.txt
data + storage (100)
Logical Namespace (Need not be
same as physical view of
resources )
Research Lab
data + storage (10)
Storage-R-Us Resource
Providers
data + storage (50)
GRP
GRP
GRP
Finance Department
data + storage (40)
GRP
/…/text1.txt
University of Florida
GRP
GRP
/…//text2.txt
21
GRP
/txt3.txt
San Diego Supercomputer Center
ICDM 2004
GRP
Tutorial Outline
 Introduction to Data Grids
o
o
o
o






The “Grid” Vision
Hype/Reality
Where: Data Grid Infrastructures in production
Why: Data Grids
Data Grid Design philosophies
SDSC Storage Resource Broker
Gridflows
DGMS Related Topics
Q & A Session
Hand-on/Demo Session
University of Florida
22
San Diego Supercomputer Center
ICDM 2004
DGMS Technology Usage
•
•
•
•
•
•
•
•
•
•
•
NSF Southern California Earthquake Center digital library
Worldwide Universities Network data grid
NASA Information Power Grid
NASA Goddard Data Management System data grid
DOE BaBar High Energy Physics data grid
NSF National Virtual Observatory data grid
NSF ROADnet real-time sensor collection data grid
NIH Biomedical Informatics Research Network data grid
NARA research prototype persistent archive
NSF National Science Digital Library persistent archive
NHPRC Persistent Archive Test bed
University of Florida
23
San Diego Supercomputer Center
ICDM 2004
Southern California Earthquake Center
University of Florida
24
San Diego Supercomputer Center
ICDM 2004
Southern California Earthquake Center
• Build community digital library
• Manage simulation and observational data
• 60 TBs, several million files
• Provide web-based interface
• Support standard services on digital library
• Manage data distributed across multiple sites
• USC, SDSC, UCSB, SDSU, SIO
• Provide standard metadata
• Community based descriptive metadata
• Administrative metadata
• Application specific metadata
University of Florida
25
San Diego Supercomputer Center
ICDM 2004
SCEC Data Management
Technologies
• Portals
• Knowledge interface to the library, presenting a coherent view of the services
• Knowledge Management Systems
• Organize relationships between SCEC concepts and semantic labels
• Process management systems
• Data processing pipelines to create derived data products
• Web services
• Uniform capabilities provided across SCEC collections
• Data grid
• Management of collections of distributed data
• Computational grid
• Access to distributed compute resources
• Persistent archive
• Management of technology evolution
University of Florida
26
San Diego Supercomputer Center
ICDM 2004
University of Florida
27
San Diego Supercomputer Center
ICDM 2004
NASA Data Grids
• NASA Information Power Grid
• NASA Ames, NASA Goddard
• Distributed data collection using the SRB
• ESIP federation
• Led by Joseph JaJa (U Md)
• Federation of ESIP data resources using the SRB
• NASA Goddard Data Management System
• Storage repository virtualization (Unix file system, Unitree
archive, DMF archive) using the SRB
• NASA EOS Petabyte store
• Storage repository virtualization for EMC persistent store
using the Nirvana version of SRB
University of Florida
28
San Diego Supercomputer Center
ICDM 2004
NIH BIRN SRB Data Grid
• Biomedical Informatics Research Network
• Access and analyze biomedical image data
• Data resources distributed throughout the country
• Medical schools and research centers across the US
• Stable high performance grid based environment
• Coordinate data sharing
• Federate collections
• Support data mining and analysis
University of Florida
29
San Diego Supercomputer Center
ICDM 2004
BIRN: Inter-organizational Data
University of Florida
30
San Diego Supercomputer Center
ICDM 2004
SRB Collections at SDSC
Asofof8/2/2004
9/3/2004
As of 6/1/2004
As of 7/15/2004 As of 5/17/2002
As
As of 12/22/2000
As of 3/3/2004
Data_size
Data_size
Data_size
Count Data_size
Data_size
(in Count
Count
Count
Data_size Comments
Count
Count (files) Users
Users
Funding
Agency
Project
InstanceInstance CountData_size
(files) Users
Users
Project
Users
GB) (files)
(in GB)
(in GB)
(files)
(in GB)
(files)(files)
(in GB)(inGB)
(in GB)
(files)
5,012,192 380 380 NPACI Users
NSF/PACI
NPACI
379 17,578.00 4,694,075
380 21,060.00
18,220.00 4,730,063
Data Grid 16,782.00 4,631,819
9,126,471
80 2Mass,DPOSS,NVO
Digsky
51,380.00 8,690,0037,599.00
80 51,380.00
8,690,003
80 53,516.00
51,380.00 5,139,249
8,690,003
80 45,939.00
Digsky
3,630,300
17,800.00
8,685,572 NSF/ITR
80
720.00
45,365
23
Visible
Embryo
NLM
DigEmbryo
720.00
45,365
23
720.00
45,365
23
720.00
45,365
23
NPACI
329.63
46,844
1,972.00
1,083,230
13,700.00
4,050,863
379
249.00
8,016
36
HyperSpectral
Images
NSF/NPACI
(ESS)
HyperLter
224.00
5,166
29
233.00
6,111
35
241.00
7,065
35
Hayden
6,800.00
41,391
7,835.00
60,001
168
7,201.00
113,600
178
FlyThrough
for
Planetarium
AMNH/Hayden
Hayden
7,201.00
113,600
178
7,201.00
113,600
178
7,201.00
113,600
178
SLAC
514.00
77,168
3,432.00
446,613
43
1,816.00
49,342
392
Grid
Portal
NSF/NPACI
Portal
1,690.00
46,011
384
1,745.00
48,174
384
1,767.00
48,513
384
LDAS/SALK
239.00
1,766
2,002.00
14,427
66
5,227.00
652,023
50
Protein
Crystallography
NSF/NPACI
(Alpha)
SLAC
4,161.00
551,918
45
4,317.00
563,176
47
4,898.00
617,374
47
TeraGrid
22,563.00
452,868
2,585
89.00
253,930
58
Archival
Documents
NARA
NARA/Collection
63.00
81,191
58
63.00
81,191
58
63.00
81,191
58
BIRN
892.00
2,472,299
160
2,122.00
758,233
27
SIO
Explorer
Documents
NSF/NSDL
NSDL/SIO
Exp
1,578.00
518,261
27
2,062.00
750,684
27
2,062.00
750,684
27
Digital Library
2,387
Classroom Videos
TRA
2,387 124.30
26
92.00 2,479
2,387
26433.0092.00
92.00
2,387
26 26 720.00
DigEmbryo 92.00
31,629
45,365 NSF/NPACI (EOT)
23
8,704.00
21,881
67 land and neuro
LDAS/SALK
3,390.00
15,547
66
4,562.00
16,781
66
7,160.00
20,437
66
HyperLter
28.94
69
158.00
3,596
215.00
5,110
29
5,619.00
3,625,858 160 160 Biomedical Informatics
NIH (NCRR)
BIRN
5,123.00
3,295,296
183
5,421.00
3,374,485
148
5,518.00
3,477,841
Portal
33.00
5,485
1,610.00
46,278
374
681.00
57,221
21 Cell Signalling Images/Docs NIH
AfCS
438.00
54,706
21
462.00
49,729
21
562.00
54,407
21
AfCS
27.00
4,007
236.00
42,987
21
128.00
203,930
UCSD
UCSDLib
127.00
202,445
29
127.00 202,445
29
127.00 202,445
29 29 Archival Image Files
NSDL/SIO Exp
19.20
383
1,217.00
193,888
26
26,839,834 122 122 K-12 Curriculum Web-sites NSF/NSDL
NSDL/CI
2,445.00 18,491,862
119 3,008.00 21,420,181
119 3,519.00
3,203.00 23,559,785
TRA
5.80
92
92.00
2,387
26
1,776,882
SCEC
14,738.00 1,735,900
52 22,781.00 1,748,924
56 41,532.00
25,715.00 1,753,458
56 59 South Cal. Earthquake Ctr. NSF/ITR
SCEC
12,311.00
1,730,432
47
907,145 3,0773,192 TeraGrid
NSF
TeraGrid
58,228.00
481,489
2,870 86,420.00 687,108
2,994 104,300.00
94,203.00 704,493
UCSDLib
127.00
202,445
29
Persistent Archive
256,575
49,454,310 4,7694,900
TOTAL
168,380.00 38,962,966
4,569 208,172.00 42,494,419
4,671
223,132 44,859,111
NARA/Collection
7.00
2,455
72.00
82,192
58
NSDL/CI
1,529.00 12,658,072
116
256
49.5millions
millions
4 thousand114 TB
TOTAL
8 TB 2083.7
million
31 million
4230
168 TB 38 millions 4 thousand
TB million
42 millions 4 thousand28 TB
223TB
TB 6.4
44.8
4 thousand
** Does not cover data brokered by SRB spaces administered outside SDSC.
31file systems
San Diego
Supercomputer
Center
University
of Florida
Does not cover
databases;
covers only files stored in
and archival
storage systems
ICDM 2004
Does not cover shadow-linked directories
Tutorial Outline
 Introduction to Data Grids
o
o
o
o






The “Grid” Vision
Hype/Reality
Where: Data Grid Infrastructures in production
Why: Data Grids
Data Grid Design philosophies
SDSC Storage Resource Broker
Gridflows
DGMS Related Topics
Q & A Session
Hand-on/Demo Session
University of Florida
32
San Diego Supercomputer Center
ICDM 2004
Why They Require Data Grids
•
•
•
•
•
Inter/Intra Organizational Sharing
Inter/Intra Organizational Data Storage Utility
Data Storage Resource Plug-n-play provisioning
Data Preservation (Technology Migration)
Information Lifecycle Management (ILM)
University of Florida
33
San Diego Supercomputer Center
ICDM 2004
Inter/intra Organizational Sharing
Research Lab 1:
You can use our
storage for this
project
GRP
Research Lab 2: We
have relevant data
for this project
GRP
GRP
/txt3.txt
University of Florida
34
San Diego Supercomputer Center
ICDM 2004
Inter/intra Organizational Sharing
• Sharing of resources between autonomous
domains
• Either same (Inter) or different (Intra) organizations
• Shared resources
• Data, storage, IT staff
• Logical namespace for Collaboration
• Shared data and physical resources available in the
logical namespace for usage
• Inter-organizational digital libraries, Personal digital
libraries
University of Florida
35
San Diego Supercomputer Center
ICDM 2004
Inter/intra Organizational Utility
Data Center
GRP
/…/text1.txt
West Coast Offices
GRP
GRP
GRP
East Coast Office
University of Florida
36
/txt3.txt
San Diego Supercomputer Center
ICDM 2004
Inter/intra Organizational Utility
That was so
easy in slide
show 
Data Center
GRP
/…/text1.txt
West Coast Offices
/…/text1.txt
GRP
GRP
GRP
East Coast Office
University of Florida
37
/txt3.txt
San Diego Supercomputer Center
ICDM 2004
Inter/intra Organizational Utility
• Create a logical data storage utility
• Virtualization of enterprise resources (data and storage)
• Manage resource usage based on demand and
supply
• Distribute the quantity and QoS of resources in an
enterprise based on the project demands, priorities, usage
• Saves a lot in TCO
• Total Cost of Operation, managing logically unified
distributed resources without loosing flexibility and local
autonomy
University of Florida
38
San Diego Supercomputer Center
ICDM 2004
Data Storage Resource Plug-n-play
• Plug-n-play Provisioning
• Add a storage resource in grid without major
reconfiguration
• Logical resources
• Logical namespace of all the resources to which
resources can be added (or removed gracefully without
affecting the applications)
• Update resources based on demand and supply
• Add resources to the storage pool from another inter/intra
organizational partner (another department or data center)
University of Florida
39
San Diego Supercomputer Center
ICDM 2004
Data Preservation (Technology Migration)
• Facilitate “Technology Migration”
• Flexible enterprise data architecture for technology
evolution
• Update seamlessly to new file/storage system resources
• Hardware changes, Software changes
• Change archival resource from digital tape to disks
• Change from magnetic to optical
• The application and users not aware of any change
• Significant saving by avoiding downtime
• Create a replicated resource of all or selected data
University of Florida
40
San Diego Supercomputer Center
ICDM 2004
Information Lifecycle Management (ILM)
• Business oriented management of data and resources
• If data is in demand; replicate, move to higher QoS
• Archive only the less accessed data
• Grid middleware facilitates ILM in the background
• More than one physical resource as a single logical
resource
• Logical namespace of online and offline data
• Irrespective of number of replicas and resources added
• Compliance with federal regulations
• Health, Finance and many domains now have regulations
on digital backup of transactions and communication
University of Florida
41
San Diego Supercomputer Center
ICDM 2004
Tutorial Outline







Introduction to Data Grids
Data Grid Design philosophies
SDSC Storage Resource Broker
Gridflows
DGMS Related Topics
Q & A Session
Hand-on/Demo Session
University of Florida
42
San Diego Supercomputer Center
ICDM 2004
Using a Data Grid – in Abstract
Data Grid
•User asks for data from the data grid
•The data is found and returned
•Where & how details are managed by data grid
•But access controls are specified by owner
University of Florida
43
San Diego Supercomputer Center
ICDM 2004
Real World : Physical Heterogeneities
•
•
•
•
Multiple autonomous administrative domains
Distributed digital entities in different domains
Heterogeneous storage resources and systems
Distributed users and authentication
mechanisms
• Different user preferences of usage
• Logical hierarchy
• Users, groups, sub-organization/departments,
administrative domains, enterprises, virtual enterprise
University of Florida
44
San Diego Supercomputer Center
ICDM 2004
Data Grid: Every Thing Is Logical
•
•
•
•
Logical namespace of grid resources
Collection hierarchy
Logical resources
Grid Users and Virtual organizations
University of Florida
45
San Diego Supercomputer Center
ICDM 2004
Transparencies/Virtualizations
(bits,data,information,..)
Semantic data Organization (with behavior)
myActiveNeuroCollection
patientRecordsCollection
Virtual Data Transparency
image.cgi image.wsdl
image.sql
Data Replica Transparency
image_0.jpg…image_100.jpg
Interorganizational
Information
Storage
Management
Data Identifier Transparency
E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where...
Storage Location Transparency
Storage Resource Transparency
University of Florida
46
San Diego Supercomputer Center
ICDM 2004
Data Grid Transparencies
• Find data without knowing the identifier
• Descriptive attributes
• Access data/storage without knowing the
location
• Logical name space
• Access data without knowing the type of storage
• Storage repository abstraction
• Provide transformations for any data collection
• Data behavior abstraction
University of Florida
47
San Diego Supercomputer Center
ICDM 2004
Data Grid Abstractions
• Storage repository virtualization
• Standard operations supported on storage systems
• Data virtualization
• Logical name space for files - Global persistent identifier
• Information repository virtualization
• Standard operations to manage collections in databases
• Access virtualization
• Standard interface to support alternate APIs
• Latency management mechanisms
• Aggregation, parallel I/O, replication, caching
• Security interoperability
• GSSAPI, inter-realm authentication, collection-based
authorization
University of Florida
48
San Diego Supercomputer Center
ICDM 2004
Data Organization
• Physical Organization of the data
• Distributed Data
• Heterogeneous resources
• Multiple formats (structured and unstructured)
• Logical Organization
• Impose logical structure for data sets
• Collections of semantically related data sets
• Users create their own views (collections) of the data grid
University of Florida
49
San Diego Supercomputer Center
ICDM 2004
Data Identifier Transparency
Four Types of Data Identifiers:
• Unique name
• OID or handle
• Descriptive name
• Descriptive attributes – meta data
• Semantic access to data
• Collective name
• Logical name space of a collection of data sets
• Location independent
• Physical name
• Physical location of resource and physical path of data
University of Florida
50
San Diego Supercomputer Center
ICDM 2004
Mappings on Resource Name Space
• Define logical resource name
• List of physical resources
• Replication
• Write to logical resource completes when all physical
resources have a copy
• Load balancing
• Write to a logical resource completes when copy exist
on next physical resource in the list
• Fault tolerance
• Write to a logical resource completes when copies exist
on “k” of “n” physical resources
University of Florida
51
San Diego Supercomputer Center
ICDM 2004
Data Replica Transparency
• Replication
•
•
•
•
Improve access time
Improve reliability
Provide disaster backup and preservation
Physically or Semantically equivalent replicas
• Replica consistency
• Synchronization across replicas on writes
• Updates might use “m of n” or any other policy
• Distributed locking across multiple sites
• Versions of files
• Time-annotated snapshots of data
University of Florida
52
San Diego Supercomputer Center
ICDM 2004
Latency Management -Bulk Operations
• Bulk register
• Create a logical name for a file
• Bulk load
• Create a copy of the file on a data grid storage repository
• Bulk unload
• Provide containers to hold small files and pointers to each file location
• Bulk delete
• Mark as deleted in metadata catalog
• After specified interval, delete file
• Bulk metadata load
• Requests for bulk operations for access control setting,
…
University of Florida
53
San Diego Supercomputer Center
ICDM 2004
In Short…
• The whole data grid infrastructure is logical
• All physical heterogeneities and distribution are hidden
• Flexibility needed for ever changing business demands in
enterprise data management
• The presentation of the underlying physical infrastructure
is controlled by the autonomous administrative domains in
the grid
• Grid Middleware has to be the “plumber”
• Has to do lot of “plumbing” to provide all these
transparencies without any significant degrade in
performance or QoS
• Distributed data management: Latency, Replica, Logical
namespace, meta-data, P2P, database tuning, etc.,
University of Florida
54
San Diego Supercomputer Center
ICDM 2004
Tutorial Outline
 Introduction to Data Grids
 Data Grid Design philosophies
 SDSC Storage Resource Broker
o SRB Architecture
o SRB Clients
o SRB Demo




Gridflows
DGMS Related Topics
Q & A Session
Hand-on/Demo Session
University of Florida
55
San Diego Supercomputer Center
ICDM 2004
Storage Resource Broker
• Distributed data management technology
• Developed at San Diego Supercomputer Center (Univ. of California,
San Diego)
• 1996 - DARPA Massive Data Analysis
• 1998 - DARPA/USPTO Distributed Object Computation Test bed
• 2000 to present - NSF, NASA, NARA, DOE, DOD, NIH, NLM, NHPRC
• Applications
•
•
•
•
Data grids - data sharing
Digital libraries - data publication
Persistent archives - data preservation
Used in national and international projects in support of Astronomy,
Bio-Informatics, Biology, Earth Systems Science, Ecology, Education,
Geology, Government records, High Energy Physics, Seismology
University of Florida
56
San Diego Supercomputer Center
ICDM 2004
Acknowledgement: SDSC SRB Team
 Arun Jagatheesan
 George Kremenek
 Sheau-Yen Chen
 Arcot Rajasekar
 Reagan Moore
 Michael Wan
 Roman Olschanowsky
 Bing Zhu
 Charlie Cowart
Not In Picture:
 Wayne Schroeder
 Tim Warnock (BIRN)
 Lucas Gilbert
 Marcio Faerman (SCEC)
 Antoine De Torcy
Students:
Jonathan Weinberg
Yufang Hu
Daniel Moore
Grace Lin
Allen Ding
Yi Li
University of Florida
57
Emeritus:
Vicky Rowley (BIRN)
Qiao Xin
Ethan Chen
Reena Mathew
Erik Vandekieft
Xi (Cynthia) Sheng
San Diego Supercomputer Center
ICDM 2004
Three Tier Architecture
• Clients (Any interface/API)
• Your preferred access mechanism
• Servers (SRB Server)
• Manage interactions with storage systems
• Federated to support direct interactions between servers
• Metadata catalog (MCAT)
• Separation of metadata management from data storage
• State persistence using a well-tuned database
University of Florida
58
San Diego Supercomputer Center
ICDM 2004
SDSC Storage Resource Broker
& Meta-data Catalog
Application
Unix
Shell
C, C++, Linux
Libraries
I/O
DLL /
Python
Java, NT
Browsers
GridFTP
OAI
WSDL
Access
APIs
Consistency Management / Authorization-Authentication
Logical Name
Space
Latency
Management
Catalog Abstraction
Databases
DB2, Oracle, Sybase,
SQLServer
Data
Transport
Metadata
Transport
Storage Abstraction
Archives
File Systems Databases
HPSS, ADSM, HRM
UniTree, DMF
University of Florida
59
Unix, NT,
Mac OSX
DB2, Oracle,
Postgres
San Diego Supercomputer Center
ICDM 2004
SRB
Server
Drivers
Federated SRB server model
Peer-to-peer
Brokering
Read Client
Logical Name
Or
Attribute Condition
Parallel Data
Access
1
6
5/6
SRB
server
3
4
5
SRB
agent
SRB
server
SRB
agent
2
1.Logical-to-Physical mapping
2.Identification of Replicas
3.Access & Audit Control
University of Florida
MCAT
Data
Access
R1
60
R2
San Diego Supercomputer Center
ICDM 2004
Server(s)
Spawning
SRB Latency Management
Remote Proxies,
Staging
Source
Data Aggregation
Containers
Network
Network
Prefetch
Destination
Destination
Replication
Streaming
Caching
Server-initiated I/O
Parallel I/O
Client-initiated I/O
University of Florida
61
San Diego Supercomputer Center
ICDM 2004
SRB Name Spaces
• Digital Entities (files, blobs, Structured data, …)
• Logical name space for files for global identifiers
• Resources
• Logical names for managing collections of resources
• User names (user-name / domain / SRB-zone)
• Distinguished names for users to manage access controls
• MCAT metadata
• Standard metadata attributes, Dublin Core, administrative
metadata
University of Florida
62
San Diego Supercomputer Center
ICDM 2004
Logical Name Space
• Global, location-independent identifiers for
digital entities
• Organized as collection hierarchy
• Attributes mapped to logical name space
• Attributed managed in a database
• Types of administrative metadata
• Physical location of file
• Owner, size, creation time, update time
• Access controls
University of Florida
63
San Diego Supercomputer Center
ICDM 2004
Remote Proxies
• Extract image cutout from Digital Palomar Sky
Survey
• Image size 1 Gbyte
• Shipped image to server for extracting cutout took 2-4
minutes (5-10 Mbytes/sec)
• Remote proxy performed cutout directly on
storage repository
• Extracted cutout by partial file reads
• Image cutouts returned in 1-2 seconds
• Remote proxies are a mechanism to aggregate I/O
commands
University of Florida
64
San Diego Supercomputer Center
ICDM 2004
Virtual Data Abstraction
• Virtual Data or “On Demand Data”
• Created on demand is not already available
• Recipe to create derived data
• Grid based computation to create derived data product
• Object based storage (extended data operations)
•
•
•
•
Data subsetting at the remote storage repository
Data formatting at the remote storage repository
Metadata extraction at the remote storage repository
Bulk data manipulation at the remote storage repository
University of Florida
65
San Diego Supercomputer Center
ICDM 2004
Grid Bricks
• Integrate data management system, data
processing system, and data storage system into
a modular unit
•
•
•
•
•
Commodity based disk systems (1 TB)
Memory (1 GB)
CPU (1.7 Ghz)
Network connection (Gig-E)
Linux operating system
• Data Grid technology to manage name spaces
• User names (authentication, authorization)
• File names
• Collection hierarchy
University of Florida
66
San Diego Supercomputer Center
ICDM 2004
Grid Bricks at SDSC
• Used to implement “picking” environments for 10-TB
collections
• Web-based access
• Web services (WSDL/SOAP) for data subsetting
• Implemented 15-TBs of storage
• Astronomy sky surveys, NARA prototype persistent archive,
NSDL web crawls
• Must still apply Linux security patches to each Grid
Brick
• Grid bricks managed through SRB
• Logical name space, User Ids, access controls
• Load leveling of files across bricks
University of Florida
68
San Diego Supercomputer Center
ICDM 2004
Data Grid Federation
• Data grids provide the ability to name, organize,
and manage data on distributed storage
resources
• Federation provides a way to name, organize,
and manage data on multiple data grids.
University of Florida
69
San Diego Supercomputer Center
ICDM 2004
SRB Zones
• Each SRB zone uses a metadata catalog (MCAT)
to manage the context associated with digital
content
• Context includes:
•
•
•
•
Administrative, descriptive, authenticity attributes
Users
Resources
Applications
University of Florida
70
San Diego Supercomputer Center
ICDM 2004
SRB Peer-to-Peer Federation
• Mechanisms to impose consistency and access
constraints on:
• Resources
• Controls on which zones may use a resource
• User names (user-name / domain / SRB-zone)
• Users may be registered into another domain, but retain their home
zone, similar to Shibboleth
• Data files
• Controls on who specifies replication of data
• MCAT metadata
• Controls on who manages updates to metadata
University of Florida
71
San Diego Supercomputer Center
ICDM 2004
Peer-to-Peer Federation
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Occasional Interchange
- for specified users
Replicated Catalogs
- entire state information replication
Resource Interaction
- data replication
Replicated Data Zones
- no user interactions between zones
Master-Slave Zones
- slaves replicate data from master zone
Snow-Flake Zones
- hierarchy of data replication zones
User / Data Replica Zones - user access from remote to home zone
Nomadic Zones “SRB in a Box” - synchronize local zone to parent
Free-floating “myZone”
- synchronize without a parent zone
Archival “BackUp Zone”
- synchronize to an archive
SRB Version 3.0.1 released December 19, 2003
University of Florida
72
San Diego Supercomputer Center
ICDM 2004
Comparison of peer-to-peer federation approaches
Free Floating
Partial User-ID Sharing
Occasional Interchange
Partial Resource Sharing
Replicated Data
System Set Access Controls
System Controlled Complete Synch
Complete User-ID Sharing
Hierarchical Zone Organization
One Shared User-ID
No Metadata Synch
Resource Interaction
User and Data Replica
Nomadic
System Managed Replication
System Set Access Controls
System Controlled Partial Metadata Synch
No Resource Sharing
System Managed Replication
Connection From Any Zone
Complete Resource Sharing
Snow Flake
Replicated Catalog
Super Administrator Zone Control
Master Slave
System Controlled Complete Metadata Synch
Complete User-ID Sharing
Archival
University of Florida
74
San Diego Supercomputer Center
ICDM 2004
Data Grid Federation - zoneSRB
Application
C, C++, Java Linux
Libraries
I/O
Java, NT
Browsers
Unix
Shell
DLL /
Python,
Perl
HTTP
OAI,
WSDL,
OGSA
Federation Management
Consistency & Metadata Management / Authorization-Authentication Audit
Logical Name
Space
Catalog Abstraction
Databases
DB2, Oracle, Sybase,
Postgres, mySQL,
Informix
University of Florida
Latency
Management
Data
Transport
Metadata
Transport
Storage Repository Virtualization
Databases
Archives - Tape,
File Systems DB2, Oracle, Sybase,
HPSS, ADSM, ORB
Unix, NT, SQLserver,Postgres,
UniTree, DMF,
Mac OSX
mySQL, Informix
CASTOR,ADS
75
San Diego Supercomputer Center
ICDM 2004
SDSC SRB Clients
• C library calls
• Provide access to all SRB functions
• Shell commands
• Provide access to all SRB functions
• mySRB web browser
• Provides hierarchical collection view
• inQ Windows browser
• Provides Windows style directory view
• Jargon Java API
• Similar to java.io. API
• Matrix WSDL/SOAP Interface
• Aggregate SRB requests into a SOAP request. Has a Java API and
GUI
• Python, Perl, C++, OAI, Windows DLL, Mac DLL, Linux I/O
redirection, GridFTP (soon)
University of Florida
76
San Diego Supercomputer Center
ICDM 2004
SDSC SRB Demo
• If possible from the venue
• Constraints: Wireless, Murphy’s live demo laws; WAN,
SRB, Storage, …
University of Florida
77
San Diego Supercomputer Center
ICDM 2004
What we are familiar with …
University of Florida
78
San Diego Supercomputer Center
ICDM 2004
What we are not familiar with, yet =)
inQ Windows Browser Interface
University of Florida
79
San Diego Supercomputer Center
ICDM 2004
How do they differ?
•
•
•
•
•
•
Folder, does NOT mean physical folder
Files, do NOT mean physical files
Everything is logical
Everything is distributed
Permissions are NOT rwxrwxrwx
Permissions are on an object by object basis
University of Florida
80
San Diego Supercomputer Center
ICDM 2004
inQ
• Windows OS only
• User Guide at
http://www.npaci.edu/dice/srb/inQ/inQ.html
• Download .exe from
http://www.npaci.edu/dice/srb/inQ/downloads.html
University of Florida
81
San Diego Supercomputer Center
ICDM 2004
mySRB
• Web-based access to the SRB
• Secure HTTP
• https://srb.npaci.edu/mySRB2v7.shtml
• Uses Cookies for Session Control
University of Florida
82
San Diego Supercomputer Center
ICDM 2004
mySRB Features
•
•
•
•
•
•
•
Access to Both Data and Metadata
Data & File Management
Collection Creation and Management
Metadata Handling
Browsing & Querying Interface
Access Control
New file creation without upload
University of Florida
83
San Diego Supercomputer Center
ICDM 2004
mySRB Interface to a SRB Collection
University of Florida
84
San Diego Supercomputer Center
ICDM 2004
Provenance Metadata
University of Florida
85
San Diego Supercomputer Center
ICDM 2004
SDSC SRB Information
• [email protected]
• SRB user community posts problems and
solutions
• [email protected]
• Request copy of source
• http://www.npaci.edu/DICE/SRB/
• Access FAQ, installation instructions, papers
• http://srb.npaci.edu/bugs/
• SRB-Zilla (bugzilla)
University of Florida
86
San Diego Supercomputer Center
ICDM 2004
SRB Availability
• SRB source distributed to academic and
research institutions
• Commercial use access through UCSD
Technology Transfer Office
• William Decker [email protected]
University of Florida
87
San Diego Supercomputer Center
ICDM 2004
Tutorial Outline




Introduction to Data Grids
Data Grid Design philosophies
SDSC Storage Resource Broker
Gridflows
o Introduction
o Matrix Project
o Data Grid Language
 DGMS Related Topics
 Q & A Session
 Hand-on/Demo Session
University of Florida
88
San Diego Supercomputer Center
ICDM 2004
Work in progress
GfMS is ‘Hard Hat Area’ (Research)
University of Florida
89
San Diego Supercomputer Center
ICDM 2004
Data handling pipeline in SCEC
(data  information pipeline)
Metadata derivation
Ingest Data
Ingest Metadata
Pipeline could be
triggered by input
at data source or
by a data request
from user
Determine analysis pipeline
Initiate automated analysis
Use the optimal set
of resources based
on the task – on
demand
Organize result data into distributed
data grid collections
All gridflow activities
stored for data flow
provenance
University of Florida
90
San Diego Supercomputer Center
ICDM 2004
Gridflows (Grid Workflow)
• Automation of an execution pipeline
• Data and/or tasks processed by multiple autonomous grid
resources
• According to set of procedural rules
• Confluence of multiple autonomous administrative
domains
• GridFlow Execution Servers
• By themselves are from autonomous administrative
domains
• P2P (Distributed) Control
University of Florida
91
San Diego Supercomputer Center
ICDM 2004
Need for Gridflows
• Data-intensive and/or compute-intensive processes
• Long run processes or pipelines on the Grid
• (e.g) If job A completes execute jobs x, y, z; else execute job B.
• Self-organization/management of data
• Semi-automation of data, storage distribution, curation processes
• (e.g) After each data insert into a collection, update the meta-data
information about the collection or replicate the collection
• Knowledge Generation
• Offline data analysis and knowledge generation pipelines
• (e.g) What inferences can be assumed from the new seismology
graphs added to this collection? Which domain scientist will be
interested to study these new possible pre-results?
University of Florida
92
San Diego Supercomputer Center
ICDM 2004
SDSC Matrix Project
• CS Research & Development
• Gridflow Description, Data Grid Administration Rules
• Gridflow P2P protocols for Gridflow Server
Communication
• Development
• SRB Data Grid Web Services
• SRB Datagrid flow automation and provenance
• Theory  Practice
• Help in customized development & deployment of gridflow
concepts in scientific / grid applications
• Visibility and assist in standardization of efforts at GGF
University of Florida
93
San Diego Supercomputer Center
ICDM 2004
Advantages from Data Grid Perspective
• Reduces the Client-Server Communication
• The whole execution logic is sent to the server
• Less number of WAN messages
• Our experiments prove significant increase in performance
• Datagrid Information Lifecycle Management
• Autonomic: “Move data at 9:00 PM in weekdays and in
week ends”
• Data Grid Administration
• Power-users and Sophisticated Users
• Data Grid Administrator (Rules to manage data grid)
• Scientist or Librarian (Visualized data flow programming)
University of Florida
94
San Diego Supercomputer Center
ICDM 2004
What they want?
We know the
business
(scientific) process
CyberInfrastructure is
all we care (why bother
about atoms or DNA)
University of Florida
95
San Diego Supercomputer Center
ICDM 2004
What they want?
Use DGL to describe
your process logic with
abstract references to
datagrid infrastructure
dependencies
University of Florida
96
San Diego Supercomputer Center
ICDM 2004
Why a Gridflow Language?
• Infrastructure independent description
• Abstract references to hardware and cyberinfrastructure
• Description of execution flow logic
• Separate the execution flow logic from application logic
• (e.g) MonteCarlo is an application, execution of that 10
times or till a variable becomes zero is execution logic
• Procedural Rules associated with execution flow
• Provenance
• What happened, when, who, how …? (and querying)
University of Florida
97
San Diego Supercomputer Center
ICDM 2004
Gridflow Language Requirements
• High level Abstract descriptions
• Abstract description of cyberinfrastructure dependencies
• Simple yet flexible
• Flexible to describe complex requirements (no brute force)
• Gridflow dependency patterns
• Based on execution structure and data semantics
• (Parallel, Sequential, fork-new), (milestones, for-each,
switch-case)..
• Asynchronous execution
• For long-run requests
• Querying using existing standard
• XQuery
University of Florida
98
San Diego Supercomputer Center
ICDM 2004
Gridflow Language Requirements
• Process meta data and annotations
• Runtime definition, update and querying of meta-data
• Runtime Management of Gridflows
• Stop gridflow at run time
• Partitioning
• Facility in language to divide a gridflow request to multiple
requests (Excellent Research Topic)
• Import descriptions
• Refer other gridflows in execution
University of Florida
99
San Diego Supercomputer Center
ICDM 2004
Data Grid Language (DGL)
• XML based gridflow description
• Describes execution flow logic
• ECA-based rule description for execution
• ECA = Event, Condition, Action
• Querying of Status of Gridflow
• XQuery / Simple query of a Gridflow Execution
• Scoped variables and gridflow patterns
• For control of execution flow logic
University of Florida
100
San Diego Supercomputer Center
ICDM 2004
DGL Requests
• Data Grid Flow
• An XML Structure that describes the execution logic,
associated procedural rules and grid environment
variables
• Status Query
• An XML Structure used to query the execution status any
gridflow or a sub-flow at any granular level
• A DGL or Matrix client sends any of these to the
Matrix Server
University of Florida
101
San Diego Supercomputer Center
ICDM 2004
Data Grid Request
Annotations about
the Data Grid
Request
Can be either a
Flow or a Status
Query
University of Florida
102
San Diego Supercomputer Center
ICDM 2004
Grid User
<GridUser>
<userID>Matrix-demo</userID>
<organization>
<organizationName>sdsc</organizationName>
</organization>
<challenge-Response>******</challenge-Response>
<homeDirectory>/home/Matrixdemo.sdsc</homeDirectory>
<defaultStorageResource>sdscunix</defaultStorageResource>
<phoneNumber>0</phoneNumber>
<e-mail>[email protected]</e-mail>
</GridUser>
University of Florida
103
San Diego Supercomputer Center
ICDM 2004
Grid Ticket
University of Florida
104
San Diego Supercomputer Center
ICDM 2004
VO Info
University of Florida
105
San Diego Supercomputer Center
ICDM 2004
Flow
Scoped Variables
that can control the
flow
Logic used by the
sub-members
Sub-members
that are the
real execution
statements
University of Florida
106
San Diego Supercomputer Center
ICDM 2004
Using DG-Modeler
• GUI for dataflow programming
University of Florida
107
San Diego Supercomputer Center
ICDM 2004
Gridflow Process I
Gridflow Description
Data Grid Language
End User using DGBuilder
University of Florida
108
San Diego Supercomputer Center
ICDM 2004
Gridflow Process II
Abstract Gridflow using
Data Grid Language
University of Florida
Planner
109
Concrete Gridflow
Using Data Grid Language
San Diego Supercomputer Center
ICDM 2004
Gridflow Process III
Gridflow Processor
Concrete Gridflow
Using Data Grid Language
University of Florida
Gridflow P2P Network
110
San Diego Supercomputer Center
ICDM 2004
Other Gridflow Research Projects
•
•
•
•
•
GriPhyN Pegasus, Sphinx, Matrix
Taverna (MyGrid)
Kepler (also from SDSC)
GridAnt
…
University of Florida
111
San Diego Supercomputer Center
ICDM 2004
Tutorial Outline







Introduction to Data Grids
Data Grid Design philosophies
SDSC Storage Resource Broker
Gridflows
DGMS Related Topics
Q & A Session
Hand-on/Demo Session
University of Florida
112
San Diego Supercomputer Center
ICDM 2004
DGMS Philosophy
• Collective view of
• Inter-organizational data
• Operations on datagrid space
• Local autonomy and global state consistency
• Collaborative datagrid communities
• Multiple administrative domains or “Grid Zones”
• Self-describing and self-manipulating data
• Horizontal and vertical behavior
• Loose coupling between data and behavior (dynamically)
• Relationships between a digital entity and its Physical
locations, Logical names, Meta-data, Access control,
Behavior, “Grid Zones”.
University of Florida
113
San Diego Supercomputer Center
ICDM 2004
DGMS Research Issues
• Self-organization of datagrid communities
• Using knowledge relationships across the datagrids
• Inter-datagrid operations based on semantics of data in
the communities (different ontologies)
• High speed data transfer
• Terabyte to transfer
• Protocols, routers needed
• Latency Management
• Data source speed >> data sink speed
• Datagrid Constraints
• Data placement and scheduling
• How many replicas, where to place them…
University of Florida
114
San Diego Supercomputer Center
ICDM 2004
Work Vision Ahead
Half-baked research ahead
University of Florida
115
San Diego Supercomputer Center
ICDM 2004
Active Datagrid Collections
Resources
Data Sets
121.Event
Behavior
Thit.xml
121.Event
getEvents()
National Lab
University of Florida
Hits.sql
addEvent()
SDSC
116
University of Gators
San Diego Supercomputer Center
ICDM 2004
Active Datagrid Collections
121.Event
Thit.xml
Heterogeneous,
distributed
physical data
121.Event
getEvents()
National Lab
University of Florida
Dynamic or
virtual data
Hits.sql
addEvent()
SDSC
117
University of Gators
San Diego Supercomputer Center
ICDM 2004
Active Datagrid Collections
Logical Collection
gives location and
naming transparency
myHEP-Collection
Meta-data
121.Event
Thit.xml
National Lab
University of Florida
121.Event
SDSC
SDSC
118
Hits.sql
University of Gators
San Diego Supercomputer Center
ICDM 2004
Active Datagrid Collections
Now add behavior or
services to this logical
collection
Collection state
and services
myHEP-Collection
Meta-data
Horizontal
Services
121.Event
Thit.xml
121.Event
getEvents()
National Lab
University of Florida
Hits.sql
addEvent()
SDSC
SDSC
119
University of Gators
San Diego Supercomputer Center
ICDM 2004
Active Datagrid Collections
ADC Logical view of
data & operations
ADC specific
Operations + Model View
Controllers
Collection state
and services
myHEP-Collection
Meta-data
Horizontal
Services
121.Event
Thit.xml
121.Event
getEvents()
National Lab
University of Florida
Hits.sql
addEvent()
SDSC
SDSC
120
University of Gators
San Diego Supercomputer Center
ICDM 2004
Active Datagrid Collections
Physical and virtual data
present in the datagrid
Digital entities
Standardized schema
with domain specific
schema extensions
Meta-data
Horizontal datagrid
services and vertical
domain specific services
(portType) or pipelines
(DGL)
Events, collective state,
mappings to domain
services to be invoked
Services
State
University of Florida
121
San Diego Supercomputer Center
ICDM 2004
Related Technologies/Links
•
•
•
•
A complete history of the Grid
SDSC Storage Resource Broker
Globus Data Grid
The Legion Project
University of Florida
122
San Diego Supercomputer Center
ICDM 2004
Global Grid Forum (GGF)
• Global Forum for Information Exchange and
Collaboration
• Promote and support the development and deployment of
Grid Technologies
• Creation and documentation of “best practices”, technical
specifications (standards), user experiences, …
• Modeled after Internet Standards Process (IETF, RFC
2026)
• http://www.ggf.org
University of Florida
123
San Diego Supercomputer Center
ICDM 2004
Tutorial Outline







Introduction to Data Grids
Data Grid Design philosophies
SDSC Storage Resource Broker
Gridflows
DGMS Related Topics
Q & A Session
Hand-on/Demo Session
University of Florida
124
San Diego Supercomputer Center
ICDM 2004
Data Grid Mining
• Distributed data mining
• Underlying infrastructure heterogeneous (not uniform LAN
or bandwidth or memory)
• Mining software (algorithms) to take advantage of the
logical resource namespace to select execution site
• Can the mining software estimate and acquire required
cyber infrastructure resources before it starts? Grid
standards must be evolved to communicate this
infrastructure dependent information
• Co-location of dependent data or tasks; Distribution
(parallel execution) of independent tasks at different
domains
University of Florida
125
San Diego Supercomputer Center
ICDM 2004
Data Grid Mining
• Using the Data Grid to mine data
• Replicating or selecting the right resources for mining
• Cost-based analysis for the best utilization of the
heterogeneous infrastructure
• Data, mining software, execution location
• Move data or code or execution location to alternative
location based on QoS and available budget
• Co-locating or distributing appropriate data mining steps
• (e.g) NVO co-add at FNAL and SDSC (distribution + colocation)
University of Florida
126
San Diego Supercomputer Center
ICDM 2004
Q & A Session; feedback
Makes a significant difference
from being here in this room
today, and flipping through the
slides from the internet later – So
lets make sure we all benefit
from the 3 hours we spent here
University of Florida
127
San Diego Supercomputer Center
ICDM 2004
Tutorial Outline







Introduction to Data Grids
Data Grid Design philosophies
SDSC Storage Resource Broker
Gridflows
DGMS Related Topics
Q & A Session
Hand-on/Demo Session
University of Florida
128
San Diego Supercomputer Center
ICDM 2004
For More Information
Arun swaran Jagatheesan
San Diego Supercomputer Center
University of California, San Diego
[email protected]
University of Florida
129
San Diego Supercomputer Center
ICDM 2004