Jim Demmel `s "Societal Scale Information Systems" (powerpoint)
Download
Report
Transcript Jim Demmel `s "Societal Scale Information Systems" (powerpoint)
UC Santa Cruz
Societal Scale Information Systems
Jim Demmel, Chief Scientist
EECS and Math Depts.
www.citris.berkeley.edu
Outline
Scope
Problems to be solved
Overview of projects
Societal-Scale Information System SIS
Massive Cluster
Gigabit Ethernet
“Server”
“Client”
Information
Appliances
MEMS
Sensors
Clusters
Scalable, Reliable,
Secure Services
Desirable SIS Features Problems to solve
•
•
•
•
•
Integrates diverse components seamlessly
Easy to build new services from existing ones
Adapts to interfaces/users
Non-stop, always connected
Secure
Projects
Sahara
Service Architecture for Heterogeneous Access, Resources, and
Applications
Katz, Joseph, Stoica
Oceanstore and Tapestry
Kubiatowicz, Joseph
ROC and iStore
Recovery Oriented Computing and Intelligent Store
Patterson, Yelick, Fox (Stanford)
Millennium and PlanetLab
Culler, Kubiatowicz, Stoica, Shenker
Many faculty away on retreats
www.cs.berkeley.edu/~bmiller/saharaRetreat.htm
www.cs.berkeley.edu/~bmiller/ROCretreat.htm
The “Sahara” Project
Service
Architecture for
Heterogeneous
Access,
Resources, and
Applications
www.cs.berkeley.edu/~bmiller/saharaRetreat.html
Scenario: Service Composition
Restaurant
Guide Service
JAL
UI
Babblefish
Translator
NTTDoCoMo
Zagat Guide
User
Tokyo
Sprint
Salt Lake
City
User
Sahara Research Focus
New mechanisms, techniques for end-to-end services w/
desirable, predictable, enforceable properties spanning
potentially distrusting service providers
Tech architecture for service composition & inter-operation across
separate admin domains, supporting peering & brokering, and diverse
business, value-exchange, access-control models
Functional elements
Service discovery
Service-level agreements
Service composition under constraints
Redirection to a service instance
Performance measurement infrastructure
Constraints based on performance, access control,
accounting/billing/settlements
Service modeling and verification
Technical Challenges
Trust management and behavior verification
Meet promised functionality, performance, availability
Adapting to network dynamics
Actively respond to shifting server-side workloads and network
congestion, based on pervasive monitoring & measurement
Awareness of network topology to drive service selection
Adapting to user dynamics
Resource allocation responsive to client-side workload variations
Resource provisioning and management
Service allocation and service placement
Interoperability across multiple service providers
Interworking across similar services deployed by different providers
Service Composition Models
Cooperative
Individual component service providers interact in distributed
fashion, with distributed responsibility, to provide an end-to-end
composed service
Brokered
Single provider, the Broker, uses functionalities provided by
underlying service providers, encapsulates these to compose an
end-to-end service
Examples
Cooperative: roaming among separate mobile networks
Brokered: JAL restaurant guide
Mechanisms for Service Composition (1)
Measurement-based Adaptation
Examples
Host distance monitoring and estimation service
Universal In-box: exchange network and server load
Content Distribution Networks: redirect client to closest service
instance
Mechanisms for Service Composition (2)
Utility-based Resource Allocation Mechanisms
Examples
Auctions to dynamically allocate resources; applied for
spectrum/bandwidth resource assignments
Congestion pricing (same idea for power)
– Voice port allocation to user-initiated calls in H.323
gateway/Voice over IP service management
– Wireless LAN bandwidth allocation and management
– H.323 gateway selection, redirection, and load balancing for
Voice over IP services
Mechanisms for Service Composition (3)
Trust Mgmt/Verification of Service & Usage
Authentication, Authorization, Accounting Services
Credential transformations to enable cross-domain service invocation
Federated admin domains with credential transformation rules based
on established agreements
AAA server makes authorization decisions
Service Level Agreement Verification
Verification and usage monitoring to ensure properties specified in SLA
are being honored
Border routers monitoring control traffic from different providers to
detect malicious route advertisements
OceanStore
Global-Scale Persistent Storage
OceanStore Context:
Ubiquitous Computing
Computing everywhere:
Desktop, Laptop, Palmtop
Cars, Cellphones
Shoes? Clothing? Walls?
Connectivity everywhere:
Rapid growth of bandwidth in the interior of the net
Broadband to the home and office
Wireless technologies such as CMDA, Satelite, laser
Questions about information:
Where is persistent information stored?
Want: Geographic independence for availability, durability, and freedom to adapt to
circumstances
How is it protected?
Want: Encryption for privacy, signatures for authenticity, and Byzantine
commitment for integrity
Can we make it indestructible?
Want: Redundancy with continuous repair and redistribution for long-term
durability
Is it hard to manage?
Want: automatic optimization, diagnosis and repair
Who owns the aggregate resouces?
Want: Utility Infrastructure!
Utility-based Infrastructure
Canadian
OceanStore
Sprint
AT&T
Pac IBM
Bell
IBM
Transparent data service provided by federation
of companies:
Monthly fee paid to one service provider
Companies buy and sell capacity from each other
OceanStore Assumptions
Untrusted Infrastructure:
The OceanStore is comprised of untrusted components
Only ciphertext within the infrastructure
Responsible Party:
Some organization (i.e. service provider) guarantees that your data is
consistent and durable
Not trusted with content of data, merely its integrity
Mostly Well-Connected:
Data producers and consumers are connected to a high-bandwidth
network most of the time
Exploit multicast for quicker consistency when possible
Promiscuous Caching:
Data may be cached anywhere, anytime
Optimistic Concurrency via Conflict Resolution:
Avoid locking in the wide area
Applications use object-based interface for updates
First Implementation [Java]:
Event-driven state-machine model
Included Components
Initial floating replica design
Conflict resolution and Byzantine agreement
Routing facility (Tapestry)
Bloom Filter location algorithm
Plaxton-based locate and route data structures
Introspective gathering of tacit info and adaptation
Language for introspective handler construction
Clustering, prefetching, adaptation of network routing
Initial archival facilities
Interleaved Reed-Solomon codes for fragmentation
Methods for signing and validating fragments
Target Applications
Unix file-system interface under Linux (“legacy apps”)
Email application, proxy for web caches, streaming multimedia applications
OceanStore Conclusions
OceanStore: everyone’s data, one big utility
Global Utility model for persistent data storage
OceanStore assumptions:
Untrusted infrastructure with a responsible party
Mostly connected with conflict resolution
Continuous on-line optimization
OceanStore properties:
Provides security, privacy, and integrity
Provides extreme durability
Lower maintenance cost through redundancy, continuous
adaptation, self-diagnosis and repair
Large scale system has good statistical properties
Oceanstore Prototype
Running with 5 other sites worldwide
Recovery-Oriented Computing Philosophy
• People/HW/SW failures are facts to cope with,
not problems to solve (“Peres’s Law”)
• Improving recovery/repair improves availability
– UnAvailability = MTTR (assuming MTTR << MTTF)
MTTF
– 1/10th MTTR just as valuable as 10X MTBF
• Recovery/repair is how we cope with above facts
• Since major Sys Admin job is recovery after
failure, ROC also helps with maintenance/TCO,
and Total Cost of Ownership is 5-10X HW/SW
• www.cs.berkeley.edu/~bmiller/ROCretreat.htm
ROC approach
Collect data to see why services fail
1.
•
Operators cause > 50% failures
Create benchmarks to measure dependability
2.
•
Benchmarks inspire and enable researchers,
name names to spur commercial improvements
Margin of Safety (from Civil Engineering)
3.
•
Overprovision to handle the unexpected
Create and Evaluate techniques to help
4.
•
•
•
Undo for system administrators in the field
Partitioning to isolate errors, upgrade in the field
Fault insertion to test emergency systems in the field
Availability benchmarking 101
Availability benchmarks quantify system behavior under
failures, maintenance, recovery
QoS Metric
normal behavior
(99% conf.)
failure
0
Repair Time
QoS degradation
Time
They require
A realistic (fault) workload for the system
Fault-injection to simulate failures
Human operators to perform repairs
New winner is fastest to recover, vs. fastest
Source: A. Brown, and D. Patterson, “Towards availability benchmarks: a case
study of software RAID systems,” Proc. USENIX, 18-23 June 2000
ISTORE –
Hardware Techniques for Availability
Cluster of Storage Oriented Nodes (SON)
Scalable, tolerates partial failures, automatic redundancy
Heavily instrumented hardware
Sensors for temp, vibration, humidity, power, intrusion
Independent diagnostic processor on each node
Remote control of power; collects environmental data for
Diagnostic processors connected via independent network
On-demand network partitioning/isolation
Allows testing, repair of online system
Managed by diagnostic processor
Built-in fault injection capabilities
Used for hardware introspection
Important for AME benchmarking
ISTORE
Software Techniques for Availability
Reactive introspection
“Mining” available system data
Proactive introspection
Isolation + fault insertion => test recovery code
Semantic redundancy
Use of coding and application-specific checkpoints
Self-Scrubbing data structures
Check (and repair?) complex distributed structures
Load adaptation for performance faults
Dynamic load balancing for “regular” computations
Benchmarking
Define quantitative evaluations for AME
ISTORE Status
ISTORE Hardware
All 80 Nodes (boards) manufactured
PCB backplane: in layout
32 node system running May 2002
Software
2-node system running -- boots OS
Diagnostic Processor SW and device driver done
Network striping done; fault adaptation ongoing
Load balancing for performance heterogeneity done
Benchmarking
Availability benchmark example complete
Initial maintainability benchmark complete, revised strategy
underway
ISTORE Prototype
UCB Clusters
Millennium Central Cluster
99 Dell 2300/6400/6450 Xeon Dual/Quad: 332 processors
Total: 211GB memory, 3TB disk
Myrinet 2000 + 1000Mb fiber ethernet
OceanStore/ROC cluster, Astro cluster, Math cluster, Cory cluster, more
CITRIS Cluster 1: 3/2002 deployment (Intel Donation)
4 Dell Precision 730 Itanium Duals: 8 processors
Total: 8GB memory, 128GB disk
Myrinet 2000 + 1000Mb copper ethernet
CITRIS Cluster 2: 9/2002 deployment (Intel Donation)
~128 Dell McKinley class Duals: 256 processors
Total: ~512GB memory, ~8TB disk
Myrinet 2000 + 1000Mb copper ethernet
Network Expansion Needed! - and underway
UCB shift from NorTel plus expansion
Ganglia cluster management in Susie distribution
hundreds of companies using it
CITRIS Network Rollout
Millennium Top Users
800 users total on central cluster, many of which are CITRIS users
75 major users for 2/2002: average 65% total CPU utilization
Independent component analysis – machine learning algorithms (fbach)
Ns-2 a packet level network simulator (machi)
parallel AI algorithms for controlling 4-legged robot (ang)
Image recognition (lwalker) 2 hours on cluster vs. 2 weeks on local resources
Network simulations for infrastructure to track moving objects over a wide area
(mdenny)
Analyzing trends in BGP routing tables (sagarwal)
Boundary extraction and segmentation of natural images (dmartin)
Optical simulation and high quality rendering (adamb)
Titanium – compiler and runtime system design for high performance parallel
programming languages (bonachea)
AMANDA – neutrino detection from polar ice core samples (amanda)
http://ganglia.millennium.berkeley.edu
Planet-Lab Motivation
A new class of services & applications is emerging that spread
over a sizable fraction of the web
CDNs as the first examples
Peer-to-peer, ...
Architectural components are beginning to emerge
Distributable hash tables to provide scalable translation
Distributed storage, caching, instrumentation, mapping, ...
The next internet will be created as an overlay on the current
one
as did the last one
it will be defined by its services, not its transport
translation, storage, caching, event notification, management
There is NO vehicle to try out the next n great ideas in this area
Structure of the PlanetLab
>1000 viewpoints on the internet
10-100 resource-rich sites at network crossroads
Typical use involves a slice across substantial subset of nodes
Dual-role by design
Research testbed
large set of geographically distributed machines
diverse & realistic network conditions
classic ‘controlled’ sigcomm, infocomm studies
Deployment platform
services: design evaluation client base
-> composite services
nodes: proxy path physical path
make it useful to people
Initial Researchers (Mar 02)
Washington
Tom Anderson
Steven Gribble
David Wetherall
MIT
Frans Kaashoek
Hari Balakrishnan
Robert Morris
David Anderson
Berkeley
Ion Stoica
Joe Helerstein
Eric Brewer
John Kubi
Intel Research
David Culler
Timothy Roscoe
Sylvia Ratnasamy
Gaetano Borriello
Satya
Milan Milenkovic
Duke
Amin Vadat
Jeff Chase
Princeton
Larry Peterson
Randy Wang
Vivek Pai
see http://www.cs.berkeley.edu/~culler/planetlab
Rice
Peter Druschel
Utah
Jay Lepreau
CMU
Srini Seshan
Hui Zhang
UCSD
Stefan Savage
Columbia
Andrew
Campbell
ICIR
Scott Shenker
Mark Handley
Eddie Kohler
Initial Planet-Lab Candidate Sites
UBC
UW
WI
Chicago
UPenn
Harvard
Utah
Intel Seattle
Intel
MIT
Intel OR
Intel Berkeley
Cornell
CMU
ICIR
Princeton
UCB
St. Louis
Columbia
Duke
UCSB
Washu
KY
UCLA
Rice GIT
UCSD
UT
ISI
Uppsala
Copenhagen
Cambridge
Amsterdam
Karlsruhe
Barcelona
Beijing
Tokyo
Melbourne
Planned as of July 2002