Jim Demmel `s "Societal Scale Information Systems" (powerpoint)

Download Report

Transcript Jim Demmel `s "Societal Scale Information Systems" (powerpoint)

UC Santa Cruz
Societal Scale Information Systems
Jim Demmel, Chief Scientist
EECS and Math Depts.
www.citris.berkeley.edu
Outline
Scope
 Problems to be solved
 Overview of projects

Societal-Scale Information System SIS
Massive Cluster
Gigabit Ethernet
“Server”
“Client”
Information
Appliances
MEMS
Sensors
Clusters
Scalable, Reliable,
Secure Services
Desirable SIS Features Problems to solve
•
•
•
•
•
Integrates diverse components seamlessly
Easy to build new services from existing ones
Adapts to interfaces/users
Non-stop, always connected
Secure
Projects
 Sahara
 Service Architecture for Heterogeneous Access, Resources, and
Applications
 Katz, Joseph, Stoica
 Oceanstore and Tapestry
 Kubiatowicz, Joseph
 ROC and iStore
 Recovery Oriented Computing and Intelligent Store
 Patterson, Yelick, Fox (Stanford)
 Millennium and PlanetLab
 Culler, Kubiatowicz, Stoica, Shenker
 Many faculty away on retreats
 www.cs.berkeley.edu/~bmiller/saharaRetreat.htm
 www.cs.berkeley.edu/~bmiller/ROCretreat.htm
The “Sahara” Project
 Service
 Architecture for
 Heterogeneous
 Access,
 Resources, and
 Applications
www.cs.berkeley.edu/~bmiller/saharaRetreat.html
Scenario: Service Composition
Restaurant
Guide Service
JAL
UI
Babblefish
Translator
NTTDoCoMo
Zagat Guide
User
Tokyo
Sprint
Salt Lake
City
User
Sahara Research Focus
 New mechanisms, techniques for end-to-end services w/
desirable, predictable, enforceable properties spanning
potentially distrusting service providers
 Tech architecture for service composition & inter-operation across
separate admin domains, supporting peering & brokering, and diverse
business, value-exchange, access-control models
 Functional elements







Service discovery
Service-level agreements
Service composition under constraints
Redirection to a service instance
Performance measurement infrastructure
Constraints based on performance, access control,
accounting/billing/settlements
Service modeling and verification
Technical Challenges
 Trust management and behavior verification
 Meet promised functionality, performance, availability
 Adapting to network dynamics
 Actively respond to shifting server-side workloads and network
congestion, based on pervasive monitoring & measurement
 Awareness of network topology to drive service selection
 Adapting to user dynamics
 Resource allocation responsive to client-side workload variations
 Resource provisioning and management
 Service allocation and service placement
 Interoperability across multiple service providers
 Interworking across similar services deployed by different providers
Service Composition Models
 Cooperative
 Individual component service providers interact in distributed
fashion, with distributed responsibility, to provide an end-to-end
composed service
 Brokered
 Single provider, the Broker, uses functionalities provided by
underlying service providers, encapsulates these to compose an
end-to-end service
 Examples
 Cooperative: roaming among separate mobile networks
 Brokered: JAL restaurant guide
Mechanisms for Service Composition (1)
 Measurement-based Adaptation
 Examples



Host distance monitoring and estimation service
Universal In-box: exchange network and server load
Content Distribution Networks: redirect client to closest service
instance
Mechanisms for Service Composition (2)
 Utility-based Resource Allocation Mechanisms
 Examples


Auctions to dynamically allocate resources; applied for
spectrum/bandwidth resource assignments
Congestion pricing (same idea for power)
– Voice port allocation to user-initiated calls in H.323
gateway/Voice over IP service management
– Wireless LAN bandwidth allocation and management
– H.323 gateway selection, redirection, and load balancing for
Voice over IP services
Mechanisms for Service Composition (3)
 Trust Mgmt/Verification of Service & Usage
 Authentication, Authorization, Accounting Services



Credential transformations to enable cross-domain service invocation
Federated admin domains with credential transformation rules based
on established agreements
AAA server makes authorization decisions
 Service Level Agreement Verification
 Verification and usage monitoring to ensure properties specified in SLA
are being honored
 Border routers monitoring control traffic from different providers to
detect malicious route advertisements
OceanStore
Global-Scale Persistent Storage
OceanStore Context:
Ubiquitous Computing
 Computing everywhere:
 Desktop, Laptop, Palmtop
 Cars, Cellphones
 Shoes? Clothing? Walls?
 Connectivity everywhere:
 Rapid growth of bandwidth in the interior of the net
 Broadband to the home and office
 Wireless technologies such as CMDA, Satelite, laser
Questions about information:
 Where is persistent information stored?
 Want: Geographic independence for availability, durability, and freedom to adapt to
circumstances
 How is it protected?
 Want: Encryption for privacy, signatures for authenticity, and Byzantine
commitment for integrity
 Can we make it indestructible?
 Want: Redundancy with continuous repair and redistribution for long-term
durability
 Is it hard to manage?
 Want: automatic optimization, diagnosis and repair
 Who owns the aggregate resouces?
 Want: Utility Infrastructure!
Utility-based Infrastructure
Canadian
OceanStore
Sprint
AT&T
Pac IBM
Bell
IBM
 Transparent data service provided by federation
of companies:
 Monthly fee paid to one service provider
 Companies buy and sell capacity from each other
OceanStore Assumptions
 Untrusted Infrastructure:
 The OceanStore is comprised of untrusted components
 Only ciphertext within the infrastructure
 Responsible Party:
 Some organization (i.e. service provider) guarantees that your data is
consistent and durable
 Not trusted with content of data, merely its integrity
 Mostly Well-Connected:
 Data producers and consumers are connected to a high-bandwidth
network most of the time
 Exploit multicast for quicker consistency when possible
 Promiscuous Caching:
 Data may be cached anywhere, anytime
 Optimistic Concurrency via Conflict Resolution:
 Avoid locking in the wide area
 Applications use object-based interface for updates
First Implementation [Java]:
 Event-driven state-machine model
 Included Components
 Initial floating replica design

Conflict resolution and Byzantine agreement
 Routing facility (Tapestry)


Bloom Filter location algorithm
Plaxton-based locate and route data structures
 Introspective gathering of tacit info and adaptation


Language for introspective handler construction
Clustering, prefetching, adaptation of network routing
 Initial archival facilities


Interleaved Reed-Solomon codes for fragmentation
Methods for signing and validating fragments
 Target Applications
 Unix file-system interface under Linux (“legacy apps”)
 Email application, proxy for web caches, streaming multimedia applications
OceanStore Conclusions
 OceanStore: everyone’s data, one big utility
 Global Utility model for persistent data storage
 OceanStore assumptions:
 Untrusted infrastructure with a responsible party
 Mostly connected with conflict resolution
 Continuous on-line optimization
 OceanStore properties:
 Provides security, privacy, and integrity
 Provides extreme durability
 Lower maintenance cost through redundancy, continuous
adaptation, self-diagnosis and repair
 Large scale system has good statistical properties
Oceanstore Prototype
Running with 5 other sites worldwide
Recovery-Oriented Computing Philosophy
• People/HW/SW failures are facts to cope with,
not problems to solve (“Peres’s Law”)
• Improving recovery/repair improves availability
– UnAvailability = MTTR (assuming MTTR << MTTF)
MTTF
– 1/10th MTTR just as valuable as 10X MTBF
• Recovery/repair is how we cope with above facts
• Since major Sys Admin job is recovery after
failure, ROC also helps with maintenance/TCO,
and Total Cost of Ownership is 5-10X HW/SW
• www.cs.berkeley.edu/~bmiller/ROCretreat.htm
ROC approach
Collect data to see why services fail
1.
•
Operators cause > 50% failures
Create benchmarks to measure dependability
2.
•
Benchmarks inspire and enable researchers,
name names to spur commercial improvements
Margin of Safety (from Civil Engineering)
3.
•
Overprovision to handle the unexpected
Create and Evaluate techniques to help
4.
•
•
•
Undo for system administrators in the field
Partitioning to isolate errors, upgrade in the field
Fault insertion to test emergency systems in the field
Availability benchmarking 101
 Availability benchmarks quantify system behavior under
failures, maintenance, recovery
QoS Metric
normal behavior
(99% conf.)
failure
0
Repair Time
QoS degradation
Time
 They require
 A realistic (fault) workload for the system
 Fault-injection to simulate failures
 Human operators to perform repairs
 New winner is fastest to recover, vs. fastest
Source: A. Brown, and D. Patterson, “Towards availability benchmarks: a case
study of software RAID systems,” Proc. USENIX, 18-23 June 2000
ISTORE –
Hardware Techniques for Availability
 Cluster of Storage Oriented Nodes (SON)
 Scalable, tolerates partial failures, automatic redundancy
 Heavily instrumented hardware
 Sensors for temp, vibration, humidity, power, intrusion
 Independent diagnostic processor on each node
 Remote control of power; collects environmental data for
 Diagnostic processors connected via independent network
 On-demand network partitioning/isolation
 Allows testing, repair of online system
 Managed by diagnostic processor
 Built-in fault injection capabilities
 Used for hardware introspection
 Important for AME benchmarking
ISTORE
Software Techniques for Availability
 Reactive introspection
 “Mining” available system data
 Proactive introspection
 Isolation + fault insertion => test recovery code
 Semantic redundancy
 Use of coding and application-specific checkpoints
 Self-Scrubbing data structures
 Check (and repair?) complex distributed structures
 Load adaptation for performance faults
 Dynamic load balancing for “regular” computations
 Benchmarking
 Define quantitative evaluations for AME
ISTORE Status
 ISTORE Hardware
 All 80 Nodes (boards) manufactured
 PCB backplane: in layout
 32 node system running May 2002
 Software
 2-node system running -- boots OS
 Diagnostic Processor SW and device driver done
 Network striping done; fault adaptation ongoing
 Load balancing for performance heterogeneity done
 Benchmarking
 Availability benchmark example complete
 Initial maintainability benchmark complete, revised strategy
underway
ISTORE Prototype
UCB Clusters
 Millennium Central Cluster
 99 Dell 2300/6400/6450 Xeon Dual/Quad: 332 processors
 Total: 211GB memory, 3TB disk
 Myrinet 2000 + 1000Mb fiber ethernet
 OceanStore/ROC cluster, Astro cluster, Math cluster, Cory cluster, more
 CITRIS Cluster 1: 3/2002 deployment (Intel Donation)
 4 Dell Precision 730 Itanium Duals: 8 processors
 Total: 8GB memory, 128GB disk
 Myrinet 2000 + 1000Mb copper ethernet
 CITRIS Cluster 2: 9/2002 deployment (Intel Donation)
 ~128 Dell McKinley class Duals: 256 processors
 Total: ~512GB memory, ~8TB disk
 Myrinet 2000 + 1000Mb copper ethernet
 Network Expansion Needed! - and underway
 UCB shift from NorTel plus expansion
 Ganglia cluster management in Susie distribution
 hundreds of companies using it
CITRIS Network Rollout
Millennium Top Users
 800 users total on central cluster, many of which are CITRIS users
 75 major users for 2/2002: average 65% total CPU utilization
 Independent component analysis – machine learning algorithms (fbach)
 Ns-2 a packet level network simulator (machi)
 parallel AI algorithms for controlling 4-legged robot (ang)
 Image recognition (lwalker) 2 hours on cluster vs. 2 weeks on local resources
 Network simulations for infrastructure to track moving objects over a wide area





(mdenny)
Analyzing trends in BGP routing tables (sagarwal)
Boundary extraction and segmentation of natural images (dmartin)
Optical simulation and high quality rendering (adamb)
Titanium – compiler and runtime system design for high performance parallel
programming languages (bonachea)
AMANDA – neutrino detection from polar ice core samples (amanda)
http://ganglia.millennium.berkeley.edu
Planet-Lab Motivation
 A new class of services & applications is emerging that spread
over a sizable fraction of the web
 CDNs as the first examples
 Peer-to-peer, ...
 Architectural components are beginning to emerge
 Distributable hash tables to provide scalable translation
 Distributed storage, caching, instrumentation, mapping, ...
 The next internet will be created as an overlay on the current
one
 as did the last one
 it will be defined by its services, not its transport

translation, storage, caching, event notification, management
 There is NO vehicle to try out the next n great ideas in this area
Structure of the PlanetLab
 >1000 viewpoints on the internet
 10-100 resource-rich sites at network crossroads
 Typical use involves a slice across substantial subset of nodes
 Dual-role by design
 Research testbed
 large set of geographically distributed machines
 diverse & realistic network conditions
 classic ‘controlled’ sigcomm, infocomm studies
 Deployment platform
 services: design  evaluation  client base
 -> composite services
 nodes: proxy path  physical path
 make it useful to people
Initial Researchers (Mar 02)
Washington
Tom Anderson
Steven Gribble
David Wetherall
MIT
Frans Kaashoek
Hari Balakrishnan
Robert Morris
David Anderson
Berkeley
Ion Stoica
Joe Helerstein
Eric Brewer
John Kubi
Intel Research
David Culler
Timothy Roscoe
Sylvia Ratnasamy
Gaetano Borriello
Satya
Milan Milenkovic
Duke
Amin Vadat
Jeff Chase
Princeton
Larry Peterson
Randy Wang
Vivek Pai
see http://www.cs.berkeley.edu/~culler/planetlab
Rice
Peter Druschel
Utah
Jay Lepreau
CMU
Srini Seshan
Hui Zhang
UCSD
Stefan Savage
Columbia
Andrew
Campbell
ICIR
Scott Shenker
Mark Handley
Eddie Kohler
Initial Planet-Lab Candidate Sites
UBC
UW
WI
Chicago
UPenn
Harvard
Utah
Intel Seattle
Intel
MIT
Intel OR
Intel Berkeley
Cornell
CMU
ICIR
Princeton
UCB
St. Louis
Columbia
Duke
UCSB
Washu
KY
UCLA
Rice GIT
UCSD
UT
ISI
Uppsala
Copenhagen
Cambridge
Amsterdam
Karlsruhe
Barcelona
Beijing
Tokyo
Melbourne
Planned as of July 2002