Sun Cluster Architecture

Download Report

Transcript Sun Cluster Architecture

Sun Clusters
Ira Pramanick
Sun Microsystems, Inc.
Outline

Today’s vs. tomorrow’s clusters
- How they are used today and how this will change

Characteristics of future clusters
- Clusters as general-purpose platforms

How they will be delivered
- Sun’s Full Moon architecture

Summary & conclusions
Clustering Today

Mostly for HA

Little sharing of
resources

Exposed topology

Hard to use

Layered on OS

Reactive Solution
LAN/WAN
IP Switch
IP Switch
Clustering Tomorrow
LAN/WAN
Central
Console
Global Networking
Global Storage
Sun Full Moon architecture

Turns clusters into general-purpose platforms
- Cluster-wide file systems, devices, networking
- Cluster-wide load-balancing and resource management

Integrated solution
- HW, system SW, storage, applications, support/service

Embedded in Solaris 8

Builds on existing Sun Cluster line
- Sun Cluster 2.2 -> Sun Cluster 3.0
Characteristics of tomorrow’s
clusters

High-availability

Cluster-wide resource sharing: files, devices, LAN

Flexibility & Scalability

Close integration with the OS

Load-balancing & Application management

Global system management

Integration of all parts: HW, SW, applications,
support, HA guarantees
High Availability

End-to-end application availability
- What matters: Applications as seen by network clients
are highly-available
- Enable Service Level Agreements

Failures will happen
- SW, HW, operator errors, unplanned maintenance, etc.
- Mask failures from applications as much as possible
- Mask application failures from clients
High Availability...

No single point of failure
- Use multiple components for HA & scalability

Need strong HA foundation integrated into OS
- Node group membership, with quorum
- Well-defined failure boundaries--no shared memory
- Communication integrated with membership
- Storage fencing
- Transparently restartable services
High Availability...

Applications are the key
- Most applications are not cluster-aware
-
Mask most errors from applications
-
Restart when node fails, with no recompile
- Provide support for cluster-aware apps

Cluster APIs, fast communication
Disaster recovery
- Campus-separation and geographical data replication
Resource Sharing

What is important to applications?
- Ability to run on any node in cluster
- Uniform global access to all storage and network
- Standard system APIs

What to hide?
- Hardware topology, disk interconnect, LAN adapters,
hardwired physical names
Resource Sharing...

What is needed?
- Cluster-wide access to existing file systems, volumes,
devices, tapes
- Cluster-wide access to LAN/WAN
- Standard OS APIs: no application rewrite/recompile

Use SMP model
- Apps run on machine (not “CPU 5, board 3, bus 2”)
- Logical resource names independent of actual path
Resource Sharing...

Cluster-wide location-independent resource access
- Run applications on any node
- Failover/switchover apps to any node
- Global job/work queues, print queues, etc.
- Change/maintain hardware topology without affecting
applications

But need not require fully-connected SAN
- Main interconnect can be used through software
support
Flexibility

Business needs change all the time
- Therefore, platform must be flexible

System must be dynamic -- all done on-line
- Resources can be added and removed
- Dynamic reconfiguration of each node
-
Hot-plug in and out of IO, CPUs, memory, storage, etc.
- Dynamic reconfiguration between nodes
-
More nodes, load-balancing, application reconfiguration
Scalability

Cluster SMP nodes

Choose nodes as big as needed to scale application
- Need expansion room within nodes too

Don’t use clustering exclusively to scale
applications
- Interconnect speed slower than backplane speed
- Few cluster-aware applications
- Clustering large number of small nodes is like herding
chicken
Close integration with OS

Currently: multi-CPU SMP support in OS
- Does not make sense otherwise

Next step: cluster support in the OS
- Next dimension of OS support: across nodes

Clustering will become part of the OS
- Not a loosely-integrated layer
Advantages of OS integration

Ease of use
- Same administration model, commands, installation

Availability
- Integrated heartbeat, membership, fencing, etc.

Performance
- In-kernel support, inter-node/process messaging, etc.

Leverage
- All OS features/support available for clustering
Load-balancing

Load-balancing done at various levels
- Built-in network load-balancing
-
For example, incoming http requests; TCP/IP bandwidth
- Transactions at middleware level
- Global job queues

All nodes have access to all storage and network
- Therefore any node can be eligible to perform the work
Resource management

Cluster-wide resource management
- CPU, network, interconnect, IO bandwidth
- Cluster-wide application priorities

Global resource requirements guaranteed locally
- Need per-node resource management

High-availability is not just making sure an
application is started
- Must guarantee resources to finish job
Global cluster management

System management
- Perform administrative functions once
- Maintain same model as single node
- Same tools/commands as base OS--minimize retraining

Hide complexity
- Most administrative operations should not deal with
HW topology
- But still enable low-level diagnostics and management
A Total Clustering Solution
Applications
Cluster OS
Software
System
Management
Middleware
HA Guarantee
Practice
Integration of all components
Servers
Storage
Cluster
Interconnect
Service and
Support
Roadmap

Sun Cluster 2.2: currently shipping
-
-

Solaris 2.6, Solaris 7, Solaris 8 3/00
4 nodes
Year 2000 compliant
Choice of servers, storage, interconnects, topologies, networks
10 Km separation
Sun Cluster 3.0
-
-
External Alpha 6/99, Beta Q1 CY‘00, GA 2H CY‘00
8 nodes
Extensive set of new features: cluster fs, global devices, network
load-balancing, new APIs (RGM), diskless application failover,
SyMON integration
Wide Range of Applications

Agents developed, sold, and supported by Sun
- Databases (Oracle, Sybase, Informix, Informix XPS), SAP
- Netscape (http, news, mail, LDAP), Lotus Notes
- NFS, DNS, Tivoli

Sold and supported by 3rd parties
- IBM DB2 and DB2 PE, BEA Tuxedo

Agents developed and supported by Sun Professional Services
- A large list, including many in-house applications

Toolkit for agent development
- Application management API, training, Sun PS support
Full Moon clustering
Embedded in Solaris 8
Dynamic
domains
Built-in
load
balancing
Global
resource
management
Global Networking
Wide-range
of HW
Global
devices
Global
application
management
Global Storage
Cluster
APIs
Single
management
console
Global file
system
Summary

Clusters as general-purpose platforms
- Shift from reactive to proactive clustering solution

Clusters must be built on a strong foundation
- Embed into a solid operating system
- Full Moon -- bakes clustering technology into Solaris

Make clusters easy to use
- Hide complexity, hardware details

Must be an integrated solution
- From platform, service/support, to HA guarantees