Transcript PPT

Lecture 12
Scalable Computing
Graduate Computer Architecture
Fall 2005
Shih-Hao Hung
Dept. of Computer Science and Information Engineering
National Taiwan University
Scalable Internet Services
• Lessions from Giant-Scale Services
http://www.computer.org/internet/ic2001/w4046abs.htm
–
–
–
–
–
Access anywhere, anytime.
Availability via multiple devices.
Groupware support.
Lower overall cost.
Simplified service updates.
Giant-Scale Services: Components
Network Interface
• A simple network connecting two machines
• Message
Network Bandwidth vs Message Size
Switch: Conencting More than 2 Machines
Switch
Network Topologies
• Relative performance for 64 nodes
Packets
Load Management
• Balancing loads (load balancer)
– Round-robin DNS
– Layer-4 (Transport layer, e.g. TCP) switches
– Layer-7 (Application layer) switches
The 7 OSI (Open System Interconnection) Layers
The 7 OSI (Open System Interconnection) Layers
•
•
•
Application (Layer 7) This layer supports application and end-user
processes. Communication partners are identified, quality of service
is identified, user authentication and privacy are considered, and any
constraints on data syntax are identified. Everything at this layer is
application-specific. file transfers, e-mail, and other network software
services. Telnet and FTP.
Presentation (Layer 6) This layer provides independence from
differences in data representation (e.g., encryption) by translating
from application to network format, and vice versa. The presentation
layer works to transform data into the form that the application layer
can accept. This layer formats and encrypts data to be sent across a
network, providing freedom from compatibility problems. It is
sometimes called the syntax layer.
Session (Layer 5) This layer establishes, manages and terminates
connections between applications. The session layer sets up,
coordinates, and terminates conversations, exchanges, and
dialogues between the applications at each end. It deals with session
and connection coordination.
The 7 OSI (Open System Interconnection) Layers
•
•
•
•
Transport (Layer 4) This layer provides transparent transfer of data between
end systems, or hosts, and is responsible for end-to-end error recovery and
flow control. It ensures complete data transfer. TCP/IP.
Network (Layer 3) This layer provides switching and routing technologies,
creating logical paths, known as virtual circuits, for transmitting data from
node to node. Routing and forwarding are functions of this layer, as well as
addressing, internetworking, error handling, congestion control and packet
sequencing.
Data Link (Layer 2) At this layer, data packets are encoded and decoded into
bits. It furnishes transmission protocol knowledge and management and
handles errors in the physical layer, flow control and frame synchronization.
The data link layer is divided into two sublayers: The Media Access Control
(MAC) layer and the Logical Link Control (LLC) layer. The MAC sublayer
controls how a computer on the network gains access to the data and
permission to transmit it. The LLC layer controls frame synchronization, flow
control and error checking.
Physical (Layer 1) This layer conveys the bit stream - electrical impulse, light
or radio signal -- through the network at the electrical and mechanical level. It
provides the hardware means of sending and receiving data on a carrier,
including defining cables, cards and physical aspects. Fast Ethernet, RS232,
and ATM are protocols with physical layer components.
OSI
Simple Web Farm
Search Engine Cluster
High Availability
• High availability is a major driving requirement
behind giant-scale system design.
– Uptime: typically measured in nines, and traditional
infrastructure systems such as the phone system aim for
four or five nines (“four nines” implies 0.9999 uptime, or
less than 60 seconds of downtime per week).
– Meantime-between-failure (MTBF)
– Mean-time-to-repair (MTTR)
– uptime = (MTBF – MTTR)/MTBF
– yield = queries completed/queries offered
– harvest = data available/complete data
– DQ Principle:
Data per query × queries per second →constant
– Graceful Degradation
Clusters in Giant-Scale Services
– Scalability
– Cost/performance
– Independent components
Cluster Example
Lesson Learned
•
•
•
•
•
•
•
Get the basics right. Start with a professional data center and layer-7 switches,
and use symmetry to simplify analysis and management.
Decide on your availability metrics. Everyone should agree on the goals and
how to measure them daily. Remember that harvest and yield are more useful
than just uptime.
Focus on MTTR at least as much as MTBF. Repair time is easier to affect for
an evolving system and has just as much impact.
Understand load redirection during faults. Data replication is insufficient for
preserving uptime under faults; you also need excess DQ.
Graceful degradation is a critical part of a high-availability strategy. Intelligent
admission control and dynamic database reduction are the key tools for
implementing the strategy.
Use DQ analysis on all upgrades. Evaluate all proposed upgrades ahead of
time, and do capacity planning.
Automate upgrades as much as possible. Develop a mostly automatic
upgrade method, such as rolling upgrades. Using a staging area will reduce
downtime, but be sure to have a fast, simple way to revert to the old version.
Deep Scientific Computing
Kramer et. al., IBM J. R&D March 2004
• High-performance computing (HPC)
–
–
–
–
Resolution of a simulation
Complexity of an analysis
Computational power
Data storage
• New paradigms of computing
– Grid computing
– Network
Themes (1/2)
• Deep science applications must now integrate simulation with
data analysis. In many ways this integration is inhibited by
limitations in storing, transferring, and manipulating the data
required.
• Very large, scalable, high-performance archives, combining
both disk and tape storage, are required to support this deep
science. These systems must respond to large amounts of
data—both many files and some very large files.
• High-performance shared file systems are critical to large
systems. The approach here separates the project into three
levels—storage systems, interconnect fabric, and global file
systems. All three levels must perform well, as well as scale, in
order to provide applications with the performance they need.
• New network protocols are necessary as the data flows are
beginning to exceed the capability of yesterdays protocols. A
number of elements can be tuned and improved in the interim,
but long-term growth requires major adjustments.
Themes (2/2)
• Data management methods are key to being able to organize
and find the relevant information in an acceptable time.
• Security approaches are needed that allow openness and
service while providing protection for systems. The security
methods must understand not just the application levels but
also the underlying functions of storage and transfer systems.
• Monitoring and control capabilities are necessary to keep pace
with the system improvements. This is key, as the application
developers for deep computing must be able to drill through
virtualization layers in order to understand how to achieve the
needed performance.
Simulation: Time and Space
More Space
NERSC System
High-Performance Storage System
(HPSS)
Networking for HPC Systems
• End-to-end network performance is a product of
–
–
–
–
–
Application behavior
Machine capabilities
Network path
Network protocol
Competing traffic
• Difficult to ascertain the limiting factor without
monitoring/diagnostic capabilities
– End host issues
– Routers and gateways
– Deep Security
End Host Issues
• Throughput limit
– Time to copy data from user memory to kernel
across memory bus (2 memory cycles)
– Time to copy from kernel to NIC (1 I/O cycle)
– If limited by memory BW:
Memory & I/O Bandwidth
• Memory BW
– DDR: 650-2500 MB/s
• I/O BW
–
–
–
–
–
–
32-bit/33Mhz PCI: 132 MB/s
64-bit/33Mhz PCI: 264 MB/s
64-bit/66Mhz PCI: 528 MB/s
64-bit/133Mhz PCI-X: 1056 MB/s
PCI-E x1: ~1 Gbit/s
PCI-E x16: ~16 Gbit/s
Network Bandwidth
• VT600, 32-bit/33mhz PCI, DDR400, AMD2700+, 850 MB/s
memory BW
– 485 Mbit/s
• 64-bit/133Mhz PCI-X, 1100-2500 MB/s memory BW
– Limited to 5000 Mbit/s
– Also limited by DMA overhead
– Only reach half of 10Gb NIC
• I/O architecture
– On-chip NIC?
• OS architecture
–
–
–
–
Reduce number of memory copy? Zero copy?
TCP/IP overhead
TCP/IP offload
Maximum Transfer Unit (MTU)
Conclusion
•
•
•
•
•
High performance storage and network
End host performance
Data management
Security
Monitoring and control
Petaflop Computing
Science-driven System Architecture
• Leadership Computing Systems
– Processor performance
– Interconnect performance
– Software: scalability & optimized lib
• Blue Planet
– Redesigned Power5-based HPC system
• Single core node
• High-memory bandwidth per processor
– ViVA (Virtual Vector Architecture) allows the eight
processors in a node to be treated as a single processor
with peak performance of 60+ Gigaflop/s.
ViVA-2: Application Accelerator
• Accelerates particular application-specific or
domain-specific features.
– Irregular access patterns
– High load/store issue rates
– Low cache line utilization
• ISA enhancement
–
–
–
–
Inst to support prefetch irregular data access
Inst to upport sparse, non-cache-resident loads
More registers for SW pipelining
Inst to initiate many dense/indexed/sparce loads
• Proper compiler support will be a critical component
Leadership Computing Applications
• Major computational advances
–
–
–
–
–
–
Nanoscience
Combustion
Fusion
Climate
Life Sciences
Astrophysics
• Teamwork
– Project team
– Facilities
– Computational scientist
Supercomputers 1993-2000
• Clusters vs MPPs
Clusters
• Cost-performance
Total Cost of Ownership (TCO)
Google
• Built with lots of PC’s
• 80 PC’s in one rack
Google
•
Performance
– Latency: <0.5s
– Bandwidth, scaled with # of users
•
Cost
– Cost of PC keeps shrinking
– Switches, Rack, etc.
– Power
•
Reliability
– Software failure
– Hardware failure (1/10 of SW failure)
• DRAM (1/3)
• Disks (2/3)
– Switch failure
– Power outage
– Network outage