How Computer Architecture Trends May Affect Future Distributed

Transcript How Computer Architecture Trends May Affect Future Distributed

How Computer Architecture Trends
May Affect Future Distributed Systems
Mark D. Hill
Computer Sciences Department
University of Wisconsin--Madison
http://www.cs.wisc.edu/~markhill
PODC ‘00 Invited Talk
(C) 2000 Mark D. Hill
University of Wisconsin-Madison
Three Questions
• What is a System Area Network (SAN)
and how will it affect clusters?
– E.g., InfiniBand
• How fat will multiprocessor servers be
and how to we build larger ones?
– E.g. Wisconsin Multifacet’s Multicast & Timestamp Snooping
• Future of multiprocessor servers & clusters?
– A merging of both?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Outline
• Motivation
• System Area Networks
• Designing Multiprocessor Servers
• Server & Cluster Trends
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Technology Push: Moore’s Law
• What do following intervals have in common?
– Prehistory to 2000
– 2001 to 2002
• Answer: Equal progress in absolute processor speed
(and more doubling 2003-4, 2005-6, etc.)
– Consider salary doubling
• Corollary: Cost halves every two years
– Jim Gray: In a decade you can buy a computer
for less than its sales tax today
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Application Pull
• Should use computers in currently wasteful ways
– Already computers in electric razors & greeting cards
• New business models
– B2C, B2B, C2B, C2C
– Mass customization
• More proactive (beyond interactive) [Tennenhouse]
–
–
–
–
–
Today: P2C where P==Person & C==Computer
More C2P: mattress adjusts to save your back
More C2C: Agents surf the web for optimal deal
More sensors (physical/logic worlds coupled)
More hidden computers (c.f., electric motors)
• Furthermore, I am wrong
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
The Internet Iceberg
• Internet Components
–
–
–
–
Clients -- mobile, wireless
“On Ramp” -- LANs/DSL/Cable Modems
WAN Backbone -- IPv6, massive BW
and ...
• SERVICES
–
–
–
–
Scale Storage
Scale Bandwidth
Scale Computation
High Availability
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Outline
• Motivation
• System Area Networks
–
–
–
–
What is a SAN?
InfiniBand
Virtualizing I/O with Queue Pairs
Predictions
• Designing Multiprocessor Servers
• Server & Cluster Trends
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Regarding Storage/Bandwidth
• Currently resides on I/O Bus (PCI)
– HW & SW protocol stacks
– Must add hosts to add storage/bandwidth
proc
proc
memory interconnect
memory
bridge
i/o bus
i/o slot 0
(C) 2000 Mark D. Hill
i/o slot n-1
PODC00: Computer Architecture Trends
Want System Area Network (SAN)
• SAN vs. Local Area Nework (LAN)
–
–
–
–
–
Higher bandwidth (10 Gbps)
Lower latency (few microseconds or less)
More limited size
Other (e.g., single administrative domain, short distance)
Examples: Tandem Servernet & Myricom Myrinet
• Emerging Standard: InfiniBand
– www.inifinibandTA.org w/ spec 1.0 Summer 2000
– Compaq, Dell, HP, IBM, Intel, Microsoft, Sun, & others
– 2.5 Gbits/s times 1, 4, or 12 wires
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
InfiniBand Model (from website)
proc
proc
memory interconnect
memory
Other
networks
router
X
C
A
HCA (host channel adapter)
link
switch
T
C
A
target
(disks)
Other switches, hosts, targets, etc.
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Inifiniband Advantages
• Storage/Network made orthogonal from Computation
• Reduce “hardware” stack -- no i/o bridge
• Reduce “software” stack; hardware support for
–
–
–
–
–
Connected Reliable
Connected Unreliable
Datagram
Reliable Datagram
Raw Datagram
• Can eliminate system call for SAN use (next slide)
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Virtualizing InfiniBand
• I/O traditionally virtualized with system call
– System enforces isolation
– System permits authorized sharing
• Memory virtualized
– System trap/call for setup
– Virtual memory hardware for common-case translation
• Infiniband exploits “queue pairs” (QPs) in memory
– C.f., Intel Virtual Interface Architecture (VIA)
[IEEE Micro, Mar/Apr ‘98]
– Users issue sends, receives, & remote DMA reads/writes
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Queue Pair
proc
• QP setup system call
Main
Memory
dma-W4
– Connect with process
– Connect with remote QP
(not shown here)
dma-R3
send2 receive1
• QP placed in “pinned”
virtual memory
send1 receive2
• User directly access QP
HCA
(C) 2000 Mark D. Hill
– E.g., sends, receives &
remote DMA reads/writes
PODC00: Computer Architecture Trends
InfiniBand, cont.
• Roadmap
– NGIO/FIO merger in ‘99
– Spec in ‘00
– Products in ‘03-’10
• My Assessment
–
–
–
–
PCI needs successor
InfiniBand has the necessary features (but also many others)
InifiniBand has considerable industry buy-in (but it is recent)
Gigabit Ethernet will be only competitor
• Good name with backing from Cisco et al.
• But TCP/IP is a killer
– Infiniband for storage will be key
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
InfiniBand Research Issues
• Software Wide Open
– Industry will do local optimization
(e.g., still have device driver virtualized with system calls)
– But what is the “right” way to do software?
– Is there a theoretical model for this software?
• Other SAN Issues
–
–
–
–
A theoretical model of a service-providers site?
How to trade performance and availability?
Utility of broadcast or multicast support?
Obtaining quasi-real-time performance?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Outline
• Motivation
• System Area Networks
• Designing Multiprocessor Servers
–
–
–
–
How Fat?
Coherence for Servers
E.g., Multicast Snooping
E.g., Timestamp Snooping
• Server & Cluster Trends
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
How Fat Should Servers Be?
• Use
– PCs -- cheap but small
– Workgroup servers -- medium cost; medium size
– Large servers -- premium cost & size
• One answer: “yes”
PCs w/
“soft” state
(C) 2000 Mark D. Hill
Servers running
databases for
“hard” state
PODC00: Computer Architecture Trends
How Do We Build the Big Servers?
• (Industry knows how to build the small ones)
• A key problem is the memory system
– Memory Wall: E.g., 100ns memory access =
400 instruction opportunities for 4-way 1GHz processor
• Use per-processor caches to reduce
– Effective Latency
– Effective Bandwidth Used
• But cache coherence problem ...
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
“4”
r0<-m[100]
P0
m[100]<-5
cache
100 : X
45
Coherence 101
“4” r1<-m[100]
P1 “?”
Pn-1
“?”
r2<-m[100] r3<-m[100]
cache
100 : 4
cache
interconnection network
memory
(C) 2000 Mark D. Hill
memory
100
4
PODC00: Computer Architecture Trends
Broadcast Snooping
P2:GETX
P2:GETX
P1:GETX
Ordered Address Network
P2:GETX P1:GETX
P1:GETX
P2:GETX P1:GETX
P2:GETX P1:GETX
P2:GETX
Mem
P0
data
P1
data
data
P2
data
data
Data Network
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Broadcast Snooping
• Symmetric Multiprocessor (SMP)
– Most commercially-successful parallel computer architecture
– Performs well by finding data directly
– Scales poorly
• Improvements, e.g., Sun E10000
–
–
–
–
Split address & data transactions
Split address & data network (e.g., bus & crossbar)
Multiple address buses (e.g., four multiplexed by address)
Address bus is broadcast tree (not shared wires)
• But…
– Broadcast all address transactions (expensive)
– All processors must snoop all transactions
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Directories
Address Network
P2:GETXP2:GETX
P1:GETX
send send
Dir/Mem
P1:GETX P2:GETX
P0
data
P1
data
data
P2
data
data
Data Network
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Directories
• Directory Based Cache Coherence
– E.g., SGI/Cray Origin2000
– Allows arbitrary point-to-point interconnection network
– Scales up well
• But
– Cache-to-cache transfers common in demanding apps
(55-62% sharing misses for OLTP [Barroso ISCA ‘98])
– Many applications can’t use 100s of processors
– Must also “scale down” well
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Wisconsin Multifacet: Big Picture
• Build Servers For Internet economy
– Moderate multiprocessor sizes: 2-8 then 16-64, but not 1K
– Optimize for these workloads (e.g. cache-to-cache transfers)
• Key Tool: Multiprocessor Prediction & Speculation
– Make a guess... verify it later
– Uniprocessor predecessors: branch & set predictors
– Recent multiprocessor work: [Mukherjee/Hill ISCA98],
[Kaxiras/Goodman HPCA99] & [Lai/Falsafi ISCA99]
– Multicast Snooping
– Timestamp Snooping
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Comparison of Coherence Methods
Coherence
Attribute
Find previous
owner directly?
Always
broadcast?
Ordering w/o
acks?
Stateless at
memory?
Ordered
network?
Snooping
Yes
Multicast
Snooping
Sometimes Usually
(good)
No (good)
No
Yes
No
Yes (good)
Yes
No
Yes
No
No but
simpler
Yes, a
challenge
Yes
Directories
Use prediction to improve on both?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Multicast Snooping
• On cache miss
– Predict "multicast mask" (e.g., bit vector of processors)
– Issue transaction on multicast address network
• Networks
– Address network that totally-orders address multicasts
– Separate point-to-point data network
• Processors snoop all incoming transactions
– If it's your own, it "occurs" now
– If another's, then invalidate and/or respond
• Simplified directory (at memory)
– Purpose: Allows masks to be wrong (explained later)
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Predicting Masks
block address
feedback
predicted mask
Mask Predictor
• Performed at Requesting Processor
– Include owner (GETS/GETX) & all sharers (GETX only)
– Exclude most other processors
• Techniques
– Many straightforward cases (e.g., stack, code,
space-sharing)
– Many options (network load, PC, software, local/global)
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Implementing an Ordered Multicast Network
• Address Network
– Must create the illusion of total order of multicasts
– May deliver a multicast to destinations at different times
• Wish List
–
–
–
–
High throughput for multicasts
No centralized bottlenecks
Low latency and cost (~ pipelined broadcast tree)
...
• Sample Solutions
– Isotach Networks [Reynolds et al., IEEE TPDS 4/97]
– Indirect Fat Tree [ISCA `99]
– Direct Torus
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Indirect Fat Tree [ISCA ‘99]
(C) 2000 Mark D. Hill
P$
DM
PODC00: Computer Architecture Trends
Indirect Fat Tree, cont.
• Basic Idea
–
–
–
–
Processors send transactions up to roots
Roots send transactions down with logical timestamp
Switches stall transactions to keep in order
Null transaction sent to avoid deadlock
• Assessment
– Viable & high cross-section bandwidth
– Many "backplane" ASICs means higher cost
– Often stalls transactions
• Want
– Lower cost of direct connections
– Always delivery transactions as soon as possible (ASAP)
– Sacrifice some cross-section bandwidth
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Direct 2-D Torus (work in progress)
• Features
– Each processor is switch
– Switches directly connected
– E.g., network of Compaq 21364
0
1
14 15
• Network order?
– Broadcasts unordered
– Snooping needs total order
• Solution
– Create order with logical timestamps
instead of network delivery order
– Called Timestamp Snooping [ASPLOS ‘00]
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Timestamp Snooping
• Timestamp Snooping
– Snooping with order determined by logical timestamps
– Broadcast (not multicast) in ASPLOS ‘00
• Basic Idea
–
–
–
–
Assign timestamp to coherence transactions at sender
Broadcast transactions over unordered network ASAP
Transaction carry timestamp (2 bits)
Processors process transactions in timestamp order
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Timestamp Snooping Issues
• More address bandwidth
– For 16-processors, 4-ary butterfly, 64-byte blocks
– Directory: 3*8 + 3*72 + more = 240 + more
– Timestamp Snooping 21*8 + 3*72 = 384 (< 60% more)
• Network must guarantee timestamps
– Assert future transactions will have greater timestamps
(so processor can processor older transactions)
– Isotach [Reynolds IEEE TPDS 4/97] more aggressively
• Other
– Priority queue at processor to order transactions
– Flow control and buffering issues
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Initial Multifacet Results
• Multicast Snooping [ISCA ‘99]
–
–
–
–
Ordered multicast of coherence transactions
Find data directly from memory or caches
Reduce bandwidth to permit some scaling
32-processor results show 2-6 destinations per multicast
• Timestamp Snooping [ASPLOS ‘00]
– Broadcast snooping with “order” determined by
logical timestamps carried by coherence transactions
– No bus: Allows arbitrary memory interconnects
– No directory or directory indirection
– 16-processor results show 25% faster for 25% more traffic
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Selected Issues
• Multicast Snooping
– What program property are mask predictors exploiting?
– Why is there no good model of locality
or the “90-10” rule in general?
– How does one build multicast networks?
– What about fault tolerance?
• Timestamp Snooping
– What is an optimal network topology?
– What about buffering, deadlock, etc.?
– Implementing switches and priority queues?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Outline
• Motivation
• System Area Networks
• Designing Multiprocessor Servers
• Server & Cluster Trends
– Out-of-box and highly-available servers
– High-performance communication for clusters
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Multiprocessor Servers
• High-Performance Communication “within box”
– SMPs (e.g., Intel PentiumPro Quads)
– Directory-based (SGI Origin2000)
• Trend toward hierarchical “out of box” solutions
– Build bigger servers from smaller ones
– Intel Profusion, Sequent NUMA-Q, Sun WildFire (pictured)
(C) 2000 Mark D. Hill
SMP
SMP
SMP
SMP
PODC00: Computer Architecture Trends
Multiprocessor Servers, cont.
• Traditionally had poor error isolation
– Double-bit ECC error crashes everything
– Kernel error crashes everything
– Poor match for highly available Internet infrastructure
• Improve error isolation
– IBM 370 “virtual machines”
– Stanford HIVE “cells”
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Clusters
• Traditionally
– Good error isolation
– Poor communication performance (especially latency)
– LANs are not optimized for clusters
• Enter Early SANs
– Berkeley NOW w/ Myricom Myrinet
– IBM SP w/ proprietary network
• What now with InfiniBand SAN (or alternatives)?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
A Prediction
• Blurring of cluster & server boundaries
• Clusters
– High communication performance
• Servers
– Better error isolation
– Multi-box solutions
• Use same hardware & configure in the field
• Issues
– How do we model these hybrids?
– Should PODC & SPAA also converge?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Three Questions
• What is a System Area Network (SAN)
and how will it affect clusters?
– E.g., InfiniBand
– Make computation, storage, & network orthogonal
• How fat will multiprocessor servers be
and how to we build larger ones?
– Varying sizes for soft & hard state
– E.g., Multicast Snooping & Timestamp Snooping
• Future of multiprocessor servers & clusters?
– Servers will support higher availability & extra-box solutions
– Clusters will get server communication performance
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends