Transcript PPT

Lecture 11
Multithreaded Architectures
Graduate Computer Architecture
Fall 2005
Shih-Hao Hung
Dept. of Computer Science and Information Engineering
National Taiwan University
Concept
• Data Access Latency
– Cache misses (L1, L2)
– Memory latency (remote, local)
– Often unpredictable
• Multithreading (MT)
– Tolerate or mask long and often unpredictable
latency operations by switching to another
context, which is able to do useful work.
Why Multithreading Today?
• ILP is exhausted, TLP is in.
• Large performance gap bet. MEM and
PROC.
• Too many transistors on chip
• More existing MT applications Today.
• Multiprocessors on a single chip.
• Long network latency, too.
Classical Problem, 60’ & 70’
•
•
•
•
•
I/O latency prompted multitasking
IBM mainframes
Multitasking
I/O processors
Caches within disk controllers
Requirements of Multithreading
• Storage need to hold multiple context’s PC,
registers, status word, etc.
• Coordination to match an event with a saved
context
• A way to switch contexts
• Long latency operations must use resources
not in use
Processor Utilization vs. Latency
R = the run length to a long latency event
L = the amount of latency
Problem of 80’
• Problem was revisited due to the advent of
graphics workstations
– Xerox Alto, TI Explorer
– Concurrent processes are interleaved to allow for
the workstations to be more responsive.
– These processes could drive or monitor display,
input, file system, network, user processing
– Process switch was slow so the subsystems
were microprogrammed to support multiple
contexts
Scalable Multiprocessor (90’)
• Dance hall – a shared interconnect with memory on one side
and processors on the other.
• Or processors may have local memory
How do the processors communicate?
• Shared Memory
• Potential long latency on every load
– Cache coherency becomes an issue
– Examples include NYU’s Ultracomputer, IBM’s RP3, BBN’s
Butterfly, MIT’s Alewife, and later Stanford’s Dash.
– Synchronization occurs through share variables, locks, flags, and
semaphores.
• Message Passing
– Programmer deals with latency. This enables them to minimize
the number of messages, while maximizing the size, and this
scheme allows for delay minimization by sending a message so
that it reaches the receiver at the time it expects it.
– Examples include Intel’s PSC and Paragon, Caltech’s Cosmic
Cube, and Thinking Machines’ CM-5
– Synchronization occurs through send and receive
Cycle-by-Cycle Interleaved Multithreading
• Denelcor HEP1 (1982),
HEP2
• Horizon, which was
never built
• Tera, MTA
Cycle-by-Cycle Interleaved Multithreading
• Features
– An instruction from a different context is launched at each
clock cycle
– No interlocks or bypasses thanks to a non-blocking
pipeline
• Optimizations:
– Leaving context state in proc (PC, register #, status)
– Assigning tags to remote request and then matching it on
completion
Challenges with this approach
•
I-Cache:
– Instruction bandwidth
– I-Cache misses: Since instructions are being grabbed from many different
contexts, instruction locality is degraded and the I-cache miss rate rises.
•
Register file access time:
– Register file access time increases due to the fact that the regfile had to
significantly increase in size to accommodate many separate contexts.
– In fact, the HEP and Tera use SRAM to implement the regfile, which
means longer access times.
•
Single thread performance
– Single thread performance significantly degraded since the context is
forced to switch to a new thread even if none are available.
•
•
Very high bandwidth network, which is fast and wide
Retries on load empty or store full
Improving Single Thread Performance
• Do more operations per instruction (VLIW)
• Allow multiple instructions to issue into pipeline from
each context.
– This could lead to pipeline hazards, so other safe
instructions could be interleaved into the execution.
– For Horizon & Tera, the compiler detects such data
dependencies and the hardware enforces it by switching to
another context if detected.
• Switch on load
• Switch on miss
– Switching on load or miss will increase the context switch
time.
Simultaneous Multithreading (SMT)
• Tullsen, et. al. (U. of Washington), ISCA ‘95
• A way to utilize pipeline with increased
parallelism from multiple threads.
Simultaneous Multithreading
SMT Architecture
• Straightforward extension to conventional
superscalar design.
– multiple program counters and some mechanism by which
the fetch unit selects one each cycle,
– a separate return stack for each thread for predicting
subroutine return destinations,
– per-thread instruction retirement, instruction queue flush,
and trap mechanisms,
– a thread id with each branch target buffer entry to avoid
predicting phantom branches, and
– a larger register file, to support logical registers for all
threads plus additional registers for register renaming.
• The size of the register file affects the pipeline and the
scheduling of load-dependent instructions.
SMT Performance
Tullsen ‘96
Commercial Machines w/ MT Support
• Intel Hyperthreding (HT)
– Dual threads
– Pentium 4, XEON
• Sun CoolThreads
– UltraSPARC T1
– 4-threads per core
• IBM
– POWER5
IBM Power5
http://www.research.ibm.com/journal/rd/494/mathis.pdf
IBM Power5
http://www.research.ibm.com/journal/rd/494/mathis.pdf
SMT Summary
• Pros:
– Increased throughput w/o adding much cost
– Fast response for multitasking environment
• Cons:
– Slower single processor performance
Multicore
• Multiple processor cores on a chip
– Chip multiprocessor (CMP)
– Sun’s Chip Multithreading (CMT)
• UltraSPARC T1 (Niagara)
– Intel’s Pentium D
– AMD dual-core Opteron
• Also a way to utilize TLP, but
– 2 cores  2X costs
– No good for single thread performacne
• Can be used together with SMT
Chip Multithreading (CMT)
Sun UltraSPARC T1 Processor
http://www.sun.com/servers/wp.jsp?tab=3&group=CoolThreads%20servers
8 Cores vs 2 Cores
• Is 8-cores too aggressive?
– Good for server applications, given
• Lots of threads
• Scalable operating environment
• Large memory space (64bit)
– Good for power efficiency
• Simple pipeline design for each core
– Good for availability
– Not intended for PCs, gaming, etc
SPECWeb 2005
IBM X346: 3Ghz Xeon
T2000: 8 core 1.0GHz T1 Processor
Sun Fire T2000 Server
Server Pricing
•
UltraSPARC
– Sun Fire T1000 Server
• 6 core
1.0GHz T1 Processor
• 2GB memory,
1x 80GB disk
• List price: $5,745
– Sun Fire T2000 Server
• 8 core 1.0GHz
T1 Processor
• 8GB DDR2 memory,
2 X 73GB disk
• List price: $13,395
•
X86
– Sun Fire X2100 Server
• Dual core
AMD Opteron 175
• 2GB memory,
1x80GB disk
• List price: $2,295
– Sun Fire X4200 Server
• 2x Dual core
AMD Opteron 275
• 4GB memory,
2x 73GB disk
• List price: $7,595