システムLSIとアーキテクチャ 技術

Download Report

Transcript システムLSIとアーキテクチャ 技術

Computer Architecture
Guidance
Keio University
AMANO, Hideharu
hunga@am.ics.keio.ac.jp
Contents
Techniques of two key architectures for future
system LSIs



Parallel Architectures
Reconfigurable Architectures
Advanced uni-processor architecture
→ Special Course of Microprocessors (by Prof.
Yamasaki, Fall term)
Class

Lecture using Powerpoint: (70mins. )




The ppt file is uploaded on the web site
http://www.am.ics.keio.ac.jp, and you can down
load/print before the lecture.
When the file is uploaded, the message is sent to
you by E-mail.
Textbook: “Parallel Computers” by H.Amano (Shoko-do)
Exercise (20mins.)

Simple design or calculation on design issues
Evaluation


Exercise on Parallel Programming using
SCore on RHiNET-2 (50%)
Exercise after every lecture (50%)
Computer Architecture 1
Introduction to Parallel
Architectures
Keio University
AMANO, Hideharu
hunga@am.ics.keio.ac.jp
Parallel Architecture
A parallel architecture consists of multiple processing units
which work simultaneously.




Purposes
Classifications
Terms
Trends
Boundary between
Parallel machines and Uniprocessors
Uniprocessors

ILP(Instruction Level Parallelism)



A single Program Counter
Parallelism Inside/Between instructions
TLP(Tread Level Parallelism)


Multiple Program Counters
Parallelism between processes and jobs
Definition
Hennessy & Petterson’s
Computer Architecture: A quantitative approach
Parallel
Machines
Increasing of simultaneous issued
instructions vs. Tightly coupling
Single pipeline
Multiple instructions
issue
Performance
improvement
Multiple Threads execution
On chip implementation
Shared memory, Shared register
Connecting Multiple
Processors
Tightly
Coupling
Purposes of providing multiple processors

Performance


Fault tolerance


If a processing unit is damaged, total system
can be available: Redundant systems
Resource sharing


A job can be executed quickly with multiple
processors
Multiple jobs share memory and/or I/O
modules for cost effective processing:
Distributed systems
Low power

High performance with Low frequency
operation Parallel Architecture: Performance Centric!
Flynn’s Classification


The number of Instruction Stream:
M(Multiple)/S(Single)
The number of Data Stream:M/S

SISD




Uniprocessors(including Super scalar、VLIW)
MISD: Not existing(Analog Computer)
SIMD
MIMD
SIMD (Single Instruction Streams
Multiple Data Streams
Instruction •All Processing Units executes
the same instruction
Memory
•Low degree of flexibility
•Illiac-IV/MMX(coarse
grain)
Instruction
•CM-2 type(fine grain)
Processing Unit
Data memory
Two types of SIMD

Coarse grain:Each node performs floating point
numerical operations




ILLIAC-IV,BSP,GF-11
Multimedia instructions in recent high-end CPUs
Dedicated on-chip approach: NEC’s IMEP
Fine grain:Each node only performs a few bits
operations



ICL DAP, CM-2,MP-2
Image/Signal Processing
Connection Machines extends the application to Artificial
Intelligence (CmLisp)
A processing unit of CM-2
Flags
A
B
F
OP
s
256bit
memory
c
C
Context
1bit serial ALU
Element of CM2
4096 chips =
64K PE
Instruction
LSI chip
Router
P
P
P
P
P P P
P P P
P P P
P P P
4x4 Processor Array
12links 4096 Hypercube
connection
256bit x 16 PE RAM
Thinking Machines’ CM2 (1996)
The future of SIMD

Coarse grain SIMD




A large scale supercomputer like Illiac-IV/GF-11 will not
revive.
Multi-media instructions will be used in the future.
Special purpose on-chip system will become popular.
Fine grain SIMD


Advantageous to specific applications like image
processing
General purpose machines are difficult to be built
ex.CM2 → CM5
MIMD
•Each processor executes
individual instructions
•Synchronization is required
•High degree of flexibility
•Various structures are possible
Processors
Interconnection
networks
Memory modules (Instructions・Data)
Classification of MIMD machines
Structure of shared memory

UMA(Uniform Memory Access Model)
provides shared memory which can be accessed
from all processors with the same manner.

NUMA(Non-Uniform Memory Access
Model)
provides shared memory but not uniformly
accessed.

NORA/NORMA(No Remote Memory
Access Model)
provides no shared memory. Communication is
done with message passing.
UMA





The simplest structure of shared memory
machine
The extension of uniprocessors
OS which is an extension for single processor
can be used.
Programming is easy.
System size is limited.



Bus connected
Switch connected
A total system can be implemented on a single
chip
On-chip multiprocessor
Chip multiprocessor
Single chip multiprocessor
An example of UMA:Bus connected
Main Memory
shared bus
Snoop
Cache
Snoop
Cache
Snoop
Cache
Snoop
Cache
PU
PU
PU
PU
SMP(Symmetric MultiProcessor)
On chip multiprocessor
Switch connected UMA
.
.
.
.
Local Memory
CPU
Interface
Switch
….
Main Memory
The gap between switch and bus becomes small
NUMA



Each processor provides a local memory,
and accesses other processors’ memory
through the network.
Address translation and cache control
often make the hardware structure
complicated.
Scalable:


Programs for UMA can run without modification.
The performance is improved as the system
size.
Competitive to WS/PC clusters with Software DSM
Typical structure of NUMA
Node 0
0
Node 1
1
Interconnecton
Network
2
Node 2
3
Node 3
Logical address space
Classification of NUMA

Simple NUMA:



CC-NUMA:Cache Coherent



Remote memory is not cached.
Simple structure but access cost of remote
memory is large.
Cache consistency is maintained with hardware.
The structure tends to be complicated.
COMA:Cache Only Memory Architecture


No home memory
Complicated control mechanism
Cray’s T3D: A simple NUMA supercomputer (1993)
Using
Alpha 21064

The Earth simulator
(2002)
SGI Origin
Bristled Hypercube
Main Memory
Hub
Chip
Network
Main Memory is connected directly with Hub Chip
1 cluster consists of 2PE.
SGI’s CC-NUMA Origin3000(2000)

Using
R12000
DDM(Data Diffusion Machine)
D
...
...
...
...
NORA/NORMA



No shared memory
Communication is done with message
passing
Simple structure but high peak performance
The fastest processor is always NORA
(except The Earth Simulator)
Hard for programming
Inter-PU communications
Cluster computing
Early Hypercube machine nCUBE2
Fujitsu’s NORA AP1000(1990)


Mesh connection
SPARC
Intel’s Paragon XP/S(1991)


Mesh connection
i860
PC Cluster

Beowulf Cluster (NASA’s Beowulf Projects
1994, by Sterling)




Commodity components
TCP/IP
Free software
Others



Commodity components
High performance networks like Myrinet /
Infiniband
Dedicated software
RHiNET-2 cluster
Terms(1)

Multiprocessors:


MIMD machines with shared memory
(Strict definition:by Enslow Jr.)





Shared memory
Shared I/O
Distributed OS
Homogeneous
Extended definition: All parallel machines(Wrong
usage)
Terms(2)

Multicomputer


Arraycomputer



MIMD machines without shared memory, that
is NORA/NORMA
Don’t use if possible
A machine consisting of array of processing
elements : SIMD
A supercomputer for array calculation (Wrong
usage)
Loosely coupled ・ Tightly coupled


Loosely coupled: NORA,Tightly coupled:UMA
But, are NORAs really loosely coupled??
Classification
Fine grain
SIMD
Coarse grain
Stored
programming
based
MIMD
Multiprocessors
Bus connected UMA
Switch connected UMA
Simple NUMA
NUMA CC-NUMA
COMA
NORA Multicomputers
Others
Systolic architecture
Data flow architecture
Mixed control
Demand driven architecture
Systolic Architecture
Data x
Computational Array
Data y
Data streams are inserted into an array of special purpose computing
node in a certain rhythm.
Introduced in Reconfigurable Architectures
Data flow machines
d
a
b
c
+
e
x
A process is driven
by the data
+
x
(a+b)x(c+(dxe))
Also introduced in Reconfigurable Systems
Exercise 1


In this class, a PC cluster RHiNET is used for
exercising parallel programming. Is RHiNET
classified into Beowulf cluster ? Why do you
think so?
If you take this class, send the answer with
your name and student number to
[email protected]