CAPS project

Transcript CAPS project

CAPS team
Compiler and Architecture
for superscalar and embedded
processors
CAPS members
CAPS project
 2 INRIA researchers: A. Seznec, P. Michaud
 2 professors:
F. Bodin, J. Lenfant
 11 Ph D students:
R. Amicel, R. Dolbeau, A. Monsifrot , L. Bertaux,
Heydemann, L. Morin, G. Pokam, A. Djabelkhir,
Fraboulet, O. Rochecouste, E.Toullec
 3 engineers: S. Bihan, P. Villalon, J. Simonnet
2
K.
A.
CAPS project
CAPS themes
 Two interacting activities
 High
performance microprocessor
architecture
 Performance
3
oriented compilation
CAPS Grail
CAPS project
 Performance at the best cost
Progress in computer science
and applications are driven by
performance
4
CAPS path to the Grail
CAPS project
 Defining the tradeoffs between:
 what
should be done through hardware
 what
can be done by the compiler
 for maximum performance
 or for minimum cost
 or for minimum size, power ..
5
CAPS project
Need for high-performance
processors
 Current applications
 general purpose: scientific, multimedia, data bases …
 embedded systems: cell phones, automotive, set-top boxes ..
 Future applications
 don’t worry: users have a lot of imagination !
 New software engineering techniques are CPU hungry:
 reusability, generality
 portability, extensibility (indirections, virtual machines)
 safety (run-time verifications)
 encryption/decryption
6
CAPS project
CAPS (ancient) background
 « ancient » background in hardware and software
management of ILP



decoupled pipeline architectures
OPAC, an hardware matrix floating-point coprocessor
software pipeline for LIW
 « Supercomputing » background
 interleaved memories
 Fortran-S
7
CAPS project
CAPS background in architecture
 Solid knowledge in microprocessor architecture
 technological watch on microprocessors
 A. Seznec worked with Alpha Development Group in
1999-2000
 Researches in cache architecture
 Researches in branch prediction mechanisms
CAPS project
CAPS background in compilers
 Software optimizations for cache memories
 Numerical algorithms on dense structures
 Optimizing data layout
 Many prototype environments for parallel compilers:
 CT++ (with CEA): image processing C++ library for a SIMD
architecture,


Menhir: a parallel compiler for MatLab
IPF (with Thomson-LER): Fortran Compiler for
image processing
on Maspar

Sage (with Indiana): Infrastusture for source level transformation
9
CAPS project
We build on

SALTO: System for Assembly-Language Transformations and
Optimizations


retargetable assembly source to source preprocessor
Erven Rohou’s Ph. D
 TSF:
 Scripting language for program transformation on top
of ForeSys (Simulog)
 Yann Mevel’s Ph. D
10
CAPS project
Salto overview
 Assembly source to source preprocessor
 Fine grain machine description
 Independent from compilers
assembly
language
Machine
Description
SALTO
assembly
language
11
Transformation
tool
C++
CAPS project
Compiler activities
 Code optimizations for embedded applications
 infrastructures rather than compilers
 optimizing compiler strategies rather than new
code optimizations
 Global constraints

performance /code sizes/ low power (starting)
 Focus on interactive tools rather than automatic



code tuning
case based reasoning
assembly code optimizations
12
Computer aided hand tuning
 Automatic optimization has many shortcomings
CAPS project

rather provide the user with a testbed to hand-tune
applications
 Target applications

Fortran codes and embedded C applications
 Our approach

case based reasoning

static code analysis and pattern matching

profiling

learning techniques

the user is the ultimate responsible
13
CAHT
Prototype built on
CAPS project
Foresys: Fortran interactive front-end (from Simulog)
TSF: Scripting language for program transformation
Sage++: Infrastusture for source level transformation
14
CAPS project
Analysis and Tuning tool for Low Level Assembly and
Source code (with Thomson Multimedia)
 ATLLAS objectives :
 Has the compiler done a good job ?
 Try to match source and optimized assembly at fine
grain
 Development/analysis environment:
 Models for both source and assembly
 Global and local analysis (WCET, …) at both levels
 Interactive environment for codes visualization and
manual/ automatic analysis and optimization
 Built using Salto and Sage++:

Retargetable with compilers and architectures
15
ATLLAS - Analysis and Tuning tool for Low Level Assembly
and Source code : Tuning method
Source Code
End
Assembly Code
CAPS project
Yes
Code
Good
?
matching analysis and evaluations
Graphic Display of Ass. And Src. Code
Half-Automatic
or Manual Source
Optimisations
compilation
16
Half-Automatic or
Manual Assembly
Optimisations
Post-Processing
Atllas
Processing
Support
profiling
CAPS project
Assembly Level Infrastrure for Software
Enhancement (with STmicroelectonics)
 ALISE
 enhanced SALTO for code optimization:
• better integration with code generation
– interface with front-end
– interface for profiling data
• targets global optimization
• based on component software optimization
engines
 Answer to a real need from industry:
 A retargetable infrastructure
17
CAPS project
ALISE
 Environment for:
 global assembly code optimization
 providing optimization alternatives
 Support for new embedded processors
 ISAs with ILP support (VLIW, EPIC)
 Predicated instructions
 Functional unit clusters, ..
18
ALISE
CAPS project
Architecture
Description
D to M
Architecture Model
Intermediate
Code
Text
Input
Intermediate representation
Optimized
Program
High Level API
P to IR
Opt 1
Opt 2
Opt n
User interface
G.U.I.
19
IR to Ass
(Emit)
Interfaces
External
Infrastructure
External
Infrastructure
CAPS project
Preprocessor for media processors
(MEDEA+ Mesa project)
 Multimedia instructions on embedded and generalpurpose processors but :

no consensus on MMD instructions among constructors:
• saturated arithmetic or not, different instructions, …
 Multimedia instructions are not well handled by
compilers:
• but performance is very dependent
20
CAPS project
Preprocessor for media processors:
our approach
 C source to source preprocessor
 user oriented idioms recognition:


easy to retarget
target dedicated recognition
 exploiting loop parallelism
 vectorization techniques
 multiprocessor systems
 available soon
 Collaboration with Stmicroelectonics
21
CAPS project
Iterative compilation
 Embedded systems:
 Compile time is not critical
 Performance/code size/power are critical
 One can often relate on profiling
 Classical compiler: local optimizations
 but constraints are GLOBAL
 Proof of concept for code sizes (Rohou ’s Ph. D)
 new Ph. D. beginning in september 2000
22
CAPS project
High performance instruction set
simulation
 Embedded processors:
 // development of silicon, ISA, compiler and
applications
 Need for flexible instruction set simulation:
 high performance
 simulation of large codes
 debugging
 retargetable to experiment:
• new ISA
• various microarchitecture options
 First results: up to 50x faster than ad-hoc simulator
23
ABSCISS:
Assembly Based System
for Compiled Instruction Set Simulation
CAPS project
C Source
tmcc
Architecture
Description
24
TriMedia Assembly
ABSCISS
tmas
C/C++ Source
TriMedia Binary
gcc
tmsim
Compiled simulator
CAPS project
Enabling superscalar processor
simulation
 Complete O-O-O microprocessor simulation:
 10000-100000 slower than real hardware
 can not simulate realistic applications, but slices
 even fast mode emulation is slow (50-100x):
• simulation generally limited to slices at the
beginning of the application
• representativeness ?
 Calvin2 + DICE:
 combines direct execution with simulation
 really fast mode: 1-2x slowdown
 enables simulating slices distributed over the whole
application
25
Calvin2 + DICE
calvin2
Static Code Annotation Tool
CAPS project
checkpoint
Switching event
Original
code
SPARC V9
assembly
code
DICE
checkpoint
Host ISA
Emulator
checkpoint
User analysis
routines
checkpoint
Switching event
checkpoint
26
Emulation mode
CAPS project
Moving tools to IA64

New 64bit ISA from Intel/HP:
 Explicitly Parallel Instruction Computing
 Predicated Execution
 Advanced loads (i.e. speculative)
 A very interesting platform for research !!

Porting SALTO and Calvin2+DICE approach to IA64

Exploring new trade-offs enabled by instruction sets:
 predicting the predicates ?
 advanced loads against predicting dependencies
 ultimate out-of-order execution against compiler
27
CAPS project
Low power, compilation, architecture, …
(just beginning :=)

Power consumption becomes a major issue:
 Embedded and general purpose

Compilation (setting a collaboration with STmicroelectronics/Stanford/Milan):
 Is it different from performance optimization ?
 Global constraint optimization
 Instruction Set Architecture support ?

Architecture:
 High order bits are generally null, …
 registers and memory
 ALUs
28
CAPS project
Caches and branch predictors
 International CAPS visibility in architecture =
 skewed associative cache
 + decoupled sectored cache
 + multiple block ahead branch prediction
 + skewed branch predictor
 Continue recurrent work on these topics:
 multiple block ahead + tradeoffs complexity/accuracy
29
CAPS project
Simultaneous Multithreading




Sharing functional units among several processes
Among the first groups working on this topic
 S. Hily’s Ph. D.
SMT behavior well understood for independent threads
 now, focus on // threads from a single application
Current research directions:
 speculative multithreading
• ultimate performance with a single thread through
predicting threads
 performance/complexity tradeoffs: SMT/CMP/hybrid
30
« Enlarging » the instruction window
CAPS project
(supported by Intel)
 In an O-O-O processor, fireable instructions are chosen in a
window of a few tens of RISC-like instructions.
 Limitations are:
 size of the window
 number of physical registers
 Prescheduling:
 separate data flow scheduling from resource arbitration.
 coarser units of work ?
 Reducing the number of physical registers:
 how to detect when a physical register is dead ?
 Per group validation ? revisiting CISC/RISC war ?
31
Unwritten rule on superscalar
processor designs
CAPS project
 For general purpose registers:
Any physical register can be the source or
the result of any instruction executed
on any functional unit
32
4-cluster WSRS architecture
S0
S1
S0
C0
C1
S1
S2
C2
C3
S3
CAPS project
(supported by Intel)
S233
S3
•Half the read ports, one
fourth the write ports
•Register file:
• Silicon area x 1/8
• Power
x 1/2
• Access time x 0.6
•Gains on:
•bypass network
•selection logic
CAPS project
Multiprocessor on a chip
 Not just replicating board level solutions !
 A way to manage a large on-chip cache capacity:
 how can a sequential application use efficiently a distributed
cache ?
 architectural supports for distributing a sequential application
on several processors ?
 how should instructions and data be distributed ?
34
HIPSOR
CAPS project
HIgh Performance SOftware Random number generation
 Need for unpredicable random number generation:
 sequences that cannot be reproduced
 State of the art:
 < 100 bit/s using the operating system
 75Kbit/s using hardware generator on Pentium III
 Internal state of a superscalar can not be reproduced
 use this state to generate unpredictable random
numbers
35
HIPSOR (2)
CAPS project
 1000’s of unmonitorable states modified by OS interrupts
 Hardware clock counter to indirectly probe these states
 Combined with in-line pseudo-random number generation

100 Mbit/s unpredictable random numbers
ARC INRIA with CODES
36

CAPS project

Transcript CAPS project

Directory