Diapositiva 1

Download Report

Transcript Diapositiva 1

Writing software for embedded systems
Part I: Introduction, design flow
Part II: The role of compilers
Alessandro Dalla Torre, Andrea Marongiu
{alessandro.dallatorre,
andrea.marongiu3}@.unibo.it
Introduction
Writing software for embedded
systems
Writing software for embedded system
•
Traditional method:
microprocessor emulation.
–
1.
2.
•
The software engineer:
develops his code on a PC,
workstation;
uses the emulator as a
window into the system.
Alternative approach:
–
–
ready-built prototyped
board;
a way to download and
debug the code on the
target board has to be
supplied.
The Compilation Process
Middleware and OS integration
•
The Toolchain as a set of computer programs (tools) that are used to create a product
(typically another program).
•
Native versus cross-compilers.
•
Run-time libraries.
–
Processor/architecture dependent (system call)
•
–
•
Memory allocation, task controlm semaphores.
IO dependent.
Writing and linking additional libraries.
–
–
Embedded OS like RTEMS.
Communication libraries, for instance Message Passing support library like MP-Queue.
Dowloading binary code into the target platform
•
Serial lines, parallel ports
1. The host is connected to the onboard
debugger via a serial comms port.
2. A dowload command sends the file to
the target debugger.
3. The target debugger converts the
ASCII format back into binary and
loads at the correct location.
•
Alternative: burn the program into
EPROM or other form of non-volatile
memories like FLASH.
•
If we are simulating on a virtual
platform (emulator), the crosscompiled binary file has to be put into
the correct path.
•
From that location, the it will be picked
up and the code loaded at bootstrap
by the simulator:
– Possibly a different binary may be
bounded to a different virtual core.
Instruction Set Simulator (ISS)
•
An Instruction Set Simulator (ISS) is
a simulation model, usually,coded in a
high-level language.
•
It mimics the behavior of a another
hardware device or microprocessor by:
– "reading" instructions
– maintaining internal variables
which represent the processor's
registers.
•
The number of instructions to perform
the above basic "loop" :
– Fetch
– Execute/
– Calculate new address
depends on hardware but
– requires multiple local host
instructions for simulating a single
target instruction of the ISS.
Debugging techniques
•
High level language simulation
– Directly on the host machine used for developing, by means of Linux threads/
processes and IPC facilities.
•
Task level debugging
– Operating system may provide breakpointing facilities on system circumstancies,
like:
• Events,
• Messages,
• Interrupt Routines.
•
Low level simulation
– ISS simulation, slower but more accurate (even cycle accurate).
•
Onboard debugging
– Code has to be dowloaded into the target evaluation board;
– A remote terminal can be attached and run on the host machine to monitor the
program execution.
Cross Compiler
• A compiler capable of creating executable code for a
platform other than the one on which the compiler is run.
• Its fundamental use is that of separating the build
environment from the target environment.
– Embedded systems have limited resources, often not
powerful enough to run a compiler or a developing
environment (debugging).
– A single build environment can be set up to compile
different (multiple) targets.
GNU Toolchain
• The set of programming tools used for programming
both application and operating system.
• A vital component in Linux kernel development.
• A standard tool when developing software for embedded
systems.
• Projects included in the GNU toolchain are:
– GNU make
– GNU Compiler Collection (GCC)
– GNU Binutils
Compiling code
#include <stdio.h>
#define TRUE 1
#define FALSE 0
main()
{
int i;
i = 5 * 2;
printf(“5 times 2 is %d.\n”, i);
printf(“TRUE is %d.\n”, TRUE);
printf(“FALSE is %d.\n”, FALSE);
}
Handled by the pre-processor
Handled by the compiler
Handled by library routines
Compiling code
source
code
pre-processor
header
files
compiler
assembler
.s
assembler
object
code
linker
executable file
compiler
binutils
Compiling code
• The pre-processor handles
– Macros (#define)
– Inclusions (#include)
– Conditional code inclusion (#ifdef, #if)
– Language extensions (#pragma).
• The compiler processes source code and turns it into assembler
modules.
• The assembler converts them to hexadecimal.
• The linker takes the object files and searches library files to find the
routines it calls. It calculates the address references and
incorporates any symbolic information to create an executable file
format.
Runtime Libraries
•
Compilers only generate a small subset of high-level languages facilities
and commands from built-in routines.
•
It relies on libraries to provide the full range of functions that the language
offers:
– Processor dependent: mathematical functions, string manipulation and
similar features that use the processor and don’t need to communicate
with peripherals;
– I/O dependent: defines the hardware that the software need to access.
The library routine either drives directly the hardware or calls the
operating system to perform its task;
– System calls: typical routines are those which dinamically allocate
memory, task control commands, use semaphores, etc;
– Exit routines: used to terminate programs free up the memory used by
the application.
Part II
The role of compiler
Program
written
in a
Programming
Language
Compiler
Assembly
Language
Translation
High-level View of a Compiler
Source
code
Compiler
Machine
code
Errors
•
Must recognize legal (and illegal) programs – Understand and preserve the
meaning of the source program.
•
Must generate correct code – Map the functionality of the source program to
the target (usually the ISA of some computer system)
•
Must introduce optimizations on the original code
1. Performance/Speed
2. Code size
3. Power consumption
Traditional Two-pass Compiler
Source
code
Front
End
IR
Back
End
Machine
code
Errors
• Use an intermediate representation (IR)
• Front end maps legal source code into IR
• Back end maps IR into target machine code
Multiple front and back ends
Fortran
Front
end
C, C++
Front
end
Java
PASCAL
Front
end
Front
end
Back
end
Target 1
Back
end
Target 2
Back
end
Target 3
• Must encode all language specific knowledge in each
front end
• Must encode all features in a single IR
• Must encode all target specific knowledge in each
back end
Anatomy of a Compiler
Program (character stream)
Lexical Analyzer (Scanner)
Token Stream
Syntax
AnalyzerEND
(Parser)
FRONT
Parse Tree
Semantic Analyzer
Intermediate Representation
Code Optimizer
MIDDLE
END
Optimized Intermediate Representation
Code
Generator
BACK
END
Assembly code
Lexical Analyzer (Scanner)
2 3 4
*
( 1 1
+ - 2 2 )
Num(234) mul_op lpar_op Num(11) add_op Num(-22) rpar_op
18..23 + val#ue
Variable names cannot have ‘#’ character
Not a number
Syntax Analyzer (Parser)
ES.
goal
x + 2 - y
expr
expr
expr
term
op
+
term
<number,2>
<id,x>
This contains a lot of unneeded
information.
op
term
-
<id,y>
1. goal  expr
2. expr  expr op term
3.
| term
4. term  number
5.
| id
6. op
7.
 +
|
-
Syntax Analyzer (Parser)
int * foo(i, j, k))
int i;
int j;
Extra parentheses
{
for(i=0; i j) {
fi(i>j)
Missing increment
return j;
Not an expression
}
Not a keyword
Semantic Analyzer
int * foo(i, j, k)
int i;
int j;
{
int x;
x = x + j + N;
return j;
}
Type not declared
Mismatched return type
Uninitialized variable used
Undeclared variable
Anatomy of a Compiler
Program (character stream)
Lexical Analyzer (Scanner)
Token Stream
Syntax Analyzer (Parser)
Parse Tree
Semantic Analyzer
Intermediate Representation
Code Optimizer
MIDDLE
END
Optimized Intermediate Representation
Code
Generator
BACK
END
Assembly code
Constant Propagation
int i, x, y;
x = 0;
y = 0;
for(i = 0; i <= N; i++) {
x = x + (4*a/b)*i + (i+1)*(i+1);
x = x + b*0;
b*y;
}
return x;
Uses of variables initialized with a constant
value are replaced with the value itself
Algebraic Simplification
int i, x, y;
x = 0;
y = 0;
for(i = 0; i <= N; i++) {
x = x + (4*a/b)*i + (i+1)*(i+1);
x = x;+ b*0;
}
return x;
Simple algebraic expressions are evaluated
and replaced with the resulting value
Copy Propagation
int i, x, y;
x = 0;
y = 0;
for(i = 0; i <= N; i++) {
x = x + (4*a/b)*i + (i+1)*(i+1);
x = x;
}
return x;
Targets of direct assignments of the form y=x
are replaced with their values
Another example:
y = x;
z = 3 + y;
z = 3 + y;
Common Subexpression Elimination
int i, x, y;
y, t;
x = 0;
y = 0;
for(i = 0; i <= N; i++) {
x = i+1;
x + (4*a/b)*i + (i+1)*(i+1);
t
t * t;
}
return x;
A subexpression that occurs more than once is
replaced with the use of a temporary variable
initialized with the subexpression
Dead Code Elimination
int i, x, t;
y, t;
x = 0;
y = 0;
for(i = 0; i <= N; i++) {
t = i+1;
x = x + (4*a/b)*i + t * t;
}
return x;
Code that is never reached during execution,
or assignments to variables that are never
used are considered dead and removed
Loop Invariant Removal
int i, x, t,
t; u;
x = 0;
u = (4*a/b);
for(i = 0; i <= N; i++) {
t = i+1;
x = x + (4*a/b)*i
u * i + t +
* t
t;* t;
}
return x;
Expressions within a loop that are independent
from the iteration count are moved outside the
loop body and computed only once
Anatomy of a Compiler
Program (character stream)
Lexical Analyzer (Scanner)
Token Stream
Syntax Analyzer (Parser)
Parse Tree
Semantic Analyzer
Intermediate Representation
Code Optimizer
Optimized Intermediate Representation
Code
Generator
BACK
END
Assembly code
Code Generator
First role of the backend issumcalc:
that of mapping
xorl
%r8d, %r8d
xorlthe %ecx, %ecx
the optimized code (IR) into
movl
%edx, %r9d
cmpl
%edx, %r8d
instructions of the target machine
ISA
jg
.L7
int sumcalc(int a, int b, int N)
{
int i, x, t, u;
x = 0;
u = (4*a/b);
for(i = 0; i <= N; i++) {
t = i+1;
x = x + u * i + t * t;
}
return x;
}
.L5:
sall
movl
cltd
idivl
leal
movl
imull
movl
imull
leal
$2, %edi
%edi, %eax
jle
movl
ret
.L5
%r8d, %eax
..but it is not the only one!
%esi
1(%rcx), %edx
%eax, %r10d
%ecx, %r10d
%edx, %ecx
%edx, %ecx
(%r10,%rcx),
Some optimizations (register allocation,
instruction scheduling) need knowledge of
%eax
some details of the target architecture.
movl
%edx, %ecx
%eax, %r8d
They are implemented in the addl
backend
cmpl
%r9d, %edx
.L7:
Back-end optimizations:
Instruction scheduling
• Many pipeline stages
–
–
–
–
Pentium
Pentium Pro
Pentium IV (130nm)
Pentium IV (90nm)
5
10
20
31
• Different instructions taking different amount of time to execute
• Most modern processors have multiple execution units (superscalar)
– If the instruction sequence is correct, multiple operations will happen in
the same cycles
– Even more important to have the right instruction sequence
• Reorder instructions so that pipeline stalls are minimized
Data Dependency between
Instructions
• If two instructions access the same variable,
they can be dependent
• Kind of dependencies
– True: write  read
– Anti: read  write
– Output: write  write
• What to do if two instructions are dependent.
– The order of execution cannot be reversed
– Reduce the possibilities for scheduling
Representing Dependencies
• Using a dependence DAG (Directed Acyclic Graph), one
per basic block
• Nodes are instructions, edges represent dependencies
1:
2:
3:
4:
r2
r3
r4
r5
=
=
=
=
*(r1
*(r1
r2 +
r2 -
+ 4)
+ 8)
r3
1
• Edge is labeled with Latency:
1
2
4
2
2
2
3
Example
1:
2:
3:
4:
5:
6:
7:
lea
add
inc
mov
add
and
imul
1
2
3
var_a, %rax
$4, %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
16(%rsp), %rbx
%rax, %rbx
4 st st 5
Results In
1 cycle
1 cycle
1 cycle
3 cycles
1 cycle
4 cycles
3 cycles
6 st st st 7 st st
14 cycles
List Scheduling Algorithm
• Create a dependence DAG of a basic block
– Do a topological sort of the dependence DAG
– Consider when an instruction can be scheduled without causing a stall
– Schedule the instruction if it causes no stall and all its predecessors are
already scheduled
• Topological Sort
– READY = nodes with no predecessors. Loop until READY is empty
• Heuristics for selecting from the READY list
– pick the node with the longest path to a leaf in the dependence graph
– pick a node with most immediate successors
Example
Number of successors
Longest path to a leaf node
1:
2:
3:
4:
5:
6:
7:
8:
lea
add
inc
mov
add
and
imul
mov
9: lea
var_a, %rax
$4, %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
16(%rsp), %rbx
%rax, %rbx
%rbx, 16(%rsp)
var_b, %rax
d=5
1 f=1
1
d=4
2 f=1
d=0
f=0
6
d=7
f=1
7
4
d=3
f=2
1
d=0
f=0
9
1
3
8
3
d=0
f=0
4
3
5
d=3
f=1
d=0
f=0
Example
READY set
{ 6, 1, 4, 3 }
{ 1, 4, 3 }
{ 4, 3 }  { 2, 4, 3 }
{ 4, 3 }  { 4, 7, 3 }
{ 7, 3 }  { 7, 3, 5 }
{ 3, 5 }  { 3, 5, 8, 9 }
{ 5, 8, 9 }
{ 8, 9 }
{9}
{}
d=5
1 f=1
1
d=4
2 f=1
6 1 2 4 7 3 5 8 9
d=0
f=0
6
d=7
f=1
7
4
d=3
f=2
1
d=0
f=0
9
1
3
8
3
d=0
f=0
4
3
5
d=3
f=1
d=0
f=0
Example
1:
2:
3:
4:
5:
6:
7:
8:
lea
add
inc
mov
add
and
imul
mov
var_a, %rax
$4, %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
16(%rsp), %rbx
%rax, %rbx
%rbx, 16(%rsp)
9: lea
var_b, %rax
1
3
2
4
st st
14 cycles vs
9 cycles
5
6 1 2 4 7 3 5 8 9
6
st st st
7
Anatomy of a Compiler
Program (character stream)
Lexical Analyzer (Scanner)
Token Stream
Syntax
AnalyzerEND
(Parser)
FRONT
Parse Tree
Semantic Analyzer
Intermediate Representation
Code Optimizer
MIDDLE
END
Optimized Intermediate Representation
Code
Generator
BACK
END
Assembly code
GCC Internals
Common
Multiple
front-ends
intermediate
representation
Retargetable!
IR – Control Flow Graph
• Most analysis/optimization passes inside a compiler are
performed on CFGs
• A CFG is a directed graph which models flow of control
in the program.
• Each node corresponds to a basic block, i.e. a sequence
of non-branch instructions.
•
Edges correspond to possible tansfer of control between
blocks
Control Flow Graph
into add(n, k) {
s = 0; a = 4; i = 0;
if (k == 0)
b = 1;
else
b = 2;
while (i < n) {
s = s + a*b;
i = i + 1;
}
return s;
}
s = 0;
a = 4;
i = 0;
k == 0
b = 2;
b = 1;
i<n
s = s + a*b;
i = i + 1;
return s;
Basic Block Construction
• Start with instruction control-flow graph (each basic
block contains a single instruction)
• Visit all edges in graph
• Merge adjacent nodes if
– Only one edge from first node
– Only one edge into second node
s = 0;
a = 4;
s = 0;
a = 4;
s = 0;
a = 4;
i = 0;
k == 0
s = 0;
a = 4;
i = 0;
k == 0
b = 2;
b = 1;
i = i + 1;
b = 1;
i<n
i<n
s = s + a*b;
b = 2;
return s;
s = s + a*b;
i = i + 1;
return s;
Optimizing for parallel architectures
Automatic Loop Parallelization
Types of Parallelism
• Instruction Level Parallelism (ILP)
 Scheduling and Hardware
• Task Level Parallelism (TLP)
 Mainly by hand
• Loop Level Parallelism (LLP) or
Data Parallelism
 Generated by Hand or
Compiler
• Pipeline Parallelism
 Hardware or streaming
• Divide and Conquer Parallelism
 Recursive functions
Loop parallelization
• Why loops?
– 90% of the execution time in 10% of the code
• Mostly in loops
– If parallel, can get good performance
• Load balancing
– Relatively easy to analyze
• How to automatically parallelize loops?
– Find FORALL loops out of FOR loops
• Data dependence analysis
• Definition
– Loop-carried dependence:
dependence that crosses a loop boundary
• If there are no loop carried dependences  parallelizable
Programmer Defined Parallel Loop
• FORALL
– No “loop carried
dependences”
– Fully parallel
• FORACROSS
– Some “loop carried
dependences”
Loop Splitting
• Example
FORPAR I = 0 to N
A[I] = A[I] + 1
• Block Distribution: Program gets mapped into
Iters =
FOR P =
FOR I
A[I]
ceiling(N/NUMPROC);
0 to NUMPROC-1
= P*Iters to MIN((P+1)*Iters, N)
= A[I] + 1
Code transformation
parallel code
int main() {
sequential code
void parallel_routine() {
…
for (i=0; i<N; i++)
for (j=i; j<N; j++)
A[i][j] = 1;
…
}
for (i=N*cpuID/nprocs;
i<N*(cpuID+1)/nprocs;
i++)
for (j=i; j<N; j++)
A[i][j] = 1.0;
}
int start() {
…
PARALLELIZING
COMPILER
for (i=0; i<N; i++)
for
(j=i; j<N; j++)
do_all();
A[i][j] = 1;
…
}
Runtime support to parallelization
runtime library
void main() {
initenv();
if (cpuID == MASTER) {
// gather workers on barrier
start();
// release workers
} else {
// spin until work provided
parallel_routine();
// spin until work provided
}
}
void doall() {
// release workers
parallel_routine();
// gather workers on barrier
}
// Synchronization facilities
// Lock Implementation
// Barrier Implementation
parallel code
void parallel_routine() {
for (i=N*cpuID/nprocs;
i<N*(cpuID+1)/nprocs;
i++)
for (j=i; j<N; j++)
A[i][j] = 1.0;
}
int start() {
// sequential code
for (i=0; i<N; i++)
for
(j=i; j<N; j++)
do_all();
A[i][j] = 1;
// sequential code
}
Cooperative approaches: OpenMP
•
The compiler may not be able to do the parallelization
in the way you like to see it:
–
A loop is not parallelized
•
–
The granularity is not high enough
•
•
The data dependency analysis is not able to determine wheter it
is safe to parallelize or not
The compiler lacks information to parallelize at the highest
possible level
This is when the explicit parallelization through
OpenMP directives comes into the picture
OpenMP and GCC
• Language extensions for shared memory concurrency
(#pragma)
• Supports C, C++ and Fortran
• Embedded directives specify
– Parallelism
– Data sharing semantics
– Work sharing semantics
• Standard and increasingly popular
OpenMP programming model
• A typical OpenMP implementation relies on the pthreads
library.
• With MPSoCs the SPMD paradigm is also used
• Based on fork-join semantics
– Master thread spawns teams of children threads
– Allows sequential and parallel execution
Supporting OpenMP in GCC
• Recognize OpenMP pragmas
– Within the parser in each frontend
• Lower into GIMPLE IR
– Augment it with new OMP nodes
• Identify data sharing
– The master thread creates a local structure which contains all
the data items marked for sharing and passes its address to
slaves
• Identify work sharing
– Split computation between different cores (omp for)
• Identify parallel regions
– Outline the body of parallel regions into functions that are used
as arguments to the thread creation routines