voor dia serie SNS

Download Report

Transcript voor dia serie SNS

Compiler Issues for
Embedded Processors
경종민
[email protected]
Contents
•
•
•
•
•
•
•
•
Compiler Design Issues
Problems of Compilers for Embedded Processors
Structure of typical C compiler
Front end
IR optimizer
Back end
Embedded-code optimization
Retargetable compiler
Compiler Design Issues
• For embedded systems the use of compilers is
less common.
• Designers still use assembly language to program
many embedded applications.
– Huge programming effort
– Far less code portability
– Maintainability
• Why is assembly programming still common?
– The reason lies in embedded systems’ high-efficiency
requirements.
Problems of Compilers for Embedded
Processors
• Embedded systems frequently employ
application-specific instruction set processors
(ASIPs)
– Meet design constraints more efficiently than generalpurpose processor
• E.g., performance, cost and power consumption
• Building the required software development tool
infrastructure for ASIPs is expensive and timeconsuming
– Especially true for efficient C and C++ compiler design,
which requires a large amount of resources and expert
knowledge.
• Therefore, C compilers are often unavailable for
newly designed ASIPs.
Problems of Compilers for Embedded
Processors
• Many existing compilers for ASIPs (e.g., DSPs)
generate low-quality code.
– Compiled code may be several times larger and/or
slower than handwritten assembly code.
• This poor code is virtually useless for efficiency
reason.
Problems of Compilers for Embedded
Processors
• The cause of the poor code quality is highly
specialized architecture of ASIPs, whose
instruction sets can be incompatible with highlevel languages and traditional compiler
technology;
– Because an instruction set is generally designed
primarily from a hardware designer’s viewpoint, and
– the architecture is fixed before considering compiler
issues.
Problems of Compilers for Embedded
Processors
• Problems of compiler unavailability must be solved,
because;
– Assembly programming will no longer meet short timeto-market requirements
– Future human programmers are unlikely to outperform
compilers
• As processor architectures become increasingly complex
(e.g., deep pipelining, predicated execution, and high
parallelism)
– Application program should be machine-independent
(e.g., C language) for architecture exploration with
various cost/performance tradeoffs.
Coarse structure of typical C compiler
Source code
Optimized IR
Front end
(scanner, parser,
semantic analyzer)
Intermediate
representation (IR)
IR optimizer
(constant folding,
constant propagation,
jump optimization,
loop-invariant code motion,
dead code elimination)
Back end
(code selection,
register allocation,
scheduling,
peephole optimization)
Assembly code
Front end
• The front end translates the source
program into a machineindependent IR
• The IR is stored in a simple format
such as three-address code
– Each statement is either an assignment
with at most three operands, a label, or
a jump
L1:
i ← i+1
t1 ← i+1
t2 ← p+4
t3 ← *t2
p ← t2
t4 ← t1 < 10
*r ← t3
if t4 goto L1
• The IR serves as a common
exchange format between the
Example IR (MIR code)
front-end and the subsequent
optimization passes, and also forms
the back-end input
Front end
• Front end’s main component
– Scanner
• Recognizes certain character string in the source code
• Groups them into tokens
– Parser
• Analyzes the syntax according to the underlying sourcelanguage grammar
– Semantic analyzer
• Performs bookkeeping of identifiers, as well as additional
correctness checks that the parser cannot perform
• Many tools (e.g, lex and yacc) that automate the
generation of scanners and parsers are available
IR optimizer
• The IR generated for a source program normally
contains many redundancies
– such as multiple computations of the same value or jump
chains, because the front end does not pay much
attention to optimization issues
• Human programmer might have built redundancies
into the source code, which must be removed by
subsequent optimization passes
IR optimizer
• Constant folding
– replaces compile-time constant expressions with their
respective values
• Constant propagation
– Replaces variables known to carry a constant value with
the respective constant
• Jump optimization
– Simplifies jumps and removes jump chains
• Loop-invariant code motion
– Moves loop-invariant computations out of the loop body
• Dead code elimination
– Removed computation whose results are never needed
ex) Constant Folding
void f() {
void f() {
int A[10], t1, t3, *t5;
char *t2, *t4;
t1 = 3 * 5; → from the source code
t4 = (char *) A;
t3 = 2 * 4; → array index 2 by
t2 = t4 + t3; the number of memory words
t5 = (int *) t2;
*t5 = t1;
int A[10];
A[2] = 3 * 5;
}
C example:
An element array A is
assigned a constant
}
C-like IR notation of the Lance compiler system
Unoptimized IR with
two compile-time constant expressions
• Now the IR optimizer can apply constant folding to replace
both constant expressions by constant numbers, thus
avoiding expensive computations at program runtime
IR optimizer
• A good compiler consists of many such IR
optimization passes.
– Some of them are far more complex and require an
advanced code analysis.
• There are strong interaction and mutual
dependence between these passes.
– Some optimizations enable further opportunities for other
optimization.
– should be applied repeatedly to be most effective
Back end
• The back end (or code generator)
– maps the machine-independent IR into a behaviorally
equivalent machine-specific assembly program.
– Statement-oriented IR is converted into a more
expressive control/dataflow graph representation.
– Front end and IR optimization technologies are quite
mature but the back end is often the most crucial
compiler phase for embedded processors.
Major back end passes
• Code selection
– maps IR statement into assembly instructions
• Register allocation
– assigns symbolic variables and intermediate results to
the physically available machine registers
• Scheduling
– arranges the generated assembly instructions in time
slots
– considers inter-instruction dependencies and limited
processor resources
• Peephole optimization
– relatively simple pattern-matching replacement of certain
expensive instruction sequences by less expensive ones.
Back end passes for embedded processors
• Code selection
– To achieve good code quality, it must use complex
instructions
• multiply-accumulate(MAC), load-with-autoincrement, etc.
– Or it must use subword-level instructions (have no
counter part in high-level language)
• SIMD and network processor architectures
• Register allocation
– Utilize a special-purpose register architecture to avoid
having too many stores and reloads between registers
and memory
• If the back end uses only traditional code
generation techniques, the resulting code quality
may be unacceptable
Example: Code selection with MAC
instructions
temporary
variable
a)
b)
c)
Dataflow graph (DFG) representation of a simple computation
Conventional tree-based code selectors must decompose the DFG
into two separate trees. Fail to exploit the MAC instructions
Covering all DFG operation with only two MAC instructions requires
code selector to consider the entire DFG
Example: Register Allocation
A=B+C*D
B=A-C*D
Source Program
LOD R1, C
MUL R1, D
STO R1, Temp1
LOD R1, B
ADD R1, Temp1
STO R1, A
LOD R1, C
MUL R1, D
STO R1, Temp2
LOD R1, A
SUB R1, Temp2
STO R1, B
Simple Register Allocation
LOD R1, C
MUL R1, D
LOD R2, B
ADD R2, R1
STO R2, A
SUB R2, R1
STO R2, B
;C*D
;B+C*D
;A-C*D
Smart Register Allocation
Embedded-code optimization
• Dedicated code optimization techniques
– Single-instruction, multiple-data instructions
• Recent multimedia processor use SIMD instructions, which
operate at the subword level. (ex. Intel MMX)
– Address generation units (AGUs)
• Allow address computation in parallel with regular
computations in the central datapath
• Good use of AGUs is mandatory for high code quality
– Code optimization for low power
• In addition to performance and code size, power efficiency
is increasingly important
• Must obey heat dissipation constraint, efficient use of
battery capacity in mobile systems
Embedded-code optimization
• Dedicated code optimization techniques (cont’d)
– Code optimization for low power (cont’d)
• Compiler can support power savings
• Generally, the shorter the program runtime, the less energy
is consumed
• “Energy-conscious” compilers armed with an energy model
of the target machine, give priority to the lowest energyconsuming instruction sequences
• Since a significant portion of energy is spent on memory
accesses, another option is to move frequently used blocks
of program code or data into efficient cache or on-chip
memory
Retargetable compiler
• To support fast compiler design for new
processors and hence support architecture
exploration, researchers have proposed
retargetable compilers
• A retargetable compilers can be modified to
generate code for different target processors with
few changes in its source code.
Example: CHESS /CHECKERS
Retargetable Tool Suites
• CHESS/CHECKERS
– is a retargetable tool-suite for flexible embedded
processors in electronic systems.
– supports both the design and the use of embedded
processors. These processors form the heart of many
advanced systems in competitive markets like telecom,
consumer or automotive electronics.
– is developed and commercialized by Target Compiler
Technologies.
http://www.retarget.com
Example: CHESS /CHECKERS
Retargetable Tool Suites
http://www.retarget.com
Example: CHESS /CHECKERS
Retargetable Tool Suites
ASIP(Application-Specific
Instruction Set Processor)
Design
경종민
[email protected]
Reference
• J.H.Yang et al, “MetaCore: An ApplicationSpecific DSP Development System”, 1998 DAC
Proceedings, pp. 800-803.
• J.H.Yang et al, “MetaCore: An ApplicationSpecific Programmable DSP Development
System”, IEEE Trans. VLSI Systems, vol 8, April
2000, pp173-183.
• B.W.Kim et al, “MDSP-II:16-bit DSP with Mobile
Communication Accelerator”, IEEE JSSC, vol 34,
March 1999, pp397-404.
Part I : ASIP in general
• ASIP is a compromise between GPP(GeneralPurpose Processor) which can be used anywhere
with low performance and full-custom ASIC which
fits only a specific application but with very high
performance.
• GPP, DSP, ASIP, FPGA, ASIC(sea of gates),
CBIC(standard cell-based IC), and full custom
ASIC in the order of increasing performance and
decreasing adaptability.
• Recently, ASIC as well as FPGA contains
processor cores.
Cost, Performance,Programmability,
and TTM(Time-to-Market)
• ASIP (Application-Specific Instruction set Processor)
– ASIP is a tradeoff between the advantages of ‘generalpurpose processor’ (flexibility, short development time) and
those of ‘ASIC’ (fast execution time).
Execution time
General-purpose
processor
ASIP
Rigidity
Cost (NRE+chip area)
ASIC
Depends on volume of product
Development time
Comparison of Typical
Development Time
Chip manufacturer time
Customer time
MetaCore (ASIP)
20 months
3 months
MetaCore development
Core generation +
application code development
General-purpose processor
20 months
2 months
Core generation
Application code development
ASIC
10 months
Issues in ASIP Design
• For high execution speed, flexibility and small chip area;
– An optimal selection of micro-architecture & instruction set is
required based on diverse exploration of the design space.
• For short design turnaround time;
– An efficient means of transforming higher-level specification into
lower-level implementation is required.
• For friendly support of application program development;
– A fast development of a suite of supporting software
including compiler and ISS(Instruction Set Simulator) is
necessary.
Various ASIP Development Systems
Instruction set customization
Year
PEAS-I
Risc-like
1991
(Univ.
Toyohashi)
Micro-architecture
(register based
ASIA
1993
operation)
(USC)
DSP-oriented
Micro-architecture
(memory based
operation)
Application
programming
Selection from
User-defined
level
predefined
instructions
super set
Yes
No
Generates proper instruction
set based on predefined
datapath
C-language
C-language
EPICS
(Philips)
1993
Yes
No
assembly
CD2450
(Clarkspur)
1995
Yes
No
assembly
MetaCore
(KAIST)
1997
Yes
Yes
C-language
Part II : MetaCore System
• Verification with co-generated compiler and ISS
• MetaCore system
– ASIP development environment
– Re-configurable fixed-point DSP architecture
– Retargetable system software
• C-compiler, ISS, assembler
– MDSP-II : a 16-bit DSP targeted for GSM applications.
The Goal of MetaCore System
• Supports efficient design methodology for ASIP
targeted for DSP application field.
Diverse design exploration
Performance/cost
efficient design
Short chip/core design
turnaround time
Automatic design generation
In-situ generation of
application program
development tools
Overview: How to Obtain a DSP
Core from MetaCore System
Instructions
Primitive class
add and or sub
....
Optional class
mac max min
....
Select
instructions
Benchmark
Programs
Functional blocks
Adder
Multiplier
Shifter
Bus structure
Data-path
structure
Pipeline model
....
Select
architectural
parameter
Select
functional blocks
Simulation
No
Add or delete
instructions
Architecture
template
OK?
Modify architecture
No
Yes
HDL code generation
Logic synthesis
Add or delete
functional blocks
System Library & Generator Set:
Key Components of MetaCore System
Processor
Specification
Benchmark
Programs
Compiler
generator
Simulation
Modify
specification
C compiler
Modify
modify
Add
Add
ISS
Evaluation
accept
Architecture
Set of
template instructions
- bus
structure
- pipeline
model
- instruction’s
definition
- related func.
block
- data-path
structure
System Lib.
ISS
generator
Set of
functional
blocks
HDL
generator
- parameterized
HDL code
- I/O port
information
- gate count
Synthesizable
HDL code
Generator set
Processor Specification (example)
• Specification of target core
– defines instruction set & hardware configuration.
– is easy for designer to use & modify due to high-level
abstraction.
//Specification of EM1
(hardware
ACC
1
AR
4
pmem
2k, [2047: 0]
Hardware
configuration
...
)
(def_inst ADD
(operand
type2
(ACC <= ACC + S1
(extension
sign
(flag
cvzn
(exestage
1
.
)
.
.
)
)
)
)
Instruction set
definition
Benchmark analysis
• is necessary for deciding the instruction set.
• produces information on
– the frequency of each instruction to obtain cost-effective
instruction set.
– the frequent sequence of contiguous instructions to reduce to
application-specific instructions.
abs a0, ar1
clr a1
add a1, ar2
cmp a1, a0
bgtz L1
clr a1
add a1, a0
; a0=|mem[ar1]|
; a1=0
; a1=a1+|mem[ar2]|
; if(a1>a0) pc=L1
; a1=0
; a1=a1+a0
abs a0, ar1
clr a1
add a1, ar2
max a1, a0
L1:
; a1=max(a1, a0)
Application-specific
instruction
L1:
Frequent sequence
of contiguous instructions
HDL Code Generator
Processor
Specification
Target core
Macro-block generation
Instantiates the parameter
variables of each functional
block
Memory size,
address space
Data
memory0
Data
memory1
Connectivity synthesis
Connects I/O and control ports
of each functional block to
buses and control signals
Synthesizable
HDL code
ALU
Multiplier
Shifter
BMU
Register file
Decoder logic
Bit-width of
functional blocks
Control-path synthesis
Generates decoder logic for
each pipeline stage
Program
memory
AGU1
Controller
Peripherals
(Timer, SIO)
Design Example (MDSP-II)
• GSM(Global System for Mobile communication)
• Benchmark programs
– C programs (each algorithm constructing GSM)
• Procedure of design refinement
Remove infrequent
instructions based on
instruction usage count
EM0
• Initial design containing
all predefined instructions
Turn frequent sequence
of contiguous instructions
into a new instruction
EM1
EM2
(MDSP-II)
• Final design containing
application-specific
instructions
Evolution of MDSP-II Core
from The Initial Machine
Number of clock cycles
Gate count
(for 1 sec. voice data processing)
Machine
EM0 (initial)
53.0 Millions
18.1K
EM1 (intermediate)
53.1 Millions
15.0K
EM2 (MDSP-II)
27.5 Millions
19.3K
Number of clock cycles
EM1
EM0
50M
40M
EM2 (MDSP-II)
30M
20M
10M
5K
10K
15K
20K
Gate count
Design Turnaround Time (MDSP-II)
• Design turnaround is significantly reduced due to the
reduction of HDL design & functional simulation time.
• Only hardware blocks for application-specific
instructions, if any, need to be designed by the user.
Design
progress
Layout,
Timing simulation
MetaCore
5 weeks
HDL design,
1 week
Functional simulation
7 weeks
Application analysis
1
2
3
Time (months)
Tape-out
Overview of EM2 (MDSP-II)
PU (SIO, Timer)
Program Memory
16-bit fixed-point DSP
Optimized for GSM
0.6 mm CMOS (TLM), 9.7mm x 9.8mm
55 MHz @5.0V
MCAU
DALU
PCU
AGU
Data Memory
MCAU (Mobile Comm. Acceleration
Unit) consists of functional blocks
for application-specific instructions
16x16 multiplier
32-bit adder
DALU (Data Arithmetic Logic Unit)
16x16 multiplier
32-bit adder
16-bit barrel shifter
Data switch network
PCU (Program Control Unit)
AGU (Address Generation Unit)
supports linear, modulo and
bit-reverse addressing modes
PU (Peripheral Unit)
Serial I/O
Timer
Conclusions
• MetaCore, an effective ASIP design methodology
for DSP is proposed.
1) Benchmark-driven & high-level abstraction of processor
specification enables performance/cost effective design.
2) Generator set with system library enables short design
turnaround time.
SoC Conference
Oct. 23-24, Coex Conference Center
Grand Challenges and Opportunities
laid by SoC for Korea
경종민
[email protected]
Can the success story be
continued?
Can the success story be continued?
• 60년대에 per capita GNP 가 100 불도 안 되던 나라.
• 지금은 반도체, 자동차, 핸드폰등 주요 IT 분야에서
Global Player 가 있는 나라.
• We need to be proud of our success despite all
today’s agony in NASDAQ and terrible politics
situation. However, we need more to know why
we succeeded and how this can be continued.
What is critical for
success in SoC
Business?
어떨 때는 Game 도중에 Rule 이 바뀐다.
• 시장과 기술의 수요가 점진적으로 변화했다면(game
rule) 우리가 일본의 기술을 따라잡는 것은 거의 불가
능해 보였다. 50년 이상의 기술 격차. 일정시대에 이
땅에 공과대학만은 없었다.(3멸 대상; 말, 이름, 기술)
It’s people, people!
• Internet 과 교통의 발달로 사람의 만남, 상품/기술의
유통이 잦아지고, TTM(Time-to-market) 이 key value
가 되었다. Dynamic 한 환경에서 가장 빠른 시간에 다
양한 resource (designer,IP,tool) 를 규합하여 고객에
게 deliver 하는 능력; 아무리 기계와 시스템이 있어도
이런 극한 상황에서는 결국 최종 차별화는 사람에게서
나온다.
어떤 사람을 키울/데
려올 것이냐 ?
어떤 사람을 키울것이냐?
• 상황분석; SoC 의 새로운 도전을 맞을 자세, 습관이
되어 있는 사람은 별로 없다.
– 창의력, 열정 부족 (창의력은 계속 창조된다. 청각능력은 뱃
속에서, 시각은 생후 3-6개월에 결정되지만; synapse
interconnection)
– 대부분의 사람들이 시간낭비하고 있어 보임.
– 판단력, 경험 부족 (만드느냐/사오느냐, 고치느냐/새로 짜느
냐)
– system house 와의 교류 부족 (새 상품 idea source)
SoC type
• 새로운 생각을 하고(창의력),
• 다른 사람과 잘 협력/타협하고(대인능력),
• 규칙을 지키고, 논리적인 사고에 충실한 사람(완전 설
계능력)
어떤 사람을 키우나? AND type,or OR type?
• AND or OR?
• 중요한 일은 힘든다. 좋은 것은 갖기 어렵다. 좋은 것을 가지려
면 노력해야 한다.
• 영역과 수준에 따라 좋은 것의 기준이 달라 질 수 있다. 동네축
구에서 dribble 길게 하는 것은 world cup 에선 안 통한다.
• 범재는 하나만 잘해도 된다. 2-영재는 적어도 2 요소의 핵융합
을 이룬 사람이다. 3-영재는 3-9배의 역량을 갖는다.
• (창의성,끈기),(자신감,겸손),(추진력,협동력),(말잘함,잘들
음),(훌륭한 직장인,좋은 아빠,남편)…
Can deep-thinker cooperate?
• 창의력과 추진력은 공존할 수 있는가?
– 그렇기 힘들다. 그러나, 그렇게 될 수 있다면 reward/폭발력
은 매우 크다.
– 훈련의 가치가 있다.(훈련이란, 하면 좋은데 하기 싫은 것을
하도록 하는 것.)
예; Bismarck 재상은 원래는 shy man.
• 창의력+추진력+협동력+motivation 도 가능한가? 논
리적 사고력+상상력?
– 핵융합반응의 폭발력과 같을 것이다.
Can deep-thinker cooperate?
Team에
공헌하는 가치
협력하며
깊이 생각하는 사람
깊이 생각하는 사람
협력하는 사람
혼자 열심히 일하는 사람
시간
Big harvest comes later
사람 키우는 system에 투자
회사의 가치
System에 투자
사람에 투자
부동산에 투자
시간
무엇이 Fundamental 인가?
• Fundamental = Growth Power 의 적분
• Growth Power = 창의력 X 흡입력 X 추진력
• 흡입력 = 직관력 + 협동능력
• 원칙 (삼각형의 최적모양); 창의성이냐 협동력이
냐? 둘 다 있어야 한다. 과학영재는 영어를 잘
해야 한다.
– To Sustain Growth in a Dynamic Environment, you
need a Sharp edge AND Stable enough bottom.
(Bottom is a tool to let edge, which is the
objective, work better.)
Back to the Basic!
• There are unique roles to be played by each
party of Government, University, and Industry.
– Government must do long-term/global planning and
evaluation/resource allocation, maintain national
research lab to perform research in areas ;
basic/health/environment/defense.
– University must excel in fundamentals and futureoriented research.
– Industry : 기술/영업의 탁월성
• strong interaction and cooperation
between ;
–
–
–
–
–
Government and private sector
Industry and academia
System industry and IC industry
Hardware designers and software designers
IC industries in the pre-competitive stage
• Korean Semiconductor
Industry :
Challenges now faced
Challenges now faced by Korean
Semiconductor Industry
• SWOT Analysis :
– Strength ; zeal for learning, “venture”mind
(can-do spirit)
– Weakness ; strong trend to escape from
technology career in favor of lawyer, doctor,
star…
– Threat ; China’s rush, Protectionism in each
region
– What opportunities ?
Future depends on dealing with
OPPORTUNITIES of SoC
• System-on-Chip (SoC) is THE
driver/market for the semiconductor AND
system industry in the 21th century.
• Korea’s expertise on DRAM and memory
business needs to be connected via. SoC
to system industries like communication,
car, consumers, medical/health.
Five reasons why SoC differs from ASIC
• SoC is not just a BIG ASIC(Application-Specific
IC); SoC needs “Culture Change”
–
–
–
–
Design Reuse
Software embedded in IC chip
System-level design methodology
Heterogeneous mix of technology;
• memory, processor, MEMS (Micro Electro-Mechanical
System), and others
– VDSM (Very Deep Sub-Micron) effects for design and
manufacturing