Centaur Technology Inc.

Download Report

Transcript Centaur Technology Inc.

Random Stuff
Centaur Technology Inc.
G Glenn Henry
Quick Background
Our Security Functions
Centaur Build Methodology
Physical Design Example
Quick Background

We’re Centaur Technology Inc. (Austin, TX)




We design x86 processors
Have been alive for 11 yrs, have shipped processors for 8.5
We operate independently, but are owned by VIA
We are a tiny group; but shipping millions of processors/yr

Our processors are software & bus compatible with Intel x86

Parent company is VIA Technologies (Taiwan)
 But are unique vs. Intel & AMD (re design & target market):
+ lower cost (price)
+ lower power consumption
+ smaller chip footprint
+ unique integrated security features
– generally, lower performance
This fits some rapidly growing “new” markets for x86
 They manufacture, market & sell our processor designs
 They develop all other PC platform chips
(including chip sets for Intel & AMD processors), etc.
C5J (aka VIA Esther, VIA C7-M)
First Shipped 8/2005
90nm IBM SOI Technology
P-M bus
and
new VIA
“V4”bus
(400-800 MHz)
+2-way SMP
support
unique
nanoBGA
package
128KB, 32-way
exclusive L2
64KB
4-way
D-cache
P4 instructions
(incl SSE2
& SSE3)
400 MHz–2.0 GHz
64KB
4-way
I-cache
P-M
power mngt
features+
Exclusive security features
31.2 mm2
26.2 M transistors
Lowest Power/MHz
3.5W @ 1 GHz TDP
20W @ 2 GHz TDP
90nm VIA C7-M
90nm Intel Pentium M
(Dothan)
our die
cost
84 mm2
31 mm2
4
C5J Die
6.9 mm
Bus & APIC
64 KB
4-way
L1-D
128 KB
32-way
L2
fuses
DCU
Fetch,
Decode
&
I-unit Translate
64 KB
4-way
L1-I
x87 FP
ROM
SSE 1,2 & 3, MMX
Br
pred
Security
PLLs
etc
Our Security Strategy

Provide comprehensive set of data security functions
…That are very secure
…That are world’s fastest (for a single chip)

These goals require that the functions
…Be Integrated tightly into the processor core
Processor silicon & implementation is fastest hdw
Only hdw can be “trusted” (no viruses, etc.)
…Require no operating system support/involvement
 available via non-privileged x86 instructions
 hardware must manage multi-tasking considerations

Available in all of our processors, for free
 We believe data security should be built into all processors
 It’s easy to do & small (effectively free)
 It’s our hobby
Our Security Implementation
Hardware RNG
C5XL
(shipped
1/2003)
C5P
(shipped
1/2004)
C5J
(shipped
8/2005)
CN
(future)
Hardware
RNG unit
2 units
Encryption
Secure Hash
Full AES (FIPS-197)
standard in hdw
ECB,CBC,CFB,OFB
Modes in hdw
fastest in world!
fastest in world!
(can also feed
+CBC/CFB-MAC modes
entropy to hardware
+CTR mode
SHA
to
get
faster
+unaligned support
(faster/better
highbuilt-in
quality)
using
hdw
+faster
hash functions)
RSA Hdw Assist
(Montgomery multiply)
xxx
Full SHA-1 & -256
(FIPS-180-1)
standard in hdw
xxx
~
adj
DC
bias
2 duplicate RNGs in
different physical areas
(& rotated)
whitener
whitener
~
~
~
~
~
~
~
^
^
Centaur Hardware RNG
asynch
clocked
1-byte per delivery
A, B, or both
32 byte hardware
collection buffer
up to 8-byte delivery
per store request
1-of-n bit
selector
SSE store bus x86 “store-rand”
status in EAX instruction
RNG “Typical” Performance

“Randomness” too hard to describe here,
but here’s some basics…

Key requirements for “truly random” (per Schneier)

Many statistical tests defined & used (& argued about)
 Unbiased statistical distribution  determined by statistics
 Unpredictability  determined by modeling
 Unreproducibility  only hardware need apply
 Collections of many different statistical analyses
FIPS-140-2  useless (4-tests, broken, 20,000 bit sample!)
Diehard (18 tests)  oriented to software RNGs, 10Mb sample
NIST (16 tests)  we think the best (much overlap with Diehard)
Ent, etc. everyone has one, everyone has their favorite
 Individual tests
entropy important & widely reported, but it’s not randomness
chi2 heavily used, especially for huge samples, our favorite
Maurer, etc. everyone has their favorite
 Many different evaluation approaches
threshhold value, fixed ranges, probability analysis (p-value)

Much analysis & interpretation needed to make sense here
RNG “Typical” Performance

Performance & randomness varies by part; these are “typical”

We have done extensive analysis





Many terabytes of data
Massive sample sizes (terabyte)
Hundreds of chip
Our own testbed software
Analysis & report by external group
www.cryptography.com/research/evaluations.html
 Here’s an embarrassingly simple summary…
Setting
Speed
(Mbs)
white8
white4
raw
hashed
raw (AES)3
1.7
3.4
28–240
150–
1,000
1.
2.
3.
Entropy
(byte)
7.9999+
7.999
7.5-7.95
7.9999+
Randomness
1 MB sample Max sample size
random?1
for random2
Y
50 MB-10 GB
Y–N
0-10 MB
N
–
Y
Passes standard test collections: FIPS, NIST, Diehard
“Good” chi2 results
Many variations: SHA, random seed size, etc.
1 TB up
Centaur AES Encryption Features

Full FIPS-197 implemented in hardware

Multiple operating modes in hardware

Optional extended key generation in hardware

Various “experimentation” options supported

Accessed via new application-level x86 instructions

US export licenses in place
 Encrypt & decrypt
 128b, 192b, & 256b keys
 128b data blocks
 ECB, CBC, CFB, OFB
 CBC/CFC-MAC & CTR modes
 For 128b key (both E & D) only
 Round count 1-16, intermediate round results, etc.
 No OS support needed
 Hardware provides inherent multitasking
Centaur AES Hardware
SSE load bus
16-byte blocks
key
ctrl
input 0
input 1
can pipeline 2
blks in ECB
Extended
Key
Ram
16x16B
Round
key
shared
generation logic
round
key
Everything runs at
processor clock speed
block startup
+ CBC, CFB, OFB, etc.
S-box
row-shift
round
fwd
column mix
key add
block finish
+ CBC, CFB, OFB, etc.
out 0
SSE store bus
 0.3 mm2
total!
out 1
blk-blk
fwd
Centaur AES Performance

AES instruction performance (approx.)
 128-bit key & block size:
usual instruction timing assumptions
= data in cache, no interrupts, aligned, key done, etc.
 Approximate clocks w/ 128b extended keys already loaded
ECB, 1 block:
17 clocks
ECB, large block count:
11.8/blk
CBC/CFB/etc, 1 block:
37
CBC/etc, large block count: 22.5/blk
 Additional extended key generation/load time (128b key)
Hardware generated:
38
Loaded from memory:
53
AES Performance

Measured Performance
 P4 = Gladman library AES, C5J = replaced routine with AES inst
 ECB mode (other modes slower, but same advantage over P4)
 Same memory size (512MB), same bus speeds (533 MHz)
data size
8 KB
64 KB
1 Mb
10 MB
2.53-GHz P4
2.0-GHz C5J
0.56 Gb/s
0.56
0.56
0.56
21.5 Gb/s
19.5
5.45
5.23
bus limited
 Another example: Gladman reports (his site) using his library (ECB)
data size
1.2-GHz C5P
16 Kb
15.2 Gb/s
Earlier part
C5J Montgomery Multiplier Features

Goal: Speed up RSA’s modular exponentiation
 c = me mod n
is dominated by repeated d = m x y mod(n) ops
 where m, y, n are thousand bits long!

This multiply is “always” done using
“Montgomery Multiply” algorithm
 Uses special number space to make
d’ = a’ x b’ mod(m) much faster by eliminating divide
 But initial & result values must be transformed
to/from Montgomery number space
 In real usage, the transformation overhead is relatively small

Our hardware directly performs “Montgomery Multiply”
 About as fast as an ordinary multiply!
 For up to 32Kb numbers!

New application-level x86 MontMul instruction
Centaur Montgomery Multiplier
SSE load bus
Usable with
any size data
(256 to 32Kb,
128b steps)
A[j]
T[j]
32
temp regs
32
32
M[j]
32
U
B[i]
32 x 32
hack of
existing
multipliers
32 x 32
64
64
Ucode sequences
loads & stores
+
Hi 33b
64
33
32b x 32b mod(32b)= 4 clks
(2 clk pipelined)
33
+
Bits 64:32
Bits 31:0
SSE store bus
T[j-1]
16-byte blocks
Centaur MontMul Performance

Compared to GMP library




Perform c = me mod n (m,e,n chosen randomly)
An example (speeds vary slightly based on values)
Note: this is most of RSA time, but not the whole thing
Same hardware as for AES chart
mod size (bits)
2.53-GHz P4
2.0-GHz C5J
512
1024
1536
2048
340 exp/s
50
15.6
7.1
1800 exp/s
243
78
35
Centaur SHA Features

FIPS-180-1 completely implemented in hardware

Instruction timing

Measured performance (Gb/s)
 SHA-1 (160-bit result)
 SHA-256 (256-bit result)
 SHA-1:
 251 clks
 SHA-256:  262
where n is the number of 64B blocks to be compressed
 Same hardware as for AES chart, GPL SHA SW (Devine)
2.53-GHz P4
data size
SHA-1
2.0-GHz C5J
SHA-256
SHA-1
SHA-256
10 B
0.07
0.04
0.38
0.35
100 B
0.43
0.24
2.41
2.24
1,000 B
0.59
0.33
3.81
3.60
1,000,000 B
0.62
0.34
2.97
2.97
bus
limited
C5J SHA Hardware
SSE load bus
Initial digest
next 64b data
160b
64
accumulating digest
data
scheduler
SHA-1:
2 clks/32b rnd (5)
(16 x32b regs)
Function generators
SHA-256:
3 clks/round
regs
5-way add
Final
sha-256
add
SSE store bus
+
+
Build Process
20
The Centaur Process
Manufacturing
Engineering
mfg requirements
timing data
Timing Process
(full chip)
Power?
Bugs?
MHz?
test vectors,
test programs,
test fixtures,
packages, FA,
silcon debug, qual,
MHz calibration, etc.
feedback
to process
Design
Process
arch
logic
circuit
layout
technology
The
Processor
Source
verilog
schematics
layout
global wires
microcode
models
bugs
Physical Build
& Tapeout
Process
Mask
Data
Foundry
"release to
manufacturing"
Silicon
tech data &
support
feedback
to process
Design
Verification
(full chip)
Compatible?
CPI ?
Power?
MHz?
fails
Via
Manufacturing,
Marketing &
Sales
hdw & sw
compatibility,
benchmarks,
power data,
supported hdw,
etc.
requirements
System
Verification
21
Centaur Build Methodolgy


Our challenges!





Complex logic with lots of architectural interconnections
2-GHz & aggressive power/size objectives
Relatively few designers (30 logic & circuit)
Strong schedule pressure (must do it fast)
Industry tools not sufficient (oriented to APR methodology)
Our Basic Approach
 Hundreds of top-level stand-alone “blocks”
Allows parallel development of “one-person” blocks
Facilitates fast “build” time (chip assembly, timing, etc.)
Facilitates use of optimum process for particular logic
 Hook blocks together with top-level routing, clocks, etc.
Significant “content” added in top-level build
 Full-chip timing with fast iterations
 Fast full-chip build iterations
 Develop our own tools & methodology to accomplish above
Centaur Chip Physical Build Process
processor.v
Defines the top-level blocks
& the connecting global
wiring
APR blocks
Verilog sources for each
physical APR (control
logic)
datapath stacks
Verilog source (special
format) for each physical
datapath stack
Verilog for control logic to
be placed in buffer section
circuit elements
Schematics for each
physical custom element/
block
Plus timing models, size
models, RTL behaviorals,
etc.
processor.mc
On-chip microcode
global wiring definitions
verilog-to-layout
APR flow
standard cell
library
verilog-tolayout
stack flow
Full-chip
integration
& build
process
stack element
library
custom layout
flow
custom
blocks
microcode flow
ROM s
routing
RC repeaters
clock tree
power/grd
via add
cap fill
C5J Die
I/O Drivers
62
Full Custom Blocks
(299 instances)
3.12 mm2
4.82 mm2 (I/O)
1.32 M xistors
I
/
O
D
r
i
v
e
r
s
20
APR Blocks
2.32 mm2
5.89 m routing
63
DatapathStacks
6.62 mm2
12.38 m routing
1.18 M xistors
3.38 M xistors
I
/
O
perf optimized routing
widths & spacing
Global RC Repeaters
Global Wiring
interconencting all top-level blocks
21,512 nets
22.73 m routing
automatic insertion tool
3,500 x 7 bfrs inserted
Clock Distribution Network
hand-drawn
49 top-level elements/395 nets
10
Bit-CellArrays
(18 instances)
8.02 mm2
20.02 M xistors
2
ROMs
0.39 mm2
0.48 M xistors
Power/ground Grids
both hand-drawn & automatic
I/O Drivers
D
On-die
Decoupling Caps r
i
automatic
insertion tool
v
e
r
s
Underlying Source Statistics

Verilog lines as written (small)
(no behaviorals, no comments, no clocks, no “top” chip)
 APR logic
112K lines
129K cells
 Stack logic
41K lines
172K cells
 Note: this is “single instance” as written
much of this gets instantiated multiple times


Schematic “pages” as written (large)




Primitive (inv, nand2, nor2, etc.)
Standard cells
Datapath elements
Full customs
Circuit library size




Clock regens
Std cell
G datapath elements
W datapath elements
110
712
1308
1332
------3462
avail
used
445
547
493
248
----1733
277
435
271
147
----1130
C5J Security Components (metal 1-4 only)
clock
repeaters
stk
stk
stk
stk
stk
custom
global clk
meanders
7 RC
bfrs
stk
APR
(control for
all stacks)
Note: global
interconnects
not shown
32b data
“bfr”
section
decoupling
caps
stk
stk
C5J Security Components (metal 1-4 only)
SHA sch & ALU
key
RAM
common
control
logic
RNG buffers
128b-wide AES engine
“Fast Build & Timing”


Every 1-5 days  Full-chip “Release”




APRs synthesized, placed & RCs estimated
Stacks “cracked”, placed & RCs estimated
Full-chip timing done with estimated RCs
Takes < 1 day for full-chip timing report
Every 5-10 days  Full-Chip Physical Build





APRs routed
Stacks routed
Global chip routed
Global chip layout produced
APRs, stack & global route RC extraction
RCs feed back to calibrate estimated RCs
 This goes on continuously, picking up new Releases as needed

Our experience at other companies  much slower
Basic “Release” Process
phy lib
pwr lib
timing tech lib
lib
shape
from
floorplan
synthesis
& place
APR
blks
processor.v
verilog
source
modules
5-10 day
cycle
APR
build
RC
estimator
floorplan
clock tree
wire control
etc.
nl
I/O
constraints
merge
flatten
subsitute
expand
rename
split
1-5 day
cycle
Full-Chip
Timing
gen
auto or
by hand
global nl
tech file
element
timing
models
gen
RC
build ctl nl
nl
nl
synthesis
ctl
Stack
place
split
DP
stacks
dp
RC
estimator
Stack
build
global
build
RTL Design Rules
APR Blocks

Element instantiation OK




Registers (req’d  synthesis can’t infer them correctly)
Clock buffers & distribution (req’d  synthesis clocks are slow!!)
Occasional logic (this has diminished over time)
The instantiated elements are really macros
Auto expanded to right size, number bits, etc. in the flow

Wires & continuous assignment OK

Nothing else! (no procedural stuff)
 Including operators like ?, +, < etc.
 No if/else, no case, no loops, no “always”, no “at”, etc.
 No timing information/control
 Synthesis generates bad logic for these
Unexpected/surperflous elements,
registers where not expected, timing doesn’t work, etc.
Stacks

Component instantiation & wires only!
APR RTL Example
As Written
assign idleNS = (T[0] | T[8])
| shaDone_P;
assign funcNS = (T[1] | T[3] | T[6] | T[10]) & ~shaDone_P;
assign add1NS = (T[2])
& ~shaDone_P;
assign add2NS = (T[5])
& ~shaDone_P;
assign faddNS = (T[4] | T[7] | T[9])
& ~shaDone_P;
rregs #(5) state (.q ({idleState, funcState, add1State,
add2State, faddState}),
.d ({idleNS,funcNS,add1NS,add2NS,faddNS}),
.clk (ph1c)
);
-----------------sha2cnst sha2cnst(.in (iteration[5:0] ),
.ksel (shKSel ),
.algo (sha1_P ),
.out (KsubI ));
-----------------wire [6:0] nextIteration;
assign nextIteration = (shaDone_P | idleState) ? 7'b0000000 :
shIterationStall ? iteration :
iteration + 1;
32
Stack RTL Example
Datapath Section
/*------------------- KeyGen XOR --------------------------*/
wire [31:0] aesKeyGenXorOut2_L;
zdxor #(32,15) keyg1 (.out (aesKeyGenXorOut2_L
),
.in0 (aesWord2I_LB ),
.in1 (aesKeyGenXorOut1_LB ));
zinv #(32,60)
kgen2 (aesKeyGenXorOut2_LB, aesKeyGenXorOut2_L);
wire [31:0] aesKeyGenXorOut2_MB;
wire [31:0] aesKeyGenXorOut2_M;
zregi_en #(32,10) keyg2 (.q
(aesKeyGenXorOut2_MB ),
.d
(aesKeyGenXorOut2_L ),
.clk (EPH1
),
.en (aesDynEn_K));
zinv #(32,10) keyg2i (aesKeyGenXorOut2_M, aesKeyGenXorOut2_MB);
Buffer Section
rregsi #(2,20) bf_kk (.qb
(aesKeyMuxSel_M ),
.d
(aesKeyMuxSel_LB),
.clk (evph1));
33
Stack Placement Tool Output
(32-bit AES stack)
Buffer section added
Inter-element routing (m2-6)
Global wires added
Sample Timing Report “Path”
time
path
0.875ns eeph1aesdp2 ^
0.925ns aesdp2/eph1 ^
element
aesdp2/eph1buf_aesdp2/
aesdp2/sc_c0ph1_48/
delta load cap wire rise/fall
0.050ns 0.2423pF 0.000ns 0.000ns
0.160ns 0.0321pF 0.000ns 0.000ns
1.085ns aesdp2/keyg2_ph1 ^ aesdp2/gxregi_x4_10…………………… 0.063ns 0.0035pF 0.000ns 0.004ns
1.148ns aesdp2/aesdp2_dp_aeskeygenxorout2_mb10 v
0.000ns 0.0035pF 0.000ns 0.004ns
1.148ns aesdp2/aesdp2_dp_keyg2i_stack_bit10_i0 v
aesdp2/ginv_10………………………………… 0.026ns 0.0209pF 0.000ns 0.044ns
1.173ns aesdp2/aesdp2_dp_aeskeygenxorout2_m10 ^
0.000ns 0.0209pF 0.000ns 0.045ns
1.174ns aesdp2/aesdp2_dp_invk_stack_bit10_i0 ^
aesdp2/gemux3i_19………………………… 0.045ns 0.0336pF 0.000ns 0.031ns
1.219ns aesdp2/aesdp2_dp_key_mb10 v
0.000ns 0.0336pF 0.000ns 0.031ns
1.219ns aesdp2/aesdp2_dp_kml_stack_bit10_i0 v
aesdp2/ginv_31………………………………… 0.017ns 0.0188pF 0.000ns 0.013ns
1.236ns aesdp2/aesdp2_dp_key_m10 ^
0.001ns 0.0188pF 0.001ns 0.014ns
1.236ns aesdp2/aesdp2_dp_mixcoldec_xorout_stack_bit10_in0 ^
aesdp2/gxor8_10……………………………… 0.095ns 0.0170pF 0.000ns 0.029ns
1.331ns aesdp2/aesdp2_dp_decout_m10 v
0.000ns 0.0170pF 0.000ns 0.030ns
1.332ns aesdp2/aesdp2_dp_mcmux_stack_bit10_i2 v
aesdp2/gmux3i_10………………………… 0.030ns 0.0089pF 0.000ns 0.017ns
1.362ns aesdp2/aesdp2_dp_mcout_mb10 ^
0.000ns 0.0089pF 0.000ns 0.017ns
1.362ns aesdp2/aesdp2_dp_invm_stack_bit10_i0 ^
aesdp2/ginv_31……………………………… 0.030ns 0.1101pF 0.000ns 0.053ns
1.391ns aesdp2/aesdp2_dp_mcout_m10 v
0.012ns 0.1101pF 0.012ns 0.078ns
1.403ns aesdp2/aesdp2_dp_pipemux0_stack_bit10_i1 v
aesdp2/gmux2i_16…………………………… 0.048ns 0.0249pF 0.000ns 0.030ns
1.451ns aesdp2/aesdp2_dp_aesword2i_kb10 ^
0.001ns 0.0249pF 0.001ns 0.032ns
1.452ns aesdp2/aesdp2_dp_byte1_indx_pb2 ^
Local reg clock-to-next reg input = 1.452-1.085 = 367ps
37
Random Circuit Topics

Clocking is very difficult & very critical
 Very aggressive skew goals






“0” ps clock skew across all top-level blocks
<20 ps skew worst case within a block
These are met in our designs ignoring on-chip silicon variations
Multiple clock domains required (for bus & various power states)
Many “early”, late”, etc. versions of the clocks needed
Clocks must be gated (for power management)
Our clocking methodology is proprietary, but…







Hand-routed global clock tree (continually changing)
Our own tools to generate clock shields tuned to surroundings
Tunable “repeaters” (via fuse & via metal)
Hand instantiated clock elements within blocks
Many selectable clocks (xx ps for each reg)
Auto-generated clock grids within APRs & stacks
Fuse adjustable PLL characteristics (duty cycle, etc.)
Power/ground distribution critical
 Extensive analysis & “management” required
Random Circuit Topics (cont)

Robust circuit design req’d across 12 “corner” models




54 formal corners identified, we choose the most critical “12”
Covers variations in: Temp, V, N xistor, P xistor
Automated element simulation done across these models
Full-chip timing is done using 2 of these corners (hi V, lo V)

Extensive use of dynamic logic

Two stack-element libraries

Element libraries has several versions of same function
 Precharge in phase 1, evaluate in phase 2
 Registers, adders, comparators, arrays, etc.
 Customs, stacks (& APRs)
 With different bit pitches
 Usually, at least “Fast/big/hot” & “slow/small/cool”
 Example: C5J has 2 different “vanilla” 32-bit adders
Fast (dynamic): 180 ps
Slow (static):
250 ps
37.9 high
16.9 high
Note: 25 total adders in library, instantiated 65 total times
Random Circuit Topics (cont)

Several families of registers available




Differ in function, speed, size & performance
Std cell, datapath & custom versions
Each comes in many drive strengths (sizes)
Many have built-in functions
muxes, and/or logic, xors, compares, etc.
These provide speed/size/power improvements vs. separate elements
 Examples using C5J stack elements
26b
normal reg
3.8
x-reg 10
fast reg
5.0
k-reg 10
26b
4.6
static
cmp-eq 20
3.8
1.4
k-reg 10
inv 54
1b
82 ps
(data-to-out)
32 ps
90
32
17
----139 ps
26b
26b
k-reg
+dynamic
cmp-eq
60
1b
88 ps
9.5
41
C5J Security Component Sizes (mm2)
Sample
scale
0.014
0.014
227
0.080
0.080
0.091
0.069
0.046
0.021
0.080
0.034
Total = 0.529 mm2 + 0.014 for 2 RNG’s (elsewhere)
= 0.54 (a few cents, but for this chip it’s really free)
C5J Security Component Sizes
(If we had only known about all
this space when we started…)
0.014
0.014
227
0.080
mm2
0.080
0.091
0.069
0.046
0.080
0.034
Note: We had so much spare room on die that
we didn’t spend any effort making this smaller.
We estimate at least 30% smaller if we tried hard!
Startup, CBC, etc. muxes & registers
---register----------------------------------
S-box ROM
(2 x 256 x 8 bit) x 4 bytes
200 ps access (dynamic)
Row-shift muxes
(wires to other 32b stacks not visible)
---register----------------------------------
Column multiply (& key xor)
made out of 2-,3-,4-,5-,6-,
7- & 8-input xors
---register----------------------------------
Startup, CBC, etc. muxes & registers
(extra stuff at bottom for key generation)