Why Chips Fail

Download Report

Transcript Why Chips Fail

Why do so many chips fail?
Ira Chayut,
Verification Architect
(opinions are my own and do not necessarily represent the
opinion of my employer)
Failure rate of first silicon is rising
“… research by Collett International revealed that 52%
of complex application specific integrated circuits
(ASICs) required a respin and the reason was largely
due to functional errors.”
(http://www.techonline.com/community/ed_resourc
e/feature_article/36655)
Who is to blame? (There must be someone to blame!)
Management – they didn’t provide enough resources
HW Engineering – they created the functional errors
Verification – they didn’t catch the functional errors
Architecture – they didn’t focus on testability
Marketing – they kept changing the specs
People don’t kill chips, complexity kills chips
http://www.cs.utexas.edu/users/dburger/teaching/cs395ts99/papers/2_src.pdf (1999) — Projected numbers are a bit
lower than current reality – a dual core AMD Opteron has 233 million
transistors and the Intel Itanium 2 has 592 million transistors
Complexity increases exponentially
Transistors per chip
1600
Millions of transistors
1400
1200
1000
800
600
400
200
0
1995
2000
2005
Year
2010
2015
• Chip component count
increases exponentially over
time (Moore’s law)
• Interactions increase
super-exponentially
• IP reuse and parallel
design teams facilitate more
functions with fewer HW
engineers per function and
more functions per chip
• Verification effort gets
combinatorially more difficult
as functions are added
Why verification is not able to keep up
Verification effort gets combinatorially more difficult as
functions are added
BUT
Verification staffing/time cannot be made
combinatorially larger to compensate
AND
Chip lifetimes are too short to allow for complete
testing
THUS
Chips will continue to have ever-increasing functional
errors as chips get more complex
Limiting the number of architectural and
functional errors
Thorough unit-level verification testing
Small simulations run faster
Avoids combinatorial explosion of interactions
Well defined interfaces between blocks with assertions
and formal verification techniques to reduce inter-block
problems
Emulation or FPGA prototyping to accelerate testing
How to live with functional errors
Successful companies have learned how to ship chips
with functional and architectural – time to market
pressures and chip complexity force the delivery of
chips that are not perfect (even if that were possible).
How can this be done better?
For a long while, DRAMs have been made with extra
components to allow a less-than-perfect chip to provide
full device function and to ship
How to do the same with architectural features? How
can full device function exist in the presence of
architectural or implementation omissions or errors?
Architecture support
Embrace Perl’s motto: “There's More Than One Way to
Do It” — allow for multiple ways of accomplishing all
critical specified functions
Analogous to Design for Test (DFT) and Design for
Verification (DFV), we should start thinking about
Architect for Verification (AFV)
[Thanks to Dave Whipp for the AFV phrase and acronym]
In some problem domains, such as networking, upperlayer protocols can recover from some silicon errors;
though there is a performance penalty when this is
used
Architect support, continued
A programmable abstraction layer between the real
hardware and user’s API can hide functional warts —
hardware catches specific operations and either directs
them to one of multiple hardware implementations, or
signals a software trap
Pyramid minicomputers hid the assembly language from
users, compiler could work around problems
Transmeta maps standard machine language to hidden
processor architecture, translation software can work
around problems
Soft hardware can allow chip redesign after silicon is
frozen (and shipped!)
Summary
Ever increasing chip complexity prevents total testing
before tape-out (or even before shipping)
AFV techniques can make chip verification not subject
to combinatorial explosion
We have to accept that there will be architectural and
functional failures in every advanced chip that is built
Architecture support needed to allow failures to be
worked around or fixed after post-silicon