This is a class presentation about A Micropower DSP for Sensor
Download
Report
Transcript This is a class presentation about A Micropower DSP for Sensor
Nathan J. Ickes, “A Micropower DSP for Sensor Applications,” PhD thesis, MIT, 2008.
by : Majid Namaki
Custom Implementation of DSP Systems, Spring 2010
Instructor: Dr S. M. Fakhraei
May 2010
1
Introduction
Why low power? Heat dissipation limits and battery
lifetime concerns.
Wireless microsensor networks, implanted medical
devices are two examples of such applications.
2
Microsensor Applications
Microsensor networks may consist of many-perhaps
hundreds or thousands-of miniature sensor nodes
scattered throughout an area of interest and linked by a
wireless network.
The network of sensors collaborates as a whole, combining
measurements made by each individual node and
delivering high-quality observations to a central base
station.
Large number of nodes in a microsensor network => highresolution, multi-dimensional observations and faulttolerance superior to more traditional sensing systems.
3
Microsensor Applications (cont.)
Applications:
inventory tracking, environmental
monitoring, machine-mounted sensing, medical
monitoring, and building climate control.
Primary advantage of microsensor networks: the
spatial diversity of the data collected by the network as
a whole.
Alternatively, the sensor network may be used to
imitate a single very large sensor, one that might be
impractically large to build or deploy
4
Microsensor Applications (cont.)
Extremely small, yet long-lived sensor => power
efficiency (the central issue in design of microsensors)
Self-powered node : scavenging energy from ambient
solar, thermal, or mechanical sources; But it is
physically large and limited to outdoor applications.
5
Common Characteristics
Low duty cycle: Nodes can be idle over 99% of the time
=> Minimizing standby power
Event driven: Typical events handled by nodes include
sending or receiving radio data, and collecting
measurement data => Events must be handled quickly
and efficiently to maximize node lifetime.
6
Common Characteristics (cont.)
Localized data processing: Preliminary signal processing
and data analysis occurs within the network. E.g. To save
energy nearby nodes might aggregate their data, so
reducing amount of data that must be sent to the network
base station => Increase the peak processing capability
required on each node.
Unpredictable performance requirements: Performance
demands on any given node are variable and unpredictable
before deployment.=> variations in the nodes required
radio transmission power, variations in the amount and
type of signal processing required
7
Acoustic Tracking Application
8
The µAMPS DSP
MIT µAMPS (micro, adaptive, multi-domain, power aware
sensors) project.
µAMPS microsensors are designed for acoustic tracking
and other applications requiring sensor sampling rates of 1
-100 kS/s and significant post-acquisition signal processing,
such as filtering, compression, or spectral analysis
4 MIPS, 10 pJ per instruction DSP designed to form the
core of a µAMPS sensor node.
The DSP is implemented in 90 nm low-power CMOS.
6.3 million transistors (6 million of which are contained in
the on-chip memory).
9
µAMPS Sensor Node Architecture
The node consists of three primary components: the
DSP, a custom 12-bit 100 kSPS ADC, and a commercial
ZigBee radio (the ChipCon CC2420)
10
DSP Block Diagram
11
Performance
12
Main Contributions
Memory power optimization
Instruction cache design
Modeling of power-gating
Hardware accelerators
13
Miniature Instruction cache
The cache is direct-mapped and organized as sixteen
lines of four words. The cache memory is implemented
using flip-flops (rather than SRAM), allowing it to
operate at the lower logic power supply voltage.
The tag comparison and valid-flag logic is
asynchronous, so that in the event of a cache miss, a
main memory access can be initiated on the same
cycle.
An instruction can therefore be fetched on every clock
cycle, regardless of whether a cache hit or miss occurs.
14
Power Gating
Clock Gating => reduces dynamic power consumption in
idle logic
Power Gating => reduces leakage idle-mode power
consumption, particularly for deep-sleep states and
modern sub-100 nm process technologies.
Power Gating is complicated: Power cannot be turned on
and off on a cycle-by-cycle basis as is the case in clock
gating. Some amount of planning ahead is required before
powering off a logic block, to ensure that power can be
restored in time before the logic is needed again.
I.
II.
Higher threshold voltage device for the power switch
Boosting the gate voltage to the power switch
15
Power Gating (cont.)
12 independent power domains: nine memory banks,
the FFT and FIR accelerator cores, and the CPU.
16
µAMPS CPU Architecture
Primary design strategy was to minimize the complexity of the
control logic in the processor =>
All instructions execute in one clock cycle (CPI=1)
All instructions have the same 16-bit length.
A second design goal was to minimize the number of data
memory accesses
The processor contains three functional units: an ALU
implementing add, subtract, and bitwise logical operations
(AND, OR, XOR, NOT), a barrel shifter, and a multiplyaccumulate (MAC) unit. The MAC consists of a 16 x 16-bit singlecycle multiplier and a a 48-bit accumulator register. The
accumulator is readable and writable as special purpose registers
r8, r9, and r10.
3-stage (fetch, execute, and write back) pipeline
17
Accelerator Cores
The µAMPS DSP, being designed for acoustic sensing
applications, incorporates accelerators for both FIR
filtering and FFTs. The accelerators are implemented
as memory-mapped devices.
Energy savings obtained by using a hardware
accelerator:
Intrinsic savings in performing the actual computation
(e.g., reduced cycle count and control logic overhead),
Extrinsic savings from reduced utilization of global
resources (e.g., reducing the number of main memory
accesses).
18
FIR Accelerator
An FIR filter accelerator implements up to 16- tap
(symmetric) filters. The accelerator consists of a
register file holding up to eight 16-bit tap coefficients,
a 16×16 circular buffer for holding the input samples, a
single multiply accumulate unit, an adder/subtracter,
and a control state machine.
Due to their small size, the sample and coefficient
memories are implemented using Flip-Flops, rather
than SRAM macros.
19
FFT Accelerator
The FFT core computes transforms on 128-, 256-, 512- or
1024-point real-valued inputs, with 16-bit precision.
The accelerator performs a complete butterfly in one clock
cycle, compared to ~95 cycles per butterfly required for a
software implementation.
The local memory for the accelerator is split into four
banks, based on the MSB and parity of each address.
Each butterfly computation operates on values from two
different banks, allowing both values to be fetched at the
same time. The butterfly operations are specifically
ordered so that sequential butterflies involve disjoint sets
of memory banks. This allows processing one butterfly per
clock cycle, with the results from one butterfly being
written back to two memory banks while the inputs to the
next butterfly are read from the other two banks. A small
number of hazards are unavoidable and result in stalling
the datapath for one cycle.
20
Comparison of the µAMPS DSP
with other micropower processors
21