Fixed Point number and arithmetic

Download Report

Transcript Fixed Point number and arithmetic

Number Representation for
DSP processor
•
•
•
•
•
Numerical issues and data formats.
Fixed point.
Fractional number.
Floating point.
Comparison of formats and dynamic
ranges.
Numerical Issues and Data
Formats
C6000 Numerical
Representation
Fixed point arithmetic:
•
•
Floating point arithmetic:
16-bit (integer or fractional). •
Signed or unsigned.
•
32-bit single precision.
64-bit Double precision.
• How are numbers represented and
processed in DSP processors for
implementing DSP algorithms?
How numbers are represented?
• A collection of N binary digits (bits) has 2N possible
states.
– This can be seen from elementary counting theory, which tells us
that there are two possibilities for the first bit, two possibilities for
the next bit, and so on until the last bit, resulting in 2×2×2… = 2N
possibilities or states.
• In the most general sense, we can allow these states to
represent anything conceivable.
– The point is that there is no meaning inherent in a binary word,
although most people are tempted to think of them as positive
integers.
• However, the meaning of an N-bit binary word depends entirely on
its interpretation.
Fixed Point number and arithmetic
Binary representation of 4-bit signed number
Decimal
Value
Sign
Magnitude
One’s
Two’s
Complement Complement
7
1
+0
-0
-1
-7
-8
0111
0001
0000
1000
1001
1111
-
0111
0001
0000
1111
1110
1000
-
0111
0001
0000
1111
1001
1000
3 BIT NUMBER IN SIGNED 2’S COMPLEMENT
REPRESENTATION
Dynamic range of integer and fraction 4-bit numbers
Unsigned integer
Signed Integer
Smallest Value:0000(0)
Largest value :1111(15)
Most Positive Value:0111= (+7)
Least negative value:1000 =(-8)
Unsigned Fraction
Unsigned Fraction
Smallest Value:.0000(0)
Largest value :.1111(0.9375)
Most Positive Value:0.111=
(+0.875)
Least negative value:1.000 =(-1)
Dynamic Range
• Dynamic range in dB= 20 log10(Max/min)
• In fixed point Unsigned integer
representation using N-bit range of Max to
min number is 2N to 1.
• In fixed point signed integer representation
using N-bit range of Max to min number is
2N-1 to 1.
• In fixed point Unsigned fraction
representation using N-bit range of Max to
min number is 1-2-N to 2-N.
• In fixed point signed fraction
representation using N-bit range of Max to
min number is 1-2-N+1 to 2-N+1.
Number
format
Dynamic
Range
Dynamic
Range in dB
Precisio
n
Unsigned
Integer
signed
Integer
0 to 65536
20log10(2^16)=
96 dB
20log10(2^15)=
90 dB
1
Unsigned
Fraction
0 to
0.99998474
96 dB
2^-16
signed
Fraction
-1 to
0.99996948
90 dB
2^-15
-32768 to
32767
1
Fixed Point Arithmetic - Problems
• The following equation is the basis of many
DSP algorithms (See Chapter 1):
yn  
N 1
 ak xn  k 
k 0
• Two problems arise when using signed and
unsigned integers:
– Multiplication overflow.
– Addition overflow.
•
•
Multiplication Overflow
16-bit x 16-bit = 32-bit
Example: using 4-bit representation
0
•
0
0
0
0
1
1
x
1
0
0
0
1
1
0
0
0
3
x
8
24
24 cannot be represented with 4-bits.
•
•
•
Addition Overflow
32-bit + 32-bit = 33-bit
Example: using 4-bit representation
1
0
0
0
+
1
0
0
0
1
0
0
0
0
8
+
8
16
16 cannot be represented with 4-bits.
Fixed Point Arithmetic - Solution
•
The solutions for reducing the overflow
problem are:
–
–
–
–
Saturate the result.
Use double precision result.
Use fractional arithmetic.
Use floating point arithmetic.
•
Solution - Saturate the result
Unsigned numbers:
– If A x B  15  result = A x B
– If A x B > 15  result = 15
0
0
0
Saturated
0
0
1
1
3
x
1
0
0
0
8
1
1
0
0
0
24
1
1
1
1
15
•
Solution - Saturate the result
Signed numbers:
– If -8  A x B  7  result = A x B
– If
A x B > 7  result = 7
– If
A x B < -8  result = -8
1
1
1
Saturated
0
0
1
1
3
x
1
0
0
0
-8
0
1
0
0
0
-24
1
0
0
0
-8
Solution - Double precision result
•
•
For a 4-bit x 4-bit multiplication hold the
result in an 8-bit location.
Problems:
– Uses more memory for storing data.
– If the result is used in another multiplication
the data needs to be represented into single
precision format (e.g. prod = prod x sum).
– Results need to be scaled down if it is to be
sent to an A to D converter.
Solution - Fractional arithmetic
•
If A and B are fractional then:
– A x B < min(A, B)
– i.e. The result is less than the operands
hence it will never overflow.
•
Examples:
– 0.6 x 0.2 = 0.12 (0.12 < 0.6 and 0.12 < 0.2)
– 0.9 x 0.9 = 0.81 (0.81 < 0.9)
– 0.1 x 0.1 = 0.01 (0.01 < 0.1)
Fractional numbers - Sign Extension
a=
0
1
1
0
= 0.5 + 0.25 = 0.75
b= 1
1
1
0
= -1 + 0.5 + 0.25 = -0.25
1
0
0
0
1
1
0
1
1
0
0
1
0
.
0
0
.
.
0
.
.
.
1
1
1
0
1
0
0
x
1
Sign extension
•
To keep the same resolution as the operands we
need to select these 4-bits:
1
1
1
0
Fractional numbers - Sign Extension
•
a= 0
1
1
0 = 0.5 + 0.25 = 0.75
x b= 1
1
1
0 = -1 + 0.5 + 0.25 = -0.25
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
.
0
0
.
.
0
.
.
.
1
1
1
1
0
1
0
0
Sign extension bits
Sign extension
The way to do it is to shift left by one bit and
store upper 4-bits or right shift by three and
store the lower 4-bits:
1
1
1
0
• Although using 2's complement integers
we can implement both addition and
subtraction by usual binary addition (with
special care for the sign bit), the integers
are not convenient to handle to implement
DSP algorithms.
– For example, if we multiply two 8-bit words
together, we need 16 bits to store the result.
• The number of required word length increases
without bound as we multiply numbers together
more.
– Although not impossible, it is complicated to handle
this increase in word-length using integer arithmetic.
• The problem can be easily handled by using
numbers between -1 and 1, instead of integers,
because the product of two numbers in [-1,1] are
always in the same range.
• In the 2's complement fractional representation,
an N bit binary word can represent 2N equally
space numbers from
• For example, we interpret an 8-bit binary word
b7 b6 b5 b4 b3 b2 b1 b0
• as a fractional number
• This representation is called Q-format.
Integer Vs Fractional Number
• Different notation are used to represent
different binary formats.
• The Qm.n represent are most widely used
• In Qm.n representation m bit
representation integer portion and n bit
represent fraction number
• Eg Q15.0 and Q0.15
• If N is total number of bits then N=m+n+1.
• Number range is -2^m to 2^m – 2^-n
• Its resolution is 2^-n
• For eg Q14.1 representation
• Its range is
• [-2^14, 2^14 – 2^-1] = [-16384.0, +16383.5] =
So in hex [0x8000, 0x8001 … 0xFFFF, 0x0000,
0x0001 … 0x7FFE, 0x7FFF]
resolution is 2^-1=0.5
Conversion
• To convert a number from floating point to
Qm.n format:
• 1.Multiply the floating point number by 2n
• 2.Round to the nearest integer
Conversion
• To convert a number from Qm.n format to
floating point
• 1.Convert the number directly to floating
point
• 2.Divide by 2^n
• In C6211, it is easiest to handle Q-15
numbers represented by each 16 bit
binary word, because the multiplication of
two Q-15 numbers results in a Q-30
number that can still be stored in a 32-bit
wide register of C6211.
– The programmer needs to keep track of the
implied binary point when manipulating Qformat numbers.
15-bit * 15-bit Multiplication
Q15 s. x x x x x x x x x x x x x x x
x Q15 s. y y y y y y y y y y y y y y y
CPU
MPY
NOP
A3,A4,A6
Q30 s s. z z z z z z z z z z z z z z z z z z z z z z z z z z z z z z
Q15 s. y y y y y y y y y y y y y y y
Store to
Data Memory
SHR
STH
A6,15,A6
A6,*A7
Assembly language implementation
• When A0 and A1 contain two 16-bit numbers in
the Q-15 format, we can perform the
• multiplications using MPY followed by a right
shift.
• 1 MPY .M1 A0,A1,A2
• 2 NOP
• 3 SHR .S1 A2,15,A2 ;lower 16 bit contains result
• in Q-15 format
C language implementation
• Let's suppose we have two 16-bit numbers in Q-15 format, stored in
variable x and y as follows:
• short x = 0x0011; /* 0.000518799 in decimal */
• short y = 0xfe12; /* -0.015075684 in decimal */
• short z; /* variable to store x*y */
• The product of x and y can be computed and stored in Q-15 format
as follows:
• z = (x * y) > > 15;
• The result of x*y is a 32-bit word with 2 sign bits. Right shifting it by
15 bits ignores the last 15 bits, and storing the shifted result in z
that is a short variable (16 bit) removes the extended sign bit by
taking only lower 16 bits.
• However, care must be taken when adding
binary numbers.
– Because each Q-15 number can represent
numbers in the range [-1,1−215] , if the result
of summing two Q-15 numbers is not in this
range, we cannot represent the result in the
Q-15 format.
• When this happens, we say an overflow has
occurred.
• Unless carefully handled, the overflow makes the result
incorrect.
– Therefore, it is really important to prevent overflows from
occurring when implementing DSP algorithms.
• One way of avoiding overflow is to scale all the numbers
down by a constant factor, effectively making all the
numbers very small, so that any summation would give
results in the [-1,1) range.
– This scaling is necessary and it is important to figure out how
much scaling is necessary to avoid overflow.
• Because scaling results in loss of effective number of digits,
increasing quantization errors, we usually need to find the minimum
amount of scaling to prevent overflow.
• Most Fixed-Point DSP processor use two’s
complement fractional numbers in different
Q format. However assembler only
recognize integer value.
• So we have to keep track of binary point
when considering fractional number in
assembly program.
Steps of converting fractional number into Qformat into an integer that can be recognized by
the assembler
• 1. Normalize the fractional number to the
range determined by the desired Q-format.
• 2. Multiply the Normalized fractional by
2^n, where n is the total number of
fractional bits.
• 3. Round the product to the nearest
integer.
‘C6000 C Data Types
Type
Size
Representation
char, signed char
unsigned char
short
unsigned short
int, signed int
unsigned int
long, signed long
unsigned long
enum
float
double
long double
pointers
8 bits
8 bits
16 bits
16 bits
32 bits
32 bits
40 bits
40 bits
32 bits
32 bits
64 bits
64 bits
32 bits
ASCII
ASCII
2’s complement
binary
2s complement
binary
2’s complement
binary
2’s complement
IEEE 32-bit
IEEE 64-bit
IEEE 64-bit
binary
Exponential Notation
• The following are equivalent
representations of 1,234
123,400.0
x 10-2
12,340.0
x 10-1
1,234.0
x 100
123.4
x 101
12.34
1.234
x 102
x 103
0.1234 x 104
The representations differ
in that the decimal place –
the “point” -- “floats” to
the left or right (with the
appropriate adjustment in
the exponent).
p. 122
Parts of a Floating Point
Number
-0.9876 x
Sign of
mantissa
Location of
decimal point
Mantissa
-3
10
Exponent
Sign of
exponent
Base
p. 123
IEEE 754 Standard
• Most common standard for representing floating
point numbers
• Single precision: 32 bits, consisting of...
– Sign bit (1 bit)
– Exponent (8 bits)
– Mantissa (23 bits)
• Double precision: 64 bits, consisting of…
– Sign bit (1 bit)
– Exponent (11 bits)
– Mantissa (52 bits)
p. 133
Single Precision Format
32 bits
Mantissa (23 bits)
Exponent (8 bits)
Sign of mantissa (1 bit)
Normalization
• The mantissa is normalized
• Has an implied decimal place on left
• Has an implied “1” on left of the decimal
place
• E.g.,
– Mantissa 
– Represents…
10100000000000000000000
1.1012 = 1.62510
Excess Notation
• To include +ve and –ve exponents, “excess”
notation is used
• Single precision: excess 127
• Double precision: excess 1023
• The value of the exponent stored is larger than
the actual exponent
• E.g., excess 127, 10000111
– Exponent 
135 – 127 = 8
– Represents…
Example
• Single precision
0 10000010 11000000000000000000000
1.112
130 – 127 = 3
0 = positive mantissa
+1.112 x 23 = 1110.02 = 14.010
Hexadecimal
• It is convenient and common to represent
the original floating point number in
hexadecimal
• The preceding example…
0 10000010 11000000000000000000000
4
1
6
0
0 0 0
0
Converting from Floating Point
• E.g., What decimal value is represented
by the following 32-bit floating point
number?
C17B000016
• Step 1
– Express in binary and find S, E, and M
C17B000016 =
1 10000010 111101100000000000000002
S
E
1 = negative
0 = positive
M
• Step 2
– Find “real” exponent, n
– n = E – 127
= 100000102 – 127
= 130 – 127
=3
• Step 3
– Put S, M, and n together to form binary result
– (Don’t forget the implied “1.” on the left of the
mantissa.)
-1.11110112 x 2n =
-1.11110112 x 23 =
-1111.10112
• Step 4
– Express result in decimal
-1111.10112
-15
2-1 = 0.5
2-3 = 0.125
2-4 = 0.0625
0.6875
Answer: -15.6875
Converting to Floating Point
• E.g., Express 36.562510 as a 32-bit
floating point number (in hexadecimal)
• Step 1
– Express original value in binary
36.562510 =
100100.10012
• Step 2
– Normalize
100100.10012 =
1.0010010012 x 25
• Step 3
– Determine S, E, and M
+1.0010010012 x 25
n
S
M
S = 0 (because the value is positive)
E = n + 127
= 5 + 127
= 132
= 100001002
• Step 4
– Put S, E, and M together to form 32-bit binary
result
0 10000100 001001001000000000000002
S
E
M
• Step 5
– Express in hexadecimal
0 10000100 001001001000000000000002 =
0100 0010 0001 0010 0100 0000 0000 00002 =
4
2
1
2
4
0
Answer: 4212400016
0
016
Comparison of fixed point and
floating point
Fixed Point
Floating point
Limited Dynamic range
Large Dynamic Range
Overview flow and
quantization errors must
be resolved
Long product
development time
Cheaper
Easier to program since
no scaling is required
Lower power
consumption
High Power
consumption
Quick time to market
More expensive
Fixed Point
Floating point
Consumer Application
such as MP3 player
,multimedia gaming, and
digital camera
High-eng audio
application such as
ambient acoustic
simulation ,audio
encoding/decoding and
audio mixing
Thank you