Arithmetic in HDL and Fixed-Point
When generating code in Engee, as well as when working with Fixed-Point in the computation environment and in simulation, certain rules apply to arithmetic operations.
The arithmetic in C rules have many implicit transformations, and the result type of a mathematical operation is inferred based on the signs and ranks of the arguments.
The rules for arithmetic operations in HDL (hardware description languages, e.g. Verilog, Chisel) are concise, but are complicated by arbitrary argument widths as well as fixed point.
Rules: concise and in words
-
If all argument types are unsigned, the result type is unsigned. Otherwise, the result type is signed.
-
Type width is not limited to 8/16/32/64 bits, but can be an arbitrary natural number up to and including 128.
-
The result type is selected so that no overflow occurs.
Operation |
Result type |
Example |
---|---|---|
Addition |
Minimum required type to accommodate the full range of possible result values |
|
Multiplication |
The width of the result type is the sum of the widths of the argument types, the same with the fractional part. |
|
Fixed-Point
In addition to floating-point numbers, fixed-point numbers are used in software and hardware development. They are less flexible than floating-point numbers, but operations on fixed-point numbers are usually faster. The following are the basic concepts of fixed-point numbers.
Engee currently uses a binary-point-only (power-of-two) version of fixed-point numbers rather than slope-bias. |
To learn how to work with Fixed-Point in the Engee computing environment, see the article Fixed-Point Arithmetic (Fixed-Point) in Engee. |
Binary point
Integers are represented in binary as the sum of powers of two. For example, 9 in binary is :
Bit number from MSB (most significant) to LSB (least significant) |
3 |
2 |
1 |
0 |
Value |
1 |
0 |
0 |
1 |
Degree of two * bit value |
|
|
|
|
Contribution in response |
8 |
0 |
0 |
1 |
To represent non-integer numbers in binary, you can mentally put a dot after the zero bit, and start counting by negative powers of two. For example, 9.625
is encoded as :
Bit number from MSB (most significant) to LSB (least significant) |
3 |
2 |
1 |
0 |
-1 |
-2 |
-3 |
Value |
1 |
0 |
0 |
1 |
1 |
0 |
1 |
Degree of two * bit value |
|
|
|
|
|
|
|
Contribution in response |
8 |
0 |
0 |
1 |
0.5 |
0 |
0.125 |
Definitions
-
Type Width is the number of bits encoding the values of the type.
-
The number of "fractional" bits is the Fraction Length, also referred to as the Binary Point (BP).
-
The value in decimal notation is called the Real World Value (RWV).
-
The integer that would result if you numbered the bits from LSB to MSB starting from zero (even for non-integer real values) is referred to as a stored integer, SI. For example, the stored integer for is 9, and for it is 77.
In this case
Moreover, the length of the fractional part may be negative. In this case, the stored integer is multiplied by the modulus of the fractional part length. Thus, the number 112 can be represented using only three bits and scaling :
Bit number from MSB (most significant) to LSB (least significant) |
6 |
5 |
4 |
Value |
1 |
1 |
1 |
Degree of two * bit value |
|
|
|
Contribution to the answer |
64 |
32 |
16 |
Also note that the length of the fractional part can exceed the width of the type. For example, can be represented in type width 3 and BP 11 as :
Bit number from MSB (most significant) to LSB (least significant) |
-9 |
-10 |
-11 |
Value |
0 |
1 |
0 |
Degree of two * bit value |
|
|
|
Contribution to the response |
0 |
0.0009765625 |
0 |
Negative numbers are encoded as two’s complement (two’s complement).
Denotations
We denote the most general form as fixdt(S, W, f)
according to the article Fixed-Point Arithmetic in Engee. Here:
-
S
- 0 means unsigned type, 1 means signed type; -
W
- width of the type; -
f
- length of fractional part (BP).
Examples of specific type designations:
Designation | Meaning |
---|---|
|
Unsigned integer type 32 bits wide |
|
Unsigned integer type of width 123 |
|
Signed integer type of width 16 |
|
Signed fixed-point type, width (total length) 32, length of fractional part 5 |
|
Unsigned fixed-point type, width 32, length of fractional part 5 |
|
Unsigned fixed-point type, width 3, fractional length -4 |
From an arithmetic point of view, the following types are equivalent: uint42 , ufix42 , ufix42_En0 , ufix42_E0 .
|
Precision
The difference between numbers that differ only in the least significant bit is called precision. Precision depends on the length of the fractional part as follows:
For example, ufix8_En3
can represent numbers with a precision of 0.125, and 1.0625 will round to 1.125, and 1.0624 will round to 1.0. And ufix8_E3
represents numbers to the nearest 8.
Slope Bias
An alternative (and more flexible but less performant) variant of fixed-point numbers is the slope-bias representation, where numbers are encoded as:
Binary-point-only is a special case of slope-bias, where ,
Arithmetic
Before the operation, the binary points of the arguments are aligned and then the arguments are added/multiplied. For example, to add from ufix4_En3
and from ufix2_En1
, you must first add two zeros after the imaginary point in the second number:
1001 (ufix4_En3)
+ 01 (ufix2_En1)
= 1.001
+ 0.100
= 1.101
And with a fractional length of 3 is .
Rules: in detail
Operation | Argument1 type | Argument2 type | Result type |
---|---|---|---|
Addition |
|
|
Minimum required type, accommodating the full range of possible values of the result, in simple cases |
Multiplication |
|
|
|
Addition
The width is chosen to accommodate the possible range of values of the result of the argument addition.
*Algorithm:
-
Find the minimum possible and maximum possible values that could be obtained as a result of operand addition.
-
To calculate the minimum possible value of the result, add the minimum values of the argument types (0 for unsigned numbers and for signed numbers).
-
To calculate the maximum possible value of the result, add up the maximum values of the argument types.
-
-
Find the minimum number of bits that are needed to store the range of the result.
* Examples:
-
Adding two
sfix5_En2
will result insfix6_En2
, since we need to extend the result type by one bit to save a possible carry from the high bit somewhere. Addingufix8_En1
andufix8_En3
will result inufix11_En3
:1111111.1 + 11111.111 = 1111111.100 + 0011111.111 = 11011111.001
Since bit_width .
*Other examples:
Argument type1 | Argument type2 | Result type |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sum of several numbers
For multiple arguments, the minimum possible accumulator width is found for all arguments at once, rather than sequentially first for the first two arguments, then for the addition of the previous result and the third argument, and so on. Calculations considering the types of all arguments at once save the width needed to save the result.
Sum of vector elements
For the block Sum (summation of elements of the incoming vector, special mode of the Add block) an optimisation is applied to avoid calculations in a loop (as happens in Add), and the type of the result is determined quickly. Namely, since the type of all elements is the same fixdt(S, W, f)
, it can be observed that:
Similarly, the minimum value of the result is determined, and then the required type width is selected.
It can be estimated that .
Multiplication
The rules for inferring the type of the result of multiplication are fully described in the table of section Rules: in detail, and extend by induction and associativity to any number of arguments. There are no exceptions or any additional complexities.