Arithmetic in HDL and Fixed-Point

When generating code in Engee, as well as when working with Fixed-Point in the computation environment and in simulation, certain rules apply to arithmetic operations.

The arithmetic in C rules have many implicit transformations, and the result type of a mathematical operation is inferred based on the signs and ranks of the arguments.

The rules for arithmetic operations in HDL (hardware description languages, e.g. Verilog, Chisel) are concise, but are complicated by arbitrary argument widths as well as fixed point.

Rules: concise and in words

If all argument types are unsigned, the result type is unsigned. Otherwise, the result type is signed.
Type width is not limited to 8/16/32/64 bits, but can be an arbitrary natural number up to and including 128.
The result type is selected so that no overflow occurs.

Operation	Result type	Example
Addition	Minimum required type to accommodate the full range of possible result values	`uint8` + `uint8` = `uint9`, because a carry from the higher digit may occur.
Multiplication	The width of the result type is the sum of the widths of the argument types, the same with the fractional part.	`uint8` * `uint8` = `uint16`

Operation

Result type

Example

Addition

Minimum required type to accommodate the full range of possible result values

uint8 + uint8 = uint9, because a carry from the higher digit may occur.

Multiplication

The width of the result type is the sum of the widths of the argument types, the same with the fractional part.

uint8 * uint8 = uint16

Fixed-Point

In addition to floating-point numbers, fixed-point numbers are used in software and hardware development. They are less flexible than floating-point numbers, but operations on fixed-point numbers are usually faster. The following are the basic concepts of fixed-point numbers.

Engee currently uses a binary-point-only (power-of-two) version of fixed-point numbers rather than slope-bias.

To learn how to work with Fixed-Point in the Engee computing environment, see the article Fixed-Point Arithmetic (Fixed-Point) in Engee.

Binary point

Integers are represented in binary as the sum of powers of two. For example, 9 in binary is :

Bit number from MSB (most significant) to LSB (least significant)	3	2	1	0
Value	1	0	0	1
Degree of two * bit value
Contribution in response	8	0	0	1

To represent non-integer numbers in binary, you can mentally put a dot after the zero bit, and start counting by negative powers of two. For example, 9.625 is encoded as :

Bit number from MSB (most significant) to LSB (least significant)	3	2	1	0	-1	-2	-3
Value	1	0	0	1	1	0	1
Degree of two * bit value
Contribution in response	8	0	0	1	0.5	0	0.125

Definitions

Type Width is the number of bits encoding the values of the type.
The number of "fractional" bits is the Fraction Length, also referred to as the Binary Point (BP).
The value in decimal notation is called the Real World Value (RWV).
The integer that would result if you numbered the bits from LSB to MSB starting from zero (even for non-integer real values) is referred to as a stored integer, SI. For example, the stored integer for is 9, and for it is 77.

In this case

Moreover, the length of the fractional part may be negative. In this case, the stored integer is multiplied by the modulus of the fractional part length. Thus, the number 112 can be represented using only three bits and scaling :

Bit number from MSB (most significant) to LSB (least significant)	6	5	4
Value	1	1	1
Degree of two * bit value
Contribution to the answer	64	32	16

Also note that the length of the fractional part can exceed the width of the type. For example, can be represented in type width 3 and BP 11 as :

Bit number from MSB (most significant) to LSB (least significant)	-9	-10	-11
Value	0	1	0
Degree of two * bit value
Contribution to the response	0	0.0009765625	0

Negative numbers are encoded as two’s complement (two’s complement).

Denotations

We denote the most general form as fixdt(S, W, f) according to the article Fixed-Point Arithmetic in Engee. Here:

S - 0 means unsigned type, 1 means signed type;
W - width of the type;
f - length of fractional part (BP).

Examples of specific type designations:

Designation Meaning

Designation	Meaning
`uint32`	Unsigned integer type 32 bits wide
`uint123`	Unsigned integer type of width 123
`sint16`	Signed integer type of width 16
`sfix32_En5`	Signed fixed-point type, width (total length) 32, length of fractional part 5
`ufix32_En5`	Unsigned fixed-point type, width 32, length of fractional part 5
`ufix3_E4`	Unsigned fixed-point type, width 3, fractional length -4

uint32

Unsigned integer type 32 bits wide

uint123

Unsigned integer type of width 123

sint16

Signed integer type of width 16

sfix32_En5

Signed fixed-point type, width (total length) 32, length of fractional part 5

ufix32_En5

Unsigned fixed-point type, width 32, length of fractional part 5

ufix3_E4

Unsigned fixed-point type, width 3, fractional length -4

From an arithmetic point of view, the following types are equivalent: uint42, ufix42, ufix42_En0, ufix42_E0.

Precision

The difference between numbers that differ only in the least significant bit is called precision. Precision depends on the length of the fractional part as follows:

For example, ufix8_En3 can represent numbers with a precision of 0.125, and 1.0625 will round to 1.125, and 1.0624 will round to 1.0. And ufix8_E3 represents numbers to the nearest 8.

Slope Bias

An alternative (and more flexible but less performant) variant of fixed-point numbers is the slope-bias representation, where numbers are encoded as:

Binary-point-only is a special case of slope-bias, where ,

Arithmetic

Before the operation, the binary points of the arguments are aligned and then the arguments are added/multiplied. For example, to add from ufix4_En3 and from ufix2_En1, you must first add two zeros after the imaginary point in the second number:

  1001 (ufix4_En3)
+   01 (ufix2_En1)
= 1.001
+ 0.100
= 1.101

And with a fractional length of 3 is .

Rules: in detail

Operation Argument1 type Argument2 type Result type

Operation	Argument1 type	Argument2 type	Result type
Addition	`fixdt(S1, W1, f1)`	`fixdt(S2, W2, f2)`.	Minimum required type, accommodating the full range of possible values of the result, in simple cases `fixdt(S1 or S2, max{W1, W2} + 1, max{f1, f2})`
Multiplication	`fixdt(S1, W1, f1)`	`fixdt(S2, W2, f2)`	`fixdt(S1 or S2, W1 + W2, f1 + f2)`

Addition

fixdt(S1, W1, f1)

fixdt(S2, W2, f2).

Minimum required type, accommodating the full range of possible values of the result, in simple cases
fixdt(S1 or S2, max{W1, W2} + 1, max{f1, f2})

Multiplication

fixdt(S1, W1, f1)

fixdt(S2, W2, f2)

fixdt(S1 or S2, W1 + W2, f1 + f2)

Addition

The width is chosen to accommodate the possible range of values of the result of the argument addition.

*Algorithm:

Find the minimum possible and maximum possible values that could be obtained as a result of operand addition.
1. To calculate the minimum possible value of the result, add the minimum values of the argument types (0 for unsigned numbers and for signed numbers).
2. To calculate the maximum possible value of the result, add up the maximum values of the argument types.
Find the minimum number of bits that are needed to store the range of the result.

* Examples:

Adding two sfix5_En2 will result in sfix6_En2, since we need to extend the result type by one bit to save a possible carry from the high bit somewhere. Adding ufix8_En1 and ufix8_En3 will result in ufix11_En3:
```
  1111111.1
+ 11111.111
=  1111111.100
+  0011111.111
= 11011111.001
```
Since bit_width .

*Other examples:

Argument type1 Argument type2 Result type

Argument type1	Argument type2	Result type
`sfix5_En2`	`sfix5_En2`	`sfix6_En2`
`sfix18_En5`	`sfix16_En6`	`sfix20_En6`
`ufix5_En17`	`ufix32_E2`	`ufix51_En17`
`ufix4_En7`	`ufix5_E2`	`ufix14_En7`
`ufix4_En7`	`ufix5_E2`	`ufix15_En7`
`ufix4_En7`	`sfix14_E2`	`sfix24_En7`
`ufix4_En7`	`fix14_En2`	`sfix20_En7`
`sfix4_En38`	`sfix6_E78`	`sfix123_En38`

sfix5_En2

sfix6_En2

sfix18_En5

sfix16_En6

sfix20_En6

ufix5_En17

ufix32_E2

ufix51_En17

ufix4_En7

ufix5_E2

ufix14_En7

ufix4_En7

ufix5_E2

ufix15_En7

ufix4_En7

sfix14_E2

sfix24_En7

ufix4_En7

fix14_En2

sfix20_En7

sfix4_En38

sfix6_E78

sfix123_En38

Sum of several numbers

For multiple arguments, the minimum possible accumulator width is found for all arguments at once, rather than sequentially first for the first two arguments, then for the addition of the previous result and the third argument, and so on. Calculations considering the types of all arguments at once save the width needed to save the result.

Sum of vector elements

For the block Sum (summation of elements of the incoming vector, special mode of the Add block) an optimisation is applied to avoid calculations in a loop (as happens in Add), and the type of the result is determined quickly. Namely, since the type of all elements is the same fixdt(S, W, f), it can be observed that:

Similarly, the minimum value of the result is determined, and then the required type width is selected.

It can be estimated that .

Multiplication

The rules for inferring the type of the result of multiplication are fully described in the table of section Rules: in detail, and extend by induction and associativity to any number of arguments. There are no exceptions or any additional complexities.

Arithmetic in HDL and Fixed-Point

Rules: concise and in words

Fixed-Point

Binary point

Definitions

Denotations

Precision

Slope Bias

Arithmetic

Rules: in detail

Addition

Sum of several numbers

Sum of vector elements

Multiplication

Useful References