Exotic Floating Point Formats¶

Python floats are typically 64 bits long, but 32 and 16 bit sizes are also supported through the struct module. These are the well-known IEEE formats. Recently, lower precision floating points have become more widely used, largely driven by the requirements of machine learning algorithms and hardware.

As well as the ‘half’ precision 16 bit standard, a truncated version of the 32 bit standard called ‘bfloat16’ is used which has the range of a 32-bit float but less precision.

The #bits value in the tables below show how the available bits are split into sign + exponent + mantissa. There’s always 1 bit to determine the sign of the floating point value. The more bits in the exponent, the larger the range that can be represented. The more bits in the mantissa, the greater the precision (~significant figures) of the values.

Type	# bits	Standard	+ive Range	bitstring / struct format
Double precision	1 + 11 + 52	IEEE 754	10^-308 → 10³⁰⁸	`'float64'` / `'d'`
Single precision	1 + 8 + 23	IEEE 754	10^-38 → 10³⁸	`'float32'` / `'f'`
Half precision	1 + 5 + 10	IEEE 754	6×10^-8 → 65504	`'float16'` / `'e'`
bfloat	1 + 8 + 7	`-`	10^-38 → 10³⁸	`'bfloat'` / `-`

An example of creation and interpretation of a bfloat:

>>> a = Bits(bfloat=4.5e23)  # No need to specify length as always 16 bits
>>> a
Bits('0x66be')
>>> a.bfloat
4.486248158726163e+23  # Converted to Python float

IEEE 8-bit Floating Point Types¶

Note

This is an experimental feature and may be modified in future point releases.

In bitstring prior to version 4.2 the p4binary8 and p3binary8 formats were called e4m3float and e5m2float respectively. The two formats are almost identical, the difference being the addition of inf values that replace the largest positive and negative values that were previously available.

Neither should be confused with the e4m3mxfp and e5m2mxfp formats from the Open Compute Project described below.

The ‘binary8’ formats are part of an ongoing IEEE standardisation process. This implementation here is based on a publicly available draft of the standard. There are seven formats defined in the draft standard, but only two are supported here. If you’d like the other precisions supported then raise a feature request!

The p4binary8 has a single sign bit, 4 bits for the exponent and 3 bits for the mantissa. For a bit more range and less precision you can use p3binary8 which has 5 bits for the exponent and only 2 for the mantissa. Note that in the standard they are called binary8p4 and binary8p3, but for bitstring types any final digits should be the length of the type, so these slightly modified names were chosen.

Type	# bits	+ive Range	bitstring format
binary8p4	1 + 4 + 3	10^-3 → 224	`'p4binary8'`
binary8p3	1 + 5 + 2	8×10^-6 → 49152	`'p3binary8'`

As there are just 256 possible values, both the range and precision of these formats are extremely limited. It’s remarkable that any useful calculations can be performed, but both inference and training of large machine learning models can be done with these formats.

You can easily examine every possible value that these formats can represent using a line like this:

>>> [Bits(uint=x, length=8).p3binary8 for x in range(256)]

or using the Array type it’s even more concise - we can create an Array and pretty print all the values with this line:

>>> Array('p4binary8', bytearray(range(256))).pp(width=90)
<Array fmt='p4binary', length=256, itemsize=8 bits, total data size=256 bytes>
[
          0.0  0.0009765625   0.001953125  0.0029296875    0.00390625  0.0048828125
  0.005859375  0.0068359375     0.0078125  0.0087890625   0.009765625  0.0107421875
   0.01171875  0.0126953125   0.013671875  0.0146484375      0.015625   0.017578125
   0.01953125   0.021484375     0.0234375   0.025390625    0.02734375   0.029296875
      0.03125    0.03515625     0.0390625    0.04296875      0.046875    0.05078125
    0.0546875    0.05859375        0.0625     0.0703125      0.078125     0.0859375
      0.09375     0.1015625      0.109375     0.1171875         0.125      0.140625
      0.15625      0.171875        0.1875      0.203125       0.21875      0.234375
         0.25       0.28125        0.3125       0.34375         0.375       0.40625
       0.4375       0.46875           0.5        0.5625         0.625        0.6875
         0.75        0.8125         0.875        0.9375           1.0         1.125
         1.25         1.375           1.5         1.625          1.75         1.875
          2.0          2.25           2.5          2.75           3.0          3.25
          3.5          3.75           4.0           4.5           5.0           5.5
          6.0           6.5           7.0           7.5           8.0           9.0
         10.0          11.0          12.0          13.0          14.0          15.0
         16.0          18.0          20.0          22.0          24.0          26.0
         28.0          30.0          32.0          36.0          40.0          44.0
         48.0          52.0          56.0          60.0          64.0          72.0
         80.0          88.0          96.0         104.0         112.0         120.0
        128.0         144.0         160.0         176.0         192.0         208.0
        224.0           inf           nan -0.0009765625  -0.001953125 -0.0029296875
  -0.00390625 -0.0048828125  -0.005859375 -0.0068359375    -0.0078125 -0.0087890625
 -0.009765625 -0.0107421875   -0.01171875 -0.0126953125  -0.013671875 -0.0146484375
    -0.015625  -0.017578125   -0.01953125  -0.021484375    -0.0234375  -0.025390625
  -0.02734375  -0.029296875      -0.03125   -0.03515625    -0.0390625   -0.04296875
    -0.046875   -0.05078125    -0.0546875   -0.05859375       -0.0625    -0.0703125
    -0.078125    -0.0859375      -0.09375    -0.1015625     -0.109375    -0.1171875
       -0.125     -0.140625      -0.15625     -0.171875       -0.1875     -0.203125
     -0.21875     -0.234375         -0.25      -0.28125       -0.3125      -0.34375
       -0.375      -0.40625       -0.4375      -0.46875          -0.5       -0.5625
       -0.625       -0.6875         -0.75       -0.8125        -0.875       -0.9375
         -1.0        -1.125         -1.25        -1.375          -1.5        -1.625
        -1.75        -1.875          -2.0         -2.25          -2.5         -2.75
         -3.0         -3.25          -3.5         -3.75          -4.0          -4.5
         -5.0          -5.5          -6.0          -6.5          -7.0          -7.5
         -8.0          -9.0         -10.0         -11.0         -12.0         -13.0
        -14.0         -15.0         -16.0         -18.0         -20.0         -22.0
        -24.0         -26.0         -28.0         -30.0         -32.0         -36.0
        -40.0         -44.0         -48.0         -52.0         -56.0         -60.0
        -64.0         -72.0         -80.0         -88.0         -96.0        -104.0
       -112.0        -120.0        -128.0        -144.0        -160.0        -176.0
       -192.0        -208.0        -224.0          -inf
]

You’ll see that there is only 1 zero value and only one ‘nan’ value, together with positive and negative ‘inf’ values.

When converting from a Python float (which will typically be stored in 64-bits) unrepresented values are rounded to nearest, with ties-to-even. This is the standard method used in IEEE 754.

Microscaling Formats¶

Note

This is an experimental feature and may be modified in future point releases.

A range of formats from the Microscaling Formats (MX) Alliance are supported. These are part of the Open Compute Project, and will usually have an external scale factor associated with them.

Eight-bit floats similar to the IEEE p3binary8 and p4binary8 are available, though these seem rather arbitrary and ugly in places in comparison to the IEEE definitions. There is also a format to use for the scaling factor, an int-like format which is really a float, and some sensible six and four bit float formats.

Type	# bits	+ive Range	bitstring format
E5M2	1 + 5 + 2	10^-6 → 57344	`'e5m2mxfp'`
E4M3	1 + 4 + 3	2×10^-3 → 448	`'e4m3mxfp'`
E3M2	1 + 3 + 2	0.0625 → 28	`'e3m2mxfp'`
E2M3	1 + 2 + 3	0.125 → 7.5	`'e2m3mxfp'`
E2M1	1 + 2 + 1	0.5 → 6	`'e2m1mxfp'`
E8M0	0 + 8 + 0	10^-38 → 10³⁸	`'e8m0mxfp'`
INT8	8	0.015625 → 1.984375	`'mxint'`

The E8M0 format is unsigned and designed to use as a scaling factor for blocks of the other formats.
The INT8 format is like a signed two’s complement 8-bit integer but with a scaling factor of 2^-6. So despite its name it is actually a float. The standard doesn’t specify whether the largest negative value (-2.0) is a supported number or not. This implementation allows it.
The E4M3 format is similar to the p4binary8 format but with a different exponent bias and it wastes some values. It has no ‘inf’ values, instead opting to have two ‘nan’ values and two zero values.
The E5M2 format is similar to the p3binary8 format but wastes even more values. It does have positive and negative ‘inf’ values, but also six ‘nan’ values and two zero values.

The MX formats are designed to work with an external scaling factor. This should be in the E8M0 format, which uses a byte to encode the powers of two from 2^-127 to 2¹²⁷, plus a ‘nan’ value. This can be specified in bitstring as part of the Dtype, and is very useful inside an Array.

>>> d = b'some_byte_data'
>>> a = Array(Dtype('e2m1mxfp', scale=2**10), d)
>>> a.pp()
<Array dtype='Dtype('e2m1mxfp', scale=2 ** 10)', length=28, itemsize=4 bits, total data size=14 bytes> [
  0:  6144.0  1536.0  4096.0 -6144.0  4096.0 -3072.0  4096.0  3072.0  3072.0 -6144.0  4096.0  1024.0  6144.0  -512.0
 14:  6144.0  2048.0  4096.0  3072.0  3072.0 -6144.0  4096.0  2048.0  4096.0   512.0  6144.0  2048.0  4096.0   512.0
]

To change the scale, replace the dtype in the Array:

>>> a.dtype = Dtype('e2m1mxfp', scale=2**6)

When initialising an Array from a list of values, you can also use the string 'auto' as the scale, and an appropriate scale will be calculated based on the data.

>>> a = Array(Dtype('e2m1mxfp', scale='auto'), [0.0, 0.5, 40.5, 106.25, -52.0, -8.0])
>>> a.pp()
<Array dtype='Dtype('e2m1mxfp', scale=2 ** 4)', length=6, itemsize=4 bits, total data size=3 bytes> [
 0:     0.0     0.0    32.0    96.0   -48.0    -8.0
]

The scale is calculated based on the maximum absolute value of the data and the maximum representable value of the format. The auto-scale feature is only available for 8-bit and smaller floating point formats, plus the IEEE 16-bit format. If all of the data is zero, then the scale is set to 1. For more details on this and these formats in general see the OCP Microscaling formats specification.

Conversion¶

When converting from a Python float to an any of the 8-or-fewer bit formats, the ‘rounds to nearest, ties to even’ rule is used. This is the same as is used in the IEEE 754 standard.

Note that for efficiency reasons Python floats are converted to 16-bit IEEE floats before being converted to their final destination. This can mean that in edge cases the rounding to the 16-bit float will cause the next rounding to go in the other direction. The 16-bit float has 11 bits of precision, whereas the final format has at most 4 bits of precision, so this shouldn’t be a real-world problem, but it could cause discrepancies when comparing with other methods. I could add a slower, more accurate mode if this is a problem (add a bug report).

Values that are out of range after rounding are dealt with as follows:

p3binary8 - Out of range values are set to +inf or -inf.
p4binary8 - Out of range values are set to +inf or -inf.
e5m2mxfp - Out of range values are dealt with according to the bitstring.options.mxfp_overflow setting:
- 'saturate' (the default): values are set to the largest positive or negative finite value, as appropriate. Infinities will also be set to the largest finite value, despite the fact that the format has infinities.
- 'overflow': Out of range values are set to +inf or -inf.
e4m3mxfp - Out of range values are dealt with according to the bitstring.options.mxfp_overflow setting:
- 'saturate' (the default): values are set to the largest positive or negative value, as appropriate.
- 'overflow': Out of range values are set to nan.
e3m2mxfp - Out of range values are saturated to the largest positive or negative value.
e2m3mxfp - Out of range values are saturated to the largest positive or negative value.
e2m1mxfp - Out of range values are saturated to the largest positive or negative value.
mxint - Out of range values are saturated to the largest positive or negative value. Note that the most negative value in this case is -2.0, which it is optional for an implementation to support.
e8m0mxfp - No rounding is done. A ValueError will be raised if you try to convert a non-representable value. This is because the format is designed as a scaling factor, so it should generally be specified exactly.