Exotic Floating Point Formats¶
Python floats are typically 64 bits long, but 32 and 16 bit sizes are also supported through the struct
module.
These are the well-known IEEE formats.
Recently, lower precision floating points have become more widely used, largely driven by the requirements of machine learning algorithms and hardware.
As well as the ‘half’ precision 16 bit standard, a truncated version of the 32 bit standard called ‘bfloat16’ is used which has the range of a 32-bit float but less precision.
The #bits value in the tables below show how the available bits are split into sign + exponent + mantissa. There’s always 1 bit to determine the sign of the floating point value. The more bits in the exponent, the larger the range that can be represented. The more bits in the mantissa, the greater the precision (~significant figures) of the values.
Type |
# bits |
Standard |
+ive Range |
bitstring / struct format |
---|---|---|---|---|
Double precision |
1 + 11 + 52 |
IEEE 754 |
10-308 → 10308 |
|
Single precision |
1 + 8 + 23 |
IEEE 754 |
10-38 → 1038 |
|
Half precision |
1 + 5 + 10 |
IEEE 754 |
6×10-8 → 65504 |
|
bfloat |
1 + 8 + 7 |
|
10-38 → 1038 |
|
An example of creation and interpretation of a bfloat:
>>> a = Bits(bfloat=4.5e23) # No need to specify length as always 16 bits
>>> a
Bits('0x66be')
>>> a.bfloat
4.486248158726163e+23 # Converted to Python float
IEEE 8-bit Floating Point Types¶
Note
This is an experimental feature and may be modified in future point releases.
In bitstring prior to version 4.2 the p4binary8 and p3binary8 formats were called e4m3float and e5m2float respectively. The two formats are almost identical, the difference being the addition of inf values that replace the largest positive and negative values that were previously available.
Neither should be confused with the e4m3mxfp and e5m2mxfp formats from the Open Compute Project described below.
The ‘binary8’ formats are part of an ongoing IEEE standardisation process. This implementation here is based on a publicly available draft of the standard. There are seven formats defined in the draft standard, but only two are supported here. If you’d like the other precisions supported then raise a feature request!
The p4binary8
has a single sign bit, 4 bits for the exponent and 3 bits for the mantissa.
For a bit more range and less precision you can use p3binary8
which has 5 bits for the exponent and only 2 for the mantissa.
Note that in the standard they are called binary8p4 and binary8p3, but for bitstring types any final digits should be the length of the type, so these slightly modified names were chosen.
Type |
# bits |
+ive Range |
bitstring format |
---|---|---|---|
binary8p4 |
1 + 4 + 3 |
10-3 → 224 |
|
binary8p3 |
1 + 5 + 2 |
8×10-6 → 49152 |
|
As there are just 256 possible values, both the range and precision of these formats are extremely limited. It’s remarkable that any useful calculations can be performed, but both inference and training of large machine learning models can be done with these formats.
You can easily examine every possible value that these formats can represent using a line like this:
>>> [Bits(uint=x, length=8).p3binary8 for x in range(256)]
or using the Array
type it’s even more concise - we can create an Array and pretty print all the values with this line:
>>> Array('p4binary8', bytearray(range(256))).pp(width=90)
<Array fmt='p4binary', length=256, itemsize=8 bits, total data size=256 bytes>
[
0: 0.0 0.0009765625 0.001953125 0.0029296875 0.00390625 0.0048828125
6: 0.005859375 0.0068359375 0.0078125 0.0087890625 0.009765625 0.0107421875
12: 0.01171875 0.0126953125 0.013671875 0.0146484375 0.015625 0.017578125
18: 0.01953125 0.021484375 0.0234375 0.025390625 0.02734375 0.029296875
24: 0.03125 0.03515625 0.0390625 0.04296875 0.046875 0.05078125
30: 0.0546875 0.05859375 0.0625 0.0703125 0.078125 0.0859375
36: 0.09375 0.1015625 0.109375 0.1171875 0.125 0.140625
42: 0.15625 0.171875 0.1875 0.203125 0.21875 0.234375
48: 0.25 0.28125 0.3125 0.34375 0.375 0.40625
54: 0.4375 0.46875 0.5 0.5625 0.625 0.6875
60: 0.75 0.8125 0.875 0.9375 1.0 1.125
66: 1.25 1.375 1.5 1.625 1.75 1.875
72: 2.0 2.25 2.5 2.75 3.0 3.25
78: 3.5 3.75 4.0 4.5 5.0 5.5
84: 6.0 6.5 7.0 7.5 8.0 9.0
90: 10.0 11.0 12.0 13.0 14.0 15.0
96: 16.0 18.0 20.0 22.0 24.0 26.0
102: 28.0 30.0 32.0 36.0 40.0 44.0
108: 48.0 52.0 56.0 60.0 64.0 72.0
114: 80.0 88.0 96.0 104.0 112.0 120.0
120: 128.0 144.0 160.0 176.0 192.0 208.0
126: 224.0 inf nan -0.0009765625 -0.001953125 -0.0029296875
132: -0.00390625 -0.0048828125 -0.005859375 -0.0068359375 -0.0078125 -0.0087890625
138: -0.009765625 -0.0107421875 -0.01171875 -0.0126953125 -0.013671875 -0.0146484375
144: -0.015625 -0.017578125 -0.01953125 -0.021484375 -0.0234375 -0.025390625
150: -0.02734375 -0.029296875 -0.03125 -0.03515625 -0.0390625 -0.04296875
156: -0.046875 -0.05078125 -0.0546875 -0.05859375 -0.0625 -0.0703125
162: -0.078125 -0.0859375 -0.09375 -0.1015625 -0.109375 -0.1171875
168: -0.125 -0.140625 -0.15625 -0.171875 -0.1875 -0.203125
174: -0.21875 -0.234375 -0.25 -0.28125 -0.3125 -0.34375
180: -0.375 -0.40625 -0.4375 -0.46875 -0.5 -0.5625
186: -0.625 -0.6875 -0.75 -0.8125 -0.875 -0.9375
192: -1.0 -1.125 -1.25 -1.375 -1.5 -1.625
198: -1.75 -1.875 -2.0 -2.25 -2.5 -2.75
204: -3.0 -3.25 -3.5 -3.75 -4.0 -4.5
210: -5.0 -5.5 -6.0 -6.5 -7.0 -7.5
216: -8.0 -9.0 -10.0 -11.0 -12.0 -13.0
222: -14.0 -15.0 -16.0 -18.0 -20.0 -22.0
228: -24.0 -26.0 -28.0 -30.0 -32.0 -36.0
234: -40.0 -44.0 -48.0 -52.0 -56.0 -60.0
240: -64.0 -72.0 -80.0 -88.0 -96.0 -104.0
246: -112.0 -120.0 -128.0 -144.0 -160.0 -176.0
252: -192.0 -208.0 -224.0 -inf
]
You’ll see that there is only 1 zero value and only one ‘nan’ value, together with positive and negative ‘inf’ values.
When converting from a Python float (which will typically be stored in 64-bits) unrepresented values are rounded to nearest, with ties-to-even. This is the standard method used in IEEE 754.
Microscaling Formats¶
Note
This is an experimental feature and may be modified in future point releases.
A range of formats from the Microscaling Formats (MX) Alliance are supported. These are part of the Open Compute Project, and will usually have an external scale factor associated with them.
Eight-bit floats similar to the IEEE p3binary8 and p4binary8 are available, though these seem rather arbitrary and ugly in places in comparison to the IEEE definitions. There is also a format to use for the scaling factor, an int-like format which is really a float, and some sensible six and four bit float formats.
Type |
# bits |
+ive Range |
bitstring format |
---|---|---|---|
E5M2 |
1 + 5 + 2 |
10-6 → 57344 |
|
E4M3 |
1 + 4 + 3 |
2×10-3 → 448 |
|
E3M2 |
1 + 3 + 2 |
0.0625 → 28 |
|
E2M3 |
1 + 2 + 3 |
0.125 → 7.5 |
|
E2M1 |
1 + 2 + 1 |
0.5 → 6 |
|
E8M0 |
0 + 8 + 0 |
10-38 → 1038 |
|
INT8 |
8 |
0.015625 → 1.984375 |
|
The E8M0 format is unsigned and designed to use as a scaling factor for blocks of the other formats.
The INT8 format is like a signed two’s complement 8-bit integer but with a scaling factor of 2-6. So despite its name it is actually a float. The standard doesn’t specify whether the largest negative value (-2.0) is a supported number or not. This implementation allows it.
The E4M3 format is similar to the p4binary8 format but with a different exponent bias and it wastes some values. It has no ‘inf’ values, instead opting to have two ‘nan’ values and two zero values.
The E5M2 format is similar to the p3binary8 format but wastes even more values. It does have positive and negative ‘inf’ values, but also six ‘nan’ values and two zero values.
The MX formats are designed to work with an external scaling factor.
This should be in the E8M0 format, which uses a byte to encode the powers of two from 2-127 to 2127, plus a ‘nan’ value.
This can be specified in bitstring as part of the Dtype, and is very useful inside an Array
.
>>> d = b'some_byte_data'
>>> a = Array(Dtype('e2m1mxfp', scale=2**10), d)
>>> a.pp()
<Array dtype='Dtype('e2m1mxfp', scale=2 ** 10)', length=28, itemsize=4 bits, total data size=14 bytes> [
0: 6144.0 1536.0 4096.0 -6144.0 4096.0 -3072.0 4096.0 3072.0 3072.0 -6144.0 4096.0 1024.0 6144.0 -512.0
14: 6144.0 2048.0 4096.0 3072.0 3072.0 -6144.0 4096.0 2048.0 4096.0 512.0 6144.0 2048.0 4096.0 512.0
]
To change the scale, replace the dtype in the Array:
>>> a.dtype = Dtype('e2m1mxfp', scale=2**6)
When initialising an Array from a list of values, you can also use the string 'auto'
as the scale, and an appropriate scale will be calculated based on the data.
>>> a = Array(Dtype('e2m1mxfp', scale='auto'), [0.0, 0.5, 40.5, 106.25, -52.0, -8.0])
>>> a.pp()
<Array dtype='Dtype('e2m1mxfp', scale=2 ** 4)', length=6, itemsize=4 bits, total data size=3 bytes> [
0: 0.0 0.0 32.0 96.0 -48.0 -8.0
]
The scale is calculated based on the maximum absolute value of the data and the maximum representable value of the format. The auto-scale feature is only available for 8-bit and smaller floating point formats, plus the IEEE 16-bit format. If all of the data is zero, then the scale is set to 1. For more details on this and these formats in general see the OCP Microscaling formats specification.
Conversion¶
When converting from a Python float to an any of the 8-or-fewer bit formats, the ‘rounds to nearest, ties to even’ rule is used. This is the same as is used in the IEEE 754 standard.
Note that for efficiency reasons Python floats are converted to 16-bit IEEE floats before being converted to their final destination. This can mean that in edge cases the rounding to the 16-bit float will cause the next rounding to go in the other direction. The 16-bit float has 11 bits of precision, whereas the final format has at most 4 bits of precision, so this shouldn’t be a real-world problem, but it could cause discrepancies when comparing with other methods. I could add a slower, more accurate mode if this is a problem (add a bug report).
Values that are out of range after rounding are dealt with as follows:
p3binary8
- Out of range values are set to+inf
or-inf
.p4binary8
- Out of range values are set to+inf
or-inf
.e5m2mxfp
- Out of range values are dealt with according to thebitstring.options.mxfp_overflow
setting:'saturate'
(the default): values are set to the largest positive or negative finite value, as appropriate. Infinities will also be set to the largest finite value, despite the fact that the format has infinities.'overflow'
: Out of range values are set to+inf
or-inf
.
e4m3mxfp
- Out of range values are dealt with according to thebitstring.options.mxfp_overflow
setting:'saturate'
(the default): values are set to the largest positive or negative value, as appropriate.'overflow'
: Out of range values are set tonan
.
e3m2mxfp
- Out of range values are saturated to the largest positive or negative value.e2m3mxfp
- Out of range values are saturated to the largest positive or negative value.e2m1mxfp
- Out of range values are saturated to the largest positive or negative value.mxint
- Out of range values are saturated to the largest positive or negative value. Note that the most negative value in this case is-2.0
, which it is optional for an implementation to support.e8m0mxfp
- No rounding is done. AValueError
will be raised if you try to convert a non-representable value. This is because the format is designed as a scaling factor, so it should generally be specified exactly.