Floating Point Reference: EE480 Advanced Computer Architecture

The basic reality is that IEEE 754 floating point standard is pretty complex. The good news is that the latest (2008) version is only 70 pages long. On the UK campus, you can get IEEE Std 754-2008 for free from this IEEE Xplore site. Better still, this is an old, but readable despite great detail, overview.

Conforming to the standard is not easy, and demonstrating conformance requires a lot of testing. I'm not going to try to explain the details of how efficient 32-bit or 64-bit floating point hardware can be built... that could easily take a full course (assuming I knew it well enough to teach, which I don't). In particular, the details of denormalized arithmetic, predictive infinities and NaN, and rounding modes get pretty complex. So, instead we'll talk about operations on a simplified 16-bit format that is really the top 16 bits of a normal-form 32-bit float.

To be precise, we're using a 16-bit binary layout that is essentially what the standard calls binary32 format, but missing the 16 least significant bits of the mantissa. That is not the same as the binary16 format, although it is very similar to what some GPUs have implemented. We're also not going to be very careful about the arithmetic, producing (disturbingly) approximate answers. The only subnormal value you'll deal with is positive 0, and we'll also ignore +/- infinity, NaN (both the quiet and signaling types), and rounding modes.

There is a lovely little formula on page 9 of the standard that says the floating point value is:

(-1)^S * 2^E-bias * (1 + 2^1-p * T)

For our format described in IEEE 754-2008 terms:

S: sign, 1 bit E: encoded exponent, 8 bits T: trailing part of significant, 7 bits

S refers to the sign bit
E is the value in the 8-bit exponent field
The bias is 127 (which is also the maximum exponent value)
The significand precision, p, is 8 (not 24, as it would be for binary32)
The trailing significand, T, is the 7-bit pattern that logically comes immediately after the binary point

That's slightly confusing. Think of it this way: what Verilog would call {1'b1, T} as an unsigned integer, gives the value:

(-1)^S * 2^E-bias-7 * {1'b1, T}

For example, the decimal value 3 is (-1)⁰ * 2^128-127-7 * 8'b11000000, which is 1 * 2^-6 * 192, which is 192/64. Thus, the encoding of 3.0 is 0x4040.

The really cool thing about using this odd format is that you can play with it using ordinary binary32 floating-point math, as implemented by langauges like C, and simply ignore the last 16-bits of the binary32 result. The good news is that Verilog also knows about binary32 values as real, and provides built-in functions $bitstoreal() and $realtobits(). The bad news is that Icarus Verilog doesn't implement either one of those functions. Oh well. They aren't supposed to be synthesizable anyway.

I also created a little CGI form to play with math and conversions to/from our mutant 16-bit float format. Enjoy. Incidentally, this format really brings home the imprecision of using floating point. For example, adding 10 and 0.1 results in 10 + 0.0996094 = 10.0625. Really? Yup. By the way, 100 + 1 = 101. For the repeating fraction generated by 0.1, truncation is not your friend. ;-)

Here are my slides on floating point.

Advanced Computer Architecture.