The basic reality is that IEEE 754 floating point standard is pretty complex. The good news is that the latest (2008) version is only 70 pages long. On the UK campus, you can get IEEE Std 754-2008 for free from this IEEE Xplore site. Better still, this is an old, but readable despite great detail, overview.

Conforming to the standard is not easy, and demonstrating conformance requires a lot of testing. I'm not going to try to explain the details of how efficient 32-bit or 64-bit floating point hardware can be built... that could easily take a full course (assuming I knew it well enough to teach, which I don't). In particular, the details of denormalized arithmetic, predictive infinities and NaN, and rounding modes get pretty complex. So, instead we'll talk about operations on a simplified 16-bit format that is really the top 16 bits of a normal-form 32-bit float.

To be precise, we're using a 16-bit binary layout that is
essentially what the standard calls **binary32**
format, but missing the 16 least significant bits of the
mantissa. *That is not the same as the binary16 format*,
although it is very similar to what some GPUs have implemented.
We're also not going to be very careful about the arithmetic,
producing (disturbingly) approximate answers. The only subnormal
value you'll deal with is positive 0, and we'll also ignore +/-
infinity, NaN (both the quiet and signaling types), and rounding
modes.

There is a lovely little formula on page 9 of the standard that says the floating point value is:

(-1)^{S} * 2^{E-bias} * (1 + 2^{1-p} * T)

For our format described in IEEE 754-2008 terms:

S: sign, 1 bit |
E: encoded exponent, 8 bits |
T: trailing part of significant, 7 bits |
---|

- S refers to the sign bit
- E is the value in the 8-bit exponent field
- The bias is 127 (which is also the maximum exponent value)
- The significand precision, p, is 8 (not 24, as it would be for binary32)
- The trailing significand, T, is the 7-bit pattern that logically comes immediately after the binary point

That's slightly confusing. Think of it this way:
what Verilog would call `{1'b1, T}` as an unsigned integer,
gives the value:

(-1)^{S} * 2^{E-bias-7} * {1'b1, T}

For example, the decimal value 3 is `(-1) ^{0} *
2^{128-127-7} * 8'b11000000`, which is

The really cool thing about using this odd format is that you
can play with it using ordinary binary32 floating-point math, as
implemented by langauges like C, and simply ignore the last
16-bits of the binary32 result. The good news is that Verilog
also knows about binary32 values as `real`, and provides
built-in functions `$bitstoreal()` and
`$realtobits()`. The bad news is that Icarus Verilog
doesn't implement either one of those functions. Oh well. They
aren't supposed to be synthesizable anyway.

I also created a little CGI form to play with math and conversions to/from our mutant 16-bit float format. Enjoy. Incidentally, this format really brings home the imprecision of using floating point. For example, adding 10 and 0.1 results in 10 + 0.0996094 = 10.0625. Really? Yup. By the way, 100 + 1 = 101. For the repeating fraction generated by 0.1, truncation is not your friend. ;-)

Here are my slides on floating point.

Advanced Computer Architecture.