Floats: Very Introductory guide to IEEE-754

This is aimed at new hires or anyone encountering floating-point issues. It doesn't go into much details and is a high-level overview of floating point concepts:

A floating-point number = mantissa × 2^exponent, where the mantissa (sometimes called significand) holds the significant digits and the exponent sets the scale. In fp32, exponent takes 8 bits and mantissa 23 bits (+1 sign bit)
Between [2^p, 2^(p+1)), floating point numbers are equally spaced (smallest difference between two consecutive floats). For example: fp32 numbers between 2.0 and 4.0 have spacing of 2^-22 Numbers between 1024.0 and 2048have spacing of 2^-13. that's what we call the ULP/error in that range.

It's actually a fact that smallest difference between two consecutive floats (ULP) with the same exponent is: 2^−23 * 2^p
notice how the larger the number gets, the spacing becomes wider, that means the rounding error of the operations get larger. (but the relative error stays the same)
Exercise for reader: google & find the above values for fp64 and think about why it's the case

relative error: Measures how far the approximation is relative to the size of the number and it's computed by (abs(realThing)-abs(approximation))/abs(realThing); the relative error of fp32 operations is ~2^-23
You should almost always stay away from using floats in for loops:

for (float x = 0.0f; x <= 1e6f; x += 0.1f) {
    // do something
}

spend a few minutes to think about why this for loop never reaches the end . . . the reason is the ULP will get larger than 0.1f at a certain point and += 0.1f will have no effect on X.

The more operations you do, the more rounding steps occur, and the more errors can accumulate (error analysis is a whole rabbit hole itself and a fun topic)

general rule of thumb is: If you can rewrite the computation to use fewer operations, you generally get more precise results

Don't rely on compiler to make your floating point computations more exact.
- compilers may reorder operations that may lead to different fp results, but they usually optimize for speed not fp accuracy. and there are so many things that could affect the compiler give different results (different instructions such as AVX, or enabling fast-math), also compilers are very conservative, they just generally reorder operations or use different instructions but never do things that change the way something is computed like herbie does .
- Do Rely on: 1.Rewriting expressions to minimize rounding, 2.using FMA when applicable and 3.Carefully chosen algorithms (like we did in n4ce shaders and used numerical methods to solve certain equations)
Multiplying by the inverse is not the same as the division
Compare this two methods in godbolt

// Type your code here, or load an example.
float mulByRcp(float a, float b) {
    float x = 1.0f / b;
    return a * x;
}

int div(float a, float b) {
    return a/b;
}

one is done with 2 operations (divss+mulss) and the other is done with only 1 (divss)

from a purely performance perspective it makes total sense cache/precompute the reciprocal (only 1 div) and then do multiply the rest of the way and most of the time it's worth it. (See DIVSS and MULSS latency in Instruction Tables - Agner Fog)

Catastrophic cancellation happens when you subtract two nearly equal (usually large) numbers. The result is much smaller than the original numbers, so you lose a lot of the significant digits, making the result inaccurate.

100000000000+1 is accurate to 1 ulp of 100000000001 but 10000100000-10000000000 might not be accurate to one ulp of 100000

Additional Resources:

What Every Computer Scientist Should Know About Floating-Point Arithmetic - DAVID GOLDBERG
Onboarding floating-point - Mike Acton
The Herbie Project, Herbie detects inaccurate expressions and finds more accurate replacements
Instruction Tables - Agner Fog. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD, and VIA CPUs

Erfan-Ahmadi/floats_ieee_754_introduction.md

Select an option

No results found

Select an option

No results found

Additional Resources: