A Tricky Herbie Bug

Today I found and fixed a Herbie bug. It had been bothering me for months: even when run with the same seed, Herbie's results are not reproducible. They're almost reproducible! They're close! But a few benchmarks, something like 5 or so, would differ from run to run. The bug was so tricky, so bizarre, that I just had to write about it.

The bug

I'll keep things short. I first tested a number of benchmarks until I found one where non-determinism actually happened pretty often, with most runs differing slightly. Then, looking at our logs, I concluded that the divergence happened in the prune phase, where two different runs would have the same set of candidate programs come in, but a different set go out. Looking at it further (this part took forever) I eventually realized that the same program, on the same input, would evaluate to different results on different runs. This was totally unexpected, since floating-point evaluation should be deterministic.

More investigation led me to a clear culprit: fma(-1.0, NaN, NaN) would usually return -1.0 but sometimes it would return NaN or even (very rarely) some random other number. This was difficult to believe:

fma corresponds directly to a CPU instruction, those don't have bugs
It's not a complicated CPU instruction either
-1.0 is not the correct answer (NaN is)

I poked around a bit more and finally found the villain: our FFI bindings.

(define-libm c_erfcf (erfc float float))
(define-libm c_expm1f (expm1 float float))
(define-libm c_log1pf (log1p float float))
(define-libm c_hypotf (hypot float float float))
(define-libm c_fmaf (fma float float float float))

The issue is that I'm loading the fma function, not the fmaf function. So: how does this cause the bug above?

But why?

Well, we first have to understand what FFI bindings do. An FFI binding:

Finds a function pointer in a shared library; here it's the fma function
Wraps it in a Racket function that we can call
That Racket function needs to place its arguments into the place the C function expects, according to its calling convention

That calling convention, for x86-64, states that the first couple of single- or double-precision float arguments go in the xmm0, xmm1, and so on registers.

Now, the issue is that we're calling fma, not fmaf. This means that we are placing a single-precision value in xmm0 but the function is reading a double-precision value. That double-precision value will have its first 32 bits drawn from the actual argument, and the next 32 bits will just be whatever junk was in the register beforehand. Same for the other two arguments. fma will then write a 64-bit output to xmm0, of which we will read only the first 32 bits.

Endianness and Float Precision

Using this basic understanding, we can work out where the strange output, -1.0, actually came from, and also why it was sometimes NaN or another random value instead. The 32-bit representation of -1.0 is 0xbf800000 and the 32-bit representation of NaN is 0x7fc00000.

So the arguments we are actually passing to fma are the 64-bit floats 0xbf800000~~~~~~~~, 0x7fc00000~~~~~~~~, and 0x7fc00000~~~~~~~~, where the ~ represent unknown values.¹ [¹ You disagree? Keep reading.] What 64-bit float values do these possibilities actually represent?

Argument	32-bit float	32-bit hex	min 64-bit hex	max 64-bit hex	min 64-bit	max 64-bit
x	-1.0	bf800000	bf80000000000000	bf800000ffffffff	-0.0078125	-0.0078125074505805951891
y	nan	7fc00000	7fc0000000000000	7fc00000ffffffff	2.24711641857789488466e+307	2.24711856159510875824e+307
z	nan	7fc00000	7fc0000000000000	7fc00000ffffffff	2.24711641857789488466e+307	2.24711856159510875824e+307

This means that the output ranges from:

-0.0078125074505805951891 * 2.24711856159510875824e+307 + 2.24711641857789488466e+307 = 2.2295607880730951e+307

to:

-0.0078125 * 2.24711641857789488466e+307 + 2.24711856159510875824e+307 = 2.229562964574969e+307

Which is a hex range of:

0x7fbfbffff7ffffc0 ... 0x7fbfc001fffffffe

By the way, if you ever need to do this² [² And, by the way, perhaps arrange your life so as to not…] the float.exposed tool is perfect for this kind of work.

Anyway, recall that we ultimately end up reading only the top 32 bits of the result, meaning we read a 32-bit float range of 0x7fbfbfff through 0x7fbfc001, which is only 3 different possible values, all of which are NaNs. Wait? I thought we were getting an output of -1.0?

Ah, right: all of this happened on an x86-64 machine, so everything is little-endian. That means the junk bytes where the top 32 bits: 0x~~~~~~~~bf800000 and 0x~~~~~~~~7fc00000. So actually, a lot of different things can happen, depending on those top bits, and you might get all sorts of 32-bit outputs.

For example, one thing that can happen is that the unknown bits are all zero; then these are all double-precision subnormal positive numbers, about 1.59e-341 and 1.06e-341; the multiply way underflows and the correct output is 1.06e-341, which is 0x000000007fc00000 in double precision, of which the bottom 32 bits are NaN in single precision.

But another thing that can happen is that the unknown bits are all ones;³ [³ Or any left-over top bits from a NaN, which seems more likely.] then these are all double-precision NaN values. In that case the correct output is NaN, and in particular many chips return the first NaN argument, which is 0xffffffffbf800000. The bottom 32 bits of that are the original -1.0 input!

Conclusion

In short, FFIs are dangerous and you gotta be really careful. But what's particularly fun is that the actual error you run into, if you mess up your FFI types, is confusing and nondeterministic, but also ultimately comprehensible!

Footnotes:

You disagree? Keep reading.

And, by the way, perhaps arrange your life so as to not…

Or any left-over top bits from a NaN, which seems more likely.

By Pavel Panchekha

12 December 2024