Herbie Performance in 2024

The Herbie Project has grown and grown over the last—gosh—decade, and in 2024 we now have something like 15 people working on it, including myself and Zach and all of our students PhD, MS, and undergraduate. So for this semester I've chosen to focus my efforts a bit narrower than normal and worked with Zane and Artem on making Herbie faster.

It's been a super-productive few months, with Artem and Zane landing major algorithmic improvements to sampling and regimes while I've pushed out a long list of small performance tuning tweaks. In total, Herbie is something like 30% faster than when we started—though the other Herbie teams have meanwhile made Herbie smarter and slower, so the overall impact is less. Given all of these changes, I wanted to present a bit of a breakdown of Herbie's performance, especially since I've been talking to folks who want to use Herbie in more and more complex environments where performance is more of an issue.

High Level Components

From a performance perspective, it's convenient to split Herbie up into three major components:

Rival, our high-precision interval arithmetic library, which Herbie uses for sampling, localization, and many other phases;
Egg, our equality saturation engine (and related wrapper code), which Herbie uses for rewriting, simplification, and other things;
And everything else, including pruning, execution, series expansion, and regimes

At a high level, the performance breaks down like this:

Rival	40%
Egg	40%
Other	20%

In other words, Rival and Egg are, to a first approximation, where most of the time is spent. So let's dive into each component and try to answer some basic performance questions, including: Where does the time go? How necessary are the various parts? How fast could this component possibly be?

Rival

One way to think about performance is in terms of users, as in, why are we actually running a certain component. Herbie uses Rival in a couple of phases, shown below:

Sampling	25%
Localize	10%
Bsearch	2.5%
Explain	1.9%
Analysis	1.8%

Note that these total to (approximately) the 40% that Rival is responsible for, meaning that sampling (for example) is 25% of Herbie's total runtime.

Clearly the biggest and most expensive component here is sampling. This component is also one of the first ones in Herbie, tasked with computing the correctly-rounded result for the user's specification at 8000 + 256 randomly-sampled inputs. Almost all of the time here goes into the correctly-rounded computation, which uses arbitrary-precision floating-point and interval arithmetic to compute an expression to the needed precision. Of the 8000 + 256 points, the 256 are training points (which Herbie uses to choose between candidates) and the 8000 are test points, only used to evaluate the results.

Not all of this is necessary. For example, this is, honestly, a crazy test-train split; in machine learning, folks often use a test set that's several times smaller than the training set, instead of using one that's 30 times bigger. (Though, unlike in machine learning, we have a very unbalanced classification problems—most inputs don't hit floating-point error—so there's a limit here.) But also, these test points are only needed at the end of a run, so can be computed in parallel with the rest of Herbie. Likewise, the Analysis phase isn't necessary if you have some representative inputs already, and the Explain phase is not necessary if there isn't a user interface. So there are some options there to speed up Herbie by just dropping the less important uses.

Another way to look at performance is by code path. We can get a bit more insight into sampling, which is an iterative algorithm, by looking at how many points use how many iterations:

Iterations	Time	Count
0	10%	4,265,957
1	3.4%	478,779
2	1.6%	112,804
3	.5%	60,461
4	.1%	1,813
5	0%	10
exit	3.8%	60,461

Clearly the vast majority of points are handled in the 0th iteration. This is also the iteration where Herbie knows the least about the expression, so it's tough to speed up further. Artem's work, for example, on mixed-precision sampling just doesn't help very much (though it speeds up later iterations significantly). We do have some ideas for speeding up later iterations by a bit, but fundamentally you can see that we've mostly hit a wall here.

Another way to think about performance is by looking at a profile, which breaks down time by individual functions. These form a graph, but I usually like to identify a "core datapath" and then just look at how much branches off that data path. For Rival, it looks like this:

Function	Total	Self
Entry (*)	100%	17%
`compiled-spec`	83%	21%
`run`	62%	17%
- `backward-pass`	8%	8%
`ival-fn`	37%	10%
`bf-return-exact`	27%	16%
`mpfr_fn`	12%	12%

Here the first row, labeled "Entry" is the sum of a few different functions that funnel into Rival code, including batch-prepare-points, eval-prog-real, and find-intervals. Some of the other rows are also aggregates. For example, there's no actual function in Herbie called ival-fn, instead there are functions like ival-mul, ival-add, and so on.

A few things are really surprising here. For example, only 12% of the runtime is actually spent in the core MPFR function, and in some sense everything else is overhead. Can Rival really be made 8× faster? Sort of, but some of the other rows are hard to avoid. For example, self-time in bf-return-exact? is largely time spent allocating MPFR values. That's definitely a waste, but avoiding it would require rewriting both the math/bigfloat and rival libraries to avoid allocation, probably by using explicit three-argument form. And while that would cut down on a lot of allocations it wouldn't cut down on all of them. Similarly, the 10% of runtime inside the interval functions could be cut down somewhat, but not to zero: interval arithmetic requires tests and control flow besides the actual computation of endpoints. The easiest wins here are the self-time in the entry functions and in compiled-spec, where inlining a bunch of code could allow us to eliminate a bunch of needless allocations.

The summary here is that, while Rival-related code is 40% of runtime, and while it could be somewhat sped up, there are fundamental difficulties with optimizing this by a lot.

Egg

The egg-related time is the opposite story. The uses are fairly even:

Simplify	15%
Soundness	12%
Rewrite	9%
Localize	2.7%
Preprocess	2.4%

Equality saturation is used for a variety of rewriting- and equality-related tasks, and these take roughly similar runtimes. I do believe that simplify is actually unnecessary and am working on removing it, but the rest play an essential role in Herbie commensurate with their runtime.

However, the profile view shows something shocking:

Function	Total	Self
`run-egg`	100%	13%
- `expand-rules`	9%	9%
- `egg->racket`	48%	48%
- `cast`	7%	7%
- `racket->egg`	23%	23%

Here the actually useful stuff—rewriting and proving equality—is in the 13% of run-egg self-time. Almost the whole runtime is wasted on "overhead" related to converting between our Racket-side and egg-side intermediate representations.

This seems like a prime opportunity for speed-ups: if we could eliminate the two different IRs, we could get rid of most of the runtime. In fact, Brett is working on just that, so I do hope to see speed-ups here. The work is kind of tricky, because doing it well will require sharing more information between Racket and egg, but the speedups available are too juicy to pass up, even if it's quite a complex project.

Other components

The remaining 20% of Herbie's runtime is spent in a collection of unrelated algorithms:

Prune	7.0%
Eval	5.7%
Regimes	3.7%
Series	3.4%

Note that these numbers are after substantial work this quarter on the Prune, Eval, and Regimes phases. Some of these could be sped up further with some effort. I think the Prune time is mostly overhead, for example, and for Regimes I have some algorithmic ideas that could speed it up by quite a bit. But because all of these phases are pretty distinct, the optimizations are largely independent, and there's a limited payoff. So perhaps, with another few months of effort, this 20% could be sped up by a factor of, I don't know, two? But that effort might be better spent elsewhere.

How fast could Herbie go?

So, having seen all the details, how fast could Herbie potentially go? Well, it depends on what exactly you want. For example, suppose your use case involves using Herbie inside a compiler. Then you might be able to dump some phases (simplify, soundness, explain, bsearch, series) and quite cheaply achieve a 33% speedup for nothing. Add on optimizations for all the vague overheads and maybe you can get a 2× speedup. Not bad!

And if you are additionally interested in a JIT setup, where you have specific inputs available, then maybe sampling and analysis could be done much more effectively, and you could make that a 4× speedup. Though that might still not be fast enough (in this hypothetical, Herbie runs in a few seconds) so you might need to do additional optimizations like running Herbie for fewer iterations or with fewer input points (which could probably get you to 10–20×, under a second).

On the other hand, if you're interested in user-facing applications, like in Odyssey, latency might matter more than end-to-end runtime. Then the best move is probably things like parallelizing and incrementalizing sampling. Rewriting Herbie to have a more "server"-like setup, producing incremental results, might obviate questions about end-to-end runtime.

Why make Herbie faster?

You might also ask, why expend all this effort to make Herbie faster? While Herbie does have real-world users (at places like Intel, NVIDIA, Sandia, and NASA) it isn't necessarily a particularly widely-used tool, and performance isn't one of their top concerns. That's true, and in fact Herbie today, approaching the 2.1 release, is more or less as fast as Herbie 2.0 and 1.6 were, so there's been almost no net change in performance over the last two years, and perhaps longer. But that's because performance buys us something: the ability to run new, more expensive algorithms that ultimately produce better results. For example, the core team brough about a 30% speedup to Herbie between 2.0 and 2.1, but the Platforms team brough about a 25% slowdown, almost exactly canceling out those gains. But that slowdown had a purpose: it made Herbie produce much better code at the less-accurate end of the spectrum, improving 2.0's headline feature (cost-accuracy trade-offs) and answering specific user concerns about using Herbie for accuracy-aware optimization.

Similarly, two years ago I spent a while optimizing Herbie and the ultimate result of that effort was that we turned on cost-accuracy trade-offs by default. It might seem that we should turn the features on anyway—wouldn't users by OK with a 30% slowdown for the new features—but it usually doesn't work that way. For example, cost-accuracy trade-offs were hundreds of times slower than the default mode before being optimized, and that's clearly unacceptable. Something similar happens in lots of other optimizations. For example, I just fixed some accidentally-quadratic code in pruning that had been there for four years. That made pruning roughly twice as fast. But the only reason I looked into pruning is that recent changes had taken pruning from about 3% of runtime up to 10% or so! New algorithms elsewhere in Herbie increased the load on pruning, suddenly making it a bottleneck. Optimization work can resolve those bottlenecks, keeping Herbie runtime reasonable.

All that said, at the moment I don't see super attractive opportunities for optimization. Artem has boiled Rival runtime down significantly, and besides tinkering with overhead we're probably near-done there. Brett is working on egg-related components so, as Artem always says, let him cook. And the other components are not big bottlenecks. So I think core Herbie will has to turn to longer-term goals: refactors, better datastructures, and better support for new running modes like incrementalism and the server.

By Pavel Panchekha

23 April 2024