More

Sesse__ · 2026-03-30T15:00:25 1774882825

I find Meson's --help fairly useful, at least compared to the disaster that is CMake's. (Try to find out, as a user not experienced with either, how you'd make a debug build.) I agree that configure --help is more useful for surfacing project-specific options, though.

Sesse__ · 2026-03-30T08:31:54 1774859514

> Copying is generally considered pretty lame in the demoscene these days.

You will still see plenty of e.g. SID covers of existing pop music, without anyone really batting an eyelid.

skrebbel · 2026-03-30T08:45:08 1774860308

Fair. Pretty lame tho.

Sesse__ · 2026-03-30T09:31:45 1774863105

That we can agree on.

galangalalgol · 2026-03-30T13:09:03 1774876143

At least they know who to cite, even if they don't. I like to have a diffusion model generate an image of my desired subject in whatever media I choose then look at it as make something close but not quite the same. I'm copying tons of people I don't even know. But I am also just practicing and don't try to pass it off as my own creation.

Sesse__ · 2026-03-19T16:23:00 1773937380

> if there are problems, they will eventually come to light anyway

Not necessarily; before Kyle started this one-man crusade against data loss, database vendors would claim generally whatever and it would go unchallenged for decades. (You might get the occasional bug report, which you could handwave away as “hardware” or “you're holding it wrong” or just ignore it.) Now you're _slightly_ less likely to succeed, but only as long as e.g. your product is sufficiently uninteresting or hard enough to set up that he doesn't test it. :-)

Sesse__ · 2026-03-16T21:44:00 1773697440

This is called “an arena” more generally, and it is in wide use across many forms of servers, compilers, and others.

Sesse__ · 2026-03-15T15:10:35 1773587435

xz is generally slower-but-more-dense; it's not meant to be a full gzip replacement. Zstd, on the other hand, aims to be a “better gzip”, in that it's compresses about as fast as gzip but packs somewhat denser (and decompresses faster).

Sesse__ · 2026-03-11T17:29:22 1773250162

And similarly, entire generations of programmers were never taught Horner's scheme. You can see it in the article, where they write stuff like

  A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D

(10 muls, 3 muladds)

instead of the faster

  tmp = x * x;
  ((A * tmp + B) * tmp + C) * tmp + D

(1 mul, 3 muladds)

def-pri-pub · 2026-03-11T18:30:20 1773253820

The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.

pavpanchekha · 2026-03-11T18:54:18 1773255258

Horner's form is typically also more accurate, or at least, it is not bit-identical, so the compiler won't do it unless you pass -funsafe-math, and maybe not even then.

anematode · 2026-03-11T19:35:36 1773257736

Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like

    typedef double v2d __attribute__ ((vector_size (16)));

    v2d packed = { x, x };
    packed = fma(packed, As, Bs);
    packed = fma(packed, Cs, Ds);
    // ...
    return x * packed[0] + packed[1]

smth like that

Actually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.

woadwarrior01 · 2026-03-11T17:50:12 1773251412

The common subexpression elimination (CSE) pass in compilers takes care of that.

cmovq · 2026-03-11T18:04:49 1773252289

Compilers cannot do this optimization for floating point [1] unless you're compiling with -ffast-math. In general, don't rely on compilers to optimize floating point sub-expressions.

[1]: https://godbolt.org/z/8bEjE9Wxx

woadwarrior01 · 2026-03-11T18:26:50 1773253610

Right, I totally forgot about floating point non associativity.

eska · 2026-03-11T18:01:59 1773252119

The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.

cmovq · 2026-03-11T18:22:27 1773253347

Not in this case because the dependencies are the same:

Naive: https://godbolt.org/z/Gzf1KM9Tc

Horner's: https://godbolt.org/z/jhvGqcxj1

Sesse__ · 2026-03-11T18:06:19 1773252379

Still, it's no worse than the naïve formula, which has exactly the same data dependencies and then more.

_Can_ you even make a reasonable high-ILP scheme for a polynomial, unless it's of extremely high degree?

stephencanon · 2026-03-11T19:52:42 1773258762

For throughput-dominated contexts, evaluation via Horner's rule does very well because it minimizes register pressure and the number of operations required. But the latency can be relatively high, as you note.

There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.

There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.

Sesse__ · 2026-03-11T21:23:59 1773264239

Ah, I didn't know of either scheme. Still, am I right in that this mainly makes sense for degrees above five or six or so?

stephencanon · 2026-03-12T00:01:27 1773273687

You can often eke something out for order-four, depending on uArch details. But basically yeah.

arkmm · 2026-03-11T17:44:42 1773251082

Didn't know this technique had a name, but I would think a modern compiler could make this optimization on its own, no?

Sesse__ · 2026-03-11T18:07:31 1773252451

No, it's not equivalent for floating point, so a compiler won't do it unless you do -fassociative-math (or a superset, such as -ffast-math), in which case all correctness bets are off.

zahlman · 2026-03-11T17:38:26 1773250706

Is this outside of what compilers can do nowadays? (Or do they refuse because it's floating-point?)

3371 · 2026-03-11T17:42:13 1773250933

Isn't that for... readability...?

owlbite · 2026-03-11T18:32:09 1773253929

Not just for speed, Horner can also be essential for numerical stability.

boothby · 2026-03-11T17:57:43 1773251863

Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).

But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.

exmadscientist · 2026-03-11T18:02:33 1773252153

You should never assume the compiler is allowed to reorder floating-point computations like it does with integers. Integer math is exact, within its domain. Floating-point math is not. The IEEE-754 standard knows this, and the compiler knows this.

boothby · 2026-03-11T18:07:38 1773252458

Ah, fair point, it has been a while since I've needed fast inexact math.

Though... they are allowed to cache common subexpressions, and my point about dependency chains is quite relevant on modern hardware. So x*x, x*x*x, etc may each be computed once. And since arithmetic operators are left-to-right associative, the rather ugly code, as written, is fast and not as wasteful as it appears.

Sesse__ · 2026-03-11T18:55:46 1773255346

> And since arithmetic operators are left-to-right associative, the rather ugly code, as written, is fast and not as wasteful as it appears.

This is incorrect, for exactly the reason you are citing: A * x * x * x * x = (((A * x) * x) * x) * x), which means that (x * x) is nowhere to be seen in the expression and cannot be factored out. Now, if you wrote x * x * x * x * A instead, _then_ the compiler could have done partial CSE against the term with B, although still not as much as you'd like.

eska · 2026-03-11T18:09:07 1773252547

The compiler is often not allowed to rearrange such operations due to a change in intermediate results. So one would have to activate something like fastmath for this code, but that’s probably not desired for all code, so one has to introduce a small library, and so on. Debug builds may be using different compilation flags, and suddenly performance can become terrible while debugging. Performance can also tank because a new compiler version optimizes differently, etc. So in general I don’t think this advice is true.

scottlamb · 2026-03-11T18:17:50 1773253070

Probably for ints unconditionally. For floats in Sesse__'s example without `-ffast-math`, I count 10 muls, 2 muladds, 1 add. With `-ffast-math`, 1 mul, 3 muladds. <https://godbolt.org/z/dPrbfjzEx>

Sesse__ · 2026-03-07T08:50:57 1772873457

> now it feels overwhelming to have to try to grasp

Here's a dirty secret: It's just like IPv4, except with longer addresses and slightly different autoconfig. :-) (Well, you don't have the legacy of classful addressing and non-contiguous netmasks and stuff, but I don't really think most people care much about that in the IPv4 world either.) Getting up to speed is, thankfully, simple.

Sesse__ · 2026-03-02T16:24:03 1772468643

> There are more errors in the text

It seems most of it is AI-generated, without any real attempt at cleaning up errors.

Sesse__ · 2026-02-24T18:38:30 1771958310

Or use a cycle timer and run a PRNG from it.

Or wait for us to launch random() :-) (It's in development, available if you enable a flag)

Sesse__ · 2026-02-22T11:03:00 1771758180

> Overall, I think block bloom filters should be the default most people reach for.

I think this depends on how big your filters are. Most people think of Bloom filters as having to have hundreds of thousands of elements, but I frequently find them useful all the way down to 32 bits (!). (E.g., there are papers showing chained hash tables where each bucket has a co-sited tiny Bloom filter to check if it's worth probing the chain.) In the “no man's land” in-between with a couple ten thousand buckets, the blocking seems to be mostly negative; it only makes sense as long as you actually keep missing the cache.

vanderZwan · 2026-02-22T20:05:26 1771790726

Are you talking about Cuckoo++ tables, perhaps? If not can you point me to the hash table you had in mind? Always fun to learn of a new approach.

https://github.com/technicolor-research/cuckoopp

Sesse__ · 2026-02-22T23:00:23 1771801223

IIRC, it's this paper: https://db.in.tum.de/~birler/papers/hashtable.pdf

I never implemented their hash table, but it opened my eyes to the technique of a tiny Bloom filter, which I've used now a couple of times to fairly good (if small) effect. :-)

vanderZwan · 2026-02-23T14:48:23 1771858103

Thanks! This'll be a fun read :)

vlmutolo · 2026-02-22T17:45:02 1771782302

Yeah, I agree with this. I think there are open addressing hash tables like Swiss Table that do something similar. IIRC, they have buckets with a portion at the beginning with lossy “fingerprints” of items, which kind of serve a similar purpose as a bloom filter.

hinkley · 2026-02-22T19:15:21 1771787721

Bloom filters are useful for sharding so it stands to reason that a hash table implemented with shards would benefit.