More

ot · 2026-03-17T06:48:49 1773730129

That's a false dichotomy: you optimize both the application and the allocator.

A 0.5% improvement may not be a lot to you, but at hyperscaler scale it's well worth staffing a team to work on it, with the added benefit of having people on hand that can investigate subtle bugs and pathological perf behaviors.

sumtechguy · 2026-03-17T15:10:01 1773760201

exactly. I can think of at least 5 different projects I have been on where a better allocator would made a world of difference. I can also think of another 5 where it probably would have been a waste of time to even fiddle with.

but as usual there is an xkcd for that. https://xkcd.com/1205/

One project I spent a bunch of time optimizing the write path of I/O. It was just using standard fwrite. But by staging items correctly it was an easy 10x speed win. Those optimizations sometimes stack up and count big. But it also had a few edges on it, so use with care.

ot · 2026-03-17T06:46:20 1773729980

It's not just that zeroing got cheaper, but also we're doing a lot less of it, because jemalloc got much better.

If the allocator returns a page to the kernel and then immediately asks back for one, it's not doing its job well: the main purpose of the allocator is to cache allocations from the kernel. Those patches are pre-decay, pre-background purging thread; these changes significantly improve how jemalloc holds on to memory that might be needed soon. Instead, the zeroing out patches optimize for the pathological behavior.

Also, the kernel has since exposed better ways to optimize memory reclamation, like MADV_FREE, which is a "lazy reclaim": the page stays mapped to the process until the kernel actually need it, so if we use it again before that happens, the whole unmapping/mapping is avoided, which saves not only the zeroing cost, but also the TLB shootdown and other costs. And without changing any security boundary. jemalloc can take advantage of this by enabling "muzzy decay".

However, the drawback is that system-level memory accounting becomes even more fuzzy.

(hi Alex!)

menaerus · 2026-03-17T13:33:44 1773754424

I am trying to understand the reason behind why "zeroing got cheaper" circa 2012-2014. Do you have some plausible explanations that you can share?

Haswell (2013) doubled the store throughput to 32 bytes/cycle per core, and Sandy Bridge (2011) doubled the load throughput to the same, but the dataset being operated at FB is most likely much larger than what L1+L2+L3 can fit so I am wondering how much effect the vectorization engine might have had since bulk-zeroing operation for large datasets is anyways going to be bottlenecked by the single core memory bandwidth, which at the time was ~20GB/s.

Perhaps the operation became cheaper simply because of moving to another CPU uarch with higher clock and larger memory bandwidth rather than the vectorization.

jcalvinowens · 2026-03-17T15:07:30 1773760050

My memory is that Ivy Bridge was when it started being different.

ahoka · 2026-03-17T15:20:00 1773760800

AVX maybe?

ot · 2026-03-10T02:11:05 1773108665

RSA was also not given that name by its authors, the name came later, which is usually the case.

In the original paper they do not give it any name: https://people.csail.mit.edu/rivest/Rsapaper.pdf

ot · 2026-03-09T01:39:11 1773020351

> are there eviction techniques to guard against this?

RE2 resets the cache when it reaches a (configurable) size limit. Which I found out the hard way when I had to debug almost-periodic latency spikes in a service I managed, where a very inefficient regex caused linear growth in the Lazy DFA, until it hit the limit, then all threads had to wait for its reset for a few hundred milliseconds, and then it all started again.

I'm not sure if dropping the whole cache is the only feasible mitigation, or some gradual pruning would also be possible.

Either way, if you cannot assume that your cache grows monotonically, synchronization becomes more complicated: the trick mentioned in the other comment about only locking the slow path may not be applicable anymore. RE2 uses RW-locking for this.

ieviev · 2026-03-09T18:06:32 1773079592

I have experienced this as well, the performance degradation of DFA to NFA is enormous and while not as bad as exponential backtracking, it's close to ReDoS territory.

The rust version of the engine (https://github.com/ieviev/resharp) just returns an Error instead of falling back to NFA, I think that should be a reasonable approach, but the library is still new so i'm still waiting to see how it turns out and whether i had any oversights on this.

ot · 2026-03-09T21:20:25 1773091225

Here RE2 does not fall back to the NFA, it just resets the Lazy DFA cache and starts growing it again. The latency spikes I was mentioning are due to the cost of destroying the cache (involving deallocations, pointer chasing, ...)

ieviev · 2026-03-10T16:33:26 1773160406

Ah, sorry then i misunderstood the comment

I'm not sure if it's with both RE2 or Rust, but some internal engines of Rust appear to allocate a fixed buffer that it constantly re-creates states into.

I'm not really familiar with the eviction technique of RE2 but I've done a lot of benchmark comparisons. A good way to really stress test RE2 is large Unicode classes, \w and \d in RE2 are ascii-only, i've noticed Unicode (\p{class}) classes very drastically change the throughput of the engine.

ot · 2026-02-25T12:39:30 1772023170

This is drawing broad conclusions from a specific RW mutex implementation. Other implementations adopt techniques to make the readers scale linearly in the read-mostly case by using per-core state (the drawback is that write locks need to scan it).

One example is folly::SharedMutex, which is very battle-tested: https://uvdn7.github.io/shared-mutex/

There are more sophisticated techniques such as RCU or hazard pointers that make synchronization overhead almost negligible for readers, but they generally require to design the algorithms around them and are not drop-in replacements for a simple mutex, so a good RW mutex implementation is a reasonable default.

PaulHoule · 2026-02-25T12:52:14 1772023934

I think it’s not unusual that reader-writer locks, even if well implemented, get in places where there are so many readers stacked up that writers never get to get a turn or 1 writer winds up holding up N readers which is not so scalable as you increase N.

Jyaif · 2026-02-25T13:05:43 1772024743

And a Rust equivalent of folly::SharedMutex: https://docs.rs/crossbeam-utils/latest/crossbeam_utils/sync/...

amluto · 2026-02-25T16:05:56 1772035556

Wow, folly::SharedMutex is quite an example of design tradeoffs. I wonder what application the authors wanted it for where using a global array was better than a per-mutex array.

mike_hearn · 2026-02-25T12:56:21 1772024181

Right, and if you're on the JVM you have access to things like ConcurrentHashMap which is lock free.

ot · 2026-02-09T13:15:21 1770642921

Glad that Moby Dick is in there.

ot · 2026-01-14T00:53:05 1768351985

You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.

This is not well documented unfortunately, and I'm not aware of open-source implementations of this.

EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.

jerrinot · 2026-01-14T01:02:35 1768352555

That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!

ot · 2026-01-14T01:05:05 1768352705

Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)

catlifeonmars · 2026-01-14T07:39:51 1768376391

I guess if you need the concurrency/throughput you should use a userspace green thread implementation. I’m guessing most implementations of green threads multiplex onto long running os threads anyway

jerrinot · 2026-01-14T09:01:50 1768381310

In a system with green threads, you typically want the CPU time of the fiber or tasklet rather than the carrier thread. In that case, you have to ask the scheduler, not the kernel.

nly · 2026-01-14T09:04:54 1768381494

Why do you need a seqlock? To make sure you're not context switched out between the read of the page value and the rdtsc?

Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?

Tbh I thought clock_gettime was a vdso based virtual syscall anyway

ot · 2026-01-15T17:08:03 1768496883

> Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?

Yes, that's exactly what a seqlock (reader) is.

mgaunard · 2026-01-14T14:40:42 1768401642

clock_gettime is not doing a syscall, it's using vdso.

jerrinot · 2026-01-14T15:07:50 1768403270

clock_gettime() goes through the vDSO shim, but whether it avoids a syscall depends on the clock ID and (in some cases) the clock source. For thread-specific CPU user time, the vDSO shim cannot resolve the request in user space and must transit into the kernel. In this specific case, there is absolutely a syscall.

ot · 2026-01-14T00:48:18 1768351698

If you look below the vDSO frame, there is still a syscall. I think that the vDSO implementation is missing a fast path for this particular clock id (it could be implemented though).

jerrinot · 2026-01-14T00:49:37 1768351777

Exactly this.

ot · 2026-01-06T17:01:00 1767718860

That's probably true for small primitive types, but if your objects are expensive to move (like a large struct) it might be beneficial to minimize swaps.

praptak · 2026-01-06T19:20:20 1767727220

Yeah, it might be interesting to run some profiling of both algorithms and see how they perform dependent on the size of the blocks being swapped (which doesn't even have to be equal to the size of the object in the array).

ot · 2026-01-05T16:38:45 1767631125

The query is incorrect, it will return any posting that contains the words "vision" and "pro", not necessarily consecutive.

It looks like phrasal search is supported, searching "vision pro" in quotes only returns 212 results worldwide

https://jobs.apple.com/en-us/search?search=%22vision+pro%22&...

Spot-checked a few and they all seem to be Vision Pro related.

--

EDIT: Actually even this is not accurate, as it matches postings with sentences like

> Fundamental to the success of iPhone, iPad, Apple Watch, Apple TV, Vision Pro, and Mac ...

but not specific to Vision Pro.

However we can filter on products and services for Vision Pro and visionOS, and it gives 106 results:

https://jobs.apple.com/en-us/search?search=%22vision+pro%22&...

Croftengea · 2026-01-05T16:41:40 1767631300

Even quoted search is not representative because some ads just mention Vision Pro as one of the many products Apple is known for:

> "The Manufacturing Design team enables the mass production of Apple's entire product line from iPhones, iPads and MacBooks to the Mac Pro, AppleTV, Apple Watch and Vision Pro."

ot · 2026-01-05T16:42:48 1767631368

Yeah, was just about to edit the comment :)

coreyh14444 · 2026-01-05T16:40:09 1767631209

212 is still a pretty high number.

g947o · 2026-01-05T16:43:45 1767631425

212 is still a large number compared to Meta's Reality Labs which is downsizing.