I recently started using Microsoft's mimalloc (via an LD_PRELOAD) to better use ...

skavi · 2026-03-16T19:45:12 1773690312

We evaluated a few allocators for some of our Linux apps and found (modern) tcmalloc to consistently win in time and space. Our applications are primarily written in Rust and the allocators were linked in statically (except for glibc). Unfortunately I didn't capture much context on the allocation patterns. I think in general the apps allocate and deallocate at a higher rate than most Rust apps (or more than I'd like at least).

Our results from July 2025:

rows are <allocator>: <RSS>, <time spent for allocator operations>

  app1:
  glibc: 215,580 KB, 133 ms
  mimalloc 2.1.7: 144,092 KB, 91 ms
  mimalloc 2.2.4: 173,240 KB, 280 ms
  tcmalloc: 138,496 KB, 96 ms
  jemalloc: 147,408 KB, 92 ms

  app2, bench1
  glibc: 1,165,000 KB, 1.4 s
  mimalloc 2.1.7: 1,072,000 KB, 5.1 s
  mimalloc 2.2.4:
  tcmalloc: 1,023,000 KB, 530 ms

  app2, bench2
  glibc: 1,190,224 KB, 1.5 s
  mimalloc 2.1.7: 1,128,328 KB, 5.3 s
  mimalloc 2.2.4: 1,657,600 KB, 3.7 s
  tcmalloc: 1,045,968 KB, 640 ms
  jemalloc: 1,210,000 KB, 1.1 s

  app3
  glibc: 284,616 KB, 440 ms
  mimalloc 2.1.7: 246,216 KB, 250 ms
  mimalloc 2.2.4: 325,184 KB, 290 ms
  tcmalloc: 178,688 KB, 200 ms
  jemalloc: 264,688 KB, 230 ms

tcmalloc was from github.com/google/tcmalloc/tree/24b3f29.

i don't recall which jemalloc was tested.

hedora · 2026-03-16T19:54:27 1773690867

I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).

tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).

Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.

Are you using async rust, or sync rust?

skavi · 2026-03-16T19:58:14 1773691094

modern tcmalloc uses per CPU caches via rseq [0]. We use async rust with multithreaded tokio executors (sometimes multiple in the same application). so relatively high thread counts.

[0]: https://github.com/google/tcmalloc/blob/master/docs/design.m...

usrnm · 2026-03-16T21:11:32 1773695492

How do you control which CPU your task resumes on? If you don't then it's still the same problem described above, no?

skavi · 2026-03-17T01:30:30 1773711030

on the OS scheduler side, i'd imagine there's some stickiness that keeps tasks from jumping wildly between cores. like i'd expect migration to be modelled as a non zero cost. complete speculation though.

tokio scheduler side, the executor is thread per core and work stealing of in progress tasks shouldn't be happening too much.

for all thread pool threads or threads unaffiliated with the executor, see earlier speculation on OS scheduler behavior.

packetlost · 2026-03-17T01:40:03 1773711603

Correct. The Linux scheduler has been NUMA aware + sticky for awhile (which is more or less what this reduces to in common scenarios).

jhalstead · 2026-03-17T03:07:42 1773716862

> I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).

Indeed, it's not the old gperftools version.

Blog: https://abseil.io/blog/20200212-tcmalloc

History / Diffs: https://google.github.io/tcmalloc/gperftools.html

skavi · 2026-03-17T11:36:43 1773747403

also:

1. tcmalloc is actually the only allocator I tested which was not using thread local caches. even glibc malloc has tcache.

2. async executors typically shouldn’t have tasks jumping willy nilly between threads. i see the issue u describe more often with the use of thread pools (like rayon or tokio’s spawn_blocking). i’d argue that the use of thread pools isn’t necessarily an inherent feature of async executors. certainly tokio relies on its threadpool for fs operations, but io-uring (for example) makes that mostly unnecessary.

ComputerGuru · 2026-03-16T19:55:46 1773690946

That’s a considerable regression for mimalloc between 2.1 and 2.2 – did you track it down or report it upstream?

Edit: I see mimalloc v3 is out – I missed that! That probably moots this discussion altogether.

skavi · 2026-03-16T19:59:43 1773691183

nope.

codexon · 2026-03-16T21:20:19 1773696019

This is similar to what I experienced when I tested mimalloc many years ago. If it was faster, it wasn't faster by much, and had pretty bad worst cases.

pjmlp · 2026-03-16T18:54:47 1773687287

If you go into Dr Dobbs, The C/C++ User's Journal and BYTE digital archives, there will be ads of companies whose product was basically special cased memory allocator.

Even toolchains like Turbo Pascal for MS-DOS, had an API to customise the memory allocator.

The one size fits all was never a solution.

adgjlsfhk1 · 2026-03-16T18:47:38 1773686858

One of the best parts about GC languages is they tend to have much more efficient allocation/freeing because the cost is much more lumped together so it shows up better in a profile.

pjmlp · 2026-03-16T18:56:16 1773687376

Agreed, however there is also a reason why the best ones also pack multiple GC algorithms, like in Java and .NET, because one approach doesn't fit all workloads.

nevdka · 2026-03-16T19:25:33 1773689133

Then there’s perl, which doesn’t free at all.

hedora · 2026-03-16T19:56:22 1773690982

Perl frees memory. It uses refcounting, so you need to break heap cycles or it will leak.

(99% of the time, I find this less problematic than Java’s approach, fwiw).

wredcoll · 2026-03-17T18:35:18 1773772518

Unless this has changed recently, perl doesn't free memory to the kernel, only within its own process/vm.

cermicelli · 2026-03-16T19:39:21 1773689961

Freedom is overrated... :P

NooneAtAll3 · 2026-03-16T19:40:35 1773690035

doesn't java also?

I heard that was a common complaint for minecraft

adgjlsfhk1 · 2026-03-17T02:42:57 1773715377

Minecraft for somewhat silly reasons was largely stuck using Java8 for ~a decade longer than it should have which meant that it was using some fairly outdated GC algorithms.

NooneAtAll3 · 2026-03-17T09:08:44 1773738524

"silly reasons" being Java breaking backwards compatibility

decade seems a usual timescale for that, considering f.e. python 2->3

kbolino · 2026-03-17T12:00:25 1773748825

So much software was stuck on Java 8 and for so long that some of the better GC algorithms got backported to it.

xxs · 2026-03-16T19:50:26 1773690626

What do you mean - if Java returns memory to the OS? Which one - Java heap of the malloc/free by the JVM?

cogman10 · 2026-03-16T20:01:43 1773691303

Java is pretty greedy with the memory it claims. Especially historically it was pretty hard to get the JVM to release memory back to the OS.

To an outsider, that looks like the JVM heap just steadily growing, which is easy to mistake for a memory leak.

k_roy · 2026-03-16T20:12:44 1773691964

> Especially historically it was pretty hard to get the JVM to release memory back to the OS.

This feels like a huge understatement. I still have some PTSD around when I did Java professionally between like 2005 and 2014.

The early part of that was particularly horrible.

xxs · 2026-03-16T20:10:12 1773691812

Java has a quite strict max heap setting, it's very uncommon to let it allocate up to 25% of the system memory (the default). It won't grow past that point, though.

Baring bugs/native leaks - Java has a very predictable memory allocation.

NooneAtAll3 · 2026-03-17T09:10:34 1773738634

we aren't talking about allocation, tho

we are talking about DEallocation

xxs · 2026-03-17T12:12:53 1773749573

it's a reply to:

"To an outsider, that looks like the JVM heap just steadily growing, which is easy to mistake for a memory leak."

I cut the part that it's possible to make JVM return memory heap after compaction but usually it's not done, i.e. if something grew once, it's likely to do it again.

adgjlsfhk1 · 2026-03-16T20:45:48 1773693948

This only really ends up being a problem on windows. On systems with proper virtual memory setups, the cost of unused memory is very low (since the the OS can just page it out)

cogman10 · 2026-03-16T23:09:48 1773702588

Unfortunately, the JVM and collectors like the JVM's plays really bad with virtual memory. (Actually, G1 might play better. Everything else does not).

The issue is that through the standard course of a JVM application running, every allocated page will ultimately be touched. The JVM fills up new gen, runs a minor collection, moves old objects to old gen, and continues until old gen gets filled. When old gen is filled, a major collection is triggered and all the live objects get moved around in memory.

This natural action of the JVM means you'll see a sawtooth of used memory in a properly running JVM where the peak of the sawtooth occasionally hits the memory maximum, which in turn causes the used memory to plummet.

pjmlp · 2026-03-17T07:49:12 1773733752

Depends on which JVM, PTC and Aicas do alright with their real time GCs for embedded deployment.

cogman10 · 2026-03-17T14:13:11 1773756791

I've never really used anything other than the OpenJDK and Azuls.

How does PTC and Aicas does GC? Is it ref counted? I'm guessing they aren't doing moving collectors.

pjmlp · 2026-03-17T15:17:39 1773760659

They are real time GCs, nothing to do with refcounting.

One of the founding members of Aicas is the author of "Hard Realtime Garbage Collection in Modern Object Oriented Programming Languages" book, which was done as part of his PhD.

snackbroken · 2026-03-16T22:50:36 1773701436

For video games it is pretty bad, because reading back a page from disk containing "freed" (from the application perspective, but not returned to the OS) junk you don't care about is significantly slower than the OS just handing you a fresh one. A 10-20ms delay is a noticeable stutter and even on an SSD that's only a handful of round-trips.

cogman10 · 2026-03-16T23:12:34 1773702754

Games today should be using ZGC.

There's a lot of bad tuning guides for minecraft that should be completely ignored and thrown in the trash. The only GC setting you need for it is `-XX:+UseZGC`

For example, a number of the minecraft golden guides I've seen will suggest things like setting pause targets but also survivor space sizes. The thing is, the pause target is disabled when you start playing with survivor space sizes.

xxs · 2026-03-17T04:16:32 1773720992

Overall if java hits the swap, it's a bad case. Windows is a like special beast when it comes to 'swapping', even if you don't truly needed it. On linux all (server) services run with swapoff.

pjmlp · 2026-03-17T07:48:23 1773733703

Not used Windows Server that much?

CyberDildonics · 2026-03-16T21:03:32 1773695012

Any extra throughput is far overshadowed by trying to control pauses and too much heap allocations happening because too much gets put on the heap. For anything interactive the options are usually fighting the gc or avoiding gc.

bluGill · 2026-03-16T19:51:15 1773690675

When it works. Many programs in GC language end up fighting the GC by allocating a large buffer and managing it by hand anyway because when performance counts you can't have allocation time in there at all. (you see this in C all the time as well)

cogman10 · 2026-03-16T20:07:55 1773691675

That's generally a bad idea. Not always, but generally.

It was a better idea when Java had the old mark and sweep collector. However, with the generational collectors (which are all Java collectors now. except for epsilon) it's more problematic. Reusing buffers and objects in those buffers will pretty much guarantees that buffer ends up in oldgen. That means to clear it out, the VM has to do more expensive collections.

The actual allocation time for most of Java's collectors is almost 0, it's a capacity check and a pointer bump in most circumstances. Giving the JVM more memory will generally solve issues with memory pressure and GC times. That's (generally) a better solution to performance problems vs doing the large buffer.

Now, that said, there certainly have been times where allocation pressure is a major problem and removing the allocation is the solution. In particular, I've found boxing to often be a major cause of performance problems.

drob518 · 2026-03-16T22:42:52 1773700972

If your workload is very regular, you can still do better with an arena allocator. Within the arena, it uses the same pointer-bump allocation as Java normally uses, but then you can free the whole area back to the start by resetting the pointer to its initial value. If you use the arena for servicing a single request, for instance, you then reset as soon as you're done with the request, setting you up with a totally free area for the next request. That's more efficient than a GC. But it also requires your algorithm to fall into that pattern where you KNOW that you can and should throw everything from the request away. If you can't guarantee that, then modern collectors are pretty magical and tunable.

CyberDildonics · 2026-03-16T21:04:04 1773695044

If people didn't need to do it, they wouldn't generally do it. Not always, but generally.

cogman10 · 2026-03-16T21:09:58 1773695398

People do stuff they shouldn't all the time.

For example, some code I had to clean up pretty early on in my career was a dev, for unknown reasons, reinventing the `ArrayList` and then using that invention as a set (doing deduplication by iterating over the elements and checking for duplicates). It was done in the name of performance, but it was never a slow part of the code. I replaced the whole thing with a `HashSet` and saved ~300 loc as a result.

This individual did that sort of stuff all over the code base.

CyberDildonics · 2026-03-16T22:06:51 1773698811

Reinventing data structures poorly is very common.

Heap allocation in java is something trivial happens constantly. People typically do funky stuff with memory allocation because they have to, because the GC is causing pauses.

People avoid system allocators in C++ too, they just don't have to do it because of uncontrollable pauses.

cogman10 · 2026-03-16T22:34:56 1773700496

> People typically do funky stuff with memory allocation because they have to

This same dev did things like putting what he deemed as being large objects (icons) into weak references to save memory. When the references were collected, invariably they had to be reloaded.

That was not the source of memory pressure issues in the app.

I've developed a mistrust for a lot of devs "doing it because we have to" when it comes to performance tweaks. It's not a never thing that a buffer is the right thing to do, but it's not been something I had to reach for to solve GC pressure issues. Often times, far more simple solutions like pulling an allocation out of the middle of a loop, or switching from boxed types to primatives, was all that was needed to relieve memory pressure.

The closest I've come to it is replacing code which would do an expensive and allocation heavy calculation with a field that caches the result of that calculation on the first call.

themanualstates · 2026-03-19T04:10:48 1773893448

"This same dev did things like putting what he deemed as being large objects (icons) into weak references to save memory. When the references were collected, invariably they had to be reloaded."

Well actually, this is what the Apple[1] docs instruct devs to do. https://developer.apple.com/library/archive/documentation/Co...

For .NET on iOS, the difference between managed and unmanaged objects is of particular concern. In the example you provide, the Icon Assets are objects from an Apple Framework, not managed by .NET. You might use them in the UIKit views for list items in a UIKit List View.

iOS creates and disposes these list view items independently of .NET managed code. Because the reference counts can't be updated across these contexts, you'll inevitably end up with dangling references. This memory can't be cleared, so inadvertently using strong references will cause a memory leak that grows until your app crashes.

The following is a great explainer in the context of Xamarin for iOS. https://thomasbandt.com/xamarinios-memory-pitfalls

The above still applies with different languages / frameworks of course, however the difference is less explicit from a syntax perspective IMHO

CyberDildonics · 2026-03-17T02:28:46 1773714526

I'm not sure why you're rationale for how to deal with garbage collected memory is based on a guy that didn't know standard data structures and your own gut feelings.

Any program that cares about performance is going to focus on minimizing memory allocation first. The difference between a GCed language like java is that the problems manifest as gc pauses that may or may not be predictable. In a language like C++ you can skip the pauses and worry about the overall throughput.

cogman10 · 2026-03-17T14:09:55 1773756595

Well, let me just circle back to the start of this comment chain.

> Many programs in GC language end up fighting the GC by allocating a large buffer and managing it by hand

That's the primary thing I'm contending with. This is a strategy for fighting the GC, but it's also generally a bad strategy. One that I think gets pulled more because someone heard of the suggestion and less because it's a good way to make things faster.

That guy I'm talking about did a lot of "performance optimizations" based on gut feelings and not data. I've observed that a lot of engineers operate that way.

But I've further observed that when it comes to optimizing for the GC, a large amount of problems don't need such an extreme measure like building your own memory buffer and managing it directly. In fact, that sort of a measure is generally counter productive in a GC environment as it makes major collections more costly. It isn't a "never do this" thing, but it's also not something that "many programs" should be doing.

I agree that many programs with a GC will probably need to change their algorithms to minimize allocations. I disagree that "allocating a large buffer and managing it by hand" is a technique that almost any program or library needs to engage in to minimize GCs.

CyberDildonics · 2026-03-17T15:40:54 1773762054

This is a strategy for fighting the GC, but it's also generally a bad strategy.

Allocating a large buffer is literally what an array or vector is. A heap uses a heap structure and hops around in memory for every allocation and free. It gets worse the more allocations there are. The allocations are fragmented and in different parts of memory.

Allocating a large buffer takes care of all this if it is possible to anything else. It doesn't make sense to make lots of heap allocations when what you want is multiple items next to each other in memory and one heap allocation.

That guy I'm talking about did a lot of "performance optimizations" based on gut feelings and not data.

You need to let this go, that guy has nothing to do with what works when optimizing memory usage and allocation.

But I've further observed that when it comes to optimizing for the GC, a large amount of problems don't need such an extreme measure like building your own memory buffer and managing it directly.

Making an array of contiguous items is not an "extreme strategy", it's the most efficient and simplest way for a program to run. Other memory allocations can just be an extension of this.

I agree that many programs with a GC will probably need to change their algorithms to minimize allocations. I disagree that "allocating a large buffer and managing it by hand"

If you need the same amount of memory but need to minimize allocations how do you think that is done? You make larger allocations and split them up. You keep saying "managing it by hand" as if there is something that has to be tricky or difficult. Using indices of an array is not difficult and neither is handing out indices or ranges to in small sections.

cogman10 · 2026-03-17T16:19:06 1773764346

> A heap uses a heap structure and hops around in memory for every allocation and free.

Not in the JVM. And maybe this is ultimately what we are butting up against. After all, the JVM isn't all GCed languages, it's just one of many.

In the JVM, heap allocations are done via bump allocation. When a region is filled, the JVM performs a garbage collection which moves objects in the heap (it compacts the memory). It's not an actual heap structure for the JVM.

> It doesn't make sense to make lots of heap allocations when what you want is multiple items next to each other in memory and one heap allocation.

That is (currently) not possible to do in the JVM, barring primitives. When I create a `new Foo[128]` in the JVM, that creates an array big enough to hold 128 references of Foo, not 128 Foo objects. Those have to be allocated onto the heap separately. This is part of the reason why managing such an object pool is pointless in the JVM. You have to make the allocations anyways and you are paying for the management cost of that pool.

The object pool is also particularly bad in the JVM because it stops the JVM from performing optimizations like scalarization. That's where the JVM can completely avoid a heap allocation all together and instead pulls out the internal fields of the allocated object to hand off to a calling function. In order for that optimization to occur, and object can't escape the current scope.

I get why this isn't the same story if you are talking about another language like C# or go. There are still the negative consequences of needing to manage the buffer, especially if the intent is to track allocations of items in the buffer and to reassign them. But there is a gain in the locality that's nice.

> Using indices of an array is not difficult and neither is handing out indices or ranges to in small sections.

Easy to do? Sure. Easy to do fast? Well, no. That's entirely the reason why C++ has multiple allocators. It's the crux of the problem an allocator is trying to solve in the first place "How can I efficiently give a chunk of memory back to the application".

Obviously, it'll matter what your usage pattern is, but if it's at all complex, you'll run into the same problems that the general allocator hits.

CyberDildonics · 2026-03-17T17:54:32 1773770072

In the JVM, heap allocations are done via bump allocation.

If that were true then they wouldn't be heap allocations.

https://www.digitalocean.com/community/tutorials/java-jvm-me...

https://docs.oracle.com/en/java/javase/21/core/heap-and-heap...

not possible to do in the JVM, barring primitives

Then you make data structures out of arrays of primitives.

Easy to do? Sure. Easy to do fast? Well, no. That's entirely the reason why C++ has multiple allocators.

I don't know what this means. Vectors are trivial and if you hand out ranges of memory in an arena allocator you allocate it once and free it once which solves the heavy allocation problem. The allocator parameter in templates don't factor in to this.

cogman10 · 2026-03-17T18:43:23 1773773003

> If that were true then they wouldn't be heap allocations.

"Heap" is a misnomer. It's not called that due to the classic CS "heap" datastructure. It's called that for the same reason it's called a heap allocation in C++. Modern C++ allocators don't use a heap structure either.

How the JVM does allocations for all it's collectors is in fact a bump allocator in the heap space. There are some weedsy details (for example, threads in the JVM have their own heap space for doing allocation to avoid contention in allocation) but suffice it to say it ultimately translates into a region check then pointer bump. This is why the JVM is so fast at allocation, much faster than C++ can be. [1] [2]

> I don't know what this means.

JVM allocations are typically pointer bumps, adding a number to a register. There's really nothing faster than it. If you are implementing an arena then you've already lost in terms of performance.

[1] https://www.datadoghq.com/blog/understanding-java-gc/#memory...

[2] https://inside.java/2020/06/25/compact-forwarding/

CyberDildonics · 2026-03-17T21:22:42 1773782562

Modern C++ allocators don't use a heap structure either.

"Yes, malloc uses a heap data structure to allocate memory dynamically for programs. The heap allows for persistent memory allocation that can be managed manually by the programmer."

"How Malloc Works with the Heap

    Heap Data Structure: Malloc uses a heap data structure to manage memory. The heap is a region of a process's memory that is used for dynamic memory allocation.

    Memory Management: When you call malloc, it searches the heap for a suitable block of memory that can accommodate the requested size. If found, it allocates that memory and returns a pointer to it."

How the JVM does allocations for all it's collectors is in fact a bump allocator in the heap space.

This doesn't make sense. It's one or the other. A heap isn't about getting more memory or mapping it into a process space, it is about managing the memory already in the process space and being able to free memory in a different order than you allocated it, then give that memory back out without system calls.

https://www.geeksforgeeks.org/c/dynamic-memory-allocation-in...

https://en.wikipedia.org/wiki/C_dynamic_memory_allocation

JVM allocations are typically pointer bumps, adding a number to a register.

I think you are mixing up mapping memory into a process (which is a system call not a register addition) and managing the memory once it is in process space.

The allocator frees memory and reuses it within a process. If freeing it was as simple as subtracting from a register then there would be no difference in speed between the stack and the heap and there would be no GC pauses and no GC complexity. None of these things are true obviously since java has been dealing with these problems for 30 years.

This is why the JVM is so fast at allocation, much faster than C++ can be

Java is slower than C++ and less predictable because you can't avoid the GC which is the whole point here.

The original point was that you have to either avoid the GC or fight the GC and a lot of what you have talked about is either not true or explains why someone has to avoid and fight the GC in the first place.

adgjlsfhk1 · 2026-03-17T23:08:39 1773788919

You're wrong for like 6 different reasons.

Java does do bump pointer allocation. The key is that when GC runs, surviving objects get moved. The slow part of GC isn't the allocation (GCs generally have much faster allocators than malloc). The slow part is the barriers that the GC requires and the pauses.

CyberDildonics · 2026-03-18T00:47:14 1773794834

You're wrong for like 6 different reasons.

If that were true you could have listed one the made sense in context. This person was saying that allocation was as fast as a incrementing a register while continually ignoring the fact that deallocation needs to happen along with any organization of allocated memory.

Then they were ignoring that large allocations have big speed benefits for a reason.

Conflating java moving a pointer, mapping memory into a process, sbrk, and arena allocation is going in circles, but the fundamentals that people need to fight the GC or work around it remains.

Allocations have a price and the first step to optimizing any program is avoiding that, but in GC languages you get pauses on top of your slow downs.

jibal · 2026-03-17T04:31:26 1773721886

Right? "I had this one contingent experience and I've built my entire world view and set of practices around it."

drob518 · 2026-03-16T23:06:02 1773702362

Premature optimization is the root of all evil.

m463 · 2026-03-16T20:01:40 1773691300

I remember in the early days of web services, using the apache portable runtime, specifically memory pools.

If you got a web request, you could allocate a memory pool for it, then you would do all your memory allocations from that pool. And when your web request ended - either cleanly or with a hundred different kinds of errors, you could just free the entire pool.

it was nice and made an impression on me.

I think the lowly malloc probably has lots of interesting ways of growing and changing.

Sesse__ · 2026-03-16T21:44:00 1773697440

This is called “an arena” more generally, and it is in wide use across many forms of servers, compilers, and others.

jra_samba · 2026-03-16T20:42:38 1773693758

Look into talloc, used inside Samba (and other FLOSS projects like sssd). Exactly this.

pocksuppet · 2026-03-16T19:36:45 1773689805

In many cases you can also do better than using malloc e.g. if you know you need a huge page, map a huge page directly with mmap

Yes, if you want to use huge pages with arbitrary alloc/free, then use a third-party malloc. If your alloc/free patterns are not arbitrary, you can do even better. We treat malloc as a magic black box but it's actually not very good.

IshKebab · 2026-03-16T19:13:13 1773688393

I feel like the real thing that needs to change is we need a more expressive allocation interface than just malloc/realloc. I'm sure that memory allocators could do a significantly better job if they had more information about what the program was intending to do.

liuliu · 2026-03-16T19:40:37 1773690037

There are, look no further than jemalloc API surface itself:

https://jemalloc.net/jemalloc.3.html

One thing to call out: sdallocx integrates well with C++'s sized delete semantics: https://isocpp.org/files/papers/n3778.html

hedora · 2026-03-16T19:59:29 1773691169

You can also play tricks with inlining and constant propagation in C (especially on the malloc path, where the ground-truth allocation size is usually statically known).

Dylan16807 · 2026-03-16T20:49:12 1773694152

I think some operating system improvements could get people motivated to use huge pages a lot better. In particular make them less fragile on linux and make them not need admin rights on windows. The biggest factor causing problems there is that neither OS can swap a 2MB page. So someone needs to care enough to fix that.

anthk · 2026-03-16T19:23:08 1773688988

I used mimalloc to run zenlisp under OpenBSD as it would clash with the paranoid malloc of base.

jeffbee · 2026-03-16T19:12:38 1773688358

Just out of curiosity are you getting 1GB huge pages on Xeon or some other platform? I always thought this class of page is the hardest to exploit, considering that the machine only has, if I recall correctly, one TLB slot for those.

bmenrigh · 2026-03-16T19:25:48 1773689148

Modern x86_64 has supported multiple page sizes for a long time. I'm on commodity Zen 5 hardware (9900X) with 128 GiB of RAM. Linux will still use a base page size of 4kb but also supports both 2 MiB and 1 GiB huge pages. You can pass something like `default_hugepagesz=2M hugepagesz=1G hugepages=16` to your kernel on boot to use 2 MiB pages but reserve 16 1 GiB pages for later use.

The nice thing about mimalloc is that there are a ton of configurable knobs available via env vars. I'm able to hand those 16 1 GiB pages to the program at launch via `MIMALLOC_RESERVE_HUGE_OS_PAGES=16`.

EDIT: after re-reading your comment a few times, I apologize if you already knew this (which it sounds like you did).

jeffbee · 2026-03-16T20:01:35 1773691295

Right but on Intel the 1G page size has historically been the odd one. For example Skylake-X has 1536 L2 shared TLB entries for either 4K or 2M pages, but it only has 16 entries that can be used for 1G pages. It wasn't unified until Cascade Lake. But Skylake-like Xeon is still incredibly common in the cloud so it's hard to target the later ones.

Dylan16807 · 2026-03-16T21:03:16 1773694996

So for any process that's using less than 16GB, it's a significant performance boost. And most processes using more RAM, but not splitting accesses across more than 16 zones in rapid succession, will also see a performance boost.

My old Intel CPU only has 4 slots for 1GB pages, and that was enough to get me about a 20% performance boost on Factorio. (I think a couple percent might have been allocator change but the boost from forcing huge pages was very significant)

jeffbee · 2026-03-16T22:41:52 1773700912

That strikes me as a common hugepages win. People never believe you, though, when you say you can make their thing 20% faster for free.

menaerus · 2026-03-17T13:51:07 1773755467

Then it should be pretty easy to display that 20% "faster for free", no? But as always the devil is in the details. I experimented a lot with huge pages, and although in theory you should see the performance boost, the workloads I have been using to test this hypothesis did not end up with anything statistically significant/measurable. So, my conclusion was ... it depends.

Dylan16807 · 2026-03-17T18:18:54 1773771534

Try a big factorio map just as a test case. It's a bit of an outlier on performance, in particular it's very heavy on memory bandwidth.

jeffbee · 2026-03-17T15:53:28 1773762808

Of course, it only helps workloads that exhibit high rates of page table walking per instruction. But those are really common.

menaerus · 2026-03-17T19:04:32 1773774272

Yes, I understand that. It is implied that there's a high TLB miss rate. However, I'm wondering if the penalty which we can quantify as O(4) memory accesses for 4-level page table, which amounts to ~20 cycles if pages are already in L1 cache, or ~60-200 cycles if they are in L2/L3, would be noticeable in workloads which are IO bound. In other words, would such workloads benefit from switching to the huge pages when most of the time CPU anyways sits waiting on the data to arrive from the storage.

jeffbee · 2026-03-17T19:37:37 1773776257

In a multi-tenant environment, yes. The faster they can get off the CPU and yield to some other tenant, the better it is.

tosti · 2026-03-17T10:40:55 1773744055

    > commodity
    > zen 5
    > 128GiB

Are you from the future?

Dylan16807 · 2026-03-18T02:11:55 1773799915

I'm not sure what point you're trying to make.

In the middle of last year, a 9900X was around $350 and 128GB of memory was also around $350. That's very easily "commodity" range.

tosti · 2026-03-18T07:07:50 1773817670

Damn. I feel old and must've missed that boat. Several other boats too, I guess.

Here I was thinking 16GiB is pretty good. I get to compile LibreOffice in an afternoon. QtWebEngine overnight.

Doesn't 128GiB make rowhammer much more feasible? You'd have 32GiB per DIMM.

Oh well

Dylan16807 · 2026-03-18T07:21:16 1773818476

Two 64GiB DIMMs would be the more likely setup. The current CPUs strongly prefer having only one stick of DDR5 per channel.

The effectiveness of rowhammer depends on how well the manufacturer implemented target row refresh. But the internal ECC on DDR5 should help defend against it somewhat.

Personally I've been in the 24-32GiB range since 2013, and that's despite the fact that I'm still on DDR3.

sylware · 2026-03-16T18:46:42 1773686802

If there is so much performance difference among generic allocators, it means you need semantic optimized allocators (unless performance is actually not that much important in the end).

Cloudef · 2026-03-16T19:06:07 1773687967

You are not wrong and this is indeed what zig is trying to push by making all std functions that allocate take a allocator parameter.

codexon · 2026-03-16T21:17:17 1773695837

Agreed mostly. Going from standard library to something like jemalloc or tcmalloc will give you around 5-10% wins which can be significant, but the difference between those generic allocators seem small. I just made a slab allocator recently for a custom data type and got speedups of 100% over malloc.

sylware · 2026-03-17T02:55:35 1773716135

Here you go.

codexon · 2026-03-16T19:01:24 1773687684

I've been using jemalloc for over 10 years and don't really see a need for it to be updated. It always holds up in benchmarks against any new flavor of the month malloc that comes out.

Last time I checked mimalloc which was admittedly a while ago, probably 5 years, it was noticebly worse and I saw a lot of people on their github issues agreeing with me so I just never looked at it again.

adgjlsfhk1 · 2026-03-16T19:26:05 1773689165

Mimalloc v3 has just come out (about a month ago) and is a significant improvement over both v2 and v1 (what you likely last tested)

hrmtst93837 · 2026-03-16T19:05:33 1773687933

Benchmarks age fast. Treating a ten-year-old allocator as done just because it still wins old tests is tempting fate, since distros, glibc, kernel VM behavior, and high-core alloc patterns keep moving and the failures usually show up as weird regressions in production, not as a clean loss on someone's benchmark chart.

codexon · 2026-03-16T19:19:31 1773688771

It still beat mimalloc when I checked 4-5 years ago.

imp0cat · 2026-03-16T19:45:07 1773690307

You really need to benchmark your workloads, ideally with the "big 3" (jemalloc, tcmalloc, mimalloc). They all have their strengths and weaknesses.

Jemalloc can usually keep the smallest memory footprint, followed by tcmalloc.

Mimalloc can really speed things up sometimes.

As usually, YMMV.

codexon · 2026-03-16T19:51:57 1773690717

I've benchmarked them every few years, they never seem to differ by more than a few percent, and jemalloc seems to fragment and leak the least for processes running for months.

Mimalloc made the claim that they were the fastest/best when they released and that didn't hold up to real world testing, so I am not inclined to trust it now.

ComputerGuru · 2026-03-16T20:03:06 1773691386

> Mimalloc made the claim that they were the fastest/best when they released and that didn't hold up to real world testing

That’s… ahistorical, at least so far as I remember. It wasn’t marketed as either of those; it was marketed as small/simple/consistent with an opt-in high-severity mode, and then its performance bore out as a result of the first set of target features/design goals. It was mainly pushed as easy to adopt, easy to use, easy to statically link, etc.

jeffbee · 2026-03-16T23:32:27 1773703947

mimalloc definitely made claims that could not be reproduced, or at least not by me. That's why I wrote this doc five years ago. "Irreproducible malloc benchmarks" https://www.dropbox.com/scl/fi/evnn6yoornh9p6l7nq1t9/Irrepro...

codexon · 2026-03-16T21:14:07 1773695647

> It was mainly pushed as easy to adopt, easy to use, easy to statically link, etc.

That is true of basically every single malloc replacement out there, that is not a uniquely defining feature.

HackerThemAll · 2026-03-16T20:59:33 1773694773

Look up the numbers in other comments above. When it comes to performance, the Google's tcmalloc is unconquered.

imp0cat · 2026-03-17T17:54:37 1773770077

I tried all three, multiple times, and it depends.

Using the last workload tested as an example, mimalloc just consumed memory like crazy. It was probably leaking, as it was the stock version that comes in Debian, so probably quite old.

Tcmalloc and jemalloc were neck to neck when comparing app metrics (request duration etc... was quite similar), but jemalloc consistently used only about half of RAM as opposed to tcmalloc).

Both custom allocators used way less RAM than the stock allocator though. Something like 10x (!) less. In the end the workload with jemalloc hovers somewhere around 4% of the memory limit. Not bad for one single package and an additional compile option to enable it.