Hacker Timesnew | past | comments | ask | show | jobs | submit | anarazel's commentslogin

> Well also I know Postgres UNIQUE indexes provide additional locking. Like you can do an INSERT... WHERE NOT EXISTS or INSERT... ON CONFLICT that is guaranteed to succeed.

That's true only for the latter (and even then only at a isolation level that's not too strict).


Oh I misremembered, yeah just tested and the second INSERT errors.

There's a bunch of nastiness around that too. If you have e.g. library state that assumes the fd still works you can get her very confusing bugs once another file is opened into that fd number...

You may be mixing up fork and exec. Library data state isn't retained over execve(), and O_CLOEXEC does not take effect at fork().

Indeed. Not enough coffee, apparently.

It is somewhat interesting that the most widely used "big" OS that doesn't use fork, i.e. Windows, has dog slow process creation...

I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.


The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1, and you need overcommit.

Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.


CoW is probably a good idea whether you use fork or not. Or rather, fork is probably a better option than just exec exactly because it can benefit from CoW.

At least on systems with virtual addressing. If you want to go into physical addressing, then yes, maybe it's a problem. But Linux will never touch anything with physical addressing, so I don't see what people are complaining about.


CoW is probably a good idea regardless, yeah. Overcommit is more questionable. Regardless, both ought to be argued based on their own merits. It's unfortunate that both are necessary as a consequence of fork().

I don't think fork() mandates overcommit. OpenBSD doesn't seem to even allow overcommit or have an OOM killer, memory allocations that exceed available capacity fail immediately even if the memory is not touched.

Let's say you have 1GB RAM. You're running program that occupies 600 MB. Now this program wants to launch second small program that occupies 1 MB.

You're doing fork + exec.

If you're overcommiting, fork will not reserve another 600 MB, and exec immediately after fork will cause total system usage to be 601 MB.

If you're not overcommiting, that fork will fail, because total memory consumption will be 1200 MB which is more than 1GB. That somewhat restricts program design.


I think that on Unixes without overcommit, people allocate massive amounts of swap so that fork never fails.

> Let's say you have 1GB RAM. You're running program that occupies 600 MB. Now this program wants to launch second small program that occupies 1 MB.

> You're doing fork + exec.

This is the clear problem: you don't want another process that's a duplicate of the current one, that's just a detail of what you actually want: a 1mb process. Right now it's a badly leaky detail which you're forced to work around.


> If you're not overcommiting, that fork will fail, because total memory consumption will be 1200 MB which is more than 1GB. That somewhat restricts program design.

Does this accounting apply to vfork as well?


As I understand it, the whole purpose of vfork is to avoid that problem. So vfork does not copy memory in any way and you're not allowed to do anything but exec after vfork. But at this point question arises: why not call it `vfork_and_exec` and get rid of undefined behaviour. Or choose better name like `CreateProcessW` hehe

In the manpage for vfork, Linux in particular

> As with fork(2), the child process created by vfork() inherits copies of various of the caller's process attributes (e.g., file descriptors, signal dispositions, and current working directory); the vfork() call differs only in the treatment of the virtual address space, as described above.

so it seems Linux does define the behavior of vfork, but if you rely on it, your code won't be portable to other POSIX systems

https://man7.org/linux/man-pages/man2/vfork.2.html


Correct: it does NOT apply to vfork().

> The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1

It may not be slow, but for the common case where fork is almost immediately followed by exec in the process where fork returns zero fork increases those refcounts and exec almost immediately decreases them again hand does typically unnecessary checks whether refcounts became zero). A combined fork/exec syscall can avoid that work.

On the other hand, a sufficiently powerful combined fork/exec call has to have a lot of parameters that it has to check (whether to inherit open pipes, open files, setting the working directory, etc), and that slows it down.

That can be avoided by having multiple variants of combined fork/exec calls, but you would need lots of them to cover all combinations of flags.

I expect either approach should be faster then having fork, then exec as separate calls, especially when the process calling fork has many resources allocated.


Another possible design is instead of forking the current process, you create a new empty process, then the parent calls syscalls to set up the new process, and eventually call exec on the child process. That does mean you either need new syscalls for that, or adapt existing syscalls to take a pidfd as an argument. That also solves some other problems with fork/exec where the default is to inherit a lot of things you probably don't want. With this, you can opt in to inheritance instead of having to opt out.

Or you could create a hybrid between a thread and a process, where it still uses the parent's memory space (unlike fok), but has it's own stack (unlike vfork), and is in its own process (unlike a thread). I think this is technically possible on linux, but there isn't a readily available interface for it. Although it seems like posix_spawn could be implemented that way...


> you create a new empty process, then the parent calls syscalls to set up the new process ...

That does seem like a much better design to me. But I wonder if that was considered way back at the dawn of computing and rejected for good reason?

> I think this is technically possible on linux, but there isn't a readily available interface for it.

Yes there is, see `man clone`. POSIX and glibc are quite different from the kernel in this regard. AFAIK under linux there are just threads of execution that might or might not share various namespaces and memory mappings. That said, the kernel does place a few artificial restrictions on what combinations are allowed in order to (as I understand it) guard against the unintended exercise of entirely untested combinations that serve no known practical purpose.

The practical problem is that if you start doing as you please with the various namespaces and mappings you quickly become incompatible with glibc and by extension most likely the majority of the dynamic libraries available on your system.


https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c...

Though I want a posix_spawn-as-a-system-call approach as well / instead of that.


I remember reading about an OS where processes weren’t basic building blocks. Instead it had a syscall to create an address space and to create a thread in an address space.

Create a thread in your own address space, and your process becomes multi-threaded. Create an address space, load some code in it, and create a thread there, and you fork/exec-ed.

In my memory, that OS was MACH, but Google doesn’t confirm that for me.


Syscalls aren’t all that cheap either.

io_uring taught us that if syscalls are expensive, queue them up in a buffer with one syscall to transfer the thread to the os to process it. So, queue up the new process mutations in a buffer with a single syscall to process all of them in a batch. This model should have replaced repetitive syscalls across the kernel years ago.

This true, but these methods don't increase the number of syscalls you need to make.

In addition to what you said: forking from a process running on multiple cores is slow once you have mark all pages as read-only and shoot this out to all cores. TLB synchronization is super expensive. Unix originally didn't support threads (want concurrency? just fork!) but with modern multicore that's clearly unsustainable.

With large enough processes, like say a server JVM process that uses 10s of GBs of RAM, even just copying the page tables for CoW can be slow. And unless you have aggressive overcommit settings you can get an OOM on fork, even if you're just going to exec something small.

vfork helps a little, but it has a lot of restrictions on what you can do before the exec, and on unix that's basically the only place you can do things like close files, change signal masks, drop privileges or set up seccomp, etc.


vfork() helps a LOT. The restrictions on what you can do on the child-side of vfork() are pretty much the same ones as for fork() + you must not do anything to damage the stack frame of the vfork() caller (i.e., you can't return).

> the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions

That's a lot more restrictive. You can't use local variables, or call any functions other than _exit or execve. On linux specifically, I _think_ those restrictions are more relaxed and you can call async-signal-safe functions, however I'm not entirely clear on how relaxed that is, and as far as I understand that isn't portable.


But some of that is nonsense and incorrect. You can very much use local variables, and you'll find tons of vfork()-using code that does that and calls plenty of async-signal-safe functions.

The real restrictions are:

  - you can't damage the function call frame
    of the caller of vfork(), thus you can't
    return from it

  - you may only call async-signal-safe
    functions on the child side of vfork()
That's basically it. Yes, you'll want to call execve(2) or _exit(2) before long, but there is no time limit as to that, it's just that the whole point of calling vfork() is to make it real cheap to spawn a process, which means ultimately calling execve(2), with _exit(2) being what you do if it execve(2) fails (e.g., because ENOENT).

There is a ton of vfork()-using code that adheres to these real restrictions and has been working fine for decades. That includes several posix_spawn() implementations, the C shell, etc.

I demand evidence that this part: "the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork()" is remotely true. That evidence must be of the form of bug reports that were accepted and which stand to scrutiny.

I've never found any such evidence. Have you?

Meanwhile I have a proof by existence that vfork() is safe used much more liberally than you say it may be used.

> You can't use local variables, or call any functions other than _exit or execve.

There are other async-signal-safe functions, and they get used routinely by posix_spawn() and other code to do child-side setup before execve(2), including: I/O redirection, process group setup, signal handling changes, etc.


Didn't he just say that fork turns out to be comparatively faster to the non-fork samples we get? Ie Linux spawns processes faster than Microsoft's kernels?

Didn't I just say that "the problem with fork isn't really that it's slow"? It's all the other OS design choices it forces on you if you want it to be fast.

Right, you did. I somehow misread your comment.

We don't have any broadly used non-fork samples. Windows, macOS, and Linux all have fork. So the presence of fork can't be the reason for the performance difference.

(Windows's fork is called ZwCreateProcess)


MacOS has posix_spawn. See https://developer.apple.com/library/archive/documentation/Sy... (yes, that’s an iOS man page. MacOS has the call, too, but I couldn’t find the man page online and it looks identical to me)

I don’t know how they implemented it, though. Under the hood, it could do the equivalent of a fork/exec pair.


XNU is open source; here’s a link into the middle of the implementation, after it’s copied all the necessary attributes of the parent into the new process structure: https://github.com/apple-oss-distributions/xnu/blob/f6217f89...

XNU's posix_spawn implementation is not fork/exec-based. It does roughly what the API suggests it would do.

NtCreateProcess does not implement a forking model. It is analogous to posix_spawn.

If you pass null for the section handle, it shares pages with the calling process, thus implementing a forking model. Or at least the parts of a forking model that some people erroneously believe are responsible for performance differences.

The nice thing about fork+exec is that's its simple and flexible.

To avoid the problems, see roc's comment under the article. Esp use of a zygote process.


One os level thing that is interesting to me is if it would be possible/wise to make an OS based on (concurrent) garbage collection.

How else does consistency work, then?

Only being half facetious here. Maybe you or someone else really has a better take.


What do you mean by consistency here?

Solaris and Windows NT both have fork() and strict accounting by default.

> The problem with fork isn't really that it's slow.

Did someone suggest that it was?


anarazel's comment focuses entirely on performance, indicating that they have an impression that the discussion about why fork is bad is about performance. I'm not entirely sure where this impression came from, as it's not mentioned in rom1v's quote nor a point in the linked paper, "A fork() in the road".

Because that OS best practices is to use threads.

Traditionally Windows applications that create processes all the time come from UNIX heritage.

Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.

While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.


A more accurate way to describe this is that Windows' (NT onward) core execution context model is a bunch of threads that by default share memory, whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.

Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.

You're right that POSIX semantics get tangled when using threads.


That's actually less accurate, not more. It's a post-hoc revision that conflates Unix with Linux.

The Unix model was invented over a decade before the idea of multithreading percolated into mainstream operating systems at all.

The reason that Windows NT started as it did, was that OS/2 had come out in 1987, with kernel threads, and the idea of multithreading had taken root. SunOS 5 gained threading, too.

Windows NT applications development began with threading available as a mechanism from the start, and with a lot of people in the IBM/Microsoft world already knowing about its use in applications development from OS/2.

Whereas with the Unices it came in more gradually, as the applications had often already been designed. The whole libthread versus libpthread thing made things interesting on SunOS for a few years, too. As did the first attempt (LinuxThreads) at providing threads on Linux.


PaulDavisThe1st is saying that the Unix pattern of forking a process (and not calling exec) was an early form of multi-threading (or multi-processing), but unlike threads in NT and later pthreads, they didn't share memory and communication between them required some form of IPC.

Yep, absolutely corrrect. It was true at the lowest level (the semantics of fork) and it was true at the app/platform design level: in Windows you used threads inside a process, on Unix you used multiple communicating processes.

This obviously changed as pthreads came into being, and at this point, I suspect that the typical use for threads-sharing-memory and threads-not-sharing-memory is the same on most platforms.

A reminder that the task_t data structure describes threads and processes not just in Linux, but earlier Unixen also.


Well, Windows before NT isn't the same design as Windows 16 bit, it only shares the name for all practical purposes, and has more influence from OS/2 than Windows 16 bit.

Which is why I took the effort to explicitly refer to Windows NT on my comment, already expecting some traditional answers from UNIX folks.

Also due to historical reasons POSIX threads are the outcome of every UNIX going their own way implementing threads, finally coming to an agreement years later, with all the plus and minus of relying in POSIX for portable code.


whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

How are those not simply child processes? I don't understand your use of the word 'threads' here.

Does the Unix world not distinguish between threads and processes? In Win32, threads exist within processes, and you can create new threads or child processes.


They are child processes.

Second answer: Linux doesn't differentiate between threads and processes. It has a "thread group ID" that serves a small number of purposes, and the rest of the difference is just whether the threads happen to share the same address space.


Actually on Windows a process is a thread with additional information.

The unit of execution is the thread.

On the UNIX world it depends on which UNIX you are talking about.

Linux has a similar model to Windows NT nowadays, hence clone() as key primitive.

Other UNIXes have different approaches.


I worked on the kernel of DEC Ultrix, Mach/BSD and a couple of other early Unixen. The approach in all the ones I worked on was broadly the same.

POSIX threads having problems with signals is, imho, mostly the problem with signals in general. They are pretty poorly designed: https://lwn.net/Articles/414618/

The problem is that threads are not fault boundaries but processes are. So they're not interchangeable when you care about resilience and misbehaving code.

True, but on Windows the approach is then to use COM servers, which have a faster IPC model, and can even serve multiple clients, depending on how the appartement space is configured.

"Faster IPC model" than what? Faster than writing to and reading from a pipe? Faster than POSIX shared memory?

Than UNIX fork/exec model, or calling into Create Process all the time.

Windows has a more rich set of IPC stuff than POSIX, especially since it has a microkernel like design.

If you are going to say it is everything on the same memory space anyway, it isn't.

Optional on Windows 10, and enforced on Windows 11, Hyper-V is always running, and several components including kernel and driver modules are sandboxed into their little worlds.

Several additional sandboxing changes were announced at BUILD.


fork/exec is not an IPC model...

It actually kind of is, hence why you have information about parent/child and get to share memory.

This is how a http server back in the day would share the request context for the child process to reply back.


I would say that pipes and shared memory are the IPC mechanisms? Controlling the state of the exec'd process's file descriptors would counts as a way to set up interprocess communication, but once that's done, it's the pipe or SHM that does the actual communication.

The problem with POSIX IPC is that passing file descriptors between processes (other than parent passing to child via fork) is hard. Yes, SCM_RIGHTS can do it, but it is quite error prone and rarely done.

Every single Wayland and GPU-accelerated X11 app does that all the time.

That's like comparing apples and oranges. When tooling is tied to a platform, you're adding in the entire platform to the comparison.

Mozilla implemented an alternative to COM, called XPCOM. XP here means cross platform. Perhaps you could compare against that to take the platform out of the equation.


A one-shot process is easier to build and reason about than an event-driven server (speaking as someone who has written plenty of both).

If you want the isolation features of a separate process, you can’t substitute it with a single multithreaded COM server process.

.NET tried this with app domains, which are now deprecated.


App Domains were in process, which isn't was I am talking about with outproc COM.

Also App Domains are partially back in .NET Core, isolation features aren't there, but code unloading is, via AssemblyLoadContext.


My point is that “just write a COM server” is not an answer to the problem of “I want each work item to be segregated from each other.”

the only difference between a thread and a process on linux is how many structures they share. the function is identical.

Agreed, however not all UNIXes are like Linux.

Windows was designed with threads-first mentality because on pre-386 machines you don't have viable process memory protection, so your tasks share memory by necessity. This is not a great argument.

Windows NT was never designed with pre-386 machines in mind. That was the territory of the old DOS+Windows. Windows NT from the get-go was for machines with page-based virtual memory.

* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...


WinNT 3.5 was a solid offering.

This is not true. NT never had fork, was always based on the assumption of an MMU and Dave Cutler was a well known fork hater in the 80s long before this paper came out and made it cool to be so. By the time Windows 95 was out, the baseline was 386 with an MMU. CreateThread was initially designed for NT in 1993 though (which didn’t support pre-386 CPUs).

As mentioned elsewhere on this page, Windows NT had fork from the start. Vide NtCreateProcess and what happens if an image file is not explicitly supplied.

* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...


NtCreateProcess was not a public Windows API. NT was flexible, that’s not what was being discussed, which should have been clear from the context.

NtCreateProcess doesn’t accept an image file parameter.

You haven't read the doco. I did point to some. The image file is supplied (or not) via the section object.

Think it through. Windows NT supported fork from the start in its POSIX subsystem, that subsystem was layered on top of the Native API, and this is the Native API mechanism that the POSIX subsystem employed. Although it took until Gary Nebbett for someone to publicly show how, even though people knew informally back in 1993.


NT performed unnatural acts to implement fork semantics for the POSIX subsystem.

NT was designed to be platform-agnostic, and its original target was the DEC Alpha. Its process model owes nothing to pre-386 CPUs. The WinAPI CreateProcess function is a layer atop NtCreateProcess, so that is where the pre-386 heritage lives. But even the WinAPI process model changed significantly with 32-bit Windows.

No.

https://en.wikipedia.org/wiki/Windows_NT#Development

Windows NT was developed on various different CPUs before the Alpha was a thing. When it was released in 1993, it was released for three CPUs: IA-32, MIPS, and Alpha.


Sorry, I had conflated Windows NT development with development of 64-bit Windows as told by Raymond Chen: https://learn.microsoft.com/en-us/previous-versions/technet-...

Raymond also says elsewhere that most WinNT engineers did development on i386, but doesn’t explicitly say what time period he is describing: https://devblogs.microsoft.com/oldnewthing/20250513-00/?p=11...


Windows NT!

Misread on purpose to make a point?


I suspect it's a long tail sort of thing; it mostly doesn't matter except when it really matters. It's interesting that the stated motivation for the patch is in the context of agentic tools spawning subcommands. There's some related prior art in this area where the payoffs could be much greater, like fuzzing: https://gts3.org/assets/papers/2017/xu:os-fuzz.pdf is an example. It would be very interesting to see this patch applied to e.g. AFL++

That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.

Again, NtCreateProcess does not implement fork(). The fundamental characteristic of fork is that the child is an exact replica of the parent, down to the instruction pointer. Windows does not have a way to create a process object with such a configuration.

Also, using the Zw prefix doesn’t make you look more knowledgeable, it makes you look like you’re trying way too hard to borrow credibility.


Okay but people don't claim that copying the instruction pointer (a single machine register) is the reason for any speed difference. They claim it's due to the memory sharing. And that's easily disproven since you can share pages, just like on Linux, simply by passing null for the section handle, yet there's still a performance difference.

Why does it matter which prefix I used? They both point to the same routine so my point applies either way.


It's a completely uncontroversial fact that NT does implement fork(). Turn to page 183 of Helen Custer's "Inside Windows NT" and you will read about it.

> Use of the "h" register slices (bits 8..15) by compilers is thankfully pretty rare -- otherwise this would have been noticed much sooner!

It's actually pretty easy to get compilers to use those, you mainly need a bunch of narrow accesses to neighboring memory. The oodle post contains a godbolt link to pretty ordinary c code triggering this.

I'd guess that you also need some other conditions (multiple in flight stores, high boost speeds) to trigger this.


> It's not the default (read committed is) and I never saw serializable being set in actual production systems.

It's not the common mode of deployment, but it's definitely in prod use.

> You can do it, but then you have to be able to retry all of your transactions, including read.

Pure read transactions shouldn't need to be retried in postgres due to serialization errors. You need to have read-write dependencies for that.

That's not to say that effectively read only transactions aren't affected by serializable, you do need to record the necessary metadata for the serialization logic to work.

FWIW, if you know your transaction is read only and long running, you can start a transaction with START TRANSACTION READ ONLY DEFERRABLE, which makes the start transaction slower, but then does not need to do any work related to serializable while the transaction is running.


The GOT has to be initially writable regardless of ifunc, even with relro, to apply relocations.


> It is a crime that postgres isn't able to allocate with 1GB huge pages by changing a config parameter in 2026

It is able to? Configure huge_page_size=1GB?

Support for 2MB pages was added in 2014, for larger pages 2020.

Edit: year details.


Didn't know that, thanks. Sorry for the wrong comment.


I stopped drinking a few years back, after some (unrelated) health stuff. I don't miss wine, beer, that stopped - like for the author - after a relatively short amount of time. But interestingly I still really miss the feeling of a good scotch after a long day. Not being buzzed, but the sharpness mixed with interesting tastes.

My sleep has gotten so much better. I really didn't realize that alcohol didn't affect just the night after I had a drink, but even the next one or two nights..


Have you tried loose leaf green tea? Fermented and aged tea like pu'er has very complex flavors. The caffeine content can be quite low so its safe to drink in the evening. Found it to be even more fun and interesting than tasting e.g. wine.


and for even lower caffeine content one could try hojicha (high temperature roasted bancha or sencha), higher grades have interesting taste.


I’ve stopped drinking several times for longer stretches and I can feel a difference after a couple months in overall energy levels and sleep. It seems to have some pretty long tail effects that are subtle, but noticeable. If anyone hasn’t tried cutting it out for 3+ months, I would recommend running a little experiment.

My bloodwork improved a lot as well. I had actually completely forgot I stopped drinking when I had it done, so when the doctor asked what I did to see such an improvement, I had no idea. I had gained weight during this period as well, so the alcohol had a bigger negative impact than the weight… though neither is ideal.


100% on sleep. I drink regularly, and very much enjoy it, but I'm always cognizant of the price I'm going to pay in good sleep.

I've started not drinking at home as simple way to curb consumption without giving it up entirely.


This is what I do 3-4 nights a week. I sleep pretty well if I stick to that and don’t drink a ton in one sitting.


Ability to trivially use custom VM images was quite nice. The amount of CI time spent installing dependencies or copying a cache of installed stuff is nontrivial. Particularly for Windows the time difference is often very substantial. But even for plain Linux, there's no point in apt-get update && apt-get install the same set of things in every run (when using containers, cirrus could build them in-demand too, with little notational overhead).

Defaulting to throw-away-VMs for everything is also the right choice for something where the threat model includes attackers submitting patches/PRs. I'll never understand why folks were ok with just container separation for that (and often have no separation in runners).


Weirdly enough, I loved coffee from the first time I tried it, at maybe 13. Even though, looking back, it must have been terrible coffee, it was at something vaguely model UN like thing our entire class went to in an overnight trip. Obviously not enough sleep was had. A vending machine (in the late 90s) provided coffee...


Yes, I also tried coffee first time when in England when 13, and it was like a revelation. I understand that beer and cigarettes are an acquired taste, they tasted terrible, but coffee was a love at first sip.


> it must have been terrible coffee

Douglas Adams nailed the quality of tea from a vending machine, "almost, but not quite, entirely unlike tea", and that era of coffee machines weren't much better at coffee.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: