> libstd is that much closer to being used in a "bare metal" context. It's still aways off, but we're getting closer every day!
This is so cool. Most programming language implementations come with a lot of baggage. They assume that the VM can stop the world, or that the VM will be in control of all threading, or that the VM will use global variables, or that the VM can move objects around in memory whenever it feels like it. VMs like this are like demanding houseguests that insist that everything conform to their expectations. VMs like this are not easy to embed and do not "play nicely" with other software.
Doing the work to factor these assumptions out of the core language/VM/implementation and into libraries is not easy work, but you end up with something so much more versatile. Lua is this way, and of course C. It sounds like Rust is going this route too, and that's so cool to see.
On the flip side, though, those assumptions are also what lets a runtime actually do things for you. In the limit, a runtime that assumes nothing is simply machine language. Even C has a runtime, whose assumptions can be violated quite thoroughly by other languages.
The less the language actually specifies, the more likely you are to encounter the situation where two Rust libraries can't work together because they actually work under Rust-Alpha (some twiddling of the knobs) and Rust-Beta (some other incompatible twiddling of the knobs).
I don't know that this is a problem for Rust yet, but it is the first announcement from the Rust team which has actually dropped my level of interest in the language. Is there recognition that this is a tradeoff and not something you can think of as purely a "feature"? Do they realize that this was a huge step towards potential fragmentation of the language? (Not just in this particular decision, either, but the general mindset that leads towards this decision leads to fragmentation.)
> The less the language actually specifies, the more likely you are to encounter the situation where two Rust libraries can't work together because they actually work under Rust-Alpha (some twiddling of the knobs) and Rust-Beta (some other incompatible twiddling of the knobs).
We are explicitly trying to avoid this. Libraries should be able to work in M:N or 1:1 mode, or some combination of the two, without having to rewrite code. This change was carefully designed to allow this.
> Do they realize that this was a huge step towards potential fragmentation of the language?
We are very much intending to minimize fragmentation. Projects like alternative standard libraries are discouraged for this reason.
This strikes me a pragmatic decision driven by wanting to reduce complexity and be interoperable with 3rd party libraries (mostly C code). If you read my other comment here, I outlined the problems I ran into with the N:M threading model (although mostly in C land). It strikes me as Rust ran into the same issues.
I'm beginning to think that you can't make N:M well unless you take the approach that language X itself is the platform and everything needs to be done in language X. Java has taken this approach (yes I knot there's the JNI -- we know how fun to use it us -- but the Java community is pretty bent on writing almost everything in Java). Go seams to be going that direction by rewriting all the standard library, runtime and compilers slowly in Go. I don't know Erlang's stance on this.
It looks like Rust is taking the approach of we want to be a systems language AND we want to be interoperable with tons of existing 3rd party code (which implies mostly other C libraries).
P.S: Nice to see you commenting here Jermey. It's been a long time since we talked (cuda days).
Erlang's stance is probably irrelevant to anyone in the Go/Rust/C/C++ etc community; it does so many things differently that its experiences are unlikely to be useful. I'm pretty sure it's M:N, but, for instance, it also trivially has a NOT-stop-the-world GC, which is something that not many languages can say. Not only that it isn't "stop the world" but that it's also fairly trivially so. It doesn't have synchronization problems because it basically doesn't have synchronization. (In the OS sense. Of course your user code may have various sync concerns, but there's no OS-level race conditions, or memory race conditions, etc.) It pays for these characteristics, but also reaps various benefits.
(It's a pity that I don't know of any great single chunk of documentation on the Erlang VM I can point to; links solicited. It's such a different beast. Someone interested in learning about how it's different without necessarily learning Erlang would probably better just studying the VM itself.)
> Erlang's stance is probably irrelevant to anyone in the Go/Rust/C/C++ etc community; it does so many things differently that its experiences are unlikely to be useful. I'm pretty sure it's M:N, but, for instance, it also trivially has a NOT-stop-the-world GC, which is something that not many languages can say.
I don't think so. If Rust has a single large shared heap for all proceses, then the approach is different. Rust can avoid GC and it can also implement proper incremental GC. But none of those things are as trivial as in Erlang, where you can use stop-the-world GCs and still avoid pause times.
> If Rust has a single large shared heap for all proceses, then the approach is different.
Rust does not use one shared heap. Tasks cannot share memory (except through ARCs, which manage their own memory through atomic reference counting and so avoid garbage collection pauses). Garbage collection is completely local to a task, so while one task is collecting garbage other tasks are free to run.
Some is just performance; there's an inevitable performance loss on copying. I've had some trouble with the binary sharing routines still leaking on me, though that got better as the VM improved. Some of it is a bit more theoretical, such as the way Erlang basically has no type system, which I believe is so that everything has an unambiguous serialization which is used for the cross-node communication, but I believe more modern languages have more flexible solutions.
Really, though, it's just that "nothing is free"; it's not as if these decisions are so bad that they are objectively wrong, it's just, nothing is free.
On the contrary, this decision was solely motivated by the desire to reduce fragmentation. The interfaces libstd provides can work under almost any runtime, and (mostly, but definitely in safe code) don't leak implementation details. Needing a different standard library just to have a different runtime model is unacceptable. A lot of work has gone and is going into ensuring that we do not fragment. We know that fragmentation will kill us.
The C runtime consists of stuff like malloc and free. Newer versions of C also require support for thread-local storage for example. It is not a lot, but there definitly is a small runtime called libc.
Of course, the C standard also defines the freestanding variant, but there is not even main() anymore.
A standard library is not the same as a runtime. A library you explicitly call out to, and if you do not call it you can avoid linking the library at all. A runtime inserts calls to itself in a way that cannot be reasonably avoided, and must therefore be linked/loaded by every program.
That little wrapper is a runtime (after all, what does crt stand for?); depending on the platform, it's certainly not so negligible that you can say it doesn't exist.
It is not a runtime in the ways that matter. After it has transferred control to the user program, it is gone completely; it never inserts itself into the program's execution. It is more like the loader or dynamic linker of the OS than like the runtime of other languages. If you have your own mechanism for transferring control to the C program you don't need the main() wrapper at all.
Half a year ago: "I wouldn't be surprised to see Rust at least abandon M:N soon though once they start really optimizing performance."
With Rust I'm sure they'll drop M:N, if not now then eventually, like essentially everybody else in CS history. It seems great on a superficial level, but when you really start caring about performance and latency and fairness it's the pits.
My question is when Go will drop it. Imagine how easy it would be to have C call Go code if Go didn't have its own threading and segmented stacks; simply push the arguments and call/jmp. Maybe reference a few objects. None of that allocating a stack, locking a goroutine to a thread, stealing work to another thread, etc. You could even embed Go into a C program instead of having to do it the other way around, and you wouldn't have the problem of adapting existing programs or rewriting them in Go.
But I bet Go will continue with M:N and segmented stacks for a decade because it seems like they actually want it to be difficult to interface with anything else. Like Pure Java, they want you to rewrite everything in Go instead of just using some existing library.
Given that you 'forkIO' functions in the IO monad, i.e. impure functions, what do you mean by that? 'forkIO' in Haskell is essentially the same as dispatching a green thread in any other language, e.g. the same as 'go' in Go.
Haskell/GHC's threads are primarily a concurrency construct, not a parallelism one. You 'forkIO' a bunch of impure functions and communicate between them using channels, STM, etc., exactly the same as in other languages. For parallelism in pure functions there are much better tools--par seq, parallel strategies, repa, Data Parallel Haskell, etc.
Erlang has been using M:N very successfully and managed to have good latency and fairness while sacrificing performance. The hard part about M:N is doing it without a VM.
EDIT: To clarify I believe you are too harsh about M:N, and that Erlang is excellent proof that M:N has its use cases.
It proves that there are cases where M:N scheduling is the right choice. But it does not prove if it is the right choice for language implementation X (where X in this case is Rust).
I consider 64bit machines the final nail in the coffin of N:M for next decade. Since you can spawn millions of kernel threads the argument of easier-than-callbacks in favor of N:M disappears.
The problem with kernel threads are they require kernel context. This is usually at least 4-8 kilobytes of memory. Which is about 8-16 times as much as an Erlang process...
I really want to be able to use Rust, but then I keep seeing changes that seem to go around and around in circles. I do appreciate that it's very very hard to get things right the first time around and I understand that some experimentation is necessary. But when it comes to Rust it seems like one approach is deemed the correct way, then it's built and used, and like most things it has flaws. Then the complete opposite approach is tried but it runs into a different set of flaws, usually the ones that caused the other approach to be chosen in the first place. The Rust home page describes it as a "practical language" but with all of these back and forth changes it is really really difficult to actually use it. I don't know if I can keep waiting for it. When I've got new code to write I'm just going to have to use a language that I know I can depend on for more than a month or two, like Go or Haskell or Ruby.
Rust is pre-alpha; it's as green as it can be, and it is first and foremost (for the moment) a PL research project. You should not be using it in production, or expecting things to not break between releases.
That said, breaking changes are becoming less and less common. The language syntax has been mostly stabilized and most changes are going into corner cases and managing modularity, and much of the possible experimentation has been fenced to the post-1.0 release.
Which back and forth changes are you referring to?
(Also, yeah, Rust is pre-release and is definitely not suitable for people who don't like making minor changes regularly if they're tracking master (or less minor changes occasionally if tracking the 0.x releases).)
I see the changes in Rust as a positive, not a negative thing. I do understand that it may become a burden for a Rust programmer and is a major hindrance if you'd actually want to go to production with it. But even this fairly major change should not break a lot of existing Rust code.
But this allows Rust to do bleeding edge research instead of finalizing on a strict specification before knowing if the design will work in practice. It is also very refreshing to see programmers admit that their choice was wrong and not insisting on staying with a bad choice that was made earlier.
Consider Rust to be a practical research project, at least for now.
I've also have come full circle on N:M / green threads. Now I'm back to thinking they aren't worth the effort / pain. My experience comes from C / C++ land and not Go and Rust but I think same lessons apply. In experience I ended up using a few different framework from libcoro to Mordor to raw swapcontext().
I was initially sold on the N:M model as a means of having event driven programming without the callback hell. You can write code that looks like pain old procedural code but underneath there's magic that uses userspace task switching whenever something would block. Sounds great. The problem is that we end up solving complexity with more complexity. swapcontext() and family are fairly strait-forward, the complexity comes from other unintended places.
All of a sudden you're forced to write a userspace scheduler and guess what it's really hard to write a scheduler that's going to do a better job that Linux's schedules that has man years of efforts put into it. Now you want your schedule to man N green threads to M physical threads so you have to worry about synchronization. Synchronization brings performance problems so you start now you're down a new lockless rabbit hole. Building a correct highly concurrent scheduler is no easy task.
A lot of 3rd party code doesn't work great with userspace threads. You end up with very subtle bugs in you code that are hard to track down. In many cases this is due to assumptions about TLS (but this isn't the only reasons). In order to make it work you now can't have work stealing between your native threads and then you end up with performance problems and starvation problems.
Next thing you realize is that you're still spending lots of memory on creating stacks for your green threads. Then you realize pthreads just let you create small stacks for your native threads and then you realize that 8Mb stacks don't mater much due to delayed allocation. I think that both Rust and Go have back tracked on spaghetti stacks since they are a lot of work require the compiler to generate extract code and they have some bad worst case scenario behavior (where you can get in a look of growing and shrinking a stack reputably due to a function call in a loop).
The final nail in the coffin for me was disk IO. The fact is that non network IO is generally blocking and no OS has great non-blocking disk IO interfaces (windows is best but it's still not great). First, it's pretty low level, eg. difficult to use. You have to do IO on block boundaries. Second, it bypasses the page cache (at least on Linux) which in most cases kills performance right there. And in many cases this non-blocking interface will end up blocking (even on windows) if the filesystem needs to do certain things under the covers (like extend the file or load metadata). Also, the way these operations are implement require a lot lot of syscalls thus context switches which further negate any perceived performance benefits. The bottom line is that regular blocking IO (better yet mmaped IO) outperforms what most people are capable of achieving using the non-blocking disk IO facilities.
This is clearly based on my own experiences. It looks like the Rust folks had similar experiences. So my hope is that anybody thinks long and hard before going down the N:M rabbit hole. I ended up studying my mistakes and history is chock full of people abandoning the N:M model. You can read about the history of NTPL (which is the threading model in Linux 2.6+/ glibc) versus NGPT which was the N:M threading model purposed. The 1:1 NTPL model was simpler and performed better. Freebsd and Solaris moved from their N:M threading models to 1:1 models.
I think the N:M model is going to keep rearing it's head in academic papers about performance of highly scalable systems but in the real world it's benefits / performance will keep being elusive. The only counter point to this is Go that seams to be making a run for it with Go routines.
I should have titled this comment "How I learned to stop worrying and love plain old threads."
The Erlang VM is a successful implementation of N:M. It addresses some but not all of the issues you mentioned.
* IIRC each new process consumes 284 bytes so it's not "spending lots of memory on creating stacks for your green threads"
* Synchronization is a non-issue as everything is message-passing.
* Disk I/O is blocking, but the VM has dedicated disk threads.
* Dealing with 3rd party code is a problem, which is why relatively few libraries exist for Erlang. However, there is some hope now that NIFs (native interface functions, i.e. C code) can interact with the scheduler, i.e. report how much time has been used and return control if necessary.
You are right that "Building a correct highly concurrent scheduler is no easy task." The Erlang scheduler was not an easy task, and has received a ton of work from extremely talented engineers over the course of decades. Definitely worth a look to see how N:M can be made to work well.
C integration is the killer for N:M threading. C libraries do blocking IO, are not wrapped into a VM, are not compiled for segmented stacks, use thread-local storage, etc. You can work around this problems, which increases complexity. For close C integration, you are basically forced to adopt Cs view of threading.
This is why the R17 release of Erlang might get "dirty schedulers" for this kind of work. Essentially it gives you background thread pools where such work can be carried out.
Currently, you can call C functions directly, but it is not meant for long-running tasks which can block and so on.
You can create threads with enif_thread_create and build a queue to handle long running tasks (or a pool of threads), and use enif_send to return the result. Not quite as hacky as it sounds, since afaik port drivers rely on message passing to return values to the VM as well. Maybe not ideal, but better than naively bumping reductions. I haven't played with enif_timeslice yet, so I'm not sure how it compares (it's newer than NIF threads).
A lot of the issues surrounding M:N threads are a result of poor operating system support. The central problem is that a syscall blocks a kernel thread even when there are more user level threads to run.
If operating systems supported something like scheduler activations (a 20 year old technique), then this becomes less of a problem. The gist of it is that every time the kernel thread would block, or gets scheduled, instead of returning back to where it was executing, it upcalls into a user level scheduler which can then choose to schedule user threads. Its a shame that this technique isn't more common.
Because if I use M:N threading then logically one of my user threads blocked and a different one should run during my timeslice. The kernel however is unaware of how I use the kernel thread and will block, believing that I cannot proceed.
1:1 threading lacks this problem but operations such as creating threads or deleting them require syscalls and are therefore relatively expensive.
Here's my experience developing a JVM lightweight thread library[1]:
1. For the scheduler, we use the JDK's superb and battle-tested ForkJoinPool (developed by Doug Lea), which is an excellent work-stealing scheduler, and continues to improve with every release.
2. For synchronization, we've adapted java.util.concurrent's constructs (we use the same interfaces so no change to user code) to respect fibers, but users are expected to mostly use Go-like channels or Erlang like actors that are both included.
3. As for disk IO, Java does provide an asynchronous interface on all platforms, so integrating that wasn't a problem.
4. Integrating with existing libraries is easy if they provide an asynchronous (callback based) API, which is easily turned into fiber-blocking calls. If not, then ForkJoinPool does handle non-frequent blocking of OS thread gracefully.
All in all, the experience has been very pleasant: callbacks are gone and performance/scalability is great. Things will get even better if Linux will adopt Google's proposal for user-scheduled OS threads, so that all code will be completely oblivious to whether the threads are scheduled by the kernel or in user space.
Regarding performance, Linux does have a very good scheduler (unlike, say, OS X), but while there's little latency involved if the kernel directly wakes up a blocked thread (say, after a sleep or as a response to an IO interrupt), it still adds very significant latency when one thread wakes up another. This is very common in code that uses message passing (CSP/actors), and we've been able to reduce scheduling overhead by at least an order of magnitude over OS threads.
I would summarize this as follows: if your code only blocks on IO, or blocks infrequently on synchronization, then OS threads are quite good; but if you structure your program with CSP/actors then user-space threads are only sensible way to go for the time being.
I saw your Quasar library before and while I hadn't had the chance to use I'm excited to try it the next time I need to write event code in the JVM.
As far as the disk IO is concerned the Java APIs are only as good as the underlying OS interfaces. And, those are not that great.
I think you're spot on with the assertion that a mostly network bound workloads can benefit from N:M scheduling. Many of the apps that we build nowadays are exactly that.
That is really interesting. If it's not too much trouble to write out, could you explain what causes the latency difference between kernel wake-up and other thread wake-up?
Paul Turner explained this really well at this year's Linux Plumbers Conference. The whole talk is fantastic, but the explanation of what pron is describing in particular (and how it could be improved) starts around 8:39: https://www.youtube.com/watch?v=KXuZi9aeGTw#t=519
I honestly don't know :) I was simply reporting my results experimenting with this (I'll try to write a blog post about it some time in the near future), so I'll defer to those with a deeper knowledge of the Linux kernel.
I have read that the Linux scheduler exploits some heuristics if it can guess how soon a blocked thread will need to be woken up, so this might have something to do with that.
I'm not too familiar with the details but I think they mention that a thread can specify a callback that will be called if it blocks on IO, and the callback can specify another thread to switch to.
I'm working on my own userspace concurrency framework (early draft intro: http://team.s4y.us/) and having fun with this stuff.
> A lot of 3rd party code doesn't work great with userspace threads.
Agreed, especially since third-party code often calls standard library functions (`connect()`, `read()`, etc.) instead of the magic ones that are aware of your concurrency model. At OkCupid, we have our own database driver and RPC implementations that know how to use events for networking, and it sucks to have them in our codebase.
> The final nail in the coffin for me was disk IO.
Why not make it "somebody else's problem"? My thing uses libuv, which provides a common API and uses a thread pool or new async APIs under the hood. At OkC most filesystem access transparently goes through a separate process — it's just network access from the app's point of view.
I guess what I'm trying to say is that if you can solve the problem once, you can make a nice abstraction on top of it and never think about it again (unlike dealing w/ third-party code and scheduling).
Either way, I'd love your input on what I'm doing so far, you have way more experience.
Looking at your project, did you ever take a look at the Mordor C++ library (https://github.com/mozy/mordor)? It looks like it predates your project by a quite a bit of time.
It looks like you have a fun project on your hands and a lot of learning.
My word of caution about having different scheduler for network / disk IO is that you end up creating latency. Depending on what you're doing this may or may not be an issue in your applications. If you're sensitive to that a large number of small size IO request will quickly do this. Then you're stuck trying to optimize this away (trading problems).
If you're edge application does filesystem calls via some network abstraction you might be in a better place. If you're doing a network filesystem I recommend you check out Ceph. It both provides an in kernel Client (shameless plug I contributed code code there) which is nice because you can rely on the OS do cache things for you. It also provides a client library and the underlying RADOS object store library has non-blocking access to it. Check it out.
Don't count go out, yet. Its green thread method works really well, and it's moving to contiguous stack next release. I've never ran into any problems with this setup as long as I'm smart about control.
I can imagine that trying to replicate myself in c would be an absolute nightmare, so I trust these more experienced folks, and they've not let me down yet...
It has certainly had kernel support for a long time, but no evidence that it is significantly better or worse. If Oracle uses it then it is probably good. Almost no one uses the Linux aio support.
Oracle uses AIO on most operating systems. It implements it's own block and buffer cache so the kernel page cache just gets in the way. Since it does it's own block cache, the block size and alignment limitations are not really a problem for it. KVM also uses AIO for similar reasons (guess OS has it's own page cache so we want to avoid double caching).
AIO on Linux was basically designed for Oracle. Only a few classes of apps can really benefit from it. It also doesn't really guarantee to be non-blocking as many filesystem metadata operations do block (read the long threads on Linux mailing lists).
Unless you're a special class of app that can live with the AIO limitations then you have to stick to blocking IO. Otherwise you're stuck rewriting the page cache and it's unlikely that you'll do a better job the the built in OS.
I think Solaris has similar limitations compared to Linux. Windows has a better API with its async overlapped file API and even there various NTFS metadata operations can block.
I went through the callback-hell problem myself and came to the conclusion that swapcontext is the only way to go. I am starting to begin a large personal project that will be entirely based on swapcontext model (but using boost::coroutine), so I would like to know more from your experience before I invest into possibly the wrong model. Since you haven't explained your solution in-detail, I have a question:
Did you use multiple N:M schedulers (each owning an isolated group of os-threads) or your entire process uses a single N:M threading library that owns all os-threads of the process?
Here is what I am doing:
My project does lot of disk and network IO. So, IMO, being able to keep CPU, Disk and Network busy is my design goal. Having a single N:M scheduler that handles all threads makes it very difficult to understand/analyze/manage the system performance, so I made the choice of having multiple N:M schedulers (that communicate through message passing), each deals with separate resources of the system. For example, os-threads of a N:M cpu-scheduler will only do compute work and when a disk (or network) io is necessary, the greenlet/task/whatever will be queued into the disk-io-scheduler (or network-io-scheduler). When the os-threads of disk-io-scheduler (or network-io-scheduler) finish the io, they queue the greenlet/task back into the cpu-scheduler.
This IMO keeps managing/analyzing/tuning the system performance and complexity little easy. For example,
> All of a sudden you're forced to write a userspace scheduler and guess what it's really hard to write a scheduler that's going to do a better job that Linux's schedules that has man years of efforts put into it. Now you want your schedule to man N green threads to M physical threads so you have to worry about synchronization. Synchronization brings performance problems so you start now you're down a new lockless rabbit hole. Building a correct highly concurrent scheduler is no easy task.
Agreed, writing a good scheduler that deals with mixed workloads is very challenging. In my model, I am hoping that a poor scheduler wouldn't be an issue because, os-threads in the cpu-scheduler are only competing for cpu-bound tasks; similarly os-threads of io-scheduler are only competing for io-bound tasks. The real scheduling between io and cpu operations is still left to the kernel (because kernel decides to pick which scheduler's os-thread to run.) From kernel point of view there are threads that only do IO work or threads that only do CPU work. I am not yet sure if kernel schedulers are optimized for such workload or not.
> A lot of 3rd party code doesn't work great with userspace threads. You end up with very subtle bugs in you code that are hard to track down. In many cases this is due to assumptions about TLS (but this isn't the only reasons). In order to make it work you now can't have work stealing between your native threads and then you end up with performance problems and starvation problems.
My design-solution to this is to segregate 3rd party code into a fixed set of threads that do not ever participate in N:M threading library. Would that solve the above issues?
> The final nail in the coffin for me was disk IO. The fact is that non network IO is generally blocking and no OS has great non-blocking disk IO interfaces (windows is best but it's still not great). First, it's pretty low level, eg. difficult to use. You have to do IO on block boundaries. Second, it bypasses the page cache (at least on Linux) which in most cases kills performance right there. And in many cases this non-blocking interface will end up blocking (even on windows) if the filesystem needs to do certain things under the covers (like extend the file or load metadata). Also, the way these operations are implement require a lot lot of syscalls thus context switches which further negate any perceived performance benefits. The bottom line is that regular blocking IO (better yet mmaped IO) outperforms what most people are capable of achieving using the non-blocking disk IO facilities.
Yes, lot of disk/file-system operations do not have non-blocking equivalents. This is one of the main reasons for having a separate disk-io-scheduler (with its own thread-pool) in my design. In my model any blocking-only operation is made asynchronous by queuing the operation into a dedicated io thread pool and let it inform me when the operation is completed. This gives me flexibility in tuning the IO parallelism specific to my needs. For example, SSD can handle more concurrent operations than a HDD, so I can create an ssd-io-scheduler with more number of threads than a hdd-io-scheduler.
It would be great if you could share your thoughts on my analysis. Thanks.
I haven't used boost::coroutine but I have experience working with Mordor (https://github.com/mozy/mordor), swapcontext directly and implementing my own swapcontext. Why implement your own swapcontext? Well, the default one in Linux actually is glibc can be a bottleneck if you're going to be calling it frequently. It makes the sigprocmask() syscall, if you can avoid it's a substantial speed up. You can read up more on this here: http://rethinkdb.com/blog/making-coroutines-fast/
Another thing you'll see give you a big performance boost is stack pooling (eg. not calling mmap for every makecontext). More on that on the rethinkdb article I mentioned. If you scheduler targets multiple OS threads you should also be careful here to avoid synchronization slow downs. Either some kind of lockless list or pre thread pools.
Now lets talk about 3rd party code. If you have a 3rd party library that internally uses TLS and you swap its context onto a different thread it's bound to misbehave and when it does it's usually subtle and hard to debug. So if you're using 3rd party libraries you either have to audit them (and make sure you didn't miss anything), disable context migration (and risk unbalanced workloads) or have a separate scheduler that only runs those tasks. Pick your poison.
It doesn't even have to be 3rd party code that miss behaves when green threads are migrated. I pulled out my hair for a couple weeks trying to debug an issue with a call to accept(). It was returning -1 but errno was set to 0. What gives? Well it turns out that on linux in glibc errno is a macro, that calls a function to get the address for errno for your thread. And that function is marked with the gcc __attribute__((pure)). So what it means that once the address of errno is calculated once in the body of the function the compiler is free to assume it'll always be that address (it's a pure function without side effects). Here's the sequence:
1. accept() == -1
2. errno == EAGAIN
4. errno = 0
3. scheduler_yield()
5. accept() == -1
6. errno == 0 (although it should be something else)
This will happen on Linux with glibc if your scheduler_yield() call returns but is running on a different thread when it returns. So even your own innocent code that doesn't use TLS can break in interesting ways.
If you have very small green threads and you have a naive stealing scheduler with a mutex you can be sure that you'll be spending significant on synchronization. You can get fancies with non-blocking queues and atomic instructions to overcome this.
I did have multiple schedulers for both CPU bound tasks and IO bound tasks. I would say that that if you're doing disk IO and you're just forwarding the data (versus having to process it) you're better of with non-blocking sendfile() or non-blocking vmsplice() (plus mmap) in your event loop. If you're doing lots of disk IO on a SSD array that can push 2GB/s you're going to needs lots of IO threads the latency of the message passing between the two scheduler is going to add up. Again this may or may not be problem in your application.
Those are some of my own experiences, they may or may not apply to you but I hope it helps.
I think this is a good decision. A language that aims to compete with C/C++ needs to expose the low-level fundamental building blocks that the hardware and OS provide. Anything high level like a mandatory GC or scheduler will scare away systems programmers.
This is so cool. Most programming language implementations come with a lot of baggage. They assume that the VM can stop the world, or that the VM will be in control of all threading, or that the VM will use global variables, or that the VM can move objects around in memory whenever it feels like it. VMs like this are like demanding houseguests that insist that everything conform to their expectations. VMs like this are not easy to embed and do not "play nicely" with other software.
Doing the work to factor these assumptions out of the core language/VM/implementation and into libraries is not easy work, but you end up with something so much more versatile. Lua is this way, and of course C. It sounds like Rust is going this route too, and that's so cool to see.