Here's my experience developing a JVM lightweight thread library[1]:
1. For the scheduler, we use the JDK's superb and battle-tested ForkJoinPool (developed by Doug Lea), which is an excellent work-stealing scheduler, and continues to improve with every release.
2. For synchronization, we've adapted java.util.concurrent's constructs (we use the same interfaces so no change to user code) to respect fibers, but users are expected to mostly use Go-like channels or Erlang like actors that are both included.
3. As for disk IO, Java does provide an asynchronous interface on all platforms, so integrating that wasn't a problem.
4. Integrating with existing libraries is easy if they provide an asynchronous (callback based) API, which is easily turned into fiber-blocking calls. If not, then ForkJoinPool does handle non-frequent blocking of OS thread gracefully.
All in all, the experience has been very pleasant: callbacks are gone and performance/scalability is great. Things will get even better if Linux will adopt Google's proposal for user-scheduled OS threads, so that all code will be completely oblivious to whether the threads are scheduled by the kernel or in user space.
Regarding performance, Linux does have a very good scheduler (unlike, say, OS X), but while there's little latency involved if the kernel directly wakes up a blocked thread (say, after a sleep or as a response to an IO interrupt), it still adds very significant latency when one thread wakes up another. This is very common in code that uses message passing (CSP/actors), and we've been able to reduce scheduling overhead by at least an order of magnitude over OS threads.
I would summarize this as follows: if your code only blocks on IO, or blocks infrequently on synchronization, then OS threads are quite good; but if you structure your program with CSP/actors then user-space threads are only sensible way to go for the time being.
I saw your Quasar library before and while I hadn't had the chance to use I'm excited to try it the next time I need to write event code in the JVM.
As far as the disk IO is concerned the Java APIs are only as good as the underlying OS interfaces. And, those are not that great.
I think you're spot on with the assertion that a mostly network bound workloads can benefit from N:M scheduling. Many of the apps that we build nowadays are exactly that.
That is really interesting. If it's not too much trouble to write out, could you explain what causes the latency difference between kernel wake-up and other thread wake-up?
Paul Turner explained this really well at this year's Linux Plumbers Conference. The whole talk is fantastic, but the explanation of what pron is describing in particular (and how it could be improved) starts around 8:39: https://www.youtube.com/watch?v=KXuZi9aeGTw#t=519
I honestly don't know :) I was simply reporting my results experimenting with this (I'll try to write a blog post about it some time in the near future), so I'll defer to those with a deeper knowledge of the Linux kernel.
I have read that the Linux scheduler exploits some heuristics if it can guess how soon a blocked thread will need to be woken up, so this might have something to do with that.
I'm not too familiar with the details but I think they mention that a thread can specify a callback that will be called if it blocks on IO, and the callback can specify another thread to switch to.
1. For the scheduler, we use the JDK's superb and battle-tested ForkJoinPool (developed by Doug Lea), which is an excellent work-stealing scheduler, and continues to improve with every release.
2. For synchronization, we've adapted java.util.concurrent's constructs (we use the same interfaces so no change to user code) to respect fibers, but users are expected to mostly use Go-like channels or Erlang like actors that are both included.
3. As for disk IO, Java does provide an asynchronous interface on all platforms, so integrating that wasn't a problem.
4. Integrating with existing libraries is easy if they provide an asynchronous (callback based) API, which is easily turned into fiber-blocking calls. If not, then ForkJoinPool does handle non-frequent blocking of OS thread gracefully.
All in all, the experience has been very pleasant: callbacks are gone and performance/scalability is great. Things will get even better if Linux will adopt Google's proposal for user-scheduled OS threads, so that all code will be completely oblivious to whether the threads are scheduled by the kernel or in user space.
Regarding performance, Linux does have a very good scheduler (unlike, say, OS X), but while there's little latency involved if the kernel directly wakes up a blocked thread (say, after a sleep or as a response to an IO interrupt), it still adds very significant latency when one thread wakes up another. This is very common in code that uses message passing (CSP/actors), and we've been able to reduce scheduling overhead by at least an order of magnitude over OS threads.
I would summarize this as follows: if your code only blocks on IO, or blocks infrequently on synchronization, then OS threads are quite good; but if you structure your program with CSP/actors then user-space threads are only sensible way to go for the time being.
[1]: https://github.com/puniverse/quasar