I went through the callback-hell problem myself and came to the conclusion that swapcontext is the only way to go. I am starting to begin a large personal project that will be entirely based on swapcontext model (but using boost::coroutine), so I would like to know more from your experience before I invest into possibly the wrong model. Since you haven't explained your solution in-detail, I have a question:
Did you use multiple N:M schedulers (each owning an isolated group of os-threads) or your entire process uses a single N:M threading library that owns all os-threads of the process?
Here is what I am doing:
My project does lot of disk and network IO. So, IMO, being able to keep CPU, Disk and Network busy is my design goal. Having a single N:M scheduler that handles all threads makes it very difficult to understand/analyze/manage the system performance, so I made the choice of having multiple N:M schedulers (that communicate through message passing), each deals with separate resources of the system. For example, os-threads of a N:M cpu-scheduler will only do compute work and when a disk (or network) io is necessary, the greenlet/task/whatever will be queued into the disk-io-scheduler (or network-io-scheduler). When the os-threads of disk-io-scheduler (or network-io-scheduler) finish the io, they queue the greenlet/task back into the cpu-scheduler.
This IMO keeps managing/analyzing/tuning the system performance and complexity little easy. For example,
> All of a sudden you're forced to write a userspace scheduler and guess what it's really hard to write a scheduler that's going to do a better job that Linux's schedules that has man years of efforts put into it. Now you want your schedule to man N green threads to M physical threads so you have to worry about synchronization. Synchronization brings performance problems so you start now you're down a new lockless rabbit hole. Building a correct highly concurrent scheduler is no easy task.
Agreed, writing a good scheduler that deals with mixed workloads is very challenging. In my model, I am hoping that a poor scheduler wouldn't be an issue because, os-threads in the cpu-scheduler are only competing for cpu-bound tasks; similarly os-threads of io-scheduler are only competing for io-bound tasks. The real scheduling between io and cpu operations is still left to the kernel (because kernel decides to pick which scheduler's os-thread to run.) From kernel point of view there are threads that only do IO work or threads that only do CPU work. I am not yet sure if kernel schedulers are optimized for such workload or not.
> A lot of 3rd party code doesn't work great with userspace threads. You end up with very subtle bugs in you code that are hard to track down. In many cases this is due to assumptions about TLS (but this isn't the only reasons). In order to make it work you now can't have work stealing between your native threads and then you end up with performance problems and starvation problems.
My design-solution to this is to segregate 3rd party code into a fixed set of threads that do not ever participate in N:M threading library. Would that solve the above issues?
> The final nail in the coffin for me was disk IO. The fact is that non network IO is generally blocking and no OS has great non-blocking disk IO interfaces (windows is best but it's still not great). First, it's pretty low level, eg. difficult to use. You have to do IO on block boundaries. Second, it bypasses the page cache (at least on Linux) which in most cases kills performance right there. And in many cases this non-blocking interface will end up blocking (even on windows) if the filesystem needs to do certain things under the covers (like extend the file or load metadata). Also, the way these operations are implement require a lot lot of syscalls thus context switches which further negate any perceived performance benefits. The bottom line is that regular blocking IO (better yet mmaped IO) outperforms what most people are capable of achieving using the non-blocking disk IO facilities.
Yes, lot of disk/file-system operations do not have non-blocking equivalents. This is one of the main reasons for having a separate disk-io-scheduler (with its own thread-pool) in my design. In my model any blocking-only operation is made asynchronous by queuing the operation into a dedicated io thread pool and let it inform me when the operation is completed. This gives me flexibility in tuning the IO parallelism specific to my needs. For example, SSD can handle more concurrent operations than a HDD, so I can create an ssd-io-scheduler with more number of threads than a hdd-io-scheduler.
It would be great if you could share your thoughts on my analysis. Thanks.
I haven't used boost::coroutine but I have experience working with Mordor (https://github.com/mozy/mordor), swapcontext directly and implementing my own swapcontext. Why implement your own swapcontext? Well, the default one in Linux actually is glibc can be a bottleneck if you're going to be calling it frequently. It makes the sigprocmask() syscall, if you can avoid it's a substantial speed up. You can read up more on this here: http://rethinkdb.com/blog/making-coroutines-fast/
Another thing you'll see give you a big performance boost is stack pooling (eg. not calling mmap for every makecontext). More on that on the rethinkdb article I mentioned. If you scheduler targets multiple OS threads you should also be careful here to avoid synchronization slow downs. Either some kind of lockless list or pre thread pools.
Now lets talk about 3rd party code. If you have a 3rd party library that internally uses TLS and you swap its context onto a different thread it's bound to misbehave and when it does it's usually subtle and hard to debug. So if you're using 3rd party libraries you either have to audit them (and make sure you didn't miss anything), disable context migration (and risk unbalanced workloads) or have a separate scheduler that only runs those tasks. Pick your poison.
It doesn't even have to be 3rd party code that miss behaves when green threads are migrated. I pulled out my hair for a couple weeks trying to debug an issue with a call to accept(). It was returning -1 but errno was set to 0. What gives? Well it turns out that on linux in glibc errno is a macro, that calls a function to get the address for errno for your thread. And that function is marked with the gcc __attribute__((pure)). So what it means that once the address of errno is calculated once in the body of the function the compiler is free to assume it'll always be that address (it's a pure function without side effects). Here's the sequence:
1. accept() == -1
2. errno == EAGAIN
4. errno = 0
3. scheduler_yield()
5. accept() == -1
6. errno == 0 (although it should be something else)
This will happen on Linux with glibc if your scheduler_yield() call returns but is running on a different thread when it returns. So even your own innocent code that doesn't use TLS can break in interesting ways.
If you have very small green threads and you have a naive stealing scheduler with a mutex you can be sure that you'll be spending significant on synchronization. You can get fancies with non-blocking queues and atomic instructions to overcome this.
I did have multiple schedulers for both CPU bound tasks and IO bound tasks. I would say that that if you're doing disk IO and you're just forwarding the data (versus having to process it) you're better of with non-blocking sendfile() or non-blocking vmsplice() (plus mmap) in your event loop. If you're doing lots of disk IO on a SSD array that can push 2GB/s you're going to needs lots of IO threads the latency of the message passing between the two scheduler is going to add up. Again this may or may not be problem in your application.
Those are some of my own experiences, they may or may not apply to you but I hope it helps.
I went through the callback-hell problem myself and came to the conclusion that swapcontext is the only way to go. I am starting to begin a large personal project that will be entirely based on swapcontext model (but using boost::coroutine), so I would like to know more from your experience before I invest into possibly the wrong model. Since you haven't explained your solution in-detail, I have a question:
Did you use multiple N:M schedulers (each owning an isolated group of os-threads) or your entire process uses a single N:M threading library that owns all os-threads of the process?
Here is what I am doing:
My project does lot of disk and network IO. So, IMO, being able to keep CPU, Disk and Network busy is my design goal. Having a single N:M scheduler that handles all threads makes it very difficult to understand/analyze/manage the system performance, so I made the choice of having multiple N:M schedulers (that communicate through message passing), each deals with separate resources of the system. For example, os-threads of a N:M cpu-scheduler will only do compute work and when a disk (or network) io is necessary, the greenlet/task/whatever will be queued into the disk-io-scheduler (or network-io-scheduler). When the os-threads of disk-io-scheduler (or network-io-scheduler) finish the io, they queue the greenlet/task back into the cpu-scheduler.
This IMO keeps managing/analyzing/tuning the system performance and complexity little easy. For example,
> All of a sudden you're forced to write a userspace scheduler and guess what it's really hard to write a scheduler that's going to do a better job that Linux's schedules that has man years of efforts put into it. Now you want your schedule to man N green threads to M physical threads so you have to worry about synchronization. Synchronization brings performance problems so you start now you're down a new lockless rabbit hole. Building a correct highly concurrent scheduler is no easy task.
Agreed, writing a good scheduler that deals with mixed workloads is very challenging. In my model, I am hoping that a poor scheduler wouldn't be an issue because, os-threads in the cpu-scheduler are only competing for cpu-bound tasks; similarly os-threads of io-scheduler are only competing for io-bound tasks. The real scheduling between io and cpu operations is still left to the kernel (because kernel decides to pick which scheduler's os-thread to run.) From kernel point of view there are threads that only do IO work or threads that only do CPU work. I am not yet sure if kernel schedulers are optimized for such workload or not.
> A lot of 3rd party code doesn't work great with userspace threads. You end up with very subtle bugs in you code that are hard to track down. In many cases this is due to assumptions about TLS (but this isn't the only reasons). In order to make it work you now can't have work stealing between your native threads and then you end up with performance problems and starvation problems.
My design-solution to this is to segregate 3rd party code into a fixed set of threads that do not ever participate in N:M threading library. Would that solve the above issues?
> The final nail in the coffin for me was disk IO. The fact is that non network IO is generally blocking and no OS has great non-blocking disk IO interfaces (windows is best but it's still not great). First, it's pretty low level, eg. difficult to use. You have to do IO on block boundaries. Second, it bypasses the page cache (at least on Linux) which in most cases kills performance right there. And in many cases this non-blocking interface will end up blocking (even on windows) if the filesystem needs to do certain things under the covers (like extend the file or load metadata). Also, the way these operations are implement require a lot lot of syscalls thus context switches which further negate any perceived performance benefits. The bottom line is that regular blocking IO (better yet mmaped IO) outperforms what most people are capable of achieving using the non-blocking disk IO facilities.
Yes, lot of disk/file-system operations do not have non-blocking equivalents. This is one of the main reasons for having a separate disk-io-scheduler (with its own thread-pool) in my design. In my model any blocking-only operation is made asynchronous by queuing the operation into a dedicated io thread pool and let it inform me when the operation is completed. This gives me flexibility in tuning the IO parallelism specific to my needs. For example, SSD can handle more concurrent operations than a HDD, so I can create an ssd-io-scheduler with more number of threads than a hdd-io-scheduler.
It would be great if you could share your thoughts on my analysis. Thanks.