Unfortunately not much has happened since OLS. Lots of people said they would be willing and able to help but life intervened, I guess. I’m no kernel developer but perhaps I have to start taking this up (this is a threat...).
There is one stream of patches which exist but they are not usable in my opinion. I’m talking about Evgeniy Polyakov’s kevent patches. I send my analysis to lkml but this didn’t result into much (except removing the ring buffer support). There are more fundamental problems with his proposed approach but I don’t get through to him and it appears as if I’m the only one complaining. Andrew keeps saying that this is because nobody else understands the topic. Well, let’s be a bit more specific. I’ll try again to motivate the case, describe how I think the userlevel interface should look like and then deduce something about the kernel interfaces. A bit of the latter Andrew and I did today. There are also network and file AIO aspects to this and then the DMA handling, but I’m restricting myself here to the event handling.
The fundamental premise is that multi-threaded code is absolutely necessary going forward. All processors grow horizontally by having ever and ever larger numbers of execution contexts (cores, hyper-threads, ...). For applications to scale there are two types of approaches:
- Parallelize large chunks of code. This is not a suitable approach for small chunks of code (because the startup, and distribution and collection of results are not free) or for code which is hardly parallelizable. If code can be parallelized people hopefully will start using OpenMP and similar methods. Unfortunately not everything is parallelizable in this form.
- Parallelize individual requests.
A web server is an example where the former approach is not usable but the latter is. Usually each request is easy to handle. Even for dynamically built pages the work is usually small. It makes no sense to parallelize the code. But many requests can be processed in parallel.
So it is logical to achieve parallelism by having many threads (or processes as in Apache). This is what programs do today. But this is not without problems:
- The current network interfaces are not well prepared. If you park all inactive threads in the accept() call to get a new incoming restriction you get a trampling herd effect if there is an incoming connection: all threads get woken up. But only one or a few will find that the accept() succeeded. The same is true for poll() etc calls.
- If the number is requests is high it is a bad approach to increase the number of threads until no connection remains unanswered. The problem is that this adds a lot of administrative overhead in the kernel and processor (context switches). If the processing of the requests uses 100% of the CPU time we end up with a higher total processing time because of the overhead.
- If the CPU time is not 100% we are wasting even more. If a thread does not have 100% utilization (and this is almost always the case) then the threads blocks somewhere, either a system call or a synchronization primitive. But this means there is CPU time available for the thread to do something else like working on another requests. So the total system time increases because of management overhead for no good reason: one thread could do the work needed for two or more requests.
The optimal programming model would therefore use only as many threads as are needed to utilize the system at 100%. This means one or two threads per CPU or core.
To scale with this fixed number of threads and achieve 100% utilization it is necessary to avoid any blocking. If a thread has to perform an operation which might block (such as reading from a socket or waiting on a mutex) the normal call cannot be executed. Instead the thread requests an event to be signaled when the reason for the blocking is gone and while this is going on it starts/continues working on another request.
This is a model of appealing simplicity: scaling is achieved by adding threads into a central loop. Add threads corresponding to the capabilities of the machine not the complexity of the program. All threads are using the same code.
This simplicity in the inner loop does come at a price: code has to be restructured to avoid all blocking calls and the central loop has to be able to wait for events of all types corresponding to operations which can block and are avoided.
This is where my proposal comes in. The new event handling mechanism is the bit in the inner loop. We can define its abilities and properties as this:
- We can wait for all kinds of events corresponding to all the possible interfaces which can block
- Event notification better be fast. Already today with 1Gb/s Ethernet we can create requests faster than they can be processed. This is only to increase with 10Gb/s Ethernet or fabrics.
- Obviously, the interface must be usable in multi-threaded situations. This means multiple threads must be able to wait in the kernel for event. The alternative is to have one waiting and signal a thread in a thread pool that an event is available. This is unnecessarily slow. When multiple threads are used it must be avoided that an unnecessary number gets woken if only one or a few events are available. Ideally we wake one thread per available event.
- Ideally the interface is also usable from multiple processes.
A consequence waking only one thread per available event is that no wakeup must be lost. This is a new concept which does not exist in the POSIX interfaces and the kernel (except the level triggered epoll interface). For this to work we have to have a mechanism to tell the kernel
I cannot handle this event, tell another thread. This is important, for instance, when the system call to wait for the event is canceled (as in thread cancellation). In this case the userlevel function call does not return and the program does not even know the notification is lost. Therefore the system library wrapper the wait-for-event system call needs to tell the kernel about the notification loss.
The requirement for such a system call to request additional notifications has itself a consequence. If the wait-for-event system call returns itself the data for the events instead of just a notification, then this data would have to be pushed back to the kernel as well. This is a high overhead. And more problematic: if there is not sufficient memory to store the event data, what to do then?
It is therefore better to separate the signalling of an event from delivering the data associated with the event. The data can be retrieved using another system call. Or: we can use shared memory. Shared between kernel and userlevel. Or userlevels (plural), see the multi-process requirement above.
If we implement the shared memory mechanism this calls for a ring buffer. Since there can be multiple threads working on the same ring buffer it must be possible to mark a ring buffer entry as worked on and/or at least as finished. The bulk of the ring buffer needs not be modified. In fact, it might be best to not allow the userlevel code to change it at all. This way the kernel care store information in the ring buffer in a form which might cause problems if the values are changed (e.g., pointers of some sort). Not all the fields of a ring buffer entry need to be used or even usable by the userlevel code.
Proposed Kernel Interfaces
This is the current state of my/our thinking. We want to use a ring buffer. The memory must not be required to be pinned down. The memory might only be used among the threads in a process or among several processes. Depending on the needs the application should allocate memory either anonymously or file-backed. The size of the buffer can be chosen by the application.
To create an event queue we pass the address and size of the buffer to the kernel:
int kev_create (void *addr, size_t size, int flags);
We throw in a flags parameter to be on the safe side for future use.
We need some structure in the buffer. At the very least we need to be able to locate the ring buffer and identify the elements of the buffer which are currently active. If we are going to implement CPU ID-based cache line optimization (see next section) we need some sort of mapping from CPU ID to ring buffer entry. And we need to chain the ring buffer entries to reach only the entries for the CPU. Searching for them defeats the purpose.
Ideally we would split the memory area in two pieces. One piece is the ring buffer itself. This part needs not be written to by the userlevel code. See above for the advantages. The writable section is the part where the program(s) provide feedback to the kernel about the entries which have been handled. Here we have two possibilities:
- We use only a front and a tail pointer. The front pointer points to the first unused entry, the tail pointer to the last still-used entry. Everything between tail and front is assumed used even though some entries might actually have been handled already. Everything between front and tail is free and available to receive new events. This approach requires little work from the kernel when queuing new events but it means we might run out of space even if there are empty slots. The latter can be avoided by making it the policy that userlevel code has to evacuate the ring buffer slots ASAP.
- The alternative is to use per-ring buffer-entry accounting. Each entry has a flag associated with it. This means less memory requirement. We can still have front and tail pointers to mark regions about which we have absolute knowledge but the kernel would also try to look for holes and use those entries.
The alternatives do not differ too much. When moving the tail pointer we always have to have knowledge about the individual entries. It's just that in the second case the work is all done by the userlevel code. The kernel has an easier life and we avoid possibly long delays (imagine a huge ring buffer with many thousand entries). Maybe the best argument in favor of the first solution: we only have one pointer is modifiable state (the tail pointer, the kernel maintains the front pointer). This means we potentially can do away with the writable part of the mapping altogether. Just pass a void ** value to kev_create() and maintain this variable in the runtime. This advantage would go away if we need more writable state.
A few words about the memory handling of the ring buffer. I would imagine that on 64bit platforms we can use large areas. Several MBs if necessary. This would cover worst case scenarios. The key points are that a) the memory needs not be pinned down (interrupt handlers can try to write and just punt to the helper code in case that fails because the memory is swapped out) and b) we can enforce a policy where the page freed by advancing the tail pointer are simply discarded (madvise(MADV_REMOVE)). This can be done implicitly in the kernel or explicitly at userlevel. There is therefore no huge memory usage associated with the ring buffers unless the program never frees entries. It is all ordinary anonymous or file-backed memory, subject to existing rlimit restrictions. No need to invent yet more rlimits.
To register delivery of events through an event queued we likely need a number of interfaces. Fortunately one interface already exists: the sigevent structure. I described all this in my paper but some people keep misunderstanding this point. The use of sigevent has nothing whatsoever to do with using signals. It's just a convenient way to request for a whole bunch of existing interfaces that the new method of event delivery should be used. This applies, so far, to POSIX AIO (which might be extended for networking) and POSIX timers.
Other uses will need new interfaces. For instance, signal delivery could be requested through an extension of sigaction(). We define a new flag to be ORed into sa_flags and add int sa_kev to the union already containing sa_handler and sa_sigaction. None of these would be use at the same time.
My biggest worry in the moment is to find a suitable interface for futexes. I'll think about it once the work actually starts.
These are a few notes about implementation details Andrew and I came up with during the discussions. It is not necessarily all usable:
- Ring buffer entries can be annotated with CPU ID (or perhaps even core ID). This would allow the scheduler to chose among the threads which are waiting for events to chose one which is scheduled for the same CPU. At userlevel we have the getcpu() system call and we are therefore able to pick an entry from the ring buffer which matches the CPU. This means no unnecessary cache line transfers.
- To locate the entries for a CPU the ring buffer could have a hash table or something similar which points to the appropriate entry. It would also be possible for the wait-for-event system call to return the index of the ring buffer entry for the current CPU.
- I think the memory for the ring buffer should be either anonymous or backed by tmpfs. Nothing in a persistent file system. We should not in any form or away encourage that people even look at the data in the file. It should be meaningless and should not survive a reboot.