Ingo Molnar [interview] posted a second version of his syslets subystem patch set, which offers asynchrous system call support [story]. He noted that the effort is a work in progress, and that there are still outstanding issues to be fixed, "the biggest conceptual change in v2 is the ability of cachemiss threads to be turned into user threads. This fixes signal handling, makes them ptrace-eable, etc," going on to list numerous fixes since the first release. He noted that prior to releasing a third version of the patch set he will add support for multiple completion rings, add logic to share the 'spare thread' between the rings to further reduce startup costs, and remove reliance on mlock().
Linus Torvalds commented, "I'm still not a huge fan of the user space interface, but at least the core code looks quite clean. No objections on that front." He referred to earlier comments in which he had reacted strongly to the syslets userland interface saying, "I dislike it intensely, because it's so _close_ to being usable. But the programming interface looks absolutely horrid for any 'casual' use, and while the loops etc look like fun, I think they are likely to be less than useful in practice. Yeah, you can do the 'setup and teardown' just once, but it ends up being 'once per user', and it ends up being a lot of stuff to do for somebody who wants to just do some simple async stuff." He later noted that he was in particular concerned with the "register" functionality, which Ingo then simplified.
From: Ingo Molnar [email blocked] To: linux-kernel Subject: [patch 00/14] Syslets, generic asynchronous system call support, v2 Date: Thu, 15 Feb 2007 17:51:51 +0100 this is the v2 release of the syslet subsystem. This is an interim release, not all known and pending items are fixed/changed yet - the tree is still work in progress: http://redhat.com/~mingo/syslet-patches/ The biggest conceptual change in v2 is the ability of cachemiss threads to be turned into user threads. This fixes signal handling, makes them ptrace-eable, etc. (I've updated the sample userspace code at the URL above to also do user-space cachemiss processing - just Ctrl-Z the async-test.c run to trigger it action.) Things not yet done in v2 and planned for v3: - multiple completion rings support - share the 'spare thread' between multiple rings, to further reduce startup costs. - remove mlock() reliance Changes since v1: - FPU support fixed: detach FPU state from kernel thread state (implemented by Arjan van de Ven) - remove superfluous CLONE_VM from create_async_thread() (noticed by Jens Axboe) - sys_umem_add() does not ignore -EFAULT of __put_user() (noticed by Andrew Morton) - use VERIFY_READ instead of VERIFY_WRITE in copy_uatom() (noticed by Andrew Morton) - move schedule() to tail of loop in cachemiss_loop() (noticed by Andrew Morton) - added move_user_context() arch op - added async_syscall() and recursion protection against re-entry of sys_async_exec(), sys_fork()/sys_clone(), etc. - added sys_async_thread() call - a user-space thread can thus call back into the syslet subsystem and continue cachemiss work. - further cleanups in the include files - race fixes to sys_async_wait() - optimized out the kmalloc()/kfree() of the async_head - async_thread structure not on the kernel stack anymore, to allow async contexts to run user-space. - added support for head_stack and head_eip to enable the initial thread/task to run a cachemiss user context too, if it gets turned into a cachemiss thread. As always, comments, suggestions, reports are welcome. Ingo
From: Linus Torvalds [email blocked] Subject: Re: [patch 00/14] Syslets, generic asynchronous system call support, v2 Date: Thu, 15 Feb 2007 09:59:58 -0800 (PST) On Thu, 15 Feb 2007, Ingo Molnar wrote: > > this is the v2 release of the syslet subsystem. This is an interim > release, not all known and pending items are fixed/changed yet - the > tree is still work in progress: I'm still not a huge fan of the user space interface, but at least the core code looks quite clean. No objections on that front. Linus
From: Ingo Molnar [email blocked] Subject: [patch 06/14] syslets: core, documentation Date: Thu, 15 Feb 2007 17:52:32 +0100 From: Ingo Molnar [email blocked] Add Documentation/syslet-design.txt with a high-level description of the syslet concepts. Signed-off-by: Ingo Molnar [email blocked] Signed-off-by: Arjan van de Ven [email blocked] --- Documentation/syslet-design.txt | 137 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 137 insertions(+) Index: linux/Documentation/syslet-design.txt =================================================================== --- /dev/null +++ linux/Documentation/syslet-design.txt @@ -0,0 +1,137 @@ +Syslets / asynchronous system calls +=================================== + +started by Ingo Molnar [email blocked] + +Goal: +----- + +The goal of the syslet subsystem is to allow user-space to execute +arbitrary system calls asynchronously. It does so by allowing user-space +to execute "syslets" which are small scriptlets that the kernel can execute +both securely and asynchronously without having to exit to user-space. + +the core syslet concepts are: + +The Syslet Atom: +---------------- + +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of +user-space memory, which is the basic unit of execution within the syslet +framework. A syslet represents a single system-call and its arguments. +In addition it also has condition flags attached to it that allows the +construction of larger programs (syslets) from these atoms. + +Arguments to the system call are implemented via pointers to arguments. +This not only increases the flexibility of syslet atoms (multiple syslets +can share the same variable for example), but is also an optimization: +copy_uatom() will only fetch syscall parameters up until the point it +meets the first NULL pointer. 50% of all syscalls have 2 or less +parameters (and 90% of all syscalls have 4 or less parameters). + + [ Note: since the argument array is at the end of the atom, and the + kernel will not touch any argument beyond the final NULL one, atoms + might be packed more tightly. (the only special case exception to + this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will + jump a full syslet_uatom number of bytes.) ] + +The Syslet: +----------- + +A syslet is a program, represented by a graph of syslet atoms. The +syslet atoms are chained to each other either via the atom->next pointer, +or via the SYSLET_SKIP_TO_NEXT_ON_STOP flag. + +Running Syslets: +---------------- + +Syslets can be run via the sys_async_exec() system call, which takes +the first atom of the syslet as an argument. The kernel does not need +to be told about the other atoms - it will fetch them on the fly as +execution goes forward. + +A syslet might either be executed 'cached', or it might generate a +'cachemiss'. + +'Cached' syslet execution means that the whole syslet was executed +without blocking. The system-call returns the submitted atom's address +in this case. + +If a syslet blocks while the kernel executes a system-call embedded in +one of its atoms, the kernel will keep working on that syscall in +parallel, but it immediately returns to user-space with a NULL pointer, +so the submitting task can submit other syslets. + +Completion of asynchronous syslets: +----------------------------------- + +Completion of asynchronous syslets is done via the 'completion ring', +which is a ringbuffer of syslet atom pointers user user-space memory, +provided by user-space in the sys_async_register() syscall. The +kernel fills in the ringbuffer starting at index 0, and user-space +must clear out these pointers. Once the kernel reaches the end of +the ring it wraps back to index 0. The kernel will not overwrite +non-NULL pointers (but will return an error), user-space has to +make sure it completes all events it asked for. + +Waiting for completions: +------------------------ + +Syslet completions can be waited for via the sys_async_wait() +system call - which takes the number of events it should wait for as +a parameter. This system call will also return if the number of +pending events goes down to zero. + +Sample Hello World syslet code: + +---------------------------> +/* + * Set up a syslet atom: + */ +static void +init_atom(struct syslet_uatom *atom, int nr, + void *arg_ptr0, void *arg_ptr1, void *arg_ptr2, + void *arg_ptr3, void *arg_ptr4, void *arg_ptr5, + void *ret_ptr, unsigned long flags, struct syslet_uatom *next) +{ + atom->nr = nr; + atom->arg_ptr[0] = arg_ptr0; + atom->arg_ptr[1] = arg_ptr1; + atom->arg_ptr[2] = arg_ptr2; + atom->arg_ptr[3] = arg_ptr3; + atom->arg_ptr[4] = arg_ptr4; + atom->arg_ptr[5] = arg_ptr5; + atom->ret_ptr = ret_ptr; + atom->flags = flags; + atom->next = next; +} + +int main(int argc, char *argv[]) +{ + unsigned long int fd_out = 1; /* standard output */ + char *buf = "Hello Syslet World!\n"; + unsigned long size = strlen(buf); + struct syslet_uatom atom, *done; + + async_head_init(); + + /* + * Simple syslet consisting of a single atom: + */ + init_atom(&atom, __NR_sys_write, &fd_out, &buf, &size, + NULL, NULL, NULL, NULL, SYSLET_ASYNC, NULL); + done = sys_async_exec(&atom); + if (!done) { + sys_async_wait(1); + if (completion_ring[curr_ring_idx] == &atom) { + completion_ring[curr_ring_idx] = NULL; + printf("completed an async syslet atom!\n"); + } + } else { + printf("completed an cached syslet atom!\n"); + } + + async_head_exit(); + + return 0; +}
From: Linus Torvalds [email blocked]; Subject: Re: [patch 05/11] syslets: core code Date: Wed, 14 Feb 2007 12:38:16 -0800 (PST) On Tue, 13 Feb 2007, Ingo Molnar wrote: > > the core syslet / async system calls infrastructure code. Ok, having now looked at it more, I can say: - I hate it. I dislike it intensely, because it's so _close_ to being usable. But the programming interface looks absolutely horrid for any "casual" use, and while the loops etc look like fun, I think they are likely to be less than useful in practice. Yeah, you can do the "setup and teardown" just once, but it ends up being "once per user", and it ends up being a lot of stuff to do for somebody who wants to just do some simple async stuff. And the whole "lock things down in memory" approach is bad. It's doing expensive things like mlock(), making the overhead for _single_ system calls much more expensive. Since I don't actually believe that the non-single case is even all that interesting, I really don't like it. I think it's clever and potentially useful to allow user mode to see the data structures (and even allow user mode to *modify* them) while the async thing is running, but it really seems to be a case of excessive cleverness. For example, how would you use this to emulate the *current* aio_read() etc interfaces that don't have any user-level component except for the actual call? And if you can't do that, the whole exercise is pointless. Or how would you do the trivial example loop that I explained was a good idea: struct one_entry *prev = NULL; struct dirent *de; while ((de = readdir(dir)) != NULL) { struct one_entry *entry = malloc(..); /* Add it to the list, fill in the name */ entry->next = prev; prev = entry; strcpy(entry->name, de->d_name); /* Do the stat lookup async */ async_stat(de->d_name, &entry->stat_buf); } wait_for_async(); .. Ta-daa! All done .. Notice? This also "chains system calls together", but it does it using a *much* more powerful entity called "user space". That's what user space is. And yeah, it's a pretty complex sequencer, but happily we have hardware support for accelerating it to the point that the kernel never even needs to care. The above is a *realistic* schenario, where you actually have things like memory allocation etc going on. In contrast, just chaining system calls together isn't a realistic schenario at all. So I think we have one _known_ usage schenario: - replacing the _existing_ aio_read() etc system calls (with not just existing semantics, but actually binary-compatible) - simple code use where people are willing to perhaps do something Linux-specific, but because it's so _simple_, they'll do it. In neither case does the "chaining atoms together" seem to really solve the problem. It's clever, but it's not what people would actually do. And yes, you can hide things like that behind an abstraction library, but once you start doing that, I've got three questions for you: - what's the point? - we're adding overhead, so how are we getting it back - how do we handle independent libraries each doing their own thing and version skew between them? In other words, the "let user space sort out the complexity" is not a good answer. It just means that the interface is badly designed. Linus
From: Ingo Molnar [email blocked] Subject: Re: [patch 05/11] syslets: core code Date: Wed, 14 Feb 2007 22:02:51 +0100 * Linus Torvalds [email blocked] wrote: > And the whole "lock things down in memory" approach is bad. It's doing > expensive things like mlock(), making the overhead for _single_ system > calls much more expensive. [...] hm, there must be some misunderstanding here. That mlock is /only/ once per the lifetime of the whole 'head' - i.e. per sys_async_register(). (And you can even forget i ever did it - it's 5 lines of code to turn the completion ring into a swappable entity.) never does any MMU trick ever enter the picture during the whole operation of this thing, and that's very much intentional. Ingo
From: Ingo Molnar [email blocked] Subject: Re: [patch 05/11] syslets: core code Date: Wed, 14 Feb 2007 22:12:31 +0100 * Ingo Molnar [email blocked] wrote: > * Linus Torvalds [email blocked] wrote: > > > And the whole "lock things down in memory" approach is bad. It's > > doing expensive things like mlock(), making the overhead for > > _single_ system calls much more expensive. [...] > > hm, there must be some misunderstanding here. That mlock is /only/ > once per the lifetime of the whole 'head' - i.e. per > sys_async_register(). (And you can even forget i ever did it - it's 5 > lines of code to turn the completion ring into a swappable entity.) > > never does any MMU trick ever enter the picture during the whole > operation of this thing, and that's very much intentional. to stress it: never does any mlocking or other lockdown happen of any syslet atom - it is /only/ the completion ring of syslet pointers that i made mlocked - but even that can be made generic memory no problem. It's all about asynchronous system calls, and if you want you can have a terabyte of syslets in user memory, half of it swapped out. They have absolutely zero kernel context attached to them in the 'cached case' (be that locked memory or some other kernel resource). Ingo
From: Linus Torvalds [email blocked] Subject: Re: [patch 05/11] syslets: core code Date: Wed, 14 Feb 2007 13:26:23 -0800 (PST) On Wed, 14 Feb 2007, Ingo Molnar wrote: > > hm, there must be some misunderstanding here. That mlock is /only/ once > per the lifetime of the whole 'head' - i.e. per sys_async_register(). > (And you can even forget i ever did it - it's 5 lines of code to turn > the completion ring into a swappable entity.) But the whole point is that the notion of a "register" is wrong in the first place. It's wrong because: - it assumes we are going to make these complex state machines (which I don't believe for a second that a real program will do) - it assumes that we're going to make many async system calls that go together (which breaks the whole notion of having different libraries using this for their own internal reasons - they may not even *know* about other libraries that _also_ do async IO for *their* reasons) - it fundamentally is based on a broken notion that everything would use this "AIO atom" in the first place, WHICH WE KNOW IS INCORRECT, since current users use "aio_read()" that simply doesn't have that and doesn't build up any such data structures. So please answer my questions. The problem wasn't the mlock(), even though that was just STUPID. The problem was much deeper. This is not a "prepare to do a lot of very boutique linked list operations" problem. This is a "people already use 'aio_read()' and want to extend on it" problem. You didn't at all react to that fundamental issue: you have an overly complex and clever thing that doesn't actually *match* what people do. Linus
From: Ingo Molnar [email blocked] Subject: Re: [patch 05/11] syslets: core code Date: Wed, 14 Feb 2007 22:35:05 +0100 * Linus Torvalds [email blocked] wrote: > But the whole point is that the notion of a "register" is wrong in the > first place. [...] forget about it then. The thing we "register" is dead-simple: struct async_head_user { struct syslet_uatom __user **completion_ring; unsigned long ring_size_bytes; unsigned long max_nr_threads; }; this can be passed in to sys_async_exec() as a second pointer, and the kernel can put the expected-completion pointer (and the user ring idx pointer) into its struct atom. It's just a few instructions, and only in the cachemiss case. that would make completions arbitrarily split-up-able. No registration whatsoever. A waiter could specify which ring's events it is interested in. A 'ring' could be a single-entry thing as well, for a single instance of pending IO. Ingo
From: Ingo Molnar [email blocked] Subject: Re: [patch 05/11] syslets: core code Date: Wed, 14 Feb 2007 23:09:48 +0100 * Linus Torvalds [email blocked] wrote: > Or how would you do the trivial example loop that I explained was a > good idea: > > struct one_entry *prev = NULL; > struct dirent *de; > > while ((de = readdir(dir)) != NULL) { > struct one_entry *entry = malloc(..); > > /* Add it to the list, fill in the name */ > entry->next = prev; > prev = entry; > strcpy(entry->name, de->d_name); > > /* Do the stat lookup async */ > async_stat(de->d_name, &entry->stat_buf); > } > wait_for_async(); > .. Ta-daa! All done .. i think you are banging on open doors. That async_stat() call is very much what i'd like to see glibc to provide, not really the raw syslet interface. Nor do i want to see raw syscalls exposed to applications. Plus the single-atom thing is what i think will be used mostly initially, so all my optimizations went into that case. while i agree with you that state machines are hard, it's all a function of where the concentration of processing is. If most of the application complexity happens in user-space, then the logic should live there. But for infrastructure things (like the async_stat() calls, or aio_read(), or other, future interfaces) i wouldnt mind at all if they were implemented using syslets. Likewise, if someone wants to implement the hottest accept loop in Apache or Samba via syslets, keeping them from wasting time on writing in-kernel webservers (oops, did i really say that?), it can be done. If a JVM wants to use syslets, sure - it's an abstraction machine anyway so application programmers are not exposed to it. syslets are just a realization that /if/ the thing we want to do is mostly on the kernel side, then we might as well put the logic to the kernel side. It's more of a 'compound interface builder' than the place for real program logic. It makes our interfaces usable more flexibly, and it allows the kernel to provide 'atomic' APIs, instead of having to provide the most common compounded uses as well. and note that if you actually try to do an async_stat() sanely, you do get quite close to the point of having syslets. You get basically up to a one-shot atom concept and 90% of what i have in kernel/async.c. The remaining 10% of further execution control is easy and still it opens up these new things that were not possible before: compounding, vectoring, simple program logic, etc. The 'cost' of syslets is mostly the atom->next pointer in essence. The whole async infrastructure only takes up 20 nsecs more in the cached case. (but with some crazier hacks i got the one-shot atom overhead [compared to a simple synchronous null syscall] to below 10 nsecs, so there's room in there for further optimizations. Our current null syscall latency is around ~150 nsecs.) Ingo
From: Linus Torvalds [email blocked] Subject: Re: [patch 05/11] syslets: core code Date: Wed, 14 Feb 2007 15:13:17 -0800 (PST) On Wed, 14 Feb 2007, Ingo Molnar wrote: > > i think you are banging on open doors. That async_stat() call is very > much what i'd like to see glibc to provide, not really the raw syslet > interface. Right. Which is why I wrote (and you removed) the rest of my email. If the "raw" interfaces aren't actually what you use, and you just expect glibc to translate things into them, WHY DO WE HAVE THEM AT ALL? > The 'cost' of syslets is mostly the atom->next pointer in essence. No. The cost is: - indirect interfaces are harder to follow and debug. It's a LOT easier to debug things that go wrong when it just does what you ask it for, instead of writing to memory and doing something totally obscure. I don't know about you, but I use "strace" a lot. That's the kind of cost we have. - the cost is the extra and totally unnecessary setup for the indirection, that nobody reallyis likely to use. > The whole async infrastructure only takes up 20 nsecs more in the cached > case. (but with some crazier hacks i got the one-shot atom overhead > [compared to a simple synchronous null syscall] to below 10 nsecs, so > there's room in there for further optimizations. Our current null > syscall latency is around ~150 nsecs.) You are not counting the whole setup cost there, then, because your setup cost is going to be at a minimum more expensive than the null system call. And yes, for benchmarks, it's going to be done just once, and then the benchmark will loop a million times. But for other things like libraries, that don't know whether they get called once, or a million times, this is a big deal. This is why I'd like a "async_stat()" to basically be the *same* cost as a "stat()". To within nanoseconds. WITH ALL THE SETUP! Because otherwise, a library may not be able to use it without thinking about it a lot, because it simply doesn't know whether the caller is going to call it once or many times. THIS was why I wanted the "synchronous mode". Exactly because it removes all the questions about "is it worth it". If the cost overhead is basically zero, you know it's always worth it. Now, if you make the "async_submit()" _incldue_ the setup itself (as you alluded to in one of your emails), and the cost of that is basically negligible, and it still allows people to do things simply and just submit a single system call without any real overhead, then hey, it's may be a complex interface, but at least you can _use_ it as a simple one. At that point most of my arguments against it go away. It might still be over-engineered, but if the costs aren't visible, and it's obvious enough that the over-engineering doesn't result in subtle bugs, THEN (and only then) is a more complex and generic interface worth it even if nobody actually ends up using it. Linus
From: Ingo Molnar [email blocked] Subject: Re: [patch 05/11] syslets: core code Date: Thu, 15 Feb 2007 00:44:17 +0100 * Linus Torvalds [email blocked] wrote: > > case. (but with some crazier hacks i got the one-shot atom overhead > > [compared to a simple synchronous null syscall] to below 10 nsecs, > > so there's room in there for further optimizations. Our current null > > syscall latency is around ~150 nsecs.) > > You are not counting the whole setup cost there, then, because your > setup cost is going to be at a minimum more expensive than the null > system call. hm, this one-time cost was never on my radar. [ It's really dwarved by other startup costs (a single fork() takes 100 usecs, an exec() takes 800 usecs.) ] In any case, we can delay this cost into the first cachemiss, or can eliminate it by making it a globally pooled thing. It does not seem like a big issue. Ingo
From: Ingo Molnar [email blocked] Subject: Re: [patch 05/11] syslets: core code Date: Thu, 15 Feb 2007 01:04:47 +0100 * Ingo Molnar [email blocked] wrote: > > You are not counting the whole setup cost there, then, because your > > setup cost is going to be at a minimum more expensive than the null > > system call. > > hm, this one-time cost was never on my radar. [ It's really dwarved by > other startup costs (a single fork() takes 100 usecs, an exec() takes > 800 usecs.) ] i really count this into the category of 'application startup', and thus it's is another type of 'cachemiss': the cost of having to bootstrap a new context. (Even though obviously we want this to go as fast as possible too.) Library startups, linking (even with prelink), etc., is quite expensive already - goes into the tens of milliseconds. or if it's a new thread startup - where this cost would indeed be visible, if the thread exits straight after being startup up, and where this thread would want to do a single AIO, then shareable async heads (see my mail to Alan) ought to solve this. (But short-lifetime threads are not really a good idea in themselves.) but the moment it's some fork()ed context, or even an exec()ed context, this cost is very small in comparisno. And who in their right mind starts up a whole new process just to do a single IO and then exit without doing any other processing? (so that the async setup cost would show up) but, short-lived contexts, where this cost would be visible, are generally a really bad idea. Ingo