"This is a case where you really should be scared, so FUD is completely appropriate."
"Basically, what the gcc developers are saying is that gcc is free to load and store to any memory location, so long as it behaves as if the instructions were executed in sequence," Nick Piggin noted, describing a linked discussion on the GCC development mailing list. He explained his concerns, "for x86, obviously the example above shows it can be miscompiled, but it is probably relatively hard to make it happen for a non trivial sequence. For an ISA with lots of predicated instructions like ia64, it would seem to be much more likely. But of course we don't want even the possibility of failures. The gcc guys seem to be saying to mark everything volatile that could be touched in a critical section. This is insane for Linux." Linus Torvalds reflected:
"Are you surprised? The gcc developers seem to have had a total disregard for what people want or need, and every time some code generation issue comes up, there's a lot of people on the list that do language-lawyering, rather than admit that there might be a problem.
"It's happened before, it will happen again. I don't think it's true of all gcc developers (or even most, I hope), but it's common enough. For some reason, compiler developers seem to be far enough removed from 'real life' that they have a tendency to talk in terms of 'this is what the spec says' rather than 'this is a problem'."
Nick Piggin posted an efficient algorithm for converting a data structure, "this is my 'data structure switch' algorithm that can convert one data structure into another, with just a single unlikely branch in fastpaths and no locking or atomic operations (the branch is only triggered when the data structure is in the process of being converted). A pointer indirection is generally also needed when converting a global data structure." He provided an example of using it to resize a hash then noted:
"the algorithm need not only convert between two different sized hashes, but that's just the most obvious and useful thing to do. It _could_ convert between a hash and a tree or something crazy like that too :)"
Nick added that while the conversion algorithm was efficiently implemented, converting it into an API was proving difficult, "the algorithm implementation is basically optimal, but the API means that it requires and extra function call and indirect function call to work, which probably makes it unusable for important implementations. (C preprocessor magic to avoid the indirect function calls and allow some level of templating while retaining the types might be the way to do it)."
In a June of 1992 posting to the linux-activists mailing list, Linus Torvalds described the original Linux scheduler noting, "the scheduler in linux is pretty simple, but does a reasonably good job at giving good IO response while not being too unfair against cpu-bound processes." A year later, Linus posted a more detailed description of the scheduler noting, "the linux scheduling algorithm is one of the simplest ones possible". Comments in the original 254 line
sched.c file read, "'schedule()' is the scheduler function. This is GOOD CODE! There probably won't be any reason to change this, as it should work well in all circumstances (ie gives IO-bound processes good response etc). The one thing you might take a look at is the signal-handler code here."
Comments in the current 6,709 line
sched.c file show the first changes being made in 1996 by Dave Grothe, "to fix bugs in semaphores and make semaphores SMP safe". Two years later Andrea Arcangeli is credited with implementing "schedule_timeout() and related stuff". It was not until 2002, ten years after Linus' original code was written, that the scheduler received a complete rewrite, "new ultra-scalable O(1) scheduler by Ingo Molnar: hybrid priority-list and round-robin design with an array-switch method of distributing timeslices and per-CPU runqueues." Con Kolivas is credited with "interactivity tuning" in 2003, and Nick Piggin added "scheduler domains" in 2004. A more recent rewrite of the scheduler happened in April, again by Ingo Molnar, this time with his Completely Fair Scheduler.
Nick Piggin used 'git bisect' to track a lmbench regression to the main CFS commit, leading to an interesting discussion between Nick and Ingo Molnar. Ultimately the regression was tracked down to the temporary configurability of the scheduler while it is tuned for optimal performance, "one reason for the extra overhead is the current tunability of CFS, but that is not fundamental, it's caused by the many knobs that CFS has at the moment." The solution, already coded but not yet merged in the mainline kernel "changes those knobs to constants, allowing the compiler to optimize the math better and reduce code size," and as a result result, "CFS can be faster at micro-context-switching than 2.6.22."
Ingo described the lmbench configuration in question as a "micro-benchmark", and noted that with a macro-benchmark better performance was more pronounced, "because with CFS the _quality_ of scheduling decisions has increased. So even if we had increased micro-costs (which we wont have once the current tuning period is over and we cast the CFS parameters into constants), the quality of macro-scheduling can offset that, and not only on the desktop!" He summarized, "that's why our main focus in CFS was on the macro-properties of scheduling _first_, and then the micro-properties are adjusted to the macro-constraints as a second layer."
Updating the pluggable scheduler patches for the 2.6.22 kernel, Peter Williams noted, "probably the last one now that CFS is in the main line". CFS author Ingo Molnar asked, "why is CFS in mainline a problem? The CFS merge should make the life of development/test patches like plugsched conceptually easier." Peter explained, "it means a major rewrite of the plugsched interface and I'm not sure that it's worth it (if CFS works well). However, note that I did say probably not definitely :-). I'll play with it and see what happens."
Ingo went on to suggest, "most of the schedulers in plugsched should be readily adaptable to the modular scheduling-policy scheme of the upstream scheduler. I'm sure there will be some minor issues as isolation of the modules is not enforced right now - and i'd be happy to review (and potentially apply) common-sense patches that improve the framework." Peter wasn't entirely convinced, but added, "I don't feel like converting staircase and nicksched as I have no real interest in them. Perhaps I'll just create the interface and some schedulers based on my own ideas and let others such as Con and Nick add schedulers if they're still that way inclined." In response to Ingo's offer to review the work he noted, "thanks for the offer of support (it may sway my decision)".
Another thread discussed potentially merging the swap prefetch patch into the mainline Linux kernel. Con Kolivas [story] started the thread saying "I fixed all bugs I could find and improved it as much as I could last kernel cycle. Put me and the users out of our misery and merge it now or delete it forever please." Replying to an off-list message, Andrew Morton asked users of the patch, "please provide us more details on your usage and testing of that code. Amount of memory, workload, observed results, etc?"
Nick Piggin [interview] noted that he's still interested in better understanding and possibly fixing what's happening with swap and reclaim on the systems reporting a benefit from the swap-prefetch patch. He went on to note, "regarding swap prefetching. I'm not going to argue for or against it anymore because I have really stopped following where it is up to, for now. If the code and the results meet the standard that Andrew wants then I don't particularly mind if he merges it. It would be nice if some of you guys would still report and test problems with reclaim when prefetching is turned off -- I have never encountered the morning after sluggishness (although I don't doubt for a minute that it is a problem for some)." Ingo Molnar followed up to these coments acking the patch, "I have tested it and have read the code, and it looks fine to me. (i've reported my test results elsewhere already [story]) We should include this in v2.6.23."
Posting a series of three patches, Nick Piggin [interview] announced that he was working on a rewrite of the buffer layer which he calls fsblock, "the name is fsblock because it basically ties the fs layer to the block layer." As to just what the buffer layer is, Nick explained, "the buffer layer is a layer between the pagecache and the block device for block based filesystems. It keeps a translation between logical offset and physical block number, as well as meta information such as locks, dirtyness, and IO status of each block. This information is tracked via the buffer_head structure." Before listing the improvements introduced by his rewrite, Nicked offered a justification for the effort:
"Why rewrite the buffer layer? Lots of people have had a desire to completely rip out the buffer layer, but we can't do that because it does actually serve a useful purpose. Why the bad rap? Because the code is old and crufty, and buffer_head is an awful name. It must be among the oldest code in the core fs/vm, and the main reason is because of the inertia of so many and such complex filesystems."
Ingo Molnar [interview] reviewed Con Kolivas [interview]'s swap-prefetching patches [story] suggesting that they were ready for inclusion in the mainline kernel, "I've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback i've seen so far was positive. Time to have this upstream and time for a desktop-oriented distro to pick it up." He went on to describe swap prefetch, "to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.)"
Nick Piggin [interview] expressed some concern that the impact of the code still wasn't understood well enough, "I wanted to see some basic regression tests to show that it hasn't caused obvious problems, and some basic scenarios where it helps, so that we can analyze them. It is really simple, but I haven't got any since first asking." Ingo noted that the patch has generated a lot of positive feedback from users and it would be best to merge it into the kernel, going on to suggest that it would be good to get more people actively involved, "really, we are likely be better off by risking the merge of _bad_ code (which in the swap-prefetch case is the exact opposite of the truth), than to let code stagnate. People are clearly unhappy about certain desktop aspects of swapping, and the only way out of that is to let more people hack that code. Merging code involves more people. It will cause 'noise' and could cause regressions, but at least in this case the only impact is 'performance' and the feature is trivial to disable."
A university student studying operating systems asked about why the Linux kernel uses two chained lists in its LRU (least recently used) page replacement algorithm. Andrea Arcangeli [interview], whose virtual memory subsystem was merged into the 2.4.10 kernel, explained, "back then I designed it with two lru lists because by splitting the active from the inactive cache allows to detect the cache pollution before it starts discarding the working set." He went on to add, "a page in the inactive list will be collected much more quickly than a page in the active list, so the pollution will be collected more quickly than the working set. Then the VM while freeing cache tries to keep a balance between the size of the two lists to avoid being too unfair, obviously at some point the active list have to be de-activated too."
Rik van Riel [interview], author of the reverse mapping virtual memory code that was merged into the 2.5 kernel [story] noted, "since memory size has increased a lot more than disk speed over the last decade (and this is likely to continue for the next decades), the quality of page replacement algorithms is likely to become more and more important over time." In response to a proposal to split the LRU into two parts, one for the page cache and the other for mapped pages, Nick Piggin [interview] replied, "I actually had patches to do 'split active lists' a while back. They worked by lazily moving the page at reclaim-time, based on whether or not it is mapped. This isn't too much worse than the kernel's current idea of what a mapped page is." Rik offered some ideas on to how to further tune it, "for each list we keep track of: 1) the size of the list 2) the rate at which we scan the list 3) the fraction of (non new) pages that get referenced. That way we can determine which list has the largest fraction of 'idle' pages sitting around and consequently which list should be scanned more aggressively."
When the data corruption bug which is fixed as of 2.6.20-rc3 [story] was still being tracked down [story], it was thought that the bug, a race in shared mmap'ed page writeback, might have been in the 2.6 kernel for a very long time. It has since been determined that the bug was introduced much more recently. Nick Piggin [interview] explains, "this bug was only introduced in 2.6.19, due to a change that caused pte dirty bits to be discarded without a subsequent set_page_dirty() (nowhere else in the kernel should have done this)." Linus Torvalds noted that earlier kernels could have been affected by a less serious version of the bug:
"Actually, I think 2.6.18 may have a subtle variation on it. But that much older race would only trigger on SMP (or possibly UP with preempt). And I haven't actually thought about it that much, so I could be full of crap. But I don't see anything that protects against it: we may hold the page lock, but since the code that marks things _dirty_ doesn't necessarily always hold it, that doesn't help us. And we may hold the 'private_lock', but we drop it before we do the dirty bit clearing, and in fact on UP+PREEMPT that very dropping could cause an active preemption to take place, so.. I dunno. For older kernels? If there is a race there, it must be pretty damn hard to hit in practice (and it must have been there for a looong time), so trying to fix it is possibly as likely to cause problems as it migh to fix them."
David Miller pointed out that some of the confusion as to when the bug was actually introduced comes from the fact that the original bug was against a 2.6.18 Debian kernel. Andrew Morton [interview] explained, "that was 2.6.18+debian-added-dirty-page-tracking-patches," then went on to caution that the fix still does not address a newly reported and currently unconfirmed BerkeleyDB corruption bug, "I'll assert (and emphasise) that the cause of the alleged BerkeleyDB corruption is not known at this time. The post-2.6.19 'fix' might make it go away. But if it does, we do not know why, and it might still be there, only harder to hit."
Nick Piggin, a college student living in Canberra Australia, has been working on an anticipatory I/O scheduler for the Linux kernel [story].
When a process reads data from a disk, the default "deadline" I/O scheduler can offer poor performance if a streamed write is happening at the same time. The reason is that many read operations require multiple reads, each reporting a result back before the next can be scheduled. Thus, each of these reads has to wait behind a queue of writes, resulting in the aforementioned performance problem. The anticipatory scheduler solves this problem nicely by pausing for a few milliseconds after each read, "anticipating" the next read request [story].
In this interview, Nick offers much more detail behind the operation of the anticipatory scheduler. His goal is to stablize and tune [story] the new scheduler, aiming utimately for inclusion into the 2.5 development kernel tree as the default Linux I/O scheduler [story]. The latest version of Nick's anticipatory scheduler can be found here in Andrew Morton's [interview] -mm kernel branch.