"We are working [on] a new I/O scheduler based on CFQ, aiming at improved predictability and fairness of the service, while maintaining the high throughput it already provides," began Fabio Checconi, announcing the BFQ I/O scheduler. "The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin scheduling policy of time slices into a fair queuing scheduling of sector budgets," he continued, "more precisely, each task is assigned a budget measured in number of sectors instead of amount of time, and budgets are scheduled using a slightly modified version of WF2Q+. The budget assigned to each task varies over time as a function of its behaviour. However, one can set the maximum value of the budget that BFQ can assign to any task." Fabio went on to explain:
"The time-based allocation of the disk service in CFQ, while having the desirable effect of implicitly charging each application for the seek time it incurs, suffers from unfairness problems also towards processes making the best possible use of the disk bandwidth. In fact, even if the same time slice is assigned to two processes, they may get a different throughput each, as a function of the positions on the disk of their requests. On the contrary, BFQ can provide strong guarantees on bandwidth distribution because the assigned budgets are measured in number of sectors. Moreover, due to its Round Robin policy, CFQ is characterized by an O(N) worst-case delay (jitter) in request completion time, where N is the number of tasks competing for the disk. On the contrary, given the accurate service distribution of the internal WF2Q+ scheduler, BFQ exhibits O(1) delay."
Jens Axboe reacted favorably, "Fabio, I've merged the scheduler for some testing. Overall the code looks great, you've done a good job!" He noted that the scheduler should soon appear in the -mm tree, and that it was worth considering merging the two I/O schedulers together.
"Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit outside your own sandbox."
Adrian Bunk posted a patch to make Linux IO schedulers a non-modular option, which would require one IO scheduler to be selected at compile time. He suggested, "there isn't any big advantage and doesn't seem to be much usage of modular schedulers." He added that removing the option to make IO schedulers modular would save 2kB on each kernel image. Jens Axboe did not like the patch, "big nack, I use it all the time for testing. Just because you don't happen to use it is not a reason to remove it." When Adrian noted that no distros seemed to be making IO schedulers available as modules, Jens suggested that this was a mistake and quipped, "it's been a long time since I considered a distro .config a benchmark/guideline of any sort."
Adrian went on to ask for the technical reasons for continuing to support four different IO schedulers, expressing concern that it could lead to bugs in individual schedulers going unreported. Jens explained that he was aiming for the perfect IO scheduler, but at this time different IO schedulers offer better results for different workloads, "with some hard work and testing, we should be able to get rid of [the anticipatory scheduler]. It still beats cfq for some of the workloads that deadline is good at, so not quite yet." Arjan van de Ven offered, "there is at least one technical reason to need more than one: certain types of storage (both big EMC boxes as well as solid state disks) don't behave like disks and have no seek penalty; any cpu time spent on avoiding seeks is wasted on those, so for these devices one really wants to use a different IO scheduler, one which is much lighter weight". Jens then acknowledged, "there's always a risk with 'duplication', like several drivers for the same hardware. I'm not disputing that."
"I think the SG stuff looks ok now, but I think we have a lot of 'fix up the rough edges' to go!" Linus Torvalds noted regarding some of the fallout from the recent merge of Jens Axboe's SG chaining patchset. During one of the many discussions, Jens explained:
"It's all about the end goal - having maintainable and resilient code. And I think the sg code will be better once we get past the next day or so, and it'll be more robust. That is what matters to me, not the simplicity of the patch itself."
Boaz Harrosh commented, "thanks Jens for doing all this, The performance gain is substantial and we will all enjoy it." Jens replied, "my pleasure, I just wish it could have been a little less painful. But in a day or two, it should all be behind us and we can move forward with making good use of it."
"With latencytop, I noticed that the (in memory) atime updates during a kernel build had latencies of 600 msec or longer; this is obviously not so nice behavior. Other EXT3 journal related operations had similar or even longer latencies," Arjan van de Ven reported, describing a "mass priority inversion" caused by, "an interaction between EXT3 and CFQ in that CFQ tries to be fair to everyone, including kjournald. However, in reality, kjournald is 'special' in that it does a lot of journal work". Finally, he offered a tiny patch to resolve the issue, "the patch below makes kjournald of the IOPRIO_CLASS_RT priority to break this priority inversion behavior. With this patch, the latencies for atime updates (and similar operation) go down by a factor of 3x to 4x !"
Andrew Morton took a cautious stance, "seems a pretty fundamental change which could do with some careful benchmarking, methinks. See, your patch amounts to 'do more seeks to improve one test case'. Surely other testcases will worsen. What are they?" CFQ author Jens Axboe agreed, "It should not be merged as-is, instead I'll provide a function to do this." Ingo Molnar wasn't convinced, "atime update latencies went down by a factor of 3x-4x ... but what bothers me even more is the large picture. Linux's development is still fundamentally skewed towards bandwidth (which goes up with hardware advances anyway), while the focus on latencies is very lacking (which users do care about much more and which usually does _not_ improve with improved hardware), so i cannot see why we shouldnt apply this." He added, "if bandwidth hurts anywhere, it will be pointed out and fixed, we've got like tons of bandwidth benchmarks and it's _easy_ to fix bandwidth problems. But _finally_ we now have desktop latency tools, hard numbers and patches that fix them, but what do we do ... we put up extra roadblocks??" Andrew calmy replied, "I think the situation is that we've asked for some additional what-can-be-hurt-by-this testing. Yes, we could sling it out there and wait for the reports. But often that's a pretty painful process and regressions can be discovered too late for us to do anything about them."
"It looks to be about 2.1% increase in time to do the make/mount/unmount operations with the marker patches in place and no blktrace operations," Alan Brunelle summarized some benchmarks testing the overhead of the kernel markers patches. He continued, "with the blktrace operations in place we see about a 3.8% decrease in time to do the same ops." Block layer maintainer Jens Axboe responded favorably, "thanks for running these numbers. I don't think you have to bother with it more. My main concern was a performance regression, increasing the overhead of running blktrace." He added, "I'd say the above is Good Enough for me," acking the kernel marker patches.
Jens went on to muse, "I do wonder about that performance _increase_ with blktrace enabled. I remember that we have seen and discussed something like this before, it's still a puzzle to me..." Mathieu Desnoyers agreed, "interesting question indeed," going on to suggest possible future tests to understand the unexpected performance increase. blktrace is a block layer IO tracing tool for providing detailed information about request queue operations, originally developed by Jens Axboe and merged into the mainline kernel in 2.6.17-rc1.
Jens Axboe detailed the changes in his linux-2.6-block.git tree that he plans to merge into the upcoming 2.6.24 kernel. Among the changes were the necessary updates to enable SG chaining which is used for large IO commands, "the goal of sg chaining is to allow support for very large sgtables, without requiring that they be allocated from one contigious piece of memory." Andrew Morton asked for more information, "presumably sg chaining means more overhead on the IO submission paths? If so, has this been quantified?"
Jens explained that there is no overhead for existing logic which doesn't use sg chaining, "just cleanups to drivers to use sg_next() and for_each_sg() and so on." He continued:
"For actually using the sg chaining, there's some overhead of course. Say we support 256 entries without chaining, or 1mb with 4kb pages. A request with 1000 entried would require 4 trips to the allocator to allocate the chainable lists and 4 trips when freeing that list again. We don't loop the sg list on setup of freeing, just jump to the correct locations. So even for chaining, the cost isn't that big. It enables us to support much larger IO commands and potentially speed up some devices quite a lot, so CPU cost is less of a concern. And for small sglists, there isn't a noticable overhead."
Lars Ellenberg started an effort to get DRBD, the Distributed Replicated Block Device merged into the Linux kernel. When asked for clarification as to what it was, Lars explained, "think of it as RAID1 over TCP. Typically you have one Node in Primary, the other as Secondary, replication target only. But you can also have both Active, for use with a cluster file system." Earlier in the thread he described it as "a stacked block device driver".
Much of the initial review focused on the need to comply with kernel coding style guidelines. Kyle Moffett offered a much lengthier review, noting at one point in the code, "how about fixing this to actually use proper workqueues or something instead of this open-coded mess?" Lars replied, "unlikely to happen 'right now'. But it is on our todo list..." Jens Axboe added, "but stuff like that is definitely a merge show stopper, jfyi".
Jens Axboe [interview] posted a series of ten patches that add support for large IO commands. He began by defining the problem:
"Some people complain that Linux doesn't support really large IO commands. The main reason why we do not support infinitely sized IO is that we need to allocate a scatterlist to fill these elements into for dma mapping. The Linux scatterlist is an array of scatterlist elements, so we need to allocate a contiguous piece of memory to hold them all. On i386, we can at most fit 256 scatterlist elements into a page, and on x86-64 we are stuck with 128. So that puts us somewhere between 512kb and 1024kb for a single IO."
Jens went on to explain his solution, "to get around that limitation, this patchset introduces an sg chaining concept. The way it works is that the last element of an sg table can point to a new sgtable, thus extending the size of the total IO scatterlist greatly." Regarding the current status he noted, "it works for me, but you can't enable large commands on anything but i386 right now. I still need to go over the x86-64 iommu bits to enable it there as well."
Announcing the third version of his syslets subsystem patches [story], Ingo Molnar [interview] noted that he has implemented many fundamental changes to the code including the introduction of threadlets, "'threadlets' are basically the user-space equivalent of syslets: small functions of execution that the kernel attempts to execute without scheduling. If the threadlet blocks, the kernel creates a real thread from it, and execution continues in that thread. The 'head' context (the context that never blocks) returns to the original function that called the threadlet." As threadlets are only moved into a separate thread context if they block, Ingo refers to them as 'optional threads'. He also describes them as 'on-demand parallelism', "user-space does not have to worry about setting up, sizing and feeding a thread pool - the kernel will execute the workload in a single-threaded manner as long as it makes sense, but once the context blocks, a parallel context is created. So parallelism inside applications is utilized in a natural way."
Ingo goes on to note that the syslet code and API has been significantly enhanced in this latest release, "the v3 code is ABI-incompatible with v2, due to these fundamental changes." He adds, "syslets (small, kernel-side, scripted 'syscall plugins') are still supported - they are (much...) harder to program than threadlets but they allow the highest performance. Core infrastructure libraries like glibc/libaio are expected to use syslets. Jens Axboe's FIO tool already includes support for v2 syslets, and the following patch updates FIO to the v3 API".
Jens Axboe has been involved with Linux since 1993. 30 years old, he lives in Copenhagen, Denmark, and works as a Linux Kernel developer for Oracle. His block layer rewrite launched the 2.5 kernel development branch, a layer he continues to maintain and improve. Interested in most anything dealing with IO, he has introduced several new IO schedulers to the kernel, including the default CFQ, or Complete Fair Queuing scheduler.
In this interview, Jens talks about how he got interested in Linux, how he became the maintainer of the block layer and other block devices, and what's involved in being a maintainer. He describes his work on IO schedulers, offering an indepth look at the design and current status of the CFQ scheduler, including a peek at what's in store for the future. He conveys his excitement about the new splice IO model, explaining how it came about and how it works. And he discusses the current 2.6 kernel development process, the impact of git, and why the GPL is important to him.
Nigel Cunningham submitted his suspend2 patches [story] to the lkml for review and inclusion into Andrew Morton [interview]'s -mm tree [story]. Jens Axboe summarized the current roadblocks to merging suspend2, "now I haven't followed the suspend2 vs swsusp debate very closely, but it seems to me that your biggest problem with getting this merged is getting consensus on where exactly this is going. Nobody wants two different suspend modules in the kernel. So there are two options - suspend2 is deemed the way to go, and it gets merged and replaces swsusp. Or the other way around - people like swsusp more, and you are doomed to maintain suspend2 outside the tree."
Greg KH pointed out that the current focus with swsusp is to move the functionality from the kernel into userspace, called uswsusp, "Pavel and others have a working implementation and are slowly moving toward adding all of the 'bright and shiny' features that is in suspend2 to it (encryption, progress screens, abort by pressing a key, etc.) so that there is no loss of functionality." Nigel countered that only some of swsusp is being moved to userland, adding, "and there _is_ loss of functionality - uswsusp still doesn't support writing a full image of memory, writing to multiple swap devices (partitions or files), or writing to ordinary files. They're getting the low hanging fruit, but when it comes to these parts of the problem, they're going to require either smoke and very good mirrors (eg the swap prefetching trick), or simply refuse to implement them." Pavel Machek, maintainer of swsusp and uswsusp, replied item by item to Nigel's list of suspend2 advantages noting that uswsusp now has or soon will have the same capabilities. It was further noted that the submitted patches will need to be consolidated into logical pieces and resubmitted for proper review.
Andrew Morton [interview] offered a list of patches in his mm tree, summarizing for each his plans as to whether or not they will be pushed to Linus for inclusion in the upcoming 2.6.17 kernel. Comments on the patches range from the simple "will merge" to pushing them to others for review. One of the more entertaining comments followed a set of 33 patches where Andrew noted, "This is Oleg's romp through the core kernel. There's a ton of material here. I'll probably send it all to Linus and ask him to review it. (aka blame-shifting)." Later in the thread he explained, "it's just a whole lot of code in areas which are tricky and in which few people work and in which reviewing resources are slight."
One set of patches refused with the comment, "still don't have a compelling argument for this, IMO" was Con Kolivas [interview]' swap prefetching efforts [story]. The feature was discussed in a couple of follow up threads. In response to some concerns raised by Jens Axboe, Con explained the implementation a little further, "If the system is idle it doesn't cost anything to bring those pages in (laptop mode disables any prefetching if you're thinking about power consumption on laptops). And if the system wants the ram that has been filled with prefetched pages wrongly, the prefetched pages are at the tail end of the inactive LRU list with a copy on backing store so if they're not accessed they'll be the first thing dropped in preference to anything else, without any I/O."
A brief thread on the lkml discussed whether or not Reiser4 would soon be stable enough to be merged into the 2.6 kernel as an 'experimental filesystem'. When it was suggested that this might be overly optimistic, that the filesystem may best go into the 2.7 development kernel [forum] first, Hans Reiser disagreed, "I don't think it is vastly optimistic, I hope we can send something in next month". He went on to explain, "we will have something we think is appropriate for inclusion as an experimental feature very soon now. Because our test scripts have become much more sophisticated, it means more when we say we cannot crash it, and it will go from experimental to stable faster than V3 did. I won't predict how fast."
Jens Axboe, maintainer of the block layer and several CD-ROM drivers, suggested that it would be unwise to merge the code so quickly, instead preferring a much lengthier period of user testing. He explains, "I don't doubt you have great testing scripts, but nothing beats real life testing." During a discussion in late August [story], 2.6 kernel maintainer Andrew Morton [interview] indicated that he would be willing to merge Reiser4 into his -mm patchset [howto].