"We are working [on] a new I/O scheduler based on CFQ, aiming at improved predictability and fairness of the service, while maintaining the high throughput it already provides," began Fabio Checconi, announcing the BFQ I/O scheduler. "The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin scheduling policy of time slices into a fair queuing scheduling of sector budgets," he continued, "more precisely, each task is assigned a budget measured in number of sectors instead of amount of time, and budgets are scheduled using a slightly modified version of WF2Q+. The budget assigned to each task varies over time as a function of its behaviour. However, one can set the maximum value of the budget that BFQ can assign to any task." Fabio went on to explain:
"The time-based allocation of the disk service in CFQ, while having the desirable effect of implicitly charging each application for the seek time it incurs, suffers from unfairness problems also towards processes making the best possible use of the disk bandwidth. In fact, even if the same time slice is assigned to two processes, they may get a different throughput each, as a function of the positions on the disk of their requests. On the contrary, BFQ can provide strong guarantees on bandwidth distribution because the assigned budgets are measured in number of sectors. Moreover, due to its Round Robin policy, CFQ is characterized by an O(N) worst-case delay (jitter) in request completion time, where N is the number of tasks competing for the disk. On the contrary, given the accurate service distribution of the internal WF2Q+ scheduler, BFQ exhibits O(1) delay."
Jens Axboe reacted favorably, "Fabio, I've merged the scheduler for some testing. Overall the code looks great, you've done a good job!" He noted that the scheduler should soon appear in the -mm tree, and that it was worth considering merging the two I/O schedulers together.
Adrian Bunk posted a patch to make Linux IO schedulers a non-modular option, which would require one IO scheduler to be selected at compile time. He suggested, "there isn't any big advantage and doesn't seem to be much usage of modular schedulers." He added that removing the option to make IO schedulers modular would save 2kB on each kernel image. Jens Axboe did not like the patch, "big nack, I use it all the time for testing. Just because you don't happen to use it is not a reason to remove it." When Adrian noted that no distros seemed to be making IO schedulers available as modules, Jens suggested that this was a mistake and quipped, "it's been a long time since I considered a distro .config a benchmark/guideline of any sort."
Adrian went on to ask for the technical reasons for continuing to support four different IO schedulers, expressing concern that it could lead to bugs in individual schedulers going unreported. Jens explained that he was aiming for the perfect IO scheduler, but at this time different IO schedulers offer better results for different workloads, "with some hard work and testing, we should be able to get rid of [the anticipatory scheduler]. It still beats cfq for some of the workloads that deadline is good at, so not quite yet." Arjan van de Ven offered, "there is at least one technical reason to need more than one: certain types of storage (both big EMC boxes as well as solid state disks) don't behave like disks and have no seek penalty; any cpu time spent on avoiding seeks is wasted on those, so for these devices one really wants to use a different IO scheduler, one which is much lighter weight". Jens then acknowledged, "there's always a risk with 'duplication', like several drivers for the same hardware. I'm not disputing that."
"With latencytop, I noticed that the (in memory) atime updates during a kernel build had latencies of 600 msec or longer; this is obviously not so nice behavior. Other EXT3 journal related operations had similar or even longer latencies," Arjan van de Ven reported, describing a "mass priority inversion" caused by, "an interaction between EXT3 and CFQ in that CFQ tries to be fair to everyone, including kjournald. However, in reality, kjournald is 'special' in that it does a lot of journal work". Finally, he offered a tiny patch to resolve the issue, "the patch below makes kjournald of the IOPRIO_CLASS_RT priority to break this priority inversion behavior. With this patch, the latencies for atime updates (and similar operation) go down by a factor of 3x to 4x !"
Andrew Morton took a cautious stance, "seems a pretty fundamental change which could do with some careful benchmarking, methinks. See, your patch amounts to 'do more seeks to improve one test case'. Surely other testcases will worsen. What are they?" CFQ author Jens Axboe agreed, "It should not be merged as-is, instead I'll provide a function to do this." Ingo Molnar wasn't convinced, "atime update latencies went down by a factor of 3x-4x ... but what bothers me even more is the large picture. Linux's development is still fundamentally skewed towards bandwidth (which goes up with hardware advances anyway), while the focus on latencies is very lacking (which users do care about much more and which usually does _not_ improve with improved hardware), so i cannot see why we shouldnt apply this." He added, "if bandwidth hurts anywhere, it will be pointed out and fixed, we've got like tons of bandwidth benchmarks and it's _easy_ to fix bandwidth problems. But _finally_ we now have desktop latency tools, hard numbers and patches that fix them, but what do we do ... we put up extra roadblocks??" Andrew calmy replied, "I think the situation is that we've asked for some additional what-can-be-hurt-by-this testing. Yes, we could sling it out there and wait for the reports. But often that's a pretty painful process and regressions can be discovered too late for us to do anything about them."
"Completely fair scheduling is [a] really good thing, but if you want the best performance for certain applications you need to tune up some things," explained Michal Piotrowski in his announcement for the fifth version of his DeskOpt daemon. The daemon is a Python script that helps to automatically tune the I/O scheduler and the process scheduler to offer better performance for certain applications such as games or audio applications. The script supports the default CFS process scheduler and CFQ I/O scheduler, as well as the anticipatory I/O scheduler and the deadline I/O scheduler.
The small script utilizes an XML configuration file,
deskopt.conf, used to define scheduler classes each supporting their own scheduler tunings. One or more applications can then be added to each scheduler class, and when any of the specified applications starts the daemon will automatically tune the schedulers per the settings in that scheduler class. As examples in the provided sample configuration file Michal defines a "games" scheduler class defining two games receiving the highest scheduler priority and an "audio" scheduler class receiving not quite as high of a scheduler priority.
Jens Axboe has been involved with Linux since 1993. 30 years old, he lives in Copenhagen, Denmark, and works as a Linux Kernel developer for Oracle. His block layer rewrite launched the 2.5 kernel development branch, a layer he continues to maintain and improve. Interested in most anything dealing with IO, he has introduced several new IO schedulers to the kernel, including the default CFQ, or Complete Fair Queuing scheduler.
In this interview, Jens talks about how he got interested in Linux, how he became the maintainer of the block layer and other block devices, and what's involved in being a maintainer. He describes his work on IO schedulers, offering an indepth look at the design and current status of the CFQ scheduler, including a peek at what's in store for the future. He conveys his excitement about the new splice IO model, explaining how it came about and how it works. And he discusses the current 2.6 kernel development process, the impact of git, and why the GPL is important to him.