"It's been two weeks rather than the usual one, because we've been hunting a really annoying VM regression that not a lot of people seem to have seen, but I didn't want to release an -rc4 with it," began Linux creator Linus Torvalds, announcing the 2.6.34-rc4 Linux kernel. He explained, "we had the choice of either reverting all the anon-vma scalability improvements, or finding out exactly what caused the regression and fixing it. And we got pretty close to the point where I was going to just revert it all." Linus continued:
"Absolutely _huge_ kudos to Borislav Petkov who reported the problem and was able to not just reliably reproduce it, but also test new patches to try to narrow things down at a moments notice. The thing took ten days of emails flying back and forth, and Borislav was there all the time, day and night, through several patches that tried to fix it (several real bugs, but not the one he hit) and lots of patches to just add instrumentation to get us nearer to the cause of the problem. And finally, today, confirmation that we actually nailed the problem. So if anybody has been seeing a oops (or sometimes a GP fault) in page_referenced(), that should be gone now."
As for the rest of the changes, Linus noted, "the bulk of the changes come from drivers - a new network driver (cxgb4), but also updates to the radeon and nouveau drivers. And then there is the random updates everywhere." Read on for the full changelog.
"I realize that getting the POWER people to accept that they have been total morons when it comes to VM for the last three decades is hard, but somebody in the POWER hardware design camp should (a) be told and (b) be really ashamed of themselves."
"Yes, the VM is hard. I agree. It's nasty. But exactly because it's nasty and subtle and horrid, I'm also very anal about it, and I get really nervous when somebody touches it without (a) knowing all the rules intimately and (b) listening to people who do."
Mark Weinem offered a summary of NetBSD's six 2007 Summer of Code development projects. The projects included: the Automated Testing Framework, "the goal of the ATF project was to develop a testing framework to easily define test cases and run them in a completely automated way"; porting ZFS, "the primary goal of this project was to port volume emulation (ZVOL) functionality in order to mount ZFS file systems"; QoS framework for NetBSD's virtual memory system, "for delay sensitive systems such as streaming multimedia servers and back-end database systems, servicing the reader processes in a timely fashion is more important than the servicing the writers"; kernel file systems in userspace, as a result of the project, "almost all NetBSD kernel file systems can be compiled, mounted and run in userspace"; and hardware monitoring, "the aim of this project was to develop a kernel event notification framework to notify userland of hardware changes e.g. a new USB device being added". Mark added:
"NetBSD has been involved in the Google Summer of Code since its conception in 2005. This year we were glad to once again have the oppertunity to introduce six students to our operating system, to Open Source software development and get them sponsored by Google to work on projects defined by the NetBSD developers."
"We don't want to introduce pointless delays in throttle_vm_writeout() when the writeback limits are not yet exceeded, do we?" asked Fengguang Wu as the description of his patch to
mm/page-writeback.c. Andrew Morton replied, "this is a pretty major bugfix, explaining, "this patch has the potential to significantly alter the dynamics of the VM behaviour under particular workloads. It might turn up other stuff..." He continued:
"I wonder why nobody noticed this happening. Either a) it turns out that kswapd is doing a good job and such callers don't do direct reclaim much or b) nobody is doing any in-depth kernel instrumentation.
"Now, how _would_ one notice this problem? We don't have very good tools, really. Booting with "profile=sleep" and looking at the profile data would be one way. Repeatedly doing sysrq-T is another. Perhaps the new lockstat-via-lockdep code would allow this to be observed in some fashion, dunno."
Rik van Riel [interview] posted some thoughts on the page replacement requirements of the Linux VM, noting that the same kinds of bugs have been getting fixed and reintroduced over the past few years, "this has convinced me that it is time to take a look at the actual requirements of a page replacement mechanism, so we can try to fix things without reintroducing other bugs. Understanding what is going on should also help us deal better with really large memory systems." He added his thoughts from this email to the linux-mm wiki, which he plans to update as new requirements surface.
The initial requirements shortlist included seven items: "1) must select good pages for eviction; must not submit too much I/O at once. Submitting too much I/O at once can kill latency and even lead to deadlocks when bounce buffers (highmem) are involved. Note that submitting sequential I/O is a good thing; 2) must be able to efficiently evict the pages on which pageout I/O completed; 3) must be able to deal with multiple memory zones efficiently; 4) must always have some pages ready to evict. Scanning 32GB of "recently referenced" memory is not an option when memory gets tight; 5) must be able to process pages in batches, to reduce SMP lock contention; 6) a bad decision should have bounded consequences. The VM needs to be resilient against its own heuristics going bad; 7) low overhead of execution." He continued on with some more in depth discussion of the various requirements.
Andrea Arcangeli is well known for having completely rewritten and stabilized the virtual memory subsystem in the 2.4 Linux kernel. Many were surprised when Linus Torvalds merged Andrea's VM into 2.4.10, but the new memory subsystem has long since proved itself. Andrea is a 27 year old Linux kernel hacker living in Italy and working for SUSE.
A recent lkml thread explored an interesting tangent when Jeff Garzik asked about what was to follow the 2.5 development kernel, "is it definitely to be named 2.6? Maybe it's just my impression from development speed, but it felt more like a 3.0 to me :)". Linux creator Linus Torvalds first suggested that there was no reason to skip from 2.5 to 3.0, qualifying it with, "But hey, it's just a number. I don't feel that strongly either way."
Ingo Molnar reflected on the significant improvements we've seen to the VM and the IO subsystem, going so far as to say, "I think due to these improvements if we dont call the next kernel 3.0 then probably no Linux kernel in the future will deserve a major number. In 2-4 years we'll only jump to 3.0 because there's no better number available after 2.8."
Linus agreed that if the VM is as good as it seems to be, indeed the upcoming release deserves to be called 3.0. But he also pointed out that there are many silent users who tend not to speak up until there is an official release. He asks, "people who are having VM trouble with the current 2.5.x series, please _complain_, and tell what your workload is. Don't sit silent and make us think we're good to go.. And if Ingo is right, I'll do the 3.0.x thing."