"The following patches have been in the -mm tree for a while, and I plan to push them to Linus when the 2.6.25 merge window opens," began Theodore Ts'o, offering the patches for review before they are merged. He explained that the patches introduce some of the final changes to the ext4 on-disk format, "ext4, shouldn't be deployed to production systems yet, although we do salute those who are willing to be guinea pigs and play with this code!" He continued:
"With this patch series, it is expected that [the] ext4 format should be settling down. We still have delayed allocation and online defrag which aren't quite ready to merge, but those shouldn't affect the on-disk format. I don't expect any other on-disk format changes to show up after this point, but I've been wrong before.... any such changes would have to have a Really Good Reason, though."
"I've just released the 2.6.23-rc9-ext4-1. It collapses some patches in preparation for pushing them to Linus, and adds some of the cleanup patches that had been incorporated into Andrew's broken-out-2007-10-01-04-09 series," announced Theodore Ts'o. He also noted of the current ext4 git tree, "it also has some new development patches in the unstable (not yet ready to push to mainline) portion of the patch series." In an earlier thread Theodore posted a series of patches specifically intended for inclusion in the upcoming 2.6.24 kernel. Included in the patch series was a patch for improving fsck performance, "in performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count." The patch included an explanation of how the feature works, enabled through a mkfs option:
"With this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation."
UBIFS is described as, "a new flash file system which is designed to work on top of UBI." It has replaced the JFFS3 project, a choice explained on the project webpage, "we have realized that creating a scalable flash file system on top of bare flash is a difficult task, just because the flash media is so problematic (wear-leveling, bad eraseblocks). We have tried this way, and it turned out to be that we solved media problems, instead of concentrating on file system issues. So we decided to split one big and complex tasks into 2 sub-tasks: UBI solves the media problems, like bad eraseblocks and wear-leveling, and UBIFS implements the file system on top. And now finally, we may concentrate on file-system issues: implementing write-back caching, multi-headed journal, garbage collector, indexing information management and so on. There are a lot of FS problems to solve - orphaned files, deletions, recoverability after unclean reboots and so on."
In a recent posting to the lkml, Artem Bityutskiy noted that UBIFS has to take into account that there is a small amount of unused block space at the ends of eraseblocks, and the size of pages written to disk are smaller than they are in memory as the filesystem performs compression. "So, if our current liability is X, we do not know exactly how much flash space (Y) it will take. All we can do is to introduce some pessimistic, worst-case function Y = F(X). This pessimistic function assumes that pages won't be compressible, and it assumes worst-case wastage." The calculation is necessary as even though data is not written immdiately to the flash device, it's important to be able to inform the application writing data if there's not enough space left. "So my question is: how can we flush _few_ oldest dirty pages/inodes while we are inside UBIFS (e.g., in ->prepare_write(), ->mkdir(), ->link(), etc)?"
"In [the first pass] of e2fsck, every inode table in the fileystem is scanned and checked, regardless of whether it is in use," Avantika Mathur began. "This is the most time consuming part of the filesystem check. The unintialized block group feature can greatly reduce e2fsck time by eliminating checking of uninitialized inodes." She went on to explain how it works, "with this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation." Avantika attached a graph illustrating the advantage of the patch which she summarized as follows:
"The patches have been stress tested with fsstress and fsx. In performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count. Since typical ext4 filesystems only use 1-10% of their inodes, this feature can greatly reduce e2fsck time for users. With performance improvement of 2-20 times, depending on how full the filesystem is."
Theodore Ts'o posted an update on the ext4 filesystem [story], "I've respun the ext4 development patchset, with Amit's updated fallocate patches. I've added Dave's patch to add ia64 support to the fallocate system call, but *not* the XFS fallocate support patches. (Probably better for them to live in an xfs tree, where they can more easily tested and updated.) Yes, we haven't reached complete closure on the fallocate system call calling convention, but it's enough for us to get more testing in -mm." Jeff Garzik noted that none of this development was happening in the kernel as originally planned, "why isn't this stuff going upstream rapidly? AFAICT nothing much at all has happened upstream besides a mass renaming? The whole point of having ext4 in the kernel is to do development upstream, in the public view, getting new stuff in ASAP (even if that means changing or pulling some stuff later)."
Theodore acknowledged, "in general, yes, ext4 development has been a little slow; part of the problem is that we have a lot of people, but a number of folks are new and their patches need review before they are ready for upstream acceptance, and a number of other folks who should be doing the review have been overloaded with multiple other projects and have been time-sharing." He went on to note, "but we also get flamed when the patches don't meet various criteria, up to and including breaking on ia64. We are in the process of setting up automated testing to help address that problem, but it's a taken a little while to get that going. I'm also trying to schedule more time so I can do the needed review of the patches so they meet basic upstream standards so we *can* push them. If other folks would like to help with the review process, that would be more than welcome. But yes, we will try to get more of the patches pushed sooner rather than later."
Linus Torvalds announced the release of the 2.6.19 Linux kernel, following the previous stable kernel release by two months [story]. "It's one of those rare 'perfect' kernels," Linus joked, "so if it doesn't happen to compile with your config (or it does compile, but then does unspeakable acts of perversion with your pet dachshund), you can rest easy knowing that it's all your own d*mn fault, and you should just fix your evil ways." He went on to add, "you could send me and the kernel mailing list a note about it anyway, of course. (And perhaps pictures, if your dachshund is involved. Not that we'd be interested, of course. No. Just so that we'd know to avoid it next time)."
The latest kernel source can be downloaded from your nearest Linux Kernel archive mirror [story]. You can browse through all the changes using the gitweb interface. Kernel Newbiews also maintains a useful summary of all the changes that went into this new version of the Linux kernel, including the inclusion of three new filesystems, GFS2, ext4 [story], and eCryptfs.
With the release of the 2.6.19-rc1-mm1 kernel, the ext4 filesystem [story] was merged into Andrew Morton [interview]'s -mm tree for further testing. In the announcement Andrew notes that the new filesystem is compatible with ext3 until you add a file that has extents. He also notes, "when comparing performance with other filesystems, remember that ext3/4 by default offers higher data integrity guarantees than most. So when comparing with a metadata-only journalling filesystem, use `mount -o data=writeback'. (Although this doesn't seem to make much difference with ext3)" The goal is to stabilize the new filesystem within the next six to nine months, and ultimately to replace the ext3 filesystem.
The discussion about why the Reiser4 filesystem has not been merged into the Linux kernel [story] continues on the lkml. Hans Reiser [interview] contrasted the struggles Reiser4 has had trying to get merged versus recent discussion about the up and coming ext4 filesystem [story], "the code isn't even written, benchmarked, or tested yet, and it is going into the kernel already so that its developers don't have to deal with maintaining patches separate from the tree. Wow. Kind of hard to argue that it is not politically differentiated, isn't it?"
Theodore T'so responsed, "it is a development procedure that was developed after discussion and consensus building across LKML and the ext2/3/4 development team. It was not the original plan put forth by the ext2 developers, but after listening to the concerns and suggestions, we did not question the motives of the people making suggestions; we listened." He went on to note that parts of what will be ext4 were written a year ago, and have been heavily tested and reviewed. Others pointed out that the evolution between ext3 and ext4 will be a very public process, with patches being merged gradually, whereas Reiser4 is a completely different code base from Reiser3.
The latest chapter in this ongoing debate tends to be more about clashing personalities than the code in question. How this affects if and when the Reiser4 filesystem will be merged into the mainline Linux kernel is yet to be seen.
Theodore Ts'o offered an insightful summary of issues affecting future development on the ext3 filesystem, "it is clear that many people feel they have a stake in the future development plans of the ext2/ext3 filesystem, as it [is] one of the most popular and commonly used filesystems, particular amongst the kernel development community. For this reason, the stakes are higher than it would be for other filesystems." He listed the three main concerns for future development as stability, compatibility confusion, and code complexity, "unfortunately, these various concerns were sometimes mixed together in the discussion two months ago, and so it was hard to make progress. Linus's concern seems to have been primarily the first point, with perhaps a minor consideration of the 3rd. Others dwelled very heavily on the second point."
Theodore went on to say, "to address these issues, after discussing the matter amongst ourselves, the ext2/3 developers would like to propose the following path forward." He listed a four step plan beginning with the creation of a new ext4 filesystem registered with the kernel temporarily as 'ext3dev', "this will be explicitly marked as an CONFIG_EXPERIMENTAL filesystem, and will in affect be a 'development fork' of ext3. A similar split of the fs/jbd will be made in order to support 64-bit jbd, which will be used by fs/ext4 and future versions of ocfs2." Theodore explained that new features will go into the ext3dev tree, with only bugfixes making their way back to the stable ext3 tree. He noted that it will remain important that the ext4 code base can mount ext3 filesystems, "this is necessary to ensure a future smooth upgrade path from ext3 to ext4 users." Finally, "probably in 6-9 months when we are satisified with the set of features that have been added to fs/ext4, and confident that the filesystem format has stablized, we will submit a patch which causes the fs/ext4 code to register itself as the ext4 filesystem." He further noted that once ext4 is deemed fully stable, it may completely replace ext3 in the source tree.
Continuing the earlier discussion about low latency and Ingo Molnar [interview]'s voluntary kernel preemption patch [story], the conversation moved onto the affect a filesystem can have on latency. Specifically, 2.6 maintainer Andrew Morton [interview] noted that ReiserFS was known to have some latency issues in both the 2.4 and 2.6 Linux kernels, "resierfs: yes, it's a problem. I 'fixed' it multiple times in 2.4, but the fixes ended up breaking the fs in subtle ways and I eventually gave up." However, he did go on to note, "actually, the 2.4 low-latency patch does still have some reiserfs fixes, so it's probably better than reiserfs in 2.6."
When asked if ext3 was a better choice for low latency work, Andrew Morton replied, "ext3 is certainly better than [reiserfs], but still has a couple of potential problem spots. ext2 is probably the best at this time." Data is continuing to be collected and reviewed by a number of kernel developers, so the more noticeable latency issues in the 2.6 kernel will likely be addressed soon.