ZFS support on linux
Daniel Phillips noted that his new Tux3 versioning filesystem is now operating like a filesystem, "the last burst of checkins has brought Tux3 to the point where it undeniably acts like a filesystem: one can write files, go away, come back later and read those files by name. We can see some of the hoped for attractiveness starting to emerge: Tux3 clearly does scale from the very small to the very big at the same time. We have our Exabyte file with 4K blocksize and we can also create 64 Petabyte files using 256 byte blocks." He went on to discuss some of the remaining features yet to be implemented, including atomic commits, versioning, coalesce on delete, a version of the filesystem written in the kernel, extents, locking, and extended attributes.
Reviewing the above list, Daniel decided he would work next on the coalesce on delete functionality, noting, "without this we can still delete files but we cannot recover file index blocks, only empty them, not so good." He added that at this time he was only going to focus on file truncation, "as soon as file truncation is added to the test mix we will see much more interesting behavior from the bitmap allocator, and we will discover some great ways to generate horrible fragmentation issues. Yummy." Daniel continued to point out that Tux3 is an open source project, and as such is always looking for others to participate, "whoever wants to carve their initials on what is starting to look like a for-real Linux filesystem, now is a great time to take a flyer. The code base is still tiny, builds fast, has lots of interactive feedback and is easy to work on. And you get to put your email address near the beginning of the list, which will naturally write its way into the history of open source. Probably."
"I'd like to get a first round of review on my AXFS filesystem," began Jared Hulbert, describing his new Advanced XIP File System for Linux. XIP stands for eXecute-In-Place. The new filesystem received quite a bit of positive feedback. Jared offered the following description:
"This is a simple read only compressed filesystem like Squashfs and cramfs. AXFS is special because it also allows for execute-in-place of your applications. It is a major improvement over the cramfs XIP patches that have been floating around for ages. The biggest improvement is in the way AXFS allows for each page to be XIP or not. First, a user collects information about which pages are accessed on a compressed image for each mmap()ed region from /proc/axfs/volume0. That 'profile' is used as an input to the image builder. The resulting image has only the relevant pages uncompressed and XIP. The result is smaller memory sizes and faster launches."
"It is about time to take a step back and describe what I have been implementing," began Daniel Phillips, referring to his new Tux3 filesystem. He provided a simple ASCII diagram that detailed the filesystem's hierarchical structure, describing each of the elements. About one he noted, "the volume table is a new addition not central to the goals of Tux3, but a nice feature to have given that it comes nearly for free. One Tux3 volume can have an arbitrary number of separate filesystems tucked inside it, indexed by a simple integer parameter at mount time. People say they like this idea and it imposes no significant complexity, so it goes in." Daniel continued:
"Each volume has a metablock pointing at the forward log chain for the volume, a version table that describes the hierarchical relationship between versions (snapshots), an atime table to take care of that horrid legacy Unix feature, and an inode table containing files and attributes of files. [...] Versioning takes place in three places, versioned pointers in the atime btree, versioned extents in a file data btree and versioned attributes in the inode table. [...] Notice the absence of a journal, the functionality of which is provided by forward log elements that I described in the Hammer thread (and will eventually write a separate post about)."
"Any benchmark is going to be a benchmark of the OS as much as it is going to be a benchmark of the filesystem. It's pretty hard to separate the two. ZFS is best tested on Open Solaris. UFS is best tested on FreeBSD, EXT3 is best tested on Linux, and HAMMER of course is best tested on DragonFly."
"Btrfs v0.16 is available for download," began Chris Mason, announcing the latest release of his new Btrfs filesystem. He noted, "v0.16 has a shiny new disk format, and is not compatible with filesystems created by older Btrfs releases. But, it should be the fastest Btrfs yet, with a wide variety of scalability fixes and new features." Improved scalability and performance improvements include fine grained btree locking, pushing CPU intensive operations such as checksumming into their own background threads, improved
data=ordered mode, and a new cache to reduce IO requirements when cleaning up old transactions. Other new features include support for ACLs, prevention of orphaned inodes so files won't be lost after a crash, and a more robust directory index format. Chris noted:
"There are still more disk format changes planned, but we're making every effort to get them out of the way as quickly as we can. You can see the major features we have planned on the development timeline. [...] the btrfs kernel module now weighs in at 30,000 LOC, which means we're getting very close to the size of ext."
"The big advantage Hammer has over Tux3 is, it is up and running and released in the Dragonfly distro," began Daniel Phillips, offering a comparison between the two filesystem. He continued, "the biggest disadvantage is, it runs on BSD, not Linux, and it so heavily implements functionality that is provided by the VFS and block layer in Linux that a port would be far from trivial. It will likely happen eventually, but probably in about the same timeframe that we can get Tux3 up and stable." This led into a lengthy and interesting technical discussion between Daniel and HAMMER author Matthew Dillon, comparing the design of the two filesystems.
Matthew reviewed the Tux3 notes and replied, "it sounds like Tux3 is using many similar ideas [as HAMMER]. I think you are on the right track. I will add one big note of caution, drawing from my experience implementing HAMMER, because I think you are going to hit a lot of the same issues. I spent 9 months designing HAMMER and 9 months implementing it. During the course of implementing it I wound up throwing away probably 80% of the original design outright." Daniel noted that he's been working on the Tux3 design for around ten years, "and working seriously on the simplifying elements for the last three years or so, either entirely on paper or in related work like ddsnap and LVM3." Matthew cautioned, "I can tell you've been thinking about Tux for a long time. If I had one worry about your proposed implementation it would be in the area of algorithmic complexity. You have to deal with the in-memory cache, the log, the B-Tree, plus secondary indexing for snapshotted elements and a ton of special cases all over the place. Your general lookup code is going to be very, very complex. My original design for HAMMER was a lot more complex (if you can believe it!) then the end result. A good chunk of what I had to do going from concept to reality was deflate a lot of that complexity." The friendly conversation offers a very detailed look at the design choices made in each of these file systems.
"I have had to apply the reiser4 patches from -mm kernels to vanilla based patchset for over a year now. Reiser4 works fine, what will it take to get it included in vanilla?" began a brief thread on the Linux Kernel mailing list. Theodore Ts'o offered several links detailing the reamining issues with Reiser4, then suggested, "people who really like reiser4 might want to take a look at btrfs; it has a number of the same design ideas that reiser3/4 had --- except (a) the filesystem format has support for some advanced features that are designed to leapfrog ZFS, (b) the maintainer is not a crazy man and works well with other LKML developers (free hint: if your code needs to be reviewed to get in, and reviewers are scarce; don't insult and abuse the volunteer reviewers as Hans did --- Not a good plan!)."
Edward Shishkin noted that Reiser4 development continues, "I am working on the plugin design document. It will be ready approximately in September. I believe that it'll address all the mentioned complaints." He added, "This document [defines] plugins [and] primitives (like conversion of run-time objects) used in reiser4, and describes all reiser4 interfaces, so that it will be clear that VFS functionality is not duplicated, there are not VFS layers inside reiser4, etc."
Hans Reiser, the original developer of the Reiser4 filesystem, was convicted of first degree murder on April 28'th, 2008. The latest Reiser4 patches currently live on kernel.org, as do the necessary support programs.
"Since everybody seems to be having fun building new filesystems these days, I thought I should join the party, began Daniel Phillips, announcing the Tux3 versioning filesystem. He continued, "Tux3 is a write anywhere, atomic commit, btree based versioning filesystem. As part of this work, the venerable HTree design used in Ext3 and Lustre is getting a rev to better support NFS and possibly become more efficient." Daniel explained:
"The main purpose of Tux3 is to embody my new ideas on storage data versioning. The secondary goal is to provide a more efficient snapshotting and replication method for the Zumastor NAS project, and a tertiary goal is to be better than ZFS."
In his announcement email, Daniel noted that implementation work is underway, "much of the work consists of cutting and pasting bits of code I have developed over the years, for example, bits of HTree and ddsnap. The immediate goal is to produce a working prototype that cuts a lot of corners, for example block pointers instead of extents, allocation bitmap instead of free extent tree, linear search instead of indexed, and no atomic commit at all. Just enough to prove out the versioning algorithms and develop new user interfaces for version control."
A recent thread on the lkml discussed a blog entry stating that minimal ZFS support for GRUB was available under the GPL license, "we could now use that code to implement support for ZFS in the Linux kernel." Alan Cox explained, "no we can't. The GPL ZFS bits don't include the various methods that would violate the patent so there is no grant. I've several times asked Sun to simply give permission and they don't even answer. I can only read the Sun motivation one way - they want to look open but know that ZFS is about the only thing that might save Solaris as a product in the data centre so are not truly prepared to let Linus use it." H. Peter Anvin added, "from what I can see, it is an absolutely-minimal read only implementation."
Christoph Hellwig offered, "adding a read-only for the start zfs driver for Linux would be useful for various purposes. And adding read-only filesystems to Linux is really easy." Referring to the individual who started the discussion, he added, "if Fred really cares about it I'd be very happy to mentor him implementing it. It should be a very good learning exercise for him." When asked if this offer applied to anyone else, Christoph replied, "yes, this offer is of course up to everyone interested. But it's not purely an integration effort in the traditional sense, the grub filesystem interface is quite different from the Linux one, and the code structure and style is quite different. But if you're willing to learn it should be very interesting."
Evgeniy Polyakov announced the latest release of his Parallel Optimized Host Message Exchange Layered File System, POHMELFS. He noted that the big new feature in this release is strong crypto support, "one can specify [an] encryption method (like cbc(aes), hash or digest, or all of them to be performed on [the] whole data channel (except headers)." In his blog, Evgeniy adds, "Cryptography support is [an] essential addition to the POHMELFS core. It was implemented with performance in mind, so that processing speeds would not drop noticeably even [during] very CPU-hungry operations". He explained, "POHMELFS utilizes [a configurable number of] pools of crypto threads, which perform data crypto processing and submit it either to [the] network or VFS layer." He included results from some performance benchmarks.
Evgeniy describes POHMELFS as "a high performance network filesystem with [a] locally coherent cache of data and metadata. Its main goal is distributed parallel processing of data. [The filesystem] supports [a] strong transaction model with failover recovery, allows encryption/hashing [of the entire] data channel, and performs read load balancing and write to multiple servers in parallel." When asked on his blog when he plans to push the new filesystem for mainline kernel inclusion, Evgeniy noted, "I do not know, maybe its time to push it upstream, but I do not want to bother with Linux kernel politics. We will see soon."
"HP has released AdvFS, a file system that was developed by Digital Equipment Corp and continues to be part of HP's Tru64 operating system," announced Xose Vazquez Perez, offering a link to the re-licensed source code. 2.4 maintainer Willy Tarreau replied favorably, "wow! That's awesome. I discovered it in 1999 and 9 years later, it probably remains the most advanced FS I encountered." HP's Linda Knippers explained:
"In case its not clear, this is a GPLv2 technology release, not an actual port to Linux. We're hoping that the code and documentation will be helpful in the development of new file systems for Linux that will provide similar capabilities, and perhaps used to make tweaks to existing file systems."
Interesting features found in AdvFS include, "simplified file system and storage management; flexible multi-device storage pools shared by multiple file systems, with or without a volume manager; exceptional file system availability (no need to take file systems off-line to expand, shrink or reconfigure; snapshots for consistent backups while applications are on-line; ability to recover deleted files); wide range of performance management tools (fine grain control over file system and file placement within the storage pool; on-line rebalancing of files and free space across the storage pool; on-demand or background file and file system defragmentation); and transaction log management, allowing choices for logging metadata and data asynchronously or synchronously."
Matthew Dillon continues to make significant progress on his HAMMER clustering filesystem for DragonFly BSD. He labeled the latest release 56c, noting that it, "represents an additional significant improvement in performance, [also including] bug fixes and most of the final media changes." A significant improvement in write performance was obtained by making the filesystem block size automatically increase from 16K to 64K when a file grows to larger than 1 MB. One remaining media change is required to optimize mtime and atime storage, at which point HAMMER will go into testing and bug fixing mode. Matt noted, "HAMMER's performance is extremely good now, and its system cpu overhead has dropped to roughly the same that we get from UFS", adding, "HAMMER is now able to sustain full disk bandwidth for bulk reads and writes. HAMMER continues to have far superior random-write performance, whether the system caches are blown out or not." Discussing future plans for the filesystem, Matt noted, "I could go on and on, there's so much that can be done with this filesytem :-)" Regarding one of these plans, he offered:
"I am not going to promise it, but there is a slight chance I will be able to get mirroring working by the release. I figured out how to do it, finally. Basically the solution is to add another field to the B-Tree's internal elements... the 'most recent' transaction id, and to propogate it up all the way to the root of the tree. The mirroring code can then optimally scan the B-Tree and pick out all records that have changed relative to some transaction id, allowing it to quickly 'pick up' where it left off and construct a record-level mirror over a fully asynchronous link, without any queueing. You can't get much better then that, frankly. "
"HAMMER makes no modifications to the B-Tree whatsoever on the front-end. When you create, delete, rename, write, etc... when you do those operations HAMMER caches them in a virtualization layer in memory and doesn't make any modifications to its on-media data structures (or their in-memory representations) at all until the meta-data is synced to disk," DragonFly BSD creator Matthew Dillon explained, comparing HAMMER, his clustering filesystem, to a wiki summary of Reiser4's implementations. He continued:
"HAMMER uses a modified B+Tree for its on-disk representation, which is a B-Tree with only keys at internal nodes and only records at the leafs. This was done to reduce structural bloat, allow for a leaf->leaf linking optimization in the future, and for other reasons. [...] HAMMER's internal nodes have a left and right bounding element. A standard B+Tree only has a left bounding element. By adding a right bounding element HAMMER can cache pointers into its B+Tree and 'pick up' searches, insertions, and deletions relative to the cached pointers instead of having to start at the root of the tree. More importantly, it can pickup searches, insertions, and deletions at internal nodes, not just leaf nodes. So I can cache a proximity pointer and if I do a good job I never have to traverse the B+Tree above that point."
"I regularly run and post various benchmarks comparing POHMELFS, NFS, XFS and Ext4, [the] main goal of POHMELFS at this stage is to be essentially as fast as [the] underlying local filesystem. And it is..." explained Evgeniy Polyakov, suggesting that the POHMELFS networking filesystem performs 10% to 300% faster than NFS, depending on the file operation. In particular, he noted that it still suffers from random reads, an area that he's currently focused on fixing. He summarized the new features found in the latest release:
"Read request (data read, directory listing, lookup requests) balancing between multiple servers; write requests are sent to multiple servers and completed only when all of them send an ack; [the] ability to add and/or remove servers from [the] working set at run-time from userspace; documentation (overall view and protocol commands); rename command; several new mount options to control client behaviour instead of hard coded numbers."
Looking forward, Evgeniy noted that this was likely the last non-bugfix release of the kernel client side implementation, suggesting that the next release would focus on adding server side features, "needed for distributed parallel data processing (like the ability to add new servers via network commands from another server), so most of the work will be devoted to server code."