Matthew Dillon continues to make significant progress on his HAMMER clustering filesystem for DragonFly BSD. He labeled the latest release 56c, noting that it, "represents an additional significant improvement in performance, [also including] bug fixes and most of the final media changes." A significant improvement in write performance was obtained by making the filesystem block size automatically increase from 16K to 64K when a file grows to larger than 1 MB. One remaining media change is required to optimize mtime and atime storage, at which point HAMMER will go into testing and bug fixing mode. Matt noted, "HAMMER's performance is extremely good now, and its system cpu overhead has dropped to roughly the same that we get from UFS", adding, "HAMMER is now able to sustain full disk bandwidth for bulk reads and writes. HAMMER continues to have far superior random-write performance, whether the system caches are blown out or not." Discussing future plans for the filesystem, Matt noted, "I could go on and on, there's so much that can be done with this filesytem :-)" Regarding one of these plans, he offered:
"I am not going to promise it, but there is a slight chance I will be able to get mirroring working by the release. I figured out how to do it, finally. Basically the solution is to add another field to the B-Tree's internal elements... the 'most recent' transaction id, and to propogate it up all the way to the root of the tree. The mirroring code can then optimally scan the B-Tree and pick out all records that have changed relative to some transaction id, allowing it to quickly 'pick up' where it left off and construct a record-level mirror over a fully asynchronous link, without any queueing. You can't get much better then that, frankly. "
From: Matthew Dillon <dillon@...> Subject: HAMMER update 19-June-2008 (56C) (HEADS UP - MEDIA CHANGED) Date: Jun 20, 2:23 am 2008 56C represents an additional significant improvement in performance, plus bug fixes and most of the final media changes. As with all the commits this week, a kernel and utilities rebuild plus a newfs_hammer is needed to continue testing. The filesystem block size now increases from 16K to 64K once a file has grown past 1MB. This improves write performance to the point where I don't really need to implement cluster_write(), so I've decided to forego doing that for the release. I will be making one final media change on Friday and then HAMMER development will go into testing & bug fixing mode until the release. This last media change will fix mtime and atime storage. At the moment mtime/atime updates require generating UNDO records and, needless to say, they're expensive. I will consider my options tomorrow but I think I am going to just not include those fields in the CRC so they can be updated asynchronously, without any UNDO's. -- Stability -- I have really begun pounding the filesystem by running blogbench, buildworld -j 8, and fsx simultaniously on two test boxes. I expect that any remaining bugs will be worked out over the next week or two. -- Performance -- All performance work except for the atime/mtime issue is now complete. WYSIWYG. HAMMER's performance is extremely good now, and its system cpu overhead has dropped to roughly the same that we get from UFS (buildworlds run 610-620 seconds of system time for HAMMER, and 610-620 seconds of system time for UFS). HAMMER is now able to sustain full disk bandwidth for bulk reads and writes. HAMMER continues to have far superior random-write performance, whether the system caches are blown out or not. Not only that but the performance can potentially improve even more if I redo the deadlock avoidance algorithms. HAMMER is within 10% of UFS's read performance under light and medium loads. HAMMER has a somewhat larger system cache footprint then UFS. After extensive testing with blogbench I've determined that HAMMER's read performance figures past blog 250 (where the system caches get blown out on my 1G test box) are actually almost as good as UFSes *IF* HAMMER's write performance were to drop to the same levels as UFS's (poor) write performance past that point. But because HAMMER's write performance doesn't drop, the system cache is never able to settle down into a 95-percentile cached data set. Basically the only reason UFS has good read performance numbers for blogbench once the system caches are blown out is because UFS's write performance is so poor the data set is no longer growing significantly and no longer eating away at the cache. HAMMER's random re-writing performance does drop a bit relative to UFS, primarily due to HAMMER's history retention mechanic. It isn't too bad and pruning/reblocking cleans it up so we're gonna have to run with it for the release. I will be working on the footprint size a bit, but I am very happy with the current state of affairs. -- Release TODO -- There are many auxillary items I want to get fully working for the release. There are some minor issues with the reblocker and pruner, some issues with how to recover space after the filesystem has filled up, plus I want to write a recovery program for catastrophic failures. (not a fsck, but a way to extract whatever good information can be found from a corrupted HAMMER filesystem). I will also probably be making other adjustments to the filesystem.... nothing I expect to mess up media compatibility past tomorrow, but to help support future features such as mirroring, better low level storage allocation, and so forth. -- Mirroring -- I am not going to promise it, but there is a slight chance I will be able to get mirroring working by the release. I figured out how to do it, finally. Basically the solution is to add another field to the B-Tree's internal elements... the 'most recent' transaction id, and to propogate it up all the way to the root of the tree. The mirroring code can then optimally scan the B-Tree and pick out all records that have changed relative to some transaction id, allowing it to quickly 'pick up' where it left off and construct a record-level mirror over a fully asynchronous link, without any queueing. You can't get much better then that, frankly. I could go on and on, there's so much that can be done with this filesytem :-) -Matt Matthew Dillon <dillon@backplane.com>
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER update 19-June-2008 (56C) (HEADS UP - MEDIA CHANGED) Date: Jun 20, 12:57 pm 2008 : :That's harmless for atime, but you really want mtime to be properly :synchronised with the last data update (and to stay that way across an :undo). Ideally, timestamp data records and hold mtime as a reference :to the last one updated (or something like that). : :-- :Bob Bishop +44 (0)118 940 1243 Yah, I agree. Here's a quick summary of the issues: * UNDO records are used to compartmentalize atomic changes which cover multiple disk blocks. For example, if you 'rm' a file and a crash occurs, you want the state of the filesystem to either show the file and its directory entry both removed, or show the file and its directory entry both still present. * Updates to the inode_data, which holds the stat/chmod info for a file object, typically requires rolling a new inode_data record with the old one still available via the filesystem history. For example, if you append some stuff to an existing file an old version of the inode_data must be present in order to 'see' the previous state of the file (in particular, the previous st_size of the file). * BUT, having to do any of the above when updating atime and mtime would be really expensive. - atime gets updated all the time. We definitely do not want to roll UNDO records *or* new inode_data records. - mtime gets updated all the time in certain situations, such as when overwriting a file (e.g. in ways that do not modify the file's size). - mtime is often used to uniquely determine whether a file has been modified. * And, finally, we want mirroring to work properly even if the filesystem is mounted 'nohistory' (told not to roll new inode_data records). Or, for that matter, if individual files are chflagged 'nohistory'. The bane of HAMMER's design is that we absolutely do not want to roll new inode_data records unless we have to, so here is what I am going to do: * ATime will be updated asynchronously and will not be CRCd, so the B-Tree element's CRC field does not have to be updated. (thus no UNDO records need to be generated either). * MTime will be updated semi-synchronously and will be CRCd. (It will be fully synchronous from the point of view of anyone using the filesystem, of course). UNDO records will be generated but new inode_data records will not have to be created. The mtime will be updated in-place. That solves the contemporary-use situations. And I think I have a solution for mirroring too. Mirroring will depend on a serial number field stored along with the B-Tree elements, with the highest serial number in the node propogated upwards towards the B-Tree root. Ultimately the B-Tree root node will wind up with a serial number representing the most recent change made to the filesystem. As I think about it, the serial number itself can be updated atomically using UNDO records, and the update can occur even if a new inode_data record is not rolled (so the serial number would be updated in-place in the B-Tree element and propogated upwards towards the B-Tree root). That makes it work with 'nohistory' mounts or files and also means serial number generation will be compatible with the MTime update mechanic, allowing us to roll new serial numbers for MTime updates without having to insert new B-Tree elements. We would not roll new serial numbers for ATime updates though (can you imagine the load that would create?). I think ATime will have to operate independantly on the mirrors, at least for now. This will give the mirroring code the ability to store just one thing... the serial number of where it 'left off' the last time, and it can use that number to then scan the B-Tree from the root node downward and only go down the branches with serial numbers >= the mirror's saved serial number. The result will be that the mirroring code can very quickly locate records modified relative to the last time it ran, without needing record queues. It will be possible to do it in batch or semi-real-time. Plus the mirroring will be completely disconnected from the flow of modifications made to the filesystem and thus not effect write performance at all. I don't think there are any major gotchas with my plan. The only question mark is the I/O load propogating the serial numbers to the root of the B-Tree will entail, but I think I can optimize that. Since UNDO records are generated I can do massive aggregation of serial number updates. Besides, how big a performance price are people willing to pay to get premium mirroring? Probably pretty big. -- There's a little side story here, going back to the Backplane Inc Database. When I was doing Backplane Inc, a start-up that sadly fell in the dot-com crash, I had a batchable, restartable, totally disconnected mirroring capability that effectively allowed me to mirror the production databases to a backup box in my home over a not very reliable modem connection. It would always get behind during the day, then catch up over night. It didn't care about frequent disconnects, it didn't care about the ludicrously low modem bandwidth... it just worked. That's how I want HAMMER's mirroring to work. Ultimately the serial numbers will serve a second purpose, and that will be as a rendezvous point for clustered filesystem operation, where the machine cluster is accessing multiple mirrors of the same filesystem which might be in various states of catch-up. The cluster protocols will agree on a serial number, and then be able to access the data from any mirror whos record(s) are updated through that serial number. The Backplane database was also able to do the same thing, using a quorum to agree on the transaction id represented the desired data, and then pulling that data from any master or slave copy of the database that had that transaction id. That's how I want HAMMER's clustered filesystem access to work. -Matt Matthew Dillon <dillon@backplane.com>
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER update 19-June-2008 (56C) (HEADS UP - MEDIA CHANGED) Date: Jun 20, 2:20 pm 2008 :Pardon my ignorance if I am missing something, I haven't looked much :into HAMMER yet. : :Will the FS have the same atomic update features that UFS has? Meaning :fsync(2) returns only when all directory entries are safely on the :disk (whether it's with softupdate-type ordering or journaling). It's :important for mail servers and such so they don't lose messages at the :time of powerfail/crash. If you dig around mailing lists, you'll find :interesting stories how people who ran their FS mounted async (the :default Linux EXT2/3 mount) for mail servers (and AFAIK at least on :Linux in that case fsync returns early - not atomic, so software :written with BSD behavior in mind wasn't safe to run without patching) :found some of the messages in lost+found. Basically yes. HAMMER maintains a dependancy hierarchy for directory entries and the related inodes, so if you create a directory structure and then fsync some file deep down in it it should fsync the directory entries as well. HAMMER does not have an async mount mode :-). It will never have an async mount mode, in fact, but it doesn't need one. Writing is so-well decoupled from the media that an async mode would not actually make things any faster. HAMMER's fsync might return early too, BTW... not intentionally, it's still an all-or-nothing deal from the point of view of crash recovery, but if the inode is already queued to the flusher it would have to be re-queued to get the rest of the modifications and that might cause fsync() to return early. Doing it properly shouldn't be too difficult but it isn't at the top of my priority list. :Also will there be a feature to grow/and or shrink the FS live without :having to unmount? I can do this right now with XFS and LVM on Linux :(grow, but not shrink), and its working amazingly well and very :quickly to boot. : :Thanks. : :-- :Dan I haven't written the utility support but growing a HAMMER filesystem is fairly trivial. All one needs to do is add the appropriate entries to the freemap. Not only that, but also adding volumes to a HAMMER filesystem. HAMMER's freemap is a two-layer blockmap. It is NOT pre-sized, and there is no block translation. it works more like a sparse file whos size is the maximum possible size of a HAMMER filesystem (uh, that would be, uh, 1 Exabyte I think with the work done last week). Shrinking is also possible. Not only shrinking, but also removing whole volumes. Again, the feature hasn't been written yet and it would be a bit more time consuming because the reblocker would have to be run to clean out (aka copy out) the areas being removed, but there would be nothing inherently difficult about it and it certainly could be done live. p.s. if someone wants to make a side-project of it, go for it! Mirroring is at the top of my list for the release. Frankly, the best way to resize a filesystem is to mirror and cluster, and then simply take the 'old' filesystem offline and completely redo it. Clustering is kinda the holy grail for the project and clearly won't be ready for this release, but it is something to think about. -Matt Matthew Dillon <dillon@backplane.com>
