"HAMMER won't be ready for sure (things take however long they take), but the hardest part is working and stable and I'm just down to garbage collection and crash recovery," noted Matthew Dillon, discussing the status of what is ultimately intended to be a highly available clustering filesystem. The upcoming DragonFlyBSD release this month was originally intended to be 2.0 with a beta quality HAMMER, but the decision was recently made to call the release 1.12 while HAMMER continues to stabilize. Matt continued, "HAMMER is really shaping up now. Here's what works now: all filesystem operations; all historical operations; all Pruning features". During the discussion, he was asked how he planned to support multi-master replication, in reply to which he began:
"My current plan is to use a quorum algorithm similar to the one I wrote for the backplane database years ago. But there are really two major (and very complex) pieces to the puzzle. Not only do we need a quorum algorithm, but we need a distributed cache coherency algorithm as well. With those two pieces individual machines will be able to proactively cache filesystem data and guarantee transactional consistency across the cluster."
"HAMMER is progressing very well with only 3-4 big-ticket items left to do," noted DragonFlyBSD creator Matthew Dillon regarding the ongoing development of his highly available clustering filesystem, "I'm really happy with the progress I'm making". He listed "on-the-fly recovery, balancing, refactoring of the spike code, and retention policy scan" as the remaining items needing to be implemented. "everything else is now working and reasonably stable. Of the remaining items only the spike coding has any real algorithmic complexity. Recovery and balancing just require brute force and the physical record deletion. The retention policy scan needs is already coded and working (just not the scan itself)."
Matt then defined what he meant by 'spike', "basically, when a cluster (a 64MB block of the disk) fills up a 'spike' needs to be driven into that cluster's B-Tree in order to expand it into a new cluster. The spike basically forwards a portion of the B-Tree's key space to a new cluster." He added, "refactoring the spike code means doing a better job selecting the amount of key space the spike can represent." He noted that balancing refers to the act of balancing the B-Tree representation of the filesystem, "we want to slowly move physical data records from higher level clusters to lower level clusters, eventually winding up with a situation where the higher level clusters contain only spikes and lower level clusters are mostly full." Matt continued:
"Keep in mind that HAMMER is designed to handle very large filesystems... in particular, filesystems that are big enough that you don't actually fill them up under normal operation, or at least do not quickly fill them up and then quickly clean them out. The balancing code is expected to need upwards of a day (or longer) to slowly iron out storage inefficiencies. If a situation comes up where faster action is needed, then faster action can be taken. I intend to take advantage of the fact that most filesystems (and, really, any large filesystem), takes quite a while to actually become full."
"HAMMER work is still progressing well, I hope to have most of it working in a degenerate single-cluster (64MB filesystem) case by the end of next week. (cluster == 64MB block of the disk, not cluster as in clustering)," noted Matthew Dillon on the DragonFlyBSD mailing list. He continued, "gluing the per-cluster B-Tree's together for the multi-cluster case is turning out to be more of a headache and will probably take at least 2 weeks to get working. Some fairly sophisticated heuristics will be needed to avoid unnecessary copying between clusters." Matt went on to note that the next DragonFlyBSD release will likely be delayed a month:
"I may decide to move the 2.0 release to mid-January to give myself some more time. This is similar to what we did for 1.8. Also, I think a January release is better then a Christmas release because people get busy with christmas-like things. I want the filesystem to be at least beta quality as of the release and I don't think its possible to get it there by mid-December."
"I will be continuing to commit bits and pieces of HAMMER, but note that it will probably not even begin to work for quite some time," Matthew Dillon reported on the new clustering filesystem he's developing for DragonFlyBSD. He noted, "I am still on track for it to make it into the end-of-year release." Matt continued:
"My B-Tree implementation also allows HAMMER to cache B-Tree nodes and start lookups from any internal node rather then having to start at the root. You can do this in a standard B-Tree too but it isn't necessarily efficient for certain boundary cases. In my implementation I store boundaries for the left AND right side which means a search starting in the middle of the tree knows exactly where to go and will never have to retrace its steps."
"I am going to start committing bits and pieces of the HAMMER filesystem over the next two months," announced Matthew Dillon on the Dragonfly BSD kernel mailing list. He noted that the filesystem should be functional by the 2.0 release in December, "I am making good progress and I believe it will be beta quality by the release. It took nearly the whole year to come up with a workable design. I thought I had it at the beginning of the year but I kept running into issues and had to redesign the thing several times since then." Matthew then posted a detailed design document for the new filesystem.
During the followup discussion, Matthew was asked if HAMMER would be a ZFS killer. He responded, "ZFS serves a different purpose and I think it is cool, but as time has progressed I find myself liking ZFS's design methodology less and less, and I am very glad I decided against trying to port it." He noted it is essential to have redundant copies of data, but added, "the problem ZFS has is that it is TOO redundant. You just don't need that scale of redundancy if you intend to operate in a multi-master replicated environment because you not only have wholely independant (logical) copies of the filesystem, they can also all be live and online at the same time." As for how Dragonfly's new filesystem will address redundancy, he explained:
"HAMMER's approach to redundancy is logical replication of the entire filesystem. That is, wholely independant copies operating on different machines in different locations. Ultimately HAMMER's mirroring features will be used to further our clustering goals. The major goal of this project is transparent clustering and a major requirement for that is to have a multi-master replicated environment. That is the role HAMMER will eventually fill. We wont have multi-master in 2.0, but there's a good chance we will have it by the end of next year."
Matthew Dillon created DragonFly BSD in June of 2003 as a fork of the FreeBSD 4.8 codebase. KernelTrap first spoke with Matthew back in January of 2002 while he was still a FreeBSD developer and a year before his current project was started. He explains that the DragonFly project's primary goal is to design a "fully cross-machine coherent and transparent cluster OS capable of migrating processes (and thus the work load) on the fly."
In this interview, Matthew discusses his incentive for starting a new BSD project and briefly compares DragonFly to FreeBSD and the other BSD projects. He goes on to discuss the new features in today's DragonFly 1.10 release. He also offers an in-depth explanation of the project's cluster goals, including a thorough description of his ambitious new clustering filesystem. Finally, he reflects back on some of his earlier experiences with FreeBSD and Linux, and explains the importance of the BSD license.
Matt Dillon [interview] posted the design synopsis of a new highly available clustered filesystem he will soon begin writing for DragonFlyBSD. The feature summary at the beginning of his document included, "on-demand filesystem check and recovery; infinite snapshots; multi-master operation, including the ability to self-heal a corrupted filesystem by accessing replicated data; infinite logless replication, meaning that replication targets can be offline for 'days' without effecting performance or operation; 64 bit file space, 64 bit filesystem space, no space restrictions whatsoever; reliably handles data storage for huge multi-hundred-terrabyte filesystems without fear of unrecoverable corruption; cluster operation, provides the ability to commit data to locally replicated store independantly of other replication nodes, with access governed by cache coherency protocols; independant index, data is laid out in a highly recoverable fashion, independant of index generation, and indexes can be regenerated from scratch and thus indexes can be updated asynchronously." He then goes into detail on each of these points and many more, explaining how he intends to implement the new filesystem.
The new filesystem is currently unnamed, though Matt noted, "it doesn't have to translate as an acronym. At the moment 'HAMMER' is my favorite. I like the idea of a hammer :-)" It was suggested that this could mean, "high-availability multi-master extra reliable file system", though Matt was not impressed with this. Another proposed idea that Matt liked was HACFS, or "High-Availability Clustered File System".