"Ceph is a distributed network file system designed to provide excellent performance, reliability, and scalability with POSIX semantics. I periodically see frustration on this list with the lack of a scalable GPL distributed file system with sufficiently robust replication and failure recovery to run on commodity hardware, and would like to think that--with a little love--Ceph could fill that gap," announced Sage Weil on the Linux Kernel mailing list. Originally developed as the subject of his PhD thesis, he went on to list the features of the new filesystem, including POSIX semantics, scalability from a few nodes to thousands of nodes, support for petabytes of data, a highly available design with no signle points of failure, n-way replication of data across multiple nodes, automatic data rebalancing as nodes are added and removed, and a Fuse-based client. He noted that a lightweight kernel client is in progress, as is flexible snapshoting, quotas, and improved security. Sage compared Ceph to other similar filesystems:
"In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely on symmetric access by all clients to shared block devices, Ceph separates data and metadata management into independent server clusters, similar to Lustre. Unlike Lustre, however, metadata and storage nodes run entirely in userspace and require no special kernel support. Storage nodes utilize either a raw block device or large image file to store data objects, or can utilize an existing file system (XFS, etc.) for local object storage (currently with weakened safety semantics). File data is striped across storage nodes in large chunks to distribute workload and facilitate high throughputs. When storage nodes fail, data is re-replicated in a distributed fashion by the storage nodes themselves (with some coordination from a cluster monitor), making the system extremely efficient and scalable."
Evgeniy Polyakov announced a new version of his distributed storage subsystem, "this release includes [a] mirroring algorithm extension, which allows [the subsystem] to store [the] 'age' of the given node on the underlying media." He went on to explain why this was useful:
"In this case, if [a] failed node gets new media, which does not contain [the] correct 'age' (unique id assigned to the whole storage during initialization time), the whole node will be marked as dirty and eventually resynced.
"This allows [it] to have [a] completely transparent failure recovery - [the] failed node can be just turned off, its hardware fixed and then turned on. DST core will detect [the] connection reset and automatically reconnect when [the] node is ready and resync if needed without any special administrator's steps."
"I am going to start committing bits and pieces of the HAMMER filesystem over the next two months," announced Matthew Dillon on the Dragonfly BSD kernel mailing list. He noted that the filesystem should be functional by the 2.0 release in December, "I am making good progress and I believe it will be beta quality by the release. It took nearly the whole year to come up with a workable design. I thought I had it at the beginning of the year but I kept running into issues and had to redesign the thing several times since then." Matthew then posted a detailed design document for the new filesystem.
During the followup discussion, Matthew was asked if HAMMER would be a ZFS killer. He responded, "ZFS serves a different purpose and I think it is cool, but as time has progressed I find myself liking ZFS's design methodology less and less, and I am very glad I decided against trying to port it." He noted it is essential to have redundant copies of data, but added, "the problem ZFS has is that it is TOO redundant. You just don't need that scale of redundancy if you intend to operate in a multi-master replicated environment because you not only have wholely independant (logical) copies of the filesystem, they can also all be live and online at the same time." As for how Dragonfly's new filesystem will address redundancy, he explained:
"HAMMER's approach to redundancy is logical replication of the entire filesystem. That is, wholely independant copies operating on different machines in different locations. Ultimately HAMMER's mirroring features will be used to further our clustering goals. The major goal of this project is transparent clustering and a major requirement for that is to have a multi-master replicated environment. That is the role HAMMER will eventually fill. We wont have multi-master in 2.0, but there's a good chance we will have it by the end of next year."
"I'm pleased to announce [the] fourth release of the distributed storage subsystem, which allows [you] to form a storage [block device] on top of remote and local nodes, which in turn can be exported to another storage [block device] as a node to form tree-like storage [block devices]," Evgeniy Polyakov stated on the Linux Kernel mailing list. The new release includes a new configuration interface and several bug fixes.
Network device driver and SATA subsystem maintainer, Jeff Garzik, was not impressed with the concept, "[distributed block devices] are not very useful, because it still relies on a useful filesystem sitting on top of the DBS." He went on to explain the problem, "it devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF)." He proposed instead that time would be better spent developing a POSIX-only distributed filesystem, "in contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster." Jeff went on to caution, "a distributed filesystem is also much more complex, which is why distributed block devices are so appealing :)" When Lustre was pointed out as an existing option, Jeff noted, "Lustre is tilted far too much towards high-priced storage, and needs improvement before it could be considered for mainline."
Matthew Dillon created DragonFly BSD in June of 2003 as a fork of the FreeBSD 4.8 codebase. KernelTrap first spoke with Matthew back in January of 2002 while he was still a FreeBSD developer and a year before his current project was started. He explains that the DragonFly project's primary goal is to design a "fully cross-machine coherent and transparent cluster OS capable of migrating processes (and thus the work load) on the fly."
In this interview, Matthew discusses his incentive for starting a new BSD project and briefly compares DragonFly to FreeBSD and the other BSD projects. He goes on to discuss the new features in today's DragonFly 1.10 release. He also offers an in-depth explanation of the project's cluster goals, including a thorough description of his ambitious new clustering filesystem. Finally, he reflects back on some of his earlier experiences with FreeBSD and Linux, and explains the importance of the BSD license.
Evgeniy Polyakov, listed as the connector and w1 subsystem maintainer, announced the first release of his distributed storage subsystem, "which allows [you] to form storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages." He describes the features of this new block device: "zero additional allocations in the common fast path not counting network allocations; zero-copy sending if supported by device using sendpage(); ability to use any implemented algorithm (linear algo implemented); pluggable mapping algorithms; failover recovery in case of broken link; ability to suspend remote node for maintenance without breaking dataflow to another nodes (if supported by algorithm and block layer) and without turning down main node; initial autoconfiguration (ability to request remote node size and use that dynamic data during array setup time); non-blocking network data processing; support for any kind of network media (not limited to tcp or inet protocols); no need for any special tools for data processing (like special userspace applications) except for configuration; userspace and kernelspace targets."
In his blog, Evgeniy noted a similarity to the recently discussed DRBD. In the recent announcement he compares his solution to iSCSI and NBD noting the following advantages: "non-blocking processing without busy loops; small, pluggable architecture; failover recovery (reconnect to remote target); autoconfiguration; no additional allocations; very simple; works with different network protocols; and storage can be formed on top of remote nodes and be exported simultaneously".
Lars Ellenberg started an effort to get DRBD, the Distributed Replicated Block Device merged into the Linux kernel. When asked for clarification as to what it was, Lars explained, "think of it as RAID1 over TCP. Typically you have one Node in Primary, the other as Secondary, replication target only. But you can also have both Active, for use with a cluster file system." Earlier in the thread he described it as "a stacked block device driver".
Much of the initial review focused on the need to comply with kernel coding style guidelines. Kyle Moffett offered a much lengthier review, noting at one point in the code, "how about fixing this to actually use proper workqueues or something instead of this open-coded mess?" Lars replied, "unlikely to happen 'right now'. But it is on our todo list..." Jens Axboe added, "but stuff like that is definitely a merge show stopper, jfyi".
Matt Dillon [interview] posted the design synopsis of a new highly available clustered filesystem he will soon begin writing for DragonFlyBSD. The feature summary at the beginning of his document included, "on-demand filesystem check and recovery; infinite snapshots; multi-master operation, including the ability to self-heal a corrupted filesystem by accessing replicated data; infinite logless replication, meaning that replication targets can be offline for 'days' without effecting performance or operation; 64 bit file space, 64 bit filesystem space, no space restrictions whatsoever; reliably handles data storage for huge multi-hundred-terrabyte filesystems without fear of unrecoverable corruption; cluster operation, provides the ability to commit data to locally replicated store independantly of other replication nodes, with access governed by cache coherency protocols; independant index, data is laid out in a highly recoverable fashion, independant of index generation, and indexes can be regenerated from scratch and thus indexes can be updated asynchronously." He then goes into detail on each of these points and many more, explaining how he intends to implement the new filesystem.
The new filesystem is currently unnamed, though Matt noted, "it doesn't have to translate as an acronym. At the moment 'HAMMER' is my favorite. I like the idea of a hammer :-)" It was suggested that this could mean, "high-availability multi-master extra reliable file system", though Matt was not impressed with this. Another proposed idea that Matt liked was HACFS, or "High-Availability Clustered File System".