Evgeniy Polyakov, listed as the connector and w1 subsystem maintainer, announced the first release of his distributed storage subsystem, "which allows [you] to form storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages." He describes the features of this new block device: "zero additional allocations in the common fast path not counting network allocations; zero-copy sending if supported by device using sendpage(); ability to use any implemented algorithm (linear algo implemented); pluggable mapping algorithms; failover recovery in case of broken link; ability to suspend remote node for maintenance without breaking dataflow to another nodes (if supported by algorithm and block layer) and without turning down main node; initial autoconfiguration (ability to request remote node size and use that dynamic data during array setup time); non-blocking network data processing; support for any kind of network media (not limited to tcp or inet protocols); no need for any special tools for data processing (like special userspace applications) except for configuration; userspace and kernelspace targets."
In his blog, Evgeniy noted a similarity to the recently discussed DRBD. In the recent announcement he compares his solution to iSCSI and NBD noting the following advantages: "non-blocking processing without busy loops; small, pluggable architecture; failover recovery (reconnect to remote target); autoconfiguration; no additional allocations; very simple; works with different network protocols; and storage can be formed on top of remote nodes and be exported simultaneously".
From: Evgeniy Polyakov [email blocked]
Subject: Distributed storage.
Date: Tue, 31 Jul 2007 21:13:47 +0400
Hi.
I'm pleased to announce first release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.
There is number of main features, this device supports:
* zero additional allocations in the common fast path (only one per node if
network queue is full) not counting network alocations
* zero-copy sending (except header) if supported by device using sendpage()
* ability to use any implemented algorithm (linear algo implemented)
* plugable mapping algorithms
* failover recovery in case of broken link (reconnection if remote node
is down)
* ability to suspend remote node for maintenance without breaking dataflow
to another nodes (if supported by algorithm and block layer) and
without turning down main node
* initial autoconfiguration (ability to request remote node size and use
that dynamic data during array setup time)
* non-blocking network data processing (except headers, which are
sent/received in blocking mode, can be simply changed to non-blocking
too by increasing request size to store state) without busy loops
checking return valu of processing functions. Non-blocking data
processing is based on ->poll() state machine with only one working
thread per storage.
* support for any kind of network media (not limited to tcp or inet
protocols) higher MAC layer (socket layer), data consistensy must be
part of the protocol (i.e. will lose data with UDP in favour of
performance)
* no need for any special tools for data processing (like special
userspace applications) except for configuration
* userspace and kernelspace targets. Userspace target can work on top of
usual files. (Windows or any other OS userspace target support can be
trivially added on request)
Compared to other similar approaches namely iSCSI and NBD,
there are following advantages:
* non-blocking processing without busy loops (compared to both above)
* small, plugable architecture
* failover recovery (reconnect to remote target)
* autoconfiguration (full absence in NBD and/or device mapper on top of it)
* no additional allocatins (not including network part) - at least two in
device mapper for fast path
* very simple - try to compare with iSCSI
* works with different network protocols
* storage can be formed on top of remote nodes and be exported
simultaneously (iSCSI is peer-to-peer only, NBD requires device
mapper and is synchronous)
TODO list currently includes following main items:
* redundancy algorithm (drop me a request of your own, but it is highly
unlikley that Reed-Solomon based will ever be used - it is too slow
for distributed RAID, I consider WEAVER codes)
* extended autoconfiguration
* move away from ioctl based configuration
Patch, userspace configuration utility and userspace target can be found
on project homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
Signed-off-by: Evgeniy Polyakov [email blocked]
drivers/block/Kconfig | 2 +
drivers/block/Makefile | 1 +
drivers/block/dst/Kconfig | 12 +
drivers/block/dst/Makefile | 5 +
drivers/block/dst/alg_linear.c | 348 ++++++++++
drivers/block/dst/dcore.c | 1222 ++++++++++++++++++++++++++++++++++
drivers/block/dst/kst.c | 1437 ++++++++++++++++++++++++++++++++++++++++
include/linux/dst.h | 282 ++++++++
8 files changed, 3309 insertions(+), 0 deletions(-)