logo
Published on KernelTrap (http://kerneltrap.org)

Linux: Distributed Storage Subsystem

By Jeremy
Created Jul 31 2007 - 16:27

Evgeniy Polyakov, listed [1] as the connector and w1 subsystem maintainer, announced the first release of his distributed storage subsystem [2], "which allows [you] to form storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages." He describes the features of this new block device: "zero additional allocations in the common fast path not counting network allocations; zero-copy sending if supported by device using sendpage(); ability to use any implemented algorithm (linear algo implemented); pluggable mapping algorithms; failover recovery in case of broken link; ability to suspend remote node for maintenance without breaking dataflow to another nodes (if supported by algorithm and block layer) and without turning down main node; initial autoconfiguration (ability to request remote node size and use that dynamic data during array setup time); non-blocking network data processing; support for any kind of network media (not limited to tcp or inet protocols); no need for any special tools for data processing (like special userspace applications) except for configuration; userspace and kernelspace targets."

In his blog [3], Evgeniy noted a similarity [4] to the recently discussed DRBD [5]. In the recent announcement he compares his solution to iSCSI [6] and NBD [7] noting the following advantages: "non-blocking processing without busy loops; small, pluggable architecture; failover recovery (reconnect to remote target); autoconfiguration; no additional allocations; very simple; works with different network protocols; and storage can be formed on top of remote nodes and be exported simultaneously".


From:	Evgeniy Polyakov [email blocked]
Subject: Distributed storage.
Date:	Tue, 31 Jul 2007 21:13:47 +0400

Hi.

I'm pleased to announce first release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

There is number of main features, this device supports:
    * zero additional allocations in the common fast path (only one per node if
	network queue is full) not counting network alocations
    * zero-copy sending (except header) if supported by device using sendpage()
    * ability to use any implemented algorithm (linear algo implemented)
    * plugable mapping algorithms
    * failover recovery in case of broken link (reconnection if remote node 
	is down)
    * ability to suspend remote node for maintenance without breaking dataflow 
	to another nodes (if supported by algorithm and block layer) and 
	without turning down main node
    * initial autoconfiguration (ability to request remote node size and use 
	that dynamic data during array setup time)
    * non-blocking network data processing (except headers, which are 
	sent/received in blocking mode, can be simply changed to non-blocking 
	too by increasing request size to store state) without busy loops 
	checking return valu of processing functions. Non-blocking data 
	processing is based on ->poll() state machine with only one working 
	thread per storage.
    * support for any kind of network media (not limited to tcp or inet 
	protocols) higher MAC layer (socket layer), data consistensy must be 
	part of the protocol (i.e. will lose data with UDP in favour of 
	performance)
    * no need for any special tools for data processing (like special 
	userspace applications) except for configuration
    * userspace and kernelspace targets. Userspace target can work on top of 
	usual files. (Windows or any other OS userspace target support can be 
	trivially added on request)

Compared to other similar approaches namely iSCSI and NBD, 
there are following advantages:
    * non-blocking processing without busy loops (compared to both above)
    * small, plugable architecture
    * failover recovery (reconnect to remote target)
    * autoconfiguration (full absence in NBD and/or device mapper on top of it)
    * no additional allocatins (not including network part) - at least two in 
	device mapper for fast path
    * very simple - try to compare with iSCSI
    * works with different network protocols
    * storage can be formed on top of remote nodes and be exported 
	simultaneously (iSCSI is peer-to-peer only, NBD requires device 
	mapper and is synchronous)

TODO list currently includes following main items:
    * redundancy algorithm (drop me a request of your own, but it is highly 
	unlikley that Reed-Solomon based will ever be used - it is too slow 
	for distributed RAID, I consider WEAVER codes)
    * extended autoconfiguration
    * move away from ioctl based configuration

Patch, userspace configuration utility and userspace target can be found
on project homepage:

http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst [8]

Signed-off-by: Evgeniy Polyakov [email blocked]

 drivers/block/Kconfig          |    2 +
 drivers/block/Makefile         |    1 +
 drivers/block/dst/Kconfig      |   12 +
 drivers/block/dst/Makefile     |    5 +
 drivers/block/dst/alg_linear.c |  348 ++++++++++
 drivers/block/dst/dcore.c      | 1222 ++++++++++++++++++++++++++++++++++
 drivers/block/dst/kst.c        | 1437 ++++++++++++++++++++++++++++++++++++++++
 include/linux/dst.h            |  282 ++++++++
 8 files changed, 3309 insertions(+), 0 deletions(-)



Related Links:


Source URL:
http://kerneltrap.org/node/14029