login
Header Space

 
 

Parallel Optimized Host Message Exchange Layered File System

May 14, 2008 - 12:45pm
Submitted by Jeremy on May 14, 2008 - 12:45pm.
Linux news

"I'm please to announce [the] POHMEL high performance network filesystem. POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System," began Evgeniy Polyakov, explaining:

"This is a high performance network filesystem with local coherent cache of data and metadata. Its main goal is distributed parallel processing of data. Network filesystem is a client transport. POHMELFS protocol was proven to be superior to NFS in lots (if not all, then it is in a roadmap) operations."

This latest release prompted Jeff Garzik to reply, "this continues to be a neat and interesting project :)" New features include fast transactions, round-robin failover, and near-wire limit performance. This adds to existing features which include a local coherent data and metadata cache, async processing of most events, and a fast and scalable multi threaded user space server. Planned features include a server extension to allow mirroring data across multiple devices, strong authentication, and possible data encryption when transferring data over the network. Evgeniy linked to several benchmarks in his blog.


From: Evgeniy Polyakov
To: <linux-kernel@...>
Subject: POHMELFS high performance network filesystem. Transactions, failover, performance.
Date: Tuesday, May 13, 2008 - 12:45 pm

Hi.

I'm please to announce POHMEL high performance network filesystem.
POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.

Development status can be tracked in filesystem section [1].

This is a high performance network filesystem with local coherent cache of data
and metadata. Its main goal is distributed parallel processing of data. Network 
filesystem is a client transport. POHMELFS protocol was proven to be superior to
NFS in lots (if not all, then it is in a roadmap) operations.

This release brings following features:
 * Fast transactions. System will wrap all writings into transactions, which
 	will be resent to different (or the same) server in case of failure.
	Details in notes [1].
 * Failover. It is now possible to provide number of servers to be used in
 	round-robin fasion when one of them dies. System will automatically
	reconnect to others and send transactions to them.
 * Performance. Super fast (close to wire limit) metadata operations over
 	the network. By courtesy of writeback cache and transactions the whole
	kernel archive can be untarred by 2-3 seconds (including sync) over
	GigE link (wire limit! Not comparable to NFS).

Basic POHMELFS features:
    * Local coherent (notes [5]) cache for data and metadata.
    * Completely async processing of all events (hard and symlinks are the only 
    	exceptions) including object creation and data reading.
    * Flexible object architecture optimized for network processing. Ability to
    	create long pathes to object and remove arbitrary huge directoris in 
	single network command.
    * High performance is one of the main design goals.
    * Very fast and scalable multithreaded userspace server. Being in userspace
    	it works with any underlying filesystem and still is much faster than
	async ni-kernel NFS one.

Roadmap includes:
    * Server extension to allow storing data on multiple devices (like creating mirroring),
    	first by saving data in several local directories (think about server, which mounted
	remote dirs over POHMELFS or NFS, and local dirs).
    * Client/server extension to report lookup and readdir requests not only for local
    	destination, but also to different addresses, so that reading/writing could be
	done from different nodes in parallel.
    * Strong authentification and possible data encryption in network channel.
    * Async writing of the data from receiving kernel thread into
    	userspace pages via copy_to_user() (check development tracking
	blog for results).

One can grab sources from archive or git [2] or check homepage [3].
Benchmark section can be found in the blog [4].

The nearest roadmap (scheduled or the end of the month) includes:
 * Full transaction support for all operations (only writeback is
 	guarded by transactions currently, default network state
	just reconnects to the same server).
 * Data and metadata coherency extensions (in addition to existing
	commented object creation/removal messages). (next week)
 * Server redundancy.

Thank you.

1. POHMELFS development status.
http://tservice.net.ru/~s0mbre/blog/devel/fs/index.html

2. Source archive.
http://tservice.net.ru/~s0mbre/archive/pohmelfs/
Git tree.
http://tservice.net.ru/~s0mbre/archive/pohmelfs/pohmelfs.git/

3. POHMELFS homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=pohmelfs

4. POHMELFS vs NFS benchmark.
http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_04_18.html
http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_04_14.html
http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_05_12.html

5. Cache-coherency notes.
http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_04_21.html
http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_04_22.html

Signed-off-by: Evgeniy Polyakov 

 fs/Kconfig               |    2 +
 fs/Makefile              |    1 +
 fs/pohmelfs/Kconfig      |    6 +
 fs/pohmelfs/Makefile     |    3 +
 fs/pohmelfs/config.c     |  148 +++++
 fs/pohmelfs/dir.c        | 1009 ++++++++++++++++++++++++++++++
 fs/pohmelfs/inode.c      | 1543 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/pohmelfs/net.c        |  800 ++++++++++++++++++++++++
 fs/pohmelfs/netfs.h      |  426 +++++++++++++
 fs/pohmelfs/path_entry.c |  278 +++++++++
 fs/pohmelfs/trans.c      |  469 ++++++++++++++
 11 files changed, 4685 insertions(+), 0 deletions(-)

From: Jeff Garzik <jeff@...>
Subject: Re: POHMELFS high performance network filesystem. Transactions, failover, performance.
Date: May 13, 3:09 pm 2008

Evgeniy Polyakov wrote:
> Hi.
> 
> I'm please to announce POHMEL high performance network filesystem.
> POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.
> 
> Development status can be tracked in filesystem section [1].
> 
> This is a high performance network filesystem with local coherent cache of data
> and metadata. Its main goal is distributed parallel processing of data. Network 
> filesystem is a client transport. POHMELFS protocol was proven to be superior to
> NFS in lots (if not all, then it is in a roadmap) operations.
> 
> This release brings following features:
>  * Fast transactions. System will wrap all writings into transactions, which
>  	will be resent to different (or the same) server in case of failure.
> 	Details in notes [1].
>  * Failover. It is now possible to provide number of servers to be used in
>  	round-robin fasion when one of them dies. System will automatically
> 	reconnect to others and send transactions to them.
>  * Performance. Super fast (close to wire limit) metadata operations over
>  	the network. By courtesy of writeback cache and transactions the whole
> 	kernel archive can be untarred by 2-3 seconds (including sync) over
> 	GigE link (wire limit! Not comparable to NFS).
> 
> Basic POHMELFS features:
>     * Local coherent (notes [5]) cache for data and metadata.
>     * Completely async processing of all events (hard and symlinks are the only 
>     	exceptions) including object creation and data reading.
>     * Flexible object architecture optimized for network processing. Ability to
>     	create long pathes to object and remove arbitrary huge directoris in 
> 	single network command.
>     * High performance is one of the main design goals.
>     * Very fast and scalable multithreaded userspace server. Being in userspace
>     	it works with any underlying filesystem and still is much faster than
> 	async ni-kernel NFS one.
> 
> Roadmap includes:
>     * Server extension to allow storing data on multiple devices (like creating mirroring),
>     	first by saving data in several local directories (think about server, which mounted
> 	remote dirs over POHMELFS or NFS, and local dirs).
>     * Client/server extension to report lookup and readdir requests not only for local
>     	destination, but also to different addresses, so that reading/writing could be
> 	done from different nodes in parallel.
>     * Strong authentification and possible data encryption in network channel.
>     * Async writing of the data from receiving kernel thread into
>     	userspace pages via copy_to_user() (check development tracking
> 	blog for results).
> 
> One can grab sources from archive or git [2] or check homepage [3].
> Benchmark section can be found in the blog [4].
> 
> The nearest roadmap (scheduled or the end of the month) includes:
>  * Full transaction support for all operations (only writeback is
>  	guarded by transactions currently, default network state
> 	just reconnects to the same server).
>  * Data and metadata coherency extensions (in addition to existing
> 	commented object creation/removal messages). (next week)
>  * Server redundancy.

This continues to be a neat and interesting project :)

Where is the best place to look at client<->server protocol?

Are you planning to support the case where the server filesystem dataset 
does not fit entirely on one server?

What is your opinion of the Paxos algorithm?

	Jeff



--

From: Evgeniy Polyakov <johnpol@...> Subject: Re: POHMELFS high performance network filesystem. Transactions, failover, performance. Date: May 13, 4:51 pm 2008 Hi. On Tue, May 13, 2008 at 03:09:06PM -0400, Jeff Garzik (jeff@garzik.org) wrote: > This continues to be a neat and interesting project :) Thanks :) > Where is the best place to look at client<->server protocol? Hmm, in sources I think, I need to kick myself to write a somewhat good spec for the next release. Basically protocol contains of fixed sized header (struct netfs_cmd) and attached data, which size is embedded into above header. Simple commands are finished here (essentially all except write/create commands), you can check them in approrpiate address space/inode operations. Transactions follow netlink (which is very ugly but exceptionally extendible) protocol: there is main header (above structure), which holds size of the embedded data, which can be dereferenced as header/data parts, where each inner header corresponds to any command (except transaction header). So one can pack (upto 90 pages of data or different commands on x86, this is limit of the page size devoted to headers) requested number of commands into single 'frame' and submit it to system, which will care about atomicity of that request in regards of being either fully processed by one of the servers or dropped. > Are you planning to support the case where the server filesystem dataset > does not fit entirely on one server? Sure. First by allowing whole object to be placed on different servers (i.e. one subdir is on server1 and another on server2), probably in the future there will be added support for the same object being distributed to different servers (i.e. half of the big file on server1 and another half on server2). > What is your opinion of the Paxos algorithm? It is slow. But it does solve failure cases. So far POHMELFS does not work as distributed filesystem, so it should not care about it at all, i.e. at most in the very nearest future it will just have number of acceptors in paxos terminology (metadata servers in others) without need for active dynamical reconfiguration, so protocol will be greatly reduced, with addition of dynamical metadata cluster extension protocol will have to be extended. As practice shows, the smaller and simpler initial steps are, the better results eventually become :) -- Evgeniy Polyakov --
From: Sage Weil <sage@...> Subject: Re: POHMELFS high performance network filesystem. Transactions, failover, performance. Date: May 14, 9:35 am 2008 > > What is your opinion of the Paxos algorithm? > > It is slow. But it does solve failure cases. For writes, Paxos is actually more or less optimal (in the non-failure cases, at least). Reads are trickier, but there are ways to keep that fast as well. FWIW, Ceph extends basic Paxos with a leasing mechanism to keep reads fast, consistent, and distributed. It's only used for cluster state, though, not file data. I think the larger issue with Paxos is that I've yet to meet anyone who wants their data replicated 3 ways (this despite newfangled 1TB+ disks not having enough bandwidth to actualy _use_ the data they store). Similarly, if only 1 out of 3 replicas is surviving, most people want to be able to read their data, while Paxos demands a majority to ensure it is correct. (This is why Paxos is typically used only for critical cluster configuration/state, not regular data.) sage --
From: Evgeniy Polyakov <johnpol@...> Subject: Re: POHMELFS high performance network filesystem. Transactions, failover, performance. Date: May 14, 9:52 am 2008 Hi Sage. On Wed, May 14, 2008 at 06:35:19AM -0700, Sage Weil (sage@newdream.net) wrote: > > > What is your opinion of the Paxos algorithm? > > > > It is slow. But it does solve failure cases. > > For writes, Paxos is actually more or less optimal (in the non-failure > cases, at least). Reads are trickier, but there are ways to keep that > fast as well. FWIW, Ceph extends basic Paxos with a leasing mechanism to > keep reads fast, consistent, and distributed. It's only used for cluster > state, though, not file data. Well, it depends... If we are talking about single node perfromance, then any protocol, which requries to wait for authorization (or any approach, which waits for acknowledge just after data was sent) is slow. If we are talking about agregate parallel perfromance, then its basic protocol with 2 messages is (probably) optimal, but still I'm not convinced, that 2 messages case is a good choise, I want one :) > I think the larger issue with Paxos is that I've yet to meet anyone who > wants their data replicated 3 ways (this despite newfangled 1TB+ disks not > having enough bandwidth to actualy _use_ the data they store). > Similarly, if only 1 out of 3 replicas is surviving, most people want to > be able to read their data, while Paxos demands a majority to ensure it is > correct. (This is why Paxos is typically used only for critical cluster > configuration/state, not regular data.) I.e. having more than single node to be failed? Google uses 3-way replication, but I can not see any factor, which will force people from lowering failure recovering expectations. -- Evgeniy Polyakov --


Reply

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <b> <quote> <pre> <hr> <br> <p> <img> <blockquote> <font> <tt> <table> <tr> <i>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

speck-geostationary