POHMELFS Performance

Submitted by Jeremy
on June 16, 2008 - 9:56am

"I regularly run and post various benchmarks comparing POHMELFS, NFS, XFS and Ext4, [the] main goal of POHMELFS at this stage is to be essentially as fast as [the] underlying local filesystem. And it is..." explained Evgeniy Polyakov, suggesting that the POHMELFS networking filesystem performs 10% to 300% faster than NFS, depending on the file operation. In particular, he noted that it still suffers from random reads, an area that he's currently focused on fixing. He summarized the new features found in the latest release:

"Read request (data read, directory listing, lookup requests) balancing between multiple servers; write requests are sent to multiple servers and completed only when all of them send an ack; [the] ability to add and/or remove servers from [the] working set at run-time from userspace; documentation (overall view and protocol commands); rename command; several new mount options to control client behaviour instead of hard coded numbers."

Looking forward, Evgeniy noted that this was likely the last non-bugfix release of the kernel client side implementation, suggesting that the next release would focus on adding server side features, "needed for distributed parallel data processing (like the ability to add new servers via network commands from another server), so most of the work will be devoted to server code."


From: Evgeniy Polyakov
Subject: [0/3] POHMELFS high performance network filesystem. First steps in parallel processing.
Date: Jun 13, 9:37 am 2008

Hi.

I'm pleased to announce POHMEL high performance network parallel
distributed filesystem.
POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.

Development status can be tracked in filesystem section [1].

This is a high performance network filesystem with local coherent cache of data
and metadata. Its main goal is distributed parallel processing of data.

This release brings following features:
 * Read requests (data read, directory listing, lookup requests) balancing
 	between multiple servers.
 * Write requests are sent to multiple servers and completed only
 	when all of them sent an ack.
 * Ability to add and/or remove servers from working set at run-time from
	 userspace (via netlink, so the same command can be processed from
	 real network though, but since server does not support it yet,
	 I dropped network part).
 * Documentation (overall view and protocol commands)!
 * Rename command (oops, forgot it in previous releases :)
 * Several new mount options to control client behaviour instead of
	hardcoded numbers.
 * Bug fixes.

Very likely it is one of the last non-bug-fixing release of the kernel
client side, next release will incorporate features, needed for distributed
parallel data processing (like ability to add new servers via network
command from another servers), so most of the work will be devoted to server
code.


Basic POHMELFS features:
 * Local coherent (notes [2]) cache for data and metadata).
 * Completely async processing of all events (hard and symlinks are the only 
    	exceptions) including object creation and data reading/writing.
 * Flexible object architecture optimized for network processing. Ability to
    	create long pathes to object and remove arbitrary huge directoris in 
	single network command.
 * High performance is one of the main design goals.
 * Very fast and scalable multithreaded userspace server. Being in userspace
    	it works with any underlying filesystem and still is much faster than
	async ni-kernel NFS one.
 * Client is able to switch between different servers (if one goes down,
	client automatically reconnects to second and so on).
 * Transactions support. Full failover for all operations. Resending
	transactions to different servers on timeout or error.

Roadmap includes:
 * Server redundancy extensions (ability to store data in multiple locations
	according to regexp rules, like '*.txt' in /root1 and '*.jpg' in /root1
	and /root2.
 * Strong authentification and possible data encryption in network
	channel.
 * Async writing of the data from receiving kernel thread into userspace
	pages via copy_to_user() (check development tracking blog for results).
 * Client dynamical server reconfiguration: ability to add/remove servers
	from working set by server command (as part of development distributed
	server facilities).
 * Start development of the generic parallel distributed server.

One can grab sources from archive or git [2] or check homepage [3].

Thank you.

1. POHMELFS development status.
http://tservice.net.ru/~s0mbre/blog/devel/fs/index.html

2. Source archive.
http://tservice.net.ru/~s0mbre/archive/pohmelfs/
Git tree.
http://tservice.net.ru/~s0mbre/archive/pohmelfs/pohmelfs.git/

3. POHMELFS homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=pohmelfs

4. POHMELFS vs NFS benchmark [iozone results are coming].
http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_04_18.html
http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_04_14.html
http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_05_12.html

5. Cache-coherency notes.
http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_05_17.html

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

-- 
	Evgeniy Polyakov
--

From: Jeff Garzik Subject: Re: [0/3] POHMELFS high performance network filesystem. First steps in parallel processing. Date: Jun 14, 2:52 am 2008 Evgeniy Polyakov wrote: > Hi. > > I'm pleased to announce POHMEL high performance network parallel > distributed filesystem. > POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System. > > Development status can be tracked in filesystem section [1]. > > This is a high performance network filesystem with local coherent cache of data > and metadata. Its main goal is distributed parallel processing of data. > > This release brings following features: > * Read requests (data read, directory listing, lookup requests) balancing > between multiple servers. > * Write requests are sent to multiple servers and completed only > when all of them sent an ack. > * Ability to add and/or remove servers from working set at run-time from > userspace (via netlink, so the same command can be processed from > real network though, but since server does not support it yet, > I dropped network part). > * Documentation (overall view and protocol commands)! > * Rename command (oops, forgot it in previous releases :) > * Several new mount options to control client behaviour instead of > hardcoded numbers. > * Bug fixes. Neat :) Thanks for protocol documentation, too. Do you plan to add write-pages in addition to write-page? Also, write-page does not appear to be documented. Is race-across-directories race-free? That is a sticky area, see Documentation/filesystems/directory-locking in particular. With the exception of encryption, do you think the POHMELFS client is mostly complete, at this point? Jeff --
From: Evgeniy Polyakov Subject: Re: [0/3] POHMELFS high performance network filesystem. First steps in parallel processing. Date: Jun 14, 3:10 am 2008 On Sat, Jun 14, 2008 at 05:52:38AM -0400, Jeff Garzik (jeff@garzik.org) wrote: > Neat :) Thanks for protocol documentation, too. Do you plan to add > write-pages in addition to write-page? Also, write-page does not appear > to be documented. ->writepage() is not needed at all (it does not even exist anymore :), and only ->writepages() is used. > Is race-across-directories race-free? That is a sticky area, see > Documentation/filesystems/directory-locking in particular. POHMELFS relies on VFS to handle that cases, it does not invent own stuff here. Although there is kind of a path/name cache, it is very trivial and small. > With the exception of encryption, do you think the POHMELFS client is > mostly complete, at this point? I think I will extend its command structure to support checksum (i.e. add 64bit field unused for now), all other protocol changes are supposed to be on the highest level (like new commands), so it should not hurt others. I have to think about locking (file locks on server, not POHMELFS internal locking :) some more, but so far I do not see, how it can change the picture. Another task is to move from slab allocation (kmalloc and friends) to memory pools, like it was done for transaction destinations. I do not plan serious changes in client (I frankly do not know, what else I want there :), so, yes, I think that most of the client side is ready. -- Evgeniy Polyakov --

From: Jamie Lokier
Subject: Re: [2/3] POHMELFS: Documentation.
Date: Jun 13, 7:15 pm 2008

> * Fast and scalable multithreaded userspace server. Being in
>   userspace it works with any underlying filesystem and still is
>   much faster than async in-kernel NFS one.

That's interesting :-)

> POHMELFS uses novel asynchronous approach of data
> processing. Courtesy to transactions, it is possible to detouch
> reply from request, and if command requires data to be received,
> caller just sleeps waiting for it. Thus it is possible to issue
> multiple read commands to different servers and async threads will
> pick replies in parallel, find appropriate transactions in the
> system and put data where it belongs (like page or inode cache).

That sounds great, but what do you mean by 'novel'?  Don't other
modern network filesystems use asynchronous requests and replies in
some form?  It seems like the obvious thing.

> * Transactions support. Full failover for all operations.
>   Resending transactions to different servers on timeout or error.

By transactions, do you mean an atomic set of writes/changes?
Or do you trace read dependencies too?

> Main feature of the POHMELFS is writeback data and metadata cache.
> [...] Creation and removal of objects, as long as writing, are
> asynchronous and are sent to the server during system writeback.
> When server receives some request for given object in the system
> (like data reading, or file creation or whatever else), it stores
> appropriate client information in own cache, so when subsequent
> request comes from different client, all previous could be notified
> (for example when several clients read data from file, and then new
> client writes there, appropriate pages on clients will be
> invalidated, so subsequent write will force them to read page from
> the server). Because of this feature POHMELFS is extremely fast in
> metadata intensive workloads, and can fully utilize bandwidth to
> servers when doing bulk data transafers.

This is extremely cool, and obviously the right thing to do.  No sane
network filesystem would be without it, one naively hopes :-)

How is it different from NFSv4 leases and SMB oplocks?  Or are they
the same basic idea?

With all those asynchronous requests, are your writeback caches fully
coherent?  Example.  Client A reads file X (data: x0), then writes X
(new data: x1), then reads Y (data: y0), then writes Y (data: y1).
Client B reads Y then reads X.  Is it guaranteed that client B cannot
ever get data y1 and x0?  A fully coherent system (meaning behaves
like a local filesystem) does guarantee that.  If cache requests for
file X and file Y are independent, this is not guaranteed.

-- Jamie
--

From: Evgeniy Polyakov Subject: Re: [2/3] POHMELFS: Documentation. Date: Jun 13, 11:56 pm 2008 On Sat, Jun 14, 2008 at 03:15:47AM +0100, Jamie Lokier (jamie@shareable.org) wrote: > > * Fast and scalable multithreaded userspace server. Being in > > userspace it works with any underlying filesystem and still is > > much faster than async in-kernel NFS one. > > That's interesting :-) Noreover, that's true :) I regulary run and post various benchmarks comparing POHMELFS, NFS, XFS and Ext4, main goal of POHMELFS at this stage is to be essentially as fast as underlying local filesystem. And it is... Though there is a single place (random reading, all others reached FS speed, so it is from 10 to 300% faster than NFS in various loads :), but I'm working on it, I think it is not server's side though. > That sounds great, but what do you mean by 'novel'? Don't other > modern network filesystems use asynchronous requests and replies in > some form? It seems like the obvious thing. Maybe it was a bit naive though :) But I checked lots of implementation, all of them use send()/recv() approach. NFSv4 uses a bit different, but it is a cryptic, and at least from its names it is not clear: like nfs_pagein_multi() -> nfs_pageio_complete() -> add_stats. Presumably we add stats when we have data handy... CIFS/SMB use synchronous approach. From those projects, which are not in kernel, like CRFS and CEPH, the former uses async receiving thread, while the latter is synchronous, but can select different servers for reading, more like NFSv4.1 leases. > > * Transactions support. Full failover for all operations. > > Resending transactions to different servers on timeout or error. > > By transactions, do you mean an atomic set of writes/changes? > Or do you trace read dependencies too? It covers all operations, including reading, directory listing, lookups, attribite changes and so on. Its main goal is to allow transaparent failover, so it has to be done for reading too. > > Main feature of the POHMELFS is writeback data and metadata cache. > > [...] Creation and removal of objects, as long as writing, are > > asynchronous and are sent to the server during system writeback. > > When server receives some request for given object in the system > > (like data reading, or file creation or whatever else), it stores > > appropriate client information in own cache, so when subsequent > > request comes from different client, all previous could be notified > > (for example when several clients read data from file, and then new > > client writes there, appropriate pages on clients will be > > invalidated, so subsequent write will force them to read page from > > the server). Because of this feature POHMELFS is extremely fast in > > metadata intensive workloads, and can fully utilize bandwidth to > > servers when doing bulk data transafers. > > This is extremely cool, and obviously the right thing to do. No sane > network filesystem would be without it, one naively hopes :-) > > How is it different from NFSv4 leases and SMB oplocks? Or are they > the same basic idea? > > With all those asynchronous requests, are your writeback caches fully > coherent? Example. Client A reads file X (data: x0), then writes X > (new data: x1), then reads Y (data: y0), then writes Y (data: y1). > Client B reads Y then reads X. Is it guaranteed that client B cannot > ever get data y1 and x0? A fully coherent system (meaning behaves > like a local filesystem) does guarantee that. If cache requests for > file X and file Y are independent, this is not guaranteed. Oplocks and leases are essentially lock on given file, which allows one client to operate on it. POHMELFS does not have locks now, and they will be created depending on how distributed server will require them. In the simplesst case it can just lock file for writing and do not allow its updates from other clients. Lock aciquite can be done at write_begin time. Without lock and writeback cache in your case writeback for file Y can happen before writeback for file X, but if client does not only write, but also sync after its write, then yes, client will see later updates after more earlier. POHMELFS does not broadcast its interest in the file content until real writing happens, i.e. at writeback time. Although I can add a mode, when the same will be done during write_begin() time. In that case your example will work without sync. > -- Jamie -- Evgeniy Polyakov --
From: Sage Weil Subject: Re: [2/3] POHMELFS: Documentation. Date: Jun 14, 9:27 pm 2008 Hi Evgeniy, On Sat, 14 Jun 2008, Evgeniy Polyakov wrote: > > That sounds great, but what do you mean by 'novel'? Don't other > > modern network filesystems use asynchronous requests and replies in > > some form? It seems like the obvious thing. > > Maybe it was a bit naive though :) > But I checked lots of implementation, all of them use send()/recv() > approach. NFSv4 uses a bit different, but it is a cryptic, and at least > from its names it is not clear: > like nfs_pagein_multi() -> nfs_pageio_complete() -> add_stats. Presumably > we add stats when we have data handy... > CIFS/SMB use synchronous approach. By synchronous/asynchronous, are you talking about whether writepages() blocks until the write is acked by the server? (Really, any FS that does writeback is writing asynchronously...) > >From those projects, which are not in kernel, like CRFS and CEPH, the > former uses async receiving thread, while the latter is synchronous, > but can select different servers for reading, more like NFSv4.1 leases. Well... Ceph writes synchronously (i.e. waits for ack in write()) only when write-sharing on a single file between multiple clients, when it is needed to preserve proper write ordering semantics. The rest of the time, it generates nice big writes via writepages(). The main performance issue is with small files... the fact that writepages() waits for an ack and is usually called from only a handful of threads limits overall throughput. If the writeback path was asynchronous as well that would definitely help (provided writeback is still appropriately throttled). Is that what you're doing in POHMELFS? > > > * Transactions support. Full failover for all operations. > > > Resending transactions to different servers on timeout or error. > > > > By transactions, do you mean an atomic set of writes/changes? > > Or do you trace read dependencies too? > > It covers all operations, including reading, directory listing, lookups, > attribite changes and so on. Its main goal is to allow transaparent > failover, so it has to be done for reading too. Your meaning of "transaction" confused me as well. It sounds like you just mean that the read/write operation is retried (asynchronously), and may be redirected at another server if need be. And that writes can be directed at multiple servers, waiting for an ack from both. Is that right? I my view the writeback metadata cache is definitely the most exciting part about this project. Is there a document that describes where the design ended up? I seem to remember a string of posts describing your experiements with client-side inode number assignment and how that is reconciled with the server. Keeping things consistent between clients is definitely the tricky part, although I suspect that even something with very coarse granularity (e.g., directory/subtree-based locking/leasing) will capture most of the performance benefits for most workloads. Cheers- sage --
From: Evgeniy Polyakov Subject: Re: [2/3] POHMELFS: Documentation. Date: Jun 14, 10:57 pm 2008 Hi Sage. On Sat, Jun 14, 2008 at 09:27:55PM -0700, Sage Weil (sage@newdream.net) wrote: > By synchronous/asynchronous, are you talking about whether writepages() > blocks until the write is acked by the server? (Really, any FS that does > writeback is writing asynchronously...) Yes, not only writepage, but any request - if it sends sequest and then receives reply (i.e. doing send/recv sequence without ability to do something else in between or allow other users to do sends or receives into the same socket), then it is synchronous. If it only sends, and someone else receives, it is possible to send multiple requests from different users who do reads or writes or lookups or whatever and asynchronously in different thread receive replies not in particular order, so this approach I call asynchronous. > Well... Ceph writes synchronously (i.e. waits for ack in write()) only > when write-sharing on a single file between multiple clients, when it is > needed to preserve proper write ordering semantics. The rest of the time, > it generates nice big writes via writepages(). The main performance issue > is with small files... the fact that writepages() waits for an ack and is > usually called from only a handful of threads limits overall throughput. > If the writeback path was asynchronous as well that would definitely help > (provided writeback is still appropriately throttled). Is that what > you're doing in POHMELFS? Yes, POHMELFS does writing that way. > > > > * Transactions support. Full failover for all operations. > > > > Resending transactions to different servers on timeout or error. > > > > > > By transactions, do you mean an atomic set of writes/changes? > > > Or do you trace read dependencies too? > > > > It covers all operations, including reading, directory listing, lookups, > > attribite changes and so on. Its main goal is to allow transaparent > > failover, so it has to be done for reading too. > > Your meaning of "transaction" confused me as well. It sounds like you > just mean that the read/write operation is retried (asynchronously), and > may be redirected at another server if need be. And that writes can be > directed at multiple servers, waiting for an ack from both. Is that > right? Not exactly. Transaction in a nutshell is a wrapper on top of command (or multiple commands if needed like in writing), which contains all information needed to perform appropriate action. When user calls read() or 'ls' or write() or whatever, POHMELFS creates transaction for that operation and tries to perform it (if operation is not cached, in that case nothing actually happens). When transaction is submitted, it becomes part of the failover state machine which will check if data has to be read from different server or written to new one or dropped. original caller may not even know from which server its data will be received. If request sending failed in the middle, the whole transaction will be redirected to new one. It is also possible to redo transaction against different server, if server sent us error (like I'm busy), but this functionality was dropped in previous release iirc, this can be resurrected though. Having generic transaction tree callers do not bother about how to store theirs requests, how to wait for results and how to complete them - transactions do it for them. It is not rocket science, but extrmely effective and simple way to help rule out asynchronous machinery. > I my view the writeback metadata cache is definitely the most exciting > part about this project. Is there a document that describes where the > design ended up? I seem to remember a string of posts describing your > experiements with client-side inode number assignment and how that is > reconciled with the server. Keeping things consistent between clients is > definitely the tricky part, although I suspect that even something with > very coarse granularity (e.g., directory/subtree-based locking/leasing) > will capture most of the performance benefits for most workloads. That was somewhat old approach, currently inode numbers and things like open-by-inode or NFS style open-by-cookie are not used. I tried to describe caching bits in docuementation I ent, although its a bit rough and likely incomplete :) Feel free to ask if there are some white areas there. -- Evgeniy Polyakov --
From: Sage Weil Subject: Re: [2/3] POHMELFS: Documentation. Date: Jun 15, 9:41 am 2008 On Sun, 15 Jun 2008, Evgeniy Polyakov wrote: > Yes, not only writepage, but any request - if it sends sequest and then > receives reply (i.e. doing send/recv sequence without ability to do > something else in between or allow other users to do sends or receives > into the same socket), then it is synchronous. If it only sends, and > someone else receives, it is possible to send multiple requests from > different users who do reads or writes or lookups or whatever and > asynchronously in different thread receive replies not in particular > order, so this approach I call asynchronous. Oh, so you just mean that the caller doesn't, say, hold a mutex for the socket for the duration of the send _and_ recv? I'm kind of shocked that anyone does that, although I suppose in some cases the protocol effectively demands it. > Yes, POHMELFS does writing that way. Nice. I will definitely be taking a look at that. > Not exactly. Transaction in a nutshell is a wrapper on top of command > (or multiple commands if needed like in writing), which contains all > information needed to perform appropriate action. When user calls read() > or 'ls' or write() or whatever, POHMELFS creates transaction for that > operation and tries to perform it (if operation is not cached, in that > case nothing actually happens). When transaction is submitted, it > becomes part of the failover state machine which will check if data has > to be read from different server or written to new one or dropped. > original caller may not even know from which server its data will be > received. If request sending failed in the middle, the whole transaction > will be redirected to new one. It is also possible to redo transaction > against different server, if server sent us error (like I'm busy), but > this functionality was dropped in previous release iirc, this can be > resurrected though. Having generic transaction tree callers do not > bother about how to store theirs requests, how to wait for results and > how to complete them - transactions do it for them. It is not rocket > science, but extrmely effective and simple way to help rule out > asynchronous machinery. Got it. Tracking pending requests in some generic way is definitely key to making failure handling sane with multiple servers. > That was somewhat old approach, currently inode numbers and things like > open-by-inode or NFS style open-by-cookie are not used. I tried to > describe caching bits in docuementation I ent, although its a bit rough > and likely incomplete :) Feel free to ask if there are some white areas > there. So what happens if the user creates a new file, and then does a stat() to expose i_ino. Does that value change later? It's not just open-by-inode/cookie that make ino important. It looks like the client/server protocol is primarily path-based. What happens if you do something like hosta$ cd foo hosta$ touch foo.txt hostb$ mv foo bar hosta$ rm foo.txt Will hosta realize it really needs to do "unlink /bar/foo.txt"? sage --
From: Evgeniy Polyakov Subject: Re: [2/3] POHMELFS: Documentation. Date: Jun 15, 10:50 am 2008 On Sun, Jun 15, 2008 at 09:41:44AM -0700, Sage Weil (sage@newdream.net) wrote: > Oh, so you just mean that the caller doesn't, say, hold a mutex for the > socket for the duration of the send _and_ recv? I'm kind of shocked that > anyone does that, although I suppose in some cases the protocol > effectively demands it. First, socket has own internal lock, which protects against simultaneous access to its structures, but POHMELFS has own mutex, which guards network operations for given network state, so if server disconnected, socket can be released and zeroed if needed, so that subsequent access could detect it and made appropriate decision like try to reconnect. I really do not understand your surprise :) But it does possible to create a scheme, when you do not need to hold a lock between commands for successfull complete. It is even possible not to _expect_ that something will be received from given socket or received at all. Courtesy of transactions: system locks only data, which has to be processed, it does not lock sequence of commands which are required for that data processing. Ordering is guarded by transactions. > > That was somewhat old approach, currently inode numbers and things like > > open-by-inode or NFS style open-by-cookie are not used. I tried to > > describe caching bits in docuementation I ent, although its a bit rough > > and likely incomplete :) Feel free to ask if there are some white areas > > there. > > So what happens if the user creates a new file, and then does a stat() to > expose i_ino. Does that value change later? It's not just > open-by-inode/cookie that make ino important. Local inode number is returned. Inode number does not change during lifetime of the inode, so while it is alive always the same number will be returned. > It looks like the client/server protocol is primarily path-based. What > happens if you do something like > > hosta$ cd foo > hosta$ touch foo.txt > hostb$ mv foo bar > hosta$ rm foo.txt > > Will hosta realize it really needs to do "unlink /bar/foo.txt"? No, since it got a reference to object in local cache. But it will fail to do something interesting with it, since it does not really exist on server anymore. When 'hosta' will reread higher directory (it will when needed, since server will send it cache coherency message, but thanks to your example, rename really does not send it, only remove :), so I will update server), it will detect that directory changed its name and later will use it. After reread system actually can not know if directory was renamed or it is completely new one with the same files. You pointed to very interesting behaviour of the path based approach, which bothers me quite for a while: since cache coherency messages have own round-trip time, there is always a window when one client does not know that another one updated object or removed it and created new one with the same name. It is trivially possible to extend path cache with storing remote ids, so that attempt to access old object would not harm new one with the same name, but I want to think about it some more. Correct solution is to use locks of course, and I'm not 100% it worse changing at all without them, but it is interesting... -- Evgeniy Polyakov --
From: Sage Weil Subject: Re: [2/3] POHMELFS: Documentation. Date: Jun 15, 8:17 pm 2008 On Sun, 15 Jun 2008, Evgeniy Polyakov wrote: > On Sun, Jun 15, 2008 at 09:41:44AM -0700, Sage Weil (sage@newdream.net) wrote: > > Oh, so you just mean that the caller doesn't, say, hold a mutex for the > > socket for the duration of the send _and_ recv? I'm kind of shocked that > > anyone does that, although I suppose in some cases the protocol > > effectively demands it. > > First, socket has own internal lock, which protects against simultaneous > access to its structures, but POHMELFS has own mutex, which guards > network operations for given network state, so if server disconnected, > socket can be released and zeroed if needed, so that subsequent access > could detect it and made appropriate decision like try to reconnect. Right... > I really do not understand your surprise :) Well, I must still be misunderstanding you :(. It sounded like you were saying other network filesystems take the socket exclusively for the duration of an entire operation (i.e., only a single RPC call oustanding with the server at a time). And I'm pretty sure that isn't the case... Which means I'm still confused as to how POHMELFS's transactions are fundamentally different here from, say, NFS's use of RPC. In both cases, multiple requests can be in flight, and the server is free to reply to requests in any order. And in the case of a timeout, RPC requests are resent (to the same server.. let's ignore failover for the moment). Am I missing something? Or giving NFS too much credit here? > > So what happens if the user creates a new file, and then does a stat() to > > expose i_ino. Does that value change later? It's not just > > open-by-inode/cookie that make ino important. > > Local inode number is returned. Inode number does not change during > lifetime of the inode, so while it is alive always the same number will > be returned. I see. And if the inode drops out of the client cache, and is later reopened, the st_ino seen by an application may change? st_ino isn't used for much, but I wonder if that would impact a large cp or rsync's ability to preserve hard links. > > It looks like the client/server protocol is primarily path-based. What > > happens if you do something like > > > > hosta$ cd foo > > hosta$ touch foo.txt > > hostb$ mv foo bar > > hosta$ rm foo.txt > > > > Will hosta realize it really needs to do "unlink /bar/foo.txt"? > > No, since it got a reference to object in local cache. But it will fail > to do something interesting with it, since it does not really exist on > server anymore. > When 'hosta' will reread higher directory (it will when needed, since > server will send it cache coherency message, but thanks to your example, > rename really does not send it, only remove :), so I will update server), > it will detect that directory changed its name and later will use it. > After reread system actually can not know if directory was renamed or it > is completely new one with the same files. > > You pointed to very interesting behaviour of the path based approach, > which bothers me quite for a while: > since cache coherency messages have own round-trip time, there is always > a window when one client does not know that another one updated object > or removed it and created new one with the same name. Not if the server waits for the cache invalidation to be acked before applying the update. That is, treat the client's cached copy as a lease or read lock. I believe this is how NFSv4 delegations behave, and it's how Ceph metadata leases (dentries, inode contents) and file access capabilities (which control sync vs async file access) behave. I'm not all that familiar with samba, but my guess is that its leases are broken synchronously as well. > It is trivially possible to extend path cache with storing remote ids, > so that attempt to access old object would not harm new one with the > same name, but I want to think about it some more. That's half of it... ideally, though, the client would have a reference to the real object as well, so that the original foo.txt would be removed. I.e. not only avoid doing the wrong thing, but also do the right thing. I have yet to come up with a satisfying solution there. Doing a d_drop on dentry lease revocation gets me most of the way there (Ceph's path generation could stop when it hits an unhashed dentry and make the request path relative to an inode), but the problem I'm coming up against is that there is no explicit communication of the CWD between the VFS and fs (well, that I know of), so the client doesn't know when it needs a real reference to the directory (and I'm not especially keen on taking references for _all_ cached directory inodes). And I'm not really sure how .. is supposed to behave in that context. Anyway... sage --
From: Evgeniy Polyakov Subject: Re: [2/3] POHMELFS: Documentation. Date: Jun 16, 3:20 am 2008 Hi. On Sun, Jun 15, 2008 at 08:17:46PM -0700, Sage Weil (sage@newdream.net) wrote: > > I really do not understand your surprise :) > > Well, I must still be misunderstanding you :(. It sounded like you were > saying other network filesystems take the socket exclusively for the > duration of an entire operation (i.e., only a single RPC call oustanding > with the server at a time). And I'm pretty sure that isn't the case... > > Which means I'm still confused as to how POHMELFS's transactions are > fundamentally different here from, say, NFS's use of RPC. In both cases, > multiple requests can be in flight, and the server is free to reply to > requests in any order. And in the case of a timeout, RPC requests are > resent (to the same server.. let's ignore failover for the moment). Am I > missing something? Or giving NFS too much credit here? Well, RPC is quite similar to what transaction is, at least its approach to completion callbacks and theirs async invokation. > > > So what happens if the user creates a new file, and then does a stat() to > > > expose i_ino. Does that value change later? It's not just > > > open-by-inode/cookie that make ino important. > > > > Local inode number is returned. Inode number does not change during > > lifetime of the inode, so while it is alive always the same number will > > be returned. > > I see. And if the inode drops out of the client cache, and is later > reopened, the st_ino seen by an application may change? st_ino isn't used > for much, but I wonder if that would impact a large cp or rsync's ability > to preserve hard links. There is number of cases when inode number will be preserved, like parent inode holds its number in own subcache, so when it will lookup object it will give it the same inode number, but generally if inode was destroyed and then recreated its number can change. > > You pointed to very interesting behaviour of the path based approach, > > which bothers me quite for a while: > > since cache coherency messages have own round-trip time, there is always > > a window when one client does not know that another one updated object > > or removed it and created new one with the same name. > > Not if the server waits for the cache invalidation to be acked before > applying the update. That is, treat the client's cached copy as a lease > or read lock. I believe this is how NFSv4 delegations behave, and it's > how Ceph metadata leases (dentries, inode contents) and file access > capabilities (which control sync vs async file access) behave. I'm not > all that familiar with samba, but my guess is that its leases are broken > synchronously as well. That's why I still did not implement locking in POHMELFS - I do not want to drop to sync case for essentially all operations, which will end up broadcasting cache coherency messages. But this may be unavoidable case, so I will have to implement it that way. NFS-like delegation is really the simplest and not interesting case, since it drops parallelism for multiple clients accessing the same data, but 'creates' it for clients who do access to different datasets. > > It is trivially possible to extend path cache with storing remote ids, > > so that attempt to access old object would not harm new one with the > > same name, but I want to think about it some more. > > That's half of it... ideally, though, the client would have a reference to > the real object as well, so that the original foo.txt would be removed. > I.e. not only avoid doing the wrong thing, but also do the right thing. > > I have yet to come up with a satisfying solution there. Doing a d_drop on > dentry lease revocation gets me most of the way there (Ceph's path > generation could stop when it hits an unhashed dentry and make the request > path relative to an inode), but the problem I'm coming up against is that > there is no explicit communication of the CWD between the VFS and fs > (well, that I know of), so the client doesn't know when it needs a real > reference to the directory (and I'm not especially keen on taking > references for _all_ cached directory inodes). And I'm not really sure > how .. is supposed to behave in that context. Well, the same code was in previous POHMELFS releases and I dropped it. I'm not sure yet what is exact requirements for locking and cache coherency expected from such kind of distributed filesystem, so there is no yet locking. There will always be some kind of tradeoffs between parallel access and caching, so drawing that line closer or far from what we have in local filesystem will anyway have some drawbacks. -- Evgeniy Polyakov --

.copy !

Nicolas (not verified)
on
June 17, 2008 - 12:39am

When a .copy could be added to the VFS ?

On NFS, a single "cp" command could take a very long time compare to "rsh servernfs cp" if the r-tools are available.

It could be much more efficient to manage a copie at the filesystem level to avoid network traffic and you could use also copie-on-write link to speed up the copie on the same volume.

POHMELFS

Anonymous (not verified)
on
June 17, 2008 - 4:17am

Parallel Optimized Host Message Exchange Layered File System (POHMELFS)

How much of that good stuff do you have to take to come up with such a name for a file system (or HMOTGSDYHTTTCUWSANFAFS for short)?

Author is Russian, as you

Elwin (not verified)
on
June 17, 2008 - 7:11am

Author is Russian, as you can see.

"Pohmel" means "drink after hangover" in Russian ("hangover" being "pohmelje").

pohmelo

Anonymous (not verified)
on
June 18, 2008 - 5:19am

Aha, that is why it sounded familiar. We finns have borrowed the word as "pohmelo". Might make it harder to "sell" that FS to the more conservative IT bosses here :-) Sounds like "hangovefs".

OMGWTFLOLBBQFS

Anonymous (not verified)
on
June 17, 2008 - 8:27am

OMGWTFLOLBBQ

Best. Ever.

OMGOMFGWTFSTFU

Anonymous (not verified)
on
June 18, 2008 - 6:18am

OMGOMFGWTFSTFU

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.