diff mbox

pnfs: Kick a pnfs_layoutcommit_inode on recall

Message ID 53FC9545.4000800@plexistor.com (mailing list archive)
State New, archived
Headers show

Commit Message

Boaz Harrosh Aug. 26, 2014, 2:10 p.m. UTC
From: Boaz Harrosh <boaz@plexistor.com>

This fixes a dead-lock in the pnfs recall processing

pnfs_layoutcommit_inode() is called through update_inode()
called from VFS. By setting set_inode_dirty during
pnfs write IO.

But the VFS will not schedule another update_inode()
If it is already inside an update_inode() or an sb-writeback

As part of writeback pnfs code might get stuck in LAYOUT_GET
with the server returning ERR_RECALL_CONFLICT because some
operation has caused the server to RECALL all layouts, including
those from our client.

So the RECALL is received, but our client is returning ERR_DELAY
because its write-segments need a LAYOUT_COMMIT, but
pnfs_layoutcommit_inode will never come because it is scheduled
behind the LAYOUT_GET which is stuck waiting for the recall to
finish

Hence the deadlock, client is stuck polling LAYOUT_GET receiving
ERR_RECALL_CONFLICT. Server is stuck polling RECALL receiving
ERR_DELAY.

With pnfs-objects the above condition can easily happen, when
a file grows beyond a group of devices. The pnfs-objects-server
will RECALL all layouts because the file-objects-map will
change and all old layouts will have stale attributes, therefor
the RECALL is initiated as part of a LAYOUT_GET, and this can
be triggered from within a single client operation.

A simple solution is to kick out a pnfs_layoutcommit_inode()
from within the recall, to free any need-to-commit segments
and let the client return success on the RECALL, so streaming
can continue.

This patch Is based on 3.17-rc1. It is completely UNTESTED.
I have tested a version of this patch at around the 3.12 Kernel
at which point the deadlock was resolved but I hit some race
conditions on pnfs state management farther on, so the actual
overall processing was not fixed. But hopefully these were fixed
by Trond and Christoph, and it should work better now.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/nfs/callback_proc.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Trond Myklebust Aug. 26, 2014, 2:26 p.m. UTC | #1
On Tue, Aug 26, 2014 at 10:10 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
>
> This fixes a dead-lock in the pnfs recall processing
>
> pnfs_layoutcommit_inode() is called through update_inode()
> called from VFS. By setting set_inode_dirty during
> pnfs write IO.
>
> But the VFS will not schedule another update_inode()
> If it is already inside an update_inode() or an sb-writeback
>
> As part of writeback pnfs code might get stuck in LAYOUT_GET
> with the server returning ERR_RECALL_CONFLICT because some
> operation has caused the server to RECALL all layouts, including
> those from our client.
>
> So the RECALL is received, but our client is returning ERR_DELAY
> because its write-segments need a LAYOUT_COMMIT, but
> pnfs_layoutcommit_inode will never come because it is scheduled
> behind the LAYOUT_GET which is stuck waiting for the recall to
> finish
>
> Hence the deadlock, client is stuck polling LAYOUT_GET receiving
> ERR_RECALL_CONFLICT. Server is stuck polling RECALL receiving
> ERR_DELAY.
>
> With pnfs-objects the above condition can easily happen, when
> a file grows beyond a group of devices. The pnfs-objects-server
> will RECALL all layouts because the file-objects-map will
> change and all old layouts will have stale attributes, therefor
> the RECALL is initiated as part of a LAYOUT_GET, and this can
> be triggered from within a single client operation.
>
> A simple solution is to kick out a pnfs_layoutcommit_inode()
> from within the recall, to free any need-to-commit segments
> and let the client return success on the RECALL, so streaming
> can continue.
>
> This patch Is based on 3.17-rc1. It is completely UNTESTED.
> I have tested a version of this patch at around the 3.12 Kernel
> at which point the deadlock was resolved but I hit some race
> conditions on pnfs state management farther on, so the actual
> overall processing was not fixed. But hopefully these were fixed
> by Trond and Christoph, and it should work better now.
>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  fs/nfs/callback_proc.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
> index 41db525..8660f96 100644
> --- a/fs/nfs/callback_proc.c
> +++ b/fs/nfs/callback_proc.c
> @@ -171,6 +171,14 @@ static u32 initiate_file_draining(struct nfs_client *clp,
>                 goto out;
>
>         ino = lo->plh_inode;
> +
> +       spin_lock(&ino->i_lock);
> +       pnfs_set_layout_stateid(lo, &args->cbl_stateid, true);
> +       spin_unlock(&ino->i_lock);
> +
> +       /* kick out any segs held by need to commit */
> +       pnfs_layoutcommit_inode(ino, true);

Making this call synchronous could deadlock the entire back channel.
Is there any reason why it can't just be made asynchonous?

> +
>         spin_lock(&ino->i_lock);
>         if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags) ||
>             pnfs_mark_matching_lsegs_invalid(lo, &free_me_list,
> @@ -178,7 +186,6 @@ static u32 initiate_file_draining(struct nfs_client *clp,
>                 rv = NFS4ERR_DELAY;
>         else
>                 rv = NFS4ERR_NOMATCHING_LAYOUT;
> -       pnfs_set_layout_stateid(lo, &args->cbl_stateid, true);
>         spin_unlock(&ino->i_lock);
>         pnfs_free_lseg_list(&free_me_list);
>         pnfs_put_layout_hdr(lo);
> --
> 1.9.3
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Aug. 26, 2014, 2:37 p.m. UTC | #2
On 08/26/2014 05:26 PM, Trond Myklebust wrote:
> On Tue, Aug 26, 2014 at 10:10 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> +
>> +       /* kick out any segs held by need to commit */
>> +       pnfs_layoutcommit_inode(ino, true);
> 
> Making this call synchronous could deadlock the entire back channel.
> Is there any reason why it can't just be made asynchonous?
> 

We were just talking about that.

So the logic here is that we want to save round trips and make this
as efficient as possible with no extra round trips for the server
recall. A single RECALL => LAYOUT_COMMIT => LAYOUT_COMMIT_REPLAY
REACLL_REPLAY_NO_MATCHING.

Please explain the deadlock you foresee. The worst is that the
mal-behaving server will time-out and after a long time the
RECALL_REPLAY will return with ERR. But why do you say deadlock
how can this deadlock?

Otherwise Christoph's version of this patch does the asynchonous
way which will always cause another poll of the RECALL and more delays
for every RECALL operation, which I was trying to avoid.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Aug. 26, 2014, 2:52 p.m. UTC | #3
On 08/26/2014 05:37 PM, Boaz Harrosh wrote:
> On 08/26/2014 05:26 PM, Trond Myklebust wrote:
>> On Tue, Aug 26, 2014 at 10:10 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>> +
>>> +       /* kick out any segs held by need to commit */
>>> +       pnfs_layoutcommit_inode(ino, true);
>>
>> Making this call synchronous could deadlock the entire back channel.
>> Is there any reason why it can't just be made asynchonous?
>>
> 
> We were just talking about that.
> 
> So the logic here is that we want to save round trips and make this
> as efficient as possible with no extra round trips for the server
> recall. A single RECALL => LAYOUT_COMMIT => LAYOUT_COMMIT_REPLAY
> REACLL_REPLAY_NO_MATCHING.
> 
> Please explain the deadlock you foresee. The worst is that the
> mal-behaving server will time-out and after a long time the
> RECALL_REPLAY will return with ERR. But why do you say deadlock
> how can this deadlock?
> 
> Otherwise Christoph's version of this patch does the asynchonous
> way which will always cause another poll of the RECALL and more delays
> for every RECALL operation, which I was trying to avoid.
> 

Also, for any client there is only a single server on it's backchannel

do you mean that ANY recall from any server can only come on a single
thread? I did not know that I thought it was from a pool of RPC threads.

So are you saying that waiting on this server operation will cause
any other server's recall to some other mounts stall until the
operation is complete for this server?

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Aug. 26, 2014, 2:55 p.m. UTC | #4
On Tue, Aug 26, 2014 at 10:37 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 08/26/2014 05:26 PM, Trond Myklebust wrote:
>> On Tue, Aug 26, 2014 at 10:10 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>> +
>>> +       /* kick out any segs held by need to commit */
>>> +       pnfs_layoutcommit_inode(ino, true);
>>
>> Making this call synchronous could deadlock the entire back channel.
>> Is there any reason why it can't just be made asynchonous?
>>
>
> We were just talking about that.
>
> So the logic here is that we want to save round trips and make this
> as efficient as possible with no extra round trips for the server
> recall. A single RECALL => LAYOUT_COMMIT => LAYOUT_COMMIT_REPLAY
> REACLL_REPLAY_NO_MATCHING.
>
> Please explain the deadlock you foresee. The worst is that the
> mal-behaving server will time-out and after a long time the
> RECALL_REPLAY will return with ERR. But why do you say deadlock
> how can this deadlock?
>
> Otherwise Christoph's version of this patch does the asynchonous
> way which will always cause another poll of the RECALL and more delays
> for every RECALL operation, which I was trying to avoid.
>
> Thanks
> Boaz
>

The above can deadlock if there are no session slots available to send
the layoutcommit, in which case the recall won't complete, and the
layoutget won't get a reply (which would free up the slot).
Boaz Harrosh Aug. 26, 2014, 3:02 p.m. UTC | #5
On 08/26/2014 05:55 PM, Trond Myklebust wrote:
> On Tue, Aug 26, 2014 at 10:37 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> 
> The above can deadlock if there are no session slots available to send
> the layoutcommit, in which case the recall won't complete, and the
> layoutget won't get a reply (which would free up the slot).
> 

What? the back-channel and the fore-channel do not use the same
slots. these are two different slots, No?

Matt, Adam you need to chip in here.

If it is as you say, then yes it must be as Christoph wrote it.

And the Ganesha server must be fixed because it has a slot system per
channel.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matt W. Benjamin Aug. 26, 2014, 3:24 p.m. UTC | #6
IIUC, the problem is the forechannel slot count, since the call you want to make synchronously is on the forechannel?

Matt

----- "Boaz Harrosh" <boaz@plexistor.com> wrote:

> On 08/26/2014 05:55 PM, Trond Myklebust wrote:
> > On Tue, Aug 26, 2014 at 10:37 AM, Boaz Harrosh <boaz@plexistor.com>
> wrote:
> > 
> > The above can deadlock if there are no session slots available to
> send
> > the layoutcommit, in which case the recall won't complete, and the
> > layoutget won't get a reply (which would free up the slot).
> > 
> 
> What? the back-channel and the fore-channel do not use the same
> slots. these are two different slots, No?
> 
> Matt, Adam you need to chip in here.
> 
> If it is as you say, then yes it must be as Christoph wrote it.
> 
> And the Ganesha server must be fixed because it has a slot system per
> channel.
> 
> Thanks
> Boaz
Trond Myklebust Aug. 26, 2014, 3:36 p.m. UTC | #7
On Tue, Aug 26, 2014 at 11:24 AM, Matt W. Benjamin <matt@linuxbox.com> wrote:
> IIUC, the problem is the forechannel slot count, since the call you want to make synchronously is on the forechannel?

Yep. layoutcommit will be sent on the fore channel, which is why it
can deadlock with the initial layoutget (or whatever operation that
triggered the layout recall).


> Matt
>
> ----- "Boaz Harrosh" <boaz@plexistor.com> wrote:
>
>> On 08/26/2014 05:55 PM, Trond Myklebust wrote:
>> > On Tue, Aug 26, 2014 at 10:37 AM, Boaz Harrosh <boaz@plexistor.com>
>> wrote:
>> >
>> > The above can deadlock if there are no session slots available to
>> send
>> > the layoutcommit, in which case the recall won't complete, and the
>> > layoutget won't get a reply (which would free up the slot).
>> >
>>
>> What? the back-channel and the fore-channel do not use the same
>> slots. these are two different slots, No?
>>
>> Matt, Adam you need to chip in here.
>>
>> If it is as you say, then yes it must be as Christoph wrote it.
>>
>> And the Ganesha server must be fixed because it has a slot system per
>> channel.
>>
>> Thanks
>> Boaz
>
> --
> Matt Benjamin
> The Linux Box
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI  48104
>
> http://linuxbox.com
>
> tel.  734-761-4689
> fax.  734-769-8938
> cel.  734-216-5309
Boaz Harrosh Aug. 26, 2014, 4:56 p.m. UTC | #8
On 08/26/2014 06:36 PM, Trond Myklebust wrote:
> On Tue, Aug 26, 2014 at 11:24 AM, Matt W. Benjamin <matt@linuxbox.com> wrote:
>> IIUC, the problem is the forechannel slot count, since the call you want to make synchronously is on the forechannel?


Matt no top post on a Linux mailing list ;-)

> Yep. layoutcommit will be sent on the fore channel, which is why it
> can deadlock with the initial layoutget (or whatever operation that
> triggered the layout recall).

Trond you said below:
> The above can deadlock if there are no session slots available to send
> the layoutcommit, in which case the recall won't complete, and the
> layoutget won't get a reply (which would free up the slot).

Why would the layoutget not-get-a-reply ?
This is how it goes with Both ganesha server and knfsd last I tested.

[1]
The LAYOUT_GET cause LAYOUT_RECALL case: (including the lo_commit)

client		Server                                     comments
~~~~~~          ~~~~~~                                     ~~~~~~~~
LAYOUT_GET ==> 	
           <==   LAYOUT_GET_REPLAY(ERR_RECALL_CONFLICT)
						<--------- fore-channel is free
           <==   RECALL
LAYOUT_COMMIT ==>
	    <==  LAYOUT_COMMIT_REPLAY
						<--------- fore-channel is free
RECALL_REPLY(NO_MATCHING) =>
						<--------- back-channel is free

Note that in this case the server is to send the RECALL only after
the error reply to LAYOUT_GET, specifically it is not aloud to get stuck
inside LAYOUT_GET and wait for the RECALL. (mandated by STD)

[2]
The LAYOUT_GET sent all the while a RECALL is on the wire:
client		Server                                     comments
~~~~~~          ~~~~~~                                     ~~~~~~~~
           <== RECALL
LAYOUT_GET ==> 	
           <== LAYOUT_GET_REPLAY(ERR_RECALL_CONFLICT)
						<--------- fore-channel is free
LAYOUT_COMMIT ==>
		LAYOUT_COMMIT_REPLAY
						<--------- fore-channel is free
RECALL_REPLY(NO_MATCHING) =>
						<--------- back-channel is free


[3]
Or the worst case that lo_commit needs to wait for the channel Similar
to [2] above:

client		Server                                     comments
~~~~~~          ~~~~~~                                     ~~~~~~~~
           <== RECALL
LAYOUT_GET ==> 	
initiate_lo_commit   ==>  slot is taken needs to wait

           <== LAYOUT_GET_REPLAY(ERR_RECALL_CONFLICT)
						<--------- fore-channel is free
LAYOUT_COMMIT ==>                               slot is now free lo_commit goes through
	   <== LAYOUT_COMMIT_REPLAY
						<--------- fore-channel is free
RECALL_REPLY(NO_MATCHING) =>
						<--------- back-channel is free

So the most important is that the server must not get stuck in lo_get
and since there is a slot for each channel the lo_commit can be sent
from within the recall.

What am I missing?

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Aug. 26, 2014, 4:59 p.m. UTC | #9
On Tue, Aug 26, 2014 at 12:56 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 08/26/2014 06:36 PM, Trond Myklebust wrote:
>> On Tue, Aug 26, 2014 at 11:24 AM, Matt W. Benjamin <matt@linuxbox.com> wrote:
>>> IIUC, the problem is the forechannel slot count, since the call you want to make synchronously is on the forechannel?
>
>
> Matt no top post on a Linux mailing list ;-)
>
>> Yep. layoutcommit will be sent on the fore channel, which is why it
>> can deadlock with the initial layoutget (or whatever operation that
>> triggered the layout recall).
>
> Trond you said below:
>> The above can deadlock if there are no session slots available to send
>> the layoutcommit, in which case the recall won't complete, and the
>> layoutget won't get a reply (which would free up the slot).
>
> Why would the layoutget not-get-a-reply ?
> This is how it goes with Both ganesha server and knfsd last I tested.
>
> [1]
> The LAYOUT_GET cause LAYOUT_RECALL case: (including the lo_commit)
>
> client          Server                                     comments
> ~~~~~~          ~~~~~~                                     ~~~~~~~~
> LAYOUT_GET ==>
>            <==   LAYOUT_GET_REPLAY(ERR_RECALL_CONFLICT)
>                                                 <--------- fore-channel is free
>            <==   RECALL
> LAYOUT_COMMIT ==>
>             <==  LAYOUT_COMMIT_REPLAY
>                                                 <--------- fore-channel is free

Beep! No free slots, so this hangs.

> RECALL_REPLY(NO_MATCHING) =>
>                                                 <--------- back-channel is free
>
> Note that in this case the server is to send the RECALL only after
> the error reply to LAYOUT_GET, specifically it is not aloud to get stuck
> inside LAYOUT_GET and wait for the RECALL. (mandated by STD)
>
> [2]
> The LAYOUT_GET sent all the while a RECALL is on the wire:
> client          Server                                     comments
> ~~~~~~          ~~~~~~                                     ~~~~~~~~
>            <== RECALL
> LAYOUT_GET ==>
>            <== LAYOUT_GET_REPLAY(ERR_RECALL_CONFLICT)
>                                                 <--------- fore-channel is free
> LAYOUT_COMMIT ==>
>                 LAYOUT_COMMIT_REPLAY
>                                                 <--------- fore-channel is free
> RECALL_REPLY(NO_MATCHING) =>
>                                                 <--------- back-channel is free
>
>
> [3]
> Or the worst case that lo_commit needs to wait for the channel Similar
> to [2] above:
>
> client          Server                                     comments
> ~~~~~~          ~~~~~~                                     ~~~~~~~~
>            <== RECALL
> LAYOUT_GET ==>
> initiate_lo_commit   ==>  slot is taken needs to wait
>
>            <== LAYOUT_GET_REPLAY(ERR_RECALL_CONFLICT)
>                                                 <--------- fore-channel is free
> LAYOUT_COMMIT ==>                               slot is now free lo_commit goes through
>            <== LAYOUT_COMMIT_REPLAY
>                                                 <--------- fore-channel is free
> RECALL_REPLY(NO_MATCHING) =>
>                                                 <--------- back-channel is free
>
> So the most important is that the server must not get stuck in lo_get
> and since there is a slot for each channel the lo_commit can be sent
> from within the recall.
>
> What am I missing?
>
> Thanks
> Boaz
>
Boaz Harrosh Aug. 26, 2014, 5:06 p.m. UTC | #10
On 08/26/2014 07:59 PM, Trond Myklebust wrote:
> On Tue, Aug 26, 2014 at 12:56 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 08/26/2014 06:36 PM, Trond Myklebust wrote:
>>> On Tue, Aug 26, 2014 at 11:24 AM, Matt W. Benjamin <matt@linuxbox.com> wrote:
>>>> IIUC, the problem is the forechannel slot count, since the call you want to make synchronously is on the forechannel?
>>
>>
>> Matt no top post on a Linux mailing list ;-)
>>
>>> Yep. layoutcommit will be sent on the fore channel, which is why it
>>> can deadlock with the initial layoutget (or whatever operation that
>>> triggered the layout recall).
>>
>> Trond you said below:
>>> The above can deadlock if there are no session slots available to send
>>> the layoutcommit, in which case the recall won't complete, and the
>>> layoutget won't get a reply (which would free up the slot).
>>
>> Why would the layoutget not-get-a-reply ?
>> This is how it goes with Both ganesha server and knfsd last I tested.
>>
>> [1]
>> The LAYOUT_GET cause LAYOUT_RECALL case: (including the lo_commit)
>>
>> client          Server                                     comments
>> ~~~~~~          ~~~~~~                                     ~~~~~~~~
>> LAYOUT_GET ==>
>>            <==   LAYOUT_GET_REPLAY(ERR_RECALL_CONFLICT)
>>                                                 <--------- fore-channel is free
>>            <==   RECALL
>> LAYOUT_COMMIT ==>
>>             <==  LAYOUT_COMMIT_REPLAY
>>                                                 <--------- fore-channel is free
> 
> Beep! No free slots, so this hangs.
> 

Beep! does not do a very good of a job to explain. Sorry

What do you mean? which slot? which channel? Just above your text it says
"fore-channel is free" so are you saying it is not free? why not.
Please use more then one line of text to explain. It might be clear to
you but not to me.

>> RECALL_REPLY(NO_MATCHING) =>
>>                                                 <--------- back-channel is free

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Aug. 26, 2014, 5:54 p.m. UTC | #11
On Tue, Aug 26, 2014 at 1:06 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 08/26/2014 07:59 PM, Trond Myklebust wrote:
>> On Tue, Aug 26, 2014 at 12:56 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>> On 08/26/2014 06:36 PM, Trond Myklebust wrote:
>>>> On Tue, Aug 26, 2014 at 11:24 AM, Matt W. Benjamin <matt@linuxbox.com> wrote:
>>>>> IIUC, the problem is the forechannel slot count, since the call you want to make synchronously is on the forechannel?
>>>
>>>
>>> Matt no top post on a Linux mailing list ;-)
>>>
>>>> Yep. layoutcommit will be sent on the fore channel, which is why it
>>>> can deadlock with the initial layoutget (or whatever operation that
>>>> triggered the layout recall).
>>>
>>> Trond you said below:
>>>> The above can deadlock if there are no session slots available to send
>>>> the layoutcommit, in which case the recall won't complete, and the
>>>> layoutget won't get a reply (which would free up the slot).
>>>
>>> Why would the layoutget not-get-a-reply ?
>>> This is how it goes with Both ganesha server and knfsd last I tested.
>>>
>>> [1]
>>> The LAYOUT_GET cause LAYOUT_RECALL case: (including the lo_commit)
>>>
>>> client          Server                                     comments
>>> ~~~~~~          ~~~~~~                                     ~~~~~~~~
>>> LAYOUT_GET ==>
>>>            <==   LAYOUT_GET_REPLAY(ERR_RECALL_CONFLICT)
>>>                                                 <--------- fore-channel is free
>>>            <==   RECALL
>>> LAYOUT_COMMIT ==>
>>>             <==  LAYOUT_COMMIT_REPLAY
>>>                                                 <--------- fore-channel is free
>>
>> Beep! No free slots, so this hangs.
>>
>
> Beep! does not do a very good of a job to explain. Sorry
>
> What do you mean? which slot? which channel? Just above your text it says
> "fore-channel is free" so are you saying it is not free? why not.
> Please use more then one line of text to explain. It might be clear to
> you but not to me.

The deadlock occurs _if_ the above layout commit  is unable to get a
slot. You can't guarantee that it will, because the slot table is a
finite resource and it can be exhausted if you allow fore channel
calls to trigger synchronous recalls on the back channel that again
trigger synchronous calls on the fore channel. You're basically saying
that the client needs to guarantee that it can allocate 2 slots before
it is allowed to send a layoutget just in case the server needs to
recall a layout.

If, OTOH, the layoutcommit is asynchronous, then there is no
serialisation and the back channel thread can happily reply to the
layout recall even if there are no free slots in the fore channel.

>>> RECALL_REPLY(NO_MATCHING) =>
>>>                                                 <--------- back-channel is free
>
> Thanks
> Boaz
>
Boaz Harrosh Aug. 26, 2014, 6:19 p.m. UTC | #12
On 08/26/2014 08:54 PM, Trond Myklebust wrote:
> On Tue, Aug 26, 2014 at 1:06 PM, Boaz Harrosh <boaz@plexistor.com> wrote:

> 
> The deadlock occurs _if_ the above layout commit  is unable to get a
> slot. You can't guarantee that it will, because the slot table is a
> finite resource and it can be exhausted 

Yes all I ever seen is 1 slot in any of the clients/servers I've
seen so I assume 1 slot ever

> if you allow fore channel
> calls to trigger synchronous recalls on the back channel 

Beep! but this is exactly what I'm trying to say. The STD specifically
forbids that. The server is not allowed to wait here, it must return
imitatively, with an error that frees the slot and then later issue the
RECALL.

This is what I said exactly three times in my mail, and what I have
depicted in my flow:
	Server async operation (mandated by the STD)
	Client back-channel can be sync with for channel (Not mentioned by the STD)

> that again trigger synchronous calls on the fore channel. 


> You're basically saying
> that the client needs to guarantee that it can allocate 2 slots before
> it is allowed to send a layoutget just in case the server needs to
> recall a layout.
> 

No I am not saying that, please count. Since the Server is not allowed
sync operation then one slot is enough and the client can do sync lo_commit
while in recall.

> If, OTOH, the layoutcommit is asynchronous, then there is no
> serialisation and the back channel thread can happily reply to the
> layout recall even if there are no free slots in the fore channel.
> 

Sure that will work as well, but not optimally, and for no good reason.

Please go back to my flow with the 3 cases. See how the server never waits
for anything and will always imitatively reply to the layout_get.
Since the server is not allowed a sync operation and is mandated by the
RFC text to not wait, then the client is allowed and can do sync operations
because it is enough that only one do async.

BTW: If what you are saying is true than there is a bug in the slot code
because this patch does work, and everything flows past this situation.
I have a reproducer test that fails 100% of the time without this patch
and only fails much later at some other place, but not at this deadlock,
with this patch applied.

Cheers
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Aug. 26, 2014, 6:34 p.m. UTC | #13
On 08/26/2014 09:19 PM, Boaz Harrosh wrote:
<>
> Beep! but this is exactly what I'm trying to say. The STD specifically
> forbids that. The server is not allowed to wait here, it must return
> imitatively, with an error that frees the slot and then later issue the
> RECALL.
> 
> This is what I said exactly three times in my mail, and what I have
> depicted in my flow:
> 	Server async operation (mandated by the STD)
> 	Client back-channel can be sync with for channel (Not mentioned by the STD)
> 


BTW Both Ganesha and kpnfsd behave the same. A recall issued while in layout_get
the layout-get reply will be sent first and then the recall will be sent on the wire.

If the recall was sent before the receive of the layout_get then an error/success is
returned imitatively without ever waiting for recall_reply to return.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Aug. 26, 2014, 6:41 p.m. UTC | #14
On Tue, Aug 26, 2014 at 2:19 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 08/26/2014 08:54 PM, Trond Myklebust wrote:
>> On Tue, Aug 26, 2014 at 1:06 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>
>>
>> The deadlock occurs _if_ the above layout commit  is unable to get a
>> slot. You can't guarantee that it will, because the slot table is a
>> finite resource and it can be exhausted
>
> Yes all I ever seen is 1 slot in any of the clients/servers I've
> seen so I assume 1 slot ever
>
>> if you allow fore channel
>> calls to trigger synchronous recalls on the back channel
>
> Beep! but this is exactly what I'm trying to say. The STD specifically
> forbids that. The server is not allowed to wait here, it must return
> imitatively, with an error that frees the slot and then later issue the
> RECALL.
>
> This is what I said exactly three times in my mail, and what I have
> depicted in my flow:
>         Server async operation (mandated by the STD)
>         Client back-channel can be sync with for channel (Not mentioned by the STD)
>
>> that again trigger synchronous calls on the fore channel.
>
>
>> You're basically saying
>> that the client needs to guarantee that it can allocate 2 slots before
>> it is allowed to send a layoutget just in case the server needs to
>> recall a layout.
>>
>
> No I am not saying that, please count. Since the Server is not allowed
> sync operation then one slot is enough and the client can do sync lo_commit
> while in recall.
>
>> If, OTOH, the layoutcommit is asynchronous, then there is no
>> serialisation and the back channel thread can happily reply to the
>> layout recall even if there are no free slots in the fore channel.
>>
>
> Sure that will work as well, but not optimally, and for no good reason.
>
> Please go back to my flow with the 3 cases. See how the server never waits
> for anything and will always imitatively reply to the layout_get.
> Since the server is not allowed a sync operation and is mandated by the
> RFC text to not wait, then the client is allowed and can do sync operations
> because it is enough that only one do async.
>
> BTW: If what you are saying is true than there is a bug in the slot code
> because this patch does work, and everything flows past this situation.
> I have a reproducer test that fails 100% of the time without this patch
> and only fails much later at some other place, but not at this deadlock,
> with this patch applied.
>
> Cheers
> Boaz
>

Whether or not your particular server allows it or not is irrelevant.
We're not coding the client to a particular implementation. None of
the other callbacks do synchronous RPC calls, and that's very
intentional.
Trond Myklebust Aug. 26, 2014, 7:46 p.m. UTC | #15
On Tue, Aug 26, 2014 at 2:41 PM, Trond Myklebust
<trond.myklebust@primarydata.com> wrote:
> On Tue, Aug 26, 2014 at 2:19 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 08/26/2014 08:54 PM, Trond Myklebust wrote:
>>> On Tue, Aug 26, 2014 at 1:06 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>
>>>
>>> The deadlock occurs _if_ the above layout commit  is unable to get a
>>> slot. You can't guarantee that it will, because the slot table is a
>>> finite resource and it can be exhausted
>>
>> Yes all I ever seen is 1 slot in any of the clients/servers I've
>> seen so I assume 1 slot ever
>>
>>> if you allow fore channel
>>> calls to trigger synchronous recalls on the back channel
>>
>> Beep! but this is exactly what I'm trying to say. The STD specifically
>> forbids that. The server is not allowed to wait here, it must return
>> imitatively, with an error that frees the slot and then later issue the
>> RECALL.
>>
>> This is what I said exactly three times in my mail, and what I have
>> depicted in my flow:
>>         Server async operation (mandated by the STD)
>>         Client back-channel can be sync with for channel (Not mentioned by the STD)
>>
>>> that again trigger synchronous calls on the fore channel.
>>
>>
>>> You're basically saying
>>> that the client needs to guarantee that it can allocate 2 slots before
>>> it is allowed to send a layoutget just in case the server needs to
>>> recall a layout.
>>>
>>
>> No I am not saying that, please count. Since the Server is not allowed
>> sync operation then one slot is enough and the client can do sync lo_commit
>> while in recall.
>>
>>> If, OTOH, the layoutcommit is asynchronous, then there is no
>>> serialisation and the back channel thread can happily reply to the
>>> layout recall even if there are no free slots in the fore channel.
>>>
>>
>> Sure that will work as well, but not optimally, and for no good reason.
>>
>> Please go back to my flow with the 3 cases. See how the server never waits
>> for anything and will always imitatively reply to the layout_get.
>> Since the server is not allowed a sync operation and is mandated by the
>> RFC text to not wait, then the client is allowed and can do sync operations
>> because it is enough that only one do async.
>>
>> BTW: If what you are saying is true than there is a bug in the slot code
>> because this patch does work, and everything flows past this situation.
>> I have a reproducer test that fails 100% of the time without this patch
>> and only fails much later at some other place, but not at this deadlock,
>> with this patch applied.
>>
>> Cheers
>> Boaz
>>
>
> Whether or not your particular server allows it or not is irrelevant.
> We're not coding the client to a particular implementation. None of
> the other callbacks do synchronous RPC calls, and that's very
> intentional.
>

So to return to the original question: could we please change the
layoutcommit in your patch so that it is asynchronous?
Boaz Harrosh Aug. 27, 2014, 8:22 a.m. UTC | #16
On 08/26/2014 09:41 PM, Trond Myklebust wrote:
<> 
> Whether or not your particular server allows it or not is irrelevant.
> We're not coding the client to a particular implementation. 

God, No this is not up to any particular server implementation
it is stated clearly in the draft text. The Server is specifically
forbidden to wait in lo_get and device_info/device_list actually
any resource that is recallable.

This is the STD exactly for those reasons you stated.

> None of
> the other callbacks do synchronous RPC calls, and that's very
> intentional.
> 

Because none of the other callbacks have the protection of the STD
that forbids the server of synchronous operations, so the client
have to.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Aug. 27, 2014, 8:50 a.m. UTC | #17
On 08/26/2014 10:46 PM, Trond Myklebust wrote:
<>
> 
> So to return to the original question: could we please change the
> layoutcommit in your patch so that it is asynchronous?
> 

There is nothing change. The original message at the head of
this thread, is Christoph's patch with the asynchronous layoutcommit.

As a reply I posted what I had in my tree, with a sync operation.
Because I wanted the recall to take 500 micro-seconds instead of like
300-1000 mili-seconds.

When the STD supports my model specifically I do not see why not

Don't you guys have recalls on the hot path?

Sigh, another nail in the coffin of what was suppose to be a dam
good STD.

Thinking about it from day one you had a personal agenda against
RECALLS. For some reason, which you never told me, you decided to
crash and burn any proper use of LO_RECALLS. With that dreaded
*synchronous* polling driven, forgetful--model recall.
(One day we'll sit on a Bear and you will tell me why? I really
 want to know)

Now that there is the last ever chance to make an half dissent
citizen out of what was left of recalls, you are all on your
back feet, against it. With religious mantras, when you have
never sat and actually stared at the prints and experienced
recalls for yourself. Your words

Christoph I have a deja-vu its principles against good science
again. Lets revert to your patch.

Reviewed-by: Boaz Harrosh <bharrosh@panasas.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index 41db525..8660f96 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -171,6 +171,14 @@  static u32 initiate_file_draining(struct nfs_client *clp,
 		goto out;
 
 	ino = lo->plh_inode;
+
+	spin_lock(&ino->i_lock);
+	pnfs_set_layout_stateid(lo, &args->cbl_stateid, true);
+	spin_unlock(&ino->i_lock);
+
+	/* kick out any segs held by need to commit */
+	pnfs_layoutcommit_inode(ino, true);
+
 	spin_lock(&ino->i_lock);
 	if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags) ||
 	    pnfs_mark_matching_lsegs_invalid(lo, &free_me_list,
@@ -178,7 +186,6 @@  static u32 initiate_file_draining(struct nfs_client *clp,
 		rv = NFS4ERR_DELAY;
 	else
 		rv = NFS4ERR_NOMATCHING_LAYOUT;
-	pnfs_set_layout_stateid(lo, &args->cbl_stateid, true);
 	spin_unlock(&ino->i_lock);
 	pnfs_free_lseg_list(&free_me_list);
 	pnfs_put_layout_hdr(lo);