reuse of slot and seq# when RPC was interrupted
diff mbox

Message ID CAN-5tyFQ2PiSHp41mOMHa=DSJ8SmXBo=Nk=v-k1hASPKRbbhzQ@mail.gmail.com
State New
Headers show

Commit Message

Olga Kornievskaia Sept. 23, 2016, 5:40 p.m. UTC
Hi folks,

I'd like to raise an issue with regards to nfs41_sequence_done()
slot->interrupted case. There is a comment there saying the if the RPC
was interrupted then we don't know if the server has processed the
slot or not so mark the slot as interrupted. In that case the sequence
is not bumped. Then later there is logic that if we received
SEQ_MISORDERED and the slot was marked interrupted then bump the
sequence.

The problem comes when the sequence number is not increment the reply
is not necessarily a SEQ_MISORDERED. Instead, the reply is a "cached"
reply of the operation that was interrupted. That leads to the xdr
returning "Remote EIO" (unrecoverable in some cases).

If we bump the sequence number always then we should get the
SEQ_MISORDERED error from which we can recover.

A reproducer to see an operation reuse a seq# and getting cached reply
is as follows:
1. on the shell do "rm <file in nfs>"
2. at the nfs_proxy delay the reply from the server enough to send a
ctrl-c to the shell.
3. do  something else on nfs.

If we instead bump the sequence number in the case of interrupted and do:

  case -NFS4ERR_DELAY:

@@ -748,14 +749,6 @@ int nfs41_sequence_done(struct rpc_task *task,
struct nfs4_sequence_res *res)
  goto retry_nowait;
  case -NFS4ERR_SEQ_MISORDERED:
  /*
- * Was the last operation on this sequence interrupted?
- * If so, retry after bumping the sequence number.
- */
- if (interrupted) {
- ++slot->seq_nr;
- goto retry_nowait;
- }
- /*
  * Could this slot have been previously retired?
  * If so, then the server may be expecting seq_nr = 1!
  */

1. if the server received it, then we bump and next operation has correct number
2. if the server didn't received and we bump, then next operation
received SEQ_MISORDERED, it'll reset the slot/session?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Trond Myklebust Sept. 23, 2016, 5:45 p.m. UTC | #1
> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:
> 
> If we instead bump the sequence number in the case of interrupted and do:

You have no guarantees that the server has seen and processed the operation.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia Sept. 23, 2016, 5:59 p.m. UTC | #2
On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
<trondmy@primarydata.com> wrote:
>
>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:
>>
>> If we instead bump the sequence number in the case of interrupted and do:
>
> You have no guarantees that the server has seen and processed the operation.

That is correct, i have tested the patch and made server never to
receive the operation and client have an interrupted slot. On the next
operation the server will complain back with SEQ_MISORDERED. Client
can recover from this operation. Client can not recover from "Remote
EIO".
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Sept. 23, 2016, 6:08 p.m. UTC | #3
> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:

> 

> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust

> <trondmy@primarydata.com> wrote:

>> 

>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:

>>> 

>>> If we instead bump the sequence number in the case of interrupted and do:

>> 

>> You have no guarantees that the server has seen and processed the operation.

> 

> That is correct, i have tested the patch and made server never to

> receive the operation and client have an interrupted slot. On the next

> operation the server will complain back with SEQ_MISORDERED. Client

> can recover from this operation. Client can not recover from "Remote

> EIO”.

> 


Why not?
Olga Kornievskaia Sept. 23, 2016, 6:25 p.m. UTC | #4
On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
<trondmy@primarydata.com> wrote:
>
>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:
>>
>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>> <trondmy@primarydata.com> wrote:
>>>
>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>
>>>> If we instead bump the sequence number in the case of interrupted and do:
>>>
>>> You have no guarantees that the server has seen and processed the operation.
>>
>> That is correct, i have tested the patch and made server never to
>> receive the operation and client have an interrupted slot. On the next
>> operation the server will complain back with SEQ_MISORDERED. Client
>> can recover from this operation. Client can not recover from "Remote
>> EIO”.
>>
>
> Why not?

When XDR layer returns EREMOTEIO it's not handled by the NFS error
recovery (are you suggesting we should?)  and returns that to the
application.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Sept. 23, 2016, 6:34 p.m. UTC | #5
> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@umich.edu> wrote:

> 

> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust

> <trondmy@primarydata.com> wrote:

>> 

>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:

>>> 

>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust

>>> <trondmy@primarydata.com> wrote:

>>>> 

>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:

>>>>> 

>>>>> If we instead bump the sequence number in the case of interrupted and do:

>>>> 

>>>> You have no guarantees that the server has seen and processed the operation.

>>> 

>>> That is correct, i have tested the patch and made server never to

>>> receive the operation and client have an interrupted slot. On the next

>>> operation the server will complain back with SEQ_MISORDERED. Client

>>> can recover from this operation. Client can not recover from "Remote

>>> EIO”.

>>> 

>> 

>> Why not?

> 

> When XDR layer returns EREMOTEIO it's not handled by the NFS error

> recovery (are you suggesting we should?)  and returns that to the

> application.

> 


I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.
Olga Kornievskaia Sept. 23, 2016, 6:41 p.m. UTC | #6
On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
<trondmy@primarydata.com> wrote:
>
>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@umich.edu> wrote:
>>
>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>> <trondmy@primarydata.com> wrote:
>>>
>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>
>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>> <trondmy@primarydata.com> wrote:
>>>>>
>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>
>>>>>> If we instead bump the sequence number in the case of interrupted and do:
>>>>>
>>>>> You have no guarantees that the server has seen and processed the operation.
>>>>
>>>> That is correct, i have tested the patch and made server never to
>>>> receive the operation and client have an interrupted slot. On the next
>>>> operation the server will complain back with SEQ_MISORDERED. Client
>>>> can recover from this operation. Client can not recover from "Remote
>>>> EIO”.
>>>>
>>>
>>> Why not?
>>
>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>> recovery (are you suggesting we should?)  and returns that to the
>> application.
>>
>
> I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.
>

I'm confused where your objection lies. Are you ok with bumping the
sequence # when task->tk_status = 1 and saying that we should still
keep the code that I deleted in the 2nd chunk of the patch that bumped
the seqid on getting SEQ_MISORDERED due to a previously interrupted
slot?
Wouldn't that create a difference of 2 slots for the server that has
received the original request?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Sept. 23, 2016, 7:07 p.m. UTC | #7
> On Sep 23, 2016, at 14:41, Olga Kornievskaia <aglo@umich.edu> wrote:

> 

> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust

> <trondmy@primarydata.com> wrote:

>> 

>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@umich.edu> wrote:

>>> 

>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust

>>> <trondmy@primarydata.com> wrote:

>>>> 

>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:

>>>>> 

>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust

>>>>> <trondmy@primarydata.com> wrote:

>>>>>> 

>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:

>>>>>>> 

>>>>>>> If we instead bump the sequence number in the case of interrupted and do:

>>>>>> 

>>>>>> You have no guarantees that the server has seen and processed the operation.

>>>>> 

>>>>> That is correct, i have tested the patch and made server never to

>>>>> receive the operation and client have an interrupted slot. On the next

>>>>> operation the server will complain back with SEQ_MISORDERED. Client

>>>>> can recover from this operation. Client can not recover from "Remote

>>>>> EIO”.

>>>>> 

>>>> 

>>>> Why not?

>>> 

>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error

>>> recovery (are you suggesting we should?)  and returns that to the

>>> application.

>>> 

>> 

>> I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.

>> 

> 

> I'm confused where your objection lies. Are you ok with bumping the

> sequence # when task->tk_status = 1 and saying that we should still

> keep the code that I deleted in the 2nd chunk of the patch that bumped

> the seqid on getting SEQ_MISORDERED due to a previously interrupted

> slot?

> Wouldn't that create a difference of 2 slots for the server that has

> received the original request?

> 


I’m saying I’d prefer to keep the current code, but fix the retry that is apparently broken. If we’re not ignoring the task->tk_error when we decide to retry, then that’s a bug in my opinion.
Olga Kornievskaia Sept. 23, 2016, 7:27 p.m. UTC | #8
On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
<trondmy@primarydata.com> wrote:
>
>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <aglo@umich.edu> wrote:
>>
>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>> <trondmy@primarydata.com> wrote:
>>>
>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>
>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>> <trondmy@primarydata.com> wrote:
>>>>>
>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>
>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>
>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>
>>>>>>>> If we instead bump the sequence number in the case of interrupted and do:
>>>>>>>
>>>>>>> You have no guarantees that the server has seen and processed the operation.
>>>>>>
>>>>>> That is correct, i have tested the patch and made server never to
>>>>>> receive the operation and client have an interrupted slot. On the next
>>>>>> operation the server will complain back with SEQ_MISORDERED. Client
>>>>>> can recover from this operation. Client can not recover from "Remote
>>>>>> EIO”.
>>>>>>
>>>>>
>>>>> Why not?
>>>>
>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>>>> recovery (are you suggesting we should?)  and returns that to the
>>>> application.
>>>>
>>>
>>> I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.
>>>
>>
>> I'm confused where your objection lies. Are you ok with bumping the
>> sequence # when task->tk_status = 1 and saying that we should still
>> keep the code that I deleted in the 2nd chunk of the patch that bumped
>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>> slot?
>> Wouldn't that create a difference of 2 slots for the server that has
>> received the original request?
>>
>
> I’m saying I’d prefer to keep the current code, but fix the retry that is apparently broken. If we’re not ignoring the task->tk_error when we decide to retry, then that’s a bug in my opinion.

I'm not understand what you are suggestion. I do better with example
so allow me:

REMOVE used slot 0 seq=00000036 received ctrl-c
nfs41_sequence_done() gets called task->tk_status = 1:
slot->interrupted is set to 1. slot is freed.

next operation comes in, in my case it's ACCESS. initialization of the
sequence uses slot 0 seq=00000036
server replies with REMOVE

client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
returns EREMOTEIO. handle error just returns that error.

where do we retry?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Sept. 23, 2016, 7:57 p.m. UTC | #9
> On Sep 23, 2016, at 15:27, Olga Kornievskaia <aglo@umich.edu> wrote:

> 

> On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust

> <trondmy@primarydata.com> wrote:

>> 

>>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <aglo@umich.edu> wrote:

>>> 

>>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust

>>> <trondmy@primarydata.com> wrote:

>>>> 

>>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@umich.edu> wrote:

>>>>> 

>>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust

>>>>> <trondmy@primarydata.com> wrote:

>>>>>> 

>>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:

>>>>>>> 

>>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust

>>>>>>> <trondmy@primarydata.com> wrote:

>>>>>>>> 

>>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:

>>>>>>>>> 

>>>>>>>>> If we instead bump the sequence number in the case of interrupted and do:

>>>>>>>> 

>>>>>>>> You have no guarantees that the server has seen and processed the operation.

>>>>>>> 

>>>>>>> That is correct, i have tested the patch and made server never to

>>>>>>> receive the operation and client have an interrupted slot. On the next

>>>>>>> operation the server will complain back with SEQ_MISORDERED. Client

>>>>>>> can recover from this operation. Client can not recover from "Remote

>>>>>>> EIO”.

>>>>>>> 

>>>>>> 

>>>>>> Why not?

>>>>> 

>>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error

>>>>> recovery (are you suggesting we should?)  and returns that to the

>>>>> application.

>>>>> 

>>>> 

>>>> I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.

>>>> 

>>> 

>>> I'm confused where your objection lies. Are you ok with bumping the

>>> sequence # when task->tk_status = 1 and saying that we should still

>>> keep the code that I deleted in the 2nd chunk of the patch that bumped

>>> the seqid on getting SEQ_MISORDERED due to a previously interrupted

>>> slot?

>>> Wouldn't that create a difference of 2 slots for the server that has

>>> received the original request?

>>> 

>> 

>> I’m saying I’d prefer to keep the current code, but fix the retry that is apparently broken. If we’re not ignoring the task->tk_error when we decide to retry, then that’s a bug in my opinion.

> 

> I'm not understand what you are suggestion. I do better with example

> so allow me:

> 

> REMOVE used slot 0 seq=00000036 received ctrl-c

> nfs41_sequence_done() gets called task->tk_status = 1:

> slot->interrupted is set to 1. slot is freed.

> 

> next operation comes in, in my case it's ACCESS. initialization of the

> sequence uses slot 0 seq=00000036

> server replies with REMOVE

> 

> client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()

> returns EREMOTEIO. handle error just returns that error.

> 

> where do we retry?

> 


The retry should be happening when we exit from nfs41_sequence_done() by restarting the RPC.
Olga Kornievskaia Sept. 23, 2016, 8:07 p.m. UTC | #10
On Fri, Sep 23, 2016 at 3:57 PM, Trond Myklebust
<trondmy@primarydata.com> wrote:
>
>> On Sep 23, 2016, at 15:27, Olga Kornievskaia <aglo@umich.edu> wrote:
>>
>> On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
>> <trondmy@primarydata.com> wrote:
>>>
>>>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>
>>>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>>>> <trondmy@primarydata.com> wrote:
>>>>>
>>>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>
>>>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>
>>>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>
>>>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>>>
>>>>>>>>>> If we instead bump the sequence number in the case of interrupted and do:
>>>>>>>>>
>>>>>>>>> You have no guarantees that the server has seen and processed the operation.
>>>>>>>>
>>>>>>>> That is correct, i have tested the patch and made server never to
>>>>>>>> receive the operation and client have an interrupted slot. On the next
>>>>>>>> operation the server will complain back with SEQ_MISORDERED. Client
>>>>>>>> can recover from this operation. Client can not recover from "Remote
>>>>>>>> EIO”.
>>>>>>>>
>>>>>>>
>>>>>>> Why not?
>>>>>>
>>>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>>>>>> recovery (are you suggesting we should?)  and returns that to the
>>>>>> application.
>>>>>>
>>>>>
>>>>> I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.
>>>>>
>>>>
>>>> I'm confused where your objection lies. Are you ok with bumping the
>>>> sequence # when task->tk_status = 1 and saying that we should still
>>>> keep the code that I deleted in the 2nd chunk of the patch that bumped
>>>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>>>> slot?
>>>> Wouldn't that create a difference of 2 slots for the server that has
>>>> received the original request?
>>>>
>>>
>>> I’m saying I’d prefer to keep the current code, but fix the retry that is apparently broken. If we’re not ignoring the task->tk_error when we decide to retry, then that’s a bug in my opinion.
>>
>> I'm not understand what you are suggestion. I do better with example
>> so allow me:
>>
>> REMOVE used slot 0 seq=00000036 received ctrl-c
>> nfs41_sequence_done() gets called task->tk_status = 1:
>> slot->interrupted is set to 1. slot is freed.
>>
>> next operation comes in, in my case it's ACCESS. initialization of the
>> sequence uses slot 0 seq=00000036
>> server replies with REMOVE
>>
>> client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
>> returns EREMOTEIO. handle error just returns that error.
>>
>> where do we retry?
>>
>
> The retry should be happening when we exit from nfs41_sequence_done() by restarting the RPC.

Are you suggestion that REMOVE is retried? Ok I can see that (though
I'm not sure why a killed task suppose to be retried. Wasn't it killed
for a reason?). But if you are saying ACCESS should be retried then I
don't see how it can fit into the code flow.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia Sept. 23, 2016, 8:25 p.m. UTC | #11
On Fri, Sep 23, 2016 at 4:07 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> On Fri, Sep 23, 2016 at 3:57 PM, Trond Myklebust
> <trondmy@primarydata.com> wrote:
>>
>>> On Sep 23, 2016, at 15:27, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>
>>> On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
>>> <trondmy@primarydata.com> wrote:
>>>>
>>>>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>
>>>>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>>>>> <trondmy@primarydata.com> wrote:
>>>>>>
>>>>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>
>>>>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>>
>>>>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>> If we instead bump the sequence number in the case of interrupted and do:
>>>>>>>>>>
>>>>>>>>>> You have no guarantees that the server has seen and processed the operation.
>>>>>>>>>
>>>>>>>>> That is correct, i have tested the patch and made server never to
>>>>>>>>> receive the operation and client have an interrupted slot. On the next
>>>>>>>>> operation the server will complain back with SEQ_MISORDERED. Client
>>>>>>>>> can recover from this operation. Client can not recover from "Remote
>>>>>>>>> EIO”.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why not?
>>>>>>>
>>>>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>>>>>>> recovery (are you suggesting we should?)  and returns that to the
>>>>>>> application.
>>>>>>>
>>>>>>
>>>>>> I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.
>>>>>>
>>>>>
>>>>> I'm confused where your objection lies. Are you ok with bumping the
>>>>> sequence # when task->tk_status = 1 and saying that we should still
>>>>> keep the code that I deleted in the 2nd chunk of the patch that bumped
>>>>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>>>>> slot?
>>>>> Wouldn't that create a difference of 2 slots for the server that has
>>>>> received the original request?
>>>>>
>>>>
>>>> I’m saying I’d prefer to keep the current code, but fix the retry that is apparently broken. If we’re not ignoring the task->tk_error when we decide to retry, then that’s a bug in my opinion.
>>>
>>> I'm not understand what you are suggestion. I do better with example
>>> so allow me:
>>>
>>> REMOVE used slot 0 seq=00000036 received ctrl-c
>>> nfs41_sequence_done() gets called task->tk_status = 1:
>>> slot->interrupted is set to 1. slot is freed.
>>>
>>> next operation comes in, in my case it's ACCESS. initialization of the
>>> sequence uses slot 0 seq=00000036
>>> server replies with REMOVE
>>>
>>> client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
>>> returns EREMOTEIO. handle error just returns that error.
>>>
>>> where do we retry?
>>>
>>
>> The retry should be happening when we exit from nfs41_sequence_done() by restarting the RPC.
>
> Are you suggestion that REMOVE is retried? Ok I can see that (though
> I'm not sure why a killed task suppose to be retried. Wasn't it killed
> for a reason?). But if you are saying ACCESS should be retried then I
> don't see how it can fit into the code flow.

I'm still hung up on the fact your suggestion of "retry". There is no
retry. You wrote "if we get a SEQ_MISORDERED" we never get
"SEQ_MISORDERED".

I can see if you want to add to error_handling that we check if error
is EREMOTEIO and check that slot->interrupted is set, then we try?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Sept. 23, 2016, 8:34 p.m. UTC | #12
DQo+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTY6MjUsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+IA0KPiBPbiBGcmksIFNlcCAyMywgMjAxNiBhdCA0OjA3IFBNLCBP
bGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0KPj4gT24gRnJpLCBTZXAg
MjMsIDIwMTYgYXQgMzo1NyBQTSwgVHJvbmQgTXlrbGVidXN0DQo+PiA8dHJvbmRteUBwcmltYXJ5
ZGF0YS5jb20+IHdyb3RlOg0KPj4+IA0KPj4+PiBPbiBTZXAgMjMsIDIwMTYsIGF0IDE1OjI3LCBP
bGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0KPj4+PiANCj4+Pj4gT24g
RnJpLCBTZXAgMjMsIDIwMTYgYXQgMzowNyBQTSwgVHJvbmQgTXlrbGVidXN0DQo+Pj4+IDx0cm9u
ZG15QHByaW1hcnlkYXRhLmNvbT4gd3JvdGU6DQo+Pj4+PiANCj4+Pj4+PiBPbiBTZXAgMjMsIDIw
MTYsIGF0IDE0OjQxLCBPbGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0K
Pj4+Pj4+IA0KPj4+Pj4+IE9uIEZyaSwgU2VwIDIzLCAyMDE2IGF0IDI6MzQgUE0sIFRyb25kIE15
a2xlYnVzdA0KPj4+Pj4+IDx0cm9uZG15QHByaW1hcnlkYXRhLmNvbT4gd3JvdGU6DQo+Pj4+Pj4+
IA0KPj4+Pj4+Pj4gT24gU2VwIDIzLCAyMDE2LCBhdCAxNDoyNSwgT2xnYSBLb3JuaWV2c2thaWEg
PGFnbG9AdW1pY2guZWR1PiB3cm90ZToNCj4+Pj4+Pj4+IA0KPj4+Pj4+Pj4gT24gRnJpLCBTZXAg
MjMsIDIwMTYgYXQgMjowOCBQTSwgVHJvbmQgTXlrbGVidXN0DQo+Pj4+Pj4+PiA8dHJvbmRteUBw
cmltYXJ5ZGF0YS5jb20+IHdyb3RlOg0KPj4+Pj4+Pj4+IA0KPj4+Pj4+Pj4+PiBPbiBTZXAgMjMs
IDIwMTYsIGF0IDEzOjU5LCBPbGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3Rl
Og0KPj4+Pj4+Pj4+PiANCj4+Pj4+Pj4+Pj4gT24gRnJpLCBTZXAgMjMsIDIwMTYgYXQgMTo0NSBQ
TSwgVHJvbmQgTXlrbGVidXN0DQo+Pj4+Pj4+Pj4+IDx0cm9uZG15QHByaW1hcnlkYXRhLmNvbT4g
d3JvdGU6DQo+Pj4+Pj4+Pj4+PiANCj4+Pj4+Pj4+Pj4+PiBPbiBTZXAgMjMsIDIwMTYsIGF0IDEz
OjQwLCBPbGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0KPj4+Pj4+Pj4+
Pj4+IA0KPj4+Pj4+Pj4+Pj4+IElmIHdlIGluc3RlYWQgYnVtcCB0aGUgc2VxdWVuY2UgbnVtYmVy
IGluIHRoZSBjYXNlIG9mIGludGVycnVwdGVkIGFuZCBkbzoNCj4+Pj4+Pj4+Pj4+IA0KPj4+Pj4+
Pj4+Pj4gWW91IGhhdmUgbm8gZ3VhcmFudGVlcyB0aGF0IHRoZSBzZXJ2ZXIgaGFzIHNlZW4gYW5k
IHByb2Nlc3NlZCB0aGUgb3BlcmF0aW9uLg0KPj4+Pj4+Pj4+PiANCj4+Pj4+Pj4+Pj4gVGhhdCBp
cyBjb3JyZWN0LCBpIGhhdmUgdGVzdGVkIHRoZSBwYXRjaCBhbmQgbWFkZSBzZXJ2ZXIgbmV2ZXIg
dG8NCj4+Pj4+Pj4+Pj4gcmVjZWl2ZSB0aGUgb3BlcmF0aW9uIGFuZCBjbGllbnQgaGF2ZSBhbiBp
bnRlcnJ1cHRlZCBzbG90LiBPbiB0aGUgbmV4dA0KPj4+Pj4+Pj4+PiBvcGVyYXRpb24gdGhlIHNl
cnZlciB3aWxsIGNvbXBsYWluIGJhY2sgd2l0aCBTRVFfTUlTT1JERVJFRC4gQ2xpZW50DQo+Pj4+
Pj4+Pj4+IGNhbiByZWNvdmVyIGZyb20gdGhpcyBvcGVyYXRpb24uIENsaWVudCBjYW4gbm90IHJl
Y292ZXIgZnJvbSAiUmVtb3RlDQo+Pj4+Pj4+Pj4+IEVJT+KAnS4NCj4+Pj4+Pj4+Pj4gDQo+Pj4+
Pj4+Pj4gDQo+Pj4+Pj4+Pj4gV2h5IG5vdD8NCj4+Pj4+Pj4+IA0KPj4+Pj4+Pj4gV2hlbiBYRFIg
bGF5ZXIgcmV0dXJucyBFUkVNT1RFSU8gaXQncyBub3QgaGFuZGxlZCBieSB0aGUgTkZTIGVycm9y
DQo+Pj4+Pj4+PiByZWNvdmVyeSAoYXJlIHlvdSBzdWdnZXN0aW5nIHdlIHNob3VsZD8pICBhbmQg
cmV0dXJucyB0aGF0IHRvIHRoZQ0KPj4+Pj4+Pj4gYXBwbGljYXRpb24uDQo+Pj4+Pj4+PiANCj4+
Pj4+Pj4gDQo+Pj4+Pj4+IEnigJltIHNheWluZyB0aGF0IGlmIHdlIGdldCBhIFNFUV9NSVNPUkRF
UkVEIGR1ZSB0byBhIHByZXZpb3VzIGludGVycnVwdCBvbiB0aGF0IHNsb3QsIHRoZW4gd2Ugc2hv
dWxkIGlnbm9yZSB0aGUgZXJyb3IgaW4gdGFzay0+dGtfc3RhdHVzLCBhbmQganVzdCByZXRyeSBh
ZnRlciBidW1waW5nIHRoZSBzbG90IHNlcWlkLg0KPj4+Pj4+PiANCj4+Pj4+PiANCj4+Pj4+PiBJ
J20gY29uZnVzZWQgd2hlcmUgeW91ciBvYmplY3Rpb24gbGllcy4gQXJlIHlvdSBvayB3aXRoIGJ1
bXBpbmcgdGhlDQo+Pj4+Pj4gc2VxdWVuY2UgIyB3aGVuIHRhc2stPnRrX3N0YXR1cyA9IDEgYW5k
IHNheWluZyB0aGF0IHdlIHNob3VsZCBzdGlsbA0KPj4+Pj4+IGtlZXAgdGhlIGNvZGUgdGhhdCBJ
IGRlbGV0ZWQgaW4gdGhlIDJuZCBjaHVuayBvZiB0aGUgcGF0Y2ggdGhhdCBidW1wZWQNCj4+Pj4+
PiB0aGUgc2VxaWQgb24gZ2V0dGluZyBTRVFfTUlTT1JERVJFRCBkdWUgdG8gYSBwcmV2aW91c2x5
IGludGVycnVwdGVkDQo+Pj4+Pj4gc2xvdD8NCj4+Pj4+PiBXb3VsZG4ndCB0aGF0IGNyZWF0ZSBh
IGRpZmZlcmVuY2Ugb2YgMiBzbG90cyBmb3IgdGhlIHNlcnZlciB0aGF0IGhhcw0KPj4+Pj4+IHJl
Y2VpdmVkIHRoZSBvcmlnaW5hbCByZXF1ZXN0Pw0KPj4+Pj4+IA0KPj4+Pj4gDQo+Pj4+PiBJ4oCZ
bSBzYXlpbmcgSeKAmWQgcHJlZmVyIHRvIGtlZXAgdGhlIGN1cnJlbnQgY29kZSwgYnV0IGZpeCB0
aGUgcmV0cnkgdGhhdCBpcyBhcHBhcmVudGx5IGJyb2tlbi4gSWYgd2XigJlyZSBub3QgaWdub3Jp
bmcgdGhlIHRhc2stPnRrX2Vycm9yIHdoZW4gd2UgZGVjaWRlIHRvIHJldHJ5LCB0aGVuIHRoYXTi
gJlzIGEgYnVnIGluIG15IG9waW5pb24uDQo+Pj4+IA0KPj4+PiBJJ20gbm90IHVuZGVyc3RhbmQg
d2hhdCB5b3UgYXJlIHN1Z2dlc3Rpb24uIEkgZG8gYmV0dGVyIHdpdGggZXhhbXBsZQ0KPj4+PiBz
byBhbGxvdyBtZToNCj4+Pj4gDQo+Pj4+IFJFTU9WRSB1c2VkIHNsb3QgMCBzZXE9MDAwMDAwMzYg
cmVjZWl2ZWQgY3RybC1jDQo+Pj4+IG5mczQxX3NlcXVlbmNlX2RvbmUoKSBnZXRzIGNhbGxlZCB0
YXNrLT50a19zdGF0dXMgPSAxOg0KPj4+PiBzbG90LT5pbnRlcnJ1cHRlZCBpcyBzZXQgdG8gMS4g
c2xvdCBpcyBmcmVlZC4NCj4+Pj4gDQo+Pj4+IG5leHQgb3BlcmF0aW9uIGNvbWVzIGluLCBpbiBt
eSBjYXNlIGl0J3MgQUNDRVNTLiBpbml0aWFsaXphdGlvbiBvZiB0aGUNCj4+Pj4gc2VxdWVuY2Ug
dXNlcyBzbG90IDAgc2VxPTAwMDAwMDM2DQo+Pj4+IHNlcnZlciByZXBsaWVzIHdpdGggUkVNT1ZF
DQo+Pj4+IA0KPj4+PiBjbGllbnQgY29kZSB4ZHIgaW4gZGVjb2RlX29wX2hycygpIHJldHVybnMg
RVJFTU9URUlPLiBkZWNvZGVfYWNjZXNzKCkNCj4+Pj4gcmV0dXJucyBFUkVNT1RFSU8uIGhhbmRs
ZSBlcnJvciBqdXN0IHJldHVybnMgdGhhdCBlcnJvci4NCj4+Pj4gDQo+Pj4+IHdoZXJlIGRvIHdl
IHJldHJ5Pw0KPj4+PiANCj4+PiANCj4+PiBUaGUgcmV0cnkgc2hvdWxkIGJlIGhhcHBlbmluZyB3
aGVuIHdlIGV4aXQgZnJvbSBuZnM0MV9zZXF1ZW5jZV9kb25lKCkgYnkgcmVzdGFydGluZyB0aGUg
UlBDLg0KPj4gDQo+PiBBcmUgeW91IHN1Z2dlc3Rpb24gdGhhdCBSRU1PVkUgaXMgcmV0cmllZD8g
T2sgSSBjYW4gc2VlIHRoYXQgKHRob3VnaA0KPj4gSSdtIG5vdCBzdXJlIHdoeSBhIGtpbGxlZCB0
YXNrIHN1cHBvc2UgdG8gYmUgcmV0cmllZC4gV2Fzbid0IGl0IGtpbGxlZA0KPj4gZm9yIGEgcmVh
c29uPykuIEJ1dCBpZiB5b3UgYXJlIHNheWluZyBBQ0NFU1Mgc2hvdWxkIGJlIHJldHJpZWQgdGhl
biBJDQo+PiBkb24ndCBzZWUgaG93IGl0IGNhbiBmaXQgaW50byB0aGUgY29kZSBmbG93Lg0KPiAN
Cj4gSSdtIHN0aWxsIGh1bmcgdXAgb24gdGhlIGZhY3QgeW91ciBzdWdnZXN0aW9uIG9mICJyZXRy
eSIuIFRoZXJlIGlzIG5vDQo+IHJldHJ5LiBZb3Ugd3JvdGUgImlmIHdlIGdldCBhIFNFUV9NSVNP
UkRFUkVEIiB3ZSBuZXZlciBnZXQNCj4gIlNFUV9NSVNPUkRFUkVEIi4NCj4gDQo+IEkgY2FuIHNl
ZSBpZiB5b3Ugd2FudCB0byBhZGQgdG8gZXJyb3JfaGFuZGxpbmcgdGhhdCB3ZSBjaGVjayBpZiBl
cnJvcg0KPiBpcyBFUkVNT1RFSU8gYW5kIGNoZWNrIHRoYXQgc2xvdC0+aW50ZXJydXB0ZWQgaXMg
c2V0LCB0aGVuIHdlIHRyeT8NCj4gDQoNClllcywgdGhhdOKAmXMgd2hhdCBJ4oCZZCBob3BlIHRv
IHNlZS4NCg0K

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia Sept. 23, 2016, 8:38 p.m. UTC | #13
On Fri, Sep 23, 2016 at 4:34 PM, Trond Myklebust
<trondmy@primarydata.com> wrote:
>
>> On Sep 23, 2016, at 16:25, Olga Kornievskaia <aglo@umich.edu> wrote:
>>
>> On Fri, Sep 23, 2016 at 4:07 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>> On Fri, Sep 23, 2016 at 3:57 PM, Trond Myklebust
>>> <trondmy@primarydata.com> wrote:
>>>>
>>>>> On Sep 23, 2016, at 15:27, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>
>>>>> On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
>>>>> <trondmy@primarydata.com> wrote:
>>>>>>
>>>>>>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>
>>>>>>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>>
>>>>>>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we instead bump the sequence number in the case of interrupted and do:
>>>>>>>>>>>>
>>>>>>>>>>>> You have no guarantees that the server has seen and processed the operation.
>>>>>>>>>>>
>>>>>>>>>>> That is correct, i have tested the patch and made server never to
>>>>>>>>>>> receive the operation and client have an interrupted slot. On the next
>>>>>>>>>>> operation the server will complain back with SEQ_MISORDERED. Client
>>>>>>>>>>> can recover from this operation. Client can not recover from "Remote
>>>>>>>>>>> EIO”.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Why not?
>>>>>>>>>
>>>>>>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>>>>>>>>> recovery (are you suggesting we should?)  and returns that to the
>>>>>>>>> application.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.
>>>>>>>>
>>>>>>>
>>>>>>> I'm confused where your objection lies. Are you ok with bumping the
>>>>>>> sequence # when task->tk_status = 1 and saying that we should still
>>>>>>> keep the code that I deleted in the 2nd chunk of the patch that bumped
>>>>>>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>>>>>>> slot?
>>>>>>> Wouldn't that create a difference of 2 slots for the server that has
>>>>>>> received the original request?
>>>>>>>
>>>>>>
>>>>>> I’m saying I’d prefer to keep the current code, but fix the retry that is apparently broken. If we’re not ignoring the task->tk_error when we decide to retry, then that’s a bug in my opinion.
>>>>>
>>>>> I'm not understand what you are suggestion. I do better with example
>>>>> so allow me:
>>>>>
>>>>> REMOVE used slot 0 seq=00000036 received ctrl-c
>>>>> nfs41_sequence_done() gets called task->tk_status = 1:
>>>>> slot->interrupted is set to 1. slot is freed.
>>>>>
>>>>> next operation comes in, in my case it's ACCESS. initialization of the
>>>>> sequence uses slot 0 seq=00000036
>>>>> server replies with REMOVE
>>>>>
>>>>> client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
>>>>> returns EREMOTEIO. handle error just returns that error.
>>>>>
>>>>> where do we retry?
>>>>>
>>>>
>>>> The retry should be happening when we exit from nfs41_sequence_done() by restarting the RPC.
>>>
>>> Are you suggestion that REMOVE is retried? Ok I can see that (though
>>> I'm not sure why a killed task suppose to be retried. Wasn't it killed
>>> for a reason?). But if you are saying ACCESS should be retried then I
>>> don't see how it can fit into the code flow.
>>
>> I'm still hung up on the fact your suggestion of "retry". There is no
>> retry. You wrote "if we get a SEQ_MISORDERED" we never get
>> "SEQ_MISORDERED".
>>
>> I can see if you want to add to error_handling that we check if error
>> is EREMOTEIO and check that slot->interrupted is set, then we try?
>>
>
> Yes, that’s what I’d hope to see.

Ok I understand now and would try that.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index a1a3b4c..b78dac5 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -728,6 +728,7 @@  int nfs41_sequence_done(struct rpc_task *task,
struct nfs4_sequence_res *res)
  * operation..
  * Mark the slot as having hosted an interrupted RPC call.
  */
+ ++slot->seq_nr;
  slot->interrupted = 1;
  goto out;