From patchwork Sat Apr 22 16:26:26 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Olga Kornievskaia <aglo@umich.edu>
X-Patchwork-Id: 9694445
Return-Path: <linux-nfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	556A160383 for <patchwork-linux-nfs@patchwork.kernel.org>;
	Sat, 22 Apr 2017 16:26:33 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3D9FD285E7
	for <patchwork-linux-nfs@patchwork.kernel.org>;
	Sat, 22 Apr 2017 16:26:33 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 31B8428614; Sat, 22 Apr 2017 16:26:33 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM,
	T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 67923285E7
	for <patchwork-linux-nfs@patchwork.kernel.org>;
	Sat, 22 Apr 2017 16:26:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1425662AbdDVQ03 (ORCPT
	<rfc822;patchwork-linux-nfs@patchwork.kernel.org>);
	Sat, 22 Apr 2017 12:26:29 -0400
Received: from mail-it0-f45.google.com ([209.85.214.45]:37125 "EHLO
	mail-it0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1425660AbdDVQ02 (ORCPT
	<rfc822; linux-nfs@vger.kernel.org>); Sat, 22 Apr 2017 12:26:28 -0400
Received: by mail-it0-f45.google.com with SMTP id x188so12802792itb.0
	for <linux-nfs@vger.kernel.org>; Sat, 22 Apr 2017 09:26:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20161025;
	h=mime-version:sender:in-reply-to:references:from:date:message-id
	:subject:to:cc:content-transfer-encoding;
	bh=Y9tLgFnzK9lhsMORTPR1HjJJbYGM0oaGMqWmxnIcuRg=;
	b=CdJqt1Uggd0/3SImB7ooDnX9rAa1TyMnzmGYQGe9c/AY5zWF4O8d6+oc6NXvlq1dvn
	dhQYvmYhMmHQOWo7fCZjhY4xKMKbr/yF48OfZSwaJ15/Q6RbnB5jfWojeAUJtp407XdE
	2A3kU13wKuZLCIKPNLFaayarkR2GTEECLHyr5VvlLWOuMkfbNzhz1AKUe+/RsxUI3DPu
	BhCf1jTXO13RfKJzpQrTIEynjW87h5+1psSUtqUnZHe+RzugxjX7qclczktBMNqrKBVQ
	DSBnn4ov1k9WkmM8/l2C+9QJN103139wIPrMtdMi2rOLKLOzuJrWNrugOLM54QOi13Ib
	jTGw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
	:date:message-id:subject:to:cc:content-transfer-encoding;
	bh=Y9tLgFnzK9lhsMORTPR1HjJJbYGM0oaGMqWmxnIcuRg=;
	b=rfQ/UPa6ak3kfk2RKjoIOuGdkKcgiCiVx+Hiw3PApWJ2GXLu39DfrKx4fEdTp7yRK4
	CIpXA3v7sNpCGmiiVBwyW29ok7NTIl9CKQUVd2M7NaLFSrqoGTDWeWAZejjyCS86fWXV
	Rm9DIUEAGlroQbhWH3Rc2YW65XQf3Z3Rf+8XThNuQ/Or5732mXvf6eTWoldPP59xfzgD
	ZK4BhDUZw7iqXWg37Rw8KeeB22P6kQu96tM0jE9jNIDKq435eKd+NzCS5K8cellHl4+b
	xLn5gEj6bftnew3odnZhrygadCWx62ETyRsRb0ZXncpKXZxjQKcMyblgZCGPqhXZnQpL
	OjeQ==
X-Gm-Message-State: AN3rC/4uoAzxgHuoDRz0StKKPog4cJ9GLhfVAnfgdtPpr3QJsnAlot2M
	2kD5Jo3nS2ZHrSLcaYUL36JeiGkfKw==
X-Received: by 10.36.2.205 with SMTP id 196mr4423534itu.63.1492878387846;
	Sat, 22 Apr 2017 09:26:27 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.107.46.170 with HTTP; Sat, 22 Apr 2017 09:26:26 -0700 (PDT)
In-Reply-To: 
 <CAN-5tyFcE9kgiV0atOQyDV5eqKupLfoG9ACYba=bsXmYzmXyLQ@mail.gmail.com>
References: 
 <CAN-5tyFQ2PiSHp41mOMHa=DSJ8SmXBo=Nk=v-k1hASPKRbbhzQ@mail.gmail.com>
	<AE6F91D9-2D4E-4AA0-A1F3-EB511E53F990@primarydata.com>
	<CAN-5tyF3HrY9+xWO7HDnN_dv13vS9nqcp0vZpA29T=i3cJMSZg@mail.gmail.com>
	<AE31C3DC-40B8-4BBE-A5A1-A62CB656268A@primarydata.com>
	<CAN-5tyHgJ9K7=RChVt=sLHQbjTCRj8MhEmMipreMKWzef4Wscw@mail.gmail.com>
	<B8C98E5E-F9F5-4468-8E52-A3A4A550C1E3@primarydata.com>
	<CAN-5tyGC1MAq54yqQJffrNZoiaUppojmfgsqfQYrNwzhyFhoHg@mail.gmail.com>
	<F5E207D8-B71F-45BB-8114-63FF0F43136B@primarydata.com>
	<CAN-5tyFKk4VEdEb+P7qxGYwc6nnFvjyzCX304Jb4MAGzaJWKXA@mail.gmail.com>
	<9343A1DB-5895-41F4-8A37-504AA710D696@primarydata.com>
	<CAN-5tyEcv_BF60jJtAFRv3q3u3Yrqzvdhe=XWkQ2z3zKHc5C7A@mail.gmail.com>
	<CAN-5tyEoMw79hMsihpuKG=auLzir82DPqUj0fA8EycnUfPSmnw@mail.gmail.com>
	<C81A4BCC-76A8-4732-AAC1-93EF379B3302@primarydata.com>
	<CAN-5tyFcE9kgiV0atOQyDV5eqKupLfoG9ACYba=bsXmYzmXyLQ@mail.gmail.com>
From: Olga Kornievskaia <aglo@umich.edu>
Date: Sat, 22 Apr 2017 12:26:26 -0400
X-Google-Sender-Auth: jCPLKFMsYLixVeRmXCkfMreKxe8
Message-ID: 
 <CAN-5tyFnJH6Y=+brO4x5sTvsXWStH+1k-diR9OLTgXfJZXz6Xw@mail.gmail.com>
Subject: Re: reuse of slot and seq# when RPC was interrupted
To: Trond Myklebust <trondmy@primarydata.com>
Cc: List Linux NFS Mailing <linux-nfs@vger.kernel.org>
Sender: linux-nfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-nfs.vger.kernel.org>
X-Mailing-List: linux-nfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Hi Trond,

I'm still having issues with interrupted slots.

1. Client sent out an operation (X with cacheme=true). It's waiting on
the reply by got a ctrl-c.
2. Client sends out the operation which happens to be CLOSE. It uses
the same slot and sequence id as operation X.
3. Server replies back to CLOSE with ERR_DELAY
4. Server replies to operation X
5. Client does the delay and resends the CLOSE using the same slot and
sequence id as before.
6. Server reply with X (out of the replay cache).
7. Client decoding returns back EREMOTEIO and error handling doesn't
do anything about it.

Now the problem is open state is left on the server.

If the server were to not send an ERR_DELAY first but just reply with
X then the code handles that and retries (commit a865880e20).

From err_delay the operation is retried on the same slot but we
cleared that the slot was interrupted. So now we got EREMOTEIO but
(slot is not interrupted) and we don't retry. Perhaps a solution to
not clear "slot interrupted" if error is err_delay (since we are
retrying the same slot/seqid with that error).

My proposed fix is:


On Fri, Sep 23, 2016 at 4:38 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> On Fri, Sep 23, 2016 at 4:34 PM, Trond Myklebust
> <trondmy@primarydata.com> wrote:
>>
>>> On Sep 23, 2016, at 16:25, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>
>>> On Fri, Sep 23, 2016 at 4:07 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>> On Fri, Sep 23, 2016 at 3:57 PM, Trond Myklebust
>>>> <trondmy@primarydata.com> wrote:
>>>>>
>>>>>> On Sep 23, 2016, at 15:27, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>
>>>>>> On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>
>>>>>>>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>
>>>>>>>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>>>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>>>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>>>>>>>> <trondmy@primarydata.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we instead bump the sequence number in the case of interrupted and do:
>>>>>>>>>>>>>
>>>>>>>>>>>>> You have no guarantees that the server has seen and processed the operation.
>>>>>>>>>>>>
>>>>>>>>>>>> That is correct, i have tested the patch and made server never to
>>>>>>>>>>>> receive the operation and client have an interrupted slot. On the next
>>>>>>>>>>>> operation the server will complain back with SEQ_MISORDERED. Client
>>>>>>>>>>>> can recover from this operation. Client can not recover from "Remote
>>>>>>>>>>>> EIO”.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Why not?
>>>>>>>>>>
>>>>>>>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>>>>>>>>>> recovery (are you suggesting we should?)  and returns that to the
>>>>>>>>>> application.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm confused where your objection lies. Are you ok with bumping the
>>>>>>>> sequence # when task->tk_status = 1 and saying that we should still
>>>>>>>> keep the code that I deleted in the 2nd chunk of the patch that bumped
>>>>>>>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>>>>>>>> slot?
>>>>>>>> Wouldn't that create a difference of 2 slots for the server that has
>>>>>>>> received the original request?
>>>>>>>>
>>>>>>>
>>>>>>> I’m saying I’d prefer to keep the current code, but fix the retry that is apparently broken. If we’re not ignoring the task->tk_error when we decide to retry, then that’s a bug in my opinion.
>>>>>>
>>>>>> I'm not understand what you are suggestion. I do better with example
>>>>>> so allow me:
>>>>>>
>>>>>> REMOVE used slot 0 seq=00000036 received ctrl-c
>>>>>> nfs41_sequence_done() gets called task->tk_status = 1:
>>>>>> slot->interrupted is set to 1. slot is freed.
>>>>>>
>>>>>> next operation comes in, in my case it's ACCESS. initialization of the
>>>>>> sequence uses slot 0 seq=00000036
>>>>>> server replies with REMOVE
>>>>>>
>>>>>> client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
>>>>>> returns EREMOTEIO. handle error just returns that error.
>>>>>>
>>>>>> where do we retry?
>>>>>>
>>>>>
>>>>> The retry should be happening when we exit from nfs41_sequence_done() by restarting the RPC.
>>>>
>>>> Are you suggestion that REMOVE is retried? Ok I can see that (though
>>>> I'm not sure why a killed task suppose to be retried. Wasn't it killed
>>>> for a reason?). But if you are saying ACCESS should be retried then I
>>>> don't see how it can fit into the code flow.
>>>
>>> I'm still hung up on the fact your suggestion of "retry". There is no
>>> retry. You wrote "if we get a SEQ_MISORDERED" we never get
>>> "SEQ_MISORDERED".
>>>
>>> I can see if you want to add to error_handling that we check if error
>>> is EREMOTEIO and check that slot->interrupted is set, then we try?
>>>
>>
>> Yes, that’s what I’d hope to see.
>
> Ok I understand now and would try that.
---
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 0d3347e..a15979c 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -697,7 +697,8 @@ static int nfs41_sequence_process(struct rpc_task *task,
  session = slot->table->session;

  if (slot->interrupted) {
- slot->interrupted = 0;
+ if (task->tk_status != -NFS4ERR_DELAY)
+ slot->interrupted = 0;
  interrupted = true;
  }