nfsd: deal with DELEGRETURN racing with CB_RECALL

Message ID	1441037201-78787-1-git-send-email-aweits@rit.edu (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: Andrew Elble <aweits@rit.edu> To: linux-nfs@vger.kernel.org, etmsys@rit.edu Cc: Andrew Elble <aweits@rit.edu> Subject: [PATCH] nfsd: deal with DELEGRETURN racing with CB_RECALL Date: Mon, 31 Aug 2015 12:06:41 -0400 Message-Id: <1441037201-78787-1-git-send-email-aweits@rit.edu> Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Andrew W Elble Aug. 31, 2015, 4:06 p.m. UTC

We have observed the server sending recalls for delegation stateids
that have already been successfully returned. Change
nfsd4_cb_recall_done() to return success if the client has returned
the delegation. While this does not completely eliminate the sending
of recalls for delegations that have already been returned, this
does prevent unnecessarily declaring the callback path to be down.

Reported-by: Eric Meddaugh <etmsys@rit.edu>
Signed-off-by: Andrew Elble <aweits@rit.edu>
---
 fs/nfsd/nfs4state.c | 3 +++
 1 file changed, 3 insertions(+)

Jeff Layton Sept. 1, 2015, 1:18 p.m. UTC | #1

On Mon, 31 Aug 2015 12:06:41 -0400
Andrew Elble <aweits@rit.edu> wrote:

> We have observed the server sending recalls for delegation stateids
> that have already been successfully returned. Change
> nfsd4_cb_recall_done() to return success if the client has returned
> the delegation. While this does not completely eliminate the sending
> of recalls for delegations that have already been returned, this
> does prevent unnecessarily declaring the callback path to be down.
> 
> Reported-by: Eric Meddaugh <etmsys@rit.edu>
> Signed-off-by: Andrew Elble <aweits@rit.edu>
> ---
>  fs/nfsd/nfs4state.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index de45c2de1668..ac235a7470e3 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3509,6 +3509,9 @@ static int nfsd4_cb_recall_done(struct nfsd4_callback *cb,
>  {
>  	struct nfs4_delegation *dp = cb_to_delegation(cb);
>  
> +	if (dp->dl_stid.sc_type == NFS4_CLOSED_DELEG_STID)
> +	        return 1;
> +
>  	switch (task->tk_status) {
>  	case 0:
>  		return 1;

Yeah, there will always be some potential for a CB_RECALL/DELEGRETURN
race since the former is driven by the server and the latter by the
client.

What error are you usually getting back from the client when you see
this happen? NFS4ERR_BAD_STATEID? If so, then maybe we should confine
this check to that error case?

OTOH, maybe there's no harm in just ignoring the error if the
delegation is already returned...

Acked-by: Jeff Layton <jlayton@poochiereds.net>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew W Elble Sept. 1, 2015, 1:55 p.m. UTC | #2

> Yeah, there will always be some potential for a CB_RECALL/DELEGRETURN
> race since the former is driven by the server and the latter by the
> client.

I believe there are some client differences to account for as well. i.e. when
does the client "unhash" the delegation? Before the DELEGRETURN is sent,
or after the response comes back?

> What error are you usually getting back from the client when you see
> this happen? NFS4ERR_BAD_STATEID? If so, then maybe we should confine
> this check to that error case?

Usually EBADHANDLE. I actually had a version of this patch where it was
confined to those cases. My rationale in changing it was that continuing
to solicit the return of the delegation when we have it is pointless.

i.e. RFC5661 20.2.4:

"...The recall is not complete until the delegation is returned using a
 DELEGRETURN operation."

Thanks,

Andy

Andrew W Elble Sept. 1, 2015, 2:48 p.m. UTC | #3

I should probably state what I'm currently chasing:

1.) Somehow, delegations wind up on the cl_revoked list at the server.
2.) The server continually asserts SEQ4_STATUS_RECALLABLE_STATE_REVOKED.
3.) The client for some reason doesn't have anything to give the server
    to satisfy it.

I speculate that this patch will assist in making this go away - I'm
just not 100% sure of all the conditions that result in making #1 possible.
(Enough recalls backed up on the callback slot on the same filehandle
stacking timeouts/retries and resultant path down errors to exceed lease
time?). Unfortunately this problem seems to always occur at ~1AM - and
obtaining a network capture of the problem in action has been elusive.

Thanks,

Andy

J. Bruce Fields Sept. 1, 2015, 6:04 p.m. UTC | #4

On Tue, Sep 01, 2015 at 09:18:42AM -0400, Jeff Layton wrote:
> On Mon, 31 Aug 2015 12:06:41 -0400
> Andrew Elble <aweits@rit.edu> wrote:
> 
> > We have observed the server sending recalls for delegation stateids
> > that have already been successfully returned. Change
> > nfsd4_cb_recall_done() to return success if the client has returned
> > the delegation. While this does not completely eliminate the sending
> > of recalls for delegations that have already been returned, this
> > does prevent unnecessarily declaring the callback path to be down.
> > 
> > Reported-by: Eric Meddaugh <etmsys@rit.edu>
> > Signed-off-by: Andrew Elble <aweits@rit.edu>
> > ---
> >  fs/nfsd/nfs4state.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index de45c2de1668..ac235a7470e3 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -3509,6 +3509,9 @@ static int nfsd4_cb_recall_done(struct nfsd4_callback *cb,
> >  {
> >  	struct nfs4_delegation *dp = cb_to_delegation(cb);
> >  
> > +	if (dp->dl_stid.sc_type == NFS4_CLOSED_DELEG_STID)
> > +	        return 1;
> > +
> >  	switch (task->tk_status) {
> >  	case 0:
> >  		return 1;
> 
> Yeah, there will always be some potential for a CB_RECALL/DELEGRETURN
> race since the former is driven by the server and the latter by the
> client.
> 
> What error are you usually getting back from the client when you see
> this happen? NFS4ERR_BAD_STATEID? If so, then maybe we should confine
> this check to that error case?
> 
> OTOH, maybe there's no harm in just ignoring the error if the
> delegation is already returned...
> 
> Acked-by: Jeff Layton <jlayton@poochiereds.net>

Agreed, applying--thanks.--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields Sept. 1, 2015, 6:08 p.m. UTC | #5

On Tue, Sep 01, 2015 at 10:48:27AM -0400, Andrew W Elble wrote:
> 
> I should probably state what I'm currently chasing:
> 
> 1.) Somehow, delegations wind up on the cl_revoked list at the server.
> 2.) The server continually asserts SEQ4_STATUS_RECALLABLE_STATE_REVOKED.
> 3.) The client for some reason doesn't have anything to give the server
>     to satisfy it.
> 
> I speculate that this patch will assist in making this go away - I'm
> just not 100% sure of all the conditions that result in making #1 possible.
> (Enough recalls backed up on the callback slot on the same filehandle
> stacking timeouts/retries and resultant path down errors to exceed lease
> time?).

Even in the worst case where the client never gets the recall, in theory
it should take the RECALLABLE_STATE_REVOKED as a sign that it's missed
one and return its delegations.  Is the client attempting any such
recovery when it sees that flag?

> Unfortunately this problem seems to always occur at ~1AM - and
> obtaining a network capture of the problem in action has been elusive.

Anyway, sounds interesting, let us know what you find....

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

nfsd: deal with DELEGRETURN racing with CB_RECALL

Commit Message

Comments

Patch