diff mbox series

[v2] nfsd: hold a lighter-weight client reference over CB_RECALL_ANY

Message ID 20240405-rhel-31513-v2-1-b0f6c10be929@kernel.org (mailing list archive)
State New
Headers show
Series [v2] nfsd: hold a lighter-weight client reference over CB_RECALL_ANY | expand

Commit Message

Jeffrey Layton April 5, 2024, 5:56 p.m. UTC
Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the
client. While a callback job is technically an RPC that counter is
really more for client-driven RPCs, and this has the effect of
preventing the client from being unhashed until the callback completes.

If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we
can end up in a situation where the callback can't complete on the (now
dead) callback channel, but the new client can't connect because the old
client can't be unhashed. This usually manifests as a NFS4ERR_DELAY
return on the CREATE_SESSION operation.

The job is only holding a reference to the client so it can clear a flag
in the after the RPC completes. Fix this by having CB_RECALL_ANY instead
hold a reference to the cl_nfsdfs.cl_ref. Typically we only take that
sort of reference when dealing with the nfsdfs info files, but it should
work appropriately here to ensure that the nfs4_client doesn't
disappear.

Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition")
Reported-by: Vladimir Benes <vbenes@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v2:
- Clean up the changelog
- Add Fixes: tag
- Use kref_get instead of kref_get_unless_zero
---
 fs/nfsd/nfs4state.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)


---
base-commit: 05258a0a69b3c5d2c003f818702c0a52b6fea861
change-id: 20240405-rhel-31513-028ab6f14252

Best regards,

Comments

Chuck Lever III April 5, 2024, 6:07 p.m. UTC | #1
On Fri, Apr 05, 2024 at 01:56:18PM -0400, Jeff Layton wrote:
> Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the
> client. While a callback job is technically an RPC that counter is
> really more for client-driven RPCs, and this has the effect of
> preventing the client from being unhashed until the callback completes.
> 
> If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we
> can end up in a situation where the callback can't complete on the (now
> dead) callback channel, but the new client can't connect because the old
> client can't be unhashed. This usually manifests as a NFS4ERR_DELAY
> return on the CREATE_SESSION operation.
> 
> The job is only holding a reference to the client so it can clear a flag
> in the after the RPC completes. Fix this by having CB_RECALL_ANY instead
> hold a reference to the cl_nfsdfs.cl_ref. Typically we only take that
> sort of reference when dealing with the nfsdfs info files, but it should
> work appropriately here to ensure that the nfs4_client doesn't
> disappear.
> 
> Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition")
> Reported-by: Vladimir Benes <vbenes@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Applied to nfsd-fixes while waiting for review and testing. Thanks!


> ---
> Changes in v2:
> - Clean up the changelog
> - Add Fixes: tag
> - Use kref_get instead of kref_get_unless_zero
> ---
>  fs/nfsd/nfs4state.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 5fcd93f7cb8c..3cef81e196c6 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3042,12 +3042,9 @@ static void
>  nfsd4_cb_recall_any_release(struct nfsd4_callback *cb)
>  {
>  	struct nfs4_client *clp = cb->cb_clp;
> -	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
>  
> -	spin_lock(&nn->client_lock);
>  	clear_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags);
> -	put_client_renew_locked(clp);
> -	spin_unlock(&nn->client_lock);
> +	drop_client(clp);
>  }
>  
>  static int
> @@ -6616,7 +6613,7 @@ deleg_reaper(struct nfsd_net *nn)
>  		list_add(&clp->cl_ra_cblist, &cblist);
>  
>  		/* release in nfsd4_cb_recall_any_release */
> -		atomic_inc(&clp->cl_rpc_users);
> +		kref_get(&clp->cl_nfsdfs.cl_ref);
>  		set_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags);
>  		clp->cl_ra_time = ktime_get_boottime_seconds();
>  	}
> 
> ---
> base-commit: 05258a0a69b3c5d2c003f818702c0a52b6fea861
> change-id: 20240405-rhel-31513-028ab6f14252
> 
> Best regards,
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
>
vbenes@redhat.com April 5, 2024, 8:11 p.m. UTC | #2
On Fri, 2024-04-05 at 14:07 -0400, Chuck Lever wrote:
> On Fri, Apr 05, 2024 at 01:56:18PM -0400, Jeff Layton wrote:
> > Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to
> > the
> > client. While a callback job is technically an RPC that counter is
> > really more for client-driven RPCs, and this has the effect of
> > preventing the client from being unhashed until the callback
> > completes.
> > 
> > If nfsd decides to send a CB_RECALL_ANY just as the client reboots,
> > we
> > can end up in a situation where the callback can't complete on the
> > (now
> > dead) callback channel, but the new client can't connect because
> > the old
> > client can't be unhashed. This usually manifests as a NFS4ERR_DELAY
> > return on the CREATE_SESSION operation.
> > 
> > The job is only holding a reference to the client so it can clear a
> > flag
> > in the after the RPC completes. Fix this by having CB_RECALL_ANY
> > instead
> > hold a reference to the cl_nfsdfs.cl_ref. Typically we only take
> > that
> > sort of reference when dealing with the nfsdfs info files, but it
> > should
> > work appropriately here to ensure that the nfs4_client doesn't
> > disappear.
> > 
> > Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low
> > memory condition")
> > Reported-by: Vladimir Benes <vbenes@redhat.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> Applied to nfsd-fixes while waiting for review and testing. Thanks!
> 
> 
> > ---
> > Changes in v2:
> > - Clean up the changelog
> > - Add Fixes: tag
> > - Use kref_get instead of kref_get_unless_zero
> > ---
> >  fs/nfsd/nfs4state.c | 7 ++-----
> >  1 file changed, 2 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index 5fcd93f7cb8c..3cef81e196c6 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -3042,12 +3042,9 @@ static void
> >  nfsd4_cb_recall_any_release(struct nfsd4_callback *cb)
> >  {
> >  	struct nfs4_client *clp = cb->cb_clp;
> > -	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
> >  
> > -	spin_lock(&nn->client_lock);
> >  	clear_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags);
> > -	put_client_renew_locked(clp);
> > -	spin_unlock(&nn->client_lock);
> > +	drop_client(clp);
> >  }
> >  
> >  static int
> > @@ -6616,7 +6613,7 @@ deleg_reaper(struct nfsd_net *nn)
> >  		list_add(&clp->cl_ra_cblist, &cblist);
> >  
> >  		/* release in nfsd4_cb_recall_any_release */
> > -		atomic_inc(&clp->cl_rpc_users);
> > +		kref_get(&clp->cl_nfsdfs.cl_ref);
> >  		set_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp-
> > >cl_flags);
> >  		clp->cl_ra_time = ktime_get_boottime_seconds();
> >  	}
> > 
> > ---
> > base-commit: 05258a0a69b3c5d2c003f818702c0a52b6fea861
> > change-id: 20240405-rhel-31513-028ab6f14252
> > 
> > Best regards,
> > -- 
> > Jeff Layton <jlayton@kernel.org>
> > 
> > 
> 
Hi, 
I've just finished the testing of the new patch on the same HW
configuration and the dracut test suite is stable again.

Thank you for your patches!
Vladimir Benes

Tested-by: Vladimir Benes <vbenes@redhat.com>
Cedric Blancher April 6, 2024, 6:07 a.m. UTC | #3
On Fri, 5 Apr 2024 at 20:07, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Fri, Apr 05, 2024 at 01:56:18PM -0400, Jeff Layton wrote:
> > Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the
> > client. While a callback job is technically an RPC that counter is
> > really more for client-driven RPCs, and this has the effect of
> > preventing the client from being unhashed until the callback completes.
> >
> > If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we
> > can end up in a situation where the callback can't complete on the (now
> > dead) callback channel, but the new client can't connect because the old
> > client can't be unhashed. This usually manifests as a NFS4ERR_DELAY
> > return on the CREATE_SESSION operation.
> >
> > The job is only holding a reference to the client so it can clear a flag
> > in the after the RPC completes. Fix this by having CB_RECALL_ANY instead
> > hold a reference to the cl_nfsdfs.cl_ref. Typically we only take that
> > sort of reference when dealing with the nfsdfs info files, but it should
> > work appropriately here to ensure that the nfs4_client doesn't
> > disappear.
> >
> > Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition")
> > Reported-by: Vladimir Benes <vbenes@redhat.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
>
> Applied to nfsd-fixes while waiting for review and testing. Thanks!

Please add this to the 6.6 LTS brach, too

Ced
Chuck Lever III April 9, 2024, 5:32 p.m. UTC | #4
On Tue, Apr 09, 2024 at 07:24:17PM +0200, Rik Theys wrote:
> Hi,
> 
> On 4/5/24 20:07, Chuck Lever wrote:
> > On Fri, Apr 05, 2024 at 01:56:18PM -0400, Jeff Layton wrote:
> > > Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the
> > > client. While a callback job is technically an RPC that counter is
> > > really more for client-driven RPCs, and this has the effect of
> > > preventing the client from being unhashed until the callback completes.
> > > 
> > > If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we
> > > can end up in a situation where the callback can't complete on the (now
> > > dead) callback channel, but the new client can't connect because the old
> > > client can't be unhashed. This usually manifests as a NFS4ERR_DELAY
> > > return on the CREATE_SESSION operation.
> > > 
> > > The job is only holding a reference to the client so it can clear a flag
> > > in the after the RPC completes. Fix this by having CB_RECALL_ANY instead
> > > hold a reference to the cl_nfsdfs.cl_ref. Typically we only take that
> > > sort of reference when dealing with the nfsdfs info files, but it should
> > > work appropriately here to ensure that the nfs4_client doesn't
> > > disappear.
> > > 
> > > Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition")
> > > Reported-by: Vladimir Benes<vbenes@redhat.com>
> > > Signed-off-by: Jeff Layton<jlayton@kernel.org>
> > Applied to nfsd-fixes while waiting for review and testing. Thanks!
> > 
> Can this fix also be included in the 6.1.x LTS kernel? Given that "NFSD: add
> delegation reaper to react to low memory condition" was added to 6.1.81, it
> would be nice to have this fix in the 6.1 series.
> 
> This way it will also be picked up by Debian at some point (it seems they
> are upgrading to 6.1.82 for their next stable point release).

Thanks to the Fixes: tag in the commit's description, it will
probably appear in a later release of linux-6.1.y automatically.

If we don't see it the next few weeks, I will ping the stable
maintainers.
diff mbox series

Patch

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 5fcd93f7cb8c..3cef81e196c6 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3042,12 +3042,9 @@  static void
 nfsd4_cb_recall_any_release(struct nfsd4_callback *cb)
 {
 	struct nfs4_client *clp = cb->cb_clp;
-	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
 
-	spin_lock(&nn->client_lock);
 	clear_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags);
-	put_client_renew_locked(clp);
-	spin_unlock(&nn->client_lock);
+	drop_client(clp);
 }
 
 static int
@@ -6616,7 +6613,7 @@  deleg_reaper(struct nfsd_net *nn)
 		list_add(&clp->cl_ra_cblist, &cblist);
 
 		/* release in nfsd4_cb_recall_any_release */
-		atomic_inc(&clp->cl_rpc_users);
+		kref_get(&clp->cl_nfsdfs.cl_ref);
 		set_bit(NFSD4_CLIENT_CB_RECALL_ANY, &clp->cl_flags);
 		clp->cl_ra_time = ktime_get_boottime_seconds();
 	}