diff mbox series

[v2,1/3] NFSD: mark cl_cb_state as NFSD4_CB_DOWN if cl_cb_client is NULL

Message ID 1713841953-19594-2-git-send-email-dai.ngo@oracle.com (mailing list archive)
State New
Headers show
Series [v2,1/3] NFSD: mark cl_cb_state as NFSD4_CB_DOWN if cl_cb_client is NULL | expand

Commit Message

Dai Ngo April 23, 2024, 3:12 a.m. UTC
In nfsd4_run_cb_work if the rpc_clnt for the back channel is no longer
exists, the callback state in nfs4_client should be marked as NFSD4_CB_DOWN
so the server can notify the client to establish a new back channel
connection.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
---
 fs/nfsd/nfs4callback.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Chuck Lever April 23, 2024, 1:41 p.m. UTC | #1
On Mon, Apr 22, 2024 at 08:12:31PM -0700, Dai Ngo wrote:
> In nfsd4_run_cb_work if the rpc_clnt for the back channel is no longer
> exists, the callback state in nfs4_client should be marked as NFSD4_CB_DOWN
> so the server can notify the client to establish a new back channel
> connection.
> 
> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> ---
>  fs/nfsd/nfs4callback.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> index cf87ace7a1b0..f8bb5ff2e9ac 100644
> --- a/fs/nfsd/nfs4callback.c
> +++ b/fs/nfsd/nfs4callback.c
> @@ -1491,9 +1491,14 @@ nfsd4_run_cb_work(struct work_struct *work)
>  
>  	clnt = clp->cl_cb_client;
>  	if (!clnt) {
> -		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags))
> +		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
>  			nfsd41_destroy_cb(cb);
> -		else {
> +			clear_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags);
> +
> +			/* let client knows BC is down when it reconnects */
> +			clear_bit(NFSD4_CLIENT_CB_UPDATE, &clp->cl_flags);
> +			nfsd4_mark_cb_down(clp);
> +		} else {
>  			/*
>  			 * XXX: Ideally, we could wait for the client to
>  			 *	reconnect, but I haven't figured out how

NFSD4_CLIENT_CB_KILL is for when the lease is getting expunged. It's
not supposed to be used when only the transport is closed. Thus,
shouldn't you mark_cb_down in this arm, instead? Even so, isn't the
backchannel already marked down when we get here?
Dai Ngo April 23, 2024, 5:49 p.m. UTC | #2
On 4/23/24 6:41 AM, Chuck Lever wrote:
> On Mon, Apr 22, 2024 at 08:12:31PM -0700, Dai Ngo wrote:
>> In nfsd4_run_cb_work if the rpc_clnt for the back channel is no longer
>> exists, the callback state in nfs4_client should be marked as NFSD4_CB_DOWN
>> so the server can notify the client to establish a new back channel
>> connection.
>>
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>>   fs/nfsd/nfs4callback.c | 9 +++++++--
>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
>> index cf87ace7a1b0..f8bb5ff2e9ac 100644
>> --- a/fs/nfsd/nfs4callback.c
>> +++ b/fs/nfsd/nfs4callback.c
>> @@ -1491,9 +1491,14 @@ nfsd4_run_cb_work(struct work_struct *work)
>>   
>>   	clnt = clp->cl_cb_client;
>>   	if (!clnt) {
>> -		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags))
>> +		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
>>   			nfsd41_destroy_cb(cb);
>> -		else {
>> +			clear_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags);
>> +
>> +			/* let client knows BC is down when it reconnects */
>> +			clear_bit(NFSD4_CLIENT_CB_UPDATE, &clp->cl_flags);
>> +			nfsd4_mark_cb_down(clp);
>> +		} else {
>>   			/*
>>   			 * XXX: Ideally, we could wait for the client to
>>   			 *	reconnect, but I haven't figured out how
> NFSD4_CLIENT_CB_KILL is for when the lease is getting expunged. It's
> not supposed to be used when only the transport is closed.

The reason NFSD4_CLIENT_CB_KILL needs to be set when the transport is
closed is because of commit c1ccfcf1a9bf3.

When the transport is closed, nfsd4_conn_lost is called which then calls
nfsd4_probe_callback to set NFSD4_CLIENT_CB_UPDATE and schedule cl_cb_null
work to activate the callback worker (nfsd4_run_cb_work) to do the update.

Callback worker calls nfsd4_process_cb_update to do rpc_shutdown_client
then clear cl_cb_client.

When nfsd4_process_cb_update returns to nfsd4_run_cb_work, if cl_cb_client
is NULL and NFSD4_CLIENT_CB_KILL not set then it re-queues the callback,
causing an infinite loop.


>   Thus,
> shouldn't you mark_cb_down in this arm, instead?

I'm not clear what you mean here, the callback worker calls
nfsd4_mark_cb_down after destroying the callback.

>   Even so, isn't the
> backchannel already marked down when we get here?

No, according to my testing. Without marking the back channel down the
client does not re-establish the back channel when it reconnects.

-Dai

>
>
Chuck Lever April 23, 2024, 6:08 p.m. UTC | #3
On Tue, Apr 23, 2024 at 10:49:25AM -0700, Dai Ngo wrote:
> 
> On 4/23/24 6:41 AM, Chuck Lever wrote:
> > On Mon, Apr 22, 2024 at 08:12:31PM -0700, Dai Ngo wrote:
> > > In nfsd4_run_cb_work if the rpc_clnt for the back channel is no longer
> > > exists, the callback state in nfs4_client should be marked as NFSD4_CB_DOWN
> > > so the server can notify the client to establish a new back channel
> > > connection.
> > > 
> > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> > > ---
> > >   fs/nfsd/nfs4callback.c | 9 +++++++--
> > >   1 file changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> > > index cf87ace7a1b0..f8bb5ff2e9ac 100644
> > > --- a/fs/nfsd/nfs4callback.c
> > > +++ b/fs/nfsd/nfs4callback.c
> > > @@ -1491,9 +1491,14 @@ nfsd4_run_cb_work(struct work_struct *work)
> > >   	clnt = clp->cl_cb_client;
> > >   	if (!clnt) {
> > > -		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags))
> > > +		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
> > >   			nfsd41_destroy_cb(cb);
> > > -		else {
> > > +			clear_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags);
> > > +
> > > +			/* let client knows BC is down when it reconnects */
> > > +			clear_bit(NFSD4_CLIENT_CB_UPDATE, &clp->cl_flags);
> > > +			nfsd4_mark_cb_down(clp);
> > > +		} else {
> > >   			/*
> > >   			 * XXX: Ideally, we could wait for the client to
> > >   			 *	reconnect, but I haven't figured out how
> > NFSD4_CLIENT_CB_KILL is for when the lease is getting expunged. It's
> > not supposed to be used when only the transport is closed.
> 
> The reason NFSD4_CLIENT_CB_KILL needs to be set when the transport is
> closed is because of commit c1ccfcf1a9bf3.
> 
> When the transport is closed, nfsd4_conn_lost is called which then calls
> nfsd4_probe_callback to set NFSD4_CLIENT_CB_UPDATE and schedule cl_cb_null
> work to activate the callback worker (nfsd4_run_cb_work) to do the update.
> 
> Callback worker calls nfsd4_process_cb_update to do rpc_shutdown_client
> then clear cl_cb_client.
> 
> When nfsd4_process_cb_update returns to nfsd4_run_cb_work, if cl_cb_client
> is NULL and NFSD4_CLIENT_CB_KILL not set then it re-queues the callback,
> causing an infinite loop.

That's the way it is supposed to work today. The callback is
re-queued until the client reconnects, at which point the loop is
broken.


> > Thus, shouldn't you mark_cb_down in this arm, instead?
> 
> I'm not clear what you mean here, the callback worker calls
> nfsd4_mark_cb_down after destroying the callback.

No, I mean in the re-queue case.


> > Even so, isn't the
> > backchannel already marked down when we get here?
> 
> No, according to my testing. Without marking the back channel down the
> client does not re-establish the back channel when it reconnects.

I didn't expect that closing the transport on the server side would
need any changes in fs/nfsd/nfs4callback.c. Let me get the
backchannel retransmit behavior sorted first. I'm still working on
setting up a test rig here.
Dai Ngo April 23, 2024, 8:13 p.m. UTC | #4
On 4/23/24 11:08 AM, Chuck Lever wrote:
> On Tue, Apr 23, 2024 at 10:49:25AM -0700, Dai Ngo wrote:
>> On 4/23/24 6:41 AM, Chuck Lever wrote:
>>> On Mon, Apr 22, 2024 at 08:12:31PM -0700, Dai Ngo wrote:
>>>> In nfsd4_run_cb_work if the rpc_clnt for the back channel is no longer
>>>> exists, the callback state in nfs4_client should be marked as NFSD4_CB_DOWN
>>>> so the server can notify the client to establish a new back channel
>>>> connection.
>>>>
>>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>>>> ---
>>>>    fs/nfsd/nfs4callback.c | 9 +++++++--
>>>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
>>>> index cf87ace7a1b0..f8bb5ff2e9ac 100644
>>>> --- a/fs/nfsd/nfs4callback.c
>>>> +++ b/fs/nfsd/nfs4callback.c
>>>> @@ -1491,9 +1491,14 @@ nfsd4_run_cb_work(struct work_struct *work)
>>>>    	clnt = clp->cl_cb_client;
>>>>    	if (!clnt) {
>>>> -		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags))
>>>> +		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
>>>>    			nfsd41_destroy_cb(cb);
>>>> -		else {
>>>> +			clear_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags);
>>>> +
>>>> +			/* let client knows BC is down when it reconnects */
>>>> +			clear_bit(NFSD4_CLIENT_CB_UPDATE, &clp->cl_flags);
>>>> +			nfsd4_mark_cb_down(clp);
>>>> +		} else {
>>>>    			/*
>>>>    			 * XXX: Ideally, we could wait for the client to
>>>>    			 *	reconnect, but I haven't figured out how
>>> NFSD4_CLIENT_CB_KILL is for when the lease is getting expunged. It's
>>> not supposed to be used when only the transport is closed.
>> The reason NFSD4_CLIENT_CB_KILL needs to be set when the transport is
>> closed is because of commit c1ccfcf1a9bf3.
>>
>> When the transport is closed, nfsd4_conn_lost is called which then calls
>> nfsd4_probe_callback to set NFSD4_CLIENT_CB_UPDATE and schedule cl_cb_null
>> work to activate the callback worker (nfsd4_run_cb_work) to do the update.
>>
>> Callback worker calls nfsd4_process_cb_update to do rpc_shutdown_client
>> then clear cl_cb_client.
>>
>> When nfsd4_process_cb_update returns to nfsd4_run_cb_work, if cl_cb_client
>> is NULL and NFSD4_CLIENT_CB_KILL not set then it re-queues the callback,
>> causing an infinite loop.
> That's the way it is supposed to work today. The callback is
> re-queued until the client reconnects, at which point the loop is
> broken.

As you mentioned below, this needs to be reworked.

What if the client never comes back, decommissioned or student hibernates
the laptop and opens it up few days later. Even when the client comes back,
it might have been rebooted so the callback does not mean anything to it.

>
>
>>> Thus, shouldn't you mark_cb_down in this arm, instead?
>> I'm not clear what you mean here, the callback worker calls
>> nfsd4_mark_cb_down after destroying the callback.
> No, I mean in the re-queue case.

In the case of re-queue, the back channel is already marked as NFSD4_CB_DOWN
and cl_flags is NFSD4_CLIENT_STABLE|NFSD4_CLIENT_RECLAIM_COMPLETE|NFSD4_CLIENT_CONFIRMED:

Apr 23 08:07:23 nfsvmc14 kernel: nfsd4_run_cb_work: NULL cl_cb_client REQUEUE CB cb[ffff888126e8a728] clp[ffff888126e8a430] cl_cb_state[2] cl_flags[0x1c]

but that does not stop the loop.

>
>>> Even so, isn't the
>>> backchannel already marked down when we get here?
>> No, according to my testing. Without marking the back channel down the
>> client does not re-establish the back channel when it reconnects.
> I didn't expect that closing the transport on the server side would
> need any changes in fs/nfsd/nfs4callback.c. Let me get the
> backchannel retransmit behavior sorted first. I'm still working on
> setting up a test rig here.

Thanks, I will wait until you sort this out.

-Dai

>
>
diff mbox series

Patch

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index cf87ace7a1b0..f8bb5ff2e9ac 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -1491,9 +1491,14 @@  nfsd4_run_cb_work(struct work_struct *work)
 
 	clnt = clp->cl_cb_client;
 	if (!clnt) {
-		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags))
+		if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
 			nfsd41_destroy_cb(cb);
-		else {
+			clear_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags);
+
+			/* let client knows BC is down when it reconnects */
+			clear_bit(NFSD4_CLIENT_CB_UPDATE, &clp->cl_flags);
+			nfsd4_mark_cb_down(clp);
+		} else {
 			/*
 			 * XXX: Ideally, we could wait for the client to
 			 *	reconnect, but I haven't figured out how