Message ID | 20250123-nfsd-6-14-v1-2-c1137a4fa2ae@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Chuck Lever |
Headers | show |
Series | nfsd: CB_SEQUENCE error handling fixes and cleanups | expand |
On 1/23/25 3:25 PM, Jeff Layton wrote: > The current error handling has some problems: > > BADSLOT and BADSESSION: don't release the slot before retrying the call > > SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any > recommendation about doing that in the spec, and it seems wrong. Random thought: You might use the Linux NFS client's forechannel session implementation as a code reference. > Handle all three errors the same way: release the slot, but then handle > it just like we would as if we hadn't gotten a reply; mark the session > as faulty, and retry the call. Some questions: Why does it matter whether NFSD keeps the slot if both sides plan to destroy the session? Also, AFAICT marking CB_FAULT does not destroy the session, it simply tries to recreate backchannel's rpc_clnt. Perhaps NFSD's callback code should actively destroy the session and let the client drive a fresh CREATE_SESSION to recover? > Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/nfsd/nfs4callback.c | 27 +++++++++++---------------- > 1 file changed, 11 insertions(+), 16 deletions(-) > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644 > --- a/fs/nfsd/nfs4callback.c > +++ b/fs/nfsd/nfs4callback.c > @@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > nfsd4_mark_cb_fault(cb->cb_clp); > ret = false; > break; > + case -NFS4ERR_BADSESSION: > + case -NFS4ERR_BADSLOT: > + case -NFS4ERR_SEQ_MISORDERED: > + /* > + * These errors indicate that something has gone wrong > + * with the server and client's synchronization. Release > + * the slot, but handle it as if we hadn't gotten a reply. > + */ > + nfsd41_cb_release_slot(cb); > + fallthrough; > case 1: > /* > * cb_seq_status remains 1 if an RPC Reply was never > * received. NFSD can't know if the client processed > * the CB_SEQUENCE operation. Ask the client to send a > - * DESTROY_SESSION to recover. > + * DESTROY_SESSION to recover, but keep the slot. > */ > - fallthrough; > - case -NFS4ERR_BADSESSION: > nfsd4_mark_cb_fault(cb->cb_clp); > - ret = false; > goto need_restart; > case -NFS4ERR_DELAY: > cb->cb_seq_status = 1; > @@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > rpc_delay(task, 2 * HZ); > return false; > - case -NFS4ERR_BADSLOT: > - goto retry_nowait; > - case -NFS4ERR_SEQ_MISORDERED: > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > - goto retry_nowait; > - } > - break; > default: > nfsd4_mark_cb_fault(cb->cb_clp); > } > @@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > nfsd41_cb_release_slot(cb); > out: > return ret; > -retry_nowait: > - if (rpc_restart_call_prepare(task)) > - ret = false; > - goto out; > need_restart: > if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) { > trace_nfsd_cb_restart(clp, cb); >
On Fri, 2025-01-24 at 09:32 -0500, Chuck Lever wrote: > On 1/23/25 3:25 PM, Jeff Layton wrote: > > The current error handling has some problems: > > > > BADSLOT and BADSESSION: don't release the slot before retrying the call > > > > SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any > > recommendation about doing that in the spec, and it seems wrong. > > Random thought: You might use the Linux NFS client's forechannel session > implementation as a code reference. > > > > Handle all three errors the same way: release the slot, but then handle > > it just like we would as if we hadn't gotten a reply; mark the session > > as faulty, and retry the call. > > Some questions: > > Why does it matter whether NFSD keeps the slot if both sides plan to > destroy the session? > It may not be required, but there is no reason to hold onto the slot in these cases. Also, at this point, only nfsd has declared that it needs a new session (see below). > Also, AFAICT marking CB_FAULT does not destroy the session, it simply > tries to recreate backchannel's rpc_clnt. Perhaps NFSD's callback code > should actively destroy the session and let the client drive a fresh > CREATE_SESSION to recover? > Marking it with a fault just sets the cl_cb_state to NFSD4_CB_FAULT. Then, on the next SEQUENCE call, that makes nfsd set SEQ4_STATUS_BACKCHANNEL_FAULT, which should make the client recreate the session. Obviously, there is some delay involved there since we might have to wait for the client to do a lease renewal before this happens. > > > Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/nfsd/nfs4callback.c | 27 +++++++++++---------------- > > 1 file changed, 11 insertions(+), 16 deletions(-) > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644 > > --- a/fs/nfsd/nfs4callback.c > > +++ b/fs/nfsd/nfs4callback.c > > @@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > nfsd4_mark_cb_fault(cb->cb_clp); > > ret = false; > > break; > > + case -NFS4ERR_BADSESSION: > > + case -NFS4ERR_BADSLOT: > > + case -NFS4ERR_SEQ_MISORDERED: > > + /* > > + * These errors indicate that something has gone wrong > > + * with the server and client's synchronization. Release > > + * the slot, but handle it as if we hadn't gotten a reply. > > + */ > > + nfsd41_cb_release_slot(cb); > > + fallthrough; > > case 1: > > /* > > * cb_seq_status remains 1 if an RPC Reply was never > > * received. NFSD can't know if the client processed > > * the CB_SEQUENCE operation. Ask the client to send a > > - * DESTROY_SESSION to recover. > > + * DESTROY_SESSION to recover, but keep the slot. > > */ > > - fallthrough; > > - case -NFS4ERR_BADSESSION: > > nfsd4_mark_cb_fault(cb->cb_clp); > > - ret = false; > > goto need_restart; > > case -NFS4ERR_DELAY: > > cb->cb_seq_status = 1; > > @@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > rpc_delay(task, 2 * HZ); > > return false; > > - case -NFS4ERR_BADSLOT: > > - goto retry_nowait; > > - case -NFS4ERR_SEQ_MISORDERED: > > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > - goto retry_nowait; > > - } > > - break; > > default: > > nfsd4_mark_cb_fault(cb->cb_clp); > > } > > @@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > nfsd41_cb_release_slot(cb); > > out: > > return ret; > > -retry_nowait: > > - if (rpc_restart_call_prepare(task)) > > - ret = false; > > - goto out; > > need_restart: > > if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) { > > trace_nfsd_cb_restart(clp, cb); > > > >
On 1/24/25 9:46 AM, Jeff Layton wrote: > On Fri, 2025-01-24 at 09:32 -0500, Chuck Lever wrote: >> On 1/23/25 3:25 PM, Jeff Layton wrote: >>> The current error handling has some problems: >>> >>> BADSLOT and BADSESSION: don't release the slot before retrying the call >>> >>> SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any >>> recommendation about doing that in the spec, and it seems wrong. >> >> Random thought: You might use the Linux NFS client's forechannel session >> implementation as a code reference. >> >> >>> Handle all three errors the same way: release the slot, but then handle >>> it just like we would as if we hadn't gotten a reply; mark the session >>> as faulty, and retry the call. >> >> Some questions: >> >> Why does it matter whether NFSD keeps the slot if both sides plan to >> destroy the session? >> > > It may not be required, but there is no reason to hold onto the slot in > these cases. In the BADSLOT case, if the slot is released, then another session consumer on the NFS server can use it and will encounter the same error. Best to keep it in the penalty box, IMO. If there are other slots, they are likely still usable. An implementation can choose to continue using the session rather than scuttling it immediately. In the past, with a single backchannel slot, NFSD had no choice but to replace the session. But now it can be more conservative. > Also, at this point, only nfsd has declared that it needs > a new session (see below). If the client's backchannel service has returned BADSESSION, then the client already knows this session is unusable. >> Also, AFAICT marking CB_FAULT does not destroy the session, it simply >> tries to recreate backchannel's rpc_clnt. Perhaps NFSD's callback code >> should actively destroy the session and let the client drive a fresh >> CREATE_SESSION to recover? >> > > Marking it with a fault just sets the cl_cb_state to NFSD4_CB_FAULT. > Then, on the next SEQUENCE call, that makes nfsd set > SEQ4_STATUS_BACKCHANNEL_FAULT, which should make the client recreate > the session. Obviously, there is some delay involved there since we > might have to wait for the client to do a lease renewal before this > happens. > >> >>> Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") >>> Signed-off-by: Jeff Layton <jlayton@kernel.org> >>> --- >>> fs/nfsd/nfs4callback.c | 27 +++++++++++---------------- >>> 1 file changed, 11 insertions(+), 16 deletions(-) >>> >>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>> index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644 >>> --- a/fs/nfsd/nfs4callback.c >>> +++ b/fs/nfsd/nfs4callback.c >>> @@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>> nfsd4_mark_cb_fault(cb->cb_clp); >>> ret = false; >>> break; >>> + case -NFS4ERR_BADSESSION: >>> + case -NFS4ERR_BADSLOT: >>> + case -NFS4ERR_SEQ_MISORDERED: >>> + /* >>> + * These errors indicate that something has gone wrong >>> + * with the server and client's synchronization. Release >>> + * the slot, but handle it as if we hadn't gotten a reply. >>> + */ >>> + nfsd41_cb_release_slot(cb); >>> + fallthrough; >>> case 1: >>> /* >>> * cb_seq_status remains 1 if an RPC Reply was never >>> * received. NFSD can't know if the client processed >>> * the CB_SEQUENCE operation. Ask the client to send a >>> - * DESTROY_SESSION to recover. >>> + * DESTROY_SESSION to recover, but keep the slot. >>> */ >>> - fallthrough; >>> - case -NFS4ERR_BADSESSION: >>> nfsd4_mark_cb_fault(cb->cb_clp); >>> - ret = false; >>> goto need_restart; >>> case -NFS4ERR_DELAY: >>> cb->cb_seq_status = 1; >>> @@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>> >>> rpc_delay(task, 2 * HZ); >>> return false; >>> - case -NFS4ERR_BADSLOT: >>> - goto retry_nowait; >>> - case -NFS4ERR_SEQ_MISORDERED: >>> - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>> - session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>> - goto retry_nowait; >>> - } >>> - break; >>> default: >>> nfsd4_mark_cb_fault(cb->cb_clp); >>> } >>> @@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>> nfsd41_cb_release_slot(cb); >>> out: >>> return ret; >>> -retry_nowait: >>> - if (rpc_restart_call_prepare(task)) >>> - ret = false; >>> - goto out; >>> need_restart: >>> if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) { >>> trace_nfsd_cb_restart(clp, cb); >>> >> >> >
On Fri, 2025-01-24 at 10:31 -0500, Chuck Lever wrote: > On 1/24/25 9:46 AM, Jeff Layton wrote: > > On Fri, 2025-01-24 at 09:32 -0500, Chuck Lever wrote: > > > On 1/23/25 3:25 PM, Jeff Layton wrote: > > > > The current error handling has some problems: > > > > > > > > BADSLOT and BADSESSION: don't release the slot before retrying the call > > > > > > > > SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any > > > > recommendation about doing that in the spec, and it seems wrong. > > > > > > Random thought: You might use the Linux NFS client's forechannel session > > > implementation as a code reference. > > > > > > > > > > Handle all three errors the same way: release the slot, but then handle > > > > it just like we would as if we hadn't gotten a reply; mark the session > > > > as faulty, and retry the call. > > > > > > Some questions: > > > > > > Why does it matter whether NFSD keeps the slot if both sides plan to > > > destroy the session? > > > > > > > It may not be required, but there is no reason to hold onto the slot in > > these cases. > > In the BADSLOT case, if the slot is released, then another session > consumer on the NFS server can use it and will encounter the same error. > Best to keep it in the penalty box, IMO. > > If there are other slots, they are likely still usable. An > implementation can choose to continue using the session rather than > scuttling it immediately. In the past, with a single backchannel slot, > NFSD had no choice but to replace the session. But now it can be more > conservative. > It may still be usable, but if we got BADSLOT, then it's still indicative of a faulty session. Best to mark it as such, IMO. It's worth noting that these things should really _never_ happen, so being somewhat heavy-handed in this situation seems reasonable. > > > Also, at this point, only nfsd has declared that it needs > > a new session (see below). > > If the client's backchannel service has returned BADSESSION, then the > client already knows this session is unusable. > Fair point. I still think it's reasonable to mark it CB_FAULT in that case. That status is also displayed via the client_info file, so that's helpful info. > > > > Also, AFAICT marking CB_FAULT does not destroy the session, it simply > > > tries to recreate backchannel's rpc_clnt. Perhaps NFSD's callback code > > > should actively destroy the session and let the client drive a fresh > > > CREATE_SESSION to recover? > > > > > > > Marking it with a fault just sets the cl_cb_state to NFSD4_CB_FAULT. > > Then, on the next SEQUENCE call, that makes nfsd set > > SEQ4_STATUS_BACKCHANNEL_FAULT, which should make the client recreate > > the session. Obviously, there is some delay involved there since we > > might have to wait for the client to do a lease renewal before this > > happens. > > > > > > > > > Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > --- > > > > fs/nfsd/nfs4callback.c | 27 +++++++++++---------------- > > > > 1 file changed, 11 insertions(+), 16 deletions(-) > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644 > > > > --- a/fs/nfsd/nfs4callback.c > > > > +++ b/fs/nfsd/nfs4callback.c > > > > @@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > ret = false; > > > > break; > > > > + case -NFS4ERR_BADSESSION: > > > > + case -NFS4ERR_BADSLOT: > > > > + case -NFS4ERR_SEQ_MISORDERED: > > > > + /* > > > > + * These errors indicate that something has gone wrong > > > > + * with the server and client's synchronization. Release > > > > + * the slot, but handle it as if we hadn't gotten a reply. > > > > + */ > > > > + nfsd41_cb_release_slot(cb); > > > > + fallthrough; > > > > case 1: > > > > /* > > > > * cb_seq_status remains 1 if an RPC Reply was never > > > > * received. NFSD can't know if the client processed > > > > * the CB_SEQUENCE operation. Ask the client to send a > > > > - * DESTROY_SESSION to recover. > > > > + * DESTROY_SESSION to recover, but keep the slot. > > > > */ > > > > - fallthrough; > > > > - case -NFS4ERR_BADSESSION: > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > - ret = false; > > > > goto need_restart; > > > > case -NFS4ERR_DELAY: > > > > cb->cb_seq_status = 1; > > > > @@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > > > > > rpc_delay(task, 2 * HZ); > > > > return false; > > > > - case -NFS4ERR_BADSLOT: > > > > - goto retry_nowait; > > > > - case -NFS4ERR_SEQ_MISORDERED: > > > > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > > > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > > > - goto retry_nowait; > > > > - } > > > > - break; > > > > default: > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > } > > > > @@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > nfsd41_cb_release_slot(cb); > > > > out: > > > > return ret; > > > > -retry_nowait: > > > > - if (rpc_restart_call_prepare(task)) > > > > - ret = false; > > > > - goto out; > > > > need_restart: > > > > if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) { > > > > trace_nfsd_cb_restart(clp, cb); > > > > > > > > > > > > > >
On Fri, 2025-01-24 at 10:31 -0500, Chuck Lever wrote: > On 1/24/25 9:46 AM, Jeff Layton wrote: > > On Fri, 2025-01-24 at 09:32 -0500, Chuck Lever wrote: > > > On 1/23/25 3:25 PM, Jeff Layton wrote: > > > > The current error handling has some problems: > > > > > > > > BADSLOT and BADSESSION: don't release the slot before retrying the call > > > > > > > > SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any > > > > recommendation about doing that in the spec, and it seems wrong. > > > > > > Random thought: You might use the Linux NFS client's forechannel session > > > implementation as a code reference. > > > > > > > > > > Handle all three errors the same way: release the slot, but then handle > > > > it just like we would as if we hadn't gotten a reply; mark the session > > > > as faulty, and retry the call. > > > > > > Some questions: > > > > > > Why does it matter whether NFSD keeps the slot if both sides plan to > > > destroy the session? > > > > > > > It may not be required, but there is no reason to hold onto the slot in > > these cases. > > In the BADSLOT case, if the slot is released, then another session > consumer on the NFS server can use it and will encounter the same error. > Best to keep it in the penalty box, IMO. > There is another problem here too. Once the session is reconstituted, there is no guarantee that the slot that the call is sitting on will still be valid. The new CB slot table may be smaller than before. I think we do need to release the slot in these cases for that reason alone. > If there are other slots, they are likely still usable. An > implementation can choose to continue using the session rather than > scuttling it immediately. In the past, with a single backchannel slot, > NFSD had no choice but to replace the session. But now it can be more > conservative. > > > > Also, at this point, only nfsd has declared that it needs > > a new session (see below). > > If the client's backchannel service has returned BADSESSION, then the > client already knows this session is unusable. > > > > > Also, AFAICT marking CB_FAULT does not destroy the session, it simply > > > tries to recreate backchannel's rpc_clnt. Perhaps NFSD's callback code > > > should actively destroy the session and let the client drive a fresh > > > CREATE_SESSION to recover? > > > > > > > Marking it with a fault just sets the cl_cb_state to NFSD4_CB_FAULT. > > Then, on the next SEQUENCE call, that makes nfsd set > > SEQ4_STATUS_BACKCHANNEL_FAULT, which should make the client recreate > > the session. Obviously, there is some delay involved there since we > > might have to wait for the client to do a lease renewal before this > > happens. > > > > > > > > > Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > --- > > > > fs/nfsd/nfs4callback.c | 27 +++++++++++---------------- > > > > 1 file changed, 11 insertions(+), 16 deletions(-) > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644 > > > > --- a/fs/nfsd/nfs4callback.c > > > > +++ b/fs/nfsd/nfs4callback.c > > > > @@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > ret = false; > > > > break; > > > > + case -NFS4ERR_BADSESSION: > > > > + case -NFS4ERR_BADSLOT: > > > > + case -NFS4ERR_SEQ_MISORDERED: > > > > + /* > > > > + * These errors indicate that something has gone wrong > > > > + * with the server and client's synchronization. Release > > > > + * the slot, but handle it as if we hadn't gotten a reply. > > > > + */ > > > > + nfsd41_cb_release_slot(cb); > > > > + fallthrough; > > > > case 1: > > > > /* > > > > * cb_seq_status remains 1 if an RPC Reply was never > > > > * received. NFSD can't know if the client processed > > > > * the CB_SEQUENCE operation. Ask the client to send a > > > > - * DESTROY_SESSION to recover. > > > > + * DESTROY_SESSION to recover, but keep the slot. > > > > */ > > > > - fallthrough; > > > > - case -NFS4ERR_BADSESSION: > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > - ret = false; > > > > goto need_restart; > > > > case -NFS4ERR_DELAY: > > > > cb->cb_seq_status = 1; > > > > @@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > > > > > rpc_delay(task, 2 * HZ); > > > > return false; > > > > - case -NFS4ERR_BADSLOT: > > > > - goto retry_nowait; > > > > - case -NFS4ERR_SEQ_MISORDERED: > > > > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > > > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > > > - goto retry_nowait; > > > > - } > > > > - break; > > > > default: > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > } > > > > @@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > nfsd41_cb_release_slot(cb); > > > > out: > > > > return ret; > > > > -retry_nowait: > > > > - if (rpc_restart_call_prepare(task)) > > > > - ret = false; > > > > - goto out; > > > > need_restart: > > > > if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) { > > > > trace_nfsd_cb_restart(clp, cb); > > > > > > > > > > > > > >
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644 --- a/fs/nfsd/nfs4callback.c +++ b/fs/nfsd/nfs4callback.c @@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback nfsd4_mark_cb_fault(cb->cb_clp); ret = false; break; + case -NFS4ERR_BADSESSION: + case -NFS4ERR_BADSLOT: + case -NFS4ERR_SEQ_MISORDERED: + /* + * These errors indicate that something has gone wrong + * with the server and client's synchronization. Release + * the slot, but handle it as if we hadn't gotten a reply. + */ + nfsd41_cb_release_slot(cb); + fallthrough; case 1: /* * cb_seq_status remains 1 if an RPC Reply was never * received. NFSD can't know if the client processed * the CB_SEQUENCE operation. Ask the client to send a - * DESTROY_SESSION to recover. + * DESTROY_SESSION to recover, but keep the slot. */ - fallthrough; - case -NFS4ERR_BADSESSION: nfsd4_mark_cb_fault(cb->cb_clp); - ret = false; goto need_restart; case -NFS4ERR_DELAY: cb->cb_seq_status = 1; @@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback rpc_delay(task, 2 * HZ); return false; - case -NFS4ERR_BADSLOT: - goto retry_nowait; - case -NFS4ERR_SEQ_MISORDERED: - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { - session->se_cb_seq_nr[cb->cb_held_slot] = 1; - goto retry_nowait; - } - break; default: nfsd4_mark_cb_fault(cb->cb_clp); } @@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback nfsd41_cb_release_slot(cb); out: return ret; -retry_nowait: - if (rpc_restart_call_prepare(task)) - ret = false; - goto out; need_restart: if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) { trace_nfsd_cb_restart(clp, cb);
The current error handling has some problems: BADSLOT and BADSESSION: don't release the slot before retrying the call SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any recommendation about doing that in the spec, and it seems wrong. Handle all three errors the same way: release the slot, but then handle it just like we would as if we hadn't gotten a reply; mark the session as faulty, and retry the call. Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/nfsd/nfs4callback.c | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-)