Message ID | 20250207-nfsd-6-14-v5-6-f3b54fb60dc0@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Chuck Lever |
Headers | show |
Series | nfsd: CB_SEQUENCE error handling fixes and cleanups | expand |
On 2/7/25 4:53 PM, Jeff Layton wrote: > For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then > fall back to treating it like a BADSLOT if that fails. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/nfsd/nfs4callback.c | 16 ++++++++++------ > 1 file changed, 10 insertions(+), 6 deletions(-) > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 > --- a/fs/nfsd/nfs4callback.c > +++ b/fs/nfsd/nfs4callback.c > @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > goto requeue; > rpc_delay(task, 2 * HZ); > return false; > + case -NFS4ERR_SEQ_MISORDERED: > + /* > + * Reattempt once with seq_nr 1. If that fails, treat this > + * like BADSLOT. > + */ Nit: this comment says exactly what the code says. If it were me, I'd remove it. Is there a "why" statement that could be made here? Like, why retry with a seq_nr of 1 instead of just failing immediately? > + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > + session->se_cb_seq_nr[cb->cb_held_slot] = 1; > + goto retry_nowait; > + } > + fallthrough; > case -NFS4ERR_BADSLOT: > /* > * BADSLOT means that the client and server are out of sync > @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > nfsd4_mark_cb_fault(cb->cb_clp); > cb->cb_held_slot = -1; > goto retry_nowait; > - case -NFS4ERR_SEQ_MISORDERED: > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > - goto retry_nowait; > - } > - break; > default: > nfsd4_mark_cb_fault(cb->cb_clp); > } >
On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: > On 2/7/25 4:53 PM, Jeff Layton wrote: > > For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then > > fall back to treating it like a BADSLOT if that fails. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/nfsd/nfs4callback.c | 16 ++++++++++------ > > 1 file changed, 10 insertions(+), 6 deletions(-) > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 > > --- a/fs/nfsd/nfs4callback.c > > +++ b/fs/nfsd/nfs4callback.c > > @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > goto requeue; > > rpc_delay(task, 2 * HZ); > > return false; > > + case -NFS4ERR_SEQ_MISORDERED: > > + /* > > + * Reattempt once with seq_nr 1. If that fails, treat this > > + * like BADSLOT. > > + */ > > Nit: this comment says exactly what the code says. If it were me, I'd > remove it. Is there a "why" statement that could be made here? Like, > why retry with a seq_nr of 1 instead of just failing immediately? > There isn't one that I know of. It looks like Kinglong Mee added it in 7ba6cad6c88f, but there is no real mention of that in the changelog. TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 when we got this error, and we then retry with a seq_nr of 1? Does the server then treat that as a retransmission? We might be best off dropping this and just always treating it like BADSLOT. Thoughts? > > > + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > + session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > + goto retry_nowait; > > + } > > + fallthrough; > > case -NFS4ERR_BADSLOT: > > /* > > * BADSLOT means that the client and server are out of sync > > @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > nfsd4_mark_cb_fault(cb->cb_clp); > > cb->cb_held_slot = -1; > > goto retry_nowait; > > - case -NFS4ERR_SEQ_MISORDERED: > > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > - goto retry_nowait; > > - } > > - break; > > default: > > nfsd4_mark_cb_fault(cb->cb_clp); > > } > > > >
On 2/8/2025 10:02 AM, Jeff Layton wrote: > On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: >> On 2/7/25 4:53 PM, Jeff Layton wrote: >>> For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then >>> fall back to treating it like a BADSLOT if that fails. >>> >>> Signed-off-by: Jeff Layton <jlayton@kernel.org> >>> --- >>> fs/nfsd/nfs4callback.c | 16 ++++++++++------ >>> 1 file changed, 10 insertions(+), 6 deletions(-) >>> >>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>> index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 >>> --- a/fs/nfsd/nfs4callback.c >>> +++ b/fs/nfsd/nfs4callback.c >>> @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>> goto requeue; >>> rpc_delay(task, 2 * HZ); >>> return false; >>> + case -NFS4ERR_SEQ_MISORDERED: >>> + /* >>> + * Reattempt once with seq_nr 1. If that fails, treat this >>> + * like BADSLOT. >>> + */ >> >> Nit: this comment says exactly what the code says. If it were me, I'd >> remove it. Is there a "why" statement that could be made here? Like, >> why retry with a seq_nr of 1 instead of just failing immediately? >> > > There isn't one that I know of. It looks like Kinglong Mee added it in > 7ba6cad6c88f, but there is no real mention of that in the changelog. > > TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 > when we got this error, and we then retry with a seq_nr of 1? Does the > server then treat that as a retransmission? So I assume you mean the requester sent seq_nr 1, saw a reply and sent a subsequent seq_nr 2, to which it gets SEQ_MISORDERED. If so, yes definitely backing up the seq_nr to 1 will result in the peer considering it to be a retransmission, which will be bad. > We might be best off > dropping this and just always treating it like BADSLOT. But, why would this happen? Usually I'd think the peer sent seq_nr X before it received a reply to seq_nr X-1, which would be a peer bug. OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, how does the requester know the difference? If treating it as BADSLOT completely resets the sequence, then sure, but either a) the request is still in-progress, or b) if a bug is causing the situation, well it's not going to converge on a functional session. Not sure I have a solid suggestion right now. Whatever the fix, it should capture any subtlety in a comment. Tom. > > Thoughts? > >> >>> + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>> + session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>> + goto retry_nowait; >>> + } >>> + fallthrough; >>> case -NFS4ERR_BADSLOT: >>> /* >>> * BADSLOT means that the client and server are out of sync >>> @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>> nfsd4_mark_cb_fault(cb->cb_clp); >>> cb->cb_held_slot = -1; >>> goto retry_nowait; >>> - case -NFS4ERR_SEQ_MISORDERED: >>> - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>> - session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>> - goto retry_nowait; >>> - } >>> - break; >>> default: >>> nfsd4_mark_cb_fault(cb->cb_clp); >>> } >>> >> >> >
On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: > On 2/8/2025 10:02 AM, Jeff Layton wrote: > > On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: > > > On 2/7/25 4:53 PM, Jeff Layton wrote: > > > > For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then > > > > fall back to treating it like a BADSLOT if that fails. > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > --- > > > > fs/nfsd/nfs4callback.c | 16 ++++++++++------ > > > > 1 file changed, 10 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 > > > > --- a/fs/nfsd/nfs4callback.c > > > > +++ b/fs/nfsd/nfs4callback.c > > > > @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > goto requeue; > > > > rpc_delay(task, 2 * HZ); > > > > return false; > > > > + case -NFS4ERR_SEQ_MISORDERED: > > > > + /* > > > > + * Reattempt once with seq_nr 1. If that fails, treat this > > > > + * like BADSLOT. > > > > + */ > > > > > > Nit: this comment says exactly what the code says. If it were me, I'd > > > remove it. Is there a "why" statement that could be made here? Like, > > > why retry with a seq_nr of 1 instead of just failing immediately? > > > > > > > There isn't one that I know of. It looks like Kinglong Mee added it in > > 7ba6cad6c88f, but there is no real mention of that in the changelog. > > > > TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 > > when we got this error, and we then retry with a seq_nr of 1? Does the > > server then treat that as a retransmission? > > So I assume you mean the requester sent seq_nr 1, saw a reply and sent a > subsequent seq_nr 2, to which it gets SEQ_MISORDERED. > > If so, yes definitely backing up the seq_nr to 1 will result in the > peer considering it to be a retransmission, which will be bad. > Yes, that's what I meant. > > We might be best off > > dropping this and just always treating it like BADSLOT. > > But, why would this happen? Usually I'd think the peer sent seq_nr X > before it received a reply to seq_nr X-1, which would be a peer bug. > > OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, > how does the requester know the difference? > > If treating it as BADSLOT completely resets the sequence, then sure, > but either a) the request is still in-progress, or b) if a bug is > causing the situation, well it's not going to converge on a functional > session. > With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT in the next forechannel SEQUENCE on the session. That should cause the client to (eventually) send a DESTROY_SESSION and create a new one. Unfortunately, in the meantime, because of the way the callback channel update works, the server can end up trying to send the callback again on the same session (and maybe more than once). I'm not sure that that's a real problem per-se, but it's less than ideal. > Not sure I have a solid suggestion right now. Whatever the fix, it > should capture any subtlety in a comment. > At this point, I'm leaning toward just treating it like BADSLOT. Basically, mark the backchannel faulty, and leak the slot so that nothing else uses it. That allows us to send backchannel requests on the other slots until the session gets recreated. > > > > Thoughts? > > > > > > > > > + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > > > + session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > > > + goto retry_nowait; > > > > + } > > > > + fallthrough; > > > > case -NFS4ERR_BADSLOT: > > > > /* > > > > * BADSLOT means that the client and server are out of sync > > > > @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > cb->cb_held_slot = -1; > > > > goto retry_nowait; > > > > - case -NFS4ERR_SEQ_MISORDERED: > > > > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > > > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > > > - goto retry_nowait; > > > > - } > > > > - break; > > > > default: > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > } > > > > > > > > > > > > >
On 2/8/2025 11:08 AM, Jeff Layton wrote: > On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: >> On 2/8/2025 10:02 AM, Jeff Layton wrote: >>> On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: >>>> On 2/7/25 4:53 PM, Jeff Layton wrote: >>>>> For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then >>>>> fall back to treating it like a BADSLOT if that fails. >>>>> >>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org> >>>>> --- >>>>> fs/nfsd/nfs4callback.c | 16 ++++++++++------ >>>>> 1 file changed, 10 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>>>> index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 >>>>> --- a/fs/nfsd/nfs4callback.c >>>>> +++ b/fs/nfsd/nfs4callback.c >>>>> @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>> goto requeue; >>>>> rpc_delay(task, 2 * HZ); >>>>> return false; >>>>> + case -NFS4ERR_SEQ_MISORDERED: >>>>> + /* >>>>> + * Reattempt once with seq_nr 1. If that fails, treat this >>>>> + * like BADSLOT. >>>>> + */ >>>> >>>> Nit: this comment says exactly what the code says. If it were me, I'd >>>> remove it. Is there a "why" statement that could be made here? Like, >>>> why retry with a seq_nr of 1 instead of just failing immediately? >>>> >>> >>> There isn't one that I know of. It looks like Kinglong Mee added it in >>> 7ba6cad6c88f, but there is no real mention of that in the changelog. >>> >>> TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 >>> when we got this error, and we then retry with a seq_nr of 1? Does the >>> server then treat that as a retransmission? >> >> So I assume you mean the requester sent seq_nr 1, saw a reply and sent a >> subsequent seq_nr 2, to which it gets SEQ_MISORDERED. >> >> If so, yes definitely backing up the seq_nr to 1 will result in the >> peer considering it to be a retransmission, which will be bad. >> > > Yes, that's what I meant. > >>> We might be best off >>> dropping this and just always treating it like BADSLOT. >> >> But, why would this happen? Usually I'd think the peer sent seq_nr X >> before it received a reply to seq_nr X-1, which would be a peer bug. >> >> OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, >> how does the requester know the difference? >> >> If treating it as BADSLOT completely resets the sequence, then sure, >> but either a) the request is still in-progress, or b) if a bug is >> causing the situation, well it's not going to converge on a functional >> session. >> > > With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT > in the next forechannel SEQUENCE on the session. That should cause the > client to (eventually) send a DESTROY_SESSION and create a new one. > > Unfortunately, in the meantime, because of the way the callback channel > update works, the server can end up trying to send the callback again > on the same session (and maybe more than once). I'm not sure that > that's a real problem per-se, but it's less than ideal. > >> Not sure I have a solid suggestion right now. Whatever the fix, it >> should capture any subtlety in a comment. >> > > At this point, I'm leaning toward just treating it like BADSLOT. > Basically, mark the backchannel faulty, and leak the slot so that > nothing else uses it. That allows us to send backchannel requests on > the other slots until the session gets recreated. Hmm, leaking the slot is a workable approach, as long as it doesn't cascade more than a time or two. Some sort of trigger should be armed to prevent runaway retries. It's maybe worth considering what state the peer might be in when this happens. It too may effectively leak a slot, and if is retaining some bogus state either as a result of or because of the previous exchange(s) then this may lead to future hangs/failures. Not pretty, and maybe not worth trying to guess. Tom. > >>> >>> Thoughts? >>> >>>> >>>>> + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>>>> + session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>>>> + goto retry_nowait; >>>>> + } >>>>> + fallthrough; >>>>> case -NFS4ERR_BADSLOT: >>>>> /* >>>>> * BADSLOT means that the client and server are out of sync >>>>> @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>> nfsd4_mark_cb_fault(cb->cb_clp); >>>>> cb->cb_held_slot = -1; >>>>> goto retry_nowait; >>>>> - case -NFS4ERR_SEQ_MISORDERED: >>>>> - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>>>> - session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>>>> - goto retry_nowait; >>>>> - } >>>>> - break; >>>>> default: >>>>> nfsd4_mark_cb_fault(cb->cb_clp); >>>>> } >>>>> >>>> >>>> >>> >> >
On Sat, 2025-02-08 at 14:18 -0500, Tom Talpey wrote: > On 2/8/2025 11:08 AM, Jeff Layton wrote: > > On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: > > > On 2/8/2025 10:02 AM, Jeff Layton wrote: > > > > On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: > > > > > On 2/7/25 4:53 PM, Jeff Layton wrote: > > > > > > For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then > > > > > > fall back to treating it like a BADSLOT if that fails. > > > > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > --- > > > > > > fs/nfsd/nfs4callback.c | 16 ++++++++++------ > > > > > > 1 file changed, 10 insertions(+), 6 deletions(-) > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > > > index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 > > > > > > --- a/fs/nfsd/nfs4callback.c > > > > > > +++ b/fs/nfsd/nfs4callback.c > > > > > > @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > > > goto requeue; > > > > > > rpc_delay(task, 2 * HZ); > > > > > > return false; > > > > > > + case -NFS4ERR_SEQ_MISORDERED: > > > > > > + /* > > > > > > + * Reattempt once with seq_nr 1. If that fails, treat this > > > > > > + * like BADSLOT. > > > > > > + */ > > > > > > > > > > Nit: this comment says exactly what the code says. If it were me, I'd > > > > > remove it. Is there a "why" statement that could be made here? Like, > > > > > why retry with a seq_nr of 1 instead of just failing immediately? > > > > > > > > > > > > > There isn't one that I know of. It looks like Kinglong Mee added it in > > > > 7ba6cad6c88f, but there is no real mention of that in the changelog. > > > > > > > > TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 > > > > when we got this error, and we then retry with a seq_nr of 1? Does the > > > > server then treat that as a retransmission? > > > > > > So I assume you mean the requester sent seq_nr 1, saw a reply and sent a > > > subsequent seq_nr 2, to which it gets SEQ_MISORDERED. > > > > > > If so, yes definitely backing up the seq_nr to 1 will result in the > > > peer considering it to be a retransmission, which will be bad. > > > > > > > Yes, that's what I meant. > > > > > > We might be best off > > > > dropping this and just always treating it like BADSLOT. > > > > > > But, why would this happen? Usually I'd think the peer sent seq_nr X > > > before it received a reply to seq_nr X-1, which would be a peer bug. > > > > > > OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, > > > how does the requester know the difference? > > > > > > If treating it as BADSLOT completely resets the sequence, then sure, > > > but either a) the request is still in-progress, or b) if a bug is > > > causing the situation, well it's not going to converge on a functional > > > session. > > > > > > > With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT > > in the next forechannel SEQUENCE on the session. That should cause the > > client to (eventually) send a DESTROY_SESSION and create a new one. > > > > Unfortunately, in the meantime, because of the way the callback channel > > update works, the server can end up trying to send the callback again > > on the same session (and maybe more than once). I'm not sure that > > that's a real problem per-se, but it's less than ideal. > > > > > Not sure I have a solid suggestion right now. Whatever the fix, it > > > should capture any subtlety in a comment. > > > > > > > At this point, I'm leaning toward just treating it like BADSLOT. > > Basically, mark the backchannel faulty, and leak the slot so that > > nothing else uses it. That allows us to send backchannel requests on > > the other slots until the session gets recreated. > > Hmm, leaking the slot is a workable approach, as long as it doesn't > cascade more than a time or two. Some sort of trigger should be armed > to prevent runaway retries. > > It's maybe worth considering what state the peer might be in when this > happens. It too may effectively leak a slot, and if is retaining some > bogus state either as a result of or because of the previous exchange(s) > then this may lead to future hangs/failures. Not pretty, and maybe not > worth trying to guess. > > Tom. > The idea here is that eventually the client should figure out that something is wrong and reestablish the session. Currently we don't limit the number of retries on a callback. Maybe they should time out after a while? If we've retried a callback for more than two lease periods, give up and log something? Either way, I'd consider that to be follow-on work to this set. > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > > > > > + session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > > > > > + goto retry_nowait; > > > > > > + } > > > > > > + fallthrough; > > > > > > case -NFS4ERR_BADSLOT: > > > > > > /* > > > > > > * BADSLOT means that the client and server are out of sync > > > > > > @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > > > cb->cb_held_slot = -1; > > > > > > goto retry_nowait; > > > > > > - case -NFS4ERR_SEQ_MISORDERED: > > > > > > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > > > > > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > > > > > - goto retry_nowait; > > > > > > - } > > > > > > - break; > > > > > > default: > > > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > >
On 2/8/25 3:45 PM, Jeff Layton wrote: > On Sat, 2025-02-08 at 14:18 -0500, Tom Talpey wrote: >> On 2/8/2025 11:08 AM, Jeff Layton wrote: >>> On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: >>>> On 2/8/2025 10:02 AM, Jeff Layton wrote: >>>>> On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: >>>>>> On 2/7/25 4:53 PM, Jeff Layton wrote: >>>>>>> For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then >>>>>>> fall back to treating it like a BADSLOT if that fails. >>>>>>> >>>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org> >>>>>>> --- >>>>>>> fs/nfsd/nfs4callback.c | 16 ++++++++++------ >>>>>>> 1 file changed, 10 insertions(+), 6 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>>>>>> index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 >>>>>>> --- a/fs/nfsd/nfs4callback.c >>>>>>> +++ b/fs/nfsd/nfs4callback.c >>>>>>> @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>>>> goto requeue; >>>>>>> rpc_delay(task, 2 * HZ); >>>>>>> return false; >>>>>>> + case -NFS4ERR_SEQ_MISORDERED: >>>>>>> + /* >>>>>>> + * Reattempt once with seq_nr 1. If that fails, treat this >>>>>>> + * like BADSLOT. >>>>>>> + */ >>>>>> >>>>>> Nit: this comment says exactly what the code says. If it were me, I'd >>>>>> remove it. Is there a "why" statement that could be made here? Like, >>>>>> why retry with a seq_nr of 1 instead of just failing immediately? >>>>>> >>>>> >>>>> There isn't one that I know of. It looks like Kinglong Mee added it in >>>>> 7ba6cad6c88f, but there is no real mention of that in the changelog. >>>>> >>>>> TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 >>>>> when we got this error, and we then retry with a seq_nr of 1? Does the >>>>> server then treat that as a retransmission? >>>> >>>> So I assume you mean the requester sent seq_nr 1, saw a reply and sent a >>>> subsequent seq_nr 2, to which it gets SEQ_MISORDERED. >>>> >>>> If so, yes definitely backing up the seq_nr to 1 will result in the >>>> peer considering it to be a retransmission, which will be bad. >>>> >>> >>> Yes, that's what I meant. >>> >>>>> We might be best off >>>>> dropping this and just always treating it like BADSLOT. >>>> >>>> But, why would this happen? Usually I'd think the peer sent seq_nr X >>>> before it received a reply to seq_nr X-1, which would be a peer bug. >>>> >>>> OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, >>>> how does the requester know the difference? >>>> >>>> If treating it as BADSLOT completely resets the sequence, then sure, >>>> but either a) the request is still in-progress, or b) if a bug is >>>> causing the situation, well it's not going to converge on a functional >>>> session. >>>> >>> >>> With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT >>> in the next forechannel SEQUENCE on the session. That should cause the >>> client to (eventually) send a DESTROY_SESSION and create a new one. >>> >>> Unfortunately, in the meantime, because of the way the callback channel >>> update works, the server can end up trying to send the callback again >>> on the same session (and maybe more than once). I'm not sure that >>> that's a real problem per-se, but it's less than ideal. >>> >>>> Not sure I have a solid suggestion right now. Whatever the fix, it >>>> should capture any subtlety in a comment. >>>> >>> >>> At this point, I'm leaning toward just treating it like BADSLOT. >>> Basically, mark the backchannel faulty, and leak the slot so that >>> nothing else uses it. That allows us to send backchannel requests on >>> the other slots until the session gets recreated. >> >> Hmm, leaking the slot is a workable approach, as long as it doesn't >> cascade more than a time or two. Some sort of trigger should be armed >> to prevent runaway retries. >> >> It's maybe worth considering what state the peer might be in when this >> happens. It too may effectively leak a slot, and if is retaining some >> bogus state either as a result of or because of the previous exchange(s) >> then this may lead to future hangs/failures. Not pretty, and maybe not >> worth trying to guess. >> >> Tom. >> > > > The idea here is that eventually the client should figure out that > something is wrong and reestablish the session. Currently we don't > limit the number of retries on a callback. > > Maybe they should time out after a while? If we've retried a callback > for more than two lease periods, give up and log something? > > Either way, I'd consider that to be follow-on work to this set. As a general comment, I think making a heroic effort to recover in any of these cases is probably not worth the additional complexity. Where it is required or where we believe it is worth the trouble, that's where we want a detailed comment. What we want to do is ensure forward progress. I'm guessing that error conditions are going to be rare, so leaking the slot until a certain portion of them are gone, and then indicating a session fault to force the client to start over from scratch, is probably the most straightforward approach. So, is there a good reason to retry? There doesn't appear to be any reasoning mentioned in the commit log or in nearby comments. >>>>> Thoughts? >>>>> >>>>>> >>>>>>> + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>>>>>> + session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>>>>>> + goto retry_nowait; >>>>>>> + } >>>>>>> + fallthrough; >>>>>>> case -NFS4ERR_BADSLOT: >>>>>>> /* >>>>>>> * BADSLOT means that the client and server are out of sync >>>>>>> @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>>>> nfsd4_mark_cb_fault(cb->cb_clp); >>>>>>> cb->cb_held_slot = -1; >>>>>>> goto retry_nowait; >>>>>>> - case -NFS4ERR_SEQ_MISORDERED: >>>>>>> - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>>>>>> - session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>>>>>> - goto retry_nowait; >>>>>>> - } >>>>>>> - break; >>>>>>> default: >>>>>>> nfsd4_mark_cb_fault(cb->cb_clp); >>>>>>> } >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
On 2/8/2025 4:07 PM, Chuck Lever wrote: > On 2/8/25 3:45 PM, Jeff Layton wrote: >> On Sat, 2025-02-08 at 14:18 -0500, Tom Talpey wrote: >>> On 2/8/2025 11:08 AM, Jeff Layton wrote: >>>> On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: >>>>> On 2/8/2025 10:02 AM, Jeff Layton wrote: >>>>>> On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: >>>>>>> On 2/7/25 4:53 PM, Jeff Layton wrote: >>>>>>>> For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then >>>>>>>> fall back to treating it like a BADSLOT if that fails. >>>>>>>> >>>>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org> >>>>>>>> --- >>>>>>>> fs/nfsd/nfs4callback.c | 16 ++++++++++------ >>>>>>>> 1 file changed, 10 insertions(+), 6 deletions(-) >>>>>>>> >>>>>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>>>>>>> index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 >>>>>>>> --- a/fs/nfsd/nfs4callback.c >>>>>>>> +++ b/fs/nfsd/nfs4callback.c >>>>>>>> @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>>>>> goto requeue; >>>>>>>> rpc_delay(task, 2 * HZ); >>>>>>>> return false; >>>>>>>> + case -NFS4ERR_SEQ_MISORDERED: >>>>>>>> + /* >>>>>>>> + * Reattempt once with seq_nr 1. If that fails, treat this >>>>>>>> + * like BADSLOT. >>>>>>>> + */ >>>>>>> >>>>>>> Nit: this comment says exactly what the code says. If it were me, I'd >>>>>>> remove it. Is there a "why" statement that could be made here? Like, >>>>>>> why retry with a seq_nr of 1 instead of just failing immediately? >>>>>>> >>>>>> >>>>>> There isn't one that I know of. It looks like Kinglong Mee added it in >>>>>> 7ba6cad6c88f, but there is no real mention of that in the changelog. >>>>>> >>>>>> TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 >>>>>> when we got this error, and we then retry with a seq_nr of 1? Does the >>>>>> server then treat that as a retransmission? >>>>> >>>>> So I assume you mean the requester sent seq_nr 1, saw a reply and sent a >>>>> subsequent seq_nr 2, to which it gets SEQ_MISORDERED. >>>>> >>>>> If so, yes definitely backing up the seq_nr to 1 will result in the >>>>> peer considering it to be a retransmission, which will be bad. >>>>> >>>> >>>> Yes, that's what I meant. >>>> >>>>>> We might be best off >>>>>> dropping this and just always treating it like BADSLOT. >>>>> >>>>> But, why would this happen? Usually I'd think the peer sent seq_nr X >>>>> before it received a reply to seq_nr X-1, which would be a peer bug. >>>>> >>>>> OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, >>>>> how does the requester know the difference? >>>>> >>>>> If treating it as BADSLOT completely resets the sequence, then sure, >>>>> but either a) the request is still in-progress, or b) if a bug is >>>>> causing the situation, well it's not going to converge on a functional >>>>> session. >>>>> >>>> >>>> With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT >>>> in the next forechannel SEQUENCE on the session. That should cause the >>>> client to (eventually) send a DESTROY_SESSION and create a new one. >>>> >>>> Unfortunately, in the meantime, because of the way the callback channel >>>> update works, the server can end up trying to send the callback again >>>> on the same session (and maybe more than once). I'm not sure that >>>> that's a real problem per-se, but it's less than ideal. >>>> >>>>> Not sure I have a solid suggestion right now. Whatever the fix, it >>>>> should capture any subtlety in a comment. >>>>> >>>> >>>> At this point, I'm leaning toward just treating it like BADSLOT. >>>> Basically, mark the backchannel faulty, and leak the slot so that >>>> nothing else uses it. That allows us to send backchannel requests on >>>> the other slots until the session gets recreated. >>> >>> Hmm, leaking the slot is a workable approach, as long as it doesn't >>> cascade more than a time or two. Some sort of trigger should be armed >>> to prevent runaway retries. >>> >>> It's maybe worth considering what state the peer might be in when this >>> happens. It too may effectively leak a slot, and if is retaining some >>> bogus state either as a result of or because of the previous exchange(s) >>> then this may lead to future hangs/failures. Not pretty, and maybe not >>> worth trying to guess. >>> >>> Tom. >>> >> >> >> The idea here is that eventually the client should figure out that >> something is wrong and reestablish the session. Currently we don't >> limit the number of retries on a callback. >> >> Maybe they should time out after a while? If we've retried a callback >> for more than two lease periods, give up and log something? >> >> Either way, I'd consider that to be follow-on work to this set. > > As a general comment, I think making a heroic effort to recover in any > of these cases is probably not worth the additional complexity. Where it > is required or where we believe it is worth the trouble, that's where we > want a detailed comment. > > What we want to do is ensure forward progress. I'm guessing that error > conditions are going to be rare, so leaking the slot until a certain > portion of them are gone, and then indicating a session fault to force > the client to start over from scratch, is probably the most > straightforward approach. > > So, is there a good reason to retry? There doesn't appear to be any > reasoning mentioned in the commit log or in nearby comments. Agreed on the general comment. As for the "any reason to retry" - maybe. If it's a transient error we don't want to give up early. Unfortunately that appears to be an ambiguous situation, because SEQ_MISORDERED is allowed in place of ERR_DELAY. I don't have any great suggestion however. Jeff, to your point that the "client should figure out something is wrong", I'm not sure how you think that will happen. If the server is making a delegation recall and the client receive code chooses to reject it at the sequence check, how would that eventually cause the client to reestablish the session (on the forechannel)? Tom. > > >>>>>> Thoughts? >>>>>> >>>>>>> >>>>>>>> + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>>>>>>> + session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>>>>>>> + goto retry_nowait; >>>>>>>> + } >>>>>>>> + fallthrough; >>>>>>>> case -NFS4ERR_BADSLOT: >>>>>>>> /* >>>>>>>> * BADSLOT means that the client and server are out of sync >>>>>>>> @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>>>>> nfsd4_mark_cb_fault(cb->cb_clp); >>>>>>>> cb->cb_held_slot = -1; >>>>>>>> goto retry_nowait; >>>>>>>> - case -NFS4ERR_SEQ_MISORDERED: >>>>>>>> - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>>>>>>> - session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>>>>>>> - goto retry_nowait; >>>>>>>> - } >>>>>>>> - break; >>>>>>>> default: >>>>>>>> nfsd4_mark_cb_fault(cb->cb_clp); >>>>>>>> } >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > >
On Sat, 2025-02-08 at 20:24 -0500, Tom Talpey wrote: > On 2/8/2025 4:07 PM, Chuck Lever wrote: > > On 2/8/25 3:45 PM, Jeff Layton wrote: > > > On Sat, 2025-02-08 at 14:18 -0500, Tom Talpey wrote: > > > > On 2/8/2025 11:08 AM, Jeff Layton wrote: > > > > > On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: > > > > > > On 2/8/2025 10:02 AM, Jeff Layton wrote: > > > > > > > On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: > > > > > > > > On 2/7/25 4:53 PM, Jeff Layton wrote: > > > > > > > > > For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then > > > > > > > > > fall back to treating it like a BADSLOT if that fails. > > > > > > > > > > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > > > > --- > > > > > > > > > fs/nfsd/nfs4callback.c | 16 ++++++++++------ > > > > > > > > > 1 file changed, 10 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > > > > > > index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 > > > > > > > > > --- a/fs/nfsd/nfs4callback.c > > > > > > > > > +++ b/fs/nfsd/nfs4callback.c > > > > > > > > > @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > > > > > > goto requeue; > > > > > > > > > rpc_delay(task, 2 * HZ); > > > > > > > > > return false; > > > > > > > > > + case -NFS4ERR_SEQ_MISORDERED: > > > > > > > > > + /* > > > > > > > > > + * Reattempt once with seq_nr 1. If that fails, treat this > > > > > > > > > + * like BADSLOT. > > > > > > > > > + */ > > > > > > > > > > > > > > > > Nit: this comment says exactly what the code says. If it were me, I'd > > > > > > > > remove it. Is there a "why" statement that could be made here? Like, > > > > > > > > why retry with a seq_nr of 1 instead of just failing immediately? > > > > > > > > > > > > > > > > > > > > > > There isn't one that I know of. It looks like Kinglong Mee added it in > > > > > > > 7ba6cad6c88f, but there is no real mention of that in the changelog. > > > > > > > > > > > > > > TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 > > > > > > > when we got this error, and we then retry with a seq_nr of 1? Does the > > > > > > > server then treat that as a retransmission? > > > > > > > > > > > > So I assume you mean the requester sent seq_nr 1, saw a reply and sent a > > > > > > subsequent seq_nr 2, to which it gets SEQ_MISORDERED. > > > > > > > > > > > > If so, yes definitely backing up the seq_nr to 1 will result in the > > > > > > peer considering it to be a retransmission, which will be bad. > > > > > > > > > > > > > > > > Yes, that's what I meant. > > > > > > > > > > > > We might be best off > > > > > > > dropping this and just always treating it like BADSLOT. > > > > > > > > > > > > But, why would this happen? Usually I'd think the peer sent seq_nr X > > > > > > before it received a reply to seq_nr X-1, which would be a peer bug. > > > > > > > > > > > > OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, > > > > > > how does the requester know the difference? > > > > > > > > > > > > If treating it as BADSLOT completely resets the sequence, then sure, > > > > > > but either a) the request is still in-progress, or b) if a bug is > > > > > > causing the situation, well it's not going to converge on a functional > > > > > > session. > > > > > > > > > > > > > > > > With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT > > > > > in the next forechannel SEQUENCE on the session. That should cause the > > > > > client to (eventually) send a DESTROY_SESSION and create a new one. > > > > > > > > > > Unfortunately, in the meantime, because of the way the callback channel > > > > > update works, the server can end up trying to send the callback again > > > > > on the same session (and maybe more than once). I'm not sure that > > > > > that's a real problem per-se, but it's less than ideal. > > > > > > > > > > > Not sure I have a solid suggestion right now. Whatever the fix, it > > > > > > should capture any subtlety in a comment. > > > > > > > > > > > > > > > > At this point, I'm leaning toward just treating it like BADSLOT. > > > > > Basically, mark the backchannel faulty, and leak the slot so that > > > > > nothing else uses it. That allows us to send backchannel requests on > > > > > the other slots until the session gets recreated. > > > > > > > > Hmm, leaking the slot is a workable approach, as long as it doesn't > > > > cascade more than a time or two. Some sort of trigger should be armed > > > > to prevent runaway retries. > > > > > > > > It's maybe worth considering what state the peer might be in when this > > > > happens. It too may effectively leak a slot, and if is retaining some > > > > bogus state either as a result of or because of the previous exchange(s) > > > > then this may lead to future hangs/failures. Not pretty, and maybe not > > > > worth trying to guess. > > > > > > > > Tom. > > > > > > > > > > > > > The idea here is that eventually the client should figure out that > > > something is wrong and reestablish the session. Currently we don't > > > limit the number of retries on a callback. > > > > > > Maybe they should time out after a while? If we've retried a callback > > > for more than two lease periods, give up and log something? > > > > > > Either way, I'd consider that to be follow-on work to this set. > > > > As a general comment, I think making a heroic effort to recover in any > > of these cases is probably not worth the additional complexity. Where it > > is required or where we believe it is worth the trouble, that's where we > > want a detailed comment. > > > > What we want to do is ensure forward progress. I'm guessing that error > > conditions are going to be rare, so leaking the slot until a certain > > portion of them are gone, and then indicating a session fault to force > > the client to start over from scratch, is probably the most > > straightforward approach. > > > > So, is there a good reason to retry? There doesn't appear to be any > > reasoning mentioned in the commit log or in nearby comments. > > Agreed on the general comment. > > As for the "any reason to retry" - maybe. If it's a transient error we > don't want to give up early. Unfortunately that appears to be an > ambiguous situation, because SEQ_MISORDERED is allowed in place of > ERR_DELAY. I don't have any great suggestion however. > IMO, we should retry callbacks (basically) indefinitely, unless the NFSv4 client is being torn down (i.e. lease expires or an unmount happened, etc). > Jeff, to your point that the "client should figure out something is > wrong", I'm not sure how you think that will happen. If the server is > making a delegation recall and the client receive code chooses to reject > it at the sequence check, how would that eventually cause the client to > reestablish the session (on the forechannel)? > > In the BADSLOT case, it calls nfsd4_mark_cb_fault(cb->cb_clp), which sets a flag in the client that makes it set SEQ4_STATUS_BACKCHANNEL_FAULT in the next SEQUENCE call. The client should take that as an indication that there is a problem and reestablish a new session (and maybe a new connection). Granted, it might take up to the next lease renewal, but there's not much else we can do if the client won't talk to us. That's why I was suggesting that we might time out the backchannel calls after two lease periods. OTOH, maybe it's sufficient to not queue any callbacks for courtesy clients? > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > > > > > > > > + session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > > > > > > > > + goto retry_nowait; > > > > > > > > > + } > > > > > > > > > + fallthrough; > > > > > > > > > case -NFS4ERR_BADSLOT: > > > > > > > > > /* > > > > > > > > > * BADSLOT means that the client and server are out of sync > > > > > > > > > @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > > > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > > > > > > cb->cb_held_slot = -1; > > > > > > > > > goto retry_nowait; > > > > > > > > > - case -NFS4ERR_SEQ_MISORDERED: > > > > > > > > > - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { > > > > > > > > > - session->se_cb_seq_nr[cb->cb_held_slot] = 1; > > > > > > > > > - goto retry_nowait; > > > > > > > > > - } > > > > > > > > > - break; > > > > > > > > > default: > > > > > > > > > nfsd4_mark_cb_fault(cb->cb_clp); > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On 2/8/2025 9:14 PM, Jeff Layton wrote: > On Sat, 2025-02-08 at 20:24 -0500, Tom Talpey wrote: >> On 2/8/2025 4:07 PM, Chuck Lever wrote: >>> On 2/8/25 3:45 PM, Jeff Layton wrote: >>>> On Sat, 2025-02-08 at 14:18 -0500, Tom Talpey wrote: >>>>> On 2/8/2025 11:08 AM, Jeff Layton wrote: >>>>>> On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: >>>>>>> On 2/8/2025 10:02 AM, Jeff Layton wrote: >>>>>>>> On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: >>>>>>>>> On 2/7/25 4:53 PM, Jeff Layton wrote: >>>>>>>>>> For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then >>>>>>>>>> fall back to treating it like a BADSLOT if that fails. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org> >>>>>>>>>> --- >>>>>>>>>> fs/nfsd/nfs4callback.c | 16 ++++++++++------ >>>>>>>>>> 1 file changed, 10 insertions(+), 6 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>>>>>>>>> index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 >>>>>>>>>> --- a/fs/nfsd/nfs4callback.c >>>>>>>>>> +++ b/fs/nfsd/nfs4callback.c >>>>>>>>>> @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>>>>>>> goto requeue; >>>>>>>>>> rpc_delay(task, 2 * HZ); >>>>>>>>>> return false; >>>>>>>>>> + case -NFS4ERR_SEQ_MISORDERED: >>>>>>>>>> + /* >>>>>>>>>> + * Reattempt once with seq_nr 1. If that fails, treat this >>>>>>>>>> + * like BADSLOT. >>>>>>>>>> + */ >>>>>>>>> >>>>>>>>> Nit: this comment says exactly what the code says. If it were me, I'd >>>>>>>>> remove it. Is there a "why" statement that could be made here? Like, >>>>>>>>> why retry with a seq_nr of 1 instead of just failing immediately? >>>>>>>>> >>>>>>>> >>>>>>>> There isn't one that I know of. It looks like Kinglong Mee added it in >>>>>>>> 7ba6cad6c88f, but there is no real mention of that in the changelog. >>>>>>>> >>>>>>>> TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 >>>>>>>> when we got this error, and we then retry with a seq_nr of 1? Does the >>>>>>>> server then treat that as a retransmission? >>>>>>> >>>>>>> So I assume you mean the requester sent seq_nr 1, saw a reply and sent a >>>>>>> subsequent seq_nr 2, to which it gets SEQ_MISORDERED. >>>>>>> >>>>>>> If so, yes definitely backing up the seq_nr to 1 will result in the >>>>>>> peer considering it to be a retransmission, which will be bad. >>>>>>> >>>>>> >>>>>> Yes, that's what I meant. >>>>>> >>>>>>>> We might be best off >>>>>>>> dropping this and just always treating it like BADSLOT. >>>>>>> >>>>>>> But, why would this happen? Usually I'd think the peer sent seq_nr X >>>>>>> before it received a reply to seq_nr X-1, which would be a peer bug. >>>>>>> >>>>>>> OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, >>>>>>> how does the requester know the difference? >>>>>>> >>>>>>> If treating it as BADSLOT completely resets the sequence, then sure, >>>>>>> but either a) the request is still in-progress, or b) if a bug is >>>>>>> causing the situation, well it's not going to converge on a functional >>>>>>> session. >>>>>>> >>>>>> >>>>>> With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT >>>>>> in the next forechannel SEQUENCE on the session. That should cause the >>>>>> client to (eventually) send a DESTROY_SESSION and create a new one. >>>>>> >>>>>> Unfortunately, in the meantime, because of the way the callback channel >>>>>> update works, the server can end up trying to send the callback again >>>>>> on the same session (and maybe more than once). I'm not sure that >>>>>> that's a real problem per-se, but it's less than ideal. >>>>>> >>>>>>> Not sure I have a solid suggestion right now. Whatever the fix, it >>>>>>> should capture any subtlety in a comment. >>>>>>> >>>>>> >>>>>> At this point, I'm leaning toward just treating it like BADSLOT. >>>>>> Basically, mark the backchannel faulty, and leak the slot so that >>>>>> nothing else uses it. That allows us to send backchannel requests on >>>>>> the other slots until the session gets recreated. >>>>> >>>>> Hmm, leaking the slot is a workable approach, as long as it doesn't >>>>> cascade more than a time or two. Some sort of trigger should be armed >>>>> to prevent runaway retries. >>>>> >>>>> It's maybe worth considering what state the peer might be in when this >>>>> happens. It too may effectively leak a slot, and if is retaining some >>>>> bogus state either as a result of or because of the previous exchange(s) >>>>> then this may lead to future hangs/failures. Not pretty, and maybe not >>>>> worth trying to guess. >>>>> >>>>> Tom. >>>>> >>>> >>>> >>>> The idea here is that eventually the client should figure out that >>>> something is wrong and reestablish the session. Currently we don't >>>> limit the number of retries on a callback. >>>> >>>> Maybe they should time out after a while? If we've retried a callback >>>> for more than two lease periods, give up and log something? >>>> >>>> Either way, I'd consider that to be follow-on work to this set. >>> >>> As a general comment, I think making a heroic effort to recover in any >>> of these cases is probably not worth the additional complexity. Where it >>> is required or where we believe it is worth the trouble, that's where we >>> want a detailed comment. >>> >>> What we want to do is ensure forward progress. I'm guessing that error >>> conditions are going to be rare, so leaking the slot until a certain >>> portion of them are gone, and then indicating a session fault to force >>> the client to start over from scratch, is probably the most >>> straightforward approach. >>> >>> So, is there a good reason to retry? There doesn't appear to be any >>> reasoning mentioned in the commit log or in nearby comments. >> >> Agreed on the general comment. >> >> As for the "any reason to retry" - maybe. If it's a transient error we >> don't want to give up early. Unfortunately that appears to be an >> ambiguous situation, because SEQ_MISORDERED is allowed in place of >> ERR_DELAY. I don't have any great suggestion however. >> > > IMO, we should retry callbacks (basically) indefinitely, unless the > NFSv4 client is being torn down (i.e. lease expires or an unmount > happened, etc). > >> Jeff, to your point that the "client should figure out something is >> wrong", I'm not sure how you think that will happen. If the server is >> making a delegation recall and the client receive code chooses to reject >> it at the sequence check, how would that eventually cause the client to >> reestablish the session (on the forechannel)? >> >> > > In the BADSLOT case, it calls nfsd4_mark_cb_fault(cb->cb_clp), which > sets a flag in the client that makes it set > SEQ4_STATUS_BACKCHANNEL_FAULT in the next SEQUENCE call. Aha, that's good. RFC8881 only mentions it twice, but it's normative: SEQ4_STATUS_BACKCHANNEL_FAULT The server has encountered an unrecoverable fault with the backchannel (e.g., it has lost track of the sequence ID for a slot in the backchannel). The client MUST stop sending more requests on the session's fore channel, wait for all outstanding requests to complete on the fore and back channel, and then destroy the session. I guess my question is, what if the client ignores it anyway? What server code actually forces the recovery? Tom. > > The client should take that as an indication that there is a problem > and reestablish a new session (and maybe a new connection). Granted, it > might take up to the next lease renewal, but there's not much else we > can do if the client won't talk to us. > > That's why I was suggesting that we might time out the backchannel > calls after two lease periods. OTOH, maybe it's sufficient to not queue > any callbacks for courtesy clients? > >>> >>> >>>>>>>> Thoughts? >>>>>>>> >>>>>>>>> >>>>>>>>>> + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>>>>>>>>> + session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>>>>>>>>> + goto retry_nowait; >>>>>>>>>> + } >>>>>>>>>> + fallthrough; >>>>>>>>>> case -NFS4ERR_BADSLOT: >>>>>>>>>> /* >>>>>>>>>> * BADSLOT means that the client and server are out of sync >>>>>>>>>> @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>>>>>>> nfsd4_mark_cb_fault(cb->cb_clp); >>>>>>>>>> cb->cb_held_slot = -1; >>>>>>>>>> goto retry_nowait; >>>>>>>>>> - case -NFS4ERR_SEQ_MISORDERED: >>>>>>>>>> - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { >>>>>>>>>> - session->se_cb_seq_nr[cb->cb_held_slot] = 1; >>>>>>>>>> - goto retry_nowait; >>>>>>>>>> - } >>>>>>>>>> - break; >>>>>>>>>> default: >>>>>>>>>> nfsd4_mark_cb_fault(cb->cb_clp); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >> >
On Sun, 2025-02-09 at 11:26 -0500, Tom Talpey wrote: > On 2/8/2025 9:14 PM, Jeff Layton wrote: > > On Sat, 2025-02-08 at 20:24 -0500, Tom Talpey wrote: > > > On 2/8/2025 4:07 PM, Chuck Lever wrote: > > > > On 2/8/25 3:45 PM, Jeff Layton wrote: > > > > > On Sat, 2025-02-08 at 14:18 -0500, Tom Talpey wrote: > > > > > > On 2/8/2025 11:08 AM, Jeff Layton wrote: > > > > > > > On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: > > > > > > > > On 2/8/2025 10:02 AM, Jeff Layton wrote: > > > > > > > > > On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: > > > > > > > > > > On 2/7/25 4:53 PM, Jeff Layton wrote: > > > > > > > > > > > For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then > > > > > > > > > > > fall back to treating it like a BADSLOT if that fails. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > --- > > > > > > > > > > > fs/nfsd/nfs4callback.c | 16 ++++++++++------ > > > > > > > > > > > 1 file changed, 10 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > > > > > > > > index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 > > > > > > > > > > > --- a/fs/nfsd/nfs4callback.c > > > > > > > > > > > +++ b/fs/nfsd/nfs4callback.c > > > > > > > > > > > @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > > > > > > > > goto requeue; > > > > > > > > > > > rpc_delay(task, 2 * HZ); > > > > > > > > > > > return false; > > > > > > > > > > > + case -NFS4ERR_SEQ_MISORDERED: > > > > > > > > > > > + /* > > > > > > > > > > > + * Reattempt once with seq_nr 1. If that fails, treat this > > > > > > > > > > > + * like BADSLOT. > > > > > > > > > > > + */ > > > > > > > > > > > > > > > > > > > > Nit: this comment says exactly what the code says. If it were me, I'd > > > > > > > > > > remove it. Is there a "why" statement that could be made here? Like, > > > > > > > > > > why retry with a seq_nr of 1 instead of just failing immediately? > > > > > > > > > > > > > > > > > > > > > > > > > > > > There isn't one that I know of. It looks like Kinglong Mee added it in > > > > > > > > > 7ba6cad6c88f, but there is no real mention of that in the changelog. > > > > > > > > > > > > > > > > > > TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 > > > > > > > > > when we got this error, and we then retry with a seq_nr of 1? Does the > > > > > > > > > server then treat that as a retransmission? > > > > > > > > > > > > > > > > So I assume you mean the requester sent seq_nr 1, saw a reply and sent a > > > > > > > > subsequent seq_nr 2, to which it gets SEQ_MISORDERED. > > > > > > > > > > > > > > > > If so, yes definitely backing up the seq_nr to 1 will result in the > > > > > > > > peer considering it to be a retransmission, which will be bad. > > > > > > > > > > > > > > > > > > > > > > Yes, that's what I meant. > > > > > > > > > > > > > > > > We might be best off > > > > > > > > > dropping this and just always treating it like BADSLOT. > > > > > > > > > > > > > > > > But, why would this happen? Usually I'd think the peer sent seq_nr X > > > > > > > > before it received a reply to seq_nr X-1, which would be a peer bug. > > > > > > > > > > > > > > > > OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, > > > > > > > > how does the requester know the difference? > > > > > > > > > > > > > > > > If treating it as BADSLOT completely resets the sequence, then sure, > > > > > > > > but either a) the request is still in-progress, or b) if a bug is > > > > > > > > causing the situation, well it's not going to converge on a functional > > > > > > > > session. > > > > > > > > > > > > > > > > > > > > > > With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT > > > > > > > in the next forechannel SEQUENCE on the session. That should cause the > > > > > > > client to (eventually) send a DESTROY_SESSION and create a new one. > > > > > > > > > > > > > > Unfortunately, in the meantime, because of the way the callback channel > > > > > > > update works, the server can end up trying to send the callback again > > > > > > > on the same session (and maybe more than once). I'm not sure that > > > > > > > that's a real problem per-se, but it's less than ideal. > > > > > > > > > > > > > > > Not sure I have a solid suggestion right now. Whatever the fix, it > > > > > > > > should capture any subtlety in a comment. > > > > > > > > > > > > > > > > > > > > > > At this point, I'm leaning toward just treating it like BADSLOT. > > > > > > > Basically, mark the backchannel faulty, and leak the slot so that > > > > > > > nothing else uses it. That allows us to send backchannel requests on > > > > > > > the other slots until the session gets recreated. > > > > > > > > > > > > Hmm, leaking the slot is a workable approach, as long as it doesn't > > > > > > cascade more than a time or two. Some sort of trigger should be armed > > > > > > to prevent runaway retries. > > > > > > > > > > > > It's maybe worth considering what state the peer might be in when this > > > > > > happens. It too may effectively leak a slot, and if is retaining some > > > > > > bogus state either as a result of or because of the previous exchange(s) > > > > > > then this may lead to future hangs/failures. Not pretty, and maybe not > > > > > > worth trying to guess. > > > > > > > > > > > > Tom. > > > > > > > > > > > > > > > > > > > > > The idea here is that eventually the client should figure out that > > > > > something is wrong and reestablish the session. Currently we don't > > > > > limit the number of retries on a callback. > > > > > > > > > > Maybe they should time out after a while? If we've retried a callback > > > > > for more than two lease periods, give up and log something? > > > > > > > > > > Either way, I'd consider that to be follow-on work to this set. > > > > > > > > As a general comment, I think making a heroic effort to recover in any > > > > of these cases is probably not worth the additional complexity. Where it > > > > is required or where we believe it is worth the trouble, that's where we > > > > want a detailed comment. > > > > > > > > What we want to do is ensure forward progress. I'm guessing that error > > > > conditions are going to be rare, so leaking the slot until a certain > > > > portion of them are gone, and then indicating a session fault to force > > > > the client to start over from scratch, is probably the most > > > > straightforward approach. > > > > > > > > So, is there a good reason to retry? There doesn't appear to be any > > > > reasoning mentioned in the commit log or in nearby comments. > > > > > > Agreed on the general comment. > > > > > > As for the "any reason to retry" - maybe. If it's a transient error we > > > don't want to give up early. Unfortunately that appears to be an > > > ambiguous situation, because SEQ_MISORDERED is allowed in place of > > > ERR_DELAY. I don't have any great suggestion however. > > > > > > > IMO, we should retry callbacks (basically) indefinitely, unless the > > NFSv4 client is being torn down (i.e. lease expires or an unmount > > happened, etc). > > > > > Jeff, to your point that the "client should figure out something is > > > wrong", I'm not sure how you think that will happen. If the server is > > > making a delegation recall and the client receive code chooses to reject > > > it at the sequence check, how would that eventually cause the client to > > > reestablish the session (on the forechannel)? > > > > > > > > > > In the BADSLOT case, it calls nfsd4_mark_cb_fault(cb->cb_clp), which > > sets a flag in the client that makes it set > > SEQ4_STATUS_BACKCHANNEL_FAULT in the next SEQUENCE call. > > Aha, that's good. RFC8881 only mentions it twice, but it's normative: > > SEQ4_STATUS_BACKCHANNEL_FAULT > The server has encountered an unrecoverable fault with the > backchannel (e.g., it has lost track of the sequence ID for a slot > in the backchannel). The client MUST stop sending more requests on > the session's fore channel, wait for all outstanding requests to > complete on the fore and back channel, and then destroy the session. > > I guess my question is, what if the client ignores it anyway? What > server code actually forces the recovery? > > Tom. > I don't think there is anything that does this right now. Does the RFC mention what the server should do if that happens? I suppose the server could just unilaterally destroy the session at some point, and force the client to reestablish it.
On 2/9/2025 11:51 AM, Jeff Layton wrote: > On Sun, 2025-02-09 at 11:26 -0500, Tom Talpey wrote: >> On 2/8/2025 9:14 PM, Jeff Layton wrote: >>> On Sat, 2025-02-08 at 20:24 -0500, Tom Talpey wrote: >>>> On 2/8/2025 4:07 PM, Chuck Lever wrote: >>>>> On 2/8/25 3:45 PM, Jeff Layton wrote: >>>>>> On Sat, 2025-02-08 at 14:18 -0500, Tom Talpey wrote: >>>>>>> On 2/8/2025 11:08 AM, Jeff Layton wrote: >>>>>>>> On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: >>>>>>>>> On 2/8/2025 10:02 AM, Jeff Layton wrote: >>>>>>>>>> On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: >>>>>>>>>>> On 2/7/25 4:53 PM, Jeff Layton wrote: >>>>>>>>>>>> For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then >>>>>>>>>>>> fall back to treating it like a BADSLOT if that fails. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org> >>>>>>>>>>>> --- >>>>>>>>>>>> fs/nfsd/nfs4callback.c | 16 ++++++++++------ >>>>>>>>>>>> 1 file changed, 10 insertions(+), 6 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>>>>>>>>>>> index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 >>>>>>>>>>>> --- a/fs/nfsd/nfs4callback.c >>>>>>>>>>>> +++ b/fs/nfsd/nfs4callback.c >>>>>>>>>>>> @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>>>>>>>>> goto requeue; >>>>>>>>>>>> rpc_delay(task, 2 * HZ); >>>>>>>>>>>> return false; >>>>>>>>>>>> + case -NFS4ERR_SEQ_MISORDERED: >>>>>>>>>>>> + /* >>>>>>>>>>>> + * Reattempt once with seq_nr 1. If that fails, treat this >>>>>>>>>>>> + * like BADSLOT. >>>>>>>>>>>> + */ >>>>>>>>>>> >>>>>>>>>>> Nit: this comment says exactly what the code says. If it were me, I'd >>>>>>>>>>> remove it. Is there a "why" statement that could be made here? Like, >>>>>>>>>>> why retry with a seq_nr of 1 instead of just failing immediately? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> There isn't one that I know of. It looks like Kinglong Mee added it in >>>>>>>>>> 7ba6cad6c88f, but there is no real mention of that in the changelog. >>>>>>>>>> >>>>>>>>>> TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 >>>>>>>>>> when we got this error, and we then retry with a seq_nr of 1? Does the >>>>>>>>>> server then treat that as a retransmission? >>>>>>>>> >>>>>>>>> So I assume you mean the requester sent seq_nr 1, saw a reply and sent a >>>>>>>>> subsequent seq_nr 2, to which it gets SEQ_MISORDERED. >>>>>>>>> >>>>>>>>> If so, yes definitely backing up the seq_nr to 1 will result in the >>>>>>>>> peer considering it to be a retransmission, which will be bad. >>>>>>>>> >>>>>>>> >>>>>>>> Yes, that's what I meant. >>>>>>>> >>>>>>>>>> We might be best off >>>>>>>>>> dropping this and just always treating it like BADSLOT. >>>>>>>>> >>>>>>>>> But, why would this happen? Usually I'd think the peer sent seq_nr X >>>>>>>>> before it received a reply to seq_nr X-1, which would be a peer bug. >>>>>>>>> >>>>>>>>> OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, >>>>>>>>> how does the requester know the difference? >>>>>>>>> >>>>>>>>> If treating it as BADSLOT completely resets the sequence, then sure, >>>>>>>>> but either a) the request is still in-progress, or b) if a bug is >>>>>>>>> causing the situation, well it's not going to converge on a functional >>>>>>>>> session. >>>>>>>>> >>>>>>>> >>>>>>>> With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT >>>>>>>> in the next forechannel SEQUENCE on the session. That should cause the >>>>>>>> client to (eventually) send a DESTROY_SESSION and create a new one. >>>>>>>> >>>>>>>> Unfortunately, in the meantime, because of the way the callback channel >>>>>>>> update works, the server can end up trying to send the callback again >>>>>>>> on the same session (and maybe more than once). I'm not sure that >>>>>>>> that's a real problem per-se, but it's less than ideal. >>>>>>>> >>>>>>>>> Not sure I have a solid suggestion right now. Whatever the fix, it >>>>>>>>> should capture any subtlety in a comment. >>>>>>>>> >>>>>>>> >>>>>>>> At this point, I'm leaning toward just treating it like BADSLOT. >>>>>>>> Basically, mark the backchannel faulty, and leak the slot so that >>>>>>>> nothing else uses it. That allows us to send backchannel requests on >>>>>>>> the other slots until the session gets recreated. >>>>>>> >>>>>>> Hmm, leaking the slot is a workable approach, as long as it doesn't >>>>>>> cascade more than a time or two. Some sort of trigger should be armed >>>>>>> to prevent runaway retries. >>>>>>> >>>>>>> It's maybe worth considering what state the peer might be in when this >>>>>>> happens. It too may effectively leak a slot, and if is retaining some >>>>>>> bogus state either as a result of or because of the previous exchange(s) >>>>>>> then this may lead to future hangs/failures. Not pretty, and maybe not >>>>>>> worth trying to guess. >>>>>>> >>>>>>> Tom. >>>>>>> >>>>>> >>>>>> >>>>>> The idea here is that eventually the client should figure out that >>>>>> something is wrong and reestablish the session. Currently we don't >>>>>> limit the number of retries on a callback. >>>>>> >>>>>> Maybe they should time out after a while? If we've retried a callback >>>>>> for more than two lease periods, give up and log something? >>>>>> >>>>>> Either way, I'd consider that to be follow-on work to this set. >>>>> >>>>> As a general comment, I think making a heroic effort to recover in any >>>>> of these cases is probably not worth the additional complexity. Where it >>>>> is required or where we believe it is worth the trouble, that's where we >>>>> want a detailed comment. >>>>> >>>>> What we want to do is ensure forward progress. I'm guessing that error >>>>> conditions are going to be rare, so leaking the slot until a certain >>>>> portion of them are gone, and then indicating a session fault to force >>>>> the client to start over from scratch, is probably the most >>>>> straightforward approach. >>>>> >>>>> So, is there a good reason to retry? There doesn't appear to be any >>>>> reasoning mentioned in the commit log or in nearby comments. >>>> >>>> Agreed on the general comment. >>>> >>>> As for the "any reason to retry" - maybe. If it's a transient error we >>>> don't want to give up early. Unfortunately that appears to be an >>>> ambiguous situation, because SEQ_MISORDERED is allowed in place of >>>> ERR_DELAY. I don't have any great suggestion however. >>>> >>> >>> IMO, we should retry callbacks (basically) indefinitely, unless the >>> NFSv4 client is being torn down (i.e. lease expires or an unmount >>> happened, etc). >>> >>>> Jeff, to your point that the "client should figure out something is >>>> wrong", I'm not sure how you think that will happen. If the server is >>>> making a delegation recall and the client receive code chooses to reject >>>> it at the sequence check, how would that eventually cause the client to >>>> reestablish the session (on the forechannel)? >>>> >>>> >>> >>> In the BADSLOT case, it calls nfsd4_mark_cb_fault(cb->cb_clp), which >>> sets a flag in the client that makes it set >>> SEQ4_STATUS_BACKCHANNEL_FAULT in the next SEQUENCE call. >> >> Aha, that's good. RFC8881 only mentions it twice, but it's normative: >> >> SEQ4_STATUS_BACKCHANNEL_FAULT >> The server has encountered an unrecoverable fault with the >> backchannel (e.g., it has lost track of the sequence ID for a slot >> in the backchannel). The client MUST stop sending more requests on >> the session's fore channel, wait for all outstanding requests to >> complete on the fore and back channel, and then destroy the session. >> >> I guess my question is, what if the client ignores it anyway? What >> server code actually forces the recovery? >> >> Tom. >> > > I don't think there is anything that does this right now. Does the RFC > mention what the server should do if that happens? I suppose the server > could just unilaterally destroy the session at some point, and force > the client to reestablish it. Nope. :( Like many other requirements, it's unenforced normatively. Perhaps we could consider this as an erratum, but it's more of an omission. Because of that, it may need IETF discussion ("at some point" needs a MUST). I'll volunteer to open an issue, if you agree to discuss! Tom.
On Sun, 2025-02-09 at 11:58 -0500, Tom Talpey wrote: > On 2/9/2025 11:51 AM, Jeff Layton wrote: > > On Sun, 2025-02-09 at 11:26 -0500, Tom Talpey wrote: > > > On 2/8/2025 9:14 PM, Jeff Layton wrote: > > > > On Sat, 2025-02-08 at 20:24 -0500, Tom Talpey wrote: > > > > > On 2/8/2025 4:07 PM, Chuck Lever wrote: > > > > > > On 2/8/25 3:45 PM, Jeff Layton wrote: > > > > > > > On Sat, 2025-02-08 at 14:18 -0500, Tom Talpey wrote: > > > > > > > > On 2/8/2025 11:08 AM, Jeff Layton wrote: > > > > > > > > > On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: > > > > > > > > > > On 2/8/2025 10:02 AM, Jeff Layton wrote: > > > > > > > > > > > On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: > > > > > > > > > > > > On 2/7/25 4:53 PM, Jeff Layton wrote: > > > > > > > > > > > > > For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then > > > > > > > > > > > > > fall back to treating it like a BADSLOT if that fails. > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > --- > > > > > > > > > > > > > fs/nfsd/nfs4callback.c | 16 ++++++++++------ > > > > > > > > > > > > > 1 file changed, 10 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > > > > > > > > > > index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 > > > > > > > > > > > > > --- a/fs/nfsd/nfs4callback.c > > > > > > > > > > > > > +++ b/fs/nfsd/nfs4callback.c > > > > > > > > > > > > > @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback > > > > > > > > > > > > > goto requeue; > > > > > > > > > > > > > rpc_delay(task, 2 * HZ); > > > > > > > > > > > > > return false; > > > > > > > > > > > > > + case -NFS4ERR_SEQ_MISORDERED: > > > > > > > > > > > > > + /* > > > > > > > > > > > > > + * Reattempt once with seq_nr 1. If that fails, treat this > > > > > > > > > > > > > + * like BADSLOT. > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > > > > > > > > > > Nit: this comment says exactly what the code says. If it were me, I'd > > > > > > > > > > > > remove it. Is there a "why" statement that could be made here? Like, > > > > > > > > > > > > why retry with a seq_nr of 1 instead of just failing immediately? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There isn't one that I know of. It looks like Kinglong Mee added it in > > > > > > > > > > > 7ba6cad6c88f, but there is no real mention of that in the changelog. > > > > > > > > > > > > > > > > > > > > > > TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 > > > > > > > > > > > when we got this error, and we then retry with a seq_nr of 1? Does the > > > > > > > > > > > server then treat that as a retransmission? > > > > > > > > > > > > > > > > > > > > So I assume you mean the requester sent seq_nr 1, saw a reply and sent a > > > > > > > > > > subsequent seq_nr 2, to which it gets SEQ_MISORDERED. > > > > > > > > > > > > > > > > > > > > If so, yes definitely backing up the seq_nr to 1 will result in the > > > > > > > > > > peer considering it to be a retransmission, which will be bad. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, that's what I meant. > > > > > > > > > > > > > > > > > > > > We might be best off > > > > > > > > > > > dropping this and just always treating it like BADSLOT. > > > > > > > > > > > > > > > > > > > > But, why would this happen? Usually I'd think the peer sent seq_nr X > > > > > > > > > > before it received a reply to seq_nr X-1, which would be a peer bug. > > > > > > > > > > > > > > > > > > > > OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, > > > > > > > > > > how does the requester know the difference? > > > > > > > > > > > > > > > > > > > > If treating it as BADSLOT completely resets the sequence, then sure, > > > > > > > > > > but either a) the request is still in-progress, or b) if a bug is > > > > > > > > > > causing the situation, well it's not going to converge on a functional > > > > > > > > > > session. > > > > > > > > > > > > > > > > > > > > > > > > > > > > With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT > > > > > > > > > in the next forechannel SEQUENCE on the session. That should cause the > > > > > > > > > client to (eventually) send a DESTROY_SESSION and create a new one. > > > > > > > > > > > > > > > > > > Unfortunately, in the meantime, because of the way the callback channel > > > > > > > > > update works, the server can end up trying to send the callback again > > > > > > > > > on the same session (and maybe more than once). I'm not sure that > > > > > > > > > that's a real problem per-se, but it's less than ideal. > > > > > > > > > > > > > > > > > > > Not sure I have a solid suggestion right now. Whatever the fix, it > > > > > > > > > > should capture any subtlety in a comment. > > > > > > > > > > > > > > > > > > > > > > > > > > > > At this point, I'm leaning toward just treating it like BADSLOT. > > > > > > > > > Basically, mark the backchannel faulty, and leak the slot so that > > > > > > > > > nothing else uses it. That allows us to send backchannel requests on > > > > > > > > > the other slots until the session gets recreated. > > > > > > > > > > > > > > > > Hmm, leaking the slot is a workable approach, as long as it doesn't > > > > > > > > cascade more than a time or two. Some sort of trigger should be armed > > > > > > > > to prevent runaway retries. > > > > > > > > > > > > > > > > It's maybe worth considering what state the peer might be in when this > > > > > > > > happens. It too may effectively leak a slot, and if is retaining some > > > > > > > > bogus state either as a result of or because of the previous exchange(s) > > > > > > > > then this may lead to future hangs/failures. Not pretty, and maybe not > > > > > > > > worth trying to guess. > > > > > > > > > > > > > > > > Tom. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The idea here is that eventually the client should figure out that > > > > > > > something is wrong and reestablish the session. Currently we don't > > > > > > > limit the number of retries on a callback. > > > > > > > > > > > > > > Maybe they should time out after a while? If we've retried a callback > > > > > > > for more than two lease periods, give up and log something? > > > > > > > > > > > > > > Either way, I'd consider that to be follow-on work to this set. > > > > > > > > > > > > As a general comment, I think making a heroic effort to recover in any > > > > > > of these cases is probably not worth the additional complexity. Where it > > > > > > is required or where we believe it is worth the trouble, that's where we > > > > > > want a detailed comment. > > > > > > > > > > > > What we want to do is ensure forward progress. I'm guessing that error > > > > > > conditions are going to be rare, so leaking the slot until a certain > > > > > > portion of them are gone, and then indicating a session fault to force > > > > > > the client to start over from scratch, is probably the most > > > > > > straightforward approach. > > > > > > > > > > > > So, is there a good reason to retry? There doesn't appear to be any > > > > > > reasoning mentioned in the commit log or in nearby comments. > > > > > > > > > > Agreed on the general comment. > > > > > > > > > > As for the "any reason to retry" - maybe. If it's a transient error we > > > > > don't want to give up early. Unfortunately that appears to be an > > > > > ambiguous situation, because SEQ_MISORDERED is allowed in place of > > > > > ERR_DELAY. I don't have any great suggestion however. > > > > > > > > > > > > > IMO, we should retry callbacks (basically) indefinitely, unless the > > > > NFSv4 client is being torn down (i.e. lease expires or an unmount > > > > happened, etc). > > > > > > > > > Jeff, to your point that the "client should figure out something is > > > > > wrong", I'm not sure how you think that will happen. If the server is > > > > > making a delegation recall and the client receive code chooses to reject > > > > > it at the sequence check, how would that eventually cause the client to > > > > > reestablish the session (on the forechannel)? > > > > > > > > > > > > > > > > > > In the BADSLOT case, it calls nfsd4_mark_cb_fault(cb->cb_clp), which > > > > sets a flag in the client that makes it set > > > > SEQ4_STATUS_BACKCHANNEL_FAULT in the next SEQUENCE call. > > > > > > Aha, that's good. RFC8881 only mentions it twice, but it's normative: > > > > > > SEQ4_STATUS_BACKCHANNEL_FAULT > > > The server has encountered an unrecoverable fault with the > > > backchannel (e.g., it has lost track of the sequence ID for a slot > > > in the backchannel). The client MUST stop sending more requests on > > > the session's fore channel, wait for all outstanding requests to > > > complete on the fore and back channel, and then destroy the session. > > > > > > I guess my question is, what if the client ignores it anyway? What > > > server code actually forces the recovery? > > > > > > Tom. > > > > > > > I don't think there is anything that does this right now. Does the RFC > > mention what the server should do if that happens? I suppose the server > > could just unilaterally destroy the session at some point, and force > > the client to reestablish it. > > Nope. :( Like many other requirements, it's unenforced normatively. > > Perhaps we could consider this as an erratum, but it's more of an > omission. Because of that, it may need IETF discussion ("at some point" > needs a MUST). I'll volunteer to open an issue, if you agree to discuss! Sure. I don't have a whole lot to add, but my proposal would be: If the server sends a SEQUENCE reply with SEQ4_STATUS_BACKCHANNEL_FAULT set, then the client has another lease period in which to reestablish the session. After that, the server may unilaterally drop the session and start returning NFS4ERR_BADSESSION to attempts to use it.
On 2/9/2025 12:05 PM, Jeff Layton wrote: > On Sun, 2025-02-09 at 11:58 -0500, Tom Talpey wrote: >> On 2/9/2025 11:51 AM, Jeff Layton wrote: >>> On Sun, 2025-02-09 at 11:26 -0500, Tom Talpey wrote: >>>> On 2/8/2025 9:14 PM, Jeff Layton wrote: >>>>> On Sat, 2025-02-08 at 20:24 -0500, Tom Talpey wrote: >>>>>> On 2/8/2025 4:07 PM, Chuck Lever wrote: >>>>>>> On 2/8/25 3:45 PM, Jeff Layton wrote: >>>>>>>> On Sat, 2025-02-08 at 14:18 -0500, Tom Talpey wrote: >>>>>>>>> On 2/8/2025 11:08 AM, Jeff Layton wrote: >>>>>>>>>> On Sat, 2025-02-08 at 13:40 -0500, Tom Talpey wrote: >>>>>>>>>>> On 2/8/2025 10:02 AM, Jeff Layton wrote: >>>>>>>>>>>> On Sat, 2025-02-08 at 12:01 -0500, Chuck Lever wrote: >>>>>>>>>>>>> On 2/7/25 4:53 PM, Jeff Layton wrote: >>>>>>>>>>>>>> For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then >>>>>>>>>>>>>> fall back to treating it like a BADSLOT if that fails. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org> >>>>>>>>>>>>>> --- >>>>>>>>>>>>>> fs/nfsd/nfs4callback.c | 16 ++++++++++------ >>>>>>>>>>>>>> 1 file changed, 10 insertions(+), 6 deletions(-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>>>>>>>>>>>>> index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 >>>>>>>>>>>>>> --- a/fs/nfsd/nfs4callback.c >>>>>>>>>>>>>> +++ b/fs/nfsd/nfs4callback.c >>>>>>>>>>>>>> @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback >>>>>>>>>>>>>> goto requeue; >>>>>>>>>>>>>> rpc_delay(task, 2 * HZ); >>>>>>>>>>>>>> return false; >>>>>>>>>>>>>> + case -NFS4ERR_SEQ_MISORDERED: >>>>>>>>>>>>>> + /* >>>>>>>>>>>>>> + * Reattempt once with seq_nr 1. If that fails, treat this >>>>>>>>>>>>>> + * like BADSLOT. >>>>>>>>>>>>>> + */ >>>>>>>>>>>>> >>>>>>>>>>>>> Nit: this comment says exactly what the code says. If it were me, I'd >>>>>>>>>>>>> remove it. Is there a "why" statement that could be made here? Like, >>>>>>>>>>>>> why retry with a seq_nr of 1 instead of just failing immediately? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> There isn't one that I know of. It looks like Kinglong Mee added it in >>>>>>>>>>>> 7ba6cad6c88f, but there is no real mention of that in the changelog. >>>>>>>>>>>> >>>>>>>>>>>> TBH, I'm not enamored with this remedy either. What if the seq_nr was 2 >>>>>>>>>>>> when we got this error, and we then retry with a seq_nr of 1? Does the >>>>>>>>>>>> server then treat that as a retransmission? >>>>>>>>>>> >>>>>>>>>>> So I assume you mean the requester sent seq_nr 1, saw a reply and sent a >>>>>>>>>>> subsequent seq_nr 2, to which it gets SEQ_MISORDERED. >>>>>>>>>>> >>>>>>>>>>> If so, yes definitely backing up the seq_nr to 1 will result in the >>>>>>>>>>> peer considering it to be a retransmission, which will be bad. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes, that's what I meant. >>>>>>>>>> >>>>>>>>>>>> We might be best off >>>>>>>>>>>> dropping this and just always treating it like BADSLOT. >>>>>>>>>>> >>>>>>>>>>> But, why would this happen? Usually I'd think the peer sent seq_nr X >>>>>>>>>>> before it received a reply to seq_nr X-1, which would be a peer bug. >>>>>>>>>>> >>>>>>>>>>> OTOH, SEQ_MISORDERED is a valid response to an in-progress retry. So, >>>>>>>>>>> how does the requester know the difference? >>>>>>>>>>> >>>>>>>>>>> If treating it as BADSLOT completely resets the sequence, then sure, >>>>>>>>>>> but either a) the request is still in-progress, or b) if a bug is >>>>>>>>>>> causing the situation, well it's not going to converge on a functional >>>>>>>>>>> session. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> With this patchset, on BADSLOT, we'll set SEQ4_STATUS_BACKCHANNEL_FAULT >>>>>>>>>> in the next forechannel SEQUENCE on the session. That should cause the >>>>>>>>>> client to (eventually) send a DESTROY_SESSION and create a new one. >>>>>>>>>> >>>>>>>>>> Unfortunately, in the meantime, because of the way the callback channel >>>>>>>>>> update works, the server can end up trying to send the callback again >>>>>>>>>> on the same session (and maybe more than once). I'm not sure that >>>>>>>>>> that's a real problem per-se, but it's less than ideal. >>>>>>>>>> >>>>>>>>>>> Not sure I have a solid suggestion right now. Whatever the fix, it >>>>>>>>>>> should capture any subtlety in a comment. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> At this point, I'm leaning toward just treating it like BADSLOT. >>>>>>>>>> Basically, mark the backchannel faulty, and leak the slot so that >>>>>>>>>> nothing else uses it. That allows us to send backchannel requests on >>>>>>>>>> the other slots until the session gets recreated. >>>>>>>>> >>>>>>>>> Hmm, leaking the slot is a workable approach, as long as it doesn't >>>>>>>>> cascade more than a time or two. Some sort of trigger should be armed >>>>>>>>> to prevent runaway retries. >>>>>>>>> >>>>>>>>> It's maybe worth considering what state the peer might be in when this >>>>>>>>> happens. It too may effectively leak a slot, and if is retaining some >>>>>>>>> bogus state either as a result of or because of the previous exchange(s) >>>>>>>>> then this may lead to future hangs/failures. Not pretty, and maybe not >>>>>>>>> worth trying to guess. >>>>>>>>> >>>>>>>>> Tom. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The idea here is that eventually the client should figure out that >>>>>>>> something is wrong and reestablish the session. Currently we don't >>>>>>>> limit the number of retries on a callback. >>>>>>>> >>>>>>>> Maybe they should time out after a while? If we've retried a callback >>>>>>>> for more than two lease periods, give up and log something? >>>>>>>> >>>>>>>> Either way, I'd consider that to be follow-on work to this set. >>>>>>> >>>>>>> As a general comment, I think making a heroic effort to recover in any >>>>>>> of these cases is probably not worth the additional complexity. Where it >>>>>>> is required or where we believe it is worth the trouble, that's where we >>>>>>> want a detailed comment. >>>>>>> >>>>>>> What we want to do is ensure forward progress. I'm guessing that error >>>>>>> conditions are going to be rare, so leaking the slot until a certain >>>>>>> portion of them are gone, and then indicating a session fault to force >>>>>>> the client to start over from scratch, is probably the most >>>>>>> straightforward approach. >>>>>>> >>>>>>> So, is there a good reason to retry? There doesn't appear to be any >>>>>>> reasoning mentioned in the commit log or in nearby comments. >>>>>> >>>>>> Agreed on the general comment. >>>>>> >>>>>> As for the "any reason to retry" - maybe. If it's a transient error we >>>>>> don't want to give up early. Unfortunately that appears to be an >>>>>> ambiguous situation, because SEQ_MISORDERED is allowed in place of >>>>>> ERR_DELAY. I don't have any great suggestion however. >>>>>> >>>>> >>>>> IMO, we should retry callbacks (basically) indefinitely, unless the >>>>> NFSv4 client is being torn down (i.e. lease expires or an unmount >>>>> happened, etc). >>>>> >>>>>> Jeff, to your point that the "client should figure out something is >>>>>> wrong", I'm not sure how you think that will happen. If the server is >>>>>> making a delegation recall and the client receive code chooses to reject >>>>>> it at the sequence check, how would that eventually cause the client to >>>>>> reestablish the session (on the forechannel)? >>>>>> >>>>>> >>>>> >>>>> In the BADSLOT case, it calls nfsd4_mark_cb_fault(cb->cb_clp), which >>>>> sets a flag in the client that makes it set >>>>> SEQ4_STATUS_BACKCHANNEL_FAULT in the next SEQUENCE call. >>>> >>>> Aha, that's good. RFC8881 only mentions it twice, but it's normative: >>>> >>>> SEQ4_STATUS_BACKCHANNEL_FAULT >>>> The server has encountered an unrecoverable fault with the >>>> backchannel (e.g., it has lost track of the sequence ID for a slot >>>> in the backchannel). The client MUST stop sending more requests on >>>> the session's fore channel, wait for all outstanding requests to >>>> complete on the fore and back channel, and then destroy the session. >>>> >>>> I guess my question is, what if the client ignores it anyway? What >>>> server code actually forces the recovery? >>>> >>>> Tom. >>>> >>> >>> I don't think there is anything that does this right now. Does the RFC >>> mention what the server should do if that happens? I suppose the server >>> could just unilaterally destroy the session at some point, and force >>> the client to reestablish it. >> >> Nope. :( Like many other requirements, it's unenforced normatively. >> >> Perhaps we could consider this as an erratum, but it's more of an >> omission. Because of that, it may need IETF discussion ("at some point" >> needs a MUST). I'll volunteer to open an issue, if you agree to discuss! > > Sure. I don't have a whole lot to add, but my proposal would be: > > If the server sends a SEQUENCE reply with SEQ4_STATUS_BACKCHANNEL_FAULT > set, then the client has another lease period in which to reestablish > the session. After that, the server may unilaterally drop the session > and start returning NFS4ERR_BADSESSION to attempts to use it. Oh, I don't think it needs to wait one lease period, but that's an interesting approach. The language needs to be normative and a bit more clear. I'll take it to the IETF this week. Tom.
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c index 10067a34db3afff8d4e4383854ab9abd9767c2d6..d6e3e8bb2efabadda9f922318880e12e1cb2c23f 100644 --- a/fs/nfsd/nfs4callback.c +++ b/fs/nfsd/nfs4callback.c @@ -1393,6 +1393,16 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback goto requeue; rpc_delay(task, 2 * HZ); return false; + case -NFS4ERR_SEQ_MISORDERED: + /* + * Reattempt once with seq_nr 1. If that fails, treat this + * like BADSLOT. + */ + if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { + session->se_cb_seq_nr[cb->cb_held_slot] = 1; + goto retry_nowait; + } + fallthrough; case -NFS4ERR_BADSLOT: /* * BADSLOT means that the client and server are out of sync @@ -1403,12 +1413,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback nfsd4_mark_cb_fault(cb->cb_clp); cb->cb_held_slot = -1; goto retry_nowait; - case -NFS4ERR_SEQ_MISORDERED: - if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) { - session->se_cb_seq_nr[cb->cb_held_slot] = 1; - goto retry_nowait; - } - break; default: nfsd4_mark_cb_fault(cb->cb_clp); }
For NFS4ERR_SEQ_MISORDERED, do one attempt with a seqid of 1, and then fall back to treating it like a BADSLOT if that fails. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/nfsd/nfs4callback.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-)