Message ID | ea44b46c7546579386d8d9e1a2b62c152534b6cc.1742941932.git.trond.myklebust@hammerspace.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Ensure that ENETUNREACH terminates state recovery | expand |
On 25 Mar 2025, at 18:35, trondmy@kernel.org wrote: > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > If a containerised process is killed and causes an ENETUNREACH or > ENETDOWN error to be propagated to the state manager, then mark the > nfs_client as being dead so that we don't loop in functions that are > expecting recovery to succeed. > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > --- > fs/nfs/nfs4state.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c > index 272d2ebdae0f..7612e977e80b 100644 > --- a/fs/nfs/nfs4state.c > +++ b/fs/nfs/nfs4state.c > @@ -2739,7 +2739,15 @@ static void nfs4_state_manager(struct nfs_client *clp) > pr_warn_ratelimited("NFS: state manager%s%s failed on NFSv4 server %s" > " with error %d\n", section_sep, section, > clp->cl_hostname, -status); > - ssleep(1); > + switch (status) { > + case -ENETDOWN: > + case -ENETUNREACH: > + nfs_mark_client_ready(clp, -EIO); > + break; > + default: > + ssleep(1); > + break; > + } > out_drain: > memalloc_nofs_restore(memflags); > nfs4_end_drain_session(clp); > -- > 2.49.0 Doesn't this have the same bug as the sysfs shutdown - in that a mount with fatal_neterrors=ENETDOWN:ENETUNREACH can take down the state manager for a mount without it? I think the same consideration applies as shutdown so far: in practical use, you're not going to care. Another thought - its pretty subtle that the only way those errors might/should reach us here is if that mount option is in play. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Ben
On Wed, 2025-03-26 at 06:39 -0400, Benjamin Coddington wrote: > On 25 Mar 2025, at 18:35, trondmy@kernel.org wrote: > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > If a containerised process is killed and causes an ENETUNREACH or > > ENETDOWN error to be propagated to the state manager, then mark the > > nfs_client as being dead so that we don't loop in functions that are > > expecting recovery to succeed. > > > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > > --- > > fs/nfs/nfs4state.c | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c > > index 272d2ebdae0f..7612e977e80b 100644 > > --- a/fs/nfs/nfs4state.c > > +++ b/fs/nfs/nfs4state.c > > @@ -2739,7 +2739,15 @@ static void nfs4_state_manager(struct nfs_client *clp) > > pr_warn_ratelimited("NFS: state manager%s%s failed on NFSv4 server %s" > > " with error %d\n", section_sep, section, > > clp->cl_hostname, -status); > > - ssleep(1); > > + switch (status) { > > + case -ENETDOWN: > > + case -ENETUNREACH: > > + nfs_mark_client_ready(clp, -EIO); > > + break; > > + default: > > + ssleep(1); > > + break; > > + } > > out_drain: > > memalloc_nofs_restore(memflags); > > nfs4_end_drain_session(clp); > > -- > > 2.49.0 > > Doesn't this have the same bug as the sysfs shutdown - in that a mount with > fatal_neterrors=ENETDOWN:ENETUNREACH can take down the state manager for a > mount without it? I think the same consideration applies as shutdown so > far: in practical use, you're not going to care. > True. In principle we probably ought to avoid sharing superblocks between mounts with different fatal_errors= options. In practice, I agree that no one will care since this means that the server is borked. > Another thought - its pretty subtle that the only way those errors > might/should reach us here is if that mount option is in play. > > Reviewed-by: Benjamin Coddington <bcodding@redhat.com> > > Ben >
On Wed, 2025-03-26 at 07:18 -0400, Jeff Layton wrote: > On Wed, 2025-03-26 at 06:39 -0400, Benjamin Coddington wrote: > > On 25 Mar 2025, at 18:35, trondmy@kernel.org wrote: > > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > > > If a containerised process is killed and causes an ENETUNREACH or > > > ENETDOWN error to be propagated to the state manager, then mark > > > the > > > nfs_client as being dead so that we don't loop in functions that > > > are > > > expecting recovery to succeed. > > > > > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > > > --- > > > fs/nfs/nfs4state.c | 10 +++++++++- > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c > > > index 272d2ebdae0f..7612e977e80b 100644 > > > --- a/fs/nfs/nfs4state.c > > > +++ b/fs/nfs/nfs4state.c > > > @@ -2739,7 +2739,15 @@ static void nfs4_state_manager(struct > > > nfs_client *clp) > > > pr_warn_ratelimited("NFS: state manager%s%s failed on > > > NFSv4 server %s" > > > " with error %d\n", section_sep, > > > section, > > > clp->cl_hostname, -status); > > > - ssleep(1); > > > + switch (status) { > > > + case -ENETDOWN: > > > + case -ENETUNREACH: > > > + nfs_mark_client_ready(clp, -EIO); > > > + break; > > > + default: > > > + ssleep(1); > > > + break; > > > + } > > > out_drain: > > > memalloc_nofs_restore(memflags); > > > nfs4_end_drain_session(clp); > > > -- > > > 2.49.0 > > > > Doesn't this have the same bug as the sysfs shutdown - in that a > > mount with > > fatal_neterrors=ENETDOWN:ENETUNREACH can take down the state > > manager for a > > mount without it? I think the same consideration applies as > > shutdown so > > far: in practical use, you're not going to care. > > > > True. In principle we probably ought to avoid sharing superblocks > between mounts with different fatal_errors= options. In practice, I > agree that no one will care since this means that the server is > borked. > > > Another thought - its pretty subtle that the only way those errors > > might/should reach us here is if that mount option is in play. > > > > Reviewed-by: Benjamin Coddington <bcodding@redhat.com> > > > > Ben > > > The difference here is that the fatal_neterrors option as it stands today, only looks at whether or not your request is routable. It is not intended to function as a mechanism for dealing with servers being down. It only works as a last resort for when the host's orchestrator software destroys a container's local virtual network devices without first ensuring that its mounts have been safely shut down. IOW: yes there are some subtleties here, which is why we need a mount option in the first place to allow override of the default behaviours. However those default settings as they stand are hopefully sufficiently conservative that admins should only rarely need to override them.
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index 272d2ebdae0f..7612e977e80b 100644 --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -2739,7 +2739,15 @@ static void nfs4_state_manager(struct nfs_client *clp) pr_warn_ratelimited("NFS: state manager%s%s failed on NFSv4 server %s" " with error %d\n", section_sep, section, clp->cl_hostname, -status); - ssleep(1); + switch (status) { + case -ENETDOWN: + case -ENETUNREACH: + nfs_mark_client_ready(clp, -EIO); + break; + default: + ssleep(1); + break; + } out_drain: memalloc_nofs_restore(memflags); nfs4_end_drain_session(clp);