mbox series

[RFC,0/4] Containerised NFS clients and teardown

Message ID cover.1742490771.git.trond.myklebust@hammerspace.com (mailing list archive)
Headers show
Series Containerised NFS clients and teardown | expand

Message

Trond Myklebust March 20, 2025, 5:44 p.m. UTC
From: Trond Myklebust <trond.myklebust@hammerspace.com>

When a NFS client is started from inside a container, it is often not
possible to ensure a safe shutdown and flush of the data before the
container orchestrator steps in to tear down the network. Typically,
what can happen is that the orchestrator triggers a lazy umount of the
mounted filesystems, then proceeds to delete virtual network device
links, bridges, NAT configurations, etc.

Once that happens, it may be impossible to reach into the container to
perform any further shutdown actions on the NFS client.

This patchset proposes to allow the client to deal with these situations
by treating the two errors ENETDOWN  and ENETUNREACH as being fatal.
The intention is to then allow the I/O queue to drain, and any remaining
RPC calls to error out, so that the lazy umounts can complete the
shutdown process.

In order to do so, a new mount option "fatal_errors" is introduced,
which can take the values "default", "none" and "enetdown:enetunreach".
The value "none" forces the existing behaviour, whereby hard mounts are
unaffected by the ENETDOWN and ENETUNREACH errors.
The value "enetdown:enetunreach" forces ENETDOWN and ENETUNREACH errors
to always be fatal.
If the user does not specify the "fatal_errors" option, or uses the
value "default", then ENETDOWN and ENETUNREACH will be fatal if the
mount was started from inside a network namespace that is not
"init_net", and otherwise not.

The expectation is that users will normally not need to set this option,
unless they are running inside a container, and want to prevent ENETDOWN
and ENETUNREACH from being fatal by setting "-ofatal_errors=none".

Trond Myklebust (4):
  NFS: Add a mount option to make ENETUNREACH errors fatal
  NFS: Treat ENETUNREACH errors as fatal in containers
  pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containers
  pNFS/flexfiles: Report ENETDOWN as a connection error

 fs/nfs/client.c                        |  5 ++++
 fs/nfs/flexfilelayout/flexfilelayout.c | 24 ++++++++++++++--
 fs/nfs/fs_context.c                    | 38 ++++++++++++++++++++++++++
 fs/nfs/nfs3client.c                    |  2 ++
 fs/nfs/nfs4client.c                    |  5 ++++
 fs/nfs/nfs4proc.c                      |  3 ++
 fs/nfs/super.c                         |  2 ++
 include/linux/nfs4.h                   |  1 +
 include/linux/nfs_fs_sb.h              |  2 ++
 include/linux/sunrpc/clnt.h            |  5 +++-
 include/linux/sunrpc/sched.h           |  1 +
 net/sunrpc/clnt.c                      | 30 ++++++++++++++------
 12 files changed, 107 insertions(+), 11 deletions(-)

Comments

Jeff Layton March 20, 2025, 7:32 p.m. UTC | #1
On Thu, 2025-03-20 at 13:44 -0400, trondmy@kernel.org wrote:
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> 
> When a NFS client is started from inside a container, it is often not
> possible to ensure a safe shutdown and flush of the data before the
> container orchestrator steps in to tear down the network. Typically,
> what can happen is that the orchestrator triggers a lazy umount of the
> mounted filesystems, then proceeds to delete virtual network device
> links, bridges, NAT configurations, etc.
> 
> Once that happens, it may be impossible to reach into the container to
> perform any further shutdown actions on the NFS client.
> 
> This patchset proposes to allow the client to deal with these situations
> by treating the two errors ENETDOWN  and ENETUNREACH as being fatal.
> The intention is to then allow the I/O queue to drain, and any remaining
> RPC calls to error out, so that the lazy umounts can complete the
> shutdown process.
> 
> In order to do so, a new mount option "fatal_errors" is introduced,
> which can take the values "default", "none" and "enetdown:enetunreach".
> The value "none" forces the existing behaviour, whereby hard mounts are
> unaffected by the ENETDOWN and ENETUNREACH errors.
> The value "enetdown:enetunreach" forces ENETDOWN and ENETUNREACH errors
> to always be fatal.
> If the user does not specify the "fatal_errors" option, or uses the
> value "default", then ENETDOWN and ENETUNREACH will be fatal if the
> mount was started from inside a network namespace that is not
> "init_net", and otherwise not.
> 
> The expectation is that users will normally not need to set this option,
> unless they are running inside a container, and want to prevent ENETDOWN
> and ENETUNREACH from being fatal by setting "-ofatal_errors=none".
> 
> Trond Myklebust (4):
>   NFS: Add a mount option to make ENETUNREACH errors fatal
>   NFS: Treat ENETUNREACH errors as fatal in containers
>   pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containers
>   pNFS/flexfiles: Report ENETDOWN as a connection error
> 
>  fs/nfs/client.c                        |  5 ++++
>  fs/nfs/flexfilelayout/flexfilelayout.c | 24 ++++++++++++++--
>  fs/nfs/fs_context.c                    | 38 ++++++++++++++++++++++++++
>  fs/nfs/nfs3client.c                    |  2 ++
>  fs/nfs/nfs4client.c                    |  5 ++++
>  fs/nfs/nfs4proc.c                      |  3 ++
>  fs/nfs/super.c                         |  2 ++
>  include/linux/nfs4.h                   |  1 +
>  include/linux/nfs_fs_sb.h              |  2 ++
>  include/linux/sunrpc/clnt.h            |  5 +++-
>  include/linux/sunrpc/sched.h           |  1 +
>  net/sunrpc/clnt.c                      | 30 ++++++++++++++------
>  12 files changed, 107 insertions(+), 11 deletions(-)
> 

I like the concept, but unfortunately it doesn't help with the
reproducer I have. The rpc_tasks remain stuck. Here's the contents of
the rpc_tasks file:

  252 c825      0 0x3 0xd2147cd2     2147 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  251 c825      0 0x3 0xd3147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  241 c825      0 0x3 0xd4147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  531 c825      0 0x3 0xd5147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  640 c825      0 0x3 0xd6147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 READ a:call_bind [sunrpc] q:delayq
  634 c825      0 0x3 0xd7147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  564 c825      0 0x3 0xd8147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  567 c825      0 0x3 0xd9147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  258 c825      0 0x3 0xda147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  259 c825      0 0x3 0xdb147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
 1159 c825      0 0x3 0xdc147cd2     2146 nfs_commit_ops [nfs] nfsv4 COMMIT a:call_bind [sunrpc] q:delayq
  246 c825      0 0x3 0xdd147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  536 c825      0 0x3 0xde147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  645 c825      0 0x3 0xdf147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 READ a:call_bind [sunrpc] q:delayq
  637 c825      0 0x3 0xe0147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  572 c825      0 0x3 0xe1147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  568 c825      0 0x3 0xe2147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  263 c825      0 0x3 0xe3147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
 1163 c825      0 0x3 0xe4147cd2     2146 nfs_commit_ops [nfs] nfsv4 COMMIT a:call_bind [sunrpc] q:delayq
  262 c825      0 0x3 0xe5147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
 1162 c825      0 0x3 0xe6147cd2     2146 nfs_commit_ops [nfs] nfsv4 COMMIT a:call_bind [sunrpc] q:delayq
  250 c825      0 0x3 0xe7147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  537 c825      0 0x3 0xe8147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  646 c825      0 0x3 0xe9147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 READ a:call_bind [sunrpc] q:delayq
  642 c825      0 0x3 0xea147cd2     2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
 1165 c825      0 0x3 0xeb147cd2     2146 nfs_commit_ops [nfs] nfsv4 COMMIT a:call_bind [sunrpc] q:delayq
  579 c825      0 0x3 0xec147cd2     2145 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  574 c825      0 0x3 0xed147cd2     2145 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  269 c825      0 0x3 0xee147cd2     2145 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq
  265 c825      0 0x3 0xef147cd2     2145 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq

I turned up a bunch of tracepoints, and collected some output for a
while waiting for the tasks to die. It's attached.

I see some ENETUNREACH (-101) errors in there, but the rpc_tasks didn't
die off. It looks sort of like the rpc_task flag didn't get set
properly? I'll plan to take a closer look tomorrow unless you figure it
out.
Trond Myklebust March 20, 2025, 8:40 p.m. UTC | #2
On Thu, 2025-03-20 at 15:32 -0400, Jeff Layton wrote:
> On Thu, 2025-03-20 at 13:44 -0400, trondmy@kernel.org wrote:
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > When a NFS client is started from inside a container, it is often
> > not
> > possible to ensure a safe shutdown and flush of the data before the
> > container orchestrator steps in to tear down the network.
> > Typically,
> > what can happen is that the orchestrator triggers a lazy umount of
> > the
> > mounted filesystems, then proceeds to delete virtual network device
> > links, bridges, NAT configurations, etc.
> > 
> > Once that happens, it may be impossible to reach into the container
> > to
> > perform any further shutdown actions on the NFS client.
> > 
> > This patchset proposes to allow the client to deal with these
> > situations
> > by treating the two errors ENETDOWN  and ENETUNREACH as being
> > fatal.
> > The intention is to then allow the I/O queue to drain, and any
> > remaining
> > RPC calls to error out, so that the lazy umounts can complete the
> > shutdown process.
> > 
> > In order to do so, a new mount option "fatal_errors" is introduced,
> > which can take the values "default", "none" and
> > "enetdown:enetunreach".
> > The value "none" forces the existing behaviour, whereby hard mounts
> > are
> > unaffected by the ENETDOWN and ENETUNREACH errors.
> > The value "enetdown:enetunreach" forces ENETDOWN and ENETUNREACH
> > errors
> > to always be fatal.
> > If the user does not specify the "fatal_errors" option, or uses the
> > value "default", then ENETDOWN and ENETUNREACH will be fatal if the
> > mount was started from inside a network namespace that is not
> > "init_net", and otherwise not.
> > 
> > The expectation is that users will normally not need to set this
> > option,
> > unless they are running inside a container, and want to prevent
> > ENETDOWN
> > and ENETUNREACH from being fatal by setting "-ofatal_errors=none".
> > 
> > Trond Myklebust (4):
> >   NFS: Add a mount option to make ENETUNREACH errors fatal
> >   NFS: Treat ENETUNREACH errors as fatal in containers
> >   pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containers
> >   pNFS/flexfiles: Report ENETDOWN as a connection error
> > 
> >  fs/nfs/client.c                        |  5 ++++
> >  fs/nfs/flexfilelayout/flexfilelayout.c | 24 ++++++++++++++--
> >  fs/nfs/fs_context.c                    | 38
> > ++++++++++++++++++++++++++
> >  fs/nfs/nfs3client.c                    |  2 ++
> >  fs/nfs/nfs4client.c                    |  5 ++++
> >  fs/nfs/nfs4proc.c                      |  3 ++
> >  fs/nfs/super.c                         |  2 ++
> >  include/linux/nfs4.h                   |  1 +
> >  include/linux/nfs_fs_sb.h              |  2 ++
> >  include/linux/sunrpc/clnt.h            |  5 +++-
> >  include/linux/sunrpc/sched.h           |  1 +
> >  net/sunrpc/clnt.c                      | 30 ++++++++++++++------
> >  12 files changed, 107 insertions(+), 11 deletions(-)
> > 
> 
> I like the concept, but unfortunately it doesn't help with the
> reproducer I have. The rpc_tasks remain stuck. Here's the contents of
> the rpc_tasks file:
> 
>   252 c825      0 0x3 0xd2147cd2     2147 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   251 c825      0 0x3 0xd3147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   241 c825      0 0x3 0xd4147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   531 c825      0 0x3 0xd5147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   640 c825      0 0x3 0xd6147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 READ a:call_bind [sunrpc] q:delayq
>   634 c825      0 0x3 0xd7147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   564 c825      0 0x3 0xd8147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   567 c825      0 0x3 0xd9147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   258 c825      0 0x3 0xda147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   259 c825      0 0x3 0xdb147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>  1159 c825      0 0x3 0xdc147cd2     2146 nfs_commit_ops [nfs] nfsv4
> COMMIT a:call_bind [sunrpc] q:delayq
>   246 c825      0 0x3 0xdd147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   536 c825      0 0x3 0xde147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   645 c825      0 0x3 0xdf147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 READ a:call_bind [sunrpc] q:delayq
>   637 c825      0 0x3 0xe0147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   572 c825      0 0x3 0xe1147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   568 c825      0 0x3 0xe2147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   263 c825      0 0x3 0xe3147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>  1163 c825      0 0x3 0xe4147cd2     2146 nfs_commit_ops [nfs] nfsv4
> COMMIT a:call_bind [sunrpc] q:delayq
>   262 c825      0 0x3 0xe5147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>  1162 c825      0 0x3 0xe6147cd2     2146 nfs_commit_ops [nfs] nfsv4
> COMMIT a:call_bind [sunrpc] q:delayq
>   250 c825      0 0x3 0xe7147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   537 c825      0 0x3 0xe8147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   646 c825      0 0x3 0xe9147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 READ a:call_bind [sunrpc] q:delayq
>   642 c825      0 0x3 0xea147cd2     2146 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>  1165 c825      0 0x3 0xeb147cd2     2146 nfs_commit_ops [nfs] nfsv4
> COMMIT a:call_bind [sunrpc] q:delayq
>   579 c825      0 0x3 0xec147cd2     2145 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   574 c825      0 0x3 0xed147cd2     2145 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   269 c825      0 0x3 0xee147cd2     2145 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
>   265 c825      0 0x3 0xef147cd2     2145 nfs_pgio_common_ops [nfs]
> nfsv4 WRITE a:call_bind [sunrpc] q:delayq
> 
> I turned up a bunch of tracepoints, and collected some output for a
> while waiting for the tasks to die. It's attached.
> 
> I see some ENETUNREACH (-101) errors in there, but the rpc_tasks
> didn't
> die off. It looks sort of like the rpc_task flag didn't get set
> properly? I'll plan to take a closer look tomorrow unless you figure
> it
> out.


Ah, crap... The client clp->cl_flag gets initialised differently in
NFSv4, so the mount flag wasn't getting propagated.

A v2 is forthcoming with the fix.