Message ID | cover.1742490771.git.trond.myklebust@hammerspace.com (mailing list archive) |
---|---|
Headers | show |
Series | Containerised NFS clients and teardown | expand |
On Thu, 2025-03-20 at 13:44 -0400, trondmy@kernel.org wrote: > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > When a NFS client is started from inside a container, it is often not > possible to ensure a safe shutdown and flush of the data before the > container orchestrator steps in to tear down the network. Typically, > what can happen is that the orchestrator triggers a lazy umount of the > mounted filesystems, then proceeds to delete virtual network device > links, bridges, NAT configurations, etc. > > Once that happens, it may be impossible to reach into the container to > perform any further shutdown actions on the NFS client. > > This patchset proposes to allow the client to deal with these situations > by treating the two errors ENETDOWN and ENETUNREACH as being fatal. > The intention is to then allow the I/O queue to drain, and any remaining > RPC calls to error out, so that the lazy umounts can complete the > shutdown process. > > In order to do so, a new mount option "fatal_errors" is introduced, > which can take the values "default", "none" and "enetdown:enetunreach". > The value "none" forces the existing behaviour, whereby hard mounts are > unaffected by the ENETDOWN and ENETUNREACH errors. > The value "enetdown:enetunreach" forces ENETDOWN and ENETUNREACH errors > to always be fatal. > If the user does not specify the "fatal_errors" option, or uses the > value "default", then ENETDOWN and ENETUNREACH will be fatal if the > mount was started from inside a network namespace that is not > "init_net", and otherwise not. > > The expectation is that users will normally not need to set this option, > unless they are running inside a container, and want to prevent ENETDOWN > and ENETUNREACH from being fatal by setting "-ofatal_errors=none". > > Trond Myklebust (4): > NFS: Add a mount option to make ENETUNREACH errors fatal > NFS: Treat ENETUNREACH errors as fatal in containers > pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containers > pNFS/flexfiles: Report ENETDOWN as a connection error > > fs/nfs/client.c | 5 ++++ > fs/nfs/flexfilelayout/flexfilelayout.c | 24 ++++++++++++++-- > fs/nfs/fs_context.c | 38 ++++++++++++++++++++++++++ > fs/nfs/nfs3client.c | 2 ++ > fs/nfs/nfs4client.c | 5 ++++ > fs/nfs/nfs4proc.c | 3 ++ > fs/nfs/super.c | 2 ++ > include/linux/nfs4.h | 1 + > include/linux/nfs_fs_sb.h | 2 ++ > include/linux/sunrpc/clnt.h | 5 +++- > include/linux/sunrpc/sched.h | 1 + > net/sunrpc/clnt.c | 30 ++++++++++++++------ > 12 files changed, 107 insertions(+), 11 deletions(-) > I like the concept, but unfortunately it doesn't help with the reproducer I have. The rpc_tasks remain stuck. Here's the contents of the rpc_tasks file: 252 c825 0 0x3 0xd2147cd2 2147 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 251 c825 0 0x3 0xd3147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 241 c825 0 0x3 0xd4147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 531 c825 0 0x3 0xd5147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 640 c825 0 0x3 0xd6147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 READ a:call_bind [sunrpc] q:delayq 634 c825 0 0x3 0xd7147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 564 c825 0 0x3 0xd8147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 567 c825 0 0x3 0xd9147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 258 c825 0 0x3 0xda147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 259 c825 0 0x3 0xdb147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 1159 c825 0 0x3 0xdc147cd2 2146 nfs_commit_ops [nfs] nfsv4 COMMIT a:call_bind [sunrpc] q:delayq 246 c825 0 0x3 0xdd147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 536 c825 0 0x3 0xde147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 645 c825 0 0x3 0xdf147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 READ a:call_bind [sunrpc] q:delayq 637 c825 0 0x3 0xe0147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 572 c825 0 0x3 0xe1147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 568 c825 0 0x3 0xe2147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 263 c825 0 0x3 0xe3147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 1163 c825 0 0x3 0xe4147cd2 2146 nfs_commit_ops [nfs] nfsv4 COMMIT a:call_bind [sunrpc] q:delayq 262 c825 0 0x3 0xe5147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 1162 c825 0 0x3 0xe6147cd2 2146 nfs_commit_ops [nfs] nfsv4 COMMIT a:call_bind [sunrpc] q:delayq 250 c825 0 0x3 0xe7147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 537 c825 0 0x3 0xe8147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 646 c825 0 0x3 0xe9147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 READ a:call_bind [sunrpc] q:delayq 642 c825 0 0x3 0xea147cd2 2146 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 1165 c825 0 0x3 0xeb147cd2 2146 nfs_commit_ops [nfs] nfsv4 COMMIT a:call_bind [sunrpc] q:delayq 579 c825 0 0x3 0xec147cd2 2145 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 574 c825 0 0x3 0xed147cd2 2145 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 269 c825 0 0x3 0xee147cd2 2145 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq 265 c825 0 0x3 0xef147cd2 2145 nfs_pgio_common_ops [nfs] nfsv4 WRITE a:call_bind [sunrpc] q:delayq I turned up a bunch of tracepoints, and collected some output for a while waiting for the tasks to die. It's attached. I see some ENETUNREACH (-101) errors in there, but the rpc_tasks didn't die off. It looks sort of like the rpc_task flag didn't get set properly? I'll plan to take a closer look tomorrow unless you figure it out.
On Thu, 2025-03-20 at 15:32 -0400, Jeff Layton wrote: > On Thu, 2025-03-20 at 13:44 -0400, trondmy@kernel.org wrote: > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > When a NFS client is started from inside a container, it is often > > not > > possible to ensure a safe shutdown and flush of the data before the > > container orchestrator steps in to tear down the network. > > Typically, > > what can happen is that the orchestrator triggers a lazy umount of > > the > > mounted filesystems, then proceeds to delete virtual network device > > links, bridges, NAT configurations, etc. > > > > Once that happens, it may be impossible to reach into the container > > to > > perform any further shutdown actions on the NFS client. > > > > This patchset proposes to allow the client to deal with these > > situations > > by treating the two errors ENETDOWN and ENETUNREACH as being > > fatal. > > The intention is to then allow the I/O queue to drain, and any > > remaining > > RPC calls to error out, so that the lazy umounts can complete the > > shutdown process. > > > > In order to do so, a new mount option "fatal_errors" is introduced, > > which can take the values "default", "none" and > > "enetdown:enetunreach". > > The value "none" forces the existing behaviour, whereby hard mounts > > are > > unaffected by the ENETDOWN and ENETUNREACH errors. > > The value "enetdown:enetunreach" forces ENETDOWN and ENETUNREACH > > errors > > to always be fatal. > > If the user does not specify the "fatal_errors" option, or uses the > > value "default", then ENETDOWN and ENETUNREACH will be fatal if the > > mount was started from inside a network namespace that is not > > "init_net", and otherwise not. > > > > The expectation is that users will normally not need to set this > > option, > > unless they are running inside a container, and want to prevent > > ENETDOWN > > and ENETUNREACH from being fatal by setting "-ofatal_errors=none". > > > > Trond Myklebust (4): > > NFS: Add a mount option to make ENETUNREACH errors fatal > > NFS: Treat ENETUNREACH errors as fatal in containers > > pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containers > > pNFS/flexfiles: Report ENETDOWN as a connection error > > > > fs/nfs/client.c | 5 ++++ > > fs/nfs/flexfilelayout/flexfilelayout.c | 24 ++++++++++++++-- > > fs/nfs/fs_context.c | 38 > > ++++++++++++++++++++++++++ > > fs/nfs/nfs3client.c | 2 ++ > > fs/nfs/nfs4client.c | 5 ++++ > > fs/nfs/nfs4proc.c | 3 ++ > > fs/nfs/super.c | 2 ++ > > include/linux/nfs4.h | 1 + > > include/linux/nfs_fs_sb.h | 2 ++ > > include/linux/sunrpc/clnt.h | 5 +++- > > include/linux/sunrpc/sched.h | 1 + > > net/sunrpc/clnt.c | 30 ++++++++++++++------ > > 12 files changed, 107 insertions(+), 11 deletions(-) > > > > I like the concept, but unfortunately it doesn't help with the > reproducer I have. The rpc_tasks remain stuck. Here's the contents of > the rpc_tasks file: > > 252 c825 0 0x3 0xd2147cd2 2147 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 251 c825 0 0x3 0xd3147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 241 c825 0 0x3 0xd4147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 531 c825 0 0x3 0xd5147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 640 c825 0 0x3 0xd6147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 READ a:call_bind [sunrpc] q:delayq > 634 c825 0 0x3 0xd7147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 564 c825 0 0x3 0xd8147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 567 c825 0 0x3 0xd9147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 258 c825 0 0x3 0xda147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 259 c825 0 0x3 0xdb147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 1159 c825 0 0x3 0xdc147cd2 2146 nfs_commit_ops [nfs] nfsv4 > COMMIT a:call_bind [sunrpc] q:delayq > 246 c825 0 0x3 0xdd147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 536 c825 0 0x3 0xde147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 645 c825 0 0x3 0xdf147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 READ a:call_bind [sunrpc] q:delayq > 637 c825 0 0x3 0xe0147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 572 c825 0 0x3 0xe1147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 568 c825 0 0x3 0xe2147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 263 c825 0 0x3 0xe3147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 1163 c825 0 0x3 0xe4147cd2 2146 nfs_commit_ops [nfs] nfsv4 > COMMIT a:call_bind [sunrpc] q:delayq > 262 c825 0 0x3 0xe5147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 1162 c825 0 0x3 0xe6147cd2 2146 nfs_commit_ops [nfs] nfsv4 > COMMIT a:call_bind [sunrpc] q:delayq > 250 c825 0 0x3 0xe7147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 537 c825 0 0x3 0xe8147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 646 c825 0 0x3 0xe9147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 READ a:call_bind [sunrpc] q:delayq > 642 c825 0 0x3 0xea147cd2 2146 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 1165 c825 0 0x3 0xeb147cd2 2146 nfs_commit_ops [nfs] nfsv4 > COMMIT a:call_bind [sunrpc] q:delayq > 579 c825 0 0x3 0xec147cd2 2145 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 574 c825 0 0x3 0xed147cd2 2145 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 269 c825 0 0x3 0xee147cd2 2145 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > 265 c825 0 0x3 0xef147cd2 2145 nfs_pgio_common_ops [nfs] > nfsv4 WRITE a:call_bind [sunrpc] q:delayq > > I turned up a bunch of tracepoints, and collected some output for a > while waiting for the tasks to die. It's attached. > > I see some ENETUNREACH (-101) errors in there, but the rpc_tasks > didn't > die off. It looks sort of like the rpc_task flag didn't get set > properly? I'll plan to take a closer look tomorrow unless you figure > it > out. Ah, crap... The client clp->cl_flag gets initialised differently in NFSv4, so the mount flag wasn't getting propagated. A v2 is forthcoming with the fix.
From: Trond Myklebust <trond.myklebust@hammerspace.com> When a NFS client is started from inside a container, it is often not possible to ensure a safe shutdown and flush of the data before the container orchestrator steps in to tear down the network. Typically, what can happen is that the orchestrator triggers a lazy umount of the mounted filesystems, then proceeds to delete virtual network device links, bridges, NAT configurations, etc. Once that happens, it may be impossible to reach into the container to perform any further shutdown actions on the NFS client. This patchset proposes to allow the client to deal with these situations by treating the two errors ENETDOWN and ENETUNREACH as being fatal. The intention is to then allow the I/O queue to drain, and any remaining RPC calls to error out, so that the lazy umounts can complete the shutdown process. In order to do so, a new mount option "fatal_errors" is introduced, which can take the values "default", "none" and "enetdown:enetunreach". The value "none" forces the existing behaviour, whereby hard mounts are unaffected by the ENETDOWN and ENETUNREACH errors. The value "enetdown:enetunreach" forces ENETDOWN and ENETUNREACH errors to always be fatal. If the user does not specify the "fatal_errors" option, or uses the value "default", then ENETDOWN and ENETUNREACH will be fatal if the mount was started from inside a network namespace that is not "init_net", and otherwise not. The expectation is that users will normally not need to set this option, unless they are running inside a container, and want to prevent ENETDOWN and ENETUNREACH from being fatal by setting "-ofatal_errors=none". Trond Myklebust (4): NFS: Add a mount option to make ENETUNREACH errors fatal NFS: Treat ENETUNREACH errors as fatal in containers pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containers pNFS/flexfiles: Report ENETDOWN as a connection error fs/nfs/client.c | 5 ++++ fs/nfs/flexfilelayout/flexfilelayout.c | 24 ++++++++++++++-- fs/nfs/fs_context.c | 38 ++++++++++++++++++++++++++ fs/nfs/nfs3client.c | 2 ++ fs/nfs/nfs4client.c | 5 ++++ fs/nfs/nfs4proc.c | 3 ++ fs/nfs/super.c | 2 ++ include/linux/nfs4.h | 1 + include/linux/nfs_fs_sb.h | 2 ++ include/linux/sunrpc/clnt.h | 5 +++- include/linux/sunrpc/sched.h | 1 + net/sunrpc/clnt.c | 30 ++++++++++++++------ 12 files changed, 107 insertions(+), 11 deletions(-)