mbox series

[RFC,0/9] nfs/sunrpc: stop holding netns references in client-side NFS and RPC objects

Message ID 20250317-rpc-shutdown-v1-0-85ba8e20b75d@kernel.org (mailing list archive)
Headers show
Series nfs/sunrpc: stop holding netns references in client-side NFS and RPC objects | expand

Message

Jeff Layton March 17, 2025, 8:59 p.m. UTC
We have a long-standing problem with containers that have NFS mounts in
them. Best practice is to unmount gracefully, of course, but sometimes
containers just spontaneously die (e.g. SIGSEGV in the init task in the
container). When that happens the orchestrator will see that all of the
tasks are dead, and will detach the mount namespace and kill off the
network connection.

If there are RPCs in flight at the time, the rpc_clnt will try to
retransmit them indefinitely, but there is no hope of them ever
contacting the server since nothing in userland can reach the netns
at that point to fix anything.

This patchset takes the approach of changing various nfs client and
sunrpc objects to not hold a netns reference. Instead, when a nfs_net or
sunrpc_net is exiting, all nfs_server, nfs_client and rpc_clnt objects
associated with it are shut down, and the pre_exit functions block
until they are gone.

With this approach, when the last userland task in the container exits,
the NFS and RPC clients get cleaned up automatically. As a bonus, this
fixes another bug with the gssproxy RPC client that causes net namespace
leaks in any container where it runs (details in the patch
descriptions).

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Jeff Layton (9):
      sunrpc: transplant shutdown_client() to sunrpc module
      lockd: add a helper to shut down rpc_clnt in nlm_host
      lockd: don't #include debug.h from lockd.h
      nfs: transplant nfs_server shutdown into a helper function
      nfs: don't hold a reference to struct net in struct nfs_client
      auth_gss: shut down gssproxy rpc_clnt in net pre_exit
      auth_gss: don't hold a net reference in gss_auth
      sunrpc: don't hold a struct net reference in rpc_xprt
      sunrpc: don't upgrade passive net reference in xs_create_sock

 fs/lockd/clnt4xdr.c                |  1 +
 fs/lockd/clntlock.c                |  1 +
 fs/lockd/clntproc.c                |  1 +
 fs/lockd/clntxdr.c                 |  1 +
 fs/lockd/host.c                    |  8 ++++++++
 fs/lockd/mon.c                     |  1 +
 fs/lockd/svc.c                     |  1 +
 fs/lockd/svc4proc.c                |  1 +
 fs/lockd/svclock.c                 |  1 +
 fs/lockd/svcproc.c                 |  1 +
 fs/lockd/svcsubs.c                 |  1 +
 fs/nfs/client.c                    |  6 ++++--
 fs/nfs/inode.c                     | 28 ++++++++++++++++++++++++++++
 fs/nfs/internal.h                  |  1 +
 fs/nfs/super.c                     | 18 ++++++++++++++++++
 fs/nfs/sysfs.c                     | 27 ++-------------------------
 include/linux/lockd/lockd.h        |  2 +-
 include/linux/sunrpc/sched.h       |  1 +
 include/linux/sunrpc/svcauth_gss.h |  1 +
 include/linux/sunrpc/xprt.h        |  1 -
 net/sunrpc/auth_gss/auth_gss.c     | 15 ++++++++-------
 net/sunrpc/auth_gss/svcauth_gss.c  |  7 ++++++-
 net/sunrpc/clnt.c                  | 14 ++++++++++++++
 net/sunrpc/sunrpc_syms.c           | 29 +++++++++++++++++++++++++++++
 net/sunrpc/xprt.c                  |  3 +--
 net/sunrpc/xprtsock.c              |  3 ---
 26 files changed, 132 insertions(+), 42 deletions(-)
---
base-commit: 80e54e84911a923c40d7bee33a34c1b4be148d7a
change-id: 20250317-rpc-shutdown-1519aacd1db3

Best regards,

Comments

Trond Myklebust March 17, 2025, 9:35 p.m. UTC | #1
On Mon, 2025-03-17 at 16:59 -0400, Jeff Layton wrote:
> We have a long-standing problem with containers that have NFS mounts
> in
> them. Best practice is to unmount gracefully, of course, but
> sometimes
> containers just spontaneously die (e.g. SIGSEGV in the init task in
> the
> container). When that happens the orchestrator will see that all of
> the
> tasks are dead, and will detach the mount namespace and kill off the
> network connection.
> 
> If there are RPCs in flight at the time, the rpc_clnt will try to
> retransmit them indefinitely, but there is no hope of them ever
> contacting the server since nothing in userland can reach the netns
> at that point to fix anything.
> 
> This patchset takes the approach of changing various nfs client and
> sunrpc objects to not hold a netns reference. Instead, when a nfs_net
> or
> sunrpc_net is exiting, all nfs_server, nfs_client and rpc_clnt
> objects
> associated with it are shut down, and the pre_exit functions block
> until they are gone.
> 
> With this approach, when the last userland task in the container
> exits,
> the NFS and RPC clients get cleaned up automatically. As a bonus,
> this
> fixes another bug with the gssproxy RPC client that causes net
> namespace
> leaks in any container where it runs (details in the patch
> descriptions).
> 

So with this approach, what happens if the NFS mount was created in a
container, but got bind mounted somewhere else?
Jeff Layton March 17, 2025, 9:57 p.m. UTC | #2
On Mon, 2025-03-17 at 21:35 +0000, Trond Myklebust wrote:
> On Mon, 2025-03-17 at 16:59 -0400, Jeff Layton wrote:
> > We have a long-standing problem with containers that have NFS mounts
> > in
> > them. Best practice is to unmount gracefully, of course, but
> > sometimes
> > containers just spontaneously die (e.g. SIGSEGV in the init task in
> > the
> > container). When that happens the orchestrator will see that all of
> > the
> > tasks are dead, and will detach the mount namespace and kill off the
> > network connection.
> > 
> > If there are RPCs in flight at the time, the rpc_clnt will try to
> > retransmit them indefinitely, but there is no hope of them ever
> > contacting the server since nothing in userland can reach the netns
> > at that point to fix anything.
> > 
> > This patchset takes the approach of changing various nfs client and
> > sunrpc objects to not hold a netns reference. Instead, when a nfs_net
> > or
> > sunrpc_net is exiting, all nfs_server, nfs_client and rpc_clnt
> > objects
> > associated with it are shut down, and the pre_exit functions block
> > until they are gone.
> > 
> > With this approach, when the last userland task in the container
> > exits,
> > the NFS and RPC clients get cleaned up automatically. As a bonus,
> > this
> > fixes another bug with the gssproxy RPC client that causes net
> > namespace
> > leaks in any container where it runs (details in the patch
> > descriptions).
> > 
> 
> So with this approach, what happens if the NFS mount was created in a
> container, but got bind mounted somewhere else?
> 

The lifetime of these objects are tied to the net namespace. If it gets
bind-mounted into a different mount namespace, while the tasks are
setns()'ed into the correct net namespace, then I expect the mount
would end up shut down at that point and be unusable, just like if you
echo 1 into the shutdown file in sysfs.

Hopefully no one is doing anything that silly. You wouldn't be able to
upcall, for one thing, since there wouldn't be any more userland
processes attached to the netns.

I'll test that scenario and get back to you though. I do want to make
sure that that's not going to lead to a crash or anything.
Trond Myklebust March 17, 2025, 10:11 p.m. UTC | #3
On Mon, 2025-03-17 at 17:57 -0400, Jeff Layton wrote:
> On Mon, 2025-03-17 at 21:35 +0000, Trond Myklebust wrote:
> > On Mon, 2025-03-17 at 16:59 -0400, Jeff Layton wrote:
> > > We have a long-standing problem with containers that have NFS
> > > mounts
> > > in
> > > them. Best practice is to unmount gracefully, of course, but
> > > sometimes
> > > containers just spontaneously die (e.g. SIGSEGV in the init task
> > > in
> > > the
> > > container). When that happens the orchestrator will see that all
> > > of
> > > the
> > > tasks are dead, and will detach the mount namespace and kill off
> > > the
> > > network connection.
> > > 
> > > If there are RPCs in flight at the time, the rpc_clnt will try to
> > > retransmit them indefinitely, but there is no hope of them ever
> > > contacting the server since nothing in userland can reach the
> > > netns
> > > at that point to fix anything.
> > > 
> > > This patchset takes the approach of changing various nfs client
> > > and
> > > sunrpc objects to not hold a netns reference. Instead, when a
> > > nfs_net
> > > or
> > > sunrpc_net is exiting, all nfs_server, nfs_client and rpc_clnt
> > > objects
> > > associated with it are shut down, and the pre_exit functions
> > > block
> > > until they are gone.
> > > 
> > > With this approach, when the last userland task in the container
> > > exits,
> > > the NFS and RPC clients get cleaned up automatically. As a bonus,
> > > this
> > > fixes another bug with the gssproxy RPC client that causes net
> > > namespace
> > > leaks in any container where it runs (details in the patch
> > > descriptions).
> > > 
> > 
> > So with this approach, what happens if the NFS mount was created in
> > a
> > container, but got bind mounted somewhere else?
> > 
> 
> The lifetime of these objects are tied to the net namespace. If it
> gets
> bind-mounted into a different mount namespace, while the tasks are
> setns()'ed into the correct net namespace, then I expect the mount
> would end up shut down at that point and be unusable, just like if
> you
> echo 1 into the shutdown file in sysfs.
> 
> Hopefully no one is doing anything that silly. You wouldn't be able
> to
> upcall, for one thing, since there wouldn't be any more userland
> processes attached to the netns.
> 
> I'll test that scenario and get back to you though. I do want to make
> sure that that's not going to lead to a crash or anything.

I agree with you that it's not a sane scenario, and that there is no
need to try to make it work. However the user space tools are there to
allow it to happen, so we need to ensure that the kernel won't panic or
cause any new exotic hangs.
Jeff Layton March 18, 2025, 11:30 a.m. UTC | #4
On Mon, 2025-03-17 at 22:11 +0000, Trond Myklebust wrote:
> On Mon, 2025-03-17 at 17:57 -0400, Jeff Layton wrote:
> > On Mon, 2025-03-17 at 21:35 +0000, Trond Myklebust wrote:
> > > On Mon, 2025-03-17 at 16:59 -0400, Jeff Layton wrote:
> > > > We have a long-standing problem with containers that have NFS
> > > > mounts
> > > > in
> > > > them. Best practice is to unmount gracefully, of course, but
> > > > sometimes
> > > > containers just spontaneously die (e.g. SIGSEGV in the init task
> > > > in
> > > > the
> > > > container). When that happens the orchestrator will see that all
> > > > of
> > > > the
> > > > tasks are dead, and will detach the mount namespace and kill off
> > > > the
> > > > network connection.
> > > > 
> > > > If there are RPCs in flight at the time, the rpc_clnt will try to
> > > > retransmit them indefinitely, but there is no hope of them ever
> > > > contacting the server since nothing in userland can reach the
> > > > netns
> > > > at that point to fix anything.
> > > > 
> > > > This patchset takes the approach of changing various nfs client
> > > > and
> > > > sunrpc objects to not hold a netns reference. Instead, when a
> > > > nfs_net
> > > > or
> > > > sunrpc_net is exiting, all nfs_server, nfs_client and rpc_clnt
> > > > objects
> > > > associated with it are shut down, and the pre_exit functions
> > > > block
> > > > until they are gone.
> > > > 
> > > > With this approach, when the last userland task in the container
> > > > exits,
> > > > the NFS and RPC clients get cleaned up automatically. As a bonus,
> > > > this
> > > > fixes another bug with the gssproxy RPC client that causes net
> > > > namespace
> > > > leaks in any container where it runs (details in the patch
> > > > descriptions).
> > > > 
> > > 
> > > So with this approach, what happens if the NFS mount was created in
> > > a
> > > container, but got bind mounted somewhere else?
> > > 
> > 
> > The lifetime of these objects are tied to the net namespace. If it
> > gets
> > bind-mounted into a different mount namespace, while the tasks are
> > setns()'ed into the correct net namespace, then I expect the mount
> > would end up shut down at that point and be unusable, just like if
> > you
> > echo 1 into the shutdown file in sysfs.
> > 
> > Hopefully no one is doing anything that silly. You wouldn't be able
> > to
> > upcall, for one thing, since there wouldn't be any more userland
> > processes attached to the netns.
> > 
> > I'll test that scenario and get back to you though. I do want to make
> > sure that that's not going to lead to a crash or anything.
> 
> I agree with you that it's not a sane scenario, and that there is no
> need to try to make it work. However the user space tools are there to
> allow it to happen, so we need to ensure that the kernel won't panic or
> cause any new exotic hangs.

Unfortunately, this does create a hang.

Bind-mounting it will cause the superblock's refcount to increase,
which keeps the nfs_server struct active. That holds a reference to the
nfs_client, which prevents everything from coming down properly in
pre_exit.

I'll have to think about how we can solve that. Let me know if you have
ideas.