[RFC,iproute2-next,0/5] Persisting of mount namespaces along with network namespaces

Message ID	20231009182753.851551-1-toke@redhat.com (mailing list archive)
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1AF573AC1A for <netdev@vger.kernel.org>; Mon, 9 Oct 2023 18:28:05 +0000 (UTC) From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com> To: David Ahern <dsahern@gmail.com>, Stephen Hemminger <stephen@networkplumber.org> Cc: netdev@vger.kernel.org, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>, Nicolas Dichtel <nicolas.dichtel@6wind.com>, Christian Brauner <brauner@kernel.org>, "Eric W . Biederman" <ebiederm@xmission.com>, David Laight <David.Laight@ACULAB.COM> Subject: [RFC PATCH iproute2-next 0/5] Persisting of mount namespaces along with network namespaces Date: Mon, 9 Oct 2023 20:27:48 +0200 Message-ID: <20231009182753.851551-1-toke@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	Persisting of mount namespaces along with network namespaces \| expand [RFC,iproute2-next,0/5] Persisting of mount namespaces along with network namespaces [RFC,iproute2-next,1/5] ip: Mount netns in child process instead of from inside the new namespace [RFC,iproute2-next,2/5] ip: Split out code creating namespace mount dir so it can be reused [RFC,iproute2-next,3/5] lib/namespace: Factor out code for reuse [RFC,iproute2-next,4/5] ip: Also create and persist mount namespace when creating netns [RFC,iproute2-next,5/5] lib/namespace: Also mount a bpffs instance inside new mount namespaces

Toke Høiland-Jørgensen Oct. 9, 2023, 6:27 p.m. UTC

The 'ip netns' command is used for setting up network namespaces with persistent
named references, and is integrated into various other commands of iproute2 via
the -n switch.

This is useful both for testing setups and for simple script-based namespacing
but has one drawback: the lack of persistent mounts inside the spawned
namespace. This is particularly apparent when working with BPF programs that use
pinning to bpffs: by default no bpffs is available inside a namespace, and
even if mounting one, that fs disappears as soon as the calling command exits.

The underlying cause for this is that iproute2 will create a new mount namespace
every time it switches into a network namespace. This is needed to be able to
mount a /sys filesystem that shows the correct network device information, but
has the unfortunate side effect of making mounts entirely transient for any 'ip
netns' invocation.

This series is an attempt to fix this situation, by persisting a mount namespace
alongside the persistent network namespace (in a separate directory,
/run/netns-mnt). Doing this allows us to still have a consistent /sys inside
the namespace, but with persistence so any mounts survive.

This mode does come with some caveats. I'm sending this as RFC to get feedback
on whether this is the right thing to do, especially considering backwards
compatibility. On balance, I think that the approach taken here of
unconditionally persisting the mount namespace, and using that persistent
reference whenever it exists, is better than the current behaviour, and that
while it does represent a change in behaviour it is backwards compatible in a
way that won't cause issues. But please do comment on this; see the patch
description of patch 4 for details.

The first three patches are just moving code around and should not represent any
functional changes. The fourth patch introduces the mount namespace persistence,
and the fifth patch adds mounting of a bpffs instance to the mount namespace
preparation logic.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@ACULAB.COM>

Toke Høiland-Jørgensen (5):
  ip: Mount netns in child process instead of from inside the new
    namespace
  ip: Split out code creating namespace mount dir so it can be reused
  lib/namespace: Factor out code for reuse
  ip: Also create and persist mount namespace when creating netns
  lib/namespace: Also mount a bpffs instance inside new mount namespaces

 Makefile            |   2 +
 include/namespace.h |   1 +
 ip/ipnetns.c        | 357 ++++++++++++++++++++++++++++++--------------
 lib/namespace.c     |  82 +++++++---
 4 files changed, 312 insertions(+), 130 deletions(-)

Eric W. Biederman Oct. 9, 2023, 8:32 p.m. UTC | #1

Toke Høiland-Jørgensen <toke@redhat.com> writes:

> The 'ip netns' command is used for setting up network namespaces with persistent
> named references, and is integrated into various other commands of iproute2 via
> the -n switch.
>
> This is useful both for testing setups and for simple script-based namespacing
> but has one drawback: the lack of persistent mounts inside the spawned
> namespace. This is particularly apparent when working with BPF programs that use
> pinning to bpffs: by default no bpffs is available inside a namespace, and
> even if mounting one, that fs disappears as soon as the calling
> command exits.

It would be entirely reasonable to copy mounts like /sys/fs/bpf from the
original mount namespace into the temporary mount namespace used by
"ip netns".

I would call it a bug that "ip netns" doesn't do that already.

I suspect that "ip netns" does copy the mounts from the old sysfs onto
the new sysfs is your entire problem.

Or is their a reason that bpffs should be per network namespace? 

> The underlying cause for this is that iproute2 will create a new mount namespace
> every time it switches into a network namespace. This is needed to be able to
> mount a /sys filesystem that shows the correct network device information, but
> has the unfortunate side effect of making mounts entirely transient for any 'ip
> netns' invocation.

Mount propagation can be made to work if necessary, that would solve the
transient problem. 

> This series is an attempt to fix this situation, by persisting a mount namespace
> alongside the persistent network namespace (in a separate directory,
> /run/netns-mnt). Doing this allows us to still have a consistent /sys inside
> the namespace, but with persistence so any mounts survive.

I really don't like that direction.

"ip netns" was designed and really should continue to be a command that
makes the world look like it has a single network namespace, for
compatibility with old code.  Part of that old code "ip netns" supports
is "ip" itself.

I think you are making bpffs unnecessarily per network namespace.

> This mode does come with some caveats. I'm sending this as RFC to get feedback
> on whether this is the right thing to do, especially considering backwards
> compatibility. On balance, I think that the approach taken here of
> unconditionally persisting the mount namespace, and using that persistent
> reference whenever it exists, is better than the current behaviour, and that
> while it does represent a change in behaviour it is backwards compatible in a
> way that won't cause issues. But please do comment on this; see the patch
> description of patch 4 for details.

As I understand it this will cause a problem for any application that
is network namespace aware and does not use "ip netns" to wrap itself.

I am fairly certain that pinning the mount namespace will result in
never seeing an update of /etc/resolve.conf.  At least if you
are on a system that has /etc/netns/NAME/resolve.conf

Unless I am missing something I think you are trying to solve the wrong
problem.  I think all it will take is for the new mount of /sys to have
the same mounts on it as the previous mount of /sys.

Eric

Toke Høiland-Jørgensen Oct. 9, 2023, 10:03 p.m. UTC | #2

"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
>> The 'ip netns' command is used for setting up network namespaces with persistent
>> named references, and is integrated into various other commands of iproute2 via
>> the -n switch.
>>
>> This is useful both for testing setups and for simple script-based namespacing
>> but has one drawback: the lack of persistent mounts inside the spawned
>> namespace. This is particularly apparent when working with BPF programs that use
>> pinning to bpffs: by default no bpffs is available inside a namespace, and
>> even if mounting one, that fs disappears as soon as the calling
>> command exits.
>
> It would be entirely reasonable to copy mounts like /sys/fs/bpf from the
> original mount namespace into the temporary mount namespace used by
> "ip netns".
>
> I would call it a bug that "ip netns" doesn't do that already.
>
> I suspect that "ip netns" does copy the mounts from the old sysfs onto
> the new sysfs is your entire problem.

How would it do that? Walk mtab and remount everything identically after
remounting /sys? Or is there a smarter way to go about this?

> Or is their a reason that bpffs should be per network namespace?

Well, I first ran into this issue because of a bug report to
xdp-tools/libxdp about things not working correctly in network
namespaces:

https://github.com/xdp-project/xdp-tools/issues/364

And libxdp does assume that there's a separate bpffs per network
namespace: it persists things into the bpffs that is tied to the network
devices in the current namespace. So if the bpffs is shared, an
application running inside the network namespace could access XDP
programs loaded in the root namespace. I don't know, but suspect, that
such assumptions would be relatively common in networking BPF programs
that use pinning (the pinning support in libbpf and iproute2 itself at
least have the same leaking problem if the bpffs is shared).

>> The underlying cause for this is that iproute2 will create a new mount namespace
>> every time it switches into a network namespace. This is needed to be able to
>> mount a /sys filesystem that shows the correct network device information, but
>> has the unfortunate side effect of making mounts entirely transient for any 'ip
>> netns' invocation.
>
> Mount propagation can be made to work if necessary, that would solve the
> transient problem.

Is mount propagation different from the remount thing you mentioned
above, or is this something different?

(Sorry for being hopelessly naive about this, as you probably guessed
from my previous email asking about this, I'm only now learning about
all the intricacies fs mounts).

>> This series is an attempt to fix this situation, by persisting a mount namespace
>> alongside the persistent network namespace (in a separate directory,
>> /run/netns-mnt). Doing this allows us to still have a consistent /sys inside
>> the namespace, but with persistence so any mounts survive.
>
> I really don't like that direction.
>
> "ip netns" was designed and really should continue to be a command that
> makes the world look like it has a single network namespace, for
> compatibility with old code.  Part of that old code "ip netns" supports
> is "ip" itself.

Well my idea with this change was to keep the functionality as close to
what 'ip' currently does, but just have mounts persist across
invocations.

> I think you are making bpffs unnecessarily per network namespace.

See above. 

>> This mode does come with some caveats. I'm sending this as RFC to get feedback
>> on whether this is the right thing to do, especially considering backwards
>> compatibility. On balance, I think that the approach taken here of
>> unconditionally persisting the mount namespace, and using that persistent
>> reference whenever it exists, is better than the current behaviour, and that
>> while it does represent a change in behaviour it is backwards compatible in a
>> way that won't cause issues. But please do comment on this; see the patch
>> description of patch 4 for details.
>
> As I understand it this will cause a problem for any application that
> is network namespace aware and does not use "ip netns" to wrap itself.
>
> I am fairly certain that pinning the mount namespace will result in
> never seeing an update of /etc/resolve.conf.  At least if you
> are on a system that has /etc/netns/NAME/resolve.conf

I was actually wondering about that /etc bind mounting support while I
was looking at this code. Could you please elaborate a bit on what that
is used for, exactly? :)

Also, if staleness of the /etc bind mounts is an issue, those could be
redone on every entry, couldn't they?

-Toke

Eric W. Biederman Oct. 10, 2023, 12:14 a.m. UTC | #3

Toke Høiland-Jørgensen <toke@redhat.com> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>
>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>
>>> The 'ip netns' command is used for setting up network namespaces with persistent
>>> named references, and is integrated into various other commands of iproute2 via
>>> the -n switch.
>>>
>>> This is useful both for testing setups and for simple script-based namespacing
>>> but has one drawback: the lack of persistent mounts inside the spawned
>>> namespace. This is particularly apparent when working with BPF programs that use
>>> pinning to bpffs: by default no bpffs is available inside a namespace, and
>>> even if mounting one, that fs disappears as soon as the calling
>>> command exits.
>>
>> It would be entirely reasonable to copy mounts like /sys/fs/bpf from the
>> original mount namespace into the temporary mount namespace used by
>> "ip netns".
>>
>> I would call it a bug that "ip netns" doesn't do that already.
>>
>> I suspect that "ip netns" does copy the mounts from the old sysfs onto
>> the new sysfs is your entire problem.
>
> How would it do that? Walk mtab and remount everything identically after
> remounting /sys? Or is there a smarter way to go about this?

There are not many places to look so something like this is probably sufficient:

# stat all of the possible/probable mount points and see if there is
# something mounted there.  If so recursive bind whatever is there onto
# the new /sys

for dir in /old/sys/fs/* /old/sys/kernel/*; do
	if [ $(stat --format '%d' "$dir") = $(stat --format '%d' "$dir/") ; then
		newdir = $(echo $dir | sed -e s/old/new/)
		mount --rbind $dir/ $newdir/
	fi  
done

If the concern is being robust for the future the mount points can also
be enumerated by looking in one of /proc/self/mounts,
/proc/self/mountinfo, or /proc/self/mountstats.

I am not certain which is less work parsing a file with lots of fields,
or reading a directory and stating the returned files from readdir.

>> Or is their a reason that bpffs should be per network namespace?
>
> Well, I first ran into this issue because of a bug report to
> xdp-tools/libxdp about things not working correctly in network
> namespaces:
>
> https://github.com/xdp-project/xdp-tools/issues/364
>
> And libxdp does assume that there's a separate bpffs per network
> namespace: it persists things into the bpffs that is tied to the network
> devices in the current namespace. So if the bpffs is shared, an
> application running inside the network namespace could access XDP
> programs loaded in the root namespace. I don't know, but suspect, that
> such assumptions would be relatively common in networking BPF programs
> that use pinning (the pinning support in libbpf and iproute2 itself at
> least have the same leaking problem if the bpffs is shared).

Are the names of the values truly network namespace specific?

I did not see any mention of the things that are persisted in the ticket
you pointed me at, and unfortunately I am not familiar with xdp.

Last I looked until all of the cpu side channels are closed it is
unfortunately unsafe to load ebpf programs with anything less than
CAP_SYS_ADMIN (aka with permission to see and administer the entire
system).  So from a system point of view I really don't see a
fundamental danger from having a global /sys/fs/bpf.

If there are name conflicts in /sys/fs/bpf because of duplicate names in
different network namespaces I can see that being a problem.

At that point the name conflicts either need to be fixed or we
fundamentally need to have multiple mount points for bpffs.
Probably under something like /run/netns-mounts/NAME/.

With ip netns updated to mount the appropriate filesystem.

>>> The underlying cause for this is that iproute2 will create a new mount namespace
>>> every time it switches into a network namespace. This is needed to be able to
>>> mount a /sys filesystem that shows the correct network device information, but
>>> has the unfortunate side effect of making mounts entirely transient for any 'ip
>>> netns' invocation.
>>
>> Mount propagation can be made to work if necessary, that would solve the
>> transient problem.
>
> Is mount propagation different from the remount thing you mentioned
> above, or is this something different?
>
> (Sorry for being hopelessly naive about this, as you probably guessed
> from my previous email asking about this, I'm only now learning about
> all the intricacies fs mounts).

Mount propagation is a way to configure a mount namespace (before
creating a new one) that will cause mounts created in the first mount
namespace to be created in it's children, and cause mounts created in
the children to be created in the parent (depending on how things are
configured).

It is not my favorite feature (it makes locking of mount namespaces
terrible) and it is probably too clever by half, unfortunately systemd
started enabling mount propagation by default, so we are stuck with it.

>>> This series is an attempt to fix this situation, by persisting a mount namespace
>>> alongside the persistent network namespace (in a separate directory,
>>> /run/netns-mnt). Doing this allows us to still have a consistent /sys inside
>>> the namespace, but with persistence so any mounts survive.
>>
>> I really don't like that direction.
>>
>> "ip netns" was designed and really should continue to be a command that
>> makes the world look like it has a single network namespace, for
>> compatibility with old code.  Part of that old code "ip netns" supports
>> is "ip" itself.
>
> Well my idea with this change was to keep the functionality as close to
> what 'ip' currently does, but just have mounts persist across
> invocations.
>
>> I think you are making bpffs unnecessarily per network namespace.
>
> See above. 
>
>>> This mode does come with some caveats. I'm sending this as RFC to get feedback
>>> on whether this is the right thing to do, especially considering backwards
>>> compatibility. On balance, I think that the approach taken here of
>>> unconditionally persisting the mount namespace, and using that persistent
>>> reference whenever it exists, is better than the current behaviour, and that
>>> while it does represent a change in behaviour it is backwards compatible in a
>>> way that won't cause issues. But please do comment on this; see the patch
>>> description of patch 4 for details.
>>
>> As I understand it this will cause a problem for any application that
>> is network namespace aware and does not use "ip netns" to wrap itself.
>>
>> I am fairly certain that pinning the mount namespace will result in
>> never seeing an update of /etc/resolve.conf.  At least if you
>> are on a system that has /etc/netns/NAME/resolve.conf
>
> I was actually wondering about that /etc bind mounting support while I
> was looking at this code. Could you please elaborate a bit on what that
> is used for, exactly? :)

The idea is that you can have separate static configuration depending
upon your network namespace.

A real world case is vpning into several company internal networks.
Each company network uses overlapping portions of the 192.168.x.x
network.
Each company network will want it's own dns servers and possibly other
network configuration as well.

For it to make sense you really only need one company network and a home
network.  One of which you could stash in a network namespace to prevent
conflicts.

I don't know if supporting that ever caught on very strongly, but
I tried to build a template that would work for that and similar cases.

> Also, if staleness of the /etc bind mounts is an issue, those could be
> redone on every entry, couldn't they?

They already are ;)

Eric

David Laight Oct. 10, 2023, 8:42 a.m. UTC | #4

From: Eric W. Biederman
> Sent: 09 October 2023 21:33
> 
> Toke Høiland-Jørgensen <toke@redhat.com> writes:
> 
> > The 'ip netns' command is used for setting up network namespaces with persistent
> > named references, and is integrated into various other commands of iproute2 via
> > the -n switch.
> >
> > This is useful both for testing setups and for simple script-based namespacing
> > but has one drawback: the lack of persistent mounts inside the spawned
> > namespace. This is particularly apparent when working with BPF programs that use
> > pinning to bpffs: by default no bpffs is available inside a namespace, and
> > even if mounting one, that fs disappears as soon as the calling
> > command exits.
> 
> It would be entirely reasonable to copy mounts like /sys/fs/bpf from the
> original mount namespace into the temporary mount namespace used by
> "ip netns".
> 
> I would call it a bug that "ip netns" doesn't do that already.
> 
> I suspect that "ip netns" does copy the mounts from the old sysfs onto
> the new sysfs is your entire problem.

When I was getting a program to run in multiple network namespaces
(has sockets in 2 namespaces) I rather expected that netns(net_ns_fd,0)
would 'magically' change /proc/net to refer to the new namespace.
I think that could be done in the code that follows the /proc/net
mountpoint - IIRC something similar is done for /proc/self.

However that would need flags to both setns() and 'ip netns exec'
since programs will rely on the existing behaviour.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Toke Høiland-Jørgensen Oct. 10, 2023, 1:38 p.m. UTC | #5

"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>
>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>
>>>> The 'ip netns' command is used for setting up network namespaces with persistent
>>>> named references, and is integrated into various other commands of iproute2 via
>>>> the -n switch.
>>>>
>>>> This is useful both for testing setups and for simple script-based namespacing
>>>> but has one drawback: the lack of persistent mounts inside the spawned
>>>> namespace. This is particularly apparent when working with BPF programs that use
>>>> pinning to bpffs: by default no bpffs is available inside a namespace, and
>>>> even if mounting one, that fs disappears as soon as the calling
>>>> command exits.
>>>
>>> It would be entirely reasonable to copy mounts like /sys/fs/bpf from the
>>> original mount namespace into the temporary mount namespace used by
>>> "ip netns".
>>>
>>> I would call it a bug that "ip netns" doesn't do that already.
>>>
>>> I suspect that "ip netns" does copy the mounts from the old sysfs onto
>>> the new sysfs is your entire problem.
>>
>> How would it do that? Walk mtab and remount everything identically after
>> remounting /sys? Or is there a smarter way to go about this?
>
> There are not many places to look so something like this is probably sufficient:
>
> # stat all of the possible/probable mount points and see if there is
> # something mounted there.  If so recursive bind whatever is there onto
> # the new /sys
>
> for dir in /old/sys/fs/* /old/sys/kernel/*; do
> 	if [ $(stat --format '%d' "$dir") = $(stat --format '%d' "$dir/") ; then

What is this comparison supposed to do? I couldn't find any directories
in /sys/fs/* where this was *not* true, regardless of whether there's a
mount there or not.

> 		newdir = $(echo $dir | sed -e s/old/new/)
> 		mount --rbind $dir/ $newdir/
> 	fi  
> done
>
> If the concern is being robust for the future the mount points can also
> be enumerated by looking in one of /proc/self/mounts,
> /proc/self/mountinfo, or /proc/self/mountstats.
>
> I am not certain which is less work parsing a file with lots of fields,
> or reading a directory and stating the returned files from readdir.

Right, fair point.

>>> Or is their a reason that bpffs should be per network namespace?
>>
>> Well, I first ran into this issue because of a bug report to
>> xdp-tools/libxdp about things not working correctly in network
>> namespaces:
>>
>> https://github.com/xdp-project/xdp-tools/issues/364
>>
>> And libxdp does assume that there's a separate bpffs per network
>> namespace: it persists things into the bpffs that is tied to the network
>> devices in the current namespace. So if the bpffs is shared, an
>> application running inside the network namespace could access XDP
>> programs loaded in the root namespace. I don't know, but suspect, that
>> such assumptions would be relatively common in networking BPF programs
>> that use pinning (the pinning support in libbpf and iproute2 itself at
>> least have the same leaking problem if the bpffs is shared).
>
> Are the names of the values truly network namespace specific?
>
> I did not see any mention of the things that are persisted in the ticket
> you pointed me at, and unfortunately I am not familiar with xdp.
>
> Last I looked until all of the cpu side channels are closed it is
> unfortunately unsafe to load ebpf programs with anything less than
> CAP_SYS_ADMIN (aka with permission to see and administer the entire
> system).  So from a system point of view I really don't see a
> fundamental danger from having a global /sys/fs/bpf.
>
> If there are name conflicts in /sys/fs/bpf because of duplicate names in
> different network namespaces I can see that being a problem.

Yeah, you're right that someone loading a BPF program generally has
permissions enough that they can break out of any containment if they
want, but applications do make assumptions about the contents of the
pinning directory that can lead to conflicts.

A couple of examples:

- libxdp will persist files in /sys/fs/bpf/dispatch-$ifindex-$prog_id

- If someone sets the 'pinning' attribute on a map definition in a BPF
  file, libbpf will pin those files in /sys/fs/bpf/$map_name

The first one leads to obvious conflicts if shared across network
namespaces because of ifindex collisions. The second one leads to
potential false sharing of state across what are supposed to be
independent networking domains (e.g., if the bpffs is shared, loading
xdp-filter inside a namespace will share the state with another instance
loaded in another namespace, which would no doubt be surprising).

> At that point the name conflicts either need to be fixed or we
> fundamentally need to have multiple mount points for bpffs.
> Probably under something like /run/netns-mounts/NAME/.
>
> With ip netns updated to mount the appropriate filesystem.

I don't think it's feasible to fix the conflicts; they've been around
for a while and are part of application API in some cases. Plus, we
don't know of all BPF-using applications.

We could have 'ip' manage separate bpffs mounts per namespace and
bind-mount them into each netns (I think that's what you're suggesting),
but that would basically achieve the same thing as the mountns
persisting I am proposing in this series, but only as a special case for
bpffs. So why not do the more flexible thing and persist the whole
mountns (so applications inside the namespace can actually mount
additional things and have them stick around)? The current behaviour
seems very surprising...

>>>> The underlying cause for this is that iproute2 will create a new mount namespace
>>>> every time it switches into a network namespace. This is needed to be able to
>>>> mount a /sys filesystem that shows the correct network device information, but
>>>> has the unfortunate side effect of making mounts entirely transient for any 'ip
>>>> netns' invocation.
>>>
>>> Mount propagation can be made to work if necessary, that would solve the
>>> transient problem.
>>
>> Is mount propagation different from the remount thing you mentioned
>> above, or is this something different?
>>
>> (Sorry for being hopelessly naive about this, as you probably guessed
>> from my previous email asking about this, I'm only now learning about
>> all the intricacies fs mounts).
>
> Mount propagation is a way to configure a mount namespace (before
> creating a new one) that will cause mounts created in the first mount
> namespace to be created in it's children, and cause mounts created in
> the children to be created in the parent (depending on how things are
> configured).
>
> It is not my favorite feature (it makes locking of mount namespaces
> terrible) and it is probably too clever by half, unfortunately systemd
> started enabling mount propagation by default, so we are stuck with it.

Right. AFAICT the current iproute2 code explicitly tries to avoid that
when creating a mountns (it does a 'mount --make-rslave /'); so you're
saying we should change that?

>>>> This series is an attempt to fix this situation, by persisting a mount namespace
>>>> alongside the persistent network namespace (in a separate directory,
>>>> /run/netns-mnt). Doing this allows us to still have a consistent /sys inside
>>>> the namespace, but with persistence so any mounts survive.
>>>
>>> I really don't like that direction.
>>>
>>> "ip netns" was designed and really should continue to be a command that
>>> makes the world look like it has a single network namespace, for
>>> compatibility with old code.  Part of that old code "ip netns" supports
>>> is "ip" itself.
>>
>> Well my idea with this change was to keep the functionality as close to
>> what 'ip' currently does, but just have mounts persist across
>> invocations.
>>
>>> I think you are making bpffs unnecessarily per network namespace.
>>
>> See above. 
>>
>>>> This mode does come with some caveats. I'm sending this as RFC to get feedback
>>>> on whether this is the right thing to do, especially considering backwards
>>>> compatibility. On balance, I think that the approach taken here of
>>>> unconditionally persisting the mount namespace, and using that persistent
>>>> reference whenever it exists, is better than the current behaviour, and that
>>>> while it does represent a change in behaviour it is backwards compatible in a
>>>> way that won't cause issues. But please do comment on this; see the patch
>>>> description of patch 4 for details.
>>>
>>> As I understand it this will cause a problem for any application that
>>> is network namespace aware and does not use "ip netns" to wrap itself.
>>>
>>> I am fairly certain that pinning the mount namespace will result in
>>> never seeing an update of /etc/resolve.conf.  At least if you
>>> are on a system that has /etc/netns/NAME/resolve.conf
>>
>> I was actually wondering about that /etc bind mounting support while I
>> was looking at this code. Could you please elaborate a bit on what that
>> is used for, exactly? :)
>
> The idea is that you can have separate static configuration depending
> upon your network namespace.
>
> A real world case is vpning into several company internal networks.
> Each company network uses overlapping portions of the 192.168.x.x
> network.
> Each company network will want it's own dns servers and possibly other
> network configuration as well.
>
> For it to make sense you really only need one company network and a home
> network.  One of which you could stash in a network namespace to prevent
> conflicts.
>
> I don't know if supporting that ever caught on very strongly, but
> I tried to build a template that would work for that and similar cases.

Hmm, I actually use a network namespace for something like that myself,
but I'm not using that functionality because I had no idea it existed
until now... :)

>> Also, if staleness of the /etc bind mounts is an issue, those could be
>> redone on every entry, couldn't they?
>
> They already are ;)

Right, by the transient mount namespaces, what I meant was that this
could be preserved even with a persistent mount namespace.

-Toke

Eric W. Biederman Oct. 10, 2023, 7:19 p.m. UTC | #6

Toke Høiland-Jørgensen <toke@redhat.com> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>
>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>
>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>
>>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>
>> There are not many places to look so something like this is probably sufficient:
>>
>> # stat all of the possible/probable mount points and see if there is
>> # something mounted there.  If so recursive bind whatever is there onto
>> # the new /sys
>>
>> for dir in /old/sys/fs/* /old/sys/kernel/*; do
>> 	if [ $(stat --format '%d' "$dir") = $(stat --format '%d' "$dir/") ; then
>
> What is this comparison supposed to do? I couldn't find any directories
> in /sys/fs/* where this was *not* true, regardless of whether there's a
> mount there or not.

Bah.  I think I got my logic scrambled.  I can only get it to work
by comparing the filesystems device on /sys/fs to the device on
/sys/fs/cgroup etc.

The idea is that st_dev changes between filesystems.  So you can detect
a filesystem change based on st_dev.

I thought the $dir vs $dir/ would have allowed stating the underlying
directory verses the mount, but apparently my memory go that one wrong.

Which makes my command actually something like:

	sys_dev=$(stat --format='%d' /sys)

	for dir in /old/sys/fs/* /old/sys/kernel/*; do
		if [ $(stat --format '%d' "$dir") -ne $sys_dev ] ; then
                	echo $dir is a mount point
                fi
	done


>>>> Or is their a reason that bpffs should be per network namespace?
>>>
>>> Well, I first ran into this issue because of a bug report to
>>> xdp-tools/libxdp about things not working correctly in network
>>> namespaces:
>>>
>>> https://github.com/xdp-project/xdp-tools/issues/364
>>>
>>> And libxdp does assume that there's a separate bpffs per network
>>> namespace: it persists things into the bpffs that is tied to the network
>>> devices in the current namespace. So if the bpffs is shared, an
>>> application running inside the network namespace could access XDP
>>> programs loaded in the root namespace. I don't know, but suspect, that
>>> such assumptions would be relatively common in networking BPF programs
>>> that use pinning (the pinning support in libbpf and iproute2 itself at
>>> least have the same leaking problem if the bpffs is shared).
>>
>> Are the names of the values truly network namespace specific?
>>
>> I did not see any mention of the things that are persisted in the ticket
>> you pointed me at, and unfortunately I am not familiar with xdp.
>>
>> Last I looked until all of the cpu side channels are closed it is
>> unfortunately unsafe to load ebpf programs with anything less than
>> CAP_SYS_ADMIN (aka with permission to see and administer the entire
>> system).  So from a system point of view I really don't see a
>> fundamental danger from having a global /sys/fs/bpf.
>>
>> If there are name conflicts in /sys/fs/bpf because of duplicate names in
>> different network namespaces I can see that being a problem.
>
> Yeah, you're right that someone loading a BPF program generally has
> permissions enough that they can break out of any containment if they
> want, but applications do make assumptions about the contents of the
> pinning directory that can lead to conflicts.
>
> A couple of examples:
>
> - libxdp will persist files in /sys/fs/bpf/dispatch-$ifindex-$prog_id
>
> - If someone sets the 'pinning' attribute on a map definition in a BPF
>   file, libbpf will pin those files in /sys/fs/bpf/$map_name
>
> The first one leads to obvious conflicts if shared across network
> namespaces because of ifindex collisions. The second one leads to
> potential false sharing of state across what are supposed to be
> independent networking domains (e.g., if the bpffs is shared, loading
> xdp-filter inside a namespace will share the state with another instance
> loaded in another namespace, which would no doubt be surprising).

Sigh.  So non-default network namespaces can't use /sys/fs/bpf,
because of silly userspace assumptions.  So the entries need to be
namespaced to prevent conflicts.

>> At that point the name conflicts either need to be fixed or we
>> fundamentally need to have multiple mount points for bpffs.
>> Probably under something like /run/netns-mounts/NAME/.
>>
>> With ip netns updated to mount the appropriate filesystem.
>
> I don't think it's feasible to fix the conflicts; they've been around
> for a while and are part of application API in some cases. Plus, we
> don't know of all BPF-using applications.
>
> We could have 'ip' manage separate bpffs mounts per namespace and
> bind-mount them into each netns (I think that's what you're suggesting),
> but that would basically achieve the same thing as the mountns
> persisting I am proposing in this series, but only as a special case for
> bpffs. So why not do the more flexible thing and persist the whole
> mountns (so applications inside the namespace can actually mount
> additional things and have them stick around)? The current behaviour
> seems very surprising...

I don't like persisting the entire mount namespace because it is hard
for a system administrator to see, it is difficult for something outside
of that mount namespace to access, and it is as easy to persist a
mistake as it is to persist something deliberate.

My proposal:

On "ip netns add NAME"
- create the network namespace and mount it at /run/netns/NAME
- mount the appropriate sysfs at /run/netns-mounts/NAME/sys
- mount the appropriate bpffs at /run/netns-mounts/NAME/sys/fs/bpf

On "ip netns delete NAME"
- umount --recursive /run/netns-mounts/NAME
- unlink /run/netns-mounts/NAME
- cleanup /run/netns/NAME as we do today.

On "ip netns exec NAME"
- Walk through /run/netns-mounts/NAME like we do /etc/netns/NAME/
  and perform bind mounts.

That way everything that needs to persist really and truly persists.

Applications that don't use "ip netns exec" can continue to use the
network namespace and everything that goes along with it without
problems.

>> Mount propagation is a way to configure a mount namespace (before
>> creating a new one) that will cause mounts created in the first mount
>> namespace to be created in it's children, and cause mounts created in
>> the children to be created in the parent (depending on how things are
>> configured).
>>
>> It is not my favorite feature (it makes locking of mount namespaces
>> terrible) and it is probably too clever by half, unfortunately systemd
>> started enabling mount propagation by default, so we are stuck with it.
>
> Right. AFAICT the current iproute2 code explicitly tries to avoid that
> when creating a mountns (it does a 'mount --make-rslave /'); so you're
> saying we should change that?

If it makes sense.

I believe I added the 'mount --make-rslave /' because otherwise all
mount activity was propagating back, and making a mess.  Especially when
I was unmounting /sys.

I am not a huge fan of mount propagation it has lots of surprising
little details that need to be set just right, to not cause problems.

With my proposal above I think we could in some carefully chosen
places enable mount propagation without problem.  But I would really
like to see an application that is performing mounts inside of
"ip netns exec" to see how it matters.

Code without concrete real world test use cases tends to get things
wrong.

Eric

Eric W. Biederman Oct. 10, 2023, 7:32 p.m. UTC | #7

David Laight <David.Laight@ACULAB.COM> writes:

> From: Eric W. Biederman
>> Sent: 09 October 2023 21:33
>> 
>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>> 
>> > The 'ip netns' command is used for setting up network namespaces with persistent
>> > named references, and is integrated into various other commands of iproute2 via
>> > the -n switch.
>> >
>> > This is useful both for testing setups and for simple script-based namespacing
>> > but has one drawback: the lack of persistent mounts inside the spawned
>> > namespace. This is particularly apparent when working with BPF programs that use
>> > pinning to bpffs: by default no bpffs is available inside a namespace, and
>> > even if mounting one, that fs disappears as soon as the calling
>> > command exits.
>> 
>> It would be entirely reasonable to copy mounts like /sys/fs/bpf from the
>> original mount namespace into the temporary mount namespace used by
>> "ip netns".
>> 
>> I would call it a bug that "ip netns" doesn't do that already.
>> 
>> I suspect that "ip netns" does copy the mounts from the old sysfs onto
>> the new sysfs is your entire problem.
>
> When I was getting a program to run in multiple network namespaces
> (has sockets in 2 namespaces) I rather expected that netns(net_ns_fd,0)
> would 'magically' change /proc/net to refer to the new namespace.
> I think that could be done in the code that follows the /proc/net
> mountpoint - IIRC something similar is done for /proc/self.

/proc/self/net does follow your current network namespace last I looked.

Of course if you are threaded you may need to look at
/proc/thread-self/net as your network namespace is per thread.

It is also quite evil.  The problem is that having different entries
cached under the same name is a major mess.  Ever since I made that
mistake I have been aiming at designs that don't fight the dcache.

Even in that case I think I limited it to just a entry where
ugliness happens.

> However that would need flags to both setns() and 'ip netns exec'
> since programs will rely on the existing behaviour.

You might want to look again.  

Eric

David Laight Oct. 10, 2023, 9:51 p.m. UTC | #8

From: Eric W. Biederman
> Sent: 10 October 2023 20:33
> 
> David Laight <David.Laight@ACULAB.COM> writes:
> 
> > From: Eric W. Biederman
> >> Sent: 09 October 2023 21:33
> >>
...
> > When I was getting a program to run in multiple network namespaces
> > (has sockets in 2 namespaces) I rather expected that netns(net_ns_fd,0)
> > would 'magically' change /proc/net to refer to the new namespace.
> > I think that could be done in the code that follows the /proc/net
> > mountpoint - IIRC something similar is done for /proc/self.
> 
> /proc/self/net does follow your current network namespace last I looked.
> 
> Of course if you are threaded you may need to look at
> /proc/thread-self/net as your network namespace is per thread.

Yes, I remember that now, and /proc/net is the wrong symlink.


> It is also quite evil.  The problem is that having different entries
> cached under the same name is a major mess.  Ever since I made that
> mistake I have been aiming at designs that don't fight the dcache.
> 
> Even in that case I think I limited it to just a entry where
> ugliness happens.

It is nice from a user point of view...

I'd guess a 'magic symlink' that points off somewhere fixed
would be a little cleaner.

> > However that would need flags to both setns() and 'ip netns exec'
> > since programs will rely on the existing behaviour.
> 
> You might want to look again.

The problem was with /sys/class/net

I ended up doing:
	ip netns exec fubar program args 3</sys/class/net

So that open("/sys/class/net/xxx") was inside the fubar namespace
and openat(3, "xxx") was in the default namespace.

But I think:
> On "ip netns add NAME"
> - create the network namespace and mount it at /run/netns/NAME
> - mount the appropriate sysfs at /run/netns-mounts/NAME/sys
> - mount the appropriate bpffs at /run/netns-mounts/NAME/sys/fs/bpf

would make it possible for a program to read (eg)
/sys/class/net/interface/speed for interfaces in multiple
network namespaces.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Toke Høiland-Jørgensen Oct. 11, 2023, 1:49 p.m. UTC | #9

"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>
>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>
>>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>>
>>>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>
>>> There are not many places to look so something like this is probably sufficient:
>>>
>>> # stat all of the possible/probable mount points and see if there is
>>> # something mounted there.  If so recursive bind whatever is there onto
>>> # the new /sys
>>>
>>> for dir in /old/sys/fs/* /old/sys/kernel/*; do
>>> 	if [ $(stat --format '%d' "$dir") = $(stat --format '%d' "$dir/") ; then
>>
>> What is this comparison supposed to do? I couldn't find any directories
>> in /sys/fs/* where this was *not* true, regardless of whether there's a
>> mount there or not.
>
> Bah.  I think I got my logic scrambled.  I can only get it to work
> by comparing the filesystems device on /sys/fs to the device on
> /sys/fs/cgroup etc.
>
> The idea is that st_dev changes between filesystems.  So you can detect
> a filesystem change based on st_dev.
>
> I thought the $dir vs $dir/ would have allowed stating the underlying
> directory verses the mount, but apparently my memory go that one wrong.
>
> Which makes my command actually something like:
>
> 	sys_dev=$(stat --format='%d' /sys)
>
> 	for dir in /old/sys/fs/* /old/sys/kernel/*; do
> 		if [ $(stat --format '%d' "$dir") -ne $sys_dev ] ; then
>                 	echo $dir is a mount point
>                 fi
> 	done

Ah, right that makes sense! I thought I was missing something when I
couldn't get your other example to work...

>>>>> Or is their a reason that bpffs should be per network namespace?
>>>>
>>>> Well, I first ran into this issue because of a bug report to
>>>> xdp-tools/libxdp about things not working correctly in network
>>>> namespaces:
>>>>
>>>> https://github.com/xdp-project/xdp-tools/issues/364
>>>>
>>>> And libxdp does assume that there's a separate bpffs per network
>>>> namespace: it persists things into the bpffs that is tied to the network
>>>> devices in the current namespace. So if the bpffs is shared, an
>>>> application running inside the network namespace could access XDP
>>>> programs loaded in the root namespace. I don't know, but suspect, that
>>>> such assumptions would be relatively common in networking BPF programs
>>>> that use pinning (the pinning support in libbpf and iproute2 itself at
>>>> least have the same leaking problem if the bpffs is shared).
>>>
>>> Are the names of the values truly network namespace specific?
>>>
>>> I did not see any mention of the things that are persisted in the ticket
>>> you pointed me at, and unfortunately I am not familiar with xdp.
>>>
>>> Last I looked until all of the cpu side channels are closed it is
>>> unfortunately unsafe to load ebpf programs with anything less than
>>> CAP_SYS_ADMIN (aka with permission to see and administer the entire
>>> system).  So from a system point of view I really don't see a
>>> fundamental danger from having a global /sys/fs/bpf.
>>>
>>> If there are name conflicts in /sys/fs/bpf because of duplicate names in
>>> different network namespaces I can see that being a problem.
>>
>> Yeah, you're right that someone loading a BPF program generally has
>> permissions enough that they can break out of any containment if they
>> want, but applications do make assumptions about the contents of the
>> pinning directory that can lead to conflicts.
>>
>> A couple of examples:
>>
>> - libxdp will persist files in /sys/fs/bpf/dispatch-$ifindex-$prog_id
>>
>> - If someone sets the 'pinning' attribute on a map definition in a BPF
>>   file, libbpf will pin those files in /sys/fs/bpf/$map_name
>>
>> The first one leads to obvious conflicts if shared across network
>> namespaces because of ifindex collisions. The second one leads to
>> potential false sharing of state across what are supposed to be
>> independent networking domains (e.g., if the bpffs is shared, loading
>> xdp-filter inside a namespace will share the state with another instance
>> loaded in another namespace, which would no doubt be surprising).
>
> Sigh.  So non-default network namespaces can't use /sys/fs/bpf,
> because of silly userspace assumptions.  So the entries need to be
> namespaced to prevent conflicts.

Yup, basically.

>>> At that point the name conflicts either need to be fixed or we
>>> fundamentally need to have multiple mount points for bpffs.
>>> Probably under something like /run/netns-mounts/NAME/.
>>>
>>> With ip netns updated to mount the appropriate filesystem.
>>
>> I don't think it's feasible to fix the conflicts; they've been around
>> for a while and are part of application API in some cases. Plus, we
>> don't know of all BPF-using applications.
>>
>> We could have 'ip' manage separate bpffs mounts per namespace and
>> bind-mount them into each netns (I think that's what you're suggesting),
>> but that would basically achieve the same thing as the mountns
>> persisting I am proposing in this series, but only as a special case for
>> bpffs. So why not do the more flexible thing and persist the whole
>> mountns (so applications inside the namespace can actually mount
>> additional things and have them stick around)? The current behaviour
>> seems very surprising...
>
> I don't like persisting the entire mount namespace because it is hard
> for a system administrator to see, it is difficult for something outside
> of that mount namespace to access, and it is as easy to persist a
> mistake as it is to persist something deliberate.
>
> My proposal:
>
> On "ip netns add NAME"
> - create the network namespace and mount it at /run/netns/NAME
> - mount the appropriate sysfs at /run/netns-mounts/NAME/sys
> - mount the appropriate bpffs at /run/netns-mounts/NAME/sys/fs/bpf
>
> On "ip netns delete NAME"
> - umount --recursive /run/netns-mounts/NAME
> - unlink /run/netns-mounts/NAME
> - cleanup /run/netns/NAME as we do today.
>
> On "ip netns exec NAME"
> - Walk through /run/netns-mounts/NAME like we do /etc/netns/NAME/
>   and perform bind mounts.

If we setup the full /sys hierarchy in /run/netns-mounts/NAME this
basically becomes a single recursive bind mount, doesn't it?

What about if we also include bind mounts from the host namespace into
that separate /sys instance? Will those be included into a recursive
bind into /sys inside the mount-ns, or will we have to walk the tree and
do separate bind mounts for each directory?

Anyway, this scheme sounds like it'll solve the issue I was trying to
address so I don't mind doing it this way. I'll try it out and respin
the patch series.

>>> Mount propagation is a way to configure a mount namespace (before
>>> creating a new one) that will cause mounts created in the first mount
>>> namespace to be created in it's children, and cause mounts created in
>>> the children to be created in the parent (depending on how things are
>>> configured).
>>>
>>> It is not my favorite feature (it makes locking of mount namespaces
>>> terrible) and it is probably too clever by half, unfortunately systemd
>>> started enabling mount propagation by default, so we are stuck with it.
>>
>> Right. AFAICT the current iproute2 code explicitly tries to avoid that
>> when creating a mountns (it does a 'mount --make-rslave /'); so you're
>> saying we should change that?
>
> If it makes sense.
>
> I believe I added the 'mount --make-rslave /' because otherwise all
> mount activity was propagating back, and making a mess.  Especially when
> I was unmounting /sys.
>
> I am not a huge fan of mount propagation it has lots of surprising
> little details that need to be set just right, to not cause problems.

Ah, you were talking about propagation from inside the mountns to
outside? Didn't catch that at first...

> With my proposal above I think we could in some carefully chosen
> places enable mount propagation without problem.

One thing that comes to mind would be that if we create persistent /sys
instances in /run/netns-mounts per the above, it would make sense for
any modifications done inside the netns to be propagated back to the
mount in /run; is this possible with a bind mount? Not sure I quite
understand how propagation would work in this case (since it would be a
separate (bind) mount point inside the namespace).

> But I would really like to see an application that is performing
> mounts inside of "ip netns exec" to see how it matters.

Two examples come to mind:

- I believe there are some applications that will mount a private bpffs
  instance for their own use case. Not sure if those applications switch
  in and out of namespaces, though, and if they do whether they are
  namespace-aware themselves

- Interactive use ('ip netns exec $SHELL'), which I sometimes use for
  testing various things. I've mostly had issues with bpffs in this
  setting, though, so if we solve that as per the above, maybe that's
  not needed.

> Code without concrete real world test use cases tends to get things
> wrong.

Heh, amen to that :)

-Toke

Eric W. Biederman Oct. 11, 2023, 2:55 p.m. UTC | #10

Toke Høiland-Jørgensen <toke@redhat.com> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>
>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>
>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>
>>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>>
>>>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>>>
>>>>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:

>> My proposal:
>>
>> On "ip netns add NAME"
>> - create the network namespace and mount it at /run/netns/NAME
>> - mount the appropriate sysfs at /run/netns-mounts/NAME/sys
>> - mount the appropriate bpffs at /run/netns-mounts/NAME/sys/fs/bpf
>>
>> On "ip netns delete NAME"
>> - umount --recursive /run/netns-mounts/NAME
>> - unlink /run/netns-mounts/NAME
>> - cleanup /run/netns/NAME as we do today.
>>
>> On "ip netns exec NAME"
>> - Walk through /run/netns-mounts/NAME like we do /etc/netns/NAME/
>>   and perform bind mounts.
>
> If we setup the full /sys hierarchy in /run/netns-mounts/NAME this
> basically becomes a single recursive bind mount, doesn't it?

Yes.

> What about if we also include bind mounts from the host namespace into
> that separate /sys instance? Will those be included into a recursive
> bind into /sys inside the mount-ns, or will we have to walk the tree and
> do separate bind mounts for each directory?

if /run/netns-mounts/NAME/sys has everything you want.

mount --rbind /run/netns-mounts/NAME/sys /sys

Will result in a /sys that has everything you want.

> Anyway, this scheme sounds like it'll solve the issue I was trying to
> address so I don't mind doing it this way. I'll try it out and respin
> the patch series.

Thanks that sounds like a way forward.


>>>> Mount propagation is a way to configure a mount namespace (before
>>>> creating a new one) that will cause mounts created in the first mount
>>>> namespace to be created in it's children, and cause mounts created in
>>>> the children to be created in the parent (depending on how things are
>>>> configured).
>>>>
>>>> It is not my favorite feature (it makes locking of mount namespaces
>>>> terrible) and it is probably too clever by half, unfortunately systemd
>>>> started enabling mount propagation by default, so we are stuck with it.
>>>
>>> Right. AFAICT the current iproute2 code explicitly tries to avoid that
>>> when creating a mountns (it does a 'mount --make-rslave /'); so you're
>>> saying we should change that?
>>
>> If it makes sense.
>>
>> I believe I added the 'mount --make-rslave /' because otherwise all
>> mount activity was propagating back, and making a mess.  Especially when
>> I was unmounting /sys.
>>
>> I am not a huge fan of mount propagation it has lots of surprising
>> little details that need to be set just right, to not cause problems.
>
> Ah, you were talking about propagation from inside the mountns to
> outside? Didn't catch that at first...
>
>> With my proposal above I think we could in some carefully chosen
>> places enable mount propagation without problem.
>
> One thing that comes to mind would be that if we create persistent /sys
> instances in /run/netns-mounts per the above, it would make sense for
> any modifications done inside the netns to be propagated back to the
> mount in /run; is this possible with a bind mount? Not sure I quite
> understand how propagation would work in this case (since it would be a
> separate (bind) mount point inside the namespace).

Basically yes, but the challenge is in the details.

If the initial propagation is setup properly it will work.  The
weirdness is how propagation works.  There is a weird detail that
it needs to be setup on the parent and not on the mount point.

I think the formula is something like:

mount --bind /run/netns-mounts/NAME/sys/ /run/netns-mounts/NAME/sys/
mount --make-rshared /run/netns-mounts/NAME/sys/
mount -t sysfs /run/netns-mounts/NAME/sys

My memory is that systemd by default does

mount --make-rshared /

So the challenge may be to simply limit what is propagated to a
controlled subset.

Eric

Toke Høiland-Jørgensen Oct. 11, 2023, 3:03 p.m. UTC | #11

"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>
>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>
>>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>>
>>>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>>>
>>>>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>>>>
>>>>>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
>>> My proposal:
>>>
>>> On "ip netns add NAME"
>>> - create the network namespace and mount it at /run/netns/NAME
>>> - mount the appropriate sysfs at /run/netns-mounts/NAME/sys
>>> - mount the appropriate bpffs at /run/netns-mounts/NAME/sys/fs/bpf
>>>
>>> On "ip netns delete NAME"
>>> - umount --recursive /run/netns-mounts/NAME
>>> - unlink /run/netns-mounts/NAME
>>> - cleanup /run/netns/NAME as we do today.
>>>
>>> On "ip netns exec NAME"
>>> - Walk through /run/netns-mounts/NAME like we do /etc/netns/NAME/
>>>   and perform bind mounts.
>>
>> If we setup the full /sys hierarchy in /run/netns-mounts/NAME this
>> basically becomes a single recursive bind mount, doesn't it?
>
> Yes.
>
>> What about if we also include bind mounts from the host namespace into
>> that separate /sys instance? Will those be included into a recursive
>> bind into /sys inside the mount-ns, or will we have to walk the tree and
>> do separate bind mounts for each directory?
>
> if /run/netns-mounts/NAME/sys has everything you want.
>
> mount --rbind /run/netns-mounts/NAME/sys /sys
>
> Will result in a /sys that has everything you want.
>
>> Anyway, this scheme sounds like it'll solve the issue I was trying to
>> address so I don't mind doing it this way. I'll try it out and respin
>> the patch series.
>
> Thanks that sounds like a way forward.
>
>
>>>>> Mount propagation is a way to configure a mount namespace (before
>>>>> creating a new one) that will cause mounts created in the first mount
>>>>> namespace to be created in it's children, and cause mounts created in
>>>>> the children to be created in the parent (depending on how things are
>>>>> configured).
>>>>>
>>>>> It is not my favorite feature (it makes locking of mount namespaces
>>>>> terrible) and it is probably too clever by half, unfortunately systemd
>>>>> started enabling mount propagation by default, so we are stuck with it.
>>>>
>>>> Right. AFAICT the current iproute2 code explicitly tries to avoid that
>>>> when creating a mountns (it does a 'mount --make-rslave /'); so you're
>>>> saying we should change that?
>>>
>>> If it makes sense.
>>>
>>> I believe I added the 'mount --make-rslave /' because otherwise all
>>> mount activity was propagating back, and making a mess.  Especially when
>>> I was unmounting /sys.
>>>
>>> I am not a huge fan of mount propagation it has lots of surprising
>>> little details that need to be set just right, to not cause problems.
>>
>> Ah, you were talking about propagation from inside the mountns to
>> outside? Didn't catch that at first...
>>
>>> With my proposal above I think we could in some carefully chosen
>>> places enable mount propagation without problem.
>>
>> One thing that comes to mind would be that if we create persistent /sys
>> instances in /run/netns-mounts per the above, it would make sense for
>> any modifications done inside the netns to be propagated back to the
>> mount in /run; is this possible with a bind mount? Not sure I quite
>> understand how propagation would work in this case (since it would be a
>> separate (bind) mount point inside the namespace).
>
> Basically yes, but the challenge is in the details.
>
> If the initial propagation is setup properly it will work.  The
> weirdness is how propagation works.  There is a weird detail that
> it needs to be setup on the parent and not on the mount point.
>
> I think the formula is something like:
>
> mount --bind /run/netns-mounts/NAME/sys/ /run/netns-mounts/NAME/sys/
> mount --make-rshared /run/netns-mounts/NAME/sys/
> mount -t sysfs /run/netns-mounts/NAME/sys
>
> My memory is that systemd by default does
>
> mount --make-rshared /
>
> So the challenge may be to simply limit what is propagated to a
> controlled subset.

Alright, I'll play around with it and bug you some more if I can't get
it to work properly; thanks for the pointers! :)

-Toke

[RFC,iproute2-next,0/5] Persisting of mount namespaces along with network namespaces

Message

Comments