diff mbox series

net: sysctl: fix edge case wrt. sysctl write access

Message ID 20231210111033.1823491-1-maze@google.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: sysctl: fix edge case wrt. sysctl write access | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1117 this patch: 1117
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 1143 this patch: 1143
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1144 this patch: 1144
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 20 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Maciej Żenczykowski Dec. 10, 2023, 11:10 a.m. UTC
The clear intent of net_ctl_permissions() is that having CAP_NET_ADMIN
grants write access to networking sysctls.

However, it turns out there is an edge case where this is insufficient:
inode_permission() has an additional check on HAS_UNMAPPED_ID(inode)
which can return -EACCES and thus block *all* write access.

Note: AFAICT this check is wrt. the uid/gid mapping that was
active at the time the filesystem (ie. proc) was mounted.

In order for this check to not fail, we need net_ctl_set_ownership()
to set valid uid/gid.  It is not immediately clear what value
to use, nor what values are guaranteed to work.
It does make sense that /proc/sys/net appear to be owned by root
from within the netns owning userns.  As such we only modify
what happens if the code fails to map uid/gid 0.
Currently the code just fails to do anything, which in practice
results in using the zeroes of freshly allocated memory,
and we thus end up with global root.
With this change we instead use the uid/gid of the owning userns.
While it is probably (?) theoretically possible for this to *also*
be unmapped from the /proc filesystem's point of view, this seems
much less likely to happen in practice.

The old code is observed to fail in a relatively complex setup,
within a global root created user namespace with selectively
mapped uid/gids (not including global root) and /proc mounted
afterwards (so this /proc mount does not have global root mapped).
Within this user namespace another non privileged task creates
a new user namespace, maps it's own uid/gid (but not uid/gid 0),
and then creates a network namespace.  It cannot write to networking
sysctls even though it does have CAP_NET_ADMIN.

This is because net_ctl_set_ownership fails to map uid/gid 0
(because uid/gid 0 are *not* mapped in the owning 2nd level user_ns),
and falls back to global root.
But global root is not mapped in the 1st level user_ns,
which was inherited by the /proc mount, and thus fails...

Note: the uid/gid of networking sysctls is of purely superficial
importance, outside of this UNMAPPED check, it does not actually
affect access, and only affects display.

Access is always based on whether you are *global* root uid
(or have CAP_NET_ADMIN over the netns) for user write access bits
(or are in *global* root gid for group write access bits).

Cc: Flavio Crisciani <fcrisciani@google.com>
Cc: "Theodore Y. Ts'o" <tytso@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Fixes: e79c6a4fc923 ("net: make net namespace sysctls belong to container's owner")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
---
 net/sysctl_net.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

Comments

Paolo Abeni Dec. 14, 2023, 9:37 a.m. UTC | #1
On Sun, 2023-12-10 at 03:10 -0800, Maciej Żenczykowski wrote:
> The clear intent of net_ctl_permissions() is that having CAP_NET_ADMIN
> grants write access to networking sysctls.
> 
> However, it turns out there is an edge case where this is insufficient:
> inode_permission() has an additional check on HAS_UNMAPPED_ID(inode)
> which can return -EACCES and thus block *all* write access.
> 
> Note: AFAICT this check is wrt. the uid/gid mapping that was
> active at the time the filesystem (ie. proc) was mounted.
> 
> In order for this check to not fail, we need net_ctl_set_ownership()
> to set valid uid/gid.  It is not immediately clear what value
> to use, nor what values are guaranteed to work.
> It does make sense that /proc/sys/net appear to be owned by root
> from within the netns owning userns.  As such we only modify
> what happens if the code fails to map uid/gid 0.
> Currently the code just fails to do anything, which in practice
> results in using the zeroes of freshly allocated memory,
> and we thus end up with global root.
> With this change we instead use the uid/gid of the owning userns.
> While it is probably (?) theoretically possible for this to *also*
> be unmapped from the /proc filesystem's point of view, this seems
> much less likely to happen in practice.
> 
> The old code is observed to fail in a relatively complex setup,
> within a global root created user namespace with selectively
> mapped uid/gids (not including global root) and /proc mounted
> afterwards (so this /proc mount does not have global root mapped).
> Within this user namespace another non privileged task creates
> a new user namespace, maps it's own uid/gid (but not uid/gid 0),
> and then creates a network namespace.  It cannot write to networking
> sysctls even though it does have CAP_NET_ADMIN.

I'm wondering if this specific scenario should be considered a setup 
issue, and should be solved with a different configuration? I would
love to hear others opinions!

> This is because net_ctl_set_ownership fails to map uid/gid 0
> (because uid/gid 0 are *not* mapped in the owning 2nd level user_ns),
> and falls back to global root.
> But global root is not mapped in the 1st level user_ns,
> which was inherited by the /proc mount, and thus fails...
> 
> Note: the uid/gid of networking sysctls is of purely superficial
> importance, outside of this UNMAPPED check, it does not actually
> affect access, and only affects display.
> 
> Access is always based on whether you are *global* root uid
> (or have CAP_NET_ADMIN over the netns) for user write access bits
> (or are in *global* root gid for group write access bits).
> 
> Cc: Flavio Crisciani <fcrisciani@google.com>
> Cc: "Theodore Y. Ts'o" <tytso@google.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
> Fixes: e79c6a4fc923 ("net: make net namespace sysctls belong to container's owner")
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> ---
>  net/sysctl_net.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/net/sysctl_net.c b/net/sysctl_net.c
> index 051ed5f6fc93..ded399f380d9 100644
> --- a/net/sysctl_net.c
> +++ b/net/sysctl_net.c
> @@ -58,16 +58,11 @@ static void net_ctl_set_ownership(struct ctl_table_header *head,
>  				  kuid_t *uid, kgid_t *gid)
>  {
>  	struct net *net = container_of(head->set, struct net, sysctls);
> -	kuid_t ns_root_uid;
> -	kgid_t ns_root_gid;
> +	kuid_t ns_root_uid = make_kuid(net->user_ns, 0);
> +	kgid_t ns_root_gid = make_kgid(net->user_ns, 0);
>  
> -	ns_root_uid = make_kuid(net->user_ns, 0);

As a fix I would prefer you would keep it minimal. e.g. just replace
the if with the ternary operator or just add an 'else' branch.

Cheers,

Paolo
Maciej Żenczykowski Dec. 14, 2023, 1:17 p.m. UTC | #2
On Thu, Dec 14, 2023 at 10:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2023-12-10 at 03:10 -0800, Maciej Żenczykowski wrote:
> > The clear intent of net_ctl_permissions() is that having CAP_NET_ADMIN
> > grants write access to networking sysctls.
> >
> > However, it turns out there is an edge case where this is insufficient:
> > inode_permission() has an additional check on HAS_UNMAPPED_ID(inode)
> > which can return -EACCES and thus block *all* write access.
> >
> > Note: AFAICT this check is wrt. the uid/gid mapping that was
> > active at the time the filesystem (ie. proc) was mounted.
> >
> > In order for this check to not fail, we need net_ctl_set_ownership()
> > to set valid uid/gid.  It is not immediately clear what value
> > to use, nor what values are guaranteed to work.
> > It does make sense that /proc/sys/net appear to be owned by root
> > from within the netns owning userns.  As such we only modify
> > what happens if the code fails to map uid/gid 0.
> > Currently the code just fails to do anything, which in practice
> > results in using the zeroes of freshly allocated memory,
> > and we thus end up with global root.
> > With this change we instead use the uid/gid of the owning userns.
> > While it is probably (?) theoretically possible for this to *also*
> > be unmapped from the /proc filesystem's point of view, this seems
> > much less likely to happen in practice.
> >
> > The old code is observed to fail in a relatively complex setup,
> > within a global root created user namespace with selectively
> > mapped uid/gids (not including global root) and /proc mounted
> > afterwards (so this /proc mount does not have global root mapped).
> > Within this user namespace another non privileged task creates
> > a new user namespace, maps it's own uid/gid (but not uid/gid 0),
> > and then creates a network namespace.  It cannot write to networking
> > sysctls even though it does have CAP_NET_ADMIN.
>
> I'm wondering if this specific scenario should be considered a setup
> issue, and should be solved with a different configuration? I would
> love to hear others opinions!

While it could be fixed in userspace.  I don't think it should:

The global root uid/gid are very intentionally not mapped in (as a
security feature).
So that part isn't changeable (it's also a system daemon and not under
user control).

The user namespace very intentionally maps uid->uid and not 0->uid.
Here there's theoretically more leeway... because it is at least under
user control.
However here this is done for good reason as well.
There's plenty of code that special cases uid=0, both in the kernel
(for example capability handling across exec) and in various userspace
libraries.  It's unrealistic to fix them all.
Additionally it's nice to have semi-transparent user namespaces,
which are security barriers but don't remap uids - remapping causes confusion.
(ie. the uid is either mapped or not, but if it is mapped it's a 1:1 mapping)

As for why?  Because uids as visible to userspace may leak across user
namespace boundaries,
either when talking to other system daemons or when talking across machines.
It's pretty easy (and common) to have uids that are globally unique
and meaningful in a cluster of machines.
Again, this is *theoretically* fixable in userspace, but not actually
a realistic expectation.

btw. even outside of clusters of machines, I also run some
user/uts/net namespace using
code on my personal desktop (this does require some minor hacks to
unshare/mount binaries),
and again I intentionally map uid->uid and 0->uid, because this makes
my username show up as 'maze' and not 'root'.

This is *clearly* a kernel bug that this doesn't just work.
(side note: there's a very similar issue in proc_net.c which I haven't
gotten around to fixing yet, because it looks to be more complex to
convince oneself it's safe to do)

> > This is because net_ctl_set_ownership fails to map uid/gid 0
> > (because uid/gid 0 are *not* mapped in the owning 2nd level user_ns),
> > and falls back to global root.
> > But global root is not mapped in the 1st level user_ns,
> > which was inherited by the /proc mount, and thus fails...
> >
> > Note: the uid/gid of networking sysctls is of purely superficial
> > importance, outside of this UNMAPPED check, it does not actually
> > affect access, and only affects display.
> >
> > Access is always based on whether you are *global* root uid
> > (or have CAP_NET_ADMIN over the netns) for user write access bits
> > (or are in *global* root gid for group write access bits).
> >
> > Cc: Flavio Crisciani <fcrisciani@google.com>
> > Cc: "Theodore Y. Ts'o" <tytso@google.com>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
> > Fixes: e79c6a4fc923 ("net: make net namespace sysctls belong to container's owner")
> > Signed-off-by: Maciej Żenczykowski <maze@google.com>
> > ---
> >  net/sysctl_net.c | 13 ++++---------
> >  1 file changed, 4 insertions(+), 9 deletions(-)
> >
> > diff --git a/net/sysctl_net.c b/net/sysctl_net.c
> > index 051ed5f6fc93..ded399f380d9 100644
> > --- a/net/sysctl_net.c
> > +++ b/net/sysctl_net.c
> > @@ -58,16 +58,11 @@ static void net_ctl_set_ownership(struct ctl_table_header *head,
> >                                 kuid_t *uid, kgid_t *gid)
> >  {
> >       struct net *net = container_of(head->set, struct net, sysctls);
> > -     kuid_t ns_root_uid;
> > -     kgid_t ns_root_gid;
> > +     kuid_t ns_root_uid = make_kuid(net->user_ns, 0);
> > +     kgid_t ns_root_gid = make_kgid(net->user_ns, 0);
> >
> > -     ns_root_uid = make_kuid(net->user_ns, 0);
>
> As a fix I would prefer you would keep it minimal. e.g. just replace
> the if with the ternary operator or just add an 'else' branch.

If you think that's better, I'll resend a v2.

>
> Cheers,
>
> Paolo
>

--
Maciej Żenczykowski, Kernel Networking Developer @ Google
Paolo Abeni Dec. 14, 2023, 5:16 p.m. UTC | #3
On Thu, 2023-12-14 at 14:17 +0100, Maciej Żenczykowski wrote:
> On Thu, Dec 14, 2023 at 10:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > On Sun, 2023-12-10 at 03:10 -0800, Maciej Żenczykowski wrote:
> > > The clear intent of net_ctl_permissions() is that having CAP_NET_ADMIN
> > > grants write access to networking sysctls.
> > > 
> > > However, it turns out there is an edge case where this is insufficient:
> > > inode_permission() has an additional check on HAS_UNMAPPED_ID(inode)
> > > which can return -EACCES and thus block *all* write access.
> > > 
> > > Note: AFAICT this check is wrt. the uid/gid mapping that was
> > > active at the time the filesystem (ie. proc) was mounted.
> > > 
> > > In order for this check to not fail, we need net_ctl_set_ownership()
> > > to set valid uid/gid.  It is not immediately clear what value
> > > to use, nor what values are guaranteed to work.
> > > It does make sense that /proc/sys/net appear to be owned by root
> > > from within the netns owning userns.  As such we only modify
> > > what happens if the code fails to map uid/gid 0.
> > > Currently the code just fails to do anything, which in practice
> > > results in using the zeroes of freshly allocated memory,
> > > and we thus end up with global root.
> > > With this change we instead use the uid/gid of the owning userns.
> > > While it is probably (?) theoretically possible for this to *also*
> > > be unmapped from the /proc filesystem's point of view, this seems
> > > much less likely to happen in practice.
> > > 
> > > The old code is observed to fail in a relatively complex setup,
> > > within a global root created user namespace with selectively
> > > mapped uid/gids (not including global root) and /proc mounted
> > > afterwards (so this /proc mount does not have global root mapped).
> > > Within this user namespace another non privileged task creates
> > > a new user namespace, maps it's own uid/gid (but not uid/gid 0),
> > > and then creates a network namespace.  It cannot write to networking
> > > sysctls even though it does have CAP_NET_ADMIN.
> > 
> > I'm wondering if this specific scenario should be considered a setup
> > issue, and should be solved with a different configuration? I would
> > love to hear others opinions!
> 
> While it could be fixed in userspace.  I don't think it should:
> 
> The global root uid/gid are very intentionally not mapped in (as a
> security feature).
> So that part isn't changeable (it's also a system daemon and not under
> user control).
> 
> The user namespace very intentionally maps uid->uid and not 0->uid.
> Here there's theoretically more leeway... because it is at least under
> user control.
> However here this is done for good reason as well.
> There's plenty of code that special cases uid=0, both in the kernel
> (for example capability handling across exec) and in various userspace
> libraries.  It's unrealistic to fix them all.
> Additionally it's nice to have semi-transparent user namespaces,
> which are security barriers but don't remap uids - remapping causes confusion.
> (ie. the uid is either mapped or not, but if it is mapped it's a 1:1 mapping)
> 
> As for why?  Because uids as visible to userspace may leak across user
> namespace boundaries,
> either when talking to other system daemons or when talking across machines.
> It's pretty easy (and common) to have uids that are globally unique
> and meaningful in a cluster of machines.
> Again, this is *theoretically* fixable in userspace, but not actually
> a realistic expectation.
> 
> btw. even outside of clusters of machines, I also run some
> user/uts/net namespace using
> code on my personal desktop (this does require some minor hacks to
> unshare/mount binaries),
> and again I intentionally map uid->uid and 0->uid, because this makes
> my username show up as 'maze' and not 'root'.

I see, thanks for all the details.

> This is *clearly* a kernel bug that this doesn't just work.
> (side note: there's a very similar issue in proc_net.c which I haven't
> gotten around to fixing yet, because it looks to be more complex to
> convince oneself it's safe to do)

Indeed the potential security related issue is the root cause of my
concerns here. I could not identify any such problem, but I must admit
the uid mapping is not the kernel I know better.

I definitely could use another pair of eyeballs here ;)

Cheers,

Paolo
Maciej Żenczykowski Dec. 14, 2023, 5:52 p.m. UTC | #4
On Thu, Dec 14, 2023 at 6:16 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2023-12-14 at 14:17 +0100, Maciej Żenczykowski wrote:
> > On Thu, Dec 14, 2023 at 10:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On Sun, 2023-12-10 at 03:10 -0800, Maciej Żenczykowski wrote:
> > > > The clear intent of net_ctl_permissions() is that having CAP_NET_ADMIN
> > > > grants write access to networking sysctls.
> > > >
> > > > However, it turns out there is an edge case where this is insufficient:
> > > > inode_permission() has an additional check on HAS_UNMAPPED_ID(inode)
> > > > which can return -EACCES and thus block *all* write access.
> > > >
> > > > Note: AFAICT this check is wrt. the uid/gid mapping that was
> > > > active at the time the filesystem (ie. proc) was mounted.
> > > >
> > > > In order for this check to not fail, we need net_ctl_set_ownership()
> > > > to set valid uid/gid.  It is not immediately clear what value
> > > > to use, nor what values are guaranteed to work.
> > > > It does make sense that /proc/sys/net appear to be owned by root
> > > > from within the netns owning userns.  As such we only modify
> > > > what happens if the code fails to map uid/gid 0.
> > > > Currently the code just fails to do anything, which in practice
> > > > results in using the zeroes of freshly allocated memory,
> > > > and we thus end up with global root.
> > > > With this change we instead use the uid/gid of the owning userns.
> > > > While it is probably (?) theoretically possible for this to *also*
> > > > be unmapped from the /proc filesystem's point of view, this seems
> > > > much less likely to happen in practice.
> > > >
> > > > The old code is observed to fail in a relatively complex setup,
> > > > within a global root created user namespace with selectively
> > > > mapped uid/gids (not including global root) and /proc mounted
> > > > afterwards (so this /proc mount does not have global root mapped).
> > > > Within this user namespace another non privileged task creates
> > > > a new user namespace, maps it's own uid/gid (but not uid/gid 0),
> > > > and then creates a network namespace.  It cannot write to networking
> > > > sysctls even though it does have CAP_NET_ADMIN.
> > >
> > > I'm wondering if this specific scenario should be considered a setup
> > > issue, and should be solved with a different configuration? I would
> > > love to hear others opinions!
> >
> > While it could be fixed in userspace.  I don't think it should:
> >
> > The global root uid/gid are very intentionally not mapped in (as a
> > security feature).
> > So that part isn't changeable (it's also a system daemon and not under
> > user control).
> >
> > The user namespace very intentionally maps uid->uid and not 0->uid.
> > Here there's theoretically more leeway... because it is at least under
> > user control.
> > However here this is done for good reason as well.
> > There's plenty of code that special cases uid=0, both in the kernel
> > (for example capability handling across exec) and in various userspace
> > libraries.  It's unrealistic to fix them all.
> > Additionally it's nice to have semi-transparent user namespaces,
> > which are security barriers but don't remap uids - remapping causes confusion.
> > (ie. the uid is either mapped or not, but if it is mapped it's a 1:1 mapping)
> >
> > As for why?  Because uids as visible to userspace may leak across user
> > namespace boundaries,
> > either when talking to other system daemons or when talking across machines.
> > It's pretty easy (and common) to have uids that are globally unique
> > and meaningful in a cluster of machines.
> > Again, this is *theoretically* fixable in userspace, but not actually
> > a realistic expectation.
> >
> > btw. even outside of clusters of machines, I also run some
> > user/uts/net namespace using
> > code on my personal desktop (this does require some minor hacks to
> > unshare/mount binaries),
> > and again I intentionally map uid->uid and 0->uid, because this makes

obviously this was meant to say "and not 0->uid"

> > my username show up as 'maze' and not 'root'.
>
> I see, thanks for all the details.
>
> > This is *clearly* a kernel bug that this doesn't just work.
> > (side note: there's a very similar issue in proc_net.c which I haven't
> > gotten around to fixing yet, because it looks to be more complex to
> > convince oneself it's safe to do)
>
> Indeed the potential security related issue is the root cause of my
> concerns here. I could not identify any such problem, but I must admit
> the uid mapping is not the kernel I know better.

Oh, I'm not surprised: It took Flavio (as the affected user) and I
~two weeks or so (and dozens of back and forths with test binaries
with more and more logging and different approaches to doing things)
to figure out what was going wrong.  Ultimately we only succeeded once
I managed to get the test case running on a semi-isolated dev machine
with the full userspace cluster stack and a printk instrumented
kernel.  It was certainly a head-scratcher...

> I definitely could use another pair of eyeballs here ;)
>
> Cheers,
>
> Paolo
>

--
Maciej Żenczykowski, Kernel Networking Developer @ Google
Eric W. Biederman Dec. 15, 2023, 6:23 a.m. UTC | #5
Maciej Żenczykowski <maze@google.com> writes:

> On Thu, Dec 14, 2023 at 10:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> On Sun, 2023-12-10 at 03:10 -0800, Maciej Żenczykowski wrote:
>> > The clear intent of net_ctl_permissions() is that having CAP_NET_ADMIN
>> > grants write access to networking sysctls.
>> >
>> > However, it turns out there is an edge case where this is insufficient:
>> > inode_permission() has an additional check on HAS_UNMAPPED_ID(inode)
>> > which can return -EACCES and thus block *all* write access.
>> >
>> > Note: AFAICT this check is wrt. the uid/gid mapping that was
>> > active at the time the filesystem (ie. proc) was mounted.
>> >
>> > In order for this check to not fail, we need net_ctl_set_ownership()
>> > to set valid uid/gid.  It is not immediately clear what value
>> > to use, nor what values are guaranteed to work.
>> > It does make sense that /proc/sys/net appear to be owned by root
>> > from within the netns owning userns.  As such we only modify
>> > what happens if the code fails to map uid/gid 0.
>> > Currently the code just fails to do anything, which in practice
>> > results in using the zeroes of freshly allocated memory,
>> > and we thus end up with global root.
>> > With this change we instead use the uid/gid of the owning userns.
>> > While it is probably (?) theoretically possible for this to *also*
>> > be unmapped from the /proc filesystem's point of view, this seems
>> > much less likely to happen in practice.
>> >
>> > The old code is observed to fail in a relatively complex setup,
>> > within a global root created user namespace with selectively
>> > mapped uid/gids (not including global root) and /proc mounted
>> > afterwards (so this /proc mount does not have global root mapped).
>> > Within this user namespace another non privileged task creates
>> > a new user namespace, maps it's own uid/gid (but not uid/gid 0),
>> > and then creates a network namespace.  It cannot write to networking
>> > sysctls even though it does have CAP_NET_ADMIN.
>>
>> I'm wondering if this specific scenario should be considered a setup
>> issue, and should be solved with a different configuration? I would
>> love to hear others opinions!

It feels like a setup issue to me.  A different mount of proc can
always be used to set the sysctls if really needed.

>
> While it could be fixed in userspace.  I don't think it should:
>
> The global root uid/gid are very intentionally not mapped in (as a
> security feature).
> So that part isn't changeable (it's also a system daemon and not under
> user control).

Likewise the default for all sysctls is global uid 0 and global gid 0.

> The user namespace very intentionally maps uid->uid and not 0->uid.
> Here there's theoretically more leeway... because it is at least under
> user control.
> However here this is done for good reason as well.
> There's plenty of code that special cases uid=0, both in the kernel
> (for example capability handling across exec) and in various userspace
> libraries.  It's unrealistic to fix them all.

Ish.  Frankly in the kernel I have fixed them all a long time ago.
At least when the uids don't map straight through.

At least in the case when they don't make to the global uid 0 and global
gid 0.

> Additionally it's nice to have semi-transparent user namespaces,
> which are security barriers but don't remap uids - remapping causes confusion.
> (ie. the uid is either mapped or not, but if it is mapped it's a 1:1 mapping)

Ah.  So you are deliberately creating a this setup.

> As for why?  Because uids as visible to userspace may leak across user
> namespace boundaries,
> either when talking to other system daemons or when talking across machines.
> It's pretty easy (and common) to have uids that are globally unique
> and meaningful in a cluster of machines.
> Again, this is *theoretically* fixable in userspace, but not actually
> a realistic expectation.
>
> btw. even outside of clusters of machines, I also run some
> user/uts/net namespace using
> code on my personal desktop (this does require some minor hacks to
> unshare/mount binaries),
> and again I intentionally map uid->uid and 0->uid, because this makes
> my username show up as 'maze' and not 'root'.
>
> This is *clearly* a kernel bug that this doesn't just work.

No it is not *clearly* bug.

If the owning uid/gid is not mapped it is in general not safe to write
to an inode.  It is a nonsense scenario.

You have deliberately created a scenario where there is no uid 0 or
gid 0 to deliberately break things as a security feature and then you
are asking why things break?

As far as I can tell this is like locking your keys in the car, to make
certain no one can steal your car.  It works but it also makes it
difficult to get into your car and drive it.

> (side note: there's a very similar issue in proc_net.c which I haven't
> gotten around to fixing yet, because it looks to be more complex to
> convince oneself it's safe to do)

It is not a some much similar as the same.

Among other things there is a possibility that someone else has
deliberately used the inability to write to those sysctls without
the 0 uid and gid mapped.  If this is the case you are busy breaking
someone else's security.

As I recall the classic approach on nfs is to map uid 0 to a useless uid
like nobody.  I don't understand why something like that is not being
used in your case.

I don't see how deliberately crippling yourself by leaving 0 completely
unmapped gains anything of any value.  As such I don't understand
why you would like the kernel to have a special case to support your
use case.

Eric
diff mbox series

Patch

diff --git a/net/sysctl_net.c b/net/sysctl_net.c
index 051ed5f6fc93..ded399f380d9 100644
--- a/net/sysctl_net.c
+++ b/net/sysctl_net.c
@@ -58,16 +58,11 @@  static void net_ctl_set_ownership(struct ctl_table_header *head,
 				  kuid_t *uid, kgid_t *gid)
 {
 	struct net *net = container_of(head->set, struct net, sysctls);
-	kuid_t ns_root_uid;
-	kgid_t ns_root_gid;
+	kuid_t ns_root_uid = make_kuid(net->user_ns, 0);
+	kgid_t ns_root_gid = make_kgid(net->user_ns, 0);
 
-	ns_root_uid = make_kuid(net->user_ns, 0);
-	if (uid_valid(ns_root_uid))
-		*uid = ns_root_uid;
-
-	ns_root_gid = make_kgid(net->user_ns, 0);
-	if (gid_valid(ns_root_gid))
-		*gid = ns_root_gid;
+	*uid = uid_valid(ns_root_uid) ? ns_root_uid : net->user_ns->owner;
+	*gid = gid_valid(ns_root_gid) ? ns_root_gid : net->user_ns->group;
 }
 
 static struct ctl_table_root net_sysctl_root = {