mbox series

[00/24] user_namespace: introduce fsid mappings

Message ID 20200211165753.356508-1-christian.brauner@ubuntu.com (mailing list archive)
Headers show
Series user_namespace: introduce fsid mappings | expand

Message

Christian Brauner Feb. 11, 2020, 4:57 p.m. UTC
Hey everyone,

This is the implementation of shiftfs which was cooked up during lunch at
Linux Plumbers 2019 the day after the container's microconference. The
idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
we all were quite busy with other work and couldn't really sit down and
implement it. But I took a few days last week to do this work, including
demos and performance testing.
This implementation does not require us to touch the vfs substantially
at all. Instead, we implement shiftfs via fsid mappings.
With this patch, it took me 20 mins to port both LXD and LXC to support
shiftfs via fsid mappings.

For anyone wanting to play with this the branch can be pulled from:
https://github.com/brauner/linux/tree/fsid_mappings
https://gitlab.com/brauner/linux/-/tree/fsid_mappings
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings

The main use case for shiftfs for us is in allowing shared writable
storage to multiple containers using non-overlapping id mappings.
In such a scenario you want the fsids to be valid and identical in both
containers for the shared mount. A demo for this exists in [3].
If you don't want to read on, go straight to the other demos below in
[1] and [2].

People not as familiar with user namespaces might not be aware that fsid
mappings already exist. Right now, fsid mappings are always identical to
id mappings. Specifically, the kernel will lookup fsuids in the uid
mappings and fsgids in the gid mappings of the relevant user namespace.

With this patch series we simply introduce the ability to create fsid
mappings that are different from the id mappings of a user namespace.

In the usual case of running an unprivileged container we will have
setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
correspond to this id mapping, i.e. all files which we want to appear as
0:0 inside the user namespace will be chowned to 100000:100000 on the
host. This works, because whenever the kernel needs to do a filesystem
access it will lookup the corresponding uid and gid in the idmapping
tables of the container.
Now think about the case where we want to have an id mapping of 0 100000
100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
share a single on-disk mapping with multiple containers that all have
different id mappings.
This will be problematic. Whenever a filesystem access is requested, the
kernel will now try to lookup a mapping for 300000 in the id mapping
tables of the user namespace but since there is none the files will
appear to be owned by the overflow id, i.e. usually 65534:65534 or
nobody:nogroup.

With fsid mappings we can solve this by writing an id mapping of 0
100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
access the kernel will now lookup the mapping for 300000 in the fsid
mapping tables of the user namespace. And since such a mapping exists,
the corresponding files will have correct ownership.

A note on proc (and sys), the proc filesystem is special in sofar as it
only has a single superblock that is (currently but might be about to
change) visible in all user namespaces (same goes for sys). This means
it has special semantics in many ways, including how file ownership and
access works. The fsid mapping implementation does not alter how proc
(and sys) ownership works. proc and sys will both continue to lookup
filesystem access in id mapping tables.

When Writing fsid mappings the same rules apply as when writing id
mappings so I won't reiterate them here. The limit of fs id mappings is
the same as for id mappings, i.e. 340 lines.

# Performance
Back when I extended the range of possible id mappings to 340 I did
performance testing by booting into single user mode, creating 1,000,000
files to fstat()ing them and calculated the mean fstat() time per file.
(Back when Linux was still fast. I won't mention that the stat
 numbers have (thanks microcode!) doubled since then...)
I did the same test for this patchset: one vanilla kernel, one kernel
with my fsid mapping patches but CONFIG_USER_NS_FSID set to n and one
with fsid mappings patches enabled. I then ran the same test on all
three kernels and compared the numbers. The implementation does not
introduce overhead. That's all I can say. Here are the numbers:

             | vanilla v5.5 | fsid mappings       | fsid mappings      | fsid mappings      |
	     |              | disabled in Kconfig | enabled in Kconfig | enabled in Kconfig |
	     |   	    |                     | and unset for all  | and set for all    |
	     |   	    |    		  | test cases         | test cases         |
-------------|--------------|---------------------|--------------------|--------------------|
 0  mappings |       367 ns |              365 ns |             365 ns |             N/A    |
 1  mappings |       362 ns |              367 ns |             363 ns |             363 ns |
 2  mappings |       361 ns |              369 ns |             363 ns |             364 ns |
 3  mappings |       361 ns |              368 ns |             366 ns |             365 ns |
 5  mappings |       365 ns |              368 ns |             363 ns |             365 ns |
 10 mappings |       391 ns |              388 ns |             387 ns |             389 ns |
 50 mappings |       395 ns |              398 ns |             401 ns |             397 ns |
100 mappings |       400 ns |              405 ns |             399 ns |             399 ns |
200 mappings |       404 ns |              407 ns |             430 ns |             404 ns |
300 mappings |       492 ns |              494 ns |             432 ns |             413 ns |
340 mappings |       495 ns |              497 ns |             500 ns |             484 ns |

# Demos
[1]: Create a container with different id and fsid mappings.
     https://asciinema.org/a/300233 
[2]: Create a container with id mappings but without fsid mappings.
     https://asciinema.org/a/300234
[3]: Share storage between multiple containers with non-overlapping id
     mappings.
     https://asciinema.org/a/300235

Thanks!
Christian

Christian Brauner (24):
  user_namespace: introduce fsid mappings infrastructure
  proc: add /proc/<pid>/fsuid_map
  proc: add /proc/<pid>/fsgid_map
  fsuidgid: add fsid mapping helpers
  proc: task_state(): use from_kfs{g,u}id_munged
  fs: add is_userns_visible() helper
  namei: may_{o_}create(): handle fsid mappings
  inode: inode_owner_or_capable(): handle fsid mappings
  capability: privileged_wrt_inode_uidgid(): handle fsid mappings
  stat: handle fsid mappings
  open: chown_common(): handle fsid mappings
  posix_acl: handle fsid mappings
  attr: notify_change(): handle fsid mappings
  commoncap: cap_task_fix_setuid(): handle fsid mappings
  commoncap:cap_bprm_set_creds(): handle fsid mappings
  sys: __sys_setfsuid(): handle fsid mappings
  sys: __sys_setfsgid(): handle fsid mappings
  sys:__sys_setuid(): handle fsid mappings
  sys:__sys_setgid(): handle fsid mappings
  sys:__sys_setreuid(): handle fsid mappings
  sys:__sys_setregid(): handle fsid mappings
  sys:__sys_setresuid(): handle fsid mappings
  sys:__sys_setresgid(): handle fsid mappings
  devpts: handle fsid mappings

 fs/attr.c                      |  23 ++-
 fs/devpts/inode.c              |   7 +-
 fs/inode.c                     |   7 +-
 fs/namei.c                     |  21 ++-
 fs/open.c                      |  10 +-
 fs/posix_acl.c                 |  21 +--
 fs/proc/array.c                |   5 +-
 fs/proc/base.c                 |  34 ++++
 fs/stat.c                      |  48 ++++--
 include/linux/fs.h             |   5 +
 include/linux/fsuidgid.h       |  70 ++++++++
 include/linux/stat.h           |   1 +
 include/linux/user_namespace.h |  10 ++
 init/Kconfig                   |  11 ++
 kernel/capability.c            |  13 +-
 kernel/sys.c                   |  83 ++++++---
 kernel/user.c                  |  22 +++
 kernel/user_namespace.c        | 303 ++++++++++++++++++++++++++++++++-
 security/commoncap.c           |  19 ++-
 19 files changed, 638 insertions(+), 75 deletions(-)
 create mode 100644 include/linux/fsuidgid.h


base-commit: d5226fa6dbae0569ee43ecfc08bdcd6770fc4755

Comments

Jann Horn Feb. 11, 2020, 8:55 p.m. UTC | #1
On Tue, Feb 11, 2020 at 5:59 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> This is the implementation of shiftfs which was cooked up during lunch at
> Linux Plumbers 2019 the day after the container's microconference. The
> idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
> we all were quite busy with other work and couldn't really sit down and
> implement it. But I took a few days last week to do this work, including
> demos and performance testing.
> This implementation does not require us to touch the vfs substantially
> at all. Instead, we implement shiftfs via fsid mappings.
> With this patch, it took me 20 mins to port both LXD and LXC to support
> shiftfs via fsid mappings.
>
> For anyone wanting to play with this the branch can be pulled from:
> https://github.com/brauner/linux/tree/fsid_mappings
> https://gitlab.com/brauner/linux/-/tree/fsid_mappings
> https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings
>
> The main use case for shiftfs for us is in allowing shared writable
> storage to multiple containers using non-overlapping id mappings.
> In such a scenario you want the fsids to be valid and identical in both
> containers for the shared mount. A demo for this exists in [3].
> If you don't want to read on, go straight to the other demos below in
> [1] and [2].

I guess essentially this means that you want to have UID separation
between containers to prevent the containers - or their owners - from
interfering between each other, but for filesystem access, you don't
want to isolate them from each other using DAC controls on the files
and folders inside the containers' directory hierarchies, instead
relying on mode-0700 parent directories to restrict access to the
container owner? Or would you still have separate UIDs for e.g. the
container's UID range 0-65535, and then map the shared UID range at
100000, or something like that?

> People not as familiar with user namespaces might not be aware that fsid
> mappings already exist. Right now, fsid mappings are always identical to
> id mappings. Specifically, the kernel will lookup fsuids in the uid
> mappings and fsgids in the gid mappings of the relevant user namespace.

That's a bit like saying that a kernel without CONFIG_USER_NS still
has user ID mappings, they just happen to be identity mappings. :P

> With this patch series we simply introduce the ability to create fsid
> mappings that are different from the id mappings of a user namespace.
>
> In the usual case of running an unprivileged container we will have
> setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> correspond to this id mapping, i.e. all files which we want to appear as
> 0:0 inside the user namespace will be chowned to 100000:100000 on the
> host. This works, because whenever the kernel needs to do a filesystem
> access it will lookup the corresponding uid and gid in the idmapping
> tables of the container.
> Now think about the case where we want to have an id mapping of 0 100000
> 100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
> share a single on-disk mapping with multiple containers that all have
> different id mappings.
> This will be problematic. Whenever a filesystem access is requested, the
> kernel will now try to lookup a mapping for 300000 in the id mapping
> tables of the user namespace but since there is none the files will
> appear to be owned by the overflow id, i.e. usually 65534:65534 or
> nobody:nogroup.
>
> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping exists,
> the corresponding files will have correct ownership.

Sorry to bring up something as disgusting as setuid execution, but:
What happens when there's a setuid root file with ->i_uid==300000? I
guess the only way to make that work inside the containers would be
something like make_kuid(current_user_ns(),
from_kfsuid(current_user_ns(), inode->i_uid)) in the setuid execve
path?

> A note on proc (and sys), the proc filesystem is special in sofar as it
> only has a single superblock that is (currently but might be about to
> change) visible in all user namespaces (same goes for sys). This means
> it has special semantics in many ways, including how file ownership and
> access works. The fsid mapping implementation does not alter how proc
> (and sys) ownership works. proc and sys will both continue to lookup
> filesystem access in id mapping tables.

In your example, a process with namespaced UID set (0, 0, 0, 0) will
have kernel UIDs (100000, 100000, 100000, 300000), right? And then if
I want to open /proc/$pid/personality of another process with the same
UIDs, may_open() will call inode_permission() -> do_inode_permission()
-> generic_permission() -> acl_permission_check(), which will compare
current_fsuid() (which is 300000) against inode->i_uid. But
inode->i_uid was filled by proc_pid_make_inode()->task_dump_owner(),
which set inode->i_uid to 100000, right?

Also, e.g. __ptrace_may_access() uses cred->fsuid for a comparison
with another task's real/effective/saved UID.

[...]
> # Demos
> [1]: Create a container with different id and fsid mappings.
>      https://asciinema.org/a/300233
> [2]: Create a container with id mappings but without fsid mappings.
>      https://asciinema.org/a/300234
> [3]: Share storage between multiple containers with non-overlapping id
>      mappings.
>      https://asciinema.org/a/300235

(I really dislike this asciinema thing; if you want to quickly glance
through the output instead of reading at the same speed as it was
typed, a simple pastebin works much better unless you absolutely have
to show things that use stuff like ncurses UI.)
Christian Brauner Feb. 12, 2020, 2:51 p.m. UTC | #2
On Tue, Feb 11, 2020 at 09:55:46PM +0100, Jann Horn via Containers wrote:
> On Tue, Feb 11, 2020 at 5:59 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > This is the implementation of shiftfs which was cooked up during lunch at
> > Linux Plumbers 2019 the day after the container's microconference. The
> > idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
> > we all were quite busy with other work and couldn't really sit down and
> > implement it. But I took a few days last week to do this work, including
> > demos and performance testing.
> > This implementation does not require us to touch the vfs substantially
> > at all. Instead, we implement shiftfs via fsid mappings.
> > With this patch, it took me 20 mins to port both LXD and LXC to support
> > shiftfs via fsid mappings.
> >
> > For anyone wanting to play with this the branch can be pulled from:
> > https://github.com/brauner/linux/tree/fsid_mappings
> > https://gitlab.com/brauner/linux/-/tree/fsid_mappings
> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings
> >
> > The main use case for shiftfs for us is in allowing shared writable
> > storage to multiple containers using non-overlapping id mappings.
> > In such a scenario you want the fsids to be valid and identical in both
> > containers for the shared mount. A demo for this exists in [3].
> > If you don't want to read on, go straight to the other demos below in
> > [1] and [2].
> 
> I guess essentially this means that you want to have UID separation
> between containers to prevent the containers - or their owners - from
> interfering between each other, but for filesystem access, you don't
> want to isolate them from each other using DAC controls on the files
> and folders inside the containers' directory hierarchies, instead
> relying on mode-0700 parent directories to restrict access to the
> container owner? Or would you still have separate UIDs for e.g. the
> container's UID range 0-65535, and then map the shared UID range at
> 100000, or something like that?

Yes.
So if you look at the permissions right now for the directory under
which the rootfs for the container and other stuff resides we have
root@wittgenstein|/var/lib/lxd/storage-pools/zfs/containers
> perms *
d--x------ 100 alp1
d--x------ 100 f1
d--x------ 100 f2

We don't really share the rootfs between containers right now since we
treat them as standalone systems but with fsid mappings that's possible
too. Layer-sharing-centric runtimes very much will want something like
that.

> 
> > People not as familiar with user namespaces might not be aware that fsid
> > mappings already exist. Right now, fsid mappings are always identical to
> > id mappings. Specifically, the kernel will lookup fsuids in the uid
> > mappings and fsgids in the gid mappings of the relevant user namespace.
> 
> That's a bit like saying that a kernel without CONFIG_USER_NS still
> has user ID mappings, they just happen to be identity mappings. :P

If you have CONFIG_USER_NS=n then you have (as you're well aware)
[<0, 0>, <1,1>, ..., <n,n>] so yeah that's true and analyzing it like
that makes sense. :P

> 
> > With this patch series we simply introduce the ability to create fsid
> > mappings that are different from the id mappings of a user namespace.
> >
> > In the usual case of running an unprivileged container we will have
> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> > correspond to this id mapping, i.e. all files which we want to appear as
> > 0:0 inside the user namespace will be chowned to 100000:100000 on the
> > host. This works, because whenever the kernel needs to do a filesystem
> > access it will lookup the corresponding uid and gid in the idmapping
> > tables of the container.
> > Now think about the case where we want to have an id mapping of 0 100000
> > 100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
> > share a single on-disk mapping with multiple containers that all have
> > different id mappings.
> > This will be problematic. Whenever a filesystem access is requested, the
> > kernel will now try to lookup a mapping for 300000 in the id mapping
> > tables of the user namespace but since there is none the files will
> > appear to be owned by the overflow id, i.e. usually 65534:65534 or
> > nobody:nogroup.
> >
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping exists,
> > the corresponding files will have correct ownership.
> 
> Sorry to bring up something as disgusting as setuid execution, but:

No that's exactly what this needs. :)

> What happens when there's a setuid root file with ->i_uid==300000? I
> guess the only way to make that work inside the containers would be
> something like make_kuid(current_user_ns(),
> from_kfsuid(current_user_ns(), inode->i_uid)) in the setuid execve
> path?

What's the specific callpath you're thinking about?

So if you look at patch
https://lore.kernel.org/lkml/20200211165753.356508-16-christian.brauner@ubuntu.com/
it does
-	new->suid = new->fsuid = new->euid;
-	new->sgid = new->fsgid = new->egid;
+	fsuid = from_kuid_munged(new->user_ns, new->euid);
+	kfsuid = make_kfsuid(new->user_ns, fsuid);
+	new->suid = new->euid;
+	new->fsuid = kfsuid;
+
+	fsgid = from_kgid_munged(new->user_ns, new->egid);
+	kfsgid = make_kfsgid(new->user_ns, fsgid);
+	new->sgid = new->egid;
+	new->fsgid = kfsgid;

One thing I definitely missed though in the setuid path is to adapt
fs/exec.c:bprm_fill_uid():

diff --git a/fs/exec.c b/fs/exec.c
index 74d88dab98dd..ad839934fdff 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1547,8 +1547,8 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
        inode_unlock(inode);

        /* We ignore suid/sgid if there are no mappings for them in the ns */
-       if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
-                !kgid_has_mapping(bprm->cred->user_ns, gid))
+       if (!kfsuid_has_mapping(bprm->cred->user_ns, uid) ||
+                !kfsgid_has_mapping(bprm->cred->user_ns, gid))
                return;

        if (mode & S_ISUID) {

> 
> > A note on proc (and sys), the proc filesystem is special in sofar as it
> > only has a single superblock that is (currently but might be about to
> > change) visible in all user namespaces (same goes for sys). This means
> > it has special semantics in many ways, including how file ownership and
> > access works. The fsid mapping implementation does not alter how proc
> > (and sys) ownership works. proc and sys will both continue to lookup
> > filesystem access in id mapping tables.
> 
> In your example, a process with namespaced UID set (0, 0, 0, 0) will
> have kernel UIDs (100000, 100000, 100000, 300000), right? And then if

Yes.

> I want to open /proc/$pid/personality of another process with the same
> UIDs, may_open() will call inode_permission() -> do_inode_permission()
> -> generic_permission() -> acl_permission_check(), which will compare
> current_fsuid() (which is 300000) against inode->i_uid. But
> inode->i_uid was filled by proc_pid_make_inode()->task_dump_owner(),
> which set inode->i_uid to 100000, right?

Yes. That should be fixable by something like below, I think. (And we can
probably shortcut this by adding a helper that does tell us whether there's
been any fsid mapping setup or not for this user namespace.)
 static int acl_permission_check(struct inode *inode, int mask)
 {
+       kuid_t kuid;
        unsigned int mode = inode->i_mode;

-       if (likely(uid_eq(current_fsuid(), inode->i_uid)))
+       if (!is_userns_visible(inode->i_sb->s_iflags)) {
+               kuid = inode->i_uid;
+       } else {
+               kuid = make_kuid(current_user_ns(),
+                                from_kfsuid(current_user_ns(), inode->i_uid));
+       }
+
+       if (likely(uid_eq(current_fsuid(), kuid)))
                mode >>= 6;
        else {&& (mode & S_IRWXG)) {

> 
> Also, e.g. __ptrace_may_access() uses cred->fsuid for a comparison
> with another task's real/effective/saved UID.

Right, you even introduced this check in 2015 iirc.
Both of your points make me think that it'd be easiest to introduce
cred->{kfsuid,kfsgid} and whenever an access decision on a
is_userns_visible() filesystem has to be made those will be used. This avoids
having to do on-the fly translations and ptrace_may_access() can just grow a
flag indicating what fscreds it's supposed to look at?

> 
> [...]
> > # Demos
> > [1]: Create a container with different id and fsid mappings.
> >      https://asciinema.org/a/300233
> > [2]: Create a container with id mappings but without fsid mappings.
> >      https://asciinema.org/a/300234
> > [3]: Share storage between multiple containers with non-overlapping id
> >      mappings.
> >      https://asciinema.org/a/300235
> 
> (I really dislike this asciinema thing; if you want to quickly glance
> through the output instead of reading at the same speed as it was
> typed, a simple pastebin works much better unless you absolutely have
> to show things that use stuff like ncurses UI.)

Hmkay, I went through the trouble of converting the asciinema output to
basic shell for all tree demos. :) I made them available as github gists.
So:
demo1: https://gist.github.com/brauner/8e1117720b3f9fab22e44c17f12184bf
demo2: https://gist.github.com/brauner/41a36026a9a1496af0095dce1545548e
demo3: https://gist.github.com/brauner/4586d6bc680a018bc8e1dd114a45592a
Jann Horn Feb. 12, 2020, 6:53 p.m. UTC | #3
On Wed, Feb 12, 2020 at 3:51 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> On Tue, Feb 11, 2020 at 09:55:46PM +0100, Jann Horn via Containers wrote:
> > On Tue, Feb 11, 2020 at 5:59 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > > This is the implementation of shiftfs which was cooked up during lunch at
> > > Linux Plumbers 2019 the day after the container's microconference. The
> > > idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
> > > we all were quite busy with other work and couldn't really sit down and
> > > implement it. But I took a few days last week to do this work, including
> > > demos and performance testing.
> > > This implementation does not require us to touch the vfs substantially
> > > at all. Instead, we implement shiftfs via fsid mappings.
> > > With this patch, it took me 20 mins to port both LXD and LXC to support
> > > shiftfs via fsid mappings.
> > >
> > > For anyone wanting to play with this the branch can be pulled from:
> > > https://github.com/brauner/linux/tree/fsid_mappings
> > > https://gitlab.com/brauner/linux/-/tree/fsid_mappings
> > > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings
> > >
> > > The main use case for shiftfs for us is in allowing shared writable
> > > storage to multiple containers using non-overlapping id mappings.
> > > In such a scenario you want the fsids to be valid and identical in both
> > > containers for the shared mount. A demo for this exists in [3].
> > > If you don't want to read on, go straight to the other demos below in
> > > [1] and [2].
> >
> > I guess essentially this means that you want to have UID separation
> > between containers to prevent the containers - or their owners - from
> > interfering between each other, but for filesystem access, you don't
> > want to isolate them from each other using DAC controls on the files
> > and folders inside the containers' directory hierarchies, instead
> > relying on mode-0700 parent directories to restrict access to the
> > container owner? Or would you still have separate UIDs for e.g. the
> > container's UID range 0-65535, and then map the shared UID range at
> > 100000, or something like that?
>
> Yes.
> So if you look at the permissions right now for the directory under
> which the rootfs for the container and other stuff resides we have
> root@wittgenstein|/var/lib/lxd/storage-pools/zfs/containers
> > perms *
> d--x------ 100 alp1
> d--x------ 100 f1
> d--x------ 100 f2
>
> We don't really share the rootfs between containers right now since we
> treat them as standalone systems but with fsid mappings that's possible
> too. Layer-sharing-centric runtimes very much will want something like
> that.
[...]
> > > With this patch series we simply introduce the ability to create fsid
> > > mappings that are different from the id mappings of a user namespace.
> > >
> > > In the usual case of running an unprivileged container we will have
> > > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> > > correspond to this id mapping, i.e. all files which we want to appear as
> > > 0:0 inside the user namespace will be chowned to 100000:100000 on the
> > > host. This works, because whenever the kernel needs to do a filesystem
> > > access it will lookup the corresponding uid and gid in the idmapping
> > > tables of the container.
> > > Now think about the case where we want to have an id mapping of 0 100000
> > > 100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
> > > share a single on-disk mapping with multiple containers that all have
> > > different id mappings.
> > > This will be problematic. Whenever a filesystem access is requested, the
> > > kernel will now try to lookup a mapping for 300000 in the id mapping
> > > tables of the user namespace but since there is none the files will
> > > appear to be owned by the overflow id, i.e. usually 65534:65534 or
> > > nobody:nogroup.
> > >
> > > With fsid mappings we can solve this by writing an id mapping of 0
> > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > > access the kernel will now lookup the mapping for 300000 in the fsid
> > > mapping tables of the user namespace. And since such a mapping exists,
> > > the corresponding files will have correct ownership.
> >
> > Sorry to bring up something as disgusting as setuid execution, but:
>
> No that's exactly what this needs. :)
>
> > What happens when there's a setuid root file with ->i_uid==300000? I
> > guess the only way to make that work inside the containers would be
> > something like make_kuid(current_user_ns(),
> > from_kfsuid(current_user_ns(), inode->i_uid)) in the setuid execve
> > path?
>
> What's the specific callpath you're thinking about?
>
> So if you look at patch
> https://lore.kernel.org/lkml/20200211165753.356508-16-christian.brauner@ubuntu.com/
> it does
> -       new->suid = new->fsuid = new->euid;
> -       new->sgid = new->fsgid = new->egid;
> +       fsuid = from_kuid_munged(new->user_ns, new->euid);
> +       kfsuid = make_kfsuid(new->user_ns, fsuid);
> +       new->suid = new->euid;
> +       new->fsuid = kfsuid;
> +
> +       fsgid = from_kgid_munged(new->user_ns, new->egid);
> +       kfsgid = make_kfsgid(new->user_ns, fsgid);
> +       new->sgid = new->egid;
> +       new->fsgid = kfsgid;

Aaah, okay, I missed that.

> One thing I definitely missed though in the setuid path is to adapt
> fs/exec.c:bprm_fill_uid():
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 74d88dab98dd..ad839934fdff 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1547,8 +1547,8 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
>         inode_unlock(inode);
>
>         /* We ignore suid/sgid if there are no mappings for them in the ns */
> -       if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
> -                !kgid_has_mapping(bprm->cred->user_ns, gid))
> +       if (!kfsuid_has_mapping(bprm->cred->user_ns, uid) ||
> +                !kfsgid_has_mapping(bprm->cred->user_ns, gid))
>                 return;
>
>         if (mode & S_ISUID) {
[...]
> > I want to open /proc/$pid/personality of another process with the same
> > UIDs, may_open() will call inode_permission() -> do_inode_permission()
> > -> generic_permission() -> acl_permission_check(), which will compare
> > current_fsuid() (which is 300000) against inode->i_uid. But
> > inode->i_uid was filled by proc_pid_make_inode()->task_dump_owner(),
> > which set inode->i_uid to 100000, right?
>
> Yes. That should be fixable by something like below, I think. (And we can
> probably shortcut this by adding a helper that does tell us whether there's
> been any fsid mapping setup or not for this user namespace.)
>  static int acl_permission_check(struct inode *inode, int mask)
>  {
> +       kuid_t kuid;
>         unsigned int mode = inode->i_mode;
>
> -       if (likely(uid_eq(current_fsuid(), inode->i_uid)))
> +       if (!is_userns_visible(inode->i_sb->s_iflags)) {
> +               kuid = inode->i_uid;
> +       } else {
> +               kuid = make_kuid(current_user_ns(),
> +                                from_kfsuid(current_user_ns(), inode->i_uid));
> +       }
> +
> +       if (likely(uid_eq(current_fsuid(), kuid)))
>                 mode >>= 6;
>         else {&& (mode & S_IRWXG)) {
>
> >
> > Also, e.g. __ptrace_may_access() uses cred->fsuid for a comparison
> > with another task's real/effective/saved UID.
>
> Right, you even introduced this check in 2015 iirc.
> Both of your points make me think that it'd be easiest to introduce
> cred->{kfsuid,kfsgid} and whenever an access decision on a
> is_userns_visible() filesystem has to be made those will be used. This avoids
> having to do on-the fly translations

I guess that might be less ugly.

> and ptrace_may_access() can just grow a
> flag indicating what fscreds it's supposed to look at?

Wouldn't you always end up using the "real" fsuid there?