mbox series

[00/34] fs: idmapped mounts

Message ID 20201029003252.2128653-1-christian.brauner@ubuntu.com (mailing list archive)
Headers show
Series fs: idmapped mounts | expand

Message

Christian Brauner Oct. 29, 2020, 12:32 a.m. UTC
Hey everyone,

I vanished for a little while to focus on this work here so sorry for
not being available by mail for a while.

Since quite a long time we have issues with sharing mounts between
multiple unprivileged containers with different id mappings, sharing a
rootfs between multiple containers with different id mappings, and also
sharing regular directories and filesystems between users with different
uids and gids. The latter use-cases have become even more important with
the availability and adoption of systemd-homed (cf. [1]) to implement
portable home directories.

The solutions we have tried and proposed so far include the introduction
of fsid mappings, a tiny overlay based filesystem, and an approach to
call override creds in the vfs. None of these solutions have covered all
of the above use-cases.

The solution proposed here has it's origins in multiple discussions
during Linux Plumbers 2017 during and after the end of the containers
microconference.
To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
James, and myself. A variant of the solution proposed here has also been
discussed, again to the best of my knowledge, after a Linux conference
in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
after Linux Plumbers.
I've taken the time to finally implement a working version of this
solution over the last weeks to the best of my abilities. Tycho has
signed up for this sligthly crazy endeavour as well and he has helped
with the conversion of the xattr codepaths.

The core idea is to make idmappings a property of struct vfsmount
instead of tying it to a process being inside of a user namespace which
has been the case for all other proposed approaches.
It means that idmappings become a property of bind-mounts, i.e. each
bind-mount can have a separate idmapping. This has the obvious advantage
that idmapped mounts can be created inside of the initial user
namespace, i.e. on the host itself instead of requiring the caller to be
located inside of a user namespace. This enables such use-cases as e.g.
making a usb stick available in multiple locations with different
idmappings (see the vfat port that is part of this patch series).

The vfsmount struct gains a new struct user_namespace member. The
idmapping of the user namespace becomes the idmapping of the mount. A
caller that is either privileged with respect to the user namespace of
the superblock of the underlying filesystem or a caller that is
privileged with respect to the user namespace a mount has been idmapped
with can create a new bind-mount and mark it with a user namespace. The
user namespace the mount will be marked with can be specified by passing
a file descriptor refering to the user namespace as an argument to the
new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes should be observed. All mapping
operations are nops for the initial user namespace.

When a file/inode is accessed through an idmapped mount the i_uid and
i_gid of the inode will be remapped according to the user namespace the
mount has been marked with. When a new object is created based on the
fsuid and fsgid of the caller they will similarly be remapped according
to the user namespace of the mount they care created from.

This means the user namespace of the mount needs to be passed down into
a few relevant inode_operations. This mostly includes inode operations
that create filesystem objects or change file attributes. Some of them
such as ->getattr() don't even need to change since they pass down a
struct path and thus the struct vfsmount is already available. Other
inode operations need to be adapted to pass down the user namespace the
vfsmount has been marked with. Al was nice enough to point out that he
will not tolerate struct vfsmount being passed to filesystems and that I
should pass down the user namespace directly; which is what I did.
The inode struct itself is never altered whenever the i_uid and i_gid
need to be mapped, i.e. i_uid and i_gid are only remapped at the time of
the check. An inode once initialized (during lookup or object creation)
is never altered when accessed through an idmapped mount.

To limit the amount of noise in this first iteration we have not changed
the existing inode operations but rather introduced a few new struct
inode operation methods such as ->mkdir_mapped which pass down the user
namespace of the mount they have been called from. Should this solution
be worth pursuing we have no problem adapting the existing inode
operations instead.

In order to support idmapped mounts, filesystems need to be changed and
mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. In this first
iteration I tried to illustrate this by changing three different
filesystem with different levels of complexity. Of course with some bias
towards urgent use-cases and filesystems I was at least a little more
familiar with. However, Tycho and I (and others) have no problem
converting each filesystem one-by-one. This first iteration includes fat
(msdos and vfat), ext4, and overlayfs (both with idmapped lower and
upper directories and idmapped merged directories). I'm sure I haven't
gotten everything right for all three of them in the first version of
this patch.

I have written a simple tool that allows to create idmapped mounts so
people can play with this patch series. Here are a few illustrations:

1. Create a simple idmapped mount of another user's home directory

u1001@f2-vm:/$ sudo ./mount-idmapped --map-mount b:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--

2. Create mapping of the whole ext4 rootfs without a mapping for uid and gid 0

ubuntu@f2-vm:~$ sudo /mount-idmapped --map-mount b:1:1:65536 / /mnt/
ubuntu@f2-vm:~$ findmnt | grep mnt
└─/mnt                                /dev/sda2  ext4       rw,relatime
  └─/mnt/mnt                          /dev/sda2  ext4       rw,relatime
ubuntu@f2-vm:~$ sudo mkdir /AS-ROOT-CAN-CREATE
ubuntu@f2-vm:~$ sudo mkdir /mnt/AS-ROOT-CANT-CREATE
mkdir: cannot create directory ‘/mnt/AS-ROOT-CANT-CREATE’: Value too large for defined data type
ubuntu@f2-vm:~$ mkdir /mnt/home/ubuntu/AS-USER-1000-CAN-CREATE

3. Create a vfat usb mount and expose to user 1001 and 5000

ubuntu@f2-vm:/$ sudo mount /dev/sdb /mnt
ubuntu@f2-vm:/$ findmnt  | grep mnt
└─/mnt                                /dev/sdb vfat       rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro
ubuntu@f2-vm:/$ ls -al /mnt
total 12
drwxr-xr-x  2 root root 4096 Jan  1  1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x  1 root root    4 Oct 28 03:44 aaa
-rwxr-xr-x  1 root root    0 Oct 28 01:09 bbb
ubuntu@f2-vm:/$ sudo /mount-idmapped --map-mount b:0:1001:1 /mnt /mnt-1001/
ubuntu@f2-vm:/$ ls -al /mnt-1001/
total 12
drwxr-xr-x  2 u1001 u1001 4096 Jan  1  1970 .
drwxr-xr-x 34 root  root  4096 Oct 28 22:24 ..
-rwxr-xr-x  1 u1001 u1001    4 Oct 28 03:44 aaa
-rwxr-xr-x  1 u1001 u1001    0 Oct 28 01:09 bbb
ubuntu@f2-vm:/$ sudo /mount-idmapped --map-mount b:0:5000:1 /mnt /mnt-5000/
ubuntu@f2-vm:/$ ls -al /mnt-5000/
total 12
drwxr-xr-x  2 5000 5000 4096 Jan  1  1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x  1 5000 5000    4 Oct 28 03:44 aaa
-rwxr-xr-x  1 5000 5000    0 Oct 28 01:09 bbb

4. Create an idmapped rootfs mount for a container

root@f2-vm:~# ls -al /var/lib/lxc/f2/rootfs/
total 68
drwxr-xr-x 17 20000 20000 4096 Sep 24 07:48 .
drwxrwx---  3 20000 20000 4096 Oct 16 19:26 ..
lrwxrwxrwx  1 20000 20000    7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x  2 20000 20000 4096 Apr 15  2020 boot
drwxr-xr-x  3 20000 20000 4096 Oct 16 19:26 dev
drwxr-xr-x 61 20000 20000 4096 Oct 16 19:26 etc
drwxr-xr-x  3 20000 20000 4096 Sep 24 07:45 home
lrwxrwxrwx  1 20000 20000    7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx  1 20000 20000    9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx  1 20000 20000    9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx  1 20000 20000   10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:43 media
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:43 mnt
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:43 opt
drwxr-xr-x  2 20000 20000 4096 Apr 15  2020 proc
drwx------  2 20000 20000 4096 Sep 24 07:43 root
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:45 run
lrwxrwxrwx  1 20000 20000    8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:43 srv
drwxr-xr-x  2 20000 20000 4096 Apr 15  2020 sys
drwxrwxrwt  2 20000 20000 4096 Sep 24 07:44 tmp
drwxr-xr-x 13 20000 20000 4096 Sep 24 07:43 usr
drwxr-xr-x 12 20000 20000 4096 Sep 24 07:44 var
root@f2-vm:~# /mount-idmapped --map-mount b:20000:10000:100000 /var/lib/lxc/f2/rootfs/ /mnt
root@f2-vm:~# ls -al /mnt
total 68
drwxr-xr-x 17 10000 10000 4096 Sep 24 07:48 .
drwxr-xr-x 34 root  root  4096 Oct 28 22:24 ..
lrwxrwxrwx  1 10000 10000    7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x  2 10000 10000 4096 Apr 15  2020 boot
drwxr-xr-x  3 10000 10000 4096 Oct 16 19:26 dev
drwxr-xr-x 61 10000 10000 4096 Oct 16 19:26 etc
drwxr-xr-x  3 10000 10000 4096 Sep 24 07:45 home
lrwxrwxrwx  1 10000 10000    7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx  1 10000 10000    9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx  1 10000 10000    9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx  1 10000 10000   10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:43 media
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:43 mnt
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:43 opt
drwxr-xr-x  2 10000 10000 4096 Apr 15  2020 proc
drwx------  2 10000 10000 4096 Sep 24 07:43 root
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:45 run
lrwxrwxrwx  1 10000 10000    8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:43 srv
drwxr-xr-x  2 10000 10000 4096 Apr 15  2020 sys
drwxrwxrwt  2 10000 10000 4096 Sep 24 07:44 tmp
drwxr-xr-x 13 10000 10000 4096 Sep 24 07:43 usr
drwxr-xr-x 12 10000 10000 4096 Sep 24 07:44 var
root@f2-vm:~# lxc-start f2 # uses /mnt as rootfs
root@f2-vm:~# lxc-attach f2 -- cat /proc/1/uid_map
         0      10000      10000
root@f2-vm:~# lxc-attach f2 -- cat /proc/1/gid_map
         0      10000      10000
root@f2-vm:~# lxc-attach f2 -- ls -al /
total 52
drwxr-xr-x  17 root   root    4096 Sep 24 07:48 .
drwxr-xr-x  17 root   root    4096 Sep 24 07:48 ..
lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
drwxr-xr-x   5 root   root     500 Oct 28 23:39 dev
drwxr-xr-x  61 root   root    4096 Oct 28 23:39 etc
drwxr-xr-x   3 root   root    4096 Sep 24 07:45 home
lrwxrwxrwx   1 root   root       7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx   1 root   root       9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx   1 root   root       9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx   1 root   root      10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x   2 root   root    4096 Sep 24 07:43 media
drwxr-xr-x   2 root   root    4096 Sep 24 07:43 mnt
drwxr-xr-x   2 root   root    4096 Sep 24 07:43 opt
dr-xr-xr-x 232 nobody nogroup    0 Oct 28 23:39 proc
drwx------   2 root   root    4096 Oct 28 23:41 root
drwxr-xr-x  12 root   root     360 Oct 28 23:39 run
lrwxrwxrwx   1 root   root       8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x   2 root   root    4096 Sep 24 07:43 srv
dr-xr-xr-x  13 nobody nogroup    0 Oct 28 23:39 sys
drwxrwxrwt  11 root   root    4096 Oct 28 23:40 tmp
drwxr-xr-x  13 root   root    4096 Sep 24 07:43 usr
drwxr-xr-x  12 root   root    4096 Sep 24 07:44 var
root@f2-vm:~# lxc-attach f2 -- ls -al /my-file
-rw-r--r-- 1 root root 0 Oct 28 23:43 /my-file
root@f2-vm:~# ls -al /var/lib/lxc/f2/rootfs/my-file
-rw-r--r-- 1 20000 20000 0 Oct 28 23:43 /var/lib/lxc/f2/rootfs/my-file

[1]: https://systemd.io/HOME_DIRECTORY/
     "If the UID assigned to a user does not match the owner of the home
      directory in the file system, the home directory is automatically
      and recursively chown()ed to the correct UID."
      This has huge performance impact and is also problematic since it
      chowns all files independent of ownership.
[2]: https://github.com/brauner/mount-idmapped

In no particular order I'd like to say thanks to:
Al for pointing me into the direction to avoid inode alias issues during
lookup. David for various discussions around this. Tycho for helping
with this series and on future patches if this is in any shape or form
acceptable. Alban Crequy for pointing out more application container
use-cases. Stéphane for various valuable input on various use-cases and
letting me work on this. Amir for explaining and discussing aspects of
overlayfs with me.
I'd like to especially thank Seth Forshee because he provided a lot of
good analysis, suggestions, and participated in short-notice discussions
in both chat and video.

This series can be found and pulled in three locations:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=idmapped_mounts
https://github.com/brauner/linux/tree/idmapped_mounts
https://gitlab.com/brauner/linux/-/commits/idmapped_mounts

Thanks!
Christian

Christian Brauner (32):
  namespace: take lock_mount_hash() directly when changing flags
  namespace: only take read lock in do_reconfigure_mnt()
  fs: add mount_setattr()
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add id translation helpers
  capability: handle idmapped mounts
  namei: add idmapped mount aware permission helpers
  inode: add idmapped mount aware init and permission helpers
  attr: handle idmapped mounts
  acl: handle idmapped mounts
  commoncap: handle idmapped mounts
  stat: add mapped_generic_fillattr()
  namei: handle idmapped mounts in may_*() helpers
  namei: introduce struct renamedata
  namei: prepare for idmapped mounts
  namei: add lookup helpers with idmapped mounts aware permission
    checking
  open: handle idmapped mounts in do_truncate()
  open: handle idmapped mounts
  af_unix: handle idmapped mounts
  utimes: handle idmapped mounts
  would_dump: handle idmapped mounts
  exec: handle idmapped mounts
  fs: add helpers for idmap mounts
  apparmor: handle idmapped mounts
  audit: handle idmapped mounts
  ima: handle idmapped mounts
  ext4: support idmapped mounts
  expfs: handle idmapped mounts
  overlayfs: handle idmapped lower directories
  overlayfs: handle idmapped merged mounts
  fat: handle idmapped mounts

Tycho Andersen (2):
  xattr: handle idmapped mounts
  selftests: add idmapped mounts xattr selftest

 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/Kconfig                                    |   6 +
 fs/attr.c                                     | 142 ++-
 fs/coredump.c                                 |  12 +-
 fs/exec.c                                     |  12 +-
 fs/exportfs/expfs.c                           |   4 +-
 fs/ext4/acl.c                                 |  11 +-
 fs/ext4/acl.h                                 |   3 +
 fs/ext4/ext4.h                                |  14 +-
 fs/ext4/file.c                                |   4 +
 fs/ext4/ialloc.c                              |   7 +-
 fs/ext4/inode.c                               |  27 +-
 fs/ext4/ioctl.c                               |  18 +-
 fs/ext4/namei.c                               | 145 ++-
 fs/ext4/super.c                               |   4 +
 fs/ext4/symlink.c                             |   9 +
 fs/ext4/xattr_hurd.c                          |  22 +-
 fs/ext4/xattr_security.c                      |  18 +-
 fs/ext4/xattr_trusted.c                       |  18 +-
 fs/fat/fat.h                                  |   2 +
 fs/fat/file.c                                 |  27 +-
 fs/fat/namei_msdos.c                          |   7 +
 fs/fat/namei_vfat.c                           |   7 +
 fs/inode.c                                    |  66 +-
 fs/internal.h                                 |   9 +
 fs/namei.c                                    | 597 ++++++++----
 fs/namespace.c                                | 446 ++++++++-
 fs/open.c                                     |  52 +-
 fs/overlayfs/copy_up.c                        | 104 +-
 fs/overlayfs/dir.c                            | 219 +++--
 fs/overlayfs/export.c                         |   3 +-
 fs/overlayfs/file.c                           |  23 +-
 fs/overlayfs/inode.c                          | 121 ++-
 fs/overlayfs/namei.c                          |  64 +-
 fs/overlayfs/overlayfs.h                      | 158 +++-
 fs/overlayfs/ovl_entry.h                      |   1 +
 fs/overlayfs/readdir.c                        |  34 +-
 fs/overlayfs/super.c                          | 109 ++-
 fs/overlayfs/util.c                           |  38 +-
 fs/posix_acl.c                                | 130 ++-
 fs/stat.c                                     |  18 +-
 fs/utimes.c                                   |   4 +-
 fs/xattr.c                                    | 264 ++++--
 include/linux/audit.h                         |  10 +-
 include/linux/capability.h                    |  12 +-
 include/linux/fs.h                            | 254 ++++-
 include/linux/ima.h                           |  15 +-
 include/linux/lsm_hook_defs.h                 |  10 +-
 include/linux/lsm_hooks.h                     |   1 +
 include/linux/mount.h                         |  20 +-
 include/linux/namei.h                         |   6 +
 include/linux/posix_acl.h                     |  14 +-
 include/linux/posix_acl_xattr.h               |  12 +-
 include/linux/security.h                      |  36 +-
 include/linux/syscalls.h                      |   3 +
 include/linux/xattr.h                         |  29 +
 include/uapi/asm-generic/unistd.h             |   4 +-
 include/uapi/linux/mount.h                    |  26 +
 ipc/mqueue.c                                  |   8 +-
 kernel/auditsc.c                              |  29 +-
 kernel/capability.c                           |  22 +-
 net/unix/af_unix.c                            |   2 +-
 security/apparmor/domain.c                    |   9 +-
 security/apparmor/file.c                      |   5 +-
 security/apparmor/lsm.c                       |  12 +-
 security/commoncap.c                          |  50 +-
 security/integrity/ima/ima.h                  |  19 +-
 security/integrity/ima/ima_api.c              |  10 +-
 security/integrity/ima/ima_appraise.c         |  14 +-
 security/integrity/ima/ima_asymmetric_keys.c  |   2 +-
 security/integrity/ima/ima_main.c             |  28 +-
 security/integrity/ima/ima_policy.c           |  17 +-
 security/integrity/ima/ima_queue_keys.c       |   2 +-
 security/security.c                           |  18 +-
 security/selinux/hooks.c                      |  13 +-
 security/smack/smack_lsm.c                    |  11 +-
 tools/include/uapi/asm-generic/unistd.h       |   4 +-
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/idmap_mounts/.gitignore |   1 +
 tools/testing/selftests/idmap_mounts/Makefile |   8 +
 tools/testing/selftests/idmap_mounts/config   |   1 +
 tools/testing/selftests/idmap_mounts/xattr.c  | 389 ++++++++
 .../selftests/mount_setattr/.gitignore        |   1 +
 .../testing/selftests/mount_setattr/Makefile  |   7 +
 tools/testing/selftests/mount_setattr/config  |   1 +
 .../mount_setattr/mount_setattr_test.c        | 888 ++++++++++++++++++
 102 files changed, 4109 insertions(+), 912 deletions(-)
 create mode 100644 tools/testing/selftests/idmap_mounts/.gitignore
 create mode 100644 tools/testing/selftests/idmap_mounts/Makefile
 create mode 100644 tools/testing/selftests/idmap_mounts/config
 create mode 100644 tools/testing/selftests/idmap_mounts/xattr.c
 create mode 100644 tools/testing/selftests/mount_setattr/.gitignore
 create mode 100644 tools/testing/selftests/mount_setattr/Makefile
 create mode 100644 tools/testing/selftests/mount_setattr/config
 create mode 100644 tools/testing/selftests/mount_setattr/mount_setattr_test.c


base-commit: 3650b228f83adda7e5ee532e2b90429c03f7b9ec

Comments

Dave Chinner Oct. 29, 2020, 2:27 a.m. UTC | #1
On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> Hey everyone,
> 
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
> 
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
> 
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
> 
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
> 
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
> 
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace. The
> user namespace the mount will be marked with can be specified by passing
> a file descriptor refering to the user namespace as an argument to the
> new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> By default vfsmounts are marked with the initial user namespace and no
> behavioral or performance changes should be observed. All mapping
> operations are nops for the initial user namespace.
> 
> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.
> 
> This means the user namespace of the mount needs to be passed down into
> a few relevant inode_operations. This mostly includes inode operations
> that create filesystem objects or change file attributes.

That's really quite ... messy.

Maybe I'm missing something, but if you have the user_ns to be used
for the VFS operation we are about to execute then why can't we use
the same model as current_fsuid/current_fsgid() for passing the
filesystem credentials down to the filesystem operations?  i.e.
attach it to the current->cred->fs_userns, and then the filesystem
code that actually needs to know the current userns can call
current_fs_user_ns() instead of current_user_ns().  i.e.

#define current_fs_user_ns()	\
	(current->cred->fs_userns ? current->cred->fs_userns \
				  : current->cred->userns)

At this point, the filesystem will now always have the correct
userns it is supposed to use for mapping the uid/gid, right?

Also, if we are passing work off to worker threads, duplicating
the current creds will capture this information and won't leave
random landmines where stuff doesn't work as it should because the
worker thread is unaware of the userns that it is supposed to be
doing filesytsem operations under...

Cheers,

Dave.
Sargun Dhillon Oct. 29, 2020, 7:20 a.m. UTC | #2
On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> Hey everyone,
> 
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
> 
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
> 
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
> 
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
> 
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
> 
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace. The
> user namespace the mount will be marked with can be specified by passing
> a file descriptor refering to the user namespace as an argument to the
> new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> By default vfsmounts are marked with the initial user namespace and no
> behavioral or performance changes should be observed. All mapping
> operations are nops for the initial user namespace.
> 
> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.
> 
> This means the user namespace of the mount needs to be passed down into
> a few relevant inode_operations. This mostly includes inode operations
> that create filesystem objects or change file attributes. Some of them
> such as ->getattr() don't even need to change since they pass down a
> struct path and thus the struct vfsmount is already available. Other
> inode operations need to be adapted to pass down the user namespace the
> vfsmount has been marked with. Al was nice enough to point out that he
> will not tolerate struct vfsmount being passed to filesystems and that I
> should pass down the user namespace directly; which is what I did.
> The inode struct itself is never altered whenever the i_uid and i_gid
> need to be mapped, i.e. i_uid and i_gid are only remapped at the time of
> the check. An inode once initialized (during lookup or object creation)
> is never altered when accessed through an idmapped mount.
> 
> To limit the amount of noise in this first iteration we have not changed
> the existing inode operations but rather introduced a few new struct
> inode operation methods such as ->mkdir_mapped which pass down the user
> namespace of the mount they have been called from. Should this solution
> be worth pursuing we have no problem adapting the existing inode
> operations instead.
> 
> In order to support idmapped mounts, filesystems need to be changed and
> mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. In this first
> iteration I tried to illustrate this by changing three different
> filesystem with different levels of complexity. Of course with some bias
> towards urgent use-cases and filesystems I was at least a little more
> familiar with. However, Tycho and I (and others) have no problem
> converting each filesystem one-by-one. This first iteration includes fat
> (msdos and vfat), ext4, and overlayfs (both with idmapped lower and
> upper directories and idmapped merged directories). I'm sure I haven't
> gotten everything right for all three of them in the first version of
> this patch.
> 

Thanks for this patchset. It's been a long-time coming.

I'm curious as to for the most cases, how much the new fs mount APIs help, and 
if focusing on those could solve the problem for everything other than bind 
mounts? Specifically, the idea of doing fsopen (creation of fs_context) under 
the user namespace of question, and relying on a user with CAP_SYS_ADMIN to call 
fsmount[1]. I think this is actually especially valuable for places like 
overlayfs that use the entire cred object, as opposed to just the uid / gid. I 
imagine that soon, most filesystems will support the new mount APIs, and not set 
the global flag if they don't need to.

How popular is the "vfsmount (bind mounts) needs different uid mappings" use 
case?

The other thing I worry about is the "What UID are you really?" game that's been 
a thing recently. For example, you can have a different user namespace UID 
mapping for your network namespace that netfilter checks[2], and a different one 
for your mount namespace, and a different one that the process is actually in.
This proliferation of different mappings makes auditing, and doing things like
writing perf toolings more difficult (since I think bpf_get_current_uid_gid
use the initial user namespace still [3]).

[1]: https://lore.kernel.org/linux-nfs/20201016123745.9510-4-sargun@sargun.me/T/#u
[2]: https://elixir.bootlin.com/linux/v5.9.1/source/net/netfilter/xt_owner.c#L37
[3]: https://elixir.bootlin.com/linux/v5.9.1/source/kernel/bpf/helpers.c#L196
Eric W. Biederman Oct. 29, 2020, 3:47 p.m. UTC | #3
Christian Brauner <christian.brauner@ubuntu.com> writes:

> Hey everyone,
>
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
>
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.

Can you walk us through the motivating use case?

As of this year's LPC I had the distinct impression that the primary use
case for such a feature was due to the RLIMIT_NPROC problem where two
containers with the same users still wanted different uid mappings to
the disk because the users were conflicting with each other because of
the per user rlimits.

Fixing rlimits is straight forward to implement, and easier to manage
for implementations and administrators.



Reading up on systemd-homed it appears to be a way to have encrypted
home directories.  Those home directories can either be encrypted at the
fs or at the block level.  Those home directories appear to have the
goal of being luggable between systems.  If the systems in question
don't have common administration of uids and gids after lugging your
encrypted home directory to another system chowning the files is
required.

Is that the use case you are looking at removing the need for
systemd-homed to avoid chowning after lugging encrypted home directories
from one system to another?  Why would it be desirable to avoid the
chown?


If the goal is to solve fragmented administration of uid assignment I
suggest that it might be better to solve the administration problem so
that all of the uids of interest get assigned the same way on all of the
systems of interest.  To the extent it is possible to solve the uid
assignment problem that would seem to have fewer long term
administrative problems.

Eric
Aleksa Sarai Oct. 29, 2020, 3:51 p.m. UTC | #4
On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > Hey everyone,
> >
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> >
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
> 
> Can you walk us through the motivating use case?
> 
> As of this year's LPC I had the distinct impression that the primary use
> case for such a feature was due to the RLIMIT_NPROC problem where two
> containers with the same users still wanted different uid mappings to
> the disk because the users were conflicting with each other because of
> the per user rlimits.
> 
> Fixing rlimits is straight forward to implement, and easier to manage
> for implementations and administrators.

This is separate to the question of "isolated user namespaces" and
managing different mappings between containers. This patchset is solving
the same problem that shiftfs solved -- sharing a single directory tree
between containers that have different ID mappings. rlimits (nor any of
the other proposals we discussed at LPC) will help with this problem.
Lennart Poettering Oct. 29, 2020, 4:05 p.m. UTC | #5
On Do, 29.10.20 10:47, Eric W. Biederman (ebiederm@xmission.com) wrote:

> Is that the use case you are looking at removing the need for
> systemd-homed to avoid chowning after lugging encrypted home directories
> from one system to another?  Why would it be desirable to avoid the
> chown?

Yes, I am very interested in seeing Christian's work succeed, for the
usecase in systemd-homed. In systemd-homed each user gets their own
private file system, and these fs shall be owned by the user's local
UID, regardless in which system it is used. The UID should be an
artifact of the local, individual system in this model, and thus
the UID on of the same user/home on system A might be picked as 1010
and on another as 1543, and on a third as 1323, and it shouldn't
matter. This way, home directories become migratable without having to
universially sync UID assignments: it doesn't matter anymore what the
local UID is.

Right now we do a recursive chown() at login time to ensure the home
dir is properly owned. This has two disadvantages:

1. It's slow. In particular on large home dirs, it takes a while to go
   through the whole user's homedir tree and chown/adjust ACLs for
   everything.

2. Because it is so slow we take a shortcut right now: if the
   top-level home dir inode itself is owned by the correct user, we
   skip the recursive chowning. This means in the typical case where a
   user uses the same system most of the time, and thus the UID is
   stable we can avoid the slowness. But this comes at a drawback: if
   the user for some reason ends up with files in their homedir owned
   by an unrelated user, then we'll never notice or readjust.

> If the goal is to solve fragmented administration of uid assignment I
> suggest that it might be better to solve the administration problem so
> that all of the uids of interest get assigned the same way on all of the
> systems of interest.

Well, the goal is to make things simple and be able to use the home
dir everywhere without any prior preparation, without central UID
assignment authority.

The goal is to have a scheme that requires no administration, by
making the UID management problem go away. Hence, if you suggest
solving this by having a central administrative authority: this is
exactly what the model wants to get away from.

Or to say this differently: just because I personally use three
different computers, I certainly don't want to set up LDAP or sync
UIDs manually.

Lennart

--
Lennart Poettering, Berlin
Christian Brauner Oct. 29, 2020, 4:19 p.m. UTC | #6
On Thu, Oct 29, 2020 at 01:27:33PM +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> > Hey everyone,
> > 
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> > 
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
> > 
> > The solutions we have tried and proposed so far include the introduction
> > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > call override creds in the vfs. None of these solutions have covered all
> > of the above use-cases.
> > 
> > The solution proposed here has it's origins in multiple discussions
> > during Linux Plumbers 2017 during and after the end of the containers
> > microconference.
> > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > James, and myself. A variant of the solution proposed here has also been
> > discussed, again to the best of my knowledge, after a Linux conference
> > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > after Linux Plumbers.
> > I've taken the time to finally implement a working version of this
> > solution over the last weeks to the best of my abilities. Tycho has
> > signed up for this sligthly crazy endeavour as well and he has helped
> > with the conversion of the xattr codepaths.
> > 
> > The core idea is to make idmappings a property of struct vfsmount
> > instead of tying it to a process being inside of a user namespace which
> > has been the case for all other proposed approaches.
> > It means that idmappings become a property of bind-mounts, i.e. each
> > bind-mount can have a separate idmapping. This has the obvious advantage
> > that idmapped mounts can be created inside of the initial user
> > namespace, i.e. on the host itself instead of requiring the caller to be
> > located inside of a user namespace. This enables such use-cases as e.g.
> > making a usb stick available in multiple locations with different
> > idmappings (see the vfat port that is part of this patch series).
> > 
> > The vfsmount struct gains a new struct user_namespace member. The
> > idmapping of the user namespace becomes the idmapping of the mount. A
> > caller that is either privileged with respect to the user namespace of
> > the superblock of the underlying filesystem or a caller that is
> > privileged with respect to the user namespace a mount has been idmapped
> > with can create a new bind-mount and mark it with a user namespace. The
> > user namespace the mount will be marked with can be specified by passing
> > a file descriptor refering to the user namespace as an argument to the
> > new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> > By default vfsmounts are marked with the initial user namespace and no
> > behavioral or performance changes should be observed. All mapping
> > operations are nops for the initial user namespace.
> > 
> > When a file/inode is accessed through an idmapped mount the i_uid and
> > i_gid of the inode will be remapped according to the user namespace the
> > mount has been marked with. When a new object is created based on the
> > fsuid and fsgid of the caller they will similarly be remapped according
> > to the user namespace of the mount they care created from.
> > 
> > This means the user namespace of the mount needs to be passed down into
> > a few relevant inode_operations. This mostly includes inode operations
> > that create filesystem objects or change file attributes.
> 
> That's really quite ... messy.

I don't agree. It's changes all across the vfs but it's not hacky in any
way since it cleanly passes down an additional argument (I'm biased of
course.). 

> 
> Maybe I'm missing something, but if you have the user_ns to be used
> for the VFS operation we are about to execute then why can't we use
> the same model as current_fsuid/current_fsgid() for passing the
> filesystem credentials down to the filesystem operations?  i.e.
> attach it to the current->cred->fs_userns, and then the filesystem
> code that actually needs to know the current userns can call
> current_fs_user_ns() instead of current_user_ns().  i.e.
> 
> #define current_fs_user_ns()	\
> 	(current->cred->fs_userns ? current->cred->fs_userns \
> 				  : current->cred->userns)
> 
> At this point, the filesystem will now always have the correct
> userns it is supposed to use for mapping the uid/gid, right?

Thanks for this interesting idea! I have some troubles with it though.

This approach (always) seemed conceptually wrong to me. Like Tycho said
somewhere else this basically would act like a global variable which
isn't great.

There's also a substantial difference between in that the current fsuid
and fsgid are an actual property of the callers creds so to have them in
there makes perfect sense. But the user namespace of the vfsmount is a
property of the mount and as such glueing it to the callers creds when
calling into the vfs is just weird and I would very much like to avoid
this. If inode's wouldn't have an i_sb member we wouldn't suddenly start
to pass down the s_user_ns via the callers creds to the filesystems.

I'm also not fond of having to call prepare_creds() and override_creds()
all across the vfs. It's messy and prepare_creds() is especially
problematic during RCU pathwalk where we can't call it. We could
in path_init() at the start of every every lookup operation call
prepare_creds() and then override them when we need to switch the
fs_userns global variable and then put_creds() at the end of every path
walk in terminate_walk(). But this means penalizing every lookup
operations with an additional prepare_creds() which needs to be called
at least once, I think. Then during lookup we would need to
override/change this new global fs_userns variable potentially at each
mountpoint crossing to switch back to the correct fs_userns for idmapped
and non-idmapped mounts. We'd also need to rearrange a bunch of
terminate_walk() calls and we would like end up with a comparable amount
of changes only that they would indeed be more messy since we're
strapping fs_userns to the caller's creds.

I an alternative might be to have a combined approach where you pass the
user namespace around in the vfs but when calling into the filesystem
use the fs_userns global variable approach but I would very much prefer
to avoid this and instead cleanly pass down the user namespace
correctly. That's more work, it'll take longer but I and others are
around to see these changes through.

> 
> Also, if we are passing work off to worker threads, duplicating
> the current creds will capture this information and won't leave
> random landmines where stuff doesn't work as it should because the
> worker thread is unaware of the userns that it is supposed to be
> doing filesytsem operations under...

That seems like a problem that can be handled by simply keeping the
userns around similar to how we need to already keep creds around.

Christian
Serge E. Hallyn Oct. 29, 2020, 4:23 p.m. UTC | #7
On Thu, Oct 29, 2020 at 10:12:31AM -0600, Tycho Andersen wrote:
> Hi Eric,
> 
> On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
> > Christian Brauner <christian.brauner@ubuntu.com> writes:
> > 
> > > Hey everyone,
> > >
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > >
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> > 
> > Can you walk us through the motivating use case?
> > 
> > As of this year's LPC I had the distinct impression that the primary use
> > case for such a feature was due to the RLIMIT_NPROC problem where two
> > containers with the same users still wanted different uid mappings to
> > the disk because the users were conflicting with each other because of
> > the per user rlimits.
> > 
> > Fixing rlimits is straight forward to implement, and easier to manage
> > for implementations and administrators.
> 
> Our use case is to have the same directory exposed to several
> different containers which each have disjoint ID mappings.
> 
> > Reading up on systemd-homed it appears to be a way to have encrypted
> > home directories.  Those home directories can either be encrypted at the
> > fs or at the block level.  Those home directories appear to have the
> > goal of being luggable between systems.  If the systems in question
> > don't have common administration of uids and gids after lugging your
> > encrypted home directory to another system chowning the files is
> > required.
> > 
> > Is that the use case you are looking at removing the need for
> > systemd-homed to avoid chowning after lugging encrypted home directories
> > from one system to another?  Why would it be desirable to avoid the
> > chown?
> 
> Not just systemd-homed, but LXD has to do this, as does our
> application at Cisco, and presumably others.
> 
> Several reasons:
> 
> * the chown is slow
> * the chown requires somewhere to write the delta in metadata (e.g. an
>   overlay workdir, or an LV or something), and there are N copies of
>   this delta, one for each container.
> * it means we need to have a +w filesystem at some point during
>   execution.
> * it's ugly :). Conceptually, the kernel solves the uid shifting
>   problem for us for most other kernel subsystems (including in a
>   limited way fscaps) by configuring a user namespace. It feels like
>   we should be able to do the same with the VFS.

And chown prevents the same inode from being shared by different
containers through different id mappings.  You can overlay, but then
they can't actually share updates.

-serge
Sargun Dhillon Oct. 29, 2020, 4:36 p.m. UTC | #8
On Thu, Oct 29, 2020 at 05:05:02PM +0100, Lennart Poettering wrote:
> On Do, 29.10.20 10:47, Eric W. Biederman (ebiederm@xmission.com) wrote:
> 
> > Is that the use case you are looking at removing the need for
> > systemd-homed to avoid chowning after lugging encrypted home directories
> > from one system to another?  Why would it be desirable to avoid the
> > chown?
> 
> Yes, I am very interested in seeing Christian's work succeed, for the
> usecase in systemd-homed. In systemd-homed each user gets their own
> private file system, and these fs shall be owned by the user's local
> UID, regardless in which system it is used. The UID should be an
> artifact of the local, individual system in this model, and thus
> the UID on of the same user/home on system A might be picked as 1010
> and on another as 1543, and on a third as 1323, and it shouldn't
> matter. This way, home directories become migratable without having to
> universially sync UID assignments: it doesn't matter anymore what the
> local UID is.
> 
> Right now we do a recursive chown() at login time to ensure the home
> dir is properly owned. This has two disadvantages:
> 
> 1. It's slow. In particular on large home dirs, it takes a while to go
>    through the whole user's homedir tree and chown/adjust ACLs for
>    everything.
> 
> 2. Because it is so slow we take a shortcut right now: if the
>    top-level home dir inode itself is owned by the correct user, we
>    skip the recursive chowning. This means in the typical case where a
>    user uses the same system most of the time, and thus the UID is
>    stable we can avoid the slowness. But this comes at a drawback: if
>    the user for some reason ends up with files in their homedir owned
>    by an unrelated user, then we'll never notice or readjust.
> 
> > If the goal is to solve fragmented administration of uid assignment I
> > suggest that it might be better to solve the administration problem so
> > that all of the uids of interest get assigned the same way on all of the
> > systems of interest.
> 
> Well, the goal is to make things simple and be able to use the home
> dir everywhere without any prior preparation, without central UID
> assignment authority.
> 
> The goal is to have a scheme that requires no administration, by
> making the UID management problem go away. Hence, if you suggest
> solving this by having a central administrative authority: this is
> exactly what the model wants to get away from.
> 
> Or to say this differently: just because I personally use three
> different computers, I certainly don't want to set up LDAP or sync
> UIDs manually.
> 
> Lennart
> 
> --
> Lennart Poettering, Berlin

Can you help me understand systemd-homed a little bit?

In the man page it says:

systemd-homed is a system service that may be used to create, remove, change or 
inspect home areas (directories and network mounts and real or loopback block 
devices with a filesystem, optionally encrypted).

It seems that the "underlay?" (If you'll call it that, maybe there is a better 
term) can either be a standalone block device (this sounds close to systemd 
machined?), a btrfs subvolume (which receives its own superblock (IIRC?, I might 
be wrong. It's been a while since I've used btrfs), or just be a directory 
that's mapped?

What decides whether it's just a directory and bind-mounted (or a similar 
vfsmount), or an actual superblock?

How is the mapping of "real UIDs" to "namespace UIDs" works when it's just a 
bind mount? From the perspective of multiple user namespaces, are all 
"underlying" UIDs mapped through, or if I try to look at another user's
home directory will they not show up?

Is there a reason you can't / don't / wont use overlayfs instead of bind mounts?
Eric W. Biederman Oct. 29, 2020, 4:37 p.m. UTC | #9
Aleksa Sarai <cyphar@cyphar.com> writes:

> On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> 
>> > Hey everyone,
>> >
>> > I vanished for a little while to focus on this work here so sorry for
>> > not being available by mail for a while.
>> >
>> > Since quite a long time we have issues with sharing mounts between
>> > multiple unprivileged containers with different id mappings, sharing a
>> > rootfs between multiple containers with different id mappings, and also
>> > sharing regular directories and filesystems between users with different
>> > uids and gids. The latter use-cases have become even more important with
>> > the availability and adoption of systemd-homed (cf. [1]) to implement
>> > portable home directories.
>> 
>> Can you walk us through the motivating use case?
>> 
>> As of this year's LPC I had the distinct impression that the primary use
>> case for such a feature was due to the RLIMIT_NPROC problem where two
>> containers with the same users still wanted different uid mappings to
>> the disk because the users were conflicting with each other because of
>> the per user rlimits.
>> 
>> Fixing rlimits is straight forward to implement, and easier to manage
>> for implementations and administrators.
>
> This is separate to the question of "isolated user namespaces" and
> managing different mappings between containers. This patchset is solving
> the same problem that shiftfs solved -- sharing a single directory tree
> between containers that have different ID mappings. rlimits (nor any of
> the other proposals we discussed at LPC) will help with this problem.

First and foremost: A uid shift on write to a filesystem is a security
bug waiting to happen.  This is especially in the context of facilities
like iouring, that play very agressive games with how process context
makes it to  system calls.

The only reason containers were not immediately exploitable when iouring
was introduced is because the mechanisms are built so that even if
something escapes containment the security properties still apply.
Changes to the uid when writing to the filesystem does not have that
property.  The tiniest slip in containment will be a security issue.

This is not even the least bit theoretical.  I have seem reports of how
shitfs+overlayfs created a situation where anyone could read
/etc/shadow.

If you are going to write using the same uid to disk from different
containers the question becomes why can't those containers configure
those users to use the same kuid?

What fixing rlimits does is it fixes one of the reasons that different
containers could not share the same kuid for users that want to write to
disk with the same uid.


I humbly suggest that it will be more secure, and easier to maintain for
both developers and users if we fix the reasons people want different
containers to have the same user running with different kuids.

If not what are the reasons we fundamentally need the same on-disk user
using multiple kuids in the kernel?

Eric
Eric W. Biederman Oct. 29, 2020, 4:44 p.m. UTC | #10
Tycho Andersen <tycho@tycho.pizza> writes:

> Hi Eric,
>
> On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> 
>> > Hey everyone,
>> >
>> > I vanished for a little while to focus on this work here so sorry for
>> > not being available by mail for a while.
>> >
>> > Since quite a long time we have issues with sharing mounts between
>> > multiple unprivileged containers with different id mappings, sharing a
>> > rootfs between multiple containers with different id mappings, and also
>> > sharing regular directories and filesystems between users with different
>> > uids and gids. The latter use-cases have become even more important with
>> > the availability and adoption of systemd-homed (cf. [1]) to implement
>> > portable home directories.
>> 
>> Can you walk us through the motivating use case?
>> 
>> As of this year's LPC I had the distinct impression that the primary use
>> case for such a feature was due to the RLIMIT_NPROC problem where two
>> containers with the same users still wanted different uid mappings to
>> the disk because the users were conflicting with each other because of
>> the per user rlimits.
>> 
>> Fixing rlimits is straight forward to implement, and easier to manage
>> for implementations and administrators.
>
> Our use case is to have the same directory exposed to several
> different containers which each have disjoint ID mappings.

Why do the you have disjoint ID mappings for the users that are writing
to disk with the same ID?

>> Reading up on systemd-homed it appears to be a way to have encrypted
>> home directories.  Those home directories can either be encrypted at the
>> fs or at the block level.  Those home directories appear to have the
>> goal of being luggable between systems.  If the systems in question
>> don't have common administration of uids and gids after lugging your
>> encrypted home directory to another system chowning the files is
>> required.
>> 
>> Is that the use case you are looking at removing the need for
>> systemd-homed to avoid chowning after lugging encrypted home directories
>> from one system to another?  Why would it be desirable to avoid the
>> chown?
>
> Not just systemd-homed, but LXD has to do this,

I asked why the same disk users are assigned different kuids and the
only reason I have heard that LXD does this is the RLIMIT_NPROC problem.

Perhaps there is another reason.

In part this is why I am eager to hear peoples use case, and why I was
trying very hard to make certain we get the requirements.

I want the real requirements though and some thought, not just we did
this and it hurts.  Changning the uids on write is a very hard problem,
and not just in implementating it but also in maintaining and
understanding what is going on.

Eric
Eric W. Biederman Oct. 29, 2020, 4:54 p.m. UTC | #11
Lennart Poettering <lennart@poettering.net> writes:

> On Do, 29.10.20 10:47, Eric W. Biederman (ebiederm@xmission.com) wrote:
>
>> Is that the use case you are looking at removing the need for
>> systemd-homed to avoid chowning after lugging encrypted home directories
>> from one system to another?  Why would it be desirable to avoid the
>> chown?
>
> Yes, I am very interested in seeing Christian's work succeed, for the
> usecase in systemd-homed. In systemd-homed each user gets their own
> private file system, and these fs shall be owned by the user's local
> UID, regardless in which system it is used. The UID should be an
> artifact of the local, individual system in this model, and thus
> the UID on of the same user/home on system A might be picked as 1010
> and on another as 1543, and on a third as 1323, and it shouldn't
> matter. This way, home directories become migratable without having to
> universially sync UID assignments: it doesn't matter anymore what the
> local UID is.
>
> Right now we do a recursive chown() at login time to ensure the home
> dir is properly owned. This has two disadvantages:
>
> 1. It's slow. In particular on large home dirs, it takes a while to go
>    through the whole user's homedir tree and chown/adjust ACLs for
>    everything.
>
> 2. Because it is so slow we take a shortcut right now: if the
>    top-level home dir inode itself is owned by the correct user, we
>    skip the recursive chowning. This means in the typical case where a
>    user uses the same system most of the time, and thus the UID is
>    stable we can avoid the slowness. But this comes at a drawback: if
>    the user for some reason ends up with files in their homedir owned
>    by an unrelated user, then we'll never notice or readjust.


The classic solution to this problem for removable media are
uid=XXX and gid=XXX mount options.

I suspect a similar solution can apply here.

I don't think you need a solution that requires different kuids
to be able to write to the same filesystem uid.

>> If the goal is to solve fragmented administration of uid assignment I
>> suggest that it might be better to solve the administration problem so
>> that all of the uids of interest get assigned the same way on all of the
>> systems of interest.
>
> Well, the goal is to make things simple and be able to use the home
> dir everywhere without any prior preparation, without central UID
> assignment authority.
>
> The goal is to have a scheme that requires no administration, by
> making the UID management problem go away. Hence, if you suggest
> solving this by having a central administrative authority: this is
> exactly what the model wants to get away from.

For a files that can be accessed by more than a single user this is
fundamentally necessary.  Otherwise group permissions and acls can not
work.  They wind up as meaningless garbage, because without some kind of
synchronization those other users and groups simply can not be
represented.

> Or to say this differently: just because I personally use three
> different computers, I certainly don't want to set up LDAP or sync
> UIDs manually.

If they are single users systems why should you need to?

But if permissions on files are going to be at all meaningful it is
a fundamentally a requirement that there be no confusion about which
party the other parties are talking about.  To the best of my knowledge
syncing uids/usernames between machines is as simple as it can get.

Eric
Stéphane Graber Oct. 29, 2020, 6:04 p.m. UTC | #12
On Thu, Oct 29, 2020 at 12:45 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Tycho Andersen <tycho@tycho.pizza> writes:
>
> > Hi Eric,
> >
> > On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >>
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >>
> >> Can you walk us through the motivating use case?
> >>
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >>
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > Our use case is to have the same directory exposed to several
> > different containers which each have disjoint ID mappings.
>
> Why do the you have disjoint ID mappings for the users that are writing
> to disk with the same ID?
>
> >> Reading up on systemd-homed it appears to be a way to have encrypted
> >> home directories.  Those home directories can either be encrypted at the
> >> fs or at the block level.  Those home directories appear to have the
> >> goal of being luggable between systems.  If the systems in question
> >> don't have common administration of uids and gids after lugging your
> >> encrypted home directory to another system chowning the files is
> >> required.
> >>
> >> Is that the use case you are looking at removing the need for
> >> systemd-homed to avoid chowning after lugging encrypted home directories
> >> from one system to another?  Why would it be desirable to avoid the
> >> chown?
> >
> > Not just systemd-homed, but LXD has to do this,
>
> I asked why the same disk users are assigned different kuids and the
> only reason I have heard that LXD does this is the RLIMIT_NPROC problem.
>
> Perhaps there is another reason.
>
> In part this is why I am eager to hear peoples use case, and why I was
> trying very hard to make certain we get the requirements.
>
> I want the real requirements though and some thought, not just we did
> this and it hurts.  Changning the uids on write is a very hard problem,
> and not just in implementating it but also in maintaining and
> understanding what is going on.

The most common cases where shiftfs is used or where folks would like
to use it today are (by importance):
 - Fast container creation (by not having to uid/gid shift all files
in the downloaded image)
 - Sharing data between the host system and a container (some paths
under /home being the most common)
 - Sharing data between unprivileged containers with a disjointed map
 - Sharing data between multiple containers, some privileged, some unprivileged

Fixing the ulimit issue only takes care of one of those (3rd item), it
does not solve any of the other cases.

The first item on there alone can be quite significant. Creation and
startup of a regular Debian container on my system takes around 500ms
when shiftfs is used (btrfs/lvm/zfs copy-on-write clone of the image,
setup shiftfs, start container) compared to 2-3s when running without
it (same clone, followed by rewrite of all uid/gid present on the fs,
including acls and capabilities, then start container). And that's on
a fast system with an NVME SSD and a small rootfs. We have had reports
of a few users running on slow spinning rust with large containers
where shifting can take several minutes.

The second item can technically be worked around without shifted
bind-mounts by doing userns map hole punching, mapping the user's
uid/gid from the host straight into the container. The downside to
this is that another shifting pass becomes needed for any file outside
of the bind-mounted path (or it would become owned by -1/-1) and it's
very much not dynamic, requiring the container be stopped, config
updated by the user, /etc/subuid and subgid maps being updated and
container started back up. If you need another user/group be exposed,
start all over again...
This is far more complex, slow and disruptive than the shifted
approach where we just need to do:
   lxc config device add MY-CONTAINER home disk source=/home
path=/home shift=true
To inject a new mount of /home from the host into the container with a
shifting layer in place, no need to reconfig subuid/subgid, no need to
re-create the userns to update the mapping and no need to go through
the container's rootfs for any file which may now need remapping
because of the map change.

Stéphane

> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Andy Lutomirski Oct. 29, 2020, 9:58 p.m. UTC | #13
> On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> 
> Hey everyone,
> 
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
> 
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
> 
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
> 
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
> 
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
> 
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace.

So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.

For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.

Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this.  Inside the namespace, the user creates a suid-root file.

Now, outside the namespace, the user has privilege over the namespace.  (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.

So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace.  We already do the latter for the vfsmnt’s mntns’s userns.

Hmm.  What happens if we require that an idmap userns equal the vfsmnt’s mntns’s userns?  Is that too limiting?

I hope that whatever solution gets used is straightforward enough to wrap one’s head around.

> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.

By “mapped according to”, I presume you mean that the on-disk uid/gid is the gid as seen in the user namespace in question.
Serge E. Hallyn Oct. 30, 2020, 2:18 a.m. UTC | #14
On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote:
> Aleksa Sarai <cyphar@cyphar.com> writes:
> 
> > On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >> 
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >> 
> >> Can you walk us through the motivating use case?
> >> 
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >> 
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > This is separate to the question of "isolated user namespaces" and
> > managing different mappings between containers. This patchset is solving
> > the same problem that shiftfs solved -- sharing a single directory tree
> > between containers that have different ID mappings. rlimits (nor any of
> > the other proposals we discussed at LPC) will help with this problem.
> 
> First and foremost: A uid shift on write to a filesystem is a security
> bug waiting to happen.  This is especially in the context of facilities
> like iouring, that play very agressive games with how process context
> makes it to  system calls.
> 
> The only reason containers were not immediately exploitable when iouring
> was introduced is because the mechanisms are built so that even if
> something escapes containment the security properties still apply.
> Changes to the uid when writing to the filesystem does not have that
> property.  The tiniest slip in containment will be a security issue.
> 
> This is not even the least bit theoretical.  I have seem reports of how
> shitfs+overlayfs created a situation where anyone could read
> /etc/shadow.
> 
> If you are going to write using the same uid to disk from different
> containers the question becomes why can't those containers configure
> those users to use the same kuid?

Because if user 'myapp' in two otherwise isolated containers both have
the same kuid, so that they can write to a shared directory, then root
in container 1 has privilege over all files owned by 'myapp' in
container 2.

Whereas if they can each have distinct kuids, but when writing to the
shared fs have a shared uid not otherwise belonging to either container,
their rootfs's can remain completely off limits to each other.

> What fixing rlimits does is it fixes one of the reasons that different
> containers could not share the same kuid for users that want to write to
> disk with the same uid.
> 
> 
> I humbly suggest that it will be more secure, and easier to maintain for
> both developers and users if we fix the reasons people want different
> containers to have the same user running with different kuids.
> 
> If not what are the reasons we fundamentally need the same on-disk user
> using multiple kuids in the kernel?
> 
> Eric
Christian Brauner Oct. 30, 2020, 12:01 p.m. UTC | #15
On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > 
> > Hey everyone,
> > 
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> > 
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
> > 
> > The solutions we have tried and proposed so far include the introduction
> > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > call override creds in the vfs. None of these solutions have covered all
> > of the above use-cases.
> > 
> > The solution proposed here has it's origins in multiple discussions
> > during Linux Plumbers 2017 during and after the end of the containers
> > microconference.
> > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > James, and myself. A variant of the solution proposed here has also been
> > discussed, again to the best of my knowledge, after a Linux conference
> > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > after Linux Plumbers.
> > I've taken the time to finally implement a working version of this
> > solution over the last weeks to the best of my abilities. Tycho has
> > signed up for this sligthly crazy endeavour as well and he has helped
> > with the conversion of the xattr codepaths.
> > 
> > The core idea is to make idmappings a property of struct vfsmount
> > instead of tying it to a process being inside of a user namespace which
> > has been the case for all other proposed approaches.
> > It means that idmappings become a property of bind-mounts, i.e. each
> > bind-mount can have a separate idmapping. This has the obvious advantage
> > that idmapped mounts can be created inside of the initial user
> > namespace, i.e. on the host itself instead of requiring the caller to be
> > located inside of a user namespace. This enables such use-cases as e.g.
> > making a usb stick available in multiple locations with different
> > idmappings (see the vfat port that is part of this patch series).
> > 
> > The vfsmount struct gains a new struct user_namespace member. The
> > idmapping of the user namespace becomes the idmapping of the mount. A
> > caller that is either privileged with respect to the user namespace of
> > the superblock of the underlying filesystem or a caller that is
> > privileged with respect to the user namespace a mount has been idmapped
> > with can create a new bind-mount and mark it with a user namespace.
> 
> So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.
> 
> For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.
> 
> Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this.  Inside the namespace, the user creates a suid-root file.
> 
> Now, outside the namespace, the user has privilege over the namespace.  (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.
> 
> So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace.  We already do the latter for the vfsmnt’s mntns’s userns.

With this series, in order to create an idmapped mount the user must
either be cap_sys_admin in the superblock of the underlying filesystem
or if the mount is already idmapped and they want to create another
idmapped mount from it they must have cap_sys_admin in the userns that
the mount is currrently marked with. It is also not possible to change
an idmapped mount once it has been idmapped, i.e. the user must create a
new detached bind-mount first.

> 
> Hmm.  What happens if we require that an idmap userns equal the vfsmnt’s mntns’s userns?  Is that too limiting?
> 
> I hope that whatever solution gets used is straightforward enough to wrap one’s head around.
> 
> > When a file/inode is accessed through an idmapped mount the i_uid and
> > i_gid of the inode will be remapped according to the user namespace the
> > mount has been marked with. When a new object is created based on the
> > fsuid and fsgid of the caller they will similarly be remapped according
> > to the user namespace of the mount they care created from.
> 
> By “mapped according to”, I presume you mean that the on-disk uid/gid is the gid as seen in the user namespace in question.

If I understand you correctly, then yes.
Seth Forshee Oct. 30, 2020, 3:07 p.m. UTC | #16
On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote:
> First and foremost: A uid shift on write to a filesystem is a security
> bug waiting to happen.  This is especially in the context of facilities
> like iouring, that play very agressive games with how process context
> makes it to  system calls.
> 
> The only reason containers were not immediately exploitable when iouring
> was introduced is because the mechanisms are built so that even if
> something escapes containment the security properties still apply.
> Changes to the uid when writing to the filesystem does not have that
> property.  The tiniest slip in containment will be a security issue.
> 
> This is not even the least bit theoretical.  I have seem reports of how
> shitfs+overlayfs created a situation where anyone could read
> /etc/shadow.

This bug was the result of a complex interaction with several
contributing factors. It's fair to say that one component was overlayfs
writing through an id-shifted mount, but the primary cause was related
to how copy-up was done coupled with allowing unprivileged overlayfs
mounts in a user ns. Checks that the mounter had access to the lower fs
file were not done before copying data up, and so the file was copied up
temporarily to the id shifted upperdir. Even though it was immediately
removed, other factors made it possible for the user to get the file
contents from the upperdir.

Regardless, I do think you raise a good point. We need to be wary of any
place the kernel could open files through a shifted mount, especially
when the open could be influenced by userspace.

Perhaps kernel file opens through shifted mounts should to be opt-in.
I.e. unless a flag is passed, or a different open interface used, the
open will fail if the dentry being opened is subject to id shifting.
This way any kernel writes which would be subject to id shifting will
only happen through code which as been written to take it into account.

Seth
Serge E. Hallyn Oct. 30, 2020, 4:03 p.m. UTC | #17
On Fri, Oct 30, 2020 at 10:07:48AM -0500, Seth Forshee wrote:
> On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote:
> > First and foremost: A uid shift on write to a filesystem is a security
> > bug waiting to happen.  This is especially in the context of facilities
> > like iouring, that play very agressive games with how process context
> > makes it to  system calls.
> > 
> > The only reason containers were not immediately exploitable when iouring
> > was introduced is because the mechanisms are built so that even if
> > something escapes containment the security properties still apply.
> > Changes to the uid when writing to the filesystem does not have that
> > property.  The tiniest slip in containment will be a security issue.
> > 
> > This is not even the least bit theoretical.  I have seem reports of how
> > shitfs+overlayfs created a situation where anyone could read
> > /etc/shadow.
> 
> This bug was the result of a complex interaction with several
> contributing factors. It's fair to say that one component was overlayfs
> writing through an id-shifted mount, but the primary cause was related
> to how copy-up was done coupled with allowing unprivileged overlayfs
> mounts in a user ns. Checks that the mounter had access to the lower fs
> file were not done before copying data up, and so the file was copied up
> temporarily to the id shifted upperdir. Even though it was immediately
> removed, other factors made it possible for the user to get the file
> contents from the upperdir.
> 
> Regardless, I do think you raise a good point. We need to be wary of any
> place the kernel could open files through a shifted mount, especially
> when the open could be influenced by userspace.
> 
> Perhaps kernel file opens through shifted mounts should to be opt-in.
> I.e. unless a flag is passed, or a different open interface used, the
> open will fail if the dentry being opened is subject to id shifting.
> This way any kernel writes which would be subject to id shifting will
> only happen through code which as been written to take it into account.

For my use cases, it would be fine to require opt-in at original fs
mount time by init_user_ns admin.  I.e.
    mount -o allow_idmap /dev/mapper/whoozit /whatzit
I'm quite certain I would always be sharing a separate LV or loopback or
tmpfs.

-serge
Serge E. Hallyn Oct. 30, 2020, 4:17 p.m. UTC | #18
On Fri, Oct 30, 2020 at 01:01:57PM +0100, Christian Brauner wrote:
> On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
> > 
> > 
> > > On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > > 
> > > Hey everyone,
> > > 
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > > 
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> > > 
> > > The solutions we have tried and proposed so far include the introduction
> > > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > > call override creds in the vfs. None of these solutions have covered all
> > > of the above use-cases.
> > > 
> > > The solution proposed here has it's origins in multiple discussions
> > > during Linux Plumbers 2017 during and after the end of the containers
> > > microconference.
> > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > > James, and myself. A variant of the solution proposed here has also been
> > > discussed, again to the best of my knowledge, after a Linux conference
> > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > > after Linux Plumbers.
> > > I've taken the time to finally implement a working version of this
> > > solution over the last weeks to the best of my abilities. Tycho has
> > > signed up for this sligthly crazy endeavour as well and he has helped
> > > with the conversion of the xattr codepaths.
> > > 
> > > The core idea is to make idmappings a property of struct vfsmount
> > > instead of tying it to a process being inside of a user namespace which
> > > has been the case for all other proposed approaches.
> > > It means that idmappings become a property of bind-mounts, i.e. each
> > > bind-mount can have a separate idmapping. This has the obvious advantage
> > > that idmapped mounts can be created inside of the initial user
> > > namespace, i.e. on the host itself instead of requiring the caller to be
> > > located inside of a user namespace. This enables such use-cases as e.g.
> > > making a usb stick available in multiple locations with different
> > > idmappings (see the vfat port that is part of this patch series).
> > > 
> > > The vfsmount struct gains a new struct user_namespace member. The
> > > idmapping of the user namespace becomes the idmapping of the mount. A
> > > caller that is either privileged with respect to the user namespace of
> > > the superblock of the underlying filesystem or a caller that is
> > > privileged with respect to the user namespace a mount has been idmapped
> > > with can create a new bind-mount and mark it with a user namespace.
> > 
> > So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.
> > 
> > For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.
> > 
> > Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this.  Inside the namespace, the user creates a suid-root file.
> > 
> > Now, outside the namespace, the user has privilege over the namespace.  (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.
> > 
> > So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace.  We already do the latter for the vfsmnt’s mntns’s userns.
> 
> With this series, in order to create an idmapped mount the user must
> either be cap_sys_admin in the superblock of the underlying filesystem
> or if the mount is already idmapped and they want to create another
> idmapped mount from it they must have cap_sys_admin in the userns that
> the mount is currrently marked with. It is also not possible to change
> an idmapped mount once it has been idmapped, i.e. the user must create a
> new detached bind-mount first.

Yeah I spent quite some time last night trying to figure out the scenario
you were presenting, but I failed.  Andy, could you either rephrase or
give a more concrete end to end attack scenario?
Andy Lutomirski Oct. 31, 2020, 5:43 p.m. UTC | #19
On Fri, Oct 30, 2020 at 5:02 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
> >
> >
> > > On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > >
> > > Hey everyone,
> > >
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > >
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> > >
> > > The solutions we have tried and proposed so far include the introduction
> > > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > > call override creds in the vfs. None of these solutions have covered all
> > > of the above use-cases.
> > >
> > > The solution proposed here has it's origins in multiple discussions
> > > during Linux Plumbers 2017 during and after the end of the containers
> > > microconference.
> > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > > James, and myself. A variant of the solution proposed here has also been
> > > discussed, again to the best of my knowledge, after a Linux conference
> > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > > after Linux Plumbers.
> > > I've taken the time to finally implement a working version of this
> > > solution over the last weeks to the best of my abilities. Tycho has
> > > signed up for this sligthly crazy endeavour as well and he has helped
> > > with the conversion of the xattr codepaths.
> > >
> > > The core idea is to make idmappings a property of struct vfsmount
> > > instead of tying it to a process being inside of a user namespace which
> > > has been the case for all other proposed approaches.
> > > It means that idmappings become a property of bind-mounts, i.e. each
> > > bind-mount can have a separate idmapping. This has the obvious advantage
> > > that idmapped mounts can be created inside of the initial user
> > > namespace, i.e. on the host itself instead of requiring the caller to be
> > > located inside of a user namespace. This enables such use-cases as e.g.
> > > making a usb stick available in multiple locations with different
> > > idmappings (see the vfat port that is part of this patch series).
> > >
> > > The vfsmount struct gains a new struct user_namespace member. The
> > > idmapping of the user namespace becomes the idmapping of the mount. A
> > > caller that is either privileged with respect to the user namespace of
> > > the superblock of the underlying filesystem or a caller that is
> > > privileged with respect to the user namespace a mount has been idmapped
> > > with can create a new bind-mount and mark it with a user namespace.
> >
> > So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.
> >
> > For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.
> >
> > Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this.  Inside the namespace, the user creates a suid-root file.
> >
> > Now, outside the namespace, the user has privilege over the namespace.  (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.
> >
> > So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace.  We already do the latter for the vfsmnt’s mntns’s userns.
>
> With this series, in order to create an idmapped mount the user must
> either be cap_sys_admin in the superblock of the underlying filesystem
> or if the mount is already idmapped and they want to create another
> idmapped mount from it they must have cap_sys_admin in the userns that
> the mount is currrently marked with. It is also not possible to change
> an idmapped mount once it has been idmapped, i.e. the user must create a
> new detached bind-mount first.

I think my attack might not work, but I also think I didn't explain it
very well.  Let me try again.  I'll also try to lay out what I
understand the rules of idmaps to be so that you can correct me when
I'm inevitable wrong :)

First, background: there are a bunch of user namespaces around.  Every
superblock has one, every idmapped mount has one, and every vfsmnt
also (indirectly) has one: mnt->mnt_ns->user_ns.  So, if you're
looking at a given vfsmnt, you have three user namespaces that are
relevant, in addition to whatever namespaces are active for the task
(or kernel thread) accessing that mount.  I'm wondering whether
mnt_user_ns() should possibly have a name that makes it clear that it
refers to the idmap namespace and not mnt->mnt_ns->user_ns.

So here's the attack.  An attacker with uid=1000 creates a userns N
(so the attacker owns the ns and 1000 outside maps to 0 inside).  N is
a child of init_user_ns.  Now the attacker creates a mount namespace M
inside the userns and, potentially with the help of a container
management tool, creates an idmapped filesystem mount F inside M.  So,
playing fast and loose with my ampersands:

F->mnt_ns == M
F->mnt_ns->user_ns == N
mnt_user_ns(F) == N

I expect that this wouldn't be a particularly uncommon setup.  Now the
user has the ability to create files with inode->uid == 0 and the SUID
bit set on their filesystem.  This isn't terribly different from FUSE,
except that the mount won't have nosuid set, whereas at least many
uses of unprivileged FUSE would have nosuid set.  So the thing that
makes me a little bit nervous.  But it actually seems likely that I
was wrong and this is okay.  Specifically, to exploit this using
kernel mechanisms, one would need to pass a mnt_may_suid() check,
which means that one would need to acquire a mount of F in one's
current mount namespace, and one would need one's current user
namespace to be init_ns (or something else sensitive).  But you
already need to own the namespace to create mounts, unless you have a
way to confuse some existing user tooling.  You would also need to be
in F's superblock's user_ns (second line of mnt_may_suid()), which
totally kills this type of attack if F's superblock is in the
container's user_ns, but I wouldn't count on that.

So maybe this is all fine.  I'll continue to try to poke holes in it,
but perhaps there aren't any holes to poke.  I'll also continue to try
to see if I can state the security properties of idmap in a way that
is clear and obviously has nice properties.

Why are you allowing the creation of a new idmapped mount if you have
cap_sys_admin over an existing idmap userns but not over the
superblock's userns?  I assume this is for a nested container use
case, but can you spell out a specific example usage?

--Andy
Alban Crequy Nov. 3, 2020, 2:10 p.m. UTC | #20
On Thu, Oct 29, 2020 at 5:37 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Aleksa Sarai <cyphar@cyphar.com> writes:
>
> > On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >>
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >>
> >> Can you walk us through the motivating use case?
> >>
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >>
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > This is separate to the question of "isolated user namespaces" and
> > managing different mappings between containers. This patchset is solving
> > the same problem that shiftfs solved -- sharing a single directory tree
> > between containers that have different ID mappings. rlimits (nor any of
> > the other proposals we discussed at LPC) will help with this problem.
>
> First and foremost: A uid shift on write to a filesystem is a security
> bug waiting to happen.  This is especially in the context of facilities
> like iouring, that play very agressive games with how process context
> makes it to  system calls.
>
> The only reason containers were not immediately exploitable when iouring
> was introduced is because the mechanisms are built so that even if
> something escapes containment the security properties still apply.
> Changes to the uid when writing to the filesystem does not have that
> property.  The tiniest slip in containment will be a security issue.
>
> This is not even the least bit theoretical.  I have seem reports of how
> shitfs+overlayfs created a situation where anyone could read
> /etc/shadow.
>
> If you are going to write using the same uid to disk from different
> containers the question becomes why can't those containers configure
> those users to use the same kuid?
>
> What fixing rlimits does is it fixes one of the reasons that different
> containers could not share the same kuid for users that want to write to
> disk with the same uid.
>
>
> I humbly suggest that it will be more secure, and easier to maintain for
> both developers and users if we fix the reasons people want different
> containers to have the same user running with different kuids.
>
> If not what are the reasons we fundamentally need the same on-disk user
> using multiple kuids in the kernel?

I would like to use this patch set in the context of Kubernetes. I
described my two possible setups in
https://www.spinics.net/lists/linux-containers/msg36537.html:

1. Each Kubernetes pod has its own userns but with the same user id mapping
2. Each Kubernetes pod has its own userns with non-overlapping user id
mapping (providing additional isolation between pods)

But even in the setup where all pods run with the same id mappings,
this patch set is still useful to me for 2 reasons:

1. To avoid the expensive recursive chown of the rootfs. We cannot
necessarily extract the tarball directly with the right uids because
we might use the same container image for privileged containers (with
the host userns) and unprivileged containers (with a new userns), so
we have at least 2 “mappings” (taking more time and resulting in more
storage space). Although the “metacopy” mount option in overlayfs
helps to make the recursive chown faster, it can still take time with
large container images with lots of files. I’d like to use this patch
set to set up the root fs in constant time.

2. To manage large external volumes (NFS or other filesystems). Even
if admins can decide to use the same kuid on all the nodes of the
Kubernetes cluster, this is impractical for migration. People can have
existing Kubernetes clusters (currently without using user namespaces)
and large NFS volumes. If they want to switch to a new version of
Kubernetes with the user namespace feature enabled, they would need to
recursively chown all the files on the NFS shares. This could take
time on large filesystems and realistically, we want to support
rolling updates where some nodes use the previous version without user
namespaces and new nodes are progressively migrated to the new userns
with the new id mapping. If both sets of nodes use the same NFS share,
that can’t work.

Alban