diff mbox series

[RFC,4/4] fs: allow mknod in non-initial userns using cgroup device guard

Message ID 20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de (mailing list archive)
State RFC
Headers show
Series bpf: cgroup device guard for non-initial user namespace | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch, async
bpf/vmtest-bpf-next-PR pending PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 pending Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 pending Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 pending Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat

Commit Message

Michael Weiß Aug. 14, 2023, 2:26 p.m. UTC
If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, we allow mknod() in fs/namei.c if a
bpf cgroup device guard is enabeld for the current task using
devcgroup_task_is_guarded() and check CAP_MKNOD for the current user
namespace by ns_capable() instead of the global CAP_MKNOD.

To avoid unusable device nodes on file systems mounted in
non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
for cgroup device guarded tasks.

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
---
 fs/namei.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

Comments

Alexander Mikhalitsyn Aug. 14, 2023, 3:24 p.m. UTC | #1
+CC Stéphane Graber <stgraber@ubuntu.com>


On Mon, Aug 14, 2023 at 4:26 PM Michael Weiß
<michael.weiss@aisec.fraunhofer.de> wrote:
>
> If a container manager restricts its unprivileged (user namespaced)
> children by a device cgroup, it is not necessary to deny mknod
> anymore. Thus, user space applications may map devices on different
> locations in the file system by using mknod() inside the container.
>
> A use case for this, we also use in GyroidOS, is to run virsh for
> VMs inside an unprivileged container. virsh creates device nodes,
> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
> in a non-initial userns, even if a cgroup device white list with the
> corresponding major, minor of /dev/null exists. Thus, in this case
> the usual bind mounts or pre populated device nodes under /dev are
> not sufficient.
>
> To circumvent this limitation, we allow mknod() in fs/namei.c if a
> bpf cgroup device guard is enabeld for the current task using
> devcgroup_task_is_guarded() and check CAP_MKNOD for the current user
> namespace by ns_capable() instead of the global CAP_MKNOD.
>
> To avoid unusable device nodes on file systems mounted in
> non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
> for cgroup device guarded tasks.
>
> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
> ---
>  fs/namei.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index e56ff39a79bc..ef4f22b9575c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj);
>
>  bool may_open_dev(const struct path *path)
>  {
> +       if (devcgroup_task_is_guarded(current))
> +               return !(path->mnt->mnt_flags & MNT_NODEV);
> +
>         return !(path->mnt->mnt_flags & MNT_NODEV) &&
>                 !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
>  }
> @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
>         if (error)
>                 return error;
>
> -       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
> -           !capable(CAP_MKNOD))
> -               return -EPERM;
> +       /*
> +        * In case of a device cgroup restirction allow mknod in user
> +        * namespace. Otherwise just check global capability; thus,
> +        * mknod is also disabled for user namespace other than the
> +        * initial one.
> +        */
> +       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
> +               if (devcgroup_task_is_guarded(current)) {
> +                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
> +                               return -EPERM;
> +               } else if (!capable(CAP_MKNOD))
> +                       return -EPERM;
> +       }
>
>         if (!dir->i_op->mknod)
>                 return -EPERM;
>
> --
> 2.30.2
>
kernel test robot Aug. 15, 2023, 7:18 a.m. UTC | #2
Hello,

kernel test robot noticed "WARNING:suspicious_RCU_usage" on:

commit: bffc333633f1e681c01ada11bd695aa220518bd8 ("[PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard")
url: https://github.com/intel-lab-lkp/linux/commits/Michael-Wei/bpf-add-cgroup-device-guard-to-flag-a-cgroup-device-prog/20230814-224110
patch link: https://lore.kernel.org/all/20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de/
patch subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard

in testcase: boot

compiler: gcc-12
test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202308151506.6be3b169-oliver.sang@intel.com



[   14.468719][  T139]
[   14.468999][  T139] =============================
[   14.469545][  T139] WARNING: suspicious RCU usage
[   14.469968][  T139] 6.5.0-rc6-00004-gbffc333633f1 #1 Not tainted
[   14.470520][  T139] -----------------------------
[   14.470940][  T139] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage!
[   14.471703][  T139]
[   14.471703][  T139] other info that might help us debug this:
[   14.471703][  T139]
[   14.472692][  T139]
[   14.472692][  T139] rcu_scheduler_active = 2, debug_locks = 1
[   14.473469][  T139] no locks held by (journald)/139.
[   14.473935][  T139]
[   14.473935][  T139] stack backtrace:
[   14.474454][  T139] CPU: 1 PID: 139 Comm: (journald) Not tainted 6.5.0-rc6-00004-gbffc333633f1 #1
[   14.475296][  T139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   14.476298][  T139] Call Trace:
[   14.476608][  T139]  dump_stack_lvl+0x78/0x8c
[   14.477055][  T139]  dump_stack+0x12/0x18
[   14.477420][  T139]  lockdep_rcu_suspicious+0x153/0x1a4
[   14.477928][  T139]  cgroup_bpf_device_guard_enabled+0x14f/0x168
[   14.478476][  T139]  devcgroup_task_is_guarded+0x10/0x20
[   14.478973][  T139]  may_open_dev+0x11/0x44
[   14.479367][  T139]  may_open+0x115/0x13c
[   14.479727][  T139]  do_open+0xa1/0x378
[   14.480113][  T139]  path_openat+0xdc/0x1bc
[   14.480512][  T139]  do_filp_open+0x91/0x124
[   14.480911][  T139]  ? lock_release+0x62/0x118
[   14.481329][  T139]  ? _raw_spin_unlock+0x18/0x34
[   14.481797][  T139]  ? alloc_fd+0x112/0x1c4
[   14.482183][  T139]  do_sys_openat2+0x7a/0xa0
[   14.482592][  T139]  __ia32_sys_openat+0x66/0x9c
[   14.483065][  T139]  do_int80_syscall_32+0x27/0x48
[   14.483502][  T139]  entry_INT80_32+0x10d/0x10d
[   14.483962][  T139] EIP: 0xa7f39092
[   14.484267][  T139] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 f8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4
 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
[   14.485995][  T139] EAX: ffffffda EBX: ffffff9c ECX: 005df542 EDX: 00008100
[   14.486622][  T139] ESI: 00000000 EDI: 00000000 EBP: affeb888 ESP: affeb6ec
[   14.487225][  T139] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230815/202308151506.6be3b169-oliver.sang@intel.com
Alexander Mikhalitsyn Aug. 15, 2023, 7:49 a.m. UTC | #3
On Tue, Aug 15, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "WARNING:suspicious_RCU_usage" on:
>
> commit: bffc333633f1e681c01ada11bd695aa220518bd8 ("[PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard")
> url: https://github.com/intel-lab-lkp/linux/commits/Michael-Wei/bpf-add-cgroup-device-guard-to-flag-a-cgroup-device-prog/20230814-224110
> patch link: https://lore.kernel.org/all/20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de/
> patch subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard
>
> in testcase: boot
>
> compiler: gcc-12
> test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202308151506.6be3b169-oliver.sang@intel.com
>
>
>
> [   14.468719][  T139]
> [   14.468999][  T139] =============================
> [   14.469545][  T139] WARNING: suspicious RCU usage
> [   14.469968][  T139] 6.5.0-rc6-00004-gbffc333633f1 #1 Not tainted
> [   14.470520][  T139] -----------------------------
> [   14.470940][  T139] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage!

Most likely it's because in "cgroup_bpf_device_guard_enabled" function:

struct cgroup *cgrp = task_dfl_cgroup(task);

should be under rcu_read_lock (or cgroup_mutex). If we get rid of
cgroup_mutex and make cgroup_bpf_device_guard_enabled
function specific to "current" task we will solve this issue too.

> [   14.471703][  T139]
> [   14.471703][  T139] other info that might help us debug this:
> [   14.471703][  T139]
> [   14.472692][  T139]
> [   14.472692][  T139] rcu_scheduler_active = 2, debug_locks = 1
> [   14.473469][  T139] no locks held by (journald)/139.
> [   14.473935][  T139]
> [   14.473935][  T139] stack backtrace:
> [   14.474454][  T139] CPU: 1 PID: 139 Comm: (journald) Not tainted 6.5.0-rc6-00004-gbffc333633f1 #1
> [   14.475296][  T139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   14.476298][  T139] Call Trace:
> [   14.476608][  T139]  dump_stack_lvl+0x78/0x8c
> [   14.477055][  T139]  dump_stack+0x12/0x18
> [   14.477420][  T139]  lockdep_rcu_suspicious+0x153/0x1a4
> [   14.477928][  T139]  cgroup_bpf_device_guard_enabled+0x14f/0x168
> [   14.478476][  T139]  devcgroup_task_is_guarded+0x10/0x20
> [   14.478973][  T139]  may_open_dev+0x11/0x44
> [   14.479367][  T139]  may_open+0x115/0x13c
> [   14.479727][  T139]  do_open+0xa1/0x378
> [   14.480113][  T139]  path_openat+0xdc/0x1bc
> [   14.480512][  T139]  do_filp_open+0x91/0x124
> [   14.480911][  T139]  ? lock_release+0x62/0x118
> [   14.481329][  T139]  ? _raw_spin_unlock+0x18/0x34
> [   14.481797][  T139]  ? alloc_fd+0x112/0x1c4
> [   14.482183][  T139]  do_sys_openat2+0x7a/0xa0
> [   14.482592][  T139]  __ia32_sys_openat+0x66/0x9c
> [   14.483065][  T139]  do_int80_syscall_32+0x27/0x48
> [   14.483502][  T139]  entry_INT80_32+0x10d/0x10d
> [   14.483962][  T139] EIP: 0xa7f39092
> [   14.484267][  T139] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 f8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4
>  26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> [   14.485995][  T139] EAX: ffffffda EBX: ffffff9c ECX: 005df542 EDX: 00008100
> [   14.486622][  T139] ESI: 00000000 EDI: 00000000 EBP: affeb888 ESP: affeb6ec
> [   14.487225][  T139] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
>
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20230815/202308151506.6be3b169-oliver.sang@intel.com
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index e56ff39a79bc..ef4f22b9575c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3221,6 +3221,9 @@  EXPORT_SYMBOL(vfs_mkobj);
 
 bool may_open_dev(const struct path *path)
 {
+	if (devcgroup_task_is_guarded(current))
+		return !(path->mnt->mnt_flags & MNT_NODEV);
+
 	return !(path->mnt->mnt_flags & MNT_NODEV) &&
 		!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
 }
@@ -3976,9 +3979,19 @@  int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		return error;
 
-	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
-	    !capable(CAP_MKNOD))
-		return -EPERM;
+	/*
+	 * In case of a device cgroup restirction allow mknod in user
+	 * namespace. Otherwise just check global capability; thus,
+	 * mknod is also disabled for user namespace other than the
+	 * initial one.
+	 */
+	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
+		if (devcgroup_task_is_guarded(current)) {
+			if (!ns_capable(current_user_ns(), CAP_MKNOD))
+				return -EPERM;
+		} else if (!capable(CAP_MKNOD))
+			return -EPERM;
+	}
 
 	if (!dir->i_op->mknod)
 		return -EPERM;