Message ID | 20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | bpf: cgroup device guard for non-initial user namespace | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch, async |
bpf/vmtest-bpf-next-PR | pending | PR summary |
bpf/vmtest-bpf-next-VM_Test-1 | success | Logs for ShellCheck |
bpf/vmtest-bpf-next-VM_Test-2 | success | Logs for build for aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-4 | success | Logs for build for x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-5 | success | Logs for build for x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-6 | success | Logs for set-matrix |
bpf/vmtest-bpf-next-VM_Test-3 | success | Logs for build for s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-7 | success | Logs for test_maps on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-8 | pending | Logs for test_maps on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-9 | success | Logs for test_maps on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-10 | success | Logs for test_maps on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-11 | success | Logs for test_progs on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-12 | pending | Logs for test_progs on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-13 | success | Logs for test_progs on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-14 | success | Logs for test_progs on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-15 | success | Logs for test_progs_no_alu32 on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-16 | pending | Logs for test_progs_no_alu32 on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-17 | success | Logs for test_progs_no_alu32 on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-18 | success | Logs for test_progs_no_alu32 on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-19 | success | Logs for test_progs_no_alu32_parallel on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-20 | success | Logs for test_progs_no_alu32_parallel on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-21 | success | Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-22 | success | Logs for test_progs_parallel on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-23 | success | Logs for test_progs_parallel on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-24 | success | Logs for test_progs_parallel on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-25 | success | Logs for test_verifier on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-26 | success | Logs for test_verifier on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-27 | success | Logs for test_verifier on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-28 | success | Logs for test_verifier on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-29 | success | Logs for veristat |
+CC Stéphane Graber <stgraber@ubuntu.com> On Mon, Aug 14, 2023 at 4:26 PM Michael Weiß <michael.weiss@aisec.fraunhofer.de> wrote: > > If a container manager restricts its unprivileged (user namespaced) > children by a device cgroup, it is not necessary to deny mknod > anymore. Thus, user space applications may map devices on different > locations in the file system by using mknod() inside the container. > > A use case for this, we also use in GyroidOS, is to run virsh for > VMs inside an unprivileged container. virsh creates device nodes, > e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails > in a non-initial userns, even if a cgroup device white list with the > corresponding major, minor of /dev/null exists. Thus, in this case > the usual bind mounts or pre populated device nodes under /dev are > not sufficient. > > To circumvent this limitation, we allow mknod() in fs/namei.c if a > bpf cgroup device guard is enabeld for the current task using > devcgroup_task_is_guarded() and check CAP_MKNOD for the current user > namespace by ns_capable() instead of the global CAP_MKNOD. > > To avoid unusable device nodes on file systems mounted in > non-initial user namespace, may_open_dev() ignores the SB_I_NODEV > for cgroup device guarded tasks. > > Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> > --- > fs/namei.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index e56ff39a79bc..ef4f22b9575c 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj); > > bool may_open_dev(const struct path *path) > { > + if (devcgroup_task_is_guarded(current)) > + return !(path->mnt->mnt_flags & MNT_NODEV); > + > return !(path->mnt->mnt_flags & MNT_NODEV) && > !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); > } > @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir, > if (error) > return error; > > - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout && > - !capable(CAP_MKNOD)) > - return -EPERM; > + /* > + * In case of a device cgroup restirction allow mknod in user > + * namespace. Otherwise just check global capability; thus, > + * mknod is also disabled for user namespace other than the > + * initial one. > + */ > + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) { > + if (devcgroup_task_is_guarded(current)) { > + if (!ns_capable(current_user_ns(), CAP_MKNOD)) > + return -EPERM; > + } else if (!capable(CAP_MKNOD)) > + return -EPERM; > + } > > if (!dir->i_op->mknod) > return -EPERM; > > -- > 2.30.2 >
Hello, kernel test robot noticed "WARNING:suspicious_RCU_usage" on: commit: bffc333633f1e681c01ada11bd695aa220518bd8 ("[PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard") url: https://github.com/intel-lab-lkp/linux/commits/Michael-Wei/bpf-add-cgroup-device-guard-to-flag-a-cgroup-device-prog/20230814-224110 patch link: https://lore.kernel.org/all/20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de/ patch subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard in testcase: boot compiler: gcc-12 test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G (please refer to attached dmesg/kmsg for entire log/backtrace) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202308151506.6be3b169-oliver.sang@intel.com [ 14.468719][ T139] [ 14.468999][ T139] ============================= [ 14.469545][ T139] WARNING: suspicious RCU usage [ 14.469968][ T139] 6.5.0-rc6-00004-gbffc333633f1 #1 Not tainted [ 14.470520][ T139] ----------------------------- [ 14.470940][ T139] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage! [ 14.471703][ T139] [ 14.471703][ T139] other info that might help us debug this: [ 14.471703][ T139] [ 14.472692][ T139] [ 14.472692][ T139] rcu_scheduler_active = 2, debug_locks = 1 [ 14.473469][ T139] no locks held by (journald)/139. [ 14.473935][ T139] [ 14.473935][ T139] stack backtrace: [ 14.474454][ T139] CPU: 1 PID: 139 Comm: (journald) Not tainted 6.5.0-rc6-00004-gbffc333633f1 #1 [ 14.475296][ T139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 14.476298][ T139] Call Trace: [ 14.476608][ T139] dump_stack_lvl+0x78/0x8c [ 14.477055][ T139] dump_stack+0x12/0x18 [ 14.477420][ T139] lockdep_rcu_suspicious+0x153/0x1a4 [ 14.477928][ T139] cgroup_bpf_device_guard_enabled+0x14f/0x168 [ 14.478476][ T139] devcgroup_task_is_guarded+0x10/0x20 [ 14.478973][ T139] may_open_dev+0x11/0x44 [ 14.479367][ T139] may_open+0x115/0x13c [ 14.479727][ T139] do_open+0xa1/0x378 [ 14.480113][ T139] path_openat+0xdc/0x1bc [ 14.480512][ T139] do_filp_open+0x91/0x124 [ 14.480911][ T139] ? lock_release+0x62/0x118 [ 14.481329][ T139] ? _raw_spin_unlock+0x18/0x34 [ 14.481797][ T139] ? alloc_fd+0x112/0x1c4 [ 14.482183][ T139] do_sys_openat2+0x7a/0xa0 [ 14.482592][ T139] __ia32_sys_openat+0x66/0x9c [ 14.483065][ T139] do_int80_syscall_32+0x27/0x48 [ 14.483502][ T139] entry_INT80_32+0x10d/0x10d [ 14.483962][ T139] EIP: 0xa7f39092 [ 14.484267][ T139] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 f8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00 [ 14.485995][ T139] EAX: ffffffda EBX: ffffff9c ECX: 005df542 EDX: 00008100 [ 14.486622][ T139] ESI: 00000000 EDI: 00000000 EBP: affeb888 ESP: affeb6ec [ 14.487225][ T139] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246 The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20230815/202308151506.6be3b169-oliver.sang@intel.com
On Tue, Aug 15, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote: > > > > Hello, > > kernel test robot noticed "WARNING:suspicious_RCU_usage" on: > > commit: bffc333633f1e681c01ada11bd695aa220518bd8 ("[PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard") > url: https://github.com/intel-lab-lkp/linux/commits/Michael-Wei/bpf-add-cgroup-device-guard-to-flag-a-cgroup-device-prog/20230814-224110 > patch link: https://lore.kernel.org/all/20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de/ > patch subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard > > in testcase: boot > > compiler: gcc-12 > test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <oliver.sang@intel.com> > | Closes: https://lore.kernel.org/oe-lkp/202308151506.6be3b169-oliver.sang@intel.com > > > > [ 14.468719][ T139] > [ 14.468999][ T139] ============================= > [ 14.469545][ T139] WARNING: suspicious RCU usage > [ 14.469968][ T139] 6.5.0-rc6-00004-gbffc333633f1 #1 Not tainted > [ 14.470520][ T139] ----------------------------- > [ 14.470940][ T139] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage! Most likely it's because in "cgroup_bpf_device_guard_enabled" function: struct cgroup *cgrp = task_dfl_cgroup(task); should be under rcu_read_lock (or cgroup_mutex). If we get rid of cgroup_mutex and make cgroup_bpf_device_guard_enabled function specific to "current" task we will solve this issue too. > [ 14.471703][ T139] > [ 14.471703][ T139] other info that might help us debug this: > [ 14.471703][ T139] > [ 14.472692][ T139] > [ 14.472692][ T139] rcu_scheduler_active = 2, debug_locks = 1 > [ 14.473469][ T139] no locks held by (journald)/139. > [ 14.473935][ T139] > [ 14.473935][ T139] stack backtrace: > [ 14.474454][ T139] CPU: 1 PID: 139 Comm: (journald) Not tainted 6.5.0-rc6-00004-gbffc333633f1 #1 > [ 14.475296][ T139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > [ 14.476298][ T139] Call Trace: > [ 14.476608][ T139] dump_stack_lvl+0x78/0x8c > [ 14.477055][ T139] dump_stack+0x12/0x18 > [ 14.477420][ T139] lockdep_rcu_suspicious+0x153/0x1a4 > [ 14.477928][ T139] cgroup_bpf_device_guard_enabled+0x14f/0x168 > [ 14.478476][ T139] devcgroup_task_is_guarded+0x10/0x20 > [ 14.478973][ T139] may_open_dev+0x11/0x44 > [ 14.479367][ T139] may_open+0x115/0x13c > [ 14.479727][ T139] do_open+0xa1/0x378 > [ 14.480113][ T139] path_openat+0xdc/0x1bc > [ 14.480512][ T139] do_filp_open+0x91/0x124 > [ 14.480911][ T139] ? lock_release+0x62/0x118 > [ 14.481329][ T139] ? _raw_spin_unlock+0x18/0x34 > [ 14.481797][ T139] ? alloc_fd+0x112/0x1c4 > [ 14.482183][ T139] do_sys_openat2+0x7a/0xa0 > [ 14.482592][ T139] __ia32_sys_openat+0x66/0x9c > [ 14.483065][ T139] do_int80_syscall_32+0x27/0x48 > [ 14.483502][ T139] entry_INT80_32+0x10d/0x10d > [ 14.483962][ T139] EIP: 0xa7f39092 > [ 14.484267][ T139] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 f8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 > 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00 > [ 14.485995][ T139] EAX: ffffffda EBX: ffffff9c ECX: 005df542 EDX: 00008100 > [ 14.486622][ T139] ESI: 00000000 EDI: 00000000 EBP: affeb888 ESP: affeb6ec > [ 14.487225][ T139] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246 > > > > The kernel config and materials to reproduce are available at: > https://download.01.org/0day-ci/archive/20230815/202308151506.6be3b169-oliver.sang@intel.com > > > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki >
diff --git a/fs/namei.c b/fs/namei.c index e56ff39a79bc..ef4f22b9575c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj); bool may_open_dev(const struct path *path) { + if (devcgroup_task_is_guarded(current)) + return !(path->mnt->mnt_flags & MNT_NODEV); + return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir, if (error) return error; - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout && - !capable(CAP_MKNOD)) - return -EPERM; + /* + * In case of a device cgroup restirction allow mknod in user + * namespace. Otherwise just check global capability; thus, + * mknod is also disabled for user namespace other than the + * initial one. + */ + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) { + if (devcgroup_task_is_guarded(current)) { + if (!ns_capable(current_user_ns(), CAP_MKNOD)) + return -EPERM; + } else if (!capable(CAP_MKNOD)) + return -EPERM; + } if (!dir->i_op->mknod) return -EPERM;
If a container manager restricts its unprivileged (user namespaced) children by a device cgroup, it is not necessary to deny mknod anymore. Thus, user space applications may map devices on different locations in the file system by using mknod() inside the container. A use case for this, we also use in GyroidOS, is to run virsh for VMs inside an unprivileged container. virsh creates device nodes, e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails in a non-initial userns, even if a cgroup device white list with the corresponding major, minor of /dev/null exists. Thus, in this case the usual bind mounts or pre populated device nodes under /dev are not sufficient. To circumvent this limitation, we allow mknod() in fs/namei.c if a bpf cgroup device guard is enabeld for the current task using devcgroup_task_is_guarded() and check CAP_MKNOD for the current user namespace by ns_capable() instead of the global CAP_MKNOD. To avoid unusable device nodes on file systems mounted in non-initial user namespace, may_open_dev() ignores the SB_I_NODEV for cgroup device guarded tasks. Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> --- fs/namei.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-)