Message ID | 20231025094224.72858-1-michael.weiss@aisec.fraunhofer.de (mailing list archive) |
---|---|
Headers | show |
Series | device_cgroup: guard mknod for non-initial user namespace | expand |
On Wed, Oct 25, 2023 at 5:42 AM Michael Weiß <michael.weiss@aisec.fraunhofer.de> wrote: > > Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type > BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod > in non-initial user namespaces. > > If a container manager restricts its unprivileged (user namespaced) > children by a device cgroup, it is not necessary to deny mknod() > anymore. Thus, user space applications may map devices on different > locations in the file system by using mknod() inside the container. > > A use case for this, we also use in GyroidOS, is to run virsh for > VMs inside an unprivileged container. virsh creates device nodes, > e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails > in a non-initial userns, even if a cgroup device white list with the > corresponding major, minor of /dev/null exists. Thus, in this case > the usual bind mounts or pre populated device nodes under /dev are > not sufficient. > > To circumvent this limitation, allow mknod() by checking CAP_MKNOD > in the userns by implementing the security_inode_mknod_nscap(). The > hook implementation checks if the corresponding permission flag > BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program. > To avoid to create unusable inodes in user space the hook also > checks SB_I_NODEV on the corresponding super block. > > Further, the security_sb_alloc_userns() hook is implemented using > cgroup_bpf_current_enabled() to allow usage of device nodes on super > blocks mounted by a guarded task. > > Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM > > Patch 4 to 8 rework explicit calls to devcgroup_check_permission > also as LSM hooks and finalize the conversion of the device_cgroup > subsystem to a LSM. > > Patch 9 and 10 introduce new generic security hooks to be used > for the actual mknod device guard implementation. > > Patch 11 wires up the security hooks in the vfs > > Patch 12 and 13 provide helper functions in the bpf cgroup > subsystem. > > Patch 14 finally implement the LSM hooks to grand access > > Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> > --- > Changes in v2: > - Integrate this as LSM (Christian, Paul) > - Switched to a device cgroup specific flag instead of a generic > bpf program flag (Christian) > - do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in > sb_alloc_super in fs/super.c > - Link to v1: https://lore.kernel.org/r/20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de > > Michael Weiß (14): > device_cgroup: Implement devcgroup hooks as lsm security hooks > vfs: Remove explicit devcgroup_inode calls > device_cgroup: Remove explicit devcgroup_inode hooks > lsm: Add security_dev_permission() hook > device_cgroup: Implement dev_permission() hook > block: Switch from devcgroup_check_permission to security hook > drm/amdkfd: Switch from devcgroup_check_permission to security hook > device_cgroup: Hide devcgroup functionality completely in lsm > lsm: Add security_inode_mknod_nscap() hook > lsm: Add security_sb_alloc_userns() hook > vfs: Wire up security hooks for lsm-based device guard in userns > bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access > bpf: cgroup: Introduce helper cgroup_bpf_current_enabled() > device_cgroup: Allow mknod in non-initial userns if guarded > > block/bdev.c | 9 +- > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 7 +- > fs/namei.c | 24 ++-- > fs/super.c | 6 +- > include/linux/bpf-cgroup.h | 2 + > include/linux/device_cgroup.h | 67 ----------- > include/linux/lsm_hook_defs.h | 4 + > include/linux/security.h | 18 +++ > include/uapi/linux/bpf.h | 1 + > init/Kconfig | 4 + > kernel/bpf/cgroup.c | 14 +++ > security/Kconfig | 1 + > security/Makefile | 2 +- > security/device_cgroup/Kconfig | 7 ++ > security/device_cgroup/Makefile | 4 + > security/{ => device_cgroup}/device_cgroup.c | 3 +- > security/device_cgroup/device_cgroup.h | 20 ++++ > security/device_cgroup/lsm.c | 114 +++++++++++++++++++ > security/security.c | 75 ++++++++++++ > 19 files changed, 294 insertions(+), 88 deletions(-) > delete mode 100644 include/linux/device_cgroup.h > create mode 100644 security/device_cgroup/Kconfig > create mode 100644 security/device_cgroup/Makefile > rename security/{ => device_cgroup}/device_cgroup.c (99%) > create mode 100644 security/device_cgroup/device_cgroup.h > create mode 100644 security/device_cgroup/lsm.c Hi Michael, I think this was lost because it wasn't CC'd to the LSM list (see below). I've CC'd the list on my reply, but future patch submissions that involve the LSM must be posted to the LSM list if you would like them to be considered. http://vger.kernel.org/vger-lists.html#linux-security-module
On 25.10.23 15:17, Paul Moore wrote: > On Wed, Oct 25, 2023 at 5:42 AM Michael Weiß > <michael.weiss@aisec.fraunhofer.de> wrote: >> >> Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type >> BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod >> in non-initial user namespaces. >> >> If a container manager restricts its unprivileged (user namespaced) >> children by a device cgroup, it is not necessary to deny mknod() >> anymore. Thus, user space applications may map devices on different >> locations in the file system by using mknod() inside the container. >> >> A use case for this, we also use in GyroidOS, is to run virsh for >> VMs inside an unprivileged container. virsh creates device nodes, >> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails >> in a non-initial userns, even if a cgroup device white list with the >> corresponding major, minor of /dev/null exists. Thus, in this case >> the usual bind mounts or pre populated device nodes under /dev are >> not sufficient. >> >> To circumvent this limitation, allow mknod() by checking CAP_MKNOD >> in the userns by implementing the security_inode_mknod_nscap(). The >> hook implementation checks if the corresponding permission flag >> BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program. >> To avoid to create unusable inodes in user space the hook also >> checks SB_I_NODEV on the corresponding super block. >> >> Further, the security_sb_alloc_userns() hook is implemented using >> cgroup_bpf_current_enabled() to allow usage of device nodes on super >> blocks mounted by a guarded task. >> >> Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM >> >> Patch 4 to 8 rework explicit calls to devcgroup_check_permission >> also as LSM hooks and finalize the conversion of the device_cgroup >> subsystem to a LSM. >> >> Patch 9 and 10 introduce new generic security hooks to be used >> for the actual mknod device guard implementation. >> >> Patch 11 wires up the security hooks in the vfs >> >> Patch 12 and 13 provide helper functions in the bpf cgroup >> subsystem. >> >> Patch 14 finally implement the LSM hooks to grand access >> >> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> >> --- >> Changes in v2: >> - Integrate this as LSM (Christian, Paul) >> - Switched to a device cgroup specific flag instead of a generic >> bpf program flag (Christian) >> - do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in >> sb_alloc_super in fs/super.c >> - Link to v1: https://lore.kernel.org/r/20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de >> >> Michael Weiß (14): >> device_cgroup: Implement devcgroup hooks as lsm security hooks >> vfs: Remove explicit devcgroup_inode calls >> device_cgroup: Remove explicit devcgroup_inode hooks >> lsm: Add security_dev_permission() hook >> device_cgroup: Implement dev_permission() hook >> block: Switch from devcgroup_check_permission to security hook >> drm/amdkfd: Switch from devcgroup_check_permission to security hook >> device_cgroup: Hide devcgroup functionality completely in lsm >> lsm: Add security_inode_mknod_nscap() hook >> lsm: Add security_sb_alloc_userns() hook >> vfs: Wire up security hooks for lsm-based device guard in userns >> bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access >> bpf: cgroup: Introduce helper cgroup_bpf_current_enabled() >> device_cgroup: Allow mknod in non-initial userns if guarded >> >> block/bdev.c | 9 +- >> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 7 +- >> fs/namei.c | 24 ++-- >> fs/super.c | 6 +- >> include/linux/bpf-cgroup.h | 2 + >> include/linux/device_cgroup.h | 67 ----------- >> include/linux/lsm_hook_defs.h | 4 + >> include/linux/security.h | 18 +++ >> include/uapi/linux/bpf.h | 1 + >> init/Kconfig | 4 + >> kernel/bpf/cgroup.c | 14 +++ >> security/Kconfig | 1 + >> security/Makefile | 2 +- >> security/device_cgroup/Kconfig | 7 ++ >> security/device_cgroup/Makefile | 4 + >> security/{ => device_cgroup}/device_cgroup.c | 3 +- >> security/device_cgroup/device_cgroup.h | 20 ++++ >> security/device_cgroup/lsm.c | 114 +++++++++++++++++++ >> security/security.c | 75 ++++++++++++ >> 19 files changed, 294 insertions(+), 88 deletions(-) >> delete mode 100644 include/linux/device_cgroup.h >> create mode 100644 security/device_cgroup/Kconfig >> create mode 100644 security/device_cgroup/Makefile >> rename security/{ => device_cgroup}/device_cgroup.c (99%) >> create mode 100644 security/device_cgroup/device_cgroup.h >> create mode 100644 security/device_cgroup/lsm.c > > Hi Michael, > > I think this was lost because it wasn't CC'd to the LSM list (see > below). I've CC'd the list on my reply, but future patch submissions > that involve the LSM must be posted to the LSM list if you would like > them to be considered. > > http://vger.kernel.org/vger-lists.html#linux-security-module > Hi Paul, thanks, I'll keep this in mind for the next submissions. I have also resend because, I realized that some spam filters my have swallowed the last submission as I used my private smtp server from another domain in the gitconfig. Sorry for that. I hope now every one received it. Thanks, Michael
> - Integrate this as LSM (Christian, Paul)
Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
thing...
So I dislike this less than the initial version that just worked around
SB_I_NODEV and this might be able to go somewhere. _But_ I want to see
the rules written down:
(1) current device access management
I summarized the current places where that's done very very briefly in
https://lore.kernel.org/all/20230815-feigling-kopfsache-56c2d31275bd@brauner
* inode_permission()
-> devcgroup_inode_permission()
* vfs_mknod()
-> devcgroup_inode_mknod()
* blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
-> devcgroup_check_permission()
* drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
-> devcgroup_check_permission()
but that's not enough. What we need is a summary of how device node
creation and device node opening currently interact.
Because it is subtle. Currently you cannot even create device nodes
without capable(CAP_SYS_ADMIN) and you can't open any existing ones
if you lack capable(CAP_SYS_ADMIN).
Years ago we tried that insane model where we enabled userspace to
create device nodes but not open them. IOW, the capability check
for device node creation was lifted but the SB_I_NODEV limitation
wasn't lifted. That broke the whole world and had to be reverted.
(2) LSM device access management
I really want to be able to see how you envision the permission
checking to work in the new model. Specifically:
* How do device node creation and device node opening interact.
* The consequences of allowing to remove the SB_I_NODEV restriction.
* Permission checking for users without and without a bpf guarded
profile.
If you write this down we'll add it to documentation as well or to the
commit messages. It won't be lost. It doesn't have to be some really
long thing. I just want to better understand what you think this is
going to do and what the code does.
On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote: > > - Integrate this as LSM (Christian, Paul) > > Huh, my rant made you write an LSM. I'm not sure if that's a good or bad > thing... > > So I dislike this less than the initial version that just worked around Hm, I wonder if we're being to timid or too complex in how we want to solve this problem. The device cgroup management logic is hacked into multiple layers and is frankly pretty appalling. What I think device access management wants to look like is that you can implement a policy in an LSM - be it bpf or regular selinux - and have this guarded by the main hooks: security_file_open() security_inode_mknod() So, look at: vfs_get_tree() -> security_sb_set_mnt_opts() -> bpf_sb_set_mnt_opts() A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags today via bpf_sb_set_mnt_opts() without any kernel changes at all. I assume that a bpf LSM can also keep state in sb->s_security just like selinux et al? If so then a device access management program or whatever can be stored in sb->s_security. That device access management program would then be run on each call to: security_file_open() -> bpf_file_open() and security_inode_mknod() -> bpf_sb_set_mnt_opts() and take access decisions. This obviously makes device access management something that's tied completely to a filesystem. So, you could have the same device node on two tmpfs filesystems both mounted in the same userns. The first tmpfs has SB_I_NODEV and doesn't allow you to open that device. The second tmpfs has a bpf LSM program attached to it that has stripped SB_I_NODEV and manages device access and allows callers to open that device. I guess it's even possible to restrict this on a caller basis by marking them with a "container id" when the container is started. That can be done with that task storage thing also via a bpf LSM hook. And then you can further restrict device access to only those tasks that have a specific container id in some range or some token or something. I might just be fantasizing abilities into bpf that it doesn't have so anyone with the knowledge please speak up. If this is feasible then the only thing we need to figure out is what to do with the legacy cgroup access management and specifically the capable(CAP_SYS_ADMIN) check that's more of a hack than anything else. So, we could introduce a sysctl that makes it possible to turn this check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to SB_I_NODEV it is inherently safe to do that. It's just that a lot of container runtimes need to have time to adapt to a world where you may be able to create a device but not be able to then open it. This isn't rocket science but it will take time. But in the end this will mean we get away with minimal kernel changes and using a lot of existing infrastructure. Thoughts?
On 24.11.23 19:08, Christian Brauner wrote: > On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote: >>> - Integrate this as LSM (Christian, Paul) >> >> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad >> thing... >> >> So I dislike this less than the initial version that just worked around >> SB_I_NODEV and this might be able to go somewhere. _But_ I want to see the rules written down: Since we have some new Ideas, I also will try to better describe the vision and current device node interaction when submitting the next version. > > Hm, I wonder if we're being to timid or too complex in how we want to > solve this problem. > > The device cgroup management logic is hacked into multiple layers and is > frankly pretty appalling. > > What I think device access management wants to look like is that you can > implement a policy in an LSM - be it bpf or regular selinux - and have > this guarded by the main hooks: > > security_file_open() > security_inode_mknod() > > So, look at: > > vfs_get_tree() > -> security_sb_set_mnt_opts() > -> bpf_sb_set_mnt_opts() > > A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags > today via bpf_sb_set_mnt_opts() without any kernel changes at all. > > I assume that a bpf LSM can also keep state in sb->s_security just like > selinux et al? If so then a device access management program or whatever > can be stored in sb->s_security. > > That device access management program would then be run on each call to: > > security_file_open() > -> bpf_file_open() > > and > > security_inode_mknod() > -> bpf_sb_set_mnt_opts() > > and take access security_sb_set_mnt_optsdecisions. > > This obviously makes device access management something that's tied > completely to a filesystem. So, you could have the same device node on > two tmpfs filesystems both mounted in the same userns. > > The first tmpfs has SB_I_NODEV and doesn't allow you to open that > device. The second tmpfs has a bpf LSM program attached to it that has > stripped SB_I_NODEV and manages device access and allows callers to open > that device. I like the approach to clear SB_I_NODEV in security_sb_set_mnt_opts(). I still have to sort this out but I think that was the missing piece in the current patch set. > > I guess it's even possible to restrict this on a caller basis by marking > them with a "container id" when the container is started. That can be > done with that task storage thing also via a bpf LSM hook. And then > you can further restrict device access to only those tasks that have a > specific container id in some range or some token or something. > > I might just be fantasizing abilities into bpf that it doesn't have so > anyone with the knowledge please speak up. > > If this is feasible then the only thing we need to figure out is what to > do with the legacy cgroup access management and specifically the > capable(CAP_SYS_ADMIN) check that's more of a hack than anything else. First to make this clear, we speak about CAP_SYS_MKNOD. My approach was to restructure the cgroup_device in an own cgroup_device lsm not in the current bpf lsm, to also be able to handle the legacy calls. Especially, the remaining direct calls to devcgroup_check_permission() are transformed to a new security_hook security_dev_permission() which is similar to security_inode_permission() but could be called in such places where only the dev_t representation is available. However, if we implement it that way you sketched up above, we can just leave the devcgroup_check_permission() in its current place and only implement the devcgroup_inode_permission() and devcgroup_mknode and let the blk and amd/gpu drivers as is for now, or just leave all the devcgroup_*() hooks as is. > > So, we could introduce a sysctl that makes it possible to turn this > check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to > SB_I_NODEV it is inherently safe to do that. It's just that a lot of > container runtimes need to have time to adapt to a world where you may > be able to create a device but not be able to then open it. This isn't > rocket science but it will take time. True. I think a sysctl would be a good option. > > But in the end this will mean we get away with minimal kernel changes > and using a lot of existing infrastructure. > > Thoughts? For the whole bpf lsm part I'm not confident enough to make any proposition, yet. But I think an own simple devnode lsm would be feasible with the above described security_sb_set_mnt_opts() handling to get this idea realized. Maybe we go that way to implement a simple lsm without any changes to the device_cgroup and keep the devcgroup hooks in place. To implement it as bpf lsm with all full blown conatiner_id could then be done orthogonally. So from a simple container runtime perspective one could just use the simple lsm and the existing bpf device cgroup program without any change only having to activate the sysctl. A more complex cloud setup Kubernetes what so ever, could then use bpf lsm approach.
Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod in non-initial user namespaces. If a container manager restricts its unprivileged (user namespaced) children by a device cgroup, it is not necessary to deny mknod() anymore. Thus, user space applications may map devices on different locations in the file system by using mknod() inside the container. A use case for this, we also use in GyroidOS, is to run virsh for VMs inside an unprivileged container. virsh creates device nodes, e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails in a non-initial userns, even if a cgroup device white list with the corresponding major, minor of /dev/null exists. Thus, in this case the usual bind mounts or pre populated device nodes under /dev are not sufficient. To circumvent this limitation, allow mknod() by checking CAP_MKNOD in the userns by implementing the security_inode_mknod_nscap(). The hook implementation checks if the corresponding permission flag BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program. To avoid to create unusable inodes in user space the hook also checks SB_I_NODEV on the corresponding super block. Further, the security_sb_alloc_userns() hook is implemented using cgroup_bpf_current_enabled() to allow usage of device nodes on super blocks mounted by a guarded task. Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM Patch 4 to 8 rework explicit calls to devcgroup_check_permission also as LSM hooks and finalize the conversion of the device_cgroup subsystem to a LSM. Patch 9 and 10 introduce new generic security hooks to be used for the actual mknod device guard implementation. Patch 11 wires up the security hooks in the vfs Patch 12 and 13 provide helper functions in the bpf cgroup subsystem. Patch 14 finally implement the LSM hooks to grand access Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> --- Changes in v2: - Integrate this as LSM (Christian, Paul) - Switched to a device cgroup specific flag instead of a generic bpf program flag (Christian) - do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in sb_alloc_super in fs/super.c - Link to v1: https://lore.kernel.org/r/20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de Michael Weiß (14): device_cgroup: Implement devcgroup hooks as lsm security hooks vfs: Remove explicit devcgroup_inode calls device_cgroup: Remove explicit devcgroup_inode hooks lsm: Add security_dev_permission() hook device_cgroup: Implement dev_permission() hook block: Switch from devcgroup_check_permission to security hook drm/amdkfd: Switch from devcgroup_check_permission to security hook device_cgroup: Hide devcgroup functionality completely in lsm lsm: Add security_inode_mknod_nscap() hook lsm: Add security_sb_alloc_userns() hook vfs: Wire up security hooks for lsm-based device guard in userns bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access bpf: cgroup: Introduce helper cgroup_bpf_current_enabled() device_cgroup: Allow mknod in non-initial userns if guarded block/bdev.c | 9 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 7 +- fs/namei.c | 24 ++-- fs/super.c | 6 +- include/linux/bpf-cgroup.h | 2 + include/linux/device_cgroup.h | 67 ----------- include/linux/lsm_hook_defs.h | 4 + include/linux/security.h | 18 +++ include/uapi/linux/bpf.h | 1 + init/Kconfig | 4 + kernel/bpf/cgroup.c | 14 +++ security/Kconfig | 1 + security/Makefile | 2 +- security/device_cgroup/Kconfig | 7 ++ security/device_cgroup/Makefile | 4 + security/{ => device_cgroup}/device_cgroup.c | 3 +- security/device_cgroup/device_cgroup.h | 20 ++++ security/device_cgroup/lsm.c | 114 +++++++++++++++++++ security/security.c | 75 ++++++++++++ 19 files changed, 294 insertions(+), 88 deletions(-) delete mode 100644 include/linux/device_cgroup.h create mode 100644 security/device_cgroup/Kconfig create mode 100644 security/device_cgroup/Makefile rename security/{ => device_cgroup}/device_cgroup.c (99%) create mode 100644 security/device_cgroup/device_cgroup.h create mode 100644 security/device_cgroup/lsm.c base-commit: 58720809f52779dc0f08e53e54b014209d13eebb