[RESEND,RFC,v2,00/14] device_cgroup: guard mknod for non-initial user namespace

Message ID	20231025094224.72858-1-michael.weiss@aisec.fraunhofer.de (mailing list archive)
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 722A8746C; Wed, 25 Oct 2023 09:44:07 +0000 (UTC) IronPort-PHdr: A9a23:lPrJrBKHcCYjC/1bwtmcuDdnWUAX0o4cQyYLv8N0w7sbaL+quo/iN RaCu6YlhwrTUIHS+/9IzPDbt6nwVGBThPTJvCUMapVRUR8Ch8gM2QsmBc+OE0rgK/D2KSc9G ZcKTwp+8nW2OlRSApy7aUfbv3uy6jAfAFD4Mw90Lf7yAYnck4G80OXhnv+bY1Bmnj24M597M BjklhjbtMQdndlHJ70qwxTE51pkKc9Rw39lI07Wowfk65WV3btOthpdoekg8MgSYeDfROEVX bdYBTIpPiUO6cvnuAPqYSCP63AfAQB02hBIVizZxkj0DqeyuTHk6so+yibCep2qa7Uzdh6u1 oZsdED0sCMaNBti/lDrt5Ql38c56Bj0gUZmzdXrTtq4KctBYvnGTewrRnFLQMB/VxJQBtjta qlRDtgPP+ZCpJbP+3oEtED5DDKrBefdzxJO2W3R5fUY9N4gTS/qzU8DJ4lXsGyXld/ROaw8C s+awon6wy6TcNNViC7suKrEfEAjmvzVUZNxIPjS4GMmPDOYqU6396XuJiOL8ecc61ew7LZNC r6tmkMoiQBspym9xsgItYjWltstlGn25AYl+r4rP97oHR0zcZulCpxWryaAK85sT9g/R309o C8h0e5uUf+TeSELzNEqyxHSR9DdL86G+Bv+UuaWLzpiwn5oK/qzhBe3pFCp0fa0FtK131BDs jdfn5HSu2oM2R3e5onPSvZ08kq7nzfa/w7J4/xCIUc6mLCdLJgkw7UqkYEUv1iFFSjz8Hg= IronPort-SDR: 6538e31d_h5/+007enWyWsNC8CU0HhGNvwg8wFuCTNOZWq59JBSXCaLB VwG/AtCm+DOjrNxnIb9gBXASmQj3fJA6O1ZFOsQ== IronPort-PHdr: A9a23:ITK+hhFT8I1cJG2jVMoxaZ1Gf29NhN3EVzX9l7I53usdOq325Y/re Vff7K8w0gyBVtDB5vZNm+fa9LrtXWUQ7JrS1RJKfMlCTRYYj8URkQE6RsmDDEzwNvnxaCImW s9FUQwt5CSgPExYE9r5fQeXrGe78DgSHRvyL09yIOH0EZTVlMO5y6W5/JiABmcAhG+Te7R3f jm/sQiDjdQcg4ZpNvQUxwDSq3RFPsV6l0hvI06emQq52tao8cxG0gF9/sws7dVBVqOoT+Edd vl1HD8mOmY66YjQuB/PQBGmylAcX24VwX8qSwLFuTXmdM7/4hu5vfBjhAnZL8KuCuBofzGlw I1ncT7vtHgbDzok80SMhP1MsfoO83fD7xYq5dTNbtqqGqFTY5LiYYkBdVVwXd1bSSpvAr2ta 9BeCshfPNRWrYnnrEQ88Tq0HFLrDdjoyzt6g1Lwgr8d67wDNjvHgCIMDpEtiC+NrM22Da02X Oubl4bnwxXxYegGxhf+uZHZIjItr6GOZr8pfevQmHssPinMpWXNjpfCYRqez/QTlGuKt9VLV r6C1DIluix+gDmyw9Y+iobtuYMK2gn8qxxL0aVpH+WmUk0rNI3sAN5RrSacL4xsXoY4Tnp1v Dpv0rQdos3TlEkizZ0mw1vad/WkWtLWpBz5XfuXITB2iWgjdL/szxqx8E310uTnTYH0y1dFq CNZj8PB/m4AzR3d68WLC7N9806t1CzJ1lX75PtNPEY0kqTWMdgmxLsxnYAUqkPNAmn9n0Ces Q== IronPort-Data: A9a23:eT2GwKkSQZEKP0rP/4FhMKHo5gwEIkRdPkR7XQ2eYbSJt1+Wr1Gzt xIdXW+PP/uPYGekc98ibY/n9xhXsZPdnNQ3QAY4qS89F1tH+JHPbTi7wugcHM8ywunrFh8PA xA2M4GYRCwMZiaA4E3raNANlFEkvYmQXL3wFeXYDS54QA5gWU8JhAlq8wIDqtcAbeORXUXV4 rsen+WFYAX+gmYubzpNg06+gEoHUMra6GtwUmMWOKgjUG/2zxE9EJ8ZLKetGHr0KqE88jmSH rurIBmRpws1zj91Yj+Xuu+Tnn4iHtY+CTOzZk9+AMBOtPTiShsaic7XPNJEAateZq7gc9pZk L2hvrToIesl0zGldOk1C3Fl/y9C0aJu36D4BjviocCq4FDKWlW0yqs1JUFvIthNkgp3KTkmG f0wMzURdlaOl+m2hryhQ/RqhsMtIdOtMI53VnNIlGyCS6d5B8mcEuOTv4AwMDQY3qiiGd7bZ sEZYDdrKgvNYgZUEl4WE5812umyj2T5czpWpUjTqadfD237klwtgOa3boC9ltqiQtl5hknD+ Gz6pmmoWB5LafqG7zCAyyf57gPItWahMG4IL5Wx8vN6iVufy3Y7DRwWXF+6qui/zEW5Xrp3I VYd5ywjt4Ax+VatQ927WAe3yFaNpQI0WNdKFeA+rgaXxcL8+w+EAkAcRyNFLdkhs9U7Azct0 zehk9rvBDFrmLySRn+U7L2TvXW0NDR9BWYEaTUFTCMG7sPlrYV1iQjAJv5mGbSpj9uzHTjt6 zSLqjUuwbkek6YjzKK98njEjiiqq5yPSRQ6ji3GXnmN4Ak/b4mgD6Sq7ljdq/hJN5qQRFSHs FALnsGf6KYFCpTlvC+VW+QLE7GB5PufNjDYx1l1EPEJ7Dij03Gkeo9U7Xd1I0IBGsYNfjv0Z 2fcvgRe4JIVN3yvBYd1ZIaqAuwpwLLmGNCjUerbBvJXf5V3aA6B1CB1YlCZ223rjA4nlqRXE Ymaa8GEH3scCLohyDuwWvdb1qUkgD09rUvWRJP/yA+PyqiTfnOZSPEFLTOmZ+U49vzfoQH9/ NNWNs/MwBJaOMXlbzPY/KYTJFQOPH59Dpfzw+RdbuCrPAVrAiciBuXXzLdnfJZq94xRl+HV7 jS+V1VexV7Xm3LKM0OJZ2plZbepWoxwxVo/PCoxLROmwHQuf4urxLkQeoFxfrQ98uFni/luQ JE4l96oW6kUD2WYvm1CPNyk9tMkahHtjkSAJSO4Zjg4cZN6AQDEkjP5QjbSGOA1JnPfneMwu bS90APcT5cZAQNkCcfdcvW0yF2t+3ManYpPs4HgebG/oW29odQ4GD+7lfItPcAHJDPKwzbQh U7cAg4VqaOJ68U5+cXAz/LM5Yq4MfpMLmwDFUni7JGyKXb7+EinytR+S+qmR23We17136SAX t9r6c/AHscJp3t0lrZtMq1KyPs+7uT/prUBwQVDGm7KXmuRCbhhAyen2+9Tuo1k241puQm/c R+K8dx0YL+MON3XFWAAAA8fasWCyvAmtT3A5tslIEjBxXFW/ZjWdW5wLhWzmChmA78tC7wcw MAlo98w1wyzrjEII+S2pHlY2ErUJ0NRTph9kI8RBbHarzYCy3ZAUMT6MTD36pTeUOd8GBAmD RHMjZWTmokG4FTJdkcyMn3/3eB9o5AqkzISxX8gI2W5oPb0tsUV7jZwrwtuFh90yy9Z2d1dI mJobk15BZuf9gdS2fRsYTqeJBFjNja4pGrK1Fo7pE/IRRKJV0vMDlEHF8SjwUQ7y19YLx9np Oy26WC9Sjv7XtDD7g1rU25flvHTZ9hQ9ArDpcOZI/q4D6QKOTrIv6v/SlcL+j3GANwwjnLpv eNF3vh9QoylOD8yo58UMZi717MReS+ANl59ZOxT+oEJEV6Bfzvo6zyFKh2ySPhsPN3Py1ezU OZ1F/JMVjO/9SeAlS8aDqgyOI1JnOYlyd4BW7HzL0sEjuevlSVou5fu6STOvm8nbNFwm8IbK ImKVTa9PkGPpHlTwUnhkdJlPzemXNw6ewHM5uC53+EXHZYlsus3U0UT0KOxjkqFIjlc4BOYk wPSVZD4l9U459xXoLLtNaFfCyGfC9D5Dr2I+T/uleV+V4rENMOWuj4FrlXiAR9tAoIQfNZKj pWIjs/82RLUnbQxUl2BoaK7KYty2ZyQUtZUY+XNF1sLuQuZWcTp3QkPxHDgF7xNj+Fmx5eGQ ymWVZKOUOA7CvlhwE9bUSx8KyomKr/Wa/7grBytrv7XBRk61xfGHeyd9nToTD96cwEQMMfAC Cvxieef1u5FpasdAS00JuxULKJ5BHTBWqIWUcL7mhfFL2uvg3KE4qDDkzh54x71K3C0KuTIy rObeQrfLTOc4LrpyvNduKxM5iwnNm5327QMTxhM6uxIhCCfJ09YC+YkaLEtKIxeyw7237HGP AD9VnMoU3jBbG4VYCfHwYrRWymEDbYzIfb/HDsi+n2UZwqQBI+tBLhA9D9q00xpewnMnf2WF tUDxkLeZhSB4IllZeI21MyJhe1KwvD7xHVR3Wvfl8f0IQgVAJRU9XhHMTdOaxf6EJD2pB2WH VQ2eGFKfhjqAwq5W8NtYGVcFxwlrSvihWdgJzuGxNHE/Z6X1qtcwfn4IPv+yaAHcN9MHrMVW HfrXCGY1gh6AJDIVXcB4LrFWZNJNM8= IronPort-HdrOrdr: A9a23:RLSEO6krXCH30wo76QE0GJCEA9vpDfIo3DAbv31ZSRFFG/Fw8P re+sjztCWE7wr5PUtLpTnuAse9qB/nmqKdgrNhX4tKPjOW3FdARbsKheffKlvbak7DH4Zmvp uIGJIeNDSfNzhHZJbBjTVRUb4bsby6zJw= From: =?utf-8?q?Michael_Wei=C3=9F?= <michael.weiss@aisec.fraunhofer.de> To: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Christian Brauner <brauner@kernel.org>, Alexei Starovoitov <ast@kernel.org>, Paul Moore <paul@paul-moore.com> CC: Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <martin.lau@linux.dev>, Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>, Quentin Monnet <quentin@isovalent.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Miklos Szeredi <miklos@szeredi.hu>, Amir Goldstein <amir73il@gmail.com>, "Serge E. Hallyn" <serge@hallyn.com>, <bpf@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>, <gyroidos@aisec.fraunhofer.de>, =?utf-8?q?Michael_Wei=C3=9F?= <michael.weiss@aisec.fraunhofer.de> Subject: [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace Date: Wed, 25 Oct 2023 11:42:10 +0200 Message-Id: <20231025094224.72858-1-michael.weiss@aisec.fraunhofer.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk MIME-Version: 1.0
Series	device_cgroup: guard mknod for non-initial user namespace \| expand [RESEND,RFC,v2,00/14] device_cgroup: guard mknod for non-initial user namespace [RESEND,RFC,v2,01/14] device_cgroup: Implement devcgroup hooks as lsm security hooks [RESEND,RFC,v2,02/14] vfs: Remove explicit devcgroup_inode calls [RESEND,RFC,v2,03/14] device_cgroup: Remove explicit devcgroup_inode hooks [RESEND,RFC,v2,04/14] lsm: Add security_dev_permission() hook [RESEND,RFC,v2,05/14] device_cgroup: Implement dev_permission() hook [RESEND,RFC,v2,06/14] block: Switch from devcgroup_check_permission to security hook [RESEND,RFC,v2,07/14] drm/amdkfd: Switch from devcgroup_check_permission to security hook [RESEND,RFC,v2,08/14] device_cgroup: Hide devcgroup functionality completely in lsm [RESEND,RFC,v2,09/14] lsm: Add security_inode_mknod_nscap() hook [RESEND,RFC,v2,10/14] lsm: Add security_sb_alloc_userns() hook [RESEND,RFC,v2,11/14] vfs: Wire up security hooks for lsm-based device guard in userns [RESEND,RFC,v2,12/14] bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access [RESEND,RFC,v2,13/14] bpf: cgroup: Introduce helper cgroup_bpf_current_enabled() [RESEND,RFC,v2,14/14] device_cgroup: Allow mknod in non-initial userns if guarded

Michael Weiß Oct. 25, 2023, 9:42 a.m. UTC

Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type
BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod
in non-initial user namespaces.

If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod()
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, allow mknod() by checking CAP_MKNOD
in the userns by implementing the security_inode_mknod_nscap(). The
hook implementation checks if the corresponding permission flag
BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
To avoid to create unusable inodes in user space the hook also
checks SB_I_NODEV on the corresponding super block.

Further, the security_sb_alloc_userns() hook is implemented using
cgroup_bpf_current_enabled() to allow usage of device nodes on super
blocks mounted by a guarded task.

Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM

Patch 4 to 8 rework explicit calls to devcgroup_check_permission
also as LSM hooks and finalize the conversion of the device_cgroup
subsystem to a LSM.

Patch 9 and 10 introduce new generic security hooks to be used
for the actual mknod device guard implementation.

Patch 11 wires up the security hooks in the vfs

Patch 12 and 13 provide helper functions in the bpf cgroup
subsystem.

Patch 14 finally implement the LSM hooks to grand access

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
---
Changes in v2:
- Integrate this as LSM (Christian, Paul)
- Switched to a device cgroup specific flag instead of a generic
  bpf program flag (Christian)
- do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
  sb_alloc_super in fs/super.c
- Link to v1: https://lore.kernel.org/r/20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de

Michael Weiß (14):
  device_cgroup: Implement devcgroup hooks as lsm security hooks
  vfs: Remove explicit devcgroup_inode calls
  device_cgroup: Remove explicit devcgroup_inode hooks
  lsm: Add security_dev_permission() hook
  device_cgroup: Implement dev_permission() hook
  block: Switch from devcgroup_check_permission to security hook
  drm/amdkfd: Switch from devcgroup_check_permission to security hook
  device_cgroup: Hide devcgroup functionality completely in lsm
  lsm: Add security_inode_mknod_nscap() hook
  lsm: Add security_sb_alloc_userns() hook
  vfs: Wire up security hooks for lsm-based device guard in userns
  bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access
  bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
  device_cgroup: Allow mknod in non-initial userns if guarded

 block/bdev.c                                 |   9 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h        |   7 +-
 fs/namei.c                                   |  24 ++--
 fs/super.c                                   |   6 +-
 include/linux/bpf-cgroup.h                   |   2 +
 include/linux/device_cgroup.h                |  67 -----------
 include/linux/lsm_hook_defs.h                |   4 +
 include/linux/security.h                     |  18 +++
 include/uapi/linux/bpf.h                     |   1 +
 init/Kconfig                                 |   4 +
 kernel/bpf/cgroup.c                          |  14 +++
 security/Kconfig                             |   1 +
 security/Makefile                            |   2 +-
 security/device_cgroup/Kconfig               |   7 ++
 security/device_cgroup/Makefile              |   4 +
 security/{ => device_cgroup}/device_cgroup.c |   3 +-
 security/device_cgroup/device_cgroup.h       |  20 ++++
 security/device_cgroup/lsm.c                 | 114 +++++++++++++++++++
 security/security.c                          |  75 ++++++++++++
 19 files changed, 294 insertions(+), 88 deletions(-)
 delete mode 100644 include/linux/device_cgroup.h
 create mode 100644 security/device_cgroup/Kconfig
 create mode 100644 security/device_cgroup/Makefile
 rename security/{ => device_cgroup}/device_cgroup.c (99%)
 create mode 100644 security/device_cgroup/device_cgroup.h
 create mode 100644 security/device_cgroup/lsm.c


base-commit: 58720809f52779dc0f08e53e54b014209d13eebb

Paul Moore Oct. 25, 2023, 1:17 p.m. UTC | #1

On Wed, Oct 25, 2023 at 5:42 AM Michael Weiß
<michael.weiss@aisec.fraunhofer.de> wrote:
>
> Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type
> BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod
> in non-initial user namespaces.
>
> If a container manager restricts its unprivileged (user namespaced)
> children by a device cgroup, it is not necessary to deny mknod()
> anymore. Thus, user space applications may map devices on different
> locations in the file system by using mknod() inside the container.
>
> A use case for this, we also use in GyroidOS, is to run virsh for
> VMs inside an unprivileged container. virsh creates device nodes,
> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
> in a non-initial userns, even if a cgroup device white list with the
> corresponding major, minor of /dev/null exists. Thus, in this case
> the usual bind mounts or pre populated device nodes under /dev are
> not sufficient.
>
> To circumvent this limitation, allow mknod() by checking CAP_MKNOD
> in the userns by implementing the security_inode_mknod_nscap(). The
> hook implementation checks if the corresponding permission flag
> BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
> To avoid to create unusable inodes in user space the hook also
> checks SB_I_NODEV on the corresponding super block.
>
> Further, the security_sb_alloc_userns() hook is implemented using
> cgroup_bpf_current_enabled() to allow usage of device nodes on super
> blocks mounted by a guarded task.
>
> Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM
>
> Patch 4 to 8 rework explicit calls to devcgroup_check_permission
> also as LSM hooks and finalize the conversion of the device_cgroup
> subsystem to a LSM.
>
> Patch 9 and 10 introduce new generic security hooks to be used
> for the actual mknod device guard implementation.
>
> Patch 11 wires up the security hooks in the vfs
>
> Patch 12 and 13 provide helper functions in the bpf cgroup
> subsystem.
>
> Patch 14 finally implement the LSM hooks to grand access
>
> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
> ---
> Changes in v2:
> - Integrate this as LSM (Christian, Paul)
> - Switched to a device cgroup specific flag instead of a generic
>   bpf program flag (Christian)
> - do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
>   sb_alloc_super in fs/super.c
> - Link to v1: https://lore.kernel.org/r/20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de
>
> Michael Weiß (14):
>   device_cgroup: Implement devcgroup hooks as lsm security hooks
>   vfs: Remove explicit devcgroup_inode calls
>   device_cgroup: Remove explicit devcgroup_inode hooks
>   lsm: Add security_dev_permission() hook
>   device_cgroup: Implement dev_permission() hook
>   block: Switch from devcgroup_check_permission to security hook
>   drm/amdkfd: Switch from devcgroup_check_permission to security hook
>   device_cgroup: Hide devcgroup functionality completely in lsm
>   lsm: Add security_inode_mknod_nscap() hook
>   lsm: Add security_sb_alloc_userns() hook
>   vfs: Wire up security hooks for lsm-based device guard in userns
>   bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access
>   bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
>   device_cgroup: Allow mknod in non-initial userns if guarded
>
>  block/bdev.c                                 |   9 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h        |   7 +-
>  fs/namei.c                                   |  24 ++--
>  fs/super.c                                   |   6 +-
>  include/linux/bpf-cgroup.h                   |   2 +
>  include/linux/device_cgroup.h                |  67 -----------
>  include/linux/lsm_hook_defs.h                |   4 +
>  include/linux/security.h                     |  18 +++
>  include/uapi/linux/bpf.h                     |   1 +
>  init/Kconfig                                 |   4 +
>  kernel/bpf/cgroup.c                          |  14 +++
>  security/Kconfig                             |   1 +
>  security/Makefile                            |   2 +-
>  security/device_cgroup/Kconfig               |   7 ++
>  security/device_cgroup/Makefile              |   4 +
>  security/{ => device_cgroup}/device_cgroup.c |   3 +-
>  security/device_cgroup/device_cgroup.h       |  20 ++++
>  security/device_cgroup/lsm.c                 | 114 +++++++++++++++++++
>  security/security.c                          |  75 ++++++++++++
>  19 files changed, 294 insertions(+), 88 deletions(-)
>  delete mode 100644 include/linux/device_cgroup.h
>  create mode 100644 security/device_cgroup/Kconfig
>  create mode 100644 security/device_cgroup/Makefile
>  rename security/{ => device_cgroup}/device_cgroup.c (99%)
>  create mode 100644 security/device_cgroup/device_cgroup.h
>  create mode 100644 security/device_cgroup/lsm.c

Hi Michael,

I think this was lost because it wasn't CC'd to the LSM list (see
below).  I've CC'd the list on my reply, but future patch submissions
that involve the LSM must be posted to the LSM list if you would like
them to be considered.

http://vger.kernel.org/vger-lists.html#linux-security-module

Michael Weiß Oct. 25, 2023, 6:11 p.m. UTC | #2

On 25.10.23 15:17, Paul Moore wrote:
> On Wed, Oct 25, 2023 at 5:42 AM Michael Weiß
> <michael.weiss@aisec.fraunhofer.de> wrote:
>>
>> Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type
>> BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod
>> in non-initial user namespaces.
>>
>> If a container manager restricts its unprivileged (user namespaced)
>> children by a device cgroup, it is not necessary to deny mknod()
>> anymore. Thus, user space applications may map devices on different
>> locations in the file system by using mknod() inside the container.
>>
>> A use case for this, we also use in GyroidOS, is to run virsh for
>> VMs inside an unprivileged container. virsh creates device nodes,
>> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
>> in a non-initial userns, even if a cgroup device white list with the
>> corresponding major, minor of /dev/null exists. Thus, in this case
>> the usual bind mounts or pre populated device nodes under /dev are
>> not sufficient.
>>
>> To circumvent this limitation, allow mknod() by checking CAP_MKNOD
>> in the userns by implementing the security_inode_mknod_nscap(). The
>> hook implementation checks if the corresponding permission flag
>> BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
>> To avoid to create unusable inodes in user space the hook also
>> checks SB_I_NODEV on the corresponding super block.
>>
>> Further, the security_sb_alloc_userns() hook is implemented using
>> cgroup_bpf_current_enabled() to allow usage of device nodes on super
>> blocks mounted by a guarded task.
>>
>> Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM
>>
>> Patch 4 to 8 rework explicit calls to devcgroup_check_permission
>> also as LSM hooks and finalize the conversion of the device_cgroup
>> subsystem to a LSM.
>>
>> Patch 9 and 10 introduce new generic security hooks to be used
>> for the actual mknod device guard implementation.
>>
>> Patch 11 wires up the security hooks in the vfs
>>
>> Patch 12 and 13 provide helper functions in the bpf cgroup
>> subsystem.
>>
>> Patch 14 finally implement the LSM hooks to grand access
>>
>> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
>> ---
>> Changes in v2:
>> - Integrate this as LSM (Christian, Paul)
>> - Switched to a device cgroup specific flag instead of a generic
>>   bpf program flag (Christian)
>> - do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
>>   sb_alloc_super in fs/super.c
>> - Link to v1: https://lore.kernel.org/r/20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de
>>
>> Michael Weiß (14):
>>   device_cgroup: Implement devcgroup hooks as lsm security hooks
>>   vfs: Remove explicit devcgroup_inode calls
>>   device_cgroup: Remove explicit devcgroup_inode hooks
>>   lsm: Add security_dev_permission() hook
>>   device_cgroup: Implement dev_permission() hook
>>   block: Switch from devcgroup_check_permission to security hook
>>   drm/amdkfd: Switch from devcgroup_check_permission to security hook
>>   device_cgroup: Hide devcgroup functionality completely in lsm
>>   lsm: Add security_inode_mknod_nscap() hook
>>   lsm: Add security_sb_alloc_userns() hook
>>   vfs: Wire up security hooks for lsm-based device guard in userns
>>   bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access
>>   bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
>>   device_cgroup: Allow mknod in non-initial userns if guarded
>>
>>  block/bdev.c                                 |   9 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h        |   7 +-
>>  fs/namei.c                                   |  24 ++--
>>  fs/super.c                                   |   6 +-
>>  include/linux/bpf-cgroup.h                   |   2 +
>>  include/linux/device_cgroup.h                |  67 -----------
>>  include/linux/lsm_hook_defs.h                |   4 +
>>  include/linux/security.h                     |  18 +++
>>  include/uapi/linux/bpf.h                     |   1 +
>>  init/Kconfig                                 |   4 +
>>  kernel/bpf/cgroup.c                          |  14 +++
>>  security/Kconfig                             |   1 +
>>  security/Makefile                            |   2 +-
>>  security/device_cgroup/Kconfig               |   7 ++
>>  security/device_cgroup/Makefile              |   4 +
>>  security/{ => device_cgroup}/device_cgroup.c |   3 +-
>>  security/device_cgroup/device_cgroup.h       |  20 ++++
>>  security/device_cgroup/lsm.c                 | 114 +++++++++++++++++++
>>  security/security.c                          |  75 ++++++++++++
>>  19 files changed, 294 insertions(+), 88 deletions(-)
>>  delete mode 100644 include/linux/device_cgroup.h
>>  create mode 100644 security/device_cgroup/Kconfig
>>  create mode 100644 security/device_cgroup/Makefile
>>  rename security/{ => device_cgroup}/device_cgroup.c (99%)
>>  create mode 100644 security/device_cgroup/device_cgroup.h
>>  create mode 100644 security/device_cgroup/lsm.c
> 
> Hi Michael,
> 
> I think this was lost because it wasn't CC'd to the LSM list (see
> below).  I've CC'd the list on my reply, but future patch submissions
> that involve the LSM must be posted to the LSM list if you would like
> them to be considered.
> 
> http://vger.kernel.org/vger-lists.html#linux-security-module
> 
Hi Paul,

thanks, I'll keep this in mind for the next submissions.

I have also resend because, I realized that some spam filters my
have swallowed the last submission as I used my private smtp server
from another domain in the gitconfig. Sorry for that. I hope now
every one received it.

Thanks,
Michael

Christian Brauner Nov. 24, 2023, 4:47 p.m. UTC | #3

> - Integrate this as LSM (Christian, Paul)

Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
thing...

So I dislike this less than the initial version that just worked around
SB_I_NODEV and this might be able to go somewhere. _But_ I want to see
the rules written down:

(1) current device access management
    I summarized the current places where that's done very very briefly in
    https://lore.kernel.org/all/20230815-feigling-kopfsache-56c2d31275bd@brauner

    * inode_permission()
      -> devcgroup_inode_permission()

    * vfs_mknod()
      -> devcgroup_inode_mknod()

    * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
      -> devcgroup_check_permission()

    * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
      -> devcgroup_check_permission()

    but that's not enough. What we need is a summary of how device node
    creation and device node opening currently interact.

    Because it is subtle. Currently you cannot even create device nodes
    without capable(CAP_SYS_ADMIN) and you can't open any existing ones
    if you lack capable(CAP_SYS_ADMIN).

    Years ago we tried that insane model where we enabled userspace to
    create device nodes but not open them. IOW, the capability check
    for device node creation was lifted but the SB_I_NODEV limitation
    wasn't lifted. That broke the whole world and had to be reverted.

(2) LSM device access management

    I really want to be able to see how you envision the permission
    checking to work in the new model. Specifically:

    * How do device node creation and device node opening interact.
    * The consequences of allowing to remove the SB_I_NODEV restriction.
    * Permission checking for users without and without a bpf guarded
      profile.

If you write this down we'll add it to documentation as well or to the
commit messages. It won't be lost. It doesn't have to be some really
long thing. I just want to better understand what you think this is
going to do and what the code does.

Christian Brauner Nov. 24, 2023, 6:08 p.m. UTC | #4

On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote:
> > - Integrate this as LSM (Christian, Paul)
> 
> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
> thing...
> 
> So I dislike this less than the initial version that just worked around

Hm, I wonder if we're being to timid or too complex in how we want to
solve this problem.

The device cgroup management logic is hacked into multiple layers and is
frankly pretty appalling.

What I think device access management wants to look like is that you can
implement a policy in an LSM - be it bpf or regular selinux - and have
this guarded by the main hooks:

security_file_open()
security_inode_mknod()

So, look at:

vfs_get_tree()
-> security_sb_set_mnt_opts()
   -> bpf_sb_set_mnt_opts()

A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags
today via bpf_sb_set_mnt_opts() without any kernel changes at all.

I assume that a bpf LSM can also keep state in sb->s_security just like
selinux et al? If so then a device access management program or whatever
can be stored in sb->s_security.

That device access management program would then be run on each call to:

security_file_open()
-> bpf_file_open()

and

security_inode_mknod()
-> bpf_sb_set_mnt_opts()

and take access decisions.

This obviously makes device access management something that's tied
completely to a filesystem. So, you could have the same device node on
two tmpfs filesystems both mounted in the same userns.

The first tmpfs has SB_I_NODEV and doesn't allow you to open that
device. The second tmpfs has a bpf LSM program attached to it that has
stripped SB_I_NODEV and manages device access and allows callers to open
that device.

I guess it's even possible to restrict this on a caller basis by marking
them with a "container id" when the container is started. That can be
done with that task storage thing also via a bpf LSM hook. And then
you can further restrict device access to only those tasks that have a
specific container id in some range or some token or something.

I might just be fantasizing abilities into bpf that it doesn't have so
anyone with the knowledge please speak up.

If this is feasible then the only thing we need to figure out is what to
do with the legacy cgroup access management and specifically the
capable(CAP_SYS_ADMIN) check that's more of a hack than anything else.

So, we could introduce a sysctl that makes it possible to turn this
check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to
SB_I_NODEV it is inherently safe to do that. It's just that a lot of
container runtimes need to have time to adapt to a world where you may
be able to create a device but not be able to then open it. This isn't
rocket science but it will take time.

But in the end this will mean we get away with minimal kernel changes
and using a lot of existing infrastructure.

Thoughts?

Michael Weiß Nov. 28, 2023, 8:54 p.m. UTC | #5

On 24.11.23 19:08, Christian Brauner wrote:
> On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote:
>>> - Integrate this as LSM (Christian, Paul)
>>
>> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
>> thing...
>>
>> So I dislike this less than the initial version that just worked around
>> SB_I_NODEV and this might be able to go somewhere. _But_ I want to see
the rules written down:

Since we have some new Ideas, I also will try to better describe
the vision and current device node interaction when submitting the next
version.

> 
> Hm, I wonder if we're being to timid or too complex in how we want to
> solve this problem.
> 
> The device cgroup management logic is hacked into multiple layers and is
> frankly pretty appalling.
> 
> What I think device access management wants to look like is that you can
> implement a policy in an LSM - be it bpf or regular selinux - and have
> this guarded by the main hooks:
> 
> security_file_open()
> security_inode_mknod()
> 
> So, look at:
> 
> vfs_get_tree()
> -> security_sb_set_mnt_opts()
>    -> bpf_sb_set_mnt_opts()
> 
> A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags
> today via bpf_sb_set_mnt_opts() without any kernel changes at all.
> > I assume that a bpf LSM can also keep state in sb->s_security just like
> selinux et al? If so then a device access management program or whatever
> can be stored in sb->s_security.
> 
> That device access management program would then be run on each call to:
> 
> security_file_open()
> -> bpf_file_open()
> 
> and
> 
> security_inode_mknod()
> -> bpf_sb_set_mnt_opts()
> 
> and take access security_sb_set_mnt_optsdecisions.
> 
> This obviously makes device access management something that's tied
> completely to a filesystem. So, you could have the same device node on
> two tmpfs filesystems both mounted in the same userns.
> 
> The first tmpfs has SB_I_NODEV and doesn't allow you to open that
> device. The second tmpfs has a bpf LSM program attached to it that has
> stripped SB_I_NODEV and manages device access and allows callers to open
> that device.

I like the approach to clear SB_I_NODEV in security_sb_set_mnt_opts().
I still have to sort this out but I think that was the missing piece in
the current patch set.

> 
> I guess it's even possible to restrict this on a caller basis by marking
> them with a "container id" when the container is started. That can be
> done with that task storage thing also via a bpf LSM hook. And then
> you can further restrict device access to only those tasks that have a
> specific container id in some range or some token or something.
> 
> I might just be fantasizing abilities into bpf that it doesn't have so
> anyone with the knowledge please speak up.
> 
> If this is feasible then the only thing we need to figure out is what to
> do with the legacy cgroup access management and specifically the
> capable(CAP_SYS_ADMIN) check that's more of a hack than anything else.

First to make this clear, we speak about CAP_SYS_MKNOD.

My approach was to restructure the cgroup_device in an own cgroup_device
lsm not in the current bpf lsm, to also be able to handle the legacy calls.
Especially, the remaining direct calls to devcgroup_check_permission() are
transformed to a new security_hook security_dev_permission() which is
similar to security_inode_permission() but could be called in such places
where only the dev_t representation is available. However, if we
implement it that way you sketched up above, we can just leave the
devcgroup_check_permission() in its current place and only implement
the devcgroup_inode_permission() and devcgroup_mknode and let the blk
and amd/gpu drivers as is for now, or just leave all the devcgroup_*()
hooks as is.

> 
> So, we could introduce a sysctl that makes it possible to turn this
> check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to
> SB_I_NODEV it is inherently safe to do that. It's just that a lot of
> container runtimes need to have time to adapt to a world where you may
> be able to create a device but not be able to then open it. This isn't
> rocket science but it will take time.

True. I think a sysctl would be a good option.

> 
> But in the end this will mean we get away with minimal kernel changes
> and using a lot of existing infrastructure.
> 
> Thoughts?

For the whole bpf lsm part I'm not confident enough to make any
proposition, yet.
But I think an own simple devnode lsm would be feasible with the above
described security_sb_set_mnt_opts() handling to get this idea realized.
Maybe we go that way to implement a simple lsm without any changes to
the device_cgroup and keep the devcgroup hooks in place. To implement
it as bpf lsm with all full blown conatiner_id could then be done
orthogonally. 
So from a simple container runtime perspective one could just use the
simple lsm and the existing bpf device cgroup program without any change
only having to activate the sysctl. A more complex cloud setup Kubernetes
what so ever, could then use bpf lsm approach.

[RESEND,RFC,v2,00/14] device_cgroup: guard mknod for non-initial user namespace

Message

Comments