[RFC,WIP] namespace.c: Allow some unprivileged proc mounts when not fully visible

Since Linux v4.2 with commit 1b852bceb0d1 ("mnt: Refactor the logic for
mounting sysfs and proc in a user namespace"), new mounts of proc or
sysfs in non init userns are only allowed when there is at least one
fully-visible proc or sysfs mount.

This is to enforce that proc/sysfs files masked by a mount are still
masked in a new mount in a unprivileged userns. The locked mount logic
for bind mounts (has_locked_children()) was not enough in the case of
proc/sysfs new mounts because some files in proc (/proc/kcore) exist as
a singleton rather than being owned by a specific proc mount.

Unfortunately, this blocks me from using userns from within a Docker
container because Docker containers mask entries such as /proc/kcore. My
use case is to build container images with arbitrary commands (such as
using "RUN" commands in Dockerfiles) without privileges and from within
a Docker container. Those arbitrary commands could be shell scripts that
require /proc.

The following commands show my problem:

$ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
mount: permission denied (are you root?)

$ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'mkdir -p /unmasked-proc && mount -t proc proc /unmasked-proc && unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
ok

This patch is a WIP attempt to ease new proc mounts in a user namespace
even when the proc mount in the parent container has masked entries.
However, to preserve the security guarantee of mount_too_revealing(),
the same masked entries in the old proc mount must be masked in the new
proc mount.

It cannot be masked with mounts covering the entries because it's not
possible to use MS_REC for new proc mount and add covering submounts at
the same time. Instead, it introduces new options in proc to disable
some proc entries (TBD). A proc entry will be disabled when all other
proc mounts have the same entry disabled, or when all other proc mounts
have the same entry masked by a submount.

The granularity does not need to be per proc entry. It is simpler to
define categories of entries that can be hidden. In practice, only a few
entries need to support disablement and what matters is that the new
proc mount is more masked than the other proc mounts. Granularity can be
improved later if use cases exist.

The flag IOP_USERNS_HIDEABLE is added on some proc inodes that are
singletons such as /proc/kcore. This flag is used in
mnt_already_visible() to signal that, as an exception to the general
rule, the file can be masked by a mount without blocking the new proc
mount. The hideable category is computed (WIP) and returned (WIP) in
order to configure the new proc mount before attaching it to the mount
tree.

For my use case, I will need to support at least the following entries:

$ sudo docker run -ti --rm busybox sh -c 'mount|grep /proc/'
proc on /proc/asound type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/kcore type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/latency_stats type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)

This patch can be tested in the following way:

$ sudo unshare -p -f -m sh -c "mount --bind /dev/null /proc/cmdline && unshare -U -r -p -m -f mount -t proc proc /proc && echo ok"
mount: /proc: permission denied.
(this patch does not support /proc/cmdline as hideable)

$ sudo unshare -p -f -m sh -c "mount --bind /dev/null /proc/kcore && unshare -U -r -p -m -f mount -t proc proc /proc && echo ok"
ok
(this patch marks /proc/kcore as hideable: the new mounts works fine,
whereas it didn't work on vanilla kernels)

Signed-off-by: Alban Crequy <alban@kinvolk.io>
---
 fs/namespace.c     | 26 +++++++++++++++++++++-----
 fs/proc/generic.c  |  5 +++++
 fs/proc/inode.c    |  2 ++
 fs/proc/internal.h |  1 +
 include/linux/fs.h |  1 +
 5 files changed, 30 insertions(+), 5 deletions(-)

[RFC,WIP] namespace.c: Allow some unprivileged proc mounts when not fully visible

Commit Message

Comments

Patch