From patchwork Wed Apr 4 11:53:11 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alban Crequy X-Patchwork-Id: 10322483 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id B9BB460318 for ; Wed, 4 Apr 2018 11:56:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AF30428E40 for ; Wed, 4 Apr 2018 11:56:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A3F2728E55; Wed, 4 Apr 2018 11:56:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED,FREEMAIL_FROM,RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E582F28E40 for ; Wed, 4 Apr 2018 11:56:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751618AbeDDL4Z (ORCPT ); Wed, 4 Apr 2018 07:56:25 -0400 Received: from mail-wm0-f67.google.com ([74.125.82.67]:37184 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751216AbeDDL4V (ORCPT ); Wed, 4 Apr 2018 07:56:21 -0400 Received: by mail-wm0-f67.google.com with SMTP id r131so41711981wmb.2; Wed, 04 Apr 2018 04:56:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id; bh=8yt7lXcfThwwBssnPyX3MxuhqozIc9gOJGN0boEA5LM=; b=qxer5D14j5UcPYlb+q03W9MSkyxfbRHKFDyifN6rF0BoMyCdLsb4sbemElolc9rr8s M+z/azUTFfIA2Sy7mQ0itc0P+YuoWQqzyOq/et3WnRB2xGPdnFkeIHwAFFrOpy8D8LxE VXl6pUc1fM/OvrAM/jTtIiABK4seDHYj3jqOG3das00o5RyCZvzJ3qfPNDqWO5u9LHwo yTLtN80wjqNtmlmHmjDoYcPOFhiZC/yNXHn6IvXdHM4YOgKTq3hqB4QYWSFystJtOEph Luf3ZcUvKimtD72RqTCeQiPOIYImlrD+rDb7CNnVgzwUypr0vmDi86wnUXHQiqhmNKUK bQBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id; bh=8yt7lXcfThwwBssnPyX3MxuhqozIc9gOJGN0boEA5LM=; b=enVqGs6R6+bMjhNPzq11WWTZ5zbD1KfZi9sfdMbFr1I1/eEd/n4XYXArhyZ78StOou mxP4qzG4TLahJS45pUEVQbRVvTUMtWXppS3pmHdTUQvAJRbAcvjvTNSzMoydHrdVaVDC vAfVz53UHRmFJdpoKSRzEL6K1MwPEUKqXdcFM0Ujo5mZxsMmre1TUuhst9OOOAh23NH4 70odjcRKRQoPFJ9292RjX5YTj9R3avzrtZJ2foSpi4n6+/P1kjUzJ9fAFuq1/P1GBW2W Rbxk20wNgN9tyE/Ld+yA6SmvUmg7CogY+2ddhAeSDM/Uz8MVHZD6SkYdCMumEdluHPBG x/5Q== X-Gm-Message-State: AElRT7GDIweTh7UDdEUcjKpyePDBy80g6J+E+Ayr6YRw2+5c2A9UGKXH h2QS02IDaD66/vUiaBhnkmU= X-Google-Smtp-Source: AIpwx49whx2jASOh5wR/LrUxsmzV9TL/zPBm/SjwKq7I0OU8DuFA/Qg0bCMYXPSs40MF7Ui4ZHv0Xg== X-Received: by 10.80.195.202 with SMTP id i10mr20463547edf.232.1522842979562; Wed, 04 Apr 2018 04:56:19 -0700 (PDT) Received: from neptune.fritz.box ([178.19.216.175]) by smtp.gmail.com with ESMTPSA id m7sm3308189eda.36.2018.04.04.04.56.18 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 04 Apr 2018 04:56:18 -0700 (PDT) From: Alban Crequy X-Google-Original-From: Alban Crequy To: Alban Crequy Cc: Dongsu Park , Iago Lopez Galeiras , Stephen J Day , Michael Crosby , Jess Frazelle , Akihiro Suda , Aleksa Sarai , Daniel J Walsh , "Eric W . Biederman" , Alexander Viro , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, containers@lists.linux-foundation.org Subject: [PATCH] [RFC][WIP] namespace.c: Allow some unprivileged proc mounts when not fully visible Date: Wed, 4 Apr 2018 13:53:11 +0200 Message-Id: <20180404115311.725-1-alban@kinvolk.io> X-Mailer: git-send-email 2.14.3 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Since Linux v4.2 with commit 1b852bceb0d1 ("mnt: Refactor the logic for mounting sysfs and proc in a user namespace"), new mounts of proc or sysfs in non init userns are only allowed when there is at least one fully-visible proc or sysfs mount. This is to enforce that proc/sysfs files masked by a mount are still masked in a new mount in a unprivileged userns. The locked mount logic for bind mounts (has_locked_children()) was not enough in the case of proc/sysfs new mounts because some files in proc (/proc/kcore) exist as a singleton rather than being owned by a specific proc mount. Unfortunately, this blocks me from using userns from within a Docker container because Docker containers mask entries such as /proc/kcore. My use case is to build container images with arbitrary commands (such as using "RUN" commands in Dockerfiles) without privileges and from within a Docker container. Those arbitrary commands could be shell scripts that require /proc. The following commands show my problem: $ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'unshare -U -r -p -m -f mount -t proc proc /home && echo ok' mount: permission denied (are you root?) $ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'mkdir -p /unmasked-proc && mount -t proc proc /unmasked-proc && unshare -U -r -p -m -f mount -t proc proc /home && echo ok' ok This patch is a WIP attempt to ease new proc mounts in a user namespace even when the proc mount in the parent container has masked entries. However, to preserve the security guarantee of mount_too_revealing(), the same masked entries in the old proc mount must be masked in the new proc mount. It cannot be masked with mounts covering the entries because it's not possible to use MS_REC for new proc mount and add covering submounts at the same time. Instead, it introduces new options in proc to disable some proc entries (TBD). A proc entry will be disabled when all other proc mounts have the same entry disabled, or when all other proc mounts have the same entry masked by a submount. The granularity does not need to be per proc entry. It is simpler to define categories of entries that can be hidden. In practice, only a few entries need to support disablement and what matters is that the new proc mount is more masked than the other proc mounts. Granularity can be improved later if use cases exist. The flag IOP_USERNS_HIDEABLE is added on some proc inodes that are singletons such as /proc/kcore. This flag is used in mnt_already_visible() to signal that, as an exception to the general rule, the file can be masked by a mount without blocking the new proc mount. The hideable category is computed (WIP) and returned (WIP) in order to configure the new proc mount before attaching it to the mount tree. For my use case, I will need to support at least the following entries: $ sudo docker run -ti --rm busybox sh -c 'mount|grep /proc/' proc on /proc/asound type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime) tmpfs on /proc/kcore type tmpfs (rw,context="...",nosuid,mode=755) tmpfs on /proc/latency_stats type tmpfs (rw,context="...",nosuid,mode=755) tmpfs on /proc/timer_list type tmpfs (rw,context="...",nosuid,mode=755) tmpfs on /proc/sched_debug type tmpfs (rw,context="...",nosuid,mode=755) tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime) This patch can be tested in the following way: $ sudo unshare -p -f -m sh -c "mount --bind /dev/null /proc/cmdline && unshare -U -r -p -m -f mount -t proc proc /proc && echo ok" mount: /proc: permission denied. (this patch does not support /proc/cmdline as hideable) $ sudo unshare -p -f -m sh -c "mount --bind /dev/null /proc/kcore && unshare -U -r -p -m -f mount -t proc proc /proc && echo ok" ok (this patch marks /proc/kcore as hideable: the new mounts works fine, whereas it didn't work on vanilla kernels) Signed-off-by: Alban Crequy --- fs/namespace.c | 26 +++++++++++++++++++++----- fs/proc/generic.c | 5 +++++ fs/proc/inode.c | 2 ++ fs/proc/internal.h | 1 + include/linux/fs.h | 1 + 5 files changed, 30 insertions(+), 5 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9d1374ab6e06..0d466885c181 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2489,7 +2489,7 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) return err; } -static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags); +static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags, int *hideable_categories); /* * create a new mount for userspace and request it to be added into the @@ -2500,6 +2500,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, { struct file_system_type *type; struct vfsmount *mnt; + int hideable_categories = 0; int err; if (!fstype) @@ -2518,11 +2519,15 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, if (IS_ERR(mnt)) return PTR_ERR(mnt); - if (mount_too_revealing(mnt, &mnt_flags)) { + if (mount_too_revealing(mnt, &mnt_flags, &hideable_categories)) { mntput(mnt); return -EPERM; } + if (hideable_categories != 0) { + /* TODO: configure the mount to hide the categories of files */ + } + err = do_add_mount(real_mount(mnt), path, mnt_flags); if (err) mntput(mnt); @@ -3342,7 +3347,7 @@ bool current_chrooted(void) } static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new, - int *new_mnt_flags) + int *new_mnt_flags, int *hideable_categories) { int new_flags = *new_mnt_flags; struct mount *mnt; @@ -3352,6 +3357,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new, list_for_each_entry(mnt, &ns->list, mnt_list) { struct mount *child; int mnt_flags; + int local_hideable_categories = 0; if (mnt->mnt.mnt_sb->s_type != new->mnt_sb->s_type) continue; @@ -3388,6 +3394,12 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new, /* Only worry about locked mounts */ if (!(child->mnt.mnt_flags & MNT_LOCKED)) continue; + /* Hideable inodes might be ok but gather categories */ + if (inode->i_opflags & IOP_USERNS_HIDEABLE) { + /* TODO: get proc_dir_entry->userns_hideable_categories */ + local_hideable_categories |= 0x01; + continue; + } /* Is the directory permanetly empty? */ if (!is_empty_dir_inode(inode)) goto next; @@ -3395,6 +3407,10 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new, /* Preserve the locked attributes */ *new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \ MNT_LOCK_ATIME); + /* Preserve hidden categories */ + *hideable_categories |= local_hideable_categories; + /* TODO: for nested containers */ + *hideable_categories |= 0; /* proc_sb(mnt->mnt.mnt_sb)->hideable_categories */ visible = true; goto found; next: ; @@ -3404,7 +3420,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new, return visible; } -static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags) +static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags, int *hideable_categories) { const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV; struct mnt_namespace *ns = current->nsproxy->mnt_ns; @@ -3424,7 +3440,7 @@ static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags) return true; } - return !mnt_already_visible(ns, mnt, new_mnt_flags); + return !mnt_already_visible(ns, mnt, new_mnt_flags, hideable_categories); } bool mnt_may_suid(struct vfsmount *mnt) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 5d709fa8f3a2..96537a0f751e 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -491,6 +491,11 @@ struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, pde->proc_fops = proc_fops; pde->data = data; pde->proc_iops = &proc_file_inode_operations; + + // TODO: add parameters to proc_create() instead of hardcoding + if (strcmp(name, "kcore") == 0) + pde->userns_hideable = true; + if (proc_register(parent, pde) < 0) goto out_free; return pde; diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 6e8724958116..dbf8f2dfe85e 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -455,6 +455,8 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de) set_nlink(inode, de->nlink); WARN_ON(!de->proc_iops); inode->i_op = de->proc_iops; + if (de->userns_hideable) + inode->i_opflags |= IOP_USERNS_HIDEABLE; if (de->proc_fops) { if (S_ISREG(inode->i_mode)) { #ifdef CONFIG_COMPAT diff --git a/fs/proc/internal.h b/fs/proc/internal.h index d697c8ab0a14..7176fbff3660 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -52,6 +52,7 @@ struct proc_dir_entry { struct proc_dir_entry *parent; struct rb_root_cached subdir; struct rb_node subdir_node; + bool userns_hideable; umode_t mode; u8 namelen; char name[]; diff --git a/include/linux/fs.h b/include/linux/fs.h index c6baf767619e..4203c4d3330f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -559,6 +559,7 @@ is_uncached_acl(struct posix_acl *acl) #define IOP_NOFOLLOW 0x0004 #define IOP_XATTR 0x0008 #define IOP_DEFAULT_READLINK 0x0010 +#define IOP_USERNS_HIDEABLE 0x0020 struct fsnotify_mark_connector;