diff mbox

[RFC,WIP] namespace.c: Allow some unprivileged proc mounts when not fully visible

Message ID 20180404115311.725-1-alban@kinvolk.io (mailing list archive)
State New, archived
Headers show

Commit Message

Alban Crequy April 4, 2018, 11:53 a.m. UTC
Since Linux v4.2 with commit 1b852bceb0d1 ("mnt: Refactor the logic for
mounting sysfs and proc in a user namespace"), new mounts of proc or
sysfs in non init userns are only allowed when there is at least one
fully-visible proc or sysfs mount.

This is to enforce that proc/sysfs files masked by a mount are still
masked in a new mount in a unprivileged userns. The locked mount logic
for bind mounts (has_locked_children()) was not enough in the case of
proc/sysfs new mounts because some files in proc (/proc/kcore) exist as
a singleton rather than being owned by a specific proc mount.

Unfortunately, this blocks me from using userns from within a Docker
container because Docker containers mask entries such as /proc/kcore. My
use case is to build container images with arbitrary commands (such as
using "RUN" commands in Dockerfiles) without privileges and from within
a Docker container. Those arbitrary commands could be shell scripts that
require /proc.

The following commands show my problem:

$ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
mount: permission denied (are you root?)

$ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'mkdir -p /unmasked-proc && mount -t proc proc /unmasked-proc && unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
ok

This patch is a WIP attempt to ease new proc mounts in a user namespace
even when the proc mount in the parent container has masked entries.
However, to preserve the security guarantee of mount_too_revealing(),
the same masked entries in the old proc mount must be masked in the new
proc mount.

It cannot be masked with mounts covering the entries because it's not
possible to use MS_REC for new proc mount and add covering submounts at
the same time. Instead, it introduces new options in proc to disable
some proc entries (TBD). A proc entry will be disabled when all other
proc mounts have the same entry disabled, or when all other proc mounts
have the same entry masked by a submount.

The granularity does not need to be per proc entry. It is simpler to
define categories of entries that can be hidden. In practice, only a few
entries need to support disablement and what matters is that the new
proc mount is more masked than the other proc mounts. Granularity can be
improved later if use cases exist.

The flag IOP_USERNS_HIDEABLE is added on some proc inodes that are
singletons such as /proc/kcore. This flag is used in
mnt_already_visible() to signal that, as an exception to the general
rule, the file can be masked by a mount without blocking the new proc
mount. The hideable category is computed (WIP) and returned (WIP) in
order to configure the new proc mount before attaching it to the mount
tree.

For my use case, I will need to support at least the following entries:

$ sudo docker run -ti --rm busybox sh -c 'mount|grep /proc/'
proc on /proc/asound type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/kcore type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/latency_stats type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)

This patch can be tested in the following way:

$ sudo unshare -p -f -m sh -c "mount --bind /dev/null /proc/cmdline && unshare -U -r -p -m -f mount -t proc proc /proc && echo ok"
mount: /proc: permission denied.
(this patch does not support /proc/cmdline as hideable)

$ sudo unshare -p -f -m sh -c "mount --bind /dev/null /proc/kcore && unshare -U -r -p -m -f mount -t proc proc /proc && echo ok"
ok
(this patch marks /proc/kcore as hideable: the new mounts works fine,
whereas it didn't work on vanilla kernels)

Signed-off-by: Alban Crequy <alban@kinvolk.io>
---
 fs/namespace.c     | 26 +++++++++++++++++++++-----
 fs/proc/generic.c  |  5 +++++
 fs/proc/inode.c    |  2 ++
 fs/proc/internal.h |  1 +
 include/linux/fs.h |  1 +
 5 files changed, 30 insertions(+), 5 deletions(-)

Comments

Eric W. Biederman April 4, 2018, 2:45 p.m. UTC | #1
Alban Crequy <alban.crequy@gmail.com> writes:

> Since Linux v4.2 with commit 1b852bceb0d1 ("mnt: Refactor the logic for
> mounting sysfs and proc in a user namespace"), new mounts of proc or
> sysfs in non init userns are only allowed when there is at least one
> fully-visible proc or sysfs mount.
>
> This is to enforce that proc/sysfs files masked by a mount are still
> masked in a new mount in a unprivileged userns. The locked mount logic
> for bind mounts (has_locked_children()) was not enough in the case of
> proc/sysfs new mounts because some files in proc (/proc/kcore) exist as
> a singleton rather than being owned by a specific proc mount.
>
> Unfortunately, this blocks me from using userns from within a Docker
> container because Docker containers mask entries such as /proc/kcore. My
> use case is to build container images with arbitrary commands (such as
> using "RUN" commands in Dockerfiles) without privileges and from within
> a Docker container. Those arbitrary commands could be shell scripts that
> require /proc.

This is an understandable problem.  /proc/kcore is a file that policy
has a very reasonable right to make inaccessible.  Allowing unprivileged
users to bypass the policy setup by root is not ok, and is the whole
point of the restrictions.


I need to hear why you can't fix Docker.  Why your subcommand needs to
mount proc in the first place.  Neither have been mentioned.  So far
this looks like ``my sysadmin told me no, can I have a kernel patch to
get around that''.  Not something I support at all.



Before we get a kernel change for something like this there need to be
clear evidence this raises to the point of something that is really
going to be used and will have multiple users, and the proposal will
be simple and maintainble.

Hiding files in /proc simply because they were mounted over in the
parent proc does not qualify as simple or maintainble by any means.
Way too much mixing of the layers.  Needing to read from the parent
proc to find which files were already hidden makes this doubly complex.

Files like /proc/kcore can not be hidden always and automatically
because their attributes can change so they may reasonably be made
available to users who are not the global root.

The only option I have seen proposed that might qualify as something
general purpose and simple is a new filesystem that is just the process
directories of proc.  As there would in essence be no files that would
need restrictions it would be safe to allow anyone to mount without
restriction.

> The following commands show my problem:
>
> $ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
> mount: permission denied (are you root?)
>
> $ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'mkdir -p /unmasked-proc && mount -t proc proc /unmasked-proc && unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
> ok

Actually this does not show your problem because it does not reveal why
you need to mount proc.

That is a ``Doctor it hurts when I do this'' example where the Doctor
will reasonably tell you ``Don't do that then''.


> For my use case, I will need to support at least the following entries:
>
> $ sudo docker run -ti --rm busybox sh -c 'mount|grep /proc/'
> proc on /proc/asound type proc (ro,nosuid,nodev,noexec,relatime)
> proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
> proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
> proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
> proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
> proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
> tmpfs on /proc/kcore type tmpfs (rw,context="...",nosuid,mode=755)
> tmpfs on /proc/latency_stats type tmpfs (rw,context="...",nosuid,mode=755)
> tmpfs on /proc/timer_list type tmpfs (rw,context="...",nosuid,mode=755)
> tmpfs on /proc/sched_debug type tmpfs (rw,context="...",nosuid,mode=755)
> tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)

It looks like a cruft free cousin of proc that is just processes would
be applicable to your usecase.

Eric
Aleksa Sarai April 4, 2018, 3:34 p.m. UTC | #2
On 2018-04-04, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > The following commands show my problem:
> >
> > $ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
> > mount: permission denied (are you root?)
> >
> > $ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'mkdir -p /unmasked-proc && mount -t proc proc /unmasked-proc && unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
> > ok
> 
> Actually this does not show your problem because it does not reveal why
> you need to mount proc.
> 
> That is a ``Doctor it hurts when I do this'' example where the Doctor
> will reasonably tell you ``Don't do that then''.

The context is that people want to run unprivileged runc inside a Docker
container, and mounting proc is part of setting up a container. But
Docker (and runc) have masks for /proc to stop containers from being
able to touch things like /proc/scsi and so on. The other possibility is
to give people an escape-hatch when setting up a container which
basically says "make this container slightly less secure so that I can
run containers inside it".

However I share your concern about the layer mixing with inheriting the
masks for the new procfs mount (what if you have a mount over a
particular process, now the mask is masking something completely
different, and a bunch of other possible problems).

> > For my use case, I will need to support at least the following entries:
> >
> > $ sudo docker run -ti --rm busybox sh -c 'mount|grep /proc/'
> > proc on /proc/asound type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
> > tmpfs on /proc/kcore type tmpfs (rw,context="...",nosuid,mode=755)
> > tmpfs on /proc/latency_stats type tmpfs (rw,context="...",nosuid,mode=755)
> > tmpfs on /proc/timer_list type tmpfs (rw,context="...",nosuid,mode=755)
> > tmpfs on /proc/sched_debug type tmpfs (rw,context="...",nosuid,mode=755)
> > tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)
> 
> It looks like a cruft free cousin of proc that is just processes would
> be applicable to your usecase.

I think a procfs that only has processes would be a massive improvement
for a bunch of other reasons too. :D
Serge E. Hallyn April 4, 2018, 6:42 p.m. UTC | #3
Quoting Eric W. Biederman (ebiederm@xmission.com):
> It looks like a cruft free cousin of proc that is just processes would
> be applicable to your usecase.

Just to check - is that something you're working on?

-serge
Eric W. Biederman April 4, 2018, 10:02 p.m. UTC | #4
"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> It looks like a cruft free cousin of proc that is just processes would
>> be applicable to your usecase.
>
> Just to check - is that something you're working on?

Only to the point of reviewing code, and I don't have a version to
review right now.

Eric
Christian Brauner April 5, 2018, 2:19 p.m. UTC | #5
On Wed, Apr 04, 2018 at 09:45:43AM -0500, Eric W. Biederman wrote:
> Alban Crequy <alban.crequy@gmail.com> writes:
> 
> > Since Linux v4.2 with commit 1b852bceb0d1 ("mnt: Refactor the logic for
> > mounting sysfs and proc in a user namespace"), new mounts of proc or
> > sysfs in non init userns are only allowed when there is at least one
> > fully-visible proc or sysfs mount.
> >
> > This is to enforce that proc/sysfs files masked by a mount are still
> > masked in a new mount in a unprivileged userns. The locked mount logic
> > for bind mounts (has_locked_children()) was not enough in the case of
> > proc/sysfs new mounts because some files in proc (/proc/kcore) exist as
> > a singleton rather than being owned by a specific proc mount.
> >
> > Unfortunately, this blocks me from using userns from within a Docker
> > container because Docker containers mask entries such as /proc/kcore. My

I honestly wonder what the benefit of this is supposed to be. If the
container retains CAP_SYS_ADMIN (privileged or unprivileged) these
mounts can all be unmounted. If the container drops CAP_SYS_ADMIN
you won't be able to {u}mount anymore but you also won't be able to use
any CLONE_* flags anymore. So clone(), setns(), unshare() are useless
too. If you're unprivileged overmounting e.g. /proc/kcore won't give
you any additional security benefits since you can't read it anyway. So
this only seems useful when the container is privileged and some form of
LSM is protecting those mount points. But for these cases the only thing
I have to say is: it's 2018 that is 5 years past CLONE_NEWUSER so don't
run privileged containers and pretend that it can be done securely in
any way.
But I might be missing cases where this would be really really useful
involving unprivileged containers too.
This is really not directed at you Ablan, I'm just wondering about this
in general and I seem to have a polemic day. :)

Christian

> > use case is to build container images with arbitrary commands (such as
> > using "RUN" commands in Dockerfiles) without privileges and from within
> > a Docker container. Those arbitrary commands could be shell scripts that
> > require /proc.
> 
> This is an understandable problem.  /proc/kcore is a file that policy
> has a very reasonable right to make inaccessible.  Allowing unprivileged
> users to bypass the policy setup by root is not ok, and is the whole
> point of the restrictions.
> 
> 
> I need to hear why you can't fix Docker.  Why your subcommand needs to
> mount proc in the first place.  Neither have been mentioned.  So far
> this looks like ``my sysadmin told me no, can I have a kernel patch to
> get around that''.  Not something I support at all.
> 
> 
> 
> Before we get a kernel change for something like this there need to be
> clear evidence this raises to the point of something that is really
> going to be used and will have multiple users, and the proposal will
> be simple and maintainble.
> 
> Hiding files in /proc simply because they were mounted over in the
> parent proc does not qualify as simple or maintainble by any means.
> Way too much mixing of the layers.  Needing to read from the parent
> proc to find which files were already hidden makes this doubly complex.
> 
> Files like /proc/kcore can not be hidden always and automatically
> because their attributes can change so they may reasonably be made
> available to users who are not the global root.
> 
> The only option I have seen proposed that might qualify as something
> general purpose and simple is a new filesystem that is just the process
> directories of proc.  As there would in essence be no files that would
> need restrictions it would be safe to allow anyone to mount without
> restriction.
> 
> > The following commands show my problem:
> >
> > $ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
> > mount: permission denied (are you root?)
> >
> > $ sudo docker run -ti --rm --cap-add=SYS_ADMIN busybox sh -c 'mkdir -p /unmasked-proc && mount -t proc proc /unmasked-proc && unshare -U -r -p -m -f mount -t proc proc /home && echo ok'
> > ok
> 
> Actually this does not show your problem because it does not reveal why
> you need to mount proc.
> 
> That is a ``Doctor it hurts when I do this'' example where the Doctor
> will reasonably tell you ``Don't do that then''.
> 
> 
> > For my use case, I will need to support at least the following entries:
> >
> > $ sudo docker run -ti --rm busybox sh -c 'mount|grep /proc/'
> > proc on /proc/asound type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
> > proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
> > tmpfs on /proc/kcore type tmpfs (rw,context="...",nosuid,mode=755)
> > tmpfs on /proc/latency_stats type tmpfs (rw,context="...",nosuid,mode=755)
> > tmpfs on /proc/timer_list type tmpfs (rw,context="...",nosuid,mode=755)
> > tmpfs on /proc/sched_debug type tmpfs (rw,context="...",nosuid,mode=755)
> > tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)
> 
> It looks like a cruft free cousin of proc that is just processes would
> be applicable to your usecase.
> 
> Eric
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Djalal Harouni April 13, 2018, 10:41 p.m. UTC | #6
On Wed, Apr 4, 2018 at 4:45 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
[...]
>
> The only option I have seen proposed that might qualify as something
> general purpose and simple is a new filesystem that is just the process
> directories of proc.  As there would in essence be no files that would
> need restrictions it would be safe to allow anyone to mount without
> restriction.
>
Eric, there is a series for this:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1533642.html

patch on top for pids:
https://github.com/legionus/linux/commit/993a2a5b9af95b0ac901ff41d32124b72ed676e3

it was reviewed, and suggestions were integrated from Andy and Al Viro
feedback, thanks. It works on Debian, Ubuntu and others, not on Fedora
due to bug with dracut+systemd.

I do not have time to work on it now, anyone can just pick them.

Thanks!
Alexey Gladkov April 16, 2018, 2:16 p.m. UTC | #7
On Sat, Apr 14, 2018 at 12:41:31AM +0200, Djalal Harouni wrote:
> On Wed, Apr 4, 2018 at 4:45 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> [...]
> >
> > The only option I have seen proposed that might qualify as something
> > general purpose and simple is a new filesystem that is just the process
> > directories of proc.  As there would in essence be no files that would
> > need restrictions it would be safe to allow anyone to mount without
> > restriction.
> >
> Eric, there is a series for this:
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1533642.html
> 
> patch on top for pids:
> https://github.com/legionus/linux/commit/993a2a5b9af95b0ac901ff41d32124b72ed676e3
> 
> it was reviewed, and suggestions were integrated from Andy and Al Viro
> feedback, thanks. It works on Debian, Ubuntu and others, not on Fedora
> due to bug with dracut+systemd.
> 
> I do not have time to work on it now, anyone can just pick them.

I continue to work on this. I am now trying to deal with the problem on
Fedora. I hope to return soon with the results.
diff mbox

Patch

diff --git a/fs/namespace.c b/fs/namespace.c
index 9d1374ab6e06..0d466885c181 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2489,7 +2489,7 @@  static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
 	return err;
 }
 
-static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
+static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags, int *hideable_categories);
 
 /*
  * create a new mount for userspace and request it to be added into the
@@ -2500,6 +2500,7 @@  static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
 {
 	struct file_system_type *type;
 	struct vfsmount *mnt;
+	int hideable_categories = 0;
 	int err;
 
 	if (!fstype)
@@ -2518,11 +2519,15 @@  static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
 	if (IS_ERR(mnt))
 		return PTR_ERR(mnt);
 
-	if (mount_too_revealing(mnt, &mnt_flags)) {
+	if (mount_too_revealing(mnt, &mnt_flags, &hideable_categories)) {
 		mntput(mnt);
 		return -EPERM;
 	}
 
+	if (hideable_categories != 0) {
+		/* TODO: configure the mount to hide the categories of files */
+	}
+
 	err = do_add_mount(real_mount(mnt), path, mnt_flags);
 	if (err)
 		mntput(mnt);
@@ -3342,7 +3347,7 @@  bool current_chrooted(void)
 }
 
 static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
-				int *new_mnt_flags)
+				int *new_mnt_flags, int *hideable_categories)
 {
 	int new_flags = *new_mnt_flags;
 	struct mount *mnt;
@@ -3352,6 +3357,7 @@  static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
 	list_for_each_entry(mnt, &ns->list, mnt_list) {
 		struct mount *child;
 		int mnt_flags;
+		int local_hideable_categories = 0;
 
 		if (mnt->mnt.mnt_sb->s_type != new->mnt_sb->s_type)
 			continue;
@@ -3388,6 +3394,12 @@  static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
 			/* Only worry about locked mounts */
 			if (!(child->mnt.mnt_flags & MNT_LOCKED))
 				continue;
+			/* Hideable inodes might be ok but gather categories */
+			if (inode->i_opflags & IOP_USERNS_HIDEABLE) {
+				/* TODO: get proc_dir_entry->userns_hideable_categories */
+				local_hideable_categories |= 0x01;
+				continue;
+			}
 			/* Is the directory permanetly empty? */
 			if (!is_empty_dir_inode(inode))
 				goto next;
@@ -3395,6 +3407,10 @@  static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
 		/* Preserve the locked attributes */
 		*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
 					       MNT_LOCK_ATIME);
+		/* Preserve hidden categories */
+		*hideable_categories |= local_hideable_categories;
+		/* TODO: for nested containers */
+		*hideable_categories |= 0; /* proc_sb(mnt->mnt.mnt_sb)->hideable_categories */
 		visible = true;
 		goto found;
 	next:	;
@@ -3404,7 +3420,7 @@  static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
 	return visible;
 }
 
-static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
+static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags, int *hideable_categories)
 {
 	const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
 	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
@@ -3424,7 +3440,7 @@  static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
 		return true;
 	}
 
-	return !mnt_already_visible(ns, mnt, new_mnt_flags);
+	return !mnt_already_visible(ns, mnt, new_mnt_flags, hideable_categories);
 }
 
 bool mnt_may_suid(struct vfsmount *mnt)
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 5d709fa8f3a2..96537a0f751e 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -491,6 +491,11 @@  struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
 	pde->proc_fops = proc_fops;
 	pde->data = data;
 	pde->proc_iops = &proc_file_inode_operations;
+
+	// TODO: add parameters to proc_create() instead of hardcoding
+	if (strcmp(name, "kcore") == 0)
+		pde->userns_hideable = true;
+
 	if (proc_register(parent, pde) < 0)
 		goto out_free;
 	return pde;
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 6e8724958116..dbf8f2dfe85e 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -455,6 +455,8 @@  struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 			set_nlink(inode, de->nlink);
 		WARN_ON(!de->proc_iops);
 		inode->i_op = de->proc_iops;
+		if (de->userns_hideable)
+			inode->i_opflags |= IOP_USERNS_HIDEABLE;
 		if (de->proc_fops) {
 			if (S_ISREG(inode->i_mode)) {
 #ifdef CONFIG_COMPAT
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index d697c8ab0a14..7176fbff3660 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -52,6 +52,7 @@  struct proc_dir_entry {
 	struct proc_dir_entry *parent;
 	struct rb_root_cached subdir;
 	struct rb_node subdir_node;
+	bool userns_hideable;
 	umode_t mode;
 	u8 namelen;
 	char name[];
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c6baf767619e..4203c4d3330f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -559,6 +559,7 @@  is_uncached_acl(struct posix_acl *acl)
 #define IOP_NOFOLLOW	0x0004
 #define IOP_XATTR	0x0008
 #define IOP_DEFAULT_READLINK	0x0010
+#define IOP_USERNS_HIDEABLE	0x0020
 
 struct fsnotify_mark_connector;