[1/2] fs: introduce uid/gid shifting bind mount

Message ID	1575335700.24227.27.camel@HansenPartnership.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=cHVt=ZZ=vger.kernel.org=linux-fsdevel-owner@kernel.org> Message-ID: <1575335700.24227.27.camel@HansenPartnership.com> Subject: [PATCH 1/2] fs: introduce uid/gid shifting bind mount From: James Bottomley <James.Bottomley@HansenPartnership.com> To: linux-fsdevel@vger.kernel.org Cc: dhowells@redhat.com, Al Viro <viro@ZenIV.linux.org.uk>, Miklos Szeredi <miklos@szeredi.hu>, Seth Forshee <seth.forshee@canonical.com>, Amir Goldstein <amir73il@gmail.com> Date: Mon, 02 Dec 2019 17:15:00 -0800 In-Reply-To: <1575335637.24227.26.camel@HansenPartnership.com> References: <1575335637.24227.26.camel@HansenPartnership.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	shiftfs reworked as a uid/gid shifting bind mount \| expand [0/2] shiftfs reworked as a uid/gid shifting bind mount [1/2] fs: introduce uid/gid shifting bind mount [2/2] fs: expose shifting bind mount to userspace

James Bottomley Dec. 3, 2019, 1:15 a.m. UTC

This implementation reverse shifts according to the user_ns belonging
to the mnt_ns.  So if the vfsmount has the newly introduced flag
MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns
then we shift back using the user_ns before committing to the
underlying filesystem.

For example, if a user_ns is created where interior (fake root, uid 0)
is mapped to kernel uid 100000 then writes from interior root normally
go to the filesystem at the kernel uid.  However, if MNT_SHIFT is set,
they will be shifted back to write at uid 0, meaning we can bind mount
real image filesystems to user_ns protected faker root.

In essence there are several things which have to be done for this to
occur safely.  Firstly for all operations on the filesystem, new
credentials have to be installed where fsuid and fsgid are set to the
*interior* values. Next all inodes used from the filesystem have to
have i_uid and i_gid shifted back to the kernel values and attributes
set from user space have to have ia_uid and ia_gid shifted from the
kernel values to the interior values.  The capability checks have to
be done using ns_capable against the kernel values, but the inode
capability checks have to be done against the shifted ids.

Since creating a new credential is a reasonably expensive proposition
and we have to shift and unshift many times during path walking, a
cached copy of the shifted credential is saved to a newly created
place in the task structure.  This serves the dual purpose of allowing
us to use a pre-prepared copy of the shifted credentials and also
allows us to recognise whenever the shift is actually in effect (the
cached shifted credential pointer being equal to the current_cred()
pointer).

To get this all to work, we have a check for the vfsmount flag and the
user_ns gating a shifting of the credentials over all user space
entries to filesystem functions.  In theory the path has to be present
everywhere we do this, so we can check the vfsmount flags.  However,
for lower level functions we can cheat this path check of vfsmount
simply to check whether a shifted credential is in effect or not to
gate things like the inode permission check, which means the path
doesn't have to be threaded all the way through the permission
checking functions.  if the credential is shifted check passes, we can
also be sure that the current user_ns is the same as the mnt->user_ns,
so we can use it and thus have no need of the struct mount at the
point of the shift.

Although the shift can be effected simply by executing
do_reconfigure_mnt with MNT_SHIFT in the flags, this patch only
contains the shifting mechanisms.  The follow on patch wires up the
user visible API for turning the flag on.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

---

Note this patch depends on the rethreading notify_change to use a path
instead of a dentry patch
---
 fs/attr.c             |  87 ++++++++++++++++++++++++++++----------
 fs/exec.c             |   7 +++-
 fs/inode.c            |   9 ++--
 fs/internal.h         |   2 +
 fs/namei.c            | 114 +++++++++++++++++++++++++++++++++++++++++---------
 fs/open.c             |  25 ++++++++++-
 fs/posix_acl.c        |   4 +-
 fs/stat.c             |  31 ++++++++++++--
 include/linux/cred.h  |  10 +++++
 include/linux/mount.h |   4 +-
 include/linux/sched.h |   5 +++
 kernel/capability.c   |  14 ++++++-
 kernel/cred.c         |  20 +++++++++
 kernel/groups.c       |   7 ++++
 14 files changed, 283 insertions(+), 56 deletions(-)

Amir Goldstein Dec. 3, 2019, 4:51 a.m. UTC | #1

[cc: ebiederman]

On Tue, Dec 3, 2019 at 3:15 AM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> This implementation reverse shifts according to the user_ns belonging
> to the mnt_ns.  So if the vfsmount has the newly introduced flag
> MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns
> then we shift back using the user_ns before committing to the
> underlying filesystem.
>
> For example, if a user_ns is created where interior (fake root, uid 0)
> is mapped to kernel uid 100000 then writes from interior root normally
> go to the filesystem at the kernel uid.  However, if MNT_SHIFT is set,
> they will be shifted back to write at uid 0, meaning we can bind mount
> real image filesystems to user_ns protected faker root.
>
> In essence there are several things which have to be done for this to
> occur safely.  Firstly for all operations on the filesystem, new
> credentials have to be installed where fsuid and fsgid are set to the
> *interior* values. Next all inodes used from the filesystem have to
> have i_uid and i_gid shifted back to the kernel values and attributes
> set from user space have to have ia_uid and ia_gid shifted from the
> kernel values to the interior values.  The capability checks have to
> be done using ns_capable against the kernel values, but the inode
> capability checks have to be done against the shifted ids.
>
> Since creating a new credential is a reasonably expensive proposition
> and we have to shift and unshift many times during path walking, a
> cached copy of the shifted credential is saved to a newly created
> place in the task structure.  This serves the dual purpose of allowing
> us to use a pre-prepared copy of the shifted credentials and also
> allows us to recognise whenever the shift is actually in effect (the
> cached shifted credential pointer being equal to the current_cred()
> pointer).
>
> To get this all to work, we have a check for the vfsmount flag and the
> user_ns gating a shifting of the credentials over all user space
> entries to filesystem functions.  In theory the path has to be present
> everywhere we do this, so we can check the vfsmount flags.  However,
> for lower level functions we can cheat this path check of vfsmount
> simply to check whether a shifted credential is in effect or not to
> gate things like the inode permission check, which means the path
> doesn't have to be threaded all the way through the permission
> checking functions.  if the credential is shifted check passes, we can
> also be sure that the current user_ns is the same as the mnt->user_ns,
> so we can use it and thus have no need of the struct mount at the
> point of the shift.
>

1. Smart
2. Needs serious vetting by Eric (cc'ed)
3. A lot of people have been asking me for filtering of "dirent" fsnotify events
(i.e. create/delete) by path, which is not available in those vfs
functions, so if
the concept of current->mnt flies, fsnotify is going to want to use it as well.
4. This is currently not overlayfs (stacked fs) nor nfsd friendly.
Those modules do not call the path based vfs APIs, but they do have the
mnt stored internally.
I suppose you do want to be able to mount overlays and export nfs out
of those shifted mounts, as they are merely the foundation for unprivileged
container storage stack. right?
For overlayfs, you should at least look at ovl_override_creds() for
incorporating shift mount logic - or more likely at the creation of
ofs->creator_cred.

Thanks,
Amir.

> Although the shift can be effected simply by executing
> do_reconfigure_mnt with MNT_SHIFT in the flags, this patch only
> contains the shifting mechanisms.  The follow on patch wires up the
> user visible API for turning the flag on.
>
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
>
> ---
>
> Note this patch depends on the rethreading notify_change to use a path
> instead of a dentry patch
> ---
>  fs/attr.c             |  87 ++++++++++++++++++++++++++++----------
>  fs/exec.c             |   7 +++-
>  fs/inode.c            |   9 ++--
>  fs/internal.h         |   2 +
>  fs/namei.c            | 114 +++++++++++++++++++++++++++++++++++++++++---------
>  fs/open.c             |  25 ++++++++++-
>  fs/posix_acl.c        |   4 +-
>  fs/stat.c             |  31 ++++++++++++--
>  include/linux/cred.h  |  10 +++++
>  include/linux/mount.h |   4 +-
>  include/linux/sched.h |   5 +++
>  kernel/capability.c   |  14 ++++++-
>  kernel/cred.c         |  20 +++++++++
>  kernel/groups.c       |   7 ++++
>  14 files changed, 283 insertions(+), 56 deletions(-)
>
> diff --git a/fs/attr.c b/fs/attr.c
> index 370b18807f05..3efb2dc67896 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,14 +18,22 @@
>  #include <linux/evm.h>
>  #include <linux/ima.h>
>
> +#include "internal.h"
> +#include "mount.h"
> +
>  static bool chown_ok(const struct inode *inode, kuid_t uid)
>  {
> +       kuid_t i_uid = inode->i_uid;
> +
> +       if (cred_is_shifted())
> +               i_uid = make_kuid(current_user_ns(), __kuid_val(i_uid));
> +
>         if (uid_eq(current_fsuid(), inode->i_uid) &&
> -           uid_eq(uid, inode->i_uid))
> +           uid_eq(uid, i_uid))
>                 return true;
>         if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>                 return true;
> -       if (uid_eq(inode->i_uid, INVALID_UID) &&
> +       if (uid_eq(i_uid, INVALID_UID) &&
>             ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>                 return true;
>         return false;
> @@ -33,12 +41,21 @@ static bool chown_ok(const struct inode *inode, kuid_t uid)
>
>  static bool chgrp_ok(const struct inode *inode, kgid_t gid)
>  {
> +       kgid_t i_gid = inode->i_gid;
> +       kuid_t i_uid = inode->i_uid;
> +
> +       if (cred_is_shifted()) {
> +               struct user_namespace *ns = current_user_ns();
> +
> +               i_uid = make_kuid(ns, __kuid_val(i_uid));
> +               i_gid = make_kgid(ns, __kgid_val(i_gid));
> +       }
>         if (uid_eq(current_fsuid(), inode->i_uid) &&
> -           (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> +           (in_group_p(gid) || gid_eq(gid, i_gid)))
>                 return true;
>         if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>                 return true;
> -       if (gid_eq(inode->i_gid, INVALID_GID) &&
> +       if (gid_eq(i_gid, INVALID_GID) &&
>             ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>                 return true;
>         return false;
> @@ -89,9 +106,10 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>         if (ia_valid & ATTR_MODE) {
>                 if (!inode_owner_or_capable(inode))
>                         return -EPERM;
> +
>                 /* Also check the setgid bit! */
> -               if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
> -                               inode->i_gid) &&
> +               if (!in_group_p_shifted((ia_valid & ATTR_GID) ? attr->ia_gid :
> +                                       inode->i_gid) &&
>                     !capable_wrt_inode_uidgid(inode, CAP_FSETID))
>                         attr->ia_mode &= ~S_ISGID;
>         }
> @@ -198,7 +216,7 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
>         if (ia_valid & ATTR_MODE) {
>                 umode_t mode = attr->ia_mode;
>
> -               if (!in_group_p(inode->i_gid) &&
> +               if (!in_group_p_shifted(inode->i_gid) &&
>                     !capable_wrt_inode_uidgid(inode, CAP_FSETID))
>                         mode &= ~S_ISGID;
>                 inode->i_mode = mode;
> @@ -235,6 +253,9 @@ int notify_change(const struct path *path, struct iattr * attr,
>         int error;
>         struct timespec64 now;
>         unsigned int ia_valid = attr->ia_valid;
> +       const struct cred *cred;
> +       kuid_t i_uid = inode->i_uid;
> +       kgid_t i_gid = inode->i_gid;
>
>         WARN_ON_ONCE(!inode_is_locked(inode));
>
> @@ -243,18 +264,28 @@ int notify_change(const struct path *path, struct iattr * attr,
>                         return -EPERM;
>         }
>
> +       cred = change_userns_creds(path);
> +       if (cred) {
> +               struct mount *m = real_mount(path->mnt);
> +
> +               attr->ia_uid = KUIDT_INIT(from_kuid(m->mnt_ns->user_ns, attr->ia_uid));
> +               attr->ia_gid = KGIDT_INIT(from_kgid(m->mnt_ns->user_ns, attr->ia_gid));
> +       }
> +
>         /*
>          * If utimes(2) and friends are called with times == NULL (or both
>          * times are UTIME_NOW), then we need to check for write permission
>          */
>         if (ia_valid & ATTR_TOUCH) {
> -               if (IS_IMMUTABLE(inode))
> -                       return -EPERM;
> +               if (IS_IMMUTABLE(inode)) {
> +                       error = -EPERM;
> +                       goto err;
> +               }
>
>                 if (!inode_owner_or_capable(inode)) {
>                         error = inode_permission(inode, MAY_WRITE);
>                         if (error)
> -                               return error;
> +                               goto err;
>                 }
>         }
>
> @@ -275,7 +306,7 @@ int notify_change(const struct path *path, struct iattr * attr,
>         if (ia_valid & ATTR_KILL_PRIV) {
>                 error = security_inode_need_killpriv(dentry);
>                 if (error < 0)
> -                       return error;
> +                       goto err;
>                 if (error == 0)
>                         ia_valid = attr->ia_valid &= ~ATTR_KILL_PRIV;
>         }
> @@ -306,34 +337,46 @@ int notify_change(const struct path *path, struct iattr * attr,
>                         attr->ia_mode &= ~S_ISGID;
>                 }
>         }
> -       if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
> -               return 0;
> +       if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID))) {
> +               error = 0;
> +               goto err;
> +       }
>
>         /*
>          * Verify that uid/gid changes are valid in the target
>          * namespace of the superblock.
>          */
> +       error = -EOVERFLOW;
>         if (ia_valid & ATTR_UID &&
>             !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
> -               return -EOVERFLOW;
> +               goto err;
> +
>         if (ia_valid & ATTR_GID &&
>             !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
> -               return -EOVERFLOW;
> +               goto err;
>
>         /* Don't allow modifications of files with invalid uids or
>          * gids unless those uids & gids are being made valid.
>          */
> -       if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
> -               return -EOVERFLOW;
> -       if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
> -               return -EOVERFLOW;
> +       if (cred_is_shifted()) {
> +               struct user_namespace *ns = current_user_ns();
> +
> +               i_uid = make_kuid(ns, __kuid_val(i_uid));
> +               i_gid = make_kgid(ns, __kgid_val(i_gid));
> +       }
> +
> +       if (!(ia_valid & ATTR_UID) && !uid_valid(i_uid))
> +               goto err;
> +
> +       if (!(ia_valid & ATTR_GID) && !gid_valid(i_gid))
> +               goto err;
>
>         error = security_inode_setattr(dentry, attr);
>         if (error)
> -               return error;
> +               goto err;
>         error = try_break_deleg(inode, delegated_inode);
>         if (error)
> -               return error;
> +               goto err;
>
>         if (inode->i_op->setattr)
>                 error = inode->i_op->setattr(dentry, attr);
> @@ -346,6 +389,8 @@ int notify_change(const struct path *path, struct iattr * attr,
>                 evm_inode_post_setattr(dentry, ia_valid);
>         }
>
> + err:
> +       revert_userns_creds(cred);
>         return error;
>  }
>  EXPORT_SYMBOL(notify_change);
> diff --git a/fs/exec.c b/fs/exec.c
> index 555e93c7dec8..5f3a6ac7cd8f 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1540,13 +1540,18 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
>
>         /* Be careful if suid/sgid is set */
>         inode_lock(inode);
> -
>         /* reload atomically mode/uid/gid now that lock held */
>         mode = inode->i_mode;
>         uid = inode->i_uid;
>         gid = inode->i_gid;
>         inode_unlock(inode);
>
> +       if (cred_is_shifted()) {
> +               struct user_namespace *ns = current_user_ns();
> +
> +               uid = make_kuid(ns, __kuid_val(uid));
> +               gid = make_kgid(ns, __kgid_val(gid));
> +       }
>         /* We ignore suid/sgid if there are no mappings for them in the ns */
>         if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
>                  !kgid_has_mapping(bprm->cred->user_ns, gid))
> diff --git a/fs/inode.c b/fs/inode.c
> index f2cc96ebede4..747bf7ca650d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2053,7 +2053,7 @@ void inode_init_owner(struct inode *inode, const struct inode *dir,
>                 if (S_ISDIR(mode))
>                         mode |= S_ISGID;
>                 else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
> -                        !in_group_p(inode->i_gid) &&
> +                        !in_group_p_shifted(inode->i_gid) &&
>                          !capable_wrt_inode_uidgid(dir, CAP_FSETID))
>                         mode &= ~S_ISGID;
>         } else
> @@ -2072,12 +2072,15 @@ EXPORT_SYMBOL(inode_init_owner);
>  bool inode_owner_or_capable(const struct inode *inode)
>  {
>         struct user_namespace *ns;
> +       kuid_t uid = inode->i_uid;
>
> -       if (uid_eq(current_fsuid(), inode->i_uid))
> +       if (uid_eq(current_fsuid(), uid))
>                 return true;
>
>         ns = current_user_ns();
> -       if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
> +       if (cred_is_shifted())
> +               uid = make_kuid(ns, __kuid_val(uid));
> +       if (kuid_has_mapping(ns, uid) && ns_capable(ns, CAP_FOWNER))
>                 return true;
>         return false;
>  }
> diff --git a/fs/internal.h b/fs/internal.h
> index f27136058d7d..99a1bbf2113e 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -73,6 +73,8 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>                   const char __user *newname);
>  int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>               const char __user *newname, int flags);
> +const struct cred *change_userns_creds(const struct path *p);
> +void revert_userns_creds(const struct cred *cred);
>
>  /*
>   * namespace.c
> diff --git a/fs/namei.c b/fs/namei.c
> index 900c826161ef..e4ca7cf393e9 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -124,6 +124,38 @@
>
>  #define EMBEDDED_NAME_MAX      (PATH_MAX - offsetof(struct filename, iname))
>
> +const struct cred *change_userns_creds(const struct path *p)
> +{
> +       struct mount *m = real_mount(p->mnt);
> +
> +       if ((p->mnt->mnt_flags & MNT_SHIFT) == 0)
> +               return NULL;
> +
> +       if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
> +               return NULL;
> +
> +       if (current->mnt != p->mnt) {
> +               struct cred *cred;
> +               struct user_namespace *user_ns = m->mnt_ns->user_ns;
> +
> +               if (current->mnt_cred)
> +                       put_cred(current->mnt_cred);
> +               cred = prepare_creds();
> +               cred->fsuid = KUIDT_INIT(from_kuid(user_ns, current->cred->fsuid));
> +               cred->fsgid = KGIDT_INIT(from_kgid(user_ns, current->cred->fsgid));
> +               current->mnt = p->mnt; /* no reference needed */
> +               current->mnt_cred = cred;
> +       }
> +       return override_creds(current->mnt_cred);
> +}
> +
> +void revert_userns_creds(const struct cred *cred)
> +{
> +       if (!cred)
> +               return;
> +       revert_creds(cred);
> +}
> +
>  struct filename *
>  getname_flags(const char __user *filename, int flags, int *empty)
>  {
> @@ -303,7 +335,7 @@ static int acl_permission_check(struct inode *inode, int mask)
>                                 return error;
>                 }
>
> -               if (in_group_p(inode->i_gid))
> +               if (in_group_p_shifted(inode->i_gid))
>                         mode >>= 3;
>         }
>
> @@ -366,7 +398,6 @@ int generic_permission(struct inode *inode, int mask)
>         if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
>                 if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
>                         return 0;
> -
>         return -EACCES;
>  }
>  EXPORT_SYMBOL(generic_permission);
> @@ -1782,6 +1813,7 @@ static int walk_component(struct nameidata *nd, int flags)
>         struct inode *inode;
>         unsigned seq;
>         int err;
> +       const struct cred *cred;
>         /*
>          * "." and ".." are special - ".." especially so because it has
>          * to be able to know about the current root directory and
> @@ -1793,30 +1825,37 @@ static int walk_component(struct nameidata *nd, int flags)
>                         put_link(nd);
>                 return err;
>         }
> +       cred = change_userns_creds(&nd->path);
>         err = lookup_fast(nd, &path, &inode, &seq);
>         if (unlikely(err <= 0)) {
>                 if (err < 0)
> -                       return err;
> +                       goto out;
>                 path.dentry = lookup_slow(&nd->last, nd->path.dentry,
>                                           nd->flags);
> -               if (IS_ERR(path.dentry))
> -                       return PTR_ERR(path.dentry);
> +               if (IS_ERR(path.dentry)) {
> +                       err = PTR_ERR(path.dentry);
> +                       goto out;
> +               }
>
>                 path.mnt = nd->path.mnt;
>                 err = follow_managed(&path, nd);
>                 if (unlikely(err < 0))
> -                       return err;
> +                       goto out;
>
>                 if (unlikely(d_is_negative(path.dentry))) {
>                         path_to_nameidata(&path, nd);
> -                       return -ENOENT;
> +                       err = -ENOENT;
> +                       goto out;
>                 }
>
>                 seq = 0;        /* we are already out of RCU mode */
>                 inode = d_backing_inode(path.dentry);
>         }
>
> -       return step_into(nd, &path, flags, inode, seq);
> +       err = step_into(nd, &path, flags, inode, seq);
> + out:
> +       revert_userns_creds(cred);
> +       return err;
>  }
>
>  /*
> @@ -2070,8 +2109,10 @@ static int link_path_walk(const char *name, struct nameidata *nd)
>         for(;;) {
>                 u64 hash_len;
>                 int type;
> +               const struct cred *cred = change_userns_creds(&nd->path);
>
>                 err = may_lookup(nd);
> +               revert_userns_creds(cred);
>                 if (err)
>                         return err;
>
> @@ -2245,12 +2286,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  static const char *trailing_symlink(struct nameidata *nd)
>  {
>         const char *s;
> +       const struct cred *cred = change_userns_creds(&nd->path);
>         int error = may_follow_link(nd);
> -       if (unlikely(error))
> -               return ERR_PTR(error);
> +       if (unlikely(error)) {
> +               s = ERR_PTR(error);
> +               goto out;
> +       }
>         nd->flags |= LOOKUP_PARENT;
>         nd->stack[0].name = NULL;
>         s = get_link(nd);
> + out:
> +       revert_userns_creds(cred);
>         return s ? s : "";
>  }
>
> @@ -3256,6 +3302,7 @@ static int do_last(struct nameidata *nd,
>         struct inode *inode;
>         struct path path;
>         int error;
> +       const struct cred *cred = change_userns_creds(&nd->path);
>
>         nd->flags &= ~LOOKUP_PARENT;
>         nd->flags |= op->intent;
> @@ -3263,7 +3310,7 @@ static int do_last(struct nameidata *nd,
>         if (nd->last_type != LAST_NORM) {
>                 error = handle_dots(nd, nd->last_type);
>                 if (unlikely(error))
> -                       return error;
> +                       goto err;
>                 goto finish_open;
>         }
>
> @@ -3276,7 +3323,7 @@ static int do_last(struct nameidata *nd,
>                         goto finish_lookup;
>
>                 if (error < 0)
> -                       return error;
> +                       goto err;
>
>                 BUG_ON(nd->inode != dir->d_inode);
>                 BUG_ON(nd->flags & LOOKUP_RCU);
> @@ -3289,12 +3336,14 @@ static int do_last(struct nameidata *nd,
>                  */
>                 error = complete_walk(nd);
>                 if (error)
> -                       return error;
> +                       goto err;
>
>                 audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
>                 /* trailing slashes? */
> -               if (unlikely(nd->last.name[nd->last.len]))
> -                       return -EISDIR;
> +               if (unlikely(nd->last.name[nd->last.len])) {
> +                       error = -EISDIR;
> +                       goto err;
> +               }
>         }
>
>         if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
> @@ -3350,11 +3399,12 @@ static int do_last(struct nameidata *nd,
>
>         error = follow_managed(&path, nd);
>         if (unlikely(error < 0))
> -               return error;
> +               goto err;
>
>         if (unlikely(d_is_negative(path.dentry))) {
>                 path_to_nameidata(&path, nd);
> -               return -ENOENT;
> +               error = -ENOENT;
> +               goto err;
>         }
>
>         /*
> @@ -3364,7 +3414,8 @@ static int do_last(struct nameidata *nd,
>
>         if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
>                 path_to_nameidata(&path, nd);
> -               return -EEXIST;
> +               error = -EEXIST;
> +               goto err;
>         }
>
>         seq = 0;        /* out of RCU mode, so the value doesn't matter */
> @@ -3372,12 +3423,12 @@ static int do_last(struct nameidata *nd,
>  finish_lookup:
>         error = step_into(nd, &path, 0, inode, seq);
>         if (unlikely(error))
> -               return error;
> +               goto err;
>  finish_open:
>         /* Why this, you ask?  _Now_ we might have grown LOOKUP_JUMPED... */
>         error = complete_walk(nd);
>         if (error)
> -               return error;
> +               goto err;
>         audit_inode(nd->name, nd->path.dentry, 0);
>         if (open_flag & O_CREAT) {
>                 error = -EISDIR;
> @@ -3419,6 +3470,8 @@ static int do_last(struct nameidata *nd,
>         }
>         if (got_write)
>                 mnt_drop_write(nd->path.mnt);
> + err:
> +       revert_userns_creds(cred);
>         return error;
>  }
>
> @@ -3737,6 +3790,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>         struct path path;
>         int error;
>         unsigned int lookup_flags = 0;
> +       const struct cred *cred;
>
>         error = may_mknod(mode);
>         if (error)
> @@ -3746,6 +3800,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>         if (IS_ERR(dentry))
>                 return PTR_ERR(dentry);
>
> +       cred = change_userns_creds(&path);
>         if (!IS_POSIXACL(path.dentry->d_inode))
>                 mode &= ~current_umask();
>         error = security_path_mknod(&path, dentry, mode, dev);
> @@ -3767,6 +3822,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>         }
>  out:
>         done_path_create(&path, dentry);
> +       revert_userns_creds(cred);
>         if (retry_estale(error, lookup_flags)) {
>                 lookup_flags |= LOOKUP_REVAL;
>                 goto retry;
> @@ -3817,18 +3873,21 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
>         struct path path;
>         int error;
>         unsigned int lookup_flags = LOOKUP_DIRECTORY;
> +       const struct cred *cred;
>
>  retry:
>         dentry = user_path_create(dfd, pathname, &path, lookup_flags);
>         if (IS_ERR(dentry))
>                 return PTR_ERR(dentry);
>
> +       cred = change_userns_creds(&path);
>         if (!IS_POSIXACL(path.dentry->d_inode))
>                 mode &= ~current_umask();
>         error = security_path_mkdir(&path, dentry, mode);
>         if (!error)
>                 error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
>         done_path_create(&path, dentry);
> +       revert_userns_creds(cred);
>         if (retry_estale(error, lookup_flags)) {
>                 lookup_flags |= LOOKUP_REVAL;
>                 goto retry;
> @@ -3895,12 +3954,14 @@ long do_rmdir(int dfd, const char __user *pathname)
>         struct qstr last;
>         int type;
>         unsigned int lookup_flags = 0;
> +       const struct cred *cred;
>  retry:
>         name = filename_parentat(dfd, getname(pathname), lookup_flags,
>                                 &path, &last, &type);
>         if (IS_ERR(name))
>                 return PTR_ERR(name);
>
> +       cred = change_userns_creds(&path);
>         switch (type) {
>         case LAST_DOTDOT:
>                 error = -ENOTEMPTY;
> @@ -3936,6 +3997,7 @@ long do_rmdir(int dfd, const char __user *pathname)
>         inode_unlock(path.dentry->d_inode);
>         mnt_drop_write(path.mnt);
>  exit1:
> +       revert_userns_creds(cred);
>         path_put(&path);
>         putname(name);
>         if (retry_estale(error, lookup_flags)) {
> @@ -4025,11 +4087,13 @@ long do_unlinkat(int dfd, struct filename *name)
>         struct inode *inode = NULL;
>         struct inode *delegated_inode = NULL;
>         unsigned int lookup_flags = 0;
> +       const struct cred *cred;
>  retry:
>         name = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
>         if (IS_ERR(name))
>                 return PTR_ERR(name);
>
> +       cred = change_userns_creds(&path);
>         error = -EISDIR;
>         if (type != LAST_NORM)
>                 goto exit1;
> @@ -4067,6 +4131,7 @@ long do_unlinkat(int dfd, struct filename *name)
>         }
>         mnt_drop_write(path.mnt);
>  exit1:
> +       revert_userns_creds(cred);
>         path_put(&path);
>         if (retry_estale(error, lookup_flags)) {
>                 lookup_flags |= LOOKUP_REVAL;
> @@ -4131,6 +4196,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>         struct dentry *dentry;
>         struct path path;
>         unsigned int lookup_flags = 0;
> +       const struct cred *cred;
>
>         from = getname(oldname);
>         if (IS_ERR(from))
> @@ -4141,6 +4207,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>         if (IS_ERR(dentry))
>                 goto out_putname;
>
> +       cred = change_userns_creds(&path);
>         error = security_path_symlink(&path, dentry, from->name);
>         if (!error)
>                 error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
> @@ -4149,6 +4216,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>                 lookup_flags |= LOOKUP_REVAL;
>                 goto retry;
>         }
> +       revert_userns_creds(cred);
>  out_putname:
>         putname(from);
>         return error;
> @@ -4262,6 +4330,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>         struct inode *delegated_inode = NULL;
>         int how = 0;
>         int error;
> +       const struct cred *cred;
>
>         if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
>                 return -EINVAL;
> @@ -4289,6 +4358,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>         if (IS_ERR(new_dentry))
>                 goto out;
>
> +       cred = change_userns_creds(&new_path);
>         error = -EXDEV;
>         if (old_path.mnt != new_path.mnt)
>                 goto out_dput;
> @@ -4300,6 +4370,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>                 goto out_dput;
>         error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
>  out_dput:
> +       revert_userns_creds(cred);
>         done_path_create(&new_path, new_dentry);
>         if (delegated_inode) {
>                 error = break_deleg_wait(&delegated_inode);
> @@ -4519,6 +4590,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
>         unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
>         bool should_retry = false;
>         int error;
> +       const struct cred *cred;
>
>         if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
>                 return -EINVAL;
> @@ -4548,6 +4620,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
>                 goto exit1;
>         }
>
> +       cred = change_userns_creds(&new_path);
>         error = -EXDEV;
>         if (old_path.mnt != new_path.mnt)
>                 goto exit2;
> @@ -4632,6 +4705,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
>         }
>         mnt_drop_write(old_path.mnt);
>  exit2:
> +       revert_userns_creds(cred);
>         if (retry_estale(error, lookup_flags))
>                 should_retry = true;
>         path_put(&new_path);
> diff --git a/fs/open.c b/fs/open.c
> index 033e2112fbda..7cad2b723925 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -456,11 +456,13 @@ int ksys_chdir(const char __user *filename)
>         struct path path;
>         int error;
>         unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> +       const struct cred *cred;
>  retry:
>         error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);
>         if (error)
>                 goto out;
>
> +       cred = change_userns_creds(&path);
>         error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
>         if (error)
>                 goto dput_and_out;
> @@ -468,6 +470,7 @@ int ksys_chdir(const char __user *filename)
>         set_fs_pwd(current->fs, &path);
>
>  dput_and_out:
> +       revert_userns_creds(cred);
>         path_put(&path);
>         if (retry_estale(error, lookup_flags)) {
>                 lookup_flags |= LOOKUP_REVAL;
> @@ -486,11 +489,13 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
>  {
>         struct fd f = fdget_raw(fd);
>         int error;
> +       const struct cred *cred;
>
>         error = -EBADF;
>         if (!f.file)
>                 goto out;
>
> +       cred = change_userns_creds(&f.file->f_path);
>         error = -ENOTDIR;
>         if (!d_can_lookup(f.file->f_path.dentry))
>                 goto out_putf;
> @@ -499,6 +504,7 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
>         if (!error)
>                 set_fs_pwd(current->fs, &f.file->f_path);
>  out_putf:
> +       revert_userns_creds(cred);
>         fdput(f);
>  out:
>         return error;
> @@ -547,11 +553,13 @@ static int chmod_common(const struct path *path, umode_t mode)
>         struct inode *inode = path->dentry->d_inode;
>         struct inode *delegated_inode = NULL;
>         struct iattr newattrs;
> +       const struct cred *cred;
>         int error;
>
> +       cred = change_userns_creds(path);
>         error = mnt_want_write(path->mnt);
>         if (error)
> -               return error;
> +               goto out;
>  retry_deleg:
>         inode_lock(inode);
>         error = security_path_chmod(path, mode);
> @@ -568,6 +576,8 @@ static int chmod_common(const struct path *path, umode_t mode)
>                         goto retry_deleg;
>         }
>         mnt_drop_write(path->mnt);
> + out:
> +       revert_userns_creds(cred);
>         return error;
>  }
>
> @@ -666,6 +676,7 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
>         struct path path;
>         int error = -EINVAL;
>         int lookup_flags;
> +       const struct cred *cred;
>
>         if ((flag & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
>                 goto out;
> @@ -677,12 +688,14 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
>         error = user_path_at(dfd, filename, lookup_flags, &path);
>         if (error)
>                 goto out;
> +       cred = change_userns_creds(&path);
>         error = mnt_want_write(path.mnt);
>         if (error)
>                 goto out_release;
>         error = chown_common(&path, user, group);
>         mnt_drop_write(path.mnt);
>  out_release:
> +       revert_userns_creds(cred);
>         path_put(&path);
>         if (retry_estale(error, lookup_flags)) {
>                 lookup_flags |= LOOKUP_REVAL;
> @@ -713,10 +726,12 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
>  {
>         struct fd f = fdget(fd);
>         int error = -EBADF;
> +       const struct cred *cred;
>
>         if (!f.file)
>                 goto out;
>
> +       cred = change_userns_creds(&f.file->f_path);
>         error = mnt_want_write_file(f.file);
>         if (error)
>                 goto out_fput;
> @@ -724,6 +739,7 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
>         error = chown_common(&f.file->f_path, user, group);
>         mnt_drop_write_file(f.file);
>  out_fput:
> +       revert_userns_creds(cred);
>         fdput(f);
>  out:
>         return error;
> @@ -911,8 +927,13 @@ EXPORT_SYMBOL(file_path);
>   */
>  int vfs_open(const struct path *path, struct file *file)
>  {
> +       int ret;
> +       const struct cred *cred = change_userns_creds(path);
> +
>         file->f_path = *path;
> -       return do_dentry_open(file, d_backing_inode(path->dentry), NULL);
> +       ret = do_dentry_open(file, d_backing_inode(path->dentry), NULL);
> +       revert_userns_creds(cred);
> +       return ret;
>  }
>
>  struct file *dentry_open(const struct path *path, int flags,
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 84ad1c90d535..b5aa36261964 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -364,7 +364,7 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
>                                          goto mask;
>                                 break;
>                          case ACL_GROUP_OBJ:
> -                                if (in_group_p(inode->i_gid)) {
> +                               if (in_group_p_shifted(inode->i_gid)) {
>                                         found = 1;
>                                         if ((pa->e_perm & want) == want)
>                                                 goto mask;
> @@ -652,7 +652,7 @@ int posix_acl_update_mode(struct inode *inode, umode_t *mode_p,
>                 return error;
>         if (error == 0)
>                 *acl = NULL;
> -       if (!in_group_p(inode->i_gid) &&
> +       if (!in_group_p_shifted(inode->i_gid) &&
>             !capable_wrt_inode_uidgid(inode, CAP_FSETID))
>                 mode &= ~S_ISGID;
>         *mode_p = mode;
> diff --git a/fs/stat.c b/fs/stat.c
> index c38e4c2e1221..0018b168d7a7 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -21,6 +21,8 @@
>  #include <linux/uaccess.h>
>  #include <asm/unistd.h>
>
> +#include "mount.h"
> +
>  /**
>   * generic_fillattr - Fill in the basic attributes from the inode struct
>   * @inode: Inode to use as the source
> @@ -48,6 +50,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
>  }
>  EXPORT_SYMBOL(generic_fillattr);
>
> +static void shift_check(struct vfsmount *mnt, struct kstat *stat)
> +{
> +       struct mount *m = real_mount(mnt);
> +       struct user_namespace *user_ns = m->mnt_ns->user_ns;
> +
> +       if ((mnt->mnt_flags & MNT_SHIFT) == 0)
> +               return;
> +
> +       if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
> +               return;
> +
> +       stat->uid = make_kuid(user_ns, __kuid_val(stat->uid));
> +       stat->gid = make_kgid(user_ns, __kgid_val(stat->gid));
> +}
> +
>  /**
>   * vfs_getattr_nosec - getattr without security checks
>   * @path: file to get attributes from
> @@ -65,6 +82,7 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>                       u32 request_mask, unsigned int query_flags)
>  {
>         struct inode *inode = d_backing_inode(path->dentry);
> +       int ret;
>
>         memset(stat, 0, sizeof(*stat));
>         stat->result_mask |= STATX_BASIC_STATS;
> @@ -77,12 +95,17 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>         if (IS_AUTOMOUNT(inode))
>                 stat->attributes |= STATX_ATTR_AUTOMOUNT;
>
> +       ret = 0;
>         if (inode->i_op->getattr)
> -               return inode->i_op->getattr(path, stat, request_mask,
> -                                           query_flags);
> +               ret = inode->i_op->getattr(path, stat, request_mask,
> +                                          query_flags);
> +       else
> +               generic_fillattr(inode, stat);
>
> -       generic_fillattr(inode, stat);
> -       return 0;
> +       if (!ret)
> +               shift_check(path->mnt, stat);
> +
> +       return ret;
>  }
>  EXPORT_SYMBOL(vfs_getattr_nosec);
>
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index 18639c069263..8a5f2c9b613a 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -59,6 +59,7 @@ extern struct group_info *groups_alloc(int);
>  extern void groups_free(struct group_info *);
>
>  extern int in_group_p(kgid_t);
> +extern int in_group_p_shifted(kgid_t);
>  extern int in_egroup_p(kgid_t);
>  extern int groups_search(const struct group_info *, kgid_t);
>
> @@ -75,6 +76,10 @@ static inline int in_group_p(kgid_t grp)
>  {
>          return 1;
>  }
> +static inline int in_group_p_shifted(kgid_t grp)
> +{
> +       return 1;
> +}
>  static inline int in_egroup_p(kgid_t grp)
>  {
>          return 1;
> @@ -422,4 +427,9 @@ do {                                                \
>         *(_fsgid) = __cred->fsgid;              \
>  } while(0)
>
> +static inline bool cred_is_shifted(void)
> +{
> +       return current_cred() == current->mnt_cred;
> +}
> +
>  #endif /* _LINUX_CRED_H */
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index bf8cc4108b8f..cdc5d981d594 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -46,7 +46,7 @@ struct fs_context;
>  #define MNT_SHARED_MASK        (MNT_UNBINDABLE)
>  #define MNT_USER_SETTABLE_MASK  (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
>                                  | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
> -                                | MNT_READONLY)
> +                                | MNT_READONLY | MNT_SHIFT)
>  #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
>
>  #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
> @@ -65,6 +65,8 @@ struct fs_context;
>  #define MNT_MARKED             0x4000000
>  #define MNT_UMOUNT             0x8000000
>
> +#define MNT_SHIFT              0x10000000
> +
>  struct vfsmount {
>         struct dentry *mnt_root;        /* root of the mounted tree */
>         struct super_block *mnt_sb;     /* pointer to superblock */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 67a1d86981a9..e1fe495b7614 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -19,6 +19,7 @@
>  #include <linux/plist.h>
>  #include <linux/hrtimer.h>
>  #include <linux/seccomp.h>
> +#include <linux/mount.h>
>  #include <linux/nodemask.h>
>  #include <linux/rcupdate.h>
>  #include <linux/refcount.h>
> @@ -877,6 +878,10 @@ struct task_struct {
>         /* Effective (overridable) subjective task credentials (COW): */
>         const struct cred __rcu         *cred;
>
> +       /* cache for uid/gid shifted cred tied to mnt */
> +       struct cred                     *mnt_cred;
> +       struct vfsmount                 *mnt;
> +
>  #ifdef CONFIG_KEYS
>         /* Cached requested key. */
>         struct key                      *cached_requested_key;
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 1444f3954d75..3273e85a644c 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -486,8 +486,18 @@ EXPORT_SYMBOL(file_ns_capable);
>   */
>  bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode)
>  {
> -       return kuid_has_mapping(ns, inode->i_uid) &&
> -               kgid_has_mapping(ns, inode->i_gid);
> +       kuid_t i_uid = inode->i_uid;
> +       kgid_t i_gid = inode->i_gid;
> +
> +       if (cred_is_shifted()) {
> +               struct user_namespace *cns = current_user_ns();
> +
> +               i_uid = make_kuid(cns, __kuid_val(i_uid));
> +               i_gid = make_kgid(cns, __kgid_val(i_gid));
> +       }
> +
> +       return kuid_has_mapping(ns, i_uid) &&
> +               kgid_has_mapping(ns, i_gid);
>  }
>
>  /**
> diff --git a/kernel/cred.c b/kernel/cred.c
> index c0a4c12d38b2..bbe0e2e64081 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -167,6 +167,8 @@ void exit_creds(struct task_struct *tsk)
>         validate_creds(cred);
>         alter_cred_subscribers(cred, -1);
>         put_cred(cred);
> +       if (tsk->mnt_cred)
> +               put_cred(tsk->mnt_cred);
>
>         cred = (struct cred *) tsk->cred;
>         tsk->cred = NULL;
> @@ -318,6 +320,17 @@ struct cred *prepare_exec_creds(void)
>         return new;
>  }
>
> +static void flush_mnt_cred(struct task_struct *t)
> +{
> +       if (t->mnt_cred == t->cred)
> +               return;
> +       if (t->mnt_cred)
> +               put_cred(t->mnt_cred);
> +       t->mnt_cred = NULL;
> +       /* mnt is only used for comparison, so it has no reference */
> +       t->mnt = NULL;
> +}
> +
>  /*
>   * Copy credentials for the new process created by fork()
>   *
> @@ -344,6 +357,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>             ) {
>                 p->real_cred = get_cred(p->cred);
>                 get_cred(p->cred);
> +               p->mnt = NULL;
> +               p->mnt_cred = NULL;
>                 alter_cred_subscribers(p->cred, 2);
>                 kdebug("share_creds(%p{%d,%d})",
>                        p->cred, atomic_read(&p->cred->usage),
> @@ -383,6 +398,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>
>         atomic_inc(&new->user->processes);
>         p->cred = p->real_cred = get_cred(new);
> +       p->mnt = NULL;
> +       p->mnt_cred = NULL;
>         alter_cred_subscribers(new, 2);
>         validate_creds(new);
>         return 0;
> @@ -506,6 +523,7 @@ int commit_creds(struct cred *new)
>         /* release the old obj and subj refs both */
>         put_cred(old);
>         put_cred(old);
> +       flush_mnt_cred(task);
>         return 0;
>  }
>  EXPORT_SYMBOL(commit_creds);
> @@ -564,6 +582,7 @@ const struct cred *override_creds(const struct cred *new)
>         alter_cred_subscribers(new, 1);
>         rcu_assign_pointer(current->cred, new);
>         alter_cred_subscribers(old, -1);
> +       flush_mnt_cred(current);
>
>         kdebug("override_creds() = %p{%d,%d}", old,
>                atomic_read(&old->usage),
> @@ -589,6 +608,7 @@ void revert_creds(const struct cred *old)
>
>         validate_creds(old);
>         validate_creds(override);
> +       flush_mnt_cred(current);
>         alter_cred_subscribers(old, 1);
>         rcu_assign_pointer(current->cred, old);
>         alter_cred_subscribers(override, -1);
> diff --git a/kernel/groups.c b/kernel/groups.c
> index daae2f2dc6d4..772b49a367b0 100644
> --- a/kernel/groups.c
> +++ b/kernel/groups.c
> @@ -228,6 +228,13 @@ int in_group_p(kgid_t grp)
>
>  EXPORT_SYMBOL(in_group_p);
>
> +int in_group_p_shifted(kgid_t grp)
> +{
> +       if (cred_is_shifted())
> +               grp = make_kgid(current_user_ns(), __kgid_val(grp));
> +       return in_group_p(grp);
> +}
> +
>  int in_egroup_p(kgid_t grp)
>  {
>         const struct cred *cred = current_cred();
> --
> 2.16.4
>

James Bottomley Dec. 3, 2019, 5:12 a.m. UTC | #2

On Tue, 2019-12-03 at 06:51 +0200, Amir Goldstein wrote:
> [cc: ebiederman]
> 
> On Tue, Dec 3, 2019 at 3:15 AM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > This implementation reverse shifts according to the user_ns
> > belonging
> > to the mnt_ns.  So if the vfsmount has the newly introduced flag
> > MNT_SHIFT and the current user_ns is the same as the mount_ns-
> > >user_ns
> > then we shift back using the user_ns before committing to the
> > underlying filesystem.
> > 
> > For example, if a user_ns is created where interior (fake root, uid
> > 0)
> > is mapped to kernel uid 100000 then writes from interior root
> > normally
> > go to the filesystem at the kernel uid.  However, if MNT_SHIFT is
> > set,
> > they will be shifted back to write at uid 0, meaning we can bind
> > mount
> > real image filesystems to user_ns protected faker root.
> > 
> > In essence there are several things which have to be done for this
> > to
> > occur safely.  Firstly for all operations on the filesystem, new
> > credentials have to be installed where fsuid and fsgid are set to
> > the
> > *interior* values. Next all inodes used from the filesystem have to
> > have i_uid and i_gid shifted back to the kernel values and
> > attributes
> > set from user space have to have ia_uid and ia_gid shifted from the
> > kernel values to the interior values.  The capability checks have
> > to
> > be done using ns_capable against the kernel values, but the inode
> > capability checks have to be done against the shifted ids.
> > 
> > Since creating a new credential is a reasonably expensive
> > proposition
> > and we have to shift and unshift many times during path walking, a
> > cached copy of the shifted credential is saved to a newly created
> > place in the task structure.  This serves the dual purpose of
> > allowing
> > us to use a pre-prepared copy of the shifted credentials and also
> > allows us to recognise whenever the shift is actually in effect
> > (the
> > cached shifted credential pointer being equal to the current_cred()
> > pointer).
> > 
> > To get this all to work, we have a check for the vfsmount flag and
> > the
> > user_ns gating a shifting of the credentials over all user space
> > entries to filesystem functions.  In theory the path has to be
> > present
> > everywhere we do this, so we can check the vfsmount
> > flags.  However,
> > for lower level functions we can cheat this path check of vfsmount
> > simply to check whether a shifted credential is in effect or not to
> > gate things like the inode permission check, which means the path
> > doesn't have to be threaded all the way through the permission
> > checking functions.  if the credential is shifted check passes, we
> > can
> > also be sure that the current user_ns is the same as the mnt-
> > >user_ns,
> > so we can use it and thus have no need of the struct mount at the
> > point of the shift.
> > 
> 
> 1. Smart

Heh, thanks.

> 2. Needs serious vetting by Eric (cc'ed)
> 3. A lot of people have been asking me for filtering of "dirent"
> fsnotify events (i.e. create/delete) by path, which is not available
> in those vfs functions, so ifthe concept of current->mnt flies,
> fsnotify is going to want to use it as well.

Just a caveat: current->mnt is used in this patch simply as a tag,
which means it doesn't need to be refcounted.  I think I can prove that
it is absolutely valid if the cred is shifted because the reference is
held by the code that shifted the cred, but it's definitely not valid
except for a tag comparison outside of that.  Thus, if it is useful for
fsnotify, more thought will have to be given to refcounting it.

> 4. This is currently not overlayfs (stacked fs) nor nfsd friendly.
> Those modules do not call the path based vfs APIs, but they do have
> the mnt stored internally.

OK, so I've got to confess that I've only tested it with my container
use case, which doesn't involve overlay or nfs.  However, as long as we
thread path down to the API that nfds and overlayfs use, it should
easily be made compatible with them ... do we have any documentation of
what API this is?

> I suppose you do want to be able to mount overlays and export nfs out
> of those shifted mounts, as they are merely the foundation for
> unprivileged container storage stack. right?

If the plan of doing this as a bind mount holds, then certainly because
any underlying filesystem has to work with it.

> For overlayfs, you should at least look at ovl_override_creds() for
> incorporating shift mount logic - or more likely at the creation of
> ofs->creator_cred.

Well, we had this discussion when I proposed shiftfs as a superblock
based stackable filesytem, I think: the way the shift needs to use
creds is fundamentally different from the way overlayfs uses them.  The
ovl_override_creds is overriding with the creator's creds but the
shifting bind mound needs to backshift through the user namespace
currently in effect.  Since uid shifts can stack, we can make them work
together, but they are fundamentally different things.

James

Amir Goldstein Dec. 3, 2019, 6:55 a.m. UTC | #3

On Tue, Dec 3, 2019 at 7:12 AM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Tue, 2019-12-03 at 06:51 +0200, Amir Goldstein wrote:
> > [cc: ebiederman]
> >
> > On Tue, Dec 3, 2019 at 3:15 AM James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > >
> > > This implementation reverse shifts according to the user_ns
> > > belonging
> > > to the mnt_ns.  So if the vfsmount has the newly introduced flag
> > > MNT_SHIFT and the current user_ns is the same as the mount_ns-
> > > >user_ns
> > > then we shift back using the user_ns before committing to the
> > > underlying filesystem.
> > >
> > > For example, if a user_ns is created where interior (fake root, uid
> > > 0)
> > > is mapped to kernel uid 100000 then writes from interior root
> > > normally
> > > go to the filesystem at the kernel uid.  However, if MNT_SHIFT is
> > > set,
> > > they will be shifted back to write at uid 0, meaning we can bind
> > > mount
> > > real image filesystems to user_ns protected faker root.
> > >
> > > In essence there are several things which have to be done for this
> > > to
> > > occur safely.  Firstly for all operations on the filesystem, new
> > > credentials have to be installed where fsuid and fsgid are set to
> > > the
> > > *interior* values. Next all inodes used from the filesystem have to
> > > have i_uid and i_gid shifted back to the kernel values and
> > > attributes
> > > set from user space have to have ia_uid and ia_gid shifted from the
> > > kernel values to the interior values.  The capability checks have
> > > to
> > > be done using ns_capable against the kernel values, but the inode
> > > capability checks have to be done against the shifted ids.
> > >
> > > Since creating a new credential is a reasonably expensive
> > > proposition
> > > and we have to shift and unshift many times during path walking, a
> > > cached copy of the shifted credential is saved to a newly created
> > > place in the task structure.  This serves the dual purpose of
> > > allowing
> > > us to use a pre-prepared copy of the shifted credentials and also
> > > allows us to recognise whenever the shift is actually in effect
> > > (the
> > > cached shifted credential pointer being equal to the current_cred()
> > > pointer).
> > >
> > > To get this all to work, we have a check for the vfsmount flag and
> > > the
> > > user_ns gating a shifting of the credentials over all user space
> > > entries to filesystem functions.  In theory the path has to be
> > > present
> > > everywhere we do this, so we can check the vfsmount
> > > flags.  However,
> > > for lower level functions we can cheat this path check of vfsmount
> > > simply to check whether a shifted credential is in effect or not to
> > > gate things like the inode permission check, which means the path
> > > doesn't have to be threaded all the way through the permission
> > > checking functions.  if the credential is shifted check passes, we
> > > can
> > > also be sure that the current user_ns is the same as the mnt-
> > > >user_ns,
> > > so we can use it and thus have no need of the struct mount at the
> > > point of the shift.
> > >
> >
> > 1. Smart
>
> Heh, thanks.
>
> > 2. Needs serious vetting by Eric (cc'ed)
> > 3. A lot of people have been asking me for filtering of "dirent"
> > fsnotify events (i.e. create/delete) by path, which is not available
> > in those vfs functions, so ifthe concept of current->mnt flies,
> > fsnotify is going to want to use it as well.
>
> Just a caveat: current->mnt is used in this patch simply as a tag,
> which means it doesn't need to be refcounted.  I think I can prove that
> it is absolutely valid if the cred is shifted because the reference is
> held by the code that shifted the cred, but it's definitely not valid
> except for a tag comparison outside of that.  Thus, if it is useful for
> fsnotify, more thought will have to be given to refcounting it.
>

Yes. Is there anything preventing us from taking refcount on
current->mnt?

> > 4. This is currently not overlayfs (stacked fs) nor nfsd friendly.
> > Those modules do not call the path based vfs APIs, but they do have
> > the mnt stored internally.
>
> OK, so I've got to confess that I've only tested it with my container
> use case, which doesn't involve overlay or nfs.  However, as long as we
> thread path down to the API that nfds and overlayfs use, it should
> easily be made compatible with them ... do we have any documentation of
> what API this is?

No proper doc AFAIK, but please take a look at:
https://lore.kernel.org/linux-fsdevel/20191025112917.22518-2-mszeredi@redhat.com/
It is part of a series to make overlayfs an FS_USERNS_MOUNT.

The simplest case goes typically something like this:
rmdir -> do_rmdir -(change_userns_creds)-> vfs_rmdir ->
    ovl_rmdir -(ovl_override_creds)-> vfs_rmdir -> ext4_rmdir

So if you shift mounted the overlayfs mount, you won't end up
using shifted creds in ext4 operations.
And if you shift mounted ext4 *before* creating the overlay, then
still, overlay doesn't go through do_rmdir, so your method won't
work either.

Similar situation with nfsd, although I have no idea if there are plans
to make nfsd userns aware.

>
> > I suppose you do want to be able to mount overlays and export nfs out
> > of those shifted mounts, as they are merely the foundation for
> > unprivileged container storage stack. right?
>
> If the plan of doing this as a bind mount holds, then certainly because
> any underlying filesystem has to work with it.
>

I am talking above, not under.
You shift mount an ext4 fs and hand it over to container fake root
(or mark it and let fake root shit mount).
The container fake root should be able to (after overlayfs unpriv changes)
create an overlay from inside container.
IOW, try to mount an overlay over your shifted fs and see how it
behaves.

> > For overlayfs, you should at least look at ovl_override_creds() for
> > incorporating shift mount logic - or more likely at the creation of
> > ofs->creator_cred.
>
> Well, we had this discussion when I proposed shiftfs as a superblock
> based stackable filesytem, I think: the way the shift needs to use
> creds is fundamentally different from the way overlayfs uses them.  The
> ovl_override_creds is overriding with the creator's creds but the
> shifting bind mound needs to backshift through the user namespace
> currently in effect.  Since uid shifts can stack, we can make them work
> together, but they are fundamentally different things.
>

Right.
Please take a look at the override_cred code in ovl_create_or_link().
This code has some fsuid dance that you need to check for shift friendliness.

The entire security model of overlayfs needs to be reexamined in the face
of shift mount, but as I wrote, I don't think its going to be too hard to
make ovl_override_creds() shift mount aware.
Overlayfs mimics vfs behavior in many cases.

Unless you shift mount both overlayfs and underlying (say) ext4, then
you still have only one mnt_cred to cache in any given call stack.

Thanks,
Amir.

James Bottomley Dec. 3, 2019, 2:10 p.m. UTC | #4

[splitting topics for ease of threading]
On Tue, 2019-12-03 at 08:55 +0200, Amir Goldstein wrote:
> On Tue, Dec 3, 2019 at 7:12 AM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > On Tue, 2019-12-03 at 06:51 +0200, Amir Goldstein wrote:
> > > [cc: ebiederman]
[...]
> > > 2. Needs serious vetting by Eric (cc'ed)
> > > 3. A lot of people have been asking me for filtering of "dirent"
> > > fsnotify events (i.e. create/delete) by path, which is not
> > > available in those vfs functions, so ifthe concept of current-
> > > >mnt flies, fsnotify is going to want to use it as well.
> > 
> > Just a caveat: current->mnt is used in this patch simply as a tag,
> > which means it doesn't need to be refcounted.  I think I can prove
> > that it is absolutely valid if the cred is shifted because the
> > reference is held by the code that shifted the cred, but it's
> > definitely not valid except for a tag comparison outside of
> > that.  Thus, if it is useful for fsnotify, more thought will have
> > to be given to refcounting it.
> > 
> 
> Yes. Is there anything preventing us from taking refcount on
> current->mnt?

We could, but what would it usefully mean?  It would just be the last
mnt that had its credentials shifted.  I think stashing a refcounted
mnt in the task structure is reasonably easy:  The creds are
refcounted, so you simply follow all the task mnt_cred logic I added
for releasing the ref in the correct places, so if you want to do that,
I can simply rename this tag to something less generic ... unless you
have some idea about using the last shift mnt?

James

Amir Goldstein Dec. 3, 2019, 2:33 p.m. UTC | #5

On Tue, Dec 3, 2019 at 4:10 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> [splitting topics for ease of threading]
> On Tue, 2019-12-03 at 08:55 +0200, Amir Goldstein wrote:
> > On Tue, Dec 3, 2019 at 7:12 AM James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > >
> > > On Tue, 2019-12-03 at 06:51 +0200, Amir Goldstein wrote:
> > > > [cc: ebiederman]
> [...]
> > > > 2. Needs serious vetting by Eric (cc'ed)
> > > > 3. A lot of people have been asking me for filtering of "dirent"
> > > > fsnotify events (i.e. create/delete) by path, which is not
> > > > available in those vfs functions, so ifthe concept of current-
> > > > >mnt flies, fsnotify is going to want to use it as well.
> > >
> > > Just a caveat: current->mnt is used in this patch simply as a tag,
> > > which means it doesn't need to be refcounted.  I think I can prove
> > > that it is absolutely valid if the cred is shifted because the
> > > reference is held by the code that shifted the cred, but it's
> > > definitely not valid except for a tag comparison outside of
> > > that.  Thus, if it is useful for fsnotify, more thought will have
> > > to be given to refcounting it.
> > >
> >
> > Yes. Is there anything preventing us from taking refcount on
> > current->mnt?
>
> We could, but what would it usefully mean?  It would just be the last
> mnt that had its credentials shifted.  I think stashing a refcounted
> mnt in the task structure is reasonably easy:  The creds are
> refcounted, so you simply follow all the task mnt_cred logic I added
> for releasing the ref in the correct places, so if you want to do that,
> I can simply rename this tag to something less generic ... unless you
> have some idea about using the last shift mnt?
>

Nevermind. Didn't want to derail the thread.
I am still not sure what the semantics of generic current->mnt should be.
operations like copy_file_range() with two path arguments can
get confusing and handling nesting (e.g. overlayfs can be confusing as well).

Thanks,
Amir.

James Bottomley Dec. 3, 2019, 2:40 p.m. UTC | #6

On Tue, 2019-12-03 at 08:55 +0200, Amir Goldstein wrote:
> On Tue, Dec 3, 2019 at 7:12 AM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > On Tue, 2019-12-03 at 06:51 +0200, Amir Goldstein wrote:
> > > [cc: ebiederman]
[...]
> > > 4. This is currently not overlayfs (stacked fs) nor nfsd
> > > friendly. Those modules do not call the path based vfs APIs, but
> > > they do have the mnt stored internally.
> > 
> > OK, so I've got to confess that I've only tested it with my
> > container use case, which doesn't involve overlay or nfs.  However,
> > as long as we thread path down to the API that nfds and overlayfs
> > use, it should easily be made compatible with them ... do we have
> > any documentation of what API this is?
> 
> No proper doc AFAIK, but please take a look at:
> https://lore.kernel.org/linux-fsdevel/20191025112917.22518-2-mszeredi
> @redhat.com/
> It is part of a series to make overlayfs an FS_USERNS_MOUNT.
> 
> The simplest case goes typically something like this:
> rmdir -> do_rmdir -(change_userns_creds)-> vfs_rmdir ->
>     ovl_rmdir -(ovl_override_creds)-> vfs_rmdir -> ext4_rmdir

Yes, I figured it would mostly be the vfs_ functions.

> So if you shift mounted the overlayfs mount, you won't end up
> using shifted creds in ext4 operations.
> And if you shift mounted ext4 *before* creating the overlay, then
> still, overlay doesn't go through do_rmdir, so your method won't
> work either.

So I think the upper use case (shift above overlay) is fairly easily
solvable: it involves making ovl_override_creds shift aware, so that
when it does the override it keeps the shift.  This might involve
stashing the overlay creds where the shift ones are in the task
structure so cred_is_shifted() still works.

The lower use case is more problematic because that would involve
changing most of the vfs_ API.  I think we can take a phased approach:

   1. Get agreement for the approach using the unstacked case (current
      patch effectively)
   2. Make the upper case work because it's the low hanging fruit; I can
      start looking at this (although I'll have to figure out how to get
      overlayfs working first).
   3. Investigate the lower case if there's an actual use.

> Similar situation with nfsd, although I have no idea if there are
> plans to make nfsd userns aware.

It's a similar upper and lower issue, although upper just involves
playing nicely with the name remapping.

> > > I suppose you do want to be able to mount overlays and export nfs
> > > out of those shifted mounts, as they are merely the foundation
> > > for unprivileged container storage stack. right?
> > 
> > If the plan of doing this as a bind mount holds, then certainly
> > because any underlying filesystem has to work with it.
> > 
> 
> I am talking above, not under.

Hopefully I addressed that above.  I think above is easier and should
be the first target, but to make this works completely eventually needs
the under case as well.

> You shift mount an ext4 fs and hand it over to container fake root
> (or mark it and let fake root shit mount).
> The container fake root should be able to (after overlayfs unpriv
> changes) create an overlay from inside container.
> IOW, try to mount an overlay over your shifted fs and see how it
> behaves.
> 
> > > For overlayfs, you should at least look at ovl_override_creds()
> > > for incorporating shift mount logic - or more likely at the
> > > creation of ofs->creator_cred.
> > 
> > Well, we had this discussion when I proposed shiftfs as a
> > superblock based stackable filesytem, I think: the way the shift
> > needs to use creds is fundamentally different from the way
> > overlayfs uses them.  The ovl_override_creds is overriding with the
> > creator's creds but the shifting bind mound needs to backshift
> > through the user namespace currently in effect.  Since uid shifts
> > can stack, we can make them work together, but they are
> > fundamentally different things.
> > 
> 
> Right.
> Please take a look at the override_cred code in ovl_create_or_link().
> This code has some fsuid dance that you need to check for shift
> friendliness.

Certainly, I've added it to my todo list.

> The entire security model of overlayfs needs to be reexamined in the
> face of shift mount, but as I wrote, I don't think its going to be
> too hard to make ovl_override_creds() shift mount aware.
> Overlayfs mimics vfs behavior in many cases.

Agreed.

> Unless you shift mount both overlayfs and underlying (say) ext4, then
> you still have only one mnt_cred to cache in any given call stack.

Heh well the double shift case will be the stress test of getting 2.
and 3. working right.

James

James Bottomley Dec. 3, 2019, 2:58 p.m. UTC | #7

On Tue, 2019-12-03 at 16:33 +0200, Amir Goldstein wrote:
> On Tue, Dec 3, 2019 at 4:10 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > [splitting topics for ease of threading]
> > On Tue, 2019-12-03 at 08:55 +0200, Amir Goldstein wrote:
> > > On Tue, Dec 3, 2019 at 7:12 AM James Bottomley
> > > <James.Bottomley@hansenpartnership.com> wrote:
> > > > 
> > > > On Tue, 2019-12-03 at 06:51 +0200, Amir Goldstein wrote:
> > > > > [cc: ebiederman]
> > 
> > [...]
> > > > > 2. Needs serious vetting by Eric (cc'ed)
> > > > > 3. A lot of people have been asking me for filtering of
> > > > > "dirent" fsnotify events (i.e. create/delete) by path, which
> > > > > is not available in those vfs functions, so if the concept of
> > > > > current-mnt flies, fsnotify is going to want to use it as
> > > > > well.
> > > > 
> > > > Just a caveat: current->mnt is used in this patch simply as a
> > > > tag, which means it doesn't need to be refcounted.  I think I
> > > > can prove that it is absolutely valid if the cred is shifted
> > > > because the reference is held by the code that shifted the
> > > > cred, but it's definitely not valid except for a tag comparison
> > > > outside of that.  Thus, if it is useful for fsnotify, more
> > > > thought will have to be given to refcounting it.
> > > > 
> > > 
> > > Yes. Is there anything preventing us from taking refcount on
> > > current->mnt?
> > 
> > We could, but what would it usefully mean?  It would just be the
> > last mnt that had its credentials shifted.  I think stashing a
> > refcounted mnt in the task structure is reasonably easy:  The creds
> > are refcounted, so you simply follow all the task mnt_cred logic I
> > added for releasing the ref in the correct places, so if you want
> > to do that, I can simply rename this tag to something less generic
> > ... unless you have some idea about using the last shift mnt?
> > 
> 
> Nevermind. Didn't want to derail the thread.

Don't worry, that's why I separated the issues.

> I am still not sure what the semantics of generic current->mnt should
> be.

OK, so that's easy: it's not designed in the current patch set ever to
be used as a mnt.  It's simply a tag to tell you if the cached shifted
credentials in mnt_cred are valid for reuse.  To the only use I put it
to is in change_userns_cred() where we look at the task cached
mnt+mnt_cred and if mnt matches the mnt change_user_ns_cred was called
for we know that if mnt_cred is not NULL then it's usable as the pre-
prepared credentials.

> operations like copy_file_range() with two path arguments can
> get confusing and handling nesting (e.g. overlayfs can be confusing
> as well).

So I think the concept of using the task struct to carry information
you don't want to thread all the way up and down the stack, like how
I've done for knowledge of the shift being in effect, is potentially a
useful one.  It is a bit of a hack though to work around the fact that
our API is missing stuff.

James

James Bottomley Dec. 13, 2019, 5:49 p.m. UTC | #8

On Tue, 2019-12-03 at 06:40 -0800, James Bottomley wrote:
> On Tue, 2019-12-03 at 08:55 +0200, Amir Goldstein wrote:
> > On Tue, Dec 3, 2019 at 7:12 AM James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > > 
> > > On Tue, 2019-12-03 at 06:51 +0200, Amir Goldstein wrote:
> > > > [cc: ebiederman]
> 
> [...]
> > > > 4. This is currently not overlayfs (stacked fs) nor nfsd
> > > > friendly. Those modules do not call the path based vfs APIs,
> > > > but
> > > > they do have the mnt stored internally.
> > > 
> > > OK, so I've got to confess that I've only tested it with my
> > > container use case, which doesn't involve overlay or
> > > nfs.  However,
> > > as long as we thread path down to the API that nfds and overlayfs
> > > use, it should easily be made compatible with them ... do we have
> > > any documentation of what API this is?
> > 
> > No proper doc AFAIK, but please take a look at:
> > https://lore.kernel.org/linux-fsdevel/20191025112917.22518-2-mszere
> > di
> > @redhat.com/
> > It is part of a series to make overlayfs an FS_USERNS_MOUNT.
> > 
> > The simplest case goes typically something like this:
> > rmdir -> do_rmdir -(change_userns_creds)-> vfs_rmdir ->
> >     ovl_rmdir -(ovl_override_creds)-> vfs_rmdir -> ext4_rmdir
> 
> Yes, I figured it would mostly be the vfs_ functions.
> 
> > So if you shift mounted the overlayfs mount, you won't end up
> > using shifted creds in ext4 operations.
> > And if you shift mounted ext4 *before* creating the overlay, then
> > still, overlay doesn't go through do_rmdir, so your method won't
> > work either.
> 
> So I think the upper use case (shift above overlay) is fairly easily
> solvable: it involves making ovl_override_creds shift aware, so that
> when it does the override it keeps the shift.  This might involve
> stashing the overlay creds where the shift ones are in the task
> structure so cred_is_shifted() still works.
> 
> The lower use case is more problematic because that would involve
> changing most of the vfs_ API.  I think we can take a phased
> approach:
> 
>    1. Get agreement for the approach using the unstacked case
> (current
>       patch effectively)
>    2. Make the upper case work because it's the low hanging fruit; I
> can
>       start looking at this (although I'll have to figure out how to
> get
>       overlayfs working first).
>    3. Investigate the lower case if there's an actual use.
> 
> > Similar situation with nfsd, although I have no idea if there are
> > plans to make nfsd userns aware.
> 
> It's a similar upper and lower issue, although upper just involves
> playing nicely with the name remapping.
> 
> > > > I suppose you do want to be able to mount overlays and export
> > > > nfs
> > > > out of those shifted mounts, as they are merely the foundation
> > > > for unprivileged container storage stack. right?
> > > 
> > > If the plan of doing this as a bind mount holds, then certainly
> > > because any underlying filesystem has to work with it.
> > > 
> > 
> > I am talking above, not under.
> 
> Hopefully I addressed that above.  I think above is easier and should
> be the first target, but to make this works completely eventually
> needs
> the under case as well.
> 
> > You shift mount an ext4 fs and hand it over to container fake root
> > (or mark it and let fake root shit mount).
> > The container fake root should be able to (after overlayfs unpriv
> > changes) create an overlay from inside container.
> > IOW, try to mount an overlay over your shifted fs and see how it
> > behaves.
> > 
> > > > For overlayfs, you should at least look at ovl_override_creds()
> > > > for incorporating shift mount logic - or more likely at the
> > > > creation of ofs->creator_cred.
> > > 
> > > Well, we had this discussion when I proposed shiftfs as a
> > > superblock based stackable filesytem, I think: the way the shift
> > > needs to use creds is fundamentally different from the way
> > > overlayfs uses them.  The ovl_override_creds is overriding with
> > > the
> > > creator's creds but the shifting bind mound needs to backshift
> > > through the user namespace currently in effect.  Since uid shifts
> > > can stack, we can make them work together, but they are
> > > fundamentally different things.
> > > 
> > 
> > Right.
> > Please take a look at the override_cred code in
> > ovl_create_or_link().
> > This code has some fsuid dance that you need to check for shift
> > friendliness.
> 
> Certainly, I've added it to my todo list.

OK, I've now read through the code and think I know how it will work in
the unprivileged case.  I'm still only looking at the shifting bind
above ... below is still a problem.

> > The entire security model of overlayfs needs to be reexamined in
> > the face of shift mount, but as I wrote, I don't think its going to
> > be too hard to make ovl_override_creds() shift mount aware.
> > Overlayfs mimics vfs behavior in many cases.
> 
> Agreed.

So in the unprivileged case, the lower mount will have its own mounter
override creds for doing the pull up operations.  These will override
any shifted creds we provide, but I think that's OK ... we don't want
the potentially more privileged creds to be active here.

When operating on the upper, we will be using the shifted creds, but
that too is correct because the upper r/w layer is precisely the one we
want the read at fake root, write at real root to work for.

In the usual (not unprivileged) case, of course, the overlayfs mount
creds are privileged, so they likely look the same as the shifted ones
anyway.

James

[1/2] fs: introduce uid/gid shifting bind mount

Commit Message

Comments

Patch