[v2,2/3] fs: introduce uid/gid shifting bind mount

Message ID	20200104203946.27914-3-James.Bottomley@HansenPartnership.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=rKJc=2Z=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: James Bottomley <James.Bottomley@HansenPartnership.com> To: linux-fsdevel@vger.kernel.org Cc: David Howells <dhowells@redhat.com>, Christian Brauner <christian@brauner.io>, Al Viro <viro@ZenIV.linux.org.uk>, Miklos Szeredi <miklos@szeredi.hu>, Seth Forshee <seth.forshee@canonical.com>, linux-unionfs@vger.kernel.org, Amir Goldstein <amir73il@gmail.com>, =?utf-8?q?St=C3=A9phane_Graber?= <stgraber@ubuntu.com>, Eric Biederman <ebiederm@xmission.com>, Aleksa Sarai <cyphar@cyphar.com>, containers@lists.linux-foundation.org Subject: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount Date: Sat, 4 Jan 2020 12:39:45 -0800 Message-Id: <20200104203946.27914-3-James.Bottomley@HansenPartnership.com> In-Reply-To: <20200104203946.27914-1-James.Bottomley@HansenPartnership.com> References: <20200104203946.27914-1-James.Bottomley@HansenPartnership.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	introduce a uid/gid shifting bind mount \| expand [v2,0/3] introduce a uid/gid shifting bind mount [v2,1/3] fs: rethread notify_change to take a path instead of a dentry [v2,2/3] fs: introduce uid/gid shifting bind mount [v2,3/3] fs: expose shifting bind mount to userspace

On Fri, 2020-01-17 at 09:44 -0600, Serge E. Hallyn wrote: > On Thu, Jan 16, 2020 at 08:29:33AM -0800, James Bottomley wrote: > > On Thu, 2020-01-16 at 00:44 -0600, Serge E. Hallyn wrote: > > > On Wed, Jan 15, 2020 at 10:19:20AM -0800, James Bottomley wrote: > > > > On Sun, 2020-01-12 at 21:41 -0600, Serge E. Hallyn wrote: > > > > > On Sat, Jan 04, 2020 at 12:39:45PM -0800, James Bottomley > > > > > wrote: > > > > > > This implementation reverse shifts according to the user_ns > > > > > > belonging to the mnt_ns. So if the vfsmount has the newly > > > > > > introduced flag MNT_SHIFT and the current user_ns is the > > > > > > same as the mount_ns->user_ns then we shift back using the > > > > > > user_ns before committing to the underlying filesystem. > > > > > > > > > > > > For example, if a user_ns is created where interior (fake > > > > > > root,uid 0) is mapped to kernel uid 100000 then writes from > > > > > > interior root normally go to the filesystem at the kernel > > > > > > uid. However, if MNT_SHIFT is set, they will be shifted > > > > > > back to write at uid 0, meaning we can bind mount real > > > > > > image filesystems to user_ns protected faker root. > > > > > > > > > > Thanks, James, I definately would like to see shifting in the > > > > > VFS api. > > > > > > > > > > I have a few practical concerns about this implementation, > > > > > but my biggest concern is more fundemental: this again by > > > > > design leaves littered about the filesystem uid-0 owned files > > > > > which were written by an untrusted user. > > > > > > > > Well, I think that's a consequence of my use case: using > > > > unmodified container images with the user namespace. We're > > > > starting to do IMA/EVM signatures in our images, so shifted UID > > > > images aren't an option for us. Therefore I have to figure out > > > > a way of allowing an untrusted user to write safely at UID > > > > zero. For me that safety comes from strictly corralling where > > > > they can write and making sure the container orchestration > > > > system sets it up correctly. > > > > > > Isn't that a matter of convention? You could ship, store, and > > > measure the files already shifted. An OCI annotation could show > > > the offset, say 100000. > > > > We could, but it's the wrong way to look at it to tell a customer > > that if they want us to run the image safely they have to modify it > > at the build stage. As a cloud service provider I want to make the > > statement that I can run any customer image safely as long as it > > was built to whatever standards the registry supports. That has to > > include integrity protected images. And I have to be able to > > attest to a > > And does the customer measure the files, or do you? Well, the customer signs the image, which is why we can't alter it. The idea would be that the CSP does the measurement and the provision of the running dashboard but that the customer could also do it themselves if they didn't trust the CSP ... or just wanted to verify the CSP integrity tool was working correctly. > > customer that I'm running their image as part of the customer > > integrity verification. > > Makes sense. And in your environment, you can easily partition off a > place (or an otherwise unused namespace) in which to mount these > images. So using a null mapping for the 'origin' would make sense > there. > > But in cases where what you want is a single directory shared by > several containers with disjoint uid mappings, where this is the only > directory they share - be it for logs, or data, etc, and be it by > infrastructure containers in the course of running a cluster or a set > of students manipulating shared data with their otherwise completely > unprivileged containers - we can make the shared directory a lot less > of a minefield. Sure ... I didn't say you didn't have a use case. I was just pointing out that I have a write at uid 0 one. I'm happy to implement a scheme that covers both. > > > Now if any admin runs across this device noone will be tricked by > > > the root owned files. > > > > Perhaps you could go into what tricks you think will happen? This > > is > > I don't like to use my own underactive imagination to decide what an > attacker - or accidental fool - might be likely to do. But simply > writing a setuid-root shell script called 'ls' will probably hit > *someone* who against all advice has . at the front of their path. The suid/sgid is already on my list of potential threats. It's mitigated by never letting the tenant get access to where the unshifted image is unpacked. > (Don't look at me like that - it's 2020 and we still have flashy > respectable-looking websites encouraging people to wget | sudo > /bin/sh) Person with root access being stupid is definitely a problem, but it's not really in my threat model. > > clearly the thread model of using unmodified images you have which > > might be different from the one I have. My mitigation is basically > > that as long as no tenant or unprivileged user can get at the > > unshifted image, we're fine. > > Are you sure? What if $package accidentally ships a broken cronjob > that tries to run ./bin/sh -c "logger $(date)" ? You mean sends a message to the systemd log socket? That's usually intercepted to go into the per-container Kafka receiver (or whatever else the CSP uses for logging). > > > Mount could conceivably look like: > > > > > > mount --bind --origin-uid 100000 --shift /proc/50/ns/user /src > > > /dest > > > > > > (the --shift idea coming from Tycho). > > > > Just so we're clear --origin-uid <uid> means map back along the -- > > shift user_ns but add this <uid> to whatever interior id the shift > > produces? > > If by interior id you mean the kuid, then yes :) Oh, no ... we need to get the terminogy straight. The kuid, as in the uid the kernel sees, is what I think of as the exterior uid. The uid the container tenant thinks they see is the interior one. So let's say the user_ns maps interior 0 to exterior 20,000 but that the image begins at 100,000. You have an --origin-uid of 100000 in the above, I believe. How we get there is the user as sudo root writes a file "f". "f" has interior owner 0. However, the exterior owner, which is what usually gets written, as 20,000. Doing the shift I'd take the 20,000; shift back along the tenant user_ns to get 0 and then add the 100,000 offset to end up writing to the image at 100,000 That would mean a bunch of different user_ns could be set up all shifting back to the same 100,000 and thus share the image. > > I think that's fairly easy to parametrise and store in the bind > > mount, yes. > > > > > I'd prefer --origin to be another user namespace fd, which I > > > suppose some tool could easily set up, for instance: > > > > > > pid1=`setup-userns-fd -m b:0:100000:65536` > > > pid2=$(prepare a container userns) > > > mount --bind --shift-origin=/proc/$pid1/ns/user \ > > > --shift-target=/proc/$pid2/ns/user /src /dest > > > > > > You could presumably always skip the shift-origin to achieve what > > > you're doing now. > > > > Yes, if you're happy to have --shift-origin <uid> default to 0 > > Yeah I think that's fine. I'd expect any distro which tries to > configure this for easy consumption to allocate a 65k subuid range > for 'images', and set a default shift-origin under /etc which 'mount' > would consult, or something like that. The kernel almost certainly > would default to 0. > > > I have to ask in the above, what is the point of the pid1 > > user_ns? Do you ever use pid1 for anything else? > > Probably not. > > > It looks like you were merely creating it for the object of having > > it passed into the bind. If there's never any use for the --shift- > > origin <ns_fd> then I think I agree that a bare number is a better > > abstraction. Or are you thinking we'll have use cases where a > > simple numeric addition won't serve and our only user mechanism for > > complex parametrisation of the shift map is a user_ns? > > I don't think so. People can have some pretty convoluted uid > mappings right now, but presumably the images we are talking about > would be the result of an rsync or tar *in* such a namespace. Though > again, limited imagination and all that. There *may* be very good > use cases for a more complicated mapping. Would it be OK to implement the simple now and add the complex later if it ever materializes. Provided we can agree on a way of passing extra arguments to bind, we have the freedom to add stuff later. > > The other slight problem is that now the bind mount does need to > > understand complex arguments, which it definitely doesn't > > today. I'm happy with extending fsconfig to bind, so it can do > > complex arguments like this, but it sounds like others are dubious > > so doing the above also depends on agreeing whatever extension we > > do to bind. > > > > I suppose bind reconfigure could be yet another system call in the > > open_tree/move_mount pantheon, which would also solve the remount > > with different bind parameters problem with the new API. > > > > The other thing I worry about is that is separating the > > shift_user_ns from the mount_ns->user_ns a potential security > > hole? For the unprivileged operation of this, I like the idea of > > enforcing them to be the same so the tenant can only shift back > > along a user_ns they're operating in. The problem being that the > > kernel has no way of validating that the passed in <ns_fd> is > > within the subuid/subgid range of the unprivileged user, so we're > > trusting that the user can't get access to the ns_fd of a user_ns > > outside that range. > > I guess I figured we would have privileged task in the owning > namespace (presumably init_user_ns) mark a bind mount as shiftable Yes, that's what I've got today in the prototype. It mirrors the original shiftfs mechanism. However, I have also heard people say they want a permanent mark, like an xattr for this. > (maybe specifying who is allowed to bind mount it using the mapped > root uid, analogous to how the namespaced file capabilities are > identified) and then the ns_fd of the task doing the "mount --bind -- > shift" (which is privileged inside the ns_fd userns) would be used, > unmodified (or even modified, since whatever uid args the task would > pass would have to be valid inside the mounting userns) > > So something like: > > 1. On the host: > > mount --bind --mark-shiftable-by 200000 --origin-uid 100000 > /data/group1 > > 2. In the container which has its root mapped to host uid 200000 > > mount --bind --shift /data/group1 /data/group1 We can certainly do that, but it does mean one mark (i.e. one mount point, so /data/group<n>) per user_ns at different uids. A simpler alternative may be to do the mark as mount --bind --mark-shiftable --origin-uid 100000 /data/group And regulate access to /data/group by the usual filesystem ACL. Then anyone who can get at /data/group can do mount --bind --shift /data/group1 /my/place/for/the/image I tend to think that ACL security is sufficient, since it's what everyone is used to but I don't object to having the additional origin check as well. > > > > > I would feel much better if you institutionalized having the > > > > > origin shifted. For instance, take a squashfs for a > > > > > container fs, shift it so that fsuid 0 == hostuid > > > > > 100000. Mount that, with a marker saying how it is shifted, > > > > > then set 'shiftable'. Now use that as a base for allowing an > > > > > unpriv user to shift. If that user has subuid 200000 as > > > > > container uid 0, then its root will write files as uid 100000 > > > > > in the fs. This isn't perfect, but I think something along > > > > > these lines would be far safer. > > > > > > > > OK, so I fully agree that if you're not doing integrity in the > > > > container, then this is an option for you and whatever API gets > > > > upstreamed should cope with that case. > > > > > > > > So to push on the API a bit, what do you want? The reverse > > > > along the user_ns one I implemented is easy: a single flag > > > > tells you to map back or not. However, the implementation is > > > > phrased in terms of shifted credentials, so as long as we know > > > > how to map, it can work for both our use cases. I think in > > > > plumbers you expressed interest in simply passing the map to > > > > the mount rather than doing it via a user_ns; is that still the > > > > case? > > > > > > Oh I think I'm fine either way - I can always create a user_ns to > > > match the map I want. > > > > I think it comes down to whether there's an actual use for the > > user_ns you create. It seems a bit wasteful merely to create a > > user_ns for the purpose of passing something that can also be > > simply parametrised if there's no further use for that user_ns. > > Oh - I consider the detail of whether we pass a userid or userns nsfd > as more of an implementation detail which we can hash out after the > more general shift-mount api is decided upon. Anyway, passing nsfds > just has a cool factor :) Well, yes, won't aruge on the cool factor-ness. James > -serge >

diff --git a/fs/attr.c b/fs/attr.c index 370b18807f05..3efb2dc67896 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -18,14 +18,22 @@ #include <linux/evm.h> #include <linux/ima.h> +#include "internal.h" +#include "mount.h" + static bool chown_ok(const struct inode *inode, kuid_t uid) { + kuid_t i_uid = inode->i_uid; + + if (cred_is_shifted()) + i_uid = make_kuid(current_user_ns(), __kuid_val(i_uid)); + if (uid_eq(current_fsuid(), inode->i_uid) && - uid_eq(uid, inode->i_uid)) + uid_eq(uid, i_uid)) return true; if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) return true; - if (uid_eq(inode->i_uid, INVALID_UID) && + if (uid_eq(i_uid, INVALID_UID) && ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) return true; return false; @@ -33,12 +41,21 @@ static bool chown_ok(const struct inode *inode, kuid_t uid) static bool chgrp_ok(const struct inode *inode, kgid_t gid) { + kgid_t i_gid = inode->i_gid; + kuid_t i_uid = inode->i_uid; + + if (cred_is_shifted()) { + struct user_namespace *ns = current_user_ns(); + + i_uid = make_kuid(ns, __kuid_val(i_uid)); + i_gid = make_kgid(ns, __kgid_val(i_gid)); + } if (uid_eq(current_fsuid(), inode->i_uid) && - (in_group_p(gid) || gid_eq(gid, inode->i_gid))) + (in_group_p(gid) || gid_eq(gid, i_gid))) return true; if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) return true; - if (gid_eq(inode->i_gid, INVALID_GID) && + if (gid_eq(i_gid, INVALID_GID) && ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN)) return true; return false; @@ -89,9 +106,10 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr) if (ia_valid & ATTR_MODE) { if (!inode_owner_or_capable(inode)) return -EPERM; + /* Also check the setgid bit! */ - if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid : - inode->i_gid) && + if (!in_group_p_shifted((ia_valid & ATTR_GID) ? attr->ia_gid : + inode->i_gid) && !capable_wrt_inode_uidgid(inode, CAP_FSETID)) attr->ia_mode &= ~S_ISGID; } @@ -198,7 +216,7 @@ void setattr_copy(struct inode *inode, const struct iattr *attr) if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; - if (!in_group_p(inode->i_gid) && + if (!in_group_p_shifted(inode->i_gid) && !capable_wrt_inode_uidgid(inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; @@ -235,6 +253,9 @@ int notify_change(const struct path *path, struct iattr * attr, int error; struct timespec64 now; unsigned int ia_valid = attr->ia_valid; + const struct cred *cred; + kuid_t i_uid = inode->i_uid; + kgid_t i_gid = inode->i_gid; WARN_ON_ONCE(!inode_is_locked(inode)); @@ -243,18 +264,28 @@ int notify_change(const struct path *path, struct iattr * attr, return -EPERM; } + cred = change_userns_creds(path); + if (cred) { + struct mount *m = real_mount(path->mnt); + + attr->ia_uid = KUIDT_INIT(from_kuid(m->mnt_ns->user_ns, attr->ia_uid)); + attr->ia_gid = KGIDT_INIT(from_kgid(m->mnt_ns->user_ns, attr->ia_gid)); + } + /* * If utimes(2) and friends are called with times == NULL (or both * times are UTIME_NOW), then we need to check for write permission */ if (ia_valid & ATTR_TOUCH) { - if (IS_IMMUTABLE(inode)) - return -EPERM; + if (IS_IMMUTABLE(inode)) { + error = -EPERM; + goto err; + } if (!inode_owner_or_capable(inode)) { error = inode_permission(inode, MAY_WRITE); if (error) - return error; + goto err; } } @@ -275,7 +306,7 @@ int notify_change(const struct path *path, struct iattr * attr, if (ia_valid & ATTR_KILL_PRIV) { error = security_inode_need_killpriv(dentry); if (error < 0) - return error; + goto err; if (error == 0) ia_valid = attr->ia_valid &= ~ATTR_KILL_PRIV; } @@ -306,34 +337,46 @@ int notify_change(const struct path *path, struct iattr * attr, attr->ia_mode &= ~S_ISGID; } } - if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID))) - return 0; + if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID))) { + error = 0; + goto err; + } /* * Verify that uid/gid changes are valid in the target * namespace of the superblock. */ + error = -EOVERFLOW; if (ia_valid & ATTR_UID && !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid)) - return -EOVERFLOW; + goto err; + if (ia_valid & ATTR_GID && !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid)) - return -EOVERFLOW; + goto err; /* Don't allow modifications of files with invalid uids or * gids unless those uids & gids are being made valid. */ - if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid)) - return -EOVERFLOW; - if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid)) - return -EOVERFLOW; + if (cred_is_shifted()) { + struct user_namespace *ns = current_user_ns(); + + i_uid = make_kuid(ns, __kuid_val(i_uid)); + i_gid = make_kgid(ns, __kgid_val(i_gid)); + } + + if (!(ia_valid & ATTR_UID) && !uid_valid(i_uid)) + goto err; + + if (!(ia_valid & ATTR_GID) && !gid_valid(i_gid)) + goto err; error = security_inode_setattr(dentry, attr); if (error) - return error; + goto err; error = try_break_deleg(inode, delegated_inode); if (error) - return error; + goto err; if (inode->i_op->setattr) error = inode->i_op->setattr(dentry, attr); @@ -346,6 +389,8 @@ int notify_change(const struct path *path, struct iattr * attr, evm_inode_post_setattr(dentry, ia_valid); } + err: + revert_userns_creds(cred); return error; } EXPORT_SYMBOL(notify_change); diff --git a/fs/exec.c b/fs/exec.c index 74d88dab98dd..4baf91391689 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1539,13 +1539,18 @@ static void bprm_fill_uid(struct linux_binprm *bprm) /* Be careful if suid/sgid is set */ inode_lock(inode); - /* reload atomically mode/uid/gid now that lock held */ mode = inode->i_mode; uid = inode->i_uid; gid = inode->i_gid; inode_unlock(inode); + if (cred_is_shifted()) { + struct user_namespace *ns = current_user_ns(); + + uid = make_kuid(ns, __kuid_val(uid)); + gid = make_kgid(ns, __kgid_val(gid)); + } /* We ignore suid/sgid if there are no mappings for them in the ns */ if (!kuid_has_mapping(bprm->cred->user_ns, uid) || !kgid_has_mapping(bprm->cred->user_ns, gid)) diff --git a/fs/inode.c b/fs/inode.c index 18ff3081bda0..f5f7f7cbd374 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2060,7 +2060,7 @@ void inode_init_owner(struct inode *inode, const struct inode *dir, if (S_ISDIR(mode)) mode |= S_ISGID; else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) && - !in_group_p(inode->i_gid) && + !in_group_p_shifted(inode->i_gid) && !capable_wrt_inode_uidgid(dir, CAP_FSETID)) mode &= ~S_ISGID; } else @@ -2079,12 +2079,15 @@ EXPORT_SYMBOL(inode_init_owner); bool inode_owner_or_capable(const struct inode *inode) { struct user_namespace *ns; + kuid_t uid = inode->i_uid; - if (uid_eq(current_fsuid(), inode->i_uid)) + if (uid_eq(current_fsuid(), uid)) return true; ns = current_user_ns(); - if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER)) + if (cred_is_shifted()) + uid = make_kuid(ns, __kuid_val(uid)); + if (kuid_has_mapping(ns, uid) && ns_capable(ns, CAP_FOWNER)) return true; return false; } diff --git a/fs/internal.h b/fs/internal.h index 9cbf6097c77f..47ac2f295f70 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -73,6 +73,8 @@ long do_symlinkat(const char __user *oldname, int newdfd, const char __user *newname); int do_linkat(int olddfd, const char __user *oldname, int newdfd, const char __user *newname, int flags); +const struct cred *change_userns_creds(const struct path *p); +void revert_userns_creds(const struct cred *cred); /* * namespace.c diff --git a/fs/namei.c b/fs/namei.c index 7bb4b1dcf3cc..0f36f21e6964 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -124,6 +124,38 @@ #define EMBEDDED_NAME_MAX (PATH_MAX - offsetof(struct filename, iname)) +const struct cred *change_userns_creds(const struct path *p) +{ + struct mount *m = real_mount(p->mnt); + + if ((p->mnt->mnt_flags & MNT_SHIFT) == 0) + return NULL; + + if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns) + return NULL; + + if (current->mnt != p->mnt) { + struct cred *cred; + struct user_namespace *user_ns = m->mnt_ns->user_ns; + + if (current->mnt_cred) + put_cred(current->mnt_cred); + cred = prepare_creds(); + cred->fsuid = KUIDT_INIT(from_kuid(user_ns, current->cred->fsuid)); + cred->fsgid = KGIDT_INIT(from_kgid(user_ns, current->cred->fsgid)); + current->mnt = p->mnt; /* no reference needed */ + current->mnt_cred = cred; + } + return override_creds(current->mnt_cred); +} + +void revert_userns_creds(const struct cred *cred) +{ + if (!cred) + return; + revert_creds(cred); +} + struct filename * getname_flags(const char __user *filename, int flags, int *empty) { @@ -303,7 +335,7 @@ static int acl_permission_check(struct inode *inode, int mask) return error; } - if (in_group_p(inode->i_gid)) + if (in_group_p_shifted(inode->i_gid)) mode >>= 3; } @@ -366,7 +398,6 @@ int generic_permission(struct inode *inode, int mask) if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO)) if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE)) return 0; - return -EACCES; } EXPORT_SYMBOL(generic_permission); @@ -1784,6 +1815,7 @@ static int walk_component(struct nameidata *nd, int flags) struct inode *inode; unsigned seq; int err; + const struct cred *cred; /* * "." and ".." are special - ".." especially so because it has * to be able to know about the current root directory and @@ -1795,25 +1827,31 @@ static int walk_component(struct nameidata *nd, int flags) put_link(nd); return err; } + cred = change_userns_creds(&nd->path); err = lookup_fast(nd, &path, &inode, &seq); if (unlikely(err <= 0)) { if (err < 0) - return err; + goto out; path.dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags); - if (IS_ERR(path.dentry)) - return PTR_ERR(path.dentry); + if (IS_ERR(path.dentry)) { + err = PTR_ERR(path.dentry); + goto out; + } path.mnt = nd->path.mnt; err = follow_managed(&path, nd); if (unlikely(err < 0)) - return err; + goto out; seq = 0; /* we are already out of RCU mode */ inode = d_backing_inode(path.dentry); } - return step_into(nd, &path, flags, inode, seq); + err = step_into(nd, &path, flags, inode, seq); + out: + revert_userns_creds(cred); + return err; } /* @@ -2067,8 +2105,10 @@ static int link_path_walk(const char *name, struct nameidata *nd) for(;;) { u64 hash_len; int type; + const struct cred *cred = change_userns_creds(&nd->path); err = may_lookup(nd); + revert_userns_creds(cred); if (err) return err; @@ -2242,12 +2282,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags) static const char *trailing_symlink(struct nameidata *nd) { const char *s; + const struct cred *cred = change_userns_creds(&nd->path); int error = may_follow_link(nd); - if (unlikely(error)) - return ERR_PTR(error); + if (unlikely(error)) { + s = ERR_PTR(error); + goto out; + } nd->flags |= LOOKUP_PARENT; nd->stack[0].name = NULL; s = get_link(nd); + out: + revert_userns_creds(cred); return s ? s : ""; } @@ -3273,6 +3318,7 @@ static int do_last(struct nameidata *nd, struct inode *inode; struct path path; int error; + const struct cred *cred = change_userns_creds(&nd->path); nd->flags &= ~LOOKUP_PARENT; nd->flags |= op->intent; @@ -3280,7 +3326,7 @@ static int do_last(struct nameidata *nd, if (nd->last_type != LAST_NORM) { error = handle_dots(nd, nd->last_type); if (unlikely(error)) - return error; + goto err; goto finish_open; } @@ -3293,7 +3339,7 @@ static int do_last(struct nameidata *nd, goto finish_lookup; if (error < 0) - return error; + goto err; BUG_ON(nd->inode != dir->d_inode); BUG_ON(nd->flags & LOOKUP_RCU); @@ -3306,12 +3352,14 @@ static int do_last(struct nameidata *nd, */ error = complete_walk(nd); if (error) - return error; + goto err; audit_inode(nd->name, dir, AUDIT_INODE_PARENT); /* trailing slashes? */ - if (unlikely(nd->last.name[nd->last.len])) - return -EISDIR; + if (unlikely(nd->last.name[nd->last.len])) { + error = -EISDIR; + goto err; + } } if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { @@ -3367,7 +3415,7 @@ static int do_last(struct nameidata *nd, error = follow_managed(&path, nd); if (unlikely(error < 0)) - return error; + goto err; /* * create/update audit record if it already exists. @@ -3376,7 +3424,8 @@ static int do_last(struct nameidata *nd, if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { path_to_nameidata(&path, nd); - return -EEXIST; + error = -EEXIST; + goto err; } seq = 0; /* out of RCU mode, so the value doesn't matter */ @@ -3384,12 +3433,12 @@ static int do_last(struct nameidata *nd, finish_lookup: error = step_into(nd, &path, 0, inode, seq); if (unlikely(error)) - return error; + goto err; finish_open: /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ error = complete_walk(nd); if (error) - return error; + goto err; audit_inode(nd->name, nd->path.dentry, 0); if (open_flag & O_CREAT) { error = -EISDIR; @@ -3431,6 +3480,8 @@ static int do_last(struct nameidata *nd, } if (got_write) mnt_drop_write(nd->path.mnt); + err: + revert_userns_creds(cred); return error; } @@ -3749,6 +3800,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode, struct path path; int error; unsigned int lookup_flags = 0; + const struct cred *cred; error = may_mknod(mode); if (error) @@ -3758,6 +3810,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode, if (IS_ERR(dentry)) return PTR_ERR(dentry); + cred = change_userns_creds(&path); if (!IS_POSIXACL(path.dentry->d_inode)) mode &= ~current_umask(); error = security_path_mknod(&path, dentry, mode, dev); @@ -3779,6 +3832,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode, } out: done_path_create(&path, dentry); + revert_userns_creds(cred); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; goto retry; @@ -3829,18 +3883,21 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode) struct path path; int error; unsigned int lookup_flags = LOOKUP_DIRECTORY; + const struct cred *cred; retry: dentry = user_path_create(dfd, pathname, &path, lookup_flags); if (IS_ERR(dentry)) return PTR_ERR(dentry); + cred = change_userns_creds(&path); if (!IS_POSIXACL(path.dentry->d_inode)) mode &= ~current_umask(); error = security_path_mkdir(&path, dentry, mode); if (!error) error = vfs_mkdir(path.dentry->d_inode, dentry, mode); done_path_create(&path, dentry); + revert_userns_creds(cred); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; goto retry; @@ -3907,12 +3964,14 @@ long do_rmdir(int dfd, const char __user *pathname) struct qstr last; int type; unsigned int lookup_flags = 0; + const struct cred *cred; retry: name = filename_parentat(dfd, getname(pathname), lookup_flags, &path, &last, &type); if (IS_ERR(name)) return PTR_ERR(name); + cred = change_userns_creds(&path); switch (type) { case LAST_DOTDOT: error = -ENOTEMPTY; @@ -3948,6 +4007,7 @@ long do_rmdir(int dfd, const char __user *pathname) inode_unlock(path.dentry->d_inode); mnt_drop_write(path.mnt); exit1: + revert_userns_creds(cred); path_put(&path); putname(name); if (retry_estale(error, lookup_flags)) { @@ -4037,11 +4097,13 @@ long do_unlinkat(int dfd, struct filename *name) struct inode *inode = NULL; struct inode *delegated_inode = NULL; unsigned int lookup_flags = 0; + const struct cred *cred; retry: name = filename_parentat(dfd, name, lookup_flags, &path, &last, &type); if (IS_ERR(name)) return PTR_ERR(name); + cred = change_userns_creds(&path); error = -EISDIR; if (type != LAST_NORM) goto exit1; @@ -4079,6 +4141,7 @@ long do_unlinkat(int dfd, struct filename *name) } mnt_drop_write(path.mnt); exit1: + revert_userns_creds(cred); path_put(&path); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; @@ -4143,6 +4206,7 @@ long do_symlinkat(const char __user *oldname, int newdfd, struct dentry *dentry; struct path path; unsigned int lookup_flags = 0; + const struct cred *cred; from = getname(oldname); if (IS_ERR(from)) @@ -4153,6 +4217,7 @@ long do_symlinkat(const char __user *oldname, int newdfd, if (IS_ERR(dentry)) goto out_putname; + cred = change_userns_creds(&path); error = security_path_symlink(&path, dentry, from->name); if (!error) error = vfs_symlink(path.dentry->d_inode, dentry, from->name); @@ -4161,6 +4226,7 @@ long do_symlinkat(const char __user *oldname, int newdfd, lookup_flags |= LOOKUP_REVAL; goto retry; } + revert_userns_creds(cred); out_putname: putname(from); return error; @@ -4274,6 +4340,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd, struct inode *delegated_inode = NULL; int how = 0; int error; + const struct cred *cred; if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) return -EINVAL; @@ -4301,6 +4368,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd, if (IS_ERR(new_dentry)) goto out; + cred = change_userns_creds(&new_path); error = -EXDEV; if (old_path.mnt != new_path.mnt) goto out_dput; @@ -4312,6 +4380,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd, goto out_dput; error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode); out_dput: + revert_userns_creds(cred); done_path_create(&new_path, new_dentry); if (delegated_inode) { error = break_deleg_wait(&delegated_inode); @@ -4531,6 +4600,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd, unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET; bool should_retry = false; int error; + const struct cred *cred; if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) return -EINVAL; @@ -4560,6 +4630,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd, goto exit1; } + cred = change_userns_creds(&new_path); error = -EXDEV; if (old_path.mnt != new_path.mnt) goto exit2; @@ -4644,6 +4715,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd, } mnt_drop_write(old_path.mnt); exit2: + revert_userns_creds(cred); if (retry_estale(error, lookup_flags)) should_retry = true; path_put(&new_path); diff --git a/fs/open.c b/fs/open.c index 033e2112fbda..7cad2b723925 100644 --- a/fs/open.c +++ b/fs/open.c @@ -456,11 +456,13 @@ int ksys_chdir(const char __user *filename) struct path path; int error; unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY; + const struct cred *cred; retry: error = user_path_at(AT_FDCWD, filename, lookup_flags, &path); if (error) goto out; + cred = change_userns_creds(&path); error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR); if (error) goto dput_and_out; @@ -468,6 +470,7 @@ int ksys_chdir(const char __user *filename) set_fs_pwd(current->fs, &path); dput_and_out: + revert_userns_creds(cred); path_put(&path); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; @@ -486,11 +489,13 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd) { struct fd f = fdget_raw(fd); int error; + const struct cred *cred; error = -EBADF; if (!f.file) goto out; + cred = change_userns_creds(&f.file->f_path); error = -ENOTDIR; if (!d_can_lookup(f.file->f_path.dentry)) goto out_putf; @@ -499,6 +504,7 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd) if (!error) set_fs_pwd(current->fs, &f.file->f_path); out_putf: + revert_userns_creds(cred); fdput(f); out: return error; @@ -547,11 +553,13 @@ static int chmod_common(const struct path *path, umode_t mode) struct inode *inode = path->dentry->d_inode; struct inode *delegated_inode = NULL; struct iattr newattrs; + const struct cred *cred; int error; + cred = change_userns_creds(path); error = mnt_want_write(path->mnt); if (error) - return error; + goto out; retry_deleg: inode_lock(inode); error = security_path_chmod(path, mode); @@ -568,6 +576,8 @@ static int chmod_common(const struct path *path, umode_t mode) goto retry_deleg; } mnt_drop_write(path->mnt); + out: + revert_userns_creds(cred); return error; } @@ -666,6 +676,7 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group, struct path path; int error = -EINVAL; int lookup_flags; + const struct cred *cred; if ((flag & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0) goto out; @@ -677,12 +688,14 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group, error = user_path_at(dfd, filename, lookup_flags, &path); if (error) goto out; + cred = change_userns_creds(&path); error = mnt_want_write(path.mnt); if (error) goto out_release; error = chown_common(&path, user, group); mnt_drop_write(path.mnt); out_release: + revert_userns_creds(cred); path_put(&path); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; @@ -713,10 +726,12 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group) { struct fd f = fdget(fd); int error = -EBADF; + const struct cred *cred; if (!f.file) goto out; + cred = change_userns_creds(&f.file->f_path); error = mnt_want_write_file(f.file); if (error) goto out_fput; @@ -724,6 +739,7 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group) error = chown_common(&f.file->f_path, user, group); mnt_drop_write_file(f.file); out_fput: + revert_userns_creds(cred); fdput(f); out: return error; @@ -911,8 +927,13 @@ EXPORT_SYMBOL(file_path); */ int vfs_open(const struct path *path, struct file *file) { + int ret; + const struct cred *cred = change_userns_creds(path); + file->f_path = *path; - return do_dentry_open(file, d_backing_inode(path->dentry), NULL); + ret = do_dentry_open(file, d_backing_inode(path->dentry), NULL); + revert_userns_creds(cred); + return ret; } struct file *dentry_open(const struct path *path, int flags, diff --git a/fs/posix_acl.c b/fs/posix_acl.c index 84ad1c90d535..b5aa36261964 100644 --- a/fs/posix_acl.c +++ b/fs/posix_acl.c @@ -364,7 +364,7 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want) goto mask; break; case ACL_GROUP_OBJ: - if (in_group_p(inode->i_gid)) { + if (in_group_p_shifted(inode->i_gid)) { found = 1; if ((pa->e_perm & want) == want) goto mask; @@ -652,7 +652,7 @@ int posix_acl_update_mode(struct inode *inode, umode_t *mode_p, return error; if (error == 0) *acl = NULL; - if (!in_group_p(inode->i_gid) && + if (!in_group_p_shifted(inode->i_gid) && !capable_wrt_inode_uidgid(inode, CAP_FSETID)) mode &= ~S_ISGID; *mode_p = mode; diff --git a/fs/stat.c b/fs/stat.c index c38e4c2e1221..0018b168d7a7 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -21,6 +21,8 @@ #include <linux/uaccess.h> #include <asm/unistd.h> +#include "mount.h" + /** * generic_fillattr - Fill in the basic attributes from the inode struct * @inode: Inode to use as the source @@ -48,6 +50,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat) } EXPORT_SYMBOL(generic_fillattr); +static void shift_check(struct vfsmount *mnt, struct kstat *stat) +{ + struct mount *m = real_mount(mnt); + struct user_namespace *user_ns = m->mnt_ns->user_ns; + + if ((mnt->mnt_flags & MNT_SHIFT) == 0) + return; + + if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns) + return; + + stat->uid = make_kuid(user_ns, __kuid_val(stat->uid)); + stat->gid = make_kgid(user_ns, __kgid_val(stat->gid)); +} + /** * vfs_getattr_nosec - getattr without security checks * @path: file to get attributes from @@ -65,6 +82,7 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_backing_inode(path->dentry); + int ret; memset(stat, 0, sizeof(*stat)); stat->result_mask |= STATX_BASIC_STATS; @@ -77,12 +95,17 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, if (IS_AUTOMOUNT(inode)) stat->attributes |= STATX_ATTR_AUTOMOUNT; + ret = 0; if (inode->i_op->getattr) - return inode->i_op->getattr(path, stat, request_mask, - query_flags); + ret = inode->i_op->getattr(path, stat, request_mask, + query_flags); + else + generic_fillattr(inode, stat); - generic_fillattr(inode, stat); - return 0; + if (!ret) + shift_check(path->mnt, stat); + + return ret; } EXPORT_SYMBOL(vfs_getattr_nosec); diff --git a/include/linux/cred.h b/include/linux/cred.h index 18639c069263..8a5f2c9b613a 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -59,6 +59,7 @@ extern struct group_info *groups_alloc(int); extern void groups_free(struct group_info *); extern int in_group_p(kgid_t); +extern int in_group_p_shifted(kgid_t); extern int in_egroup_p(kgid_t); extern int groups_search(const struct group_info *, kgid_t); @@ -75,6 +76,10 @@ static inline int in_group_p(kgid_t grp) { return 1; } +static inline int in_group_p_shifted(kgid_t grp) +{ + return 1; +} static inline int in_egroup_p(kgid_t grp) { return 1; @@ -422,4 +427,9 @@ do { \ *(_fsgid) = __cred->fsgid; \ } while(0) +static inline bool cred_is_shifted(void) +{ + return current_cred() == current->mnt_cred; +} + #endif /* _LINUX_CRED_H */ diff --git a/include/linux/mount.h b/include/linux/mount.h index bf8cc4108b8f..cdc5d981d594 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -46,7 +46,7 @@ struct fs_context; #define MNT_SHARED_MASK (MNT_UNBINDABLE) #define MNT_USER_SETTABLE_MASK (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \ | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \ - | MNT_READONLY) + | MNT_READONLY | MNT_SHIFT) #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME ) #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \ @@ -65,6 +65,8 @@ struct fs_context; #define MNT_MARKED 0x4000000 #define MNT_UMOUNT 0x8000000 +#define MNT_SHIFT 0x10000000 + struct vfsmount { struct dentry *mnt_root; /* root of the mounted tree */ struct super_block *mnt_sb; /* pointer to superblock */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 467d26046416..d376dc7bcf76 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -19,6 +19,7 @@ #include <linux/plist.h> #include <linux/hrtimer.h> #include <linux/seccomp.h> +#include <linux/mount.h> #include <linux/nodemask.h> #include <linux/rcupdate.h> #include <linux/refcount.h> @@ -882,6 +883,10 @@ struct task_struct { /* Effective (overridable) subjective task credentials (COW): */ const struct cred __rcu *cred; + /* cache for uid/gid shifted cred tied to mnt */ + struct cred *mnt_cred; + struct vfsmount *mnt; + #ifdef CONFIG_KEYS /* Cached requested key. */ struct key *cached_requested_key; diff --git a/kernel/capability.c b/kernel/capability.c index 1444f3954d75..3273e85a644c 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -486,8 +486,18 @@ EXPORT_SYMBOL(file_ns_capable); */ bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode) { - return kuid_has_mapping(ns, inode->i_uid) && - kgid_has_mapping(ns, inode->i_gid); + kuid_t i_uid = inode->i_uid; + kgid_t i_gid = inode->i_gid; + + if (cred_is_shifted()) { + struct user_namespace *cns = current_user_ns(); + + i_uid = make_kuid(cns, __kuid_val(i_uid)); + i_gid = make_kgid(cns, __kgid_val(i_gid)); + } + + return kuid_has_mapping(ns, i_uid) && + kgid_has_mapping(ns, i_gid); } /** diff --git a/kernel/cred.c b/kernel/cred.c index c0a4c12d38b2..bbe0e2e64081 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -167,6 +167,8 @@ void exit_creds(struct task_struct *tsk) validate_creds(cred); alter_cred_subscribers(cred, -1); put_cred(cred); + if (tsk->mnt_cred) + put_cred(tsk->mnt_cred); cred = (struct cred *) tsk->cred; tsk->cred = NULL; @@ -318,6 +320,17 @@ struct cred *prepare_exec_creds(void) return new; } +static void flush_mnt_cred(struct task_struct *t) +{ + if (t->mnt_cred == t->cred) + return; + if (t->mnt_cred) + put_cred(t->mnt_cred); + t->mnt_cred = NULL; + /* mnt is only used for comparison, so it has no reference */ + t->mnt = NULL; +} + /* * Copy credentials for the new process created by fork() * @@ -344,6 +357,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags) ) { p->real_cred = get_cred(p->cred); get_cred(p->cred); + p->mnt = NULL; + p->mnt_cred = NULL; alter_cred_subscribers(p->cred, 2); kdebug("share_creds(%p{%d,%d})", p->cred, atomic_read(&p->cred->usage), @@ -383,6 +398,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags) atomic_inc(&new->user->processes); p->cred = p->real_cred = get_cred(new); + p->mnt = NULL; + p->mnt_cred = NULL; alter_cred_subscribers(new, 2); validate_creds(new); return 0; @@ -506,6 +523,7 @@ int commit_creds(struct cred *new) /* release the old obj and subj refs both */ put_cred(old); put_cred(old); + flush_mnt_cred(task); return 0; } EXPORT_SYMBOL(commit_creds); @@ -564,6 +582,7 @@ const struct cred *override_creds(const struct cred *new) alter_cred_subscribers(new, 1); rcu_assign_pointer(current->cred, new); alter_cred_subscribers(old, -1); + flush_mnt_cred(current); kdebug("override_creds() = %p{%d,%d}", old, atomic_read(&old->usage), @@ -589,6 +608,7 @@ void revert_creds(const struct cred *old) validate_creds(old); validate_creds(override); + flush_mnt_cred(current); alter_cred_subscribers(old, 1); rcu_assign_pointer(current->cred, old); alter_cred_subscribers(override, -1); diff --git a/kernel/groups.c b/kernel/groups.c index daae2f2dc6d4..772b49a367b0 100644 --- a/kernel/groups.c +++ b/kernel/groups.c @@ -228,6 +228,13 @@ int in_group_p(kgid_t grp) EXPORT_SYMBOL(in_group_p); +int in_group_p_shifted(kgid_t grp) +{ + if (cred_is_shifted()) + grp = make_kgid(current_user_ns(), __kgid_val(grp)); + return in_group_p(grp); +} + int in_egroup_p(kgid_t grp) { const struct cred *cred = current_cred();

[v2,2/3] fs: introduce uid/gid shifting bind mount

Commit Message

Comments

Patch