Message ID | 20180705155120.22102-1-christian@brauner.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Your description is usesless.
It needs to detail exactly what breaks, what regressions and why.
All I see below is hand waving.
We need to know why this does not work so someone does not come in and try
this again. Or so that someone can fix this and then try again.
You do not include that kind of information in your commit log.
Calling mknod to create device nodes can not be widespread. There are
not that many privileged processes and calling mknod outside of being
a specialed process like udev is broken.
Therefore I refute your assertion that this is a widespread issue.
I expect somewhere there is a reasonable argument for reverting this
change on the basis that it causes a regression. You have not made it.
Until that time I am going to oppose this revert because your
justfication for the revert is lacking.
It has never been the case that mknod on a device node will guarantee
that you even can open the device node. The applications that regress
are broken. It doesn't mean we shouldn't be bug compatible, but we darn
well should document very clearly the bugs we are being bug compatible
with.
Eric
On Thu, Jul 05, 2018 at 11:48:11AM -0500, Eric W. Biederman wrote: > > Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > Your description is usesless. > > It needs to detail exactly what breaks, what regressions and why. > All I see below is hand waving. > > We need to know why this does not work so someone does not come in and try > this again. Or so that someone can fix this and then try again. > > You do not include that kind of information in your commit log. My commit log explicitly states that if you run systemd services in a user namespace with PrivateDevices=true it will fail as soon as anything tries to open such a device node. Before that change this worked just fine. The commit log also leads to the related thread. I can come up with a list of services that fail to start if that helps. > > Calling mknod to create device nodes can not be widespread. There are Well, there are a few. There's the container runtimes (aka systemd-nspawn, rkt, runC, LXC, LXD), udev, systemd, openrc to name just a few. Recently we even worked on udev being useable in user namespaces. Please also note, that a lot of applications were also switched to fallback to bind-mounts on mknod() permission failures since this was the easiest and least costly way to deal with all of the LSMs, user namespaces, capability dropping, and seccomp. They all would need way more complex logic to decide whether to fallback to a bind-mount or not. > not that many privileged processes and calling mknod outside of being > a specialed process like udev is broken. > > Therefore I refute your assertion that this is a widespread issue. > > > I expect somewhere there is a reasonable argument for reverting this > change on the basis that it causes a regression. You have not made it. Fair enough, I can rewrite the commit message and focs on the container workload and container runtime regressions as a clear example if that seems a sufficient argument to you. > > Until that time I am going to oppose this revert because your > justfication for the revert is lacking. I sympathize with you wanting a proper and thorough justification and I'm sorry if I apparently did not provided exhaustive details. However, I think (see above) that I've provided at least a sufficient argument in my commit log to start a reasonable discussion about this that doesn't end with "You're saying the kernel is broken. I'm saying userspace is broken.". > > > It has never been the case that mknod on a device node will guarantee > that you even can open the device node. The applications that regress > are broken. It doesn't mean we shouldn't be bug compatible, but we darn It seems a fair assumption to me that an object you created (with the _right permissions_) you can also interact with. Also if I may retort, I see no good argument why the applications are broken. > well should document very clearly the bugs we are being bug compatible > with. > > Eric > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers
ebiederm@xmission.com (Eric W. Biederman) writes: > Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > Your description is usesless. > > It needs to detail exactly what breaks, what regressions and why. > All I see below is hand waving. > > We need to know why this does not work so someone does not come in and try > this again. Or so that someone can fix this and then try again. > > You do not include that kind of information in your commit log. > > Calling mknod to create device nodes can not be widespread. There are > not that many privileged processes and calling mknod outside of being > a specialed process like udev is broken. > > Therefore I refute your assertion that this is a widespread issue. > > > I expect somewhere there is a reasonable argument for reverting this > change on the basis that it causes a regression. You have not made it. > > Until that time I am going to oppose this revert because your > justfication for the revert is lacking. > > > It has never been the case that mknod on a device node will guarantee > that you even can open the device node. The applications that regress > are broken. It doesn't mean we shouldn't be bug compatible, but we darn > well should document very clearly the bugs we are being bug compatible > with. > Further from what I have seen of this issue, there is a compelling case that what the applications that are broken what what is enabled by allowing mknod to succeed. So we absolutely need a good description of what is going on, because at best a revert to fix today's breaking is temporary until userspace gets their bugs fixed. Eric
diff --git a/fs/namei.c b/fs/namei.c index 734cef54fdf8..389e48e93542 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3711,8 +3711,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) if (error) return error; - if ((S_ISCHR(mode) || S_ISBLK(mode)) && - !ns_capable(dentry->d_sb->s_user_ns, CAP_MKNOD)) + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD)) return -EPERM; if (!dir->i_op->mknod)
This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd. commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: https://github.com/systemd/systemd/pull/9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> --- fs/namei.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)