Message ID | 20180823223145.GK6515@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [git,pull] new mount API | expand |
On Thu, Aug 23, 2018 at 3:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > new mount API series from Dave Howells > > To quote his cover letter, > Here are a set of patches to create a filesystem context prior to setting > up a new mount, populating it with the parsed options/binary data, creating > the superblock and then effecting the mount. This is also used for remount > since much of the parsing stuff is common in many filesystems. > > This allows namespaces and other information to be conveyed through the > mount procedure. > > This also allows Miklós Szeredi's idea of doing: > > fd = fsopen("nfs"); > fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0); > fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); > mfd = fsmount(fd, MS_NODEV); > move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); > > that he presented at LSF-2017 to be implemented (see the relevant patches > in the series). > > I didn't use netlink as that would make the core kernel depend on > CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing > issues. > > I've implemented filesystem context handling for procfs, nfs, mqueue, > cpuset, kernfs, sysfs, cgroup and afs filesystems. > > Unconverted filesystems are handled by a legacy filesystem wrapper. > > One trivial conflict in fs/file_table.c:__fput(); resolved as > if (unlikely(mode & FMODE_NEED_UNMOUNT)) > dissolve_on_fput(mnt); > dput(dentry); > mntput(mnt); > out: > file_free(file); > Has anything been done to ensure that the behavior when doing FSCONFIG_CMD_CREATE against an already-mounted block device is reasonable?
Andy Lutomirski <luto@amacapital.net> wrote: > Has anything been done to ensure that the behavior when doing > FSCONFIG_CMD_CREATE against an already-mounted block device is > reasonable? For the moment, I've left it as the same behaviour as for mount(2) since mount(2) now uses the same mechanism internally and we aren't permitted to break userspace. I would like to add at least one flag to stipulate that, in the case of an incompatible collision, you can get a failure - but defining what is meant by incompatible isn't necessarily trivial, and would vary by filesystem *and* the LSM. However, I don't want to start reengineering everything this close to the merge window and we don't really need it immediately. David
On Thu, Aug 23, 2018 at 5:08 PM, David Howells <dhowells@redhat.com> wrote: > Andy Lutomirski <luto@amacapital.net> wrote: > >> Has anything been done to ensure that the behavior when doing >> FSCONFIG_CMD_CREATE against an already-mounted block device is >> reasonable? > > For the moment, I've left it as the same behaviour as for mount(2) since > mount(2) now uses the same mechanism internally and we aren't permitted to > break userspace. > > I would like to add at least one flag to stipulate that, in the case of an > incompatible collision, you can get a failure - but defining what is meant by > incompatible isn't necessarily trivial, and would vary by filesystem *and* the > LSM. > > However, I don't want to start reengineering everything this close to the > merge window and we don't really need it immediately. > The problem is that, once this ends up merged, then we're kind of stuck with it, too. It would be a bit sad if your better proposal for handling nfs and instantiation of filesystems in general were added in the next release and then we end up with the current FSCONFIG_CMD_CREATE as a permanently supported but non-preferred option.
On Thu, Aug 23, 2018 at 05:16:53PM -0700, Andy Lutomirski wrote: > On Thu, Aug 23, 2018 at 5:08 PM, David Howells <dhowells@redhat.com> wrote: > > Andy Lutomirski <luto@amacapital.net> wrote: > > > >> Has anything been done to ensure that the behavior when doing > >> FSCONFIG_CMD_CREATE against an already-mounted block device is > >> reasonable? > > > > For the moment, I've left it as the same behaviour as for mount(2) since > > mount(2) now uses the same mechanism internally and we aren't permitted to > > break userspace. > > > > I would like to add at least one flag to stipulate that, in the case of an > > incompatible collision, you can get a failure - but defining what is meant by > > incompatible isn't necessarily trivial, and would vary by filesystem *and* the > > LSM. > > > > However, I don't want to start reengineering everything this close to the > > merge window and we don't really need it immediately. > > > > The problem is that, once this ends up merged, then we're kind of > stuck with it, too. It would be a bit sad if your better proposal for > handling nfs and instantiation of filesystems in general were added in > the next release and then we end up with the current > FSCONFIG_CMD_CREATE as a permanently supported but non-preferred > option. For fuck sake, mount(2) is a permanently supported option! Folks, get over it - you are mixing entirely different issues. You know, I know and everyone even remotely sane knows that mount(2) *IS* *NOT* *GOING* *AWAY*. And it's not changing semantics either. So bemoaning the "permanently supported non-preferred options" is utter lunacy. It's there and it will remain there, period. We do not break userland. And "their local scripts would break terribly inside userns container" does *NOT* render those second-class in any sense. So you can curse the current behaviour of mount(2) and I even agree with some of that, but we are not going to be able to remove that. Yes, it would've been nice if, etc., but it's not going to happen. Not now, not for many years.
> On Aug 23, 2018, at 5:31 PM, Al Viro <viro@ZenIV.linux.org.uk> wrote: > >> On Thu, Aug 23, 2018 at 05:16:53PM -0700, Andy Lutomirski wrote: >>> On Thu, Aug 23, 2018 at 5:08 PM, David Howells <dhowells@redhat.com> wrote: >>> Andy Lutomirski <luto@amacapital.net> wrote: >>> >>>> Has anything been done to ensure that the behavior when doing >>>> FSCONFIG_CMD_CREATE against an already-mounted block device is >>>> reasonable? >>> >>> For the moment, I've left it as the same behaviour as for mount(2) since >>> mount(2) now uses the same mechanism internally and we aren't permitted to >>> break userspace. >>> >>> I would like to add at least one flag to stipulate that, in the case of an >>> incompatible collision, you can get a failure - but defining what is meant by >>> incompatible isn't necessarily trivial, and would vary by filesystem *and* the >>> LSM. >>> >>> However, I don't want to start reengineering everything this close to the >>> merge window and we don't really need it immediately. >>> >> >> The problem is that, once this ends up merged, then we're kind of >> stuck with it, too. It would be a bit sad if your better proposal for >> handling nfs and instantiation of filesystems in general were added in >> the next release and then we end up with the current >> FSCONFIG_CMD_CREATE as a permanently supported but non-preferred >> option. > > For fuck sake, mount(2) is a permanently supported option! Exactly. mount(2) has broken semantics and it’s permanently supported. If this merge request gets pulled, then FSCONFIG_CMD_CREATE will *also* be a permanently supported API with broken semantics.
On Thu, Aug 23, 2018 at 07:36:15PM -0700, Andy Lutomirski wrote: > > For fuck sake, mount(2) is a permanently supported option! > > Exactly. mount(2) has broken semantics and it’s permanently supported. > > If this merge request gets pulled, then FSCONFIG_CMD_CREATE will *also* be a permanently supported API with broken semantics. Oh no - mount(2) behaviour can be expressed that way! The horror... Andy, this is bullshit. You are saying that dealing with mount(2) mess of ABI (badly unorthogonal set of operations, overloading from hell, etc.) must be tied with massive rework of fs drivers. Why? It's a simple enough question and pardon me, but your "it's broken" doesn't inspire any confidence that you even understand what the current behaviour is and what its problems (they are real enough) are. I have seen a lot of handwaving from you in these threads, combined with "surely, it must work thus", followed with "it's a special case/class/whatnot" or "surely, it must be broken" every time you find something that doesn't "work thus". Bugger it; explain why we must combine untangling the existing ABI (and we *will* have to keep the existing semantics possible to express for sys_mount() sake) with this, or with anything else, for that matter.
On Thu, Aug 23, 2018 at 8:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Aug 23, 2018 at 07:36:15PM -0700, Andy Lutomirski wrote: > >> > For fuck sake, mount(2) is a permanently supported option! >> >> Exactly. mount(2) has broken semantics and it’s permanently supported. >> >> If this merge request gets pulled, then FSCONFIG_CMD_CREATE will *also* be a permanently supported API with broken semantics. > > Oh no - mount(2) behaviour can be expressed that way! The horror... > > Andy, this is bullshit. You are saying that dealing with mount(2) > mess of ABI (badly unorthogonal set of operations, overloading from > hell, etc.) must be tied with massive rework of fs drivers. Why? > When this was reviewed earlier, a problem was identified. I asked if it had been addressed. I did *not* say that it was mandatory to address it, nor did I say anything about reworking fs drivers. A reasonable answer might have been "avoiding this pitfall in the new API would involve a large amount of reworking of existing filesystem drivers. I think that the new API, as is, has enough benefits that it makes sense to merge it even with this pitfall, and, if needed, we can introduce an improved version down the road." I had at least hoped for a better answer than "bugger it." --Andy
On Thu, Aug 23, 2018 at 09:51:00PM -0700, Andy Lutomirski wrote: > When this was reviewed earlier, a problem was identified. I asked if > it had been addressed. I did *not* say that it was mandatory to > address it, nor did I say anything about reworking fs drivers. > > A reasonable answer might have been "avoiding this pitfall in the new > API would involve a large amount of reworking of existing filesystem > drivers. I think that the new API, as is, has enough benefits that it > makes sense to merge it even with this pitfall, and, if needed, we can > introduce an improved version down the road." It's also not clear that an API that you think is "cleaner" would actually be more usable. In fact, I believe it's going to be a sh*t show for userspace, because it won't be obvious what will work, and what will cause an error of the form, "sorry we can't do this cleaner thing that some people think is better". Which means a huge amount of special casing in the program, or a lot of very surprising failures that will then get exposed to the system administrator, many of whom haven't really had much of a problem with the existing mount(8) user interface. It may very well be that the "cleaner" approach will be hellacious from a human usability perspective. Figuring all of this out could take months and months (stalling the new mount API for a long time), and we may never know for sure until we implement a full prototype of kernel changes, massive rewrite of the file systems, and userspace programs --- and then see what is the best way to expose the radically different semantics depending on what individual file systems can and can not do to the poor human system administrator who is just trying to mount the d*mned file system. - Ted
On Fri, Aug 24, 2018 at 8:05 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote: > On Thu, Aug 23, 2018 at 09:51:00PM -0700, Andy Lutomirski wrote: >> When this was reviewed earlier, a problem was identified. I asked if >> it had been addressed. I did *not* say that it was mandatory to >> address it, nor did I say anything about reworking fs drivers. >> >> A reasonable answer might have been "avoiding this pitfall in the new >> API would involve a large amount of reworking of existing filesystem >> drivers. I think that the new API, as is, has enough benefits that it >> makes sense to merge it even with this pitfall, and, if needed, we can >> introduce an improved version down the road." > > It's also not clear that an API that you think is "cleaner" would > actually be more usable. In fact, I believe it's going to be a sh*t > show for userspace, because it won't be obvious what will work, and > what will cause an error of the form, "sorry we can't do this cleaner > thing that some people think is better". Which means a huge amount of > special casing in the program, or a lot of very surprising failures > that will then get exposed to the system administrator, many of whom > haven't really had much of a problem with the existing mount(8) user > interface. In what way is the kernel better suited to read the mind of the poor sysadmin, than a userspace helper program? No, I don't think this argument is about the mount(8) interface, which is what the sysadmin is using and can continue to use for the foreseeable future, with basically the same feature set It's about new uses of the mount(2) API that *would* be possible if it was a saner interface. And so doing the new mount API radically different than what current mount(2) is doing is quite all right in my opinion. > It may very well be that the "cleaner" approach will be hellacious > from a human usability perspective. Figuring all of this out could > take months and months (stalling the new mount API for a long time), > and we may never know for sure until we implement a full prototype of > kernel changes, massive rewrite of the file systems, and userspace > programs --- and then see what is the best way to expose the radically > different semantics depending on what individual file systems can and > can not do to the poor human system administrator who is just trying > to mount the d*mned file system. I'm not against merging this patchset (have nits about resuing the MS_* constants for the new API that I've complained about, but that's really easy to fix before -final), but it may make sense to differentiate the legacy behavior from a saner one from the very start. I.e. rename FSCONFIG_CMD_CREATE to something implying it's actually *not* a create in exclusive sense that one would first imply (and yeah, we have creat() and O_CREAT, which don't imply exclusivity, yet they at least have clear semantics that current super block creation does definitely not). Also, we don't have a massive conversion of filesystems yet to the new internal API, so we could require anything converted to support the exclusive create semantics. None of this requires months of figuring out, just a bit of tweaking and acknowledging that the current semantics are weird, to say the least. Thanks, Miklos
On Fri, Aug 24, 2018 at 10:38 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: > On Fri, Aug 24, 2018 at 8:05 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote: >> On Thu, Aug 23, 2018 at 09:51:00PM -0700, Andy Lutomirski wrote: >>> When this was reviewed earlier, a problem was identified. I asked if >>> it had been addressed. I did *not* say that it was mandatory to >>> address it, nor did I say anything about reworking fs drivers. >>> >>> A reasonable answer might have been "avoiding this pitfall in the new >>> API would involve a large amount of reworking of existing filesystem >>> drivers. I think that the new API, as is, has enough benefits that it >>> makes sense to merge it even with this pitfall, and, if needed, we can >>> introduce an improved version down the road." >> >> It's also not clear that an API that you think is "cleaner" would >> actually be more usable. In fact, I believe it's going to be a sh*t >> show for userspace, because it won't be obvious what will work, and >> what will cause an error of the form, "sorry we can't do this cleaner >> thing that some people think is better". Which means a huge amount of >> special casing in the program, or a lot of very surprising failures >> that will then get exposed to the system administrator, many of whom >> haven't really had much of a problem with the existing mount(8) user >> interface. > > In what way is the kernel better suited to read the mind of the poor > sysadmin, than a userspace helper program? I have a concrete example: mount -oloop. You can leave it to mount(8) to automagically find an existing loop device or setup a new one, or you can do the low level thing and set up your own loop device and mount it. Same story as NFS and friends, except it's not the kernel that does the magic "same source -> same sb" policy but the mount(8) utility. Ever notice the difference? See? And yeah, there are races involved, and userspace is perfectly suited to deal with them. Thanks, Miklos
On Fri, Aug 24, 2018 at 10:56 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: > On Fri, Aug 24, 2018 at 10:38 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: >> On Fri, Aug 24, 2018 at 8:05 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote: >>> On Thu, Aug 23, 2018 at 09:51:00PM -0700, Andy Lutomirski wrote: >>>> When this was reviewed earlier, a problem was identified. I asked if >>>> it had been addressed. I did *not* say that it was mandatory to >>>> address it, nor did I say anything about reworking fs drivers. >>>> >>>> A reasonable answer might have been "avoiding this pitfall in the new >>>> API would involve a large amount of reworking of existing filesystem >>>> drivers. I think that the new API, as is, has enough benefits that it >>>> makes sense to merge it even with this pitfall, and, if needed, we can >>>> introduce an improved version down the road." >>> >>> It's also not clear that an API that you think is "cleaner" would >>> actually be more usable. In fact, I believe it's going to be a sh*t >>> show for userspace, because it won't be obvious what will work, and >>> what will cause an error of the form, "sorry we can't do this cleaner >>> thing that some people think is better". Which means a huge amount of >>> special casing in the program, or a lot of very surprising failures >>> that will then get exposed to the system administrator, many of whom >>> haven't really had much of a problem with the existing mount(8) user >>> interface. >> >> In what way is the kernel better suited to read the mind of the poor >> sysadmin, than a userspace helper program? > > I have a concrete example: mount -oloop. You can leave it to > mount(8) to automagically find an existing loop device or setup a new > one, or you can do the low level thing and set up your own loop device > and mount it. Same story as NFS and friends, except it's not the > kernel that does the magic "same source -> same sb" policy but the > mount(8) utility. Ever notice the difference? See? And yeah, there > are races involved, and userspace is perfectly suited to deal with > them. And to continue from that thought, a namespace for filesystem instances could, for example, live under /dev/fs/$FSTYPE/$INSTANCE Naming the $INSTANCE could be done by the one creating that instance, or in case of legacy mounts could have a fixed prefix + sequence number, or something similar. All this could come later, let's just not exclude the possibility of using the new API in this way. Thanks, Miklos
Miklos Szeredi <miklos@szeredi.hu> wrote: > I'm not against merging this patchset (have nits about resuing the > MS_* constants for the new API that I've complained about, but that's > really easy to fix before -final), Can you send me a patch that does what you want here? They're only used by fsmount() for creating a vfsmount and not used for the superblock creation or reconfiguration. > but it may make sense to differentiate the legacy behavior from a saner one > from the very start. I.e. rename FSCONFIG_CMD_CREATE I would suggest leaving it as-is and add an FSCONFIG_CMD_CREATE_EXCLUSIVE. > to something implying it's actually *not* a create in exclusive sense that > one would first imply (and yeah, we have creat() and O_CREAT, which don't > imply exclusivity, yet they at least have clear semantics that current super > block creation does definitely not). The problem is that "exclusivity" isn't necessarily an easy thing to define. Take nfs4 and btrfs for example. They creating a backing superblock that the actual node is derived from (though in different ways). How do you define what "exclusive" means in their case? David
On Fri, Aug 24, 2018 at 11:45 AM, David Howells <dhowells@redhat.com> wrote: > Miklos Szeredi <miklos@szeredi.hu> wrote: > >> I'm not against merging this patchset (have nits about resuing the >> MS_* constants for the new API that I've complained about, but that's >> really easy to fix before -final), > > Can you send me a patch that does what you want here? They're only used by > fsmount() for creating a vfsmount and not used for the superblock creation or > reconfiguration. > >> but it may make sense to differentiate the legacy behavior from a saner one >> from the very start. I.e. rename FSCONFIG_CMD_CREATE > > I would suggest leaving it as-is and add an FSCONFIG_CMD_CREATE_EXCLUSIVE. > >> to something implying it's actually *not* a create in exclusive sense that >> one would first imply (and yeah, we have creat() and O_CREAT, which don't >> imply exclusivity, yet they at least have clear semantics that current super >> block creation does definitely not). > > The problem is that "exclusivity" isn't necessarily an easy thing to define. > Take nfs4 and btrfs for example. They creating a backing superblock that the > actual node is derived from (though in different ways). How do you define > what "exclusive" means in their case? Exclusive creation of a filesystem instance is just that: we promise that no one else is using that filesystem instance, and we can set options as we like, guaranteed that no bother to any other uses. Now, with block filesystems that can only be done once, obviously, so if someone already created a btrfs instance using /dev/foobar then trying to create another instance will fail with EEXIST, EBUSY or whaterver. For NFS it's a different story, it *is* possible to create several instances of the filesystem even if connecting to the same server, except then they will not share caches, will not have local coherency, etc.. Regardless, I think a "filesystem instance" is a pretty clear concept, that we can base a future API on and it's not just an internal implementation detail that the current mount(2) API makes look like. It *is* in fact the only concept that makes FSCONFIG_CMD_RECONFIGURE have sane meaning (i.e. I want to change *that* instance, and am fully aware what affect that will have on mounts of that instance). Compare that to "mount -o remount path"; that one would imply we are changing something for a mount that is at "path", but if there are more than one users of that instance (which is based on magic algorithm inside the filesystem) than that might have unwanted side effects. It's what Eric's complaint is about, and it's I think the root of the issue at hand. Thanks, Miklos
On Fri, Aug 24, 2018 at 10:45:31AM +0100, David Howells wrote: > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > I'm not against merging this patchset (have nits about resuing the > > MS_* constants for the new API that I've complained about, but that's > > really easy to fix before -final), > > Can you send me a patch that does what you want here? They're only used by > fsmount() for creating a vfsmount and not used for the superblock creation or > reconfiguration. I'm bothered by the fact that we use the same MS_ prefix in the old mount(2) api and the new fsmount(2) api. What happens if we introduce new flags for fsmount(2) and are already out of flags for mount(2)? I see a big mess that way. Also notice, how the old code just totally ignored MS_RELATIME? Bit of rot in the new interface already. So here's a patch. The MNT_ prefix is already used by libmount, and we don't want any confusion arising from that. So how about M_*? Short and sweet, just like O_* for open(2). Reusing the same values as MNT_ isn't important, but it does reduce the code size a bit. Especially if we'd switch to MNT_STRICTATIME internally as well. Thanks, Miklos --- diff --git a/fs/namespace.c b/fs/namespace.c index e34e3fd064b0..a4b8503f80b8 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3351,7 +3351,7 @@ EXPORT_SYMBOL_GPL(kern_mount); * Create a kernel mount representation for a new, prepared superblock * (specified by fs_fd) and attach to an open_tree-like file descriptor. */ -SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags) +SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, m_flags) { struct fs_context *fc; struct file *file; @@ -3366,27 +3366,21 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags if ((flags & ~(FSMOUNT_CLOEXEC)) != 0) return -EINVAL; - if (ms_flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC | - MS_NOATIME | MS_NODIRATIME | MS_RELATIME | - MS_STRICTATIME)) + if (m_flags & ~(M_NOSUID | M_NODEV | M_NOEXEC | M_NOATIME | + M_NODIRATIME | M_STRICTATIME | M_RDONLY)) return -EINVAL; - if (ms_flags & MS_RDONLY) - mnt_flags |= MNT_READONLY; - if (ms_flags & MS_NOSUID) - mnt_flags |= MNT_NOSUID; - if (ms_flags & MS_NODEV) - mnt_flags |= MNT_NODEV; - if (ms_flags & MS_NOEXEC) - mnt_flags |= MNT_NOEXEC; - if (ms_flags & MS_NODIRATIME) - mnt_flags |= MNT_NODIRATIME; + BUILD_BUG_ON(M_NOSUID != MNT_NOSUID || M_NODEV != MNT_NODEV || + M_NOEXEC != MNT_NOEXEC || M_NOATIME != MNT_NOATIME || + M_NODIRATIME != MNT_NODIRATIME || + M_STRICTATIME != MNT_RELATIME || M_RDONLY != MNT_READONLY); + + /* Relatime is the default, but internally that's what we flag */ + mnt_flags = m_flags & ~M_STRICTATIME; - if (ms_flags & MS_STRICTATIME) { - if (ms_flags & MS_NOATIME) + if (m_flags & M_STRICTATIME) { + if (m_flags & M_NOATIME) return -EINVAL; - } else if (ms_flags & MS_NOATIME) { - mnt_flags |= MNT_NOATIME; } else { mnt_flags |= MNT_RELATIME; } diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h index 3634e065836c..331e69f5a7bf 100644 --- a/include/uapi/linux/mount.h +++ b/include/uapi/linux/mount.h @@ -1,6 +1,15 @@ #ifndef _UAPI_LINUX_MOUNT_H #define _UAPI_LINUX_MOUNT_H +/* Mount flags passed to fsmount(2) */ +#define M_NOSUID 0x01 +#define M_NODEV 0x02 +#define M_NOEXEC 0x04 +#define M_NOATIME 0x08 +#define M_NODIRATIME 0x10 +#define M_STRICTATIME 0x20 +#define M_RDONLY 0x40 + /* * These are the fs-independent mount-flags: up to 32 flags are supported *
Miklos Szeredi <miklos@szeredi.hu> wrote: > +/* Mount flags passed to fsmount(2) */ > +#define M_NOSUID 0x01 > +#define M_NODEV 0x02 > +#define M_NOEXEC 0x04 > +#define M_NOATIME 0x08 > +#define M_NODIRATIME 0x10 > +#define M_STRICTATIME 0x20 > +#define M_RDONLY 0x40 If we're going to do this, I would suggest a longer prefix than just 'M' and renumber them to put *_RDONLY first. > + BUILD_BUG_ON(M_NOSUID != MNT_NOSUID || M_NODEV != MNT_NODEV || > + M_NOEXEC != MNT_NOEXEC || M_NOATIME != MNT_NOATIME || > + M_NODIRATIME != MNT_NODIRATIME || > + M_STRICTATIME != MNT_RELATIME || M_RDONLY != MNT_READONLY); Please don't, please do: if (ms_flags & M_RDONLY) mnt_flags |= MNT_READONLY; Yes, and at some point I'd also like to compress the numbering on the SB_* constants and break the identity as Christoph suggested. David
On Fri, Aug 24, 2018 at 04:18:10PM +0200, Miklos Szeredi wrote: > I'm bothered by the fact that we use the same MS_ prefix in the old mount(2) api > and the new fsmount(2) api. What happens if we introduce new flags for > fsmount(2) and are already out of flags for mount(2)? I see a big mess that > way. > > Also notice, how the old code just totally ignored MS_RELATIME? Bit of rot in > the new interface already. > > So here's a patch. The MNT_ prefix is already used by libmount, and we don't > want any confusion arising from that. So how about M_*? Short and sweet, just > like O_* for open(2). +1 Please.
On Fri, Aug 24, 2018 at 4:26 PM, David Howells <dhowells@redhat.com> wrote: > Miklos Szeredi <miklos@szeredi.hu> wrote: > >> +/* Mount flags passed to fsmount(2) */ >> +#define M_NOSUID 0x01 >> +#define M_NODEV 0x02 >> +#define M_NOEXEC 0x04 >> +#define M_NOATIME 0x08 >> +#define M_NODIRATIME 0x10 >> +#define M_STRICTATIME 0x20 >> +#define M_RDONLY 0x40 > > If we're going to do this, I would suggest a longer prefix than just 'M' Any suggestions? We could do "MOUNT_", but that sounds tad too long. But hey, that doesn't matter much either. Let me know your preference. > and > renumber them to put *_RDONLY first. > >> + BUILD_BUG_ON(M_NOSUID != MNT_NOSUID || M_NODEV != MNT_NODEV || >> + M_NOEXEC != MNT_NOEXEC || M_NOATIME != MNT_NOATIME || >> + M_NODIRATIME != MNT_NODIRATIME || >> + M_STRICTATIME != MNT_RELATIME || M_RDONLY != MNT_READONLY); > > Please don't, please do: > > if (ms_flags & M_RDONLY) > mnt_flags |= MNT_READONLY; As I said, I don't really care either way. Thanks, Miklos
> On Aug 24, 2018, at 2:45 AM, David Howells <dhowells@redhat.com> wrote: > > > The problem is that "exclusivity" isn't necessarily an easy thing to define. > Take nfs4 and btrfs for example. They creating a backing superblock that the > actual node is derived from (though in different ways). How do you define > what "exclusive" means in their case? > I would argue that “exclusive” means “semantically equivalent to getting a fully independent instance.” This means, at least, that all options need to be respected as specified (or fail if they can’t be) and future reconfigure/remount actions only affect *this* instance and not others. NFS might already meet these requirements, although I’m not sure about the latter one. But it might be that adding CREATE_EXCLUSIVE is not all that useful and it would be better to have an API like you sketched where you first explicitly instantiate the driver and then you create however many trees (a la fsmount()) you want, where each tree has its own set of per-tree options like “subvol” for btrfs.
On Fri, Aug 24, 2018 at 4:49 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > >> On Aug 24, 2018, at 2:45 AM, David Howells <dhowells@redhat.com> wrote: > >> >> >> The problem is that "exclusivity" isn't necessarily an easy thing to define. >> Take nfs4 and btrfs for example. They creating a backing superblock that the >> actual node is derived from (though in different ways). How do you define >> what "exclusive" means in their case? >> > > I would argue that “exclusive” means “semantically equivalent to getting a fully independent instance.” The only way to get semantically equivalent instance is to create a fully independent instance. See reconfigure (nee remount). Thanks, Miklos
> On Aug 24, 2018, at 8:02 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: > >> On Fri, Aug 24, 2018 at 4:49 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> >> >>>> On Aug 24, 2018, at 2:45 AM, David Howells <dhowells@redhat.com> wrote: >>> >>> >>> >>> The problem is that "exclusivity" isn't necessarily an easy thing to define. >>> Take nfs4 and btrfs for example. They creating a backing superblock that the >>> actual node is derived from (though in different ways). How do you define >>> what "exclusive" means in their case? >>> >> >> I would argue that “exclusive” means “semantically equivalent to getting a fully independent instance.” > > The only way to get semantically equivalent instance is to create a > fully independent instance. See reconfigure (nee remount). > > Hmm. Is it that case in the current patchset that you can do CMD_CREATE and reconfigure the result and some *other* existing mount will change? If so, that’s rather unfriendly to users.
On Fri, Aug 24, 2018 at 5:09 PM, Andy Lutomirski <luto@amacapital.net> wrote: > >> On Aug 24, 2018, at 8:02 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: >> >>> On Fri, Aug 24, 2018 at 4:49 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>> >>> >>>>> On Aug 24, 2018, at 2:45 AM, David Howells <dhowells@redhat.com> wrote: >>>> >>>> >>>> >>>> The problem is that "exclusivity" isn't necessarily an easy thing to define. >>>> Take nfs4 and btrfs for example. They creating a backing superblock that the >>>> actual node is derived from (though in different ways). How do you define >>>> what "exclusive" means in their case? >>>> >>> >>> I would argue that “exclusive” means “semantically equivalent to getting a fully independent instance.” >> >> The only way to get semantically equivalent instance is to create a >> fully independent instance. See reconfigure (nee remount). >> >> > > Hmm. Is it that case in the current patchset that you can do CMD_CREATE and reconfigure the result and some *other* existing mount will change? If so, that’s rather unfriendly to users. Exactly. Thanks, Miklos
Andy Lutomirski <luto@amacapital.net> wrote: > Hmm. Is it that case in the current patchset that you can do CMD_CREATE and > reconfigure the result and some *other* existing mount will change? If so, > that’s rather unfriendly to users. The default behaviour has to be the same as mount(2). David
On Fri, Aug 24, 2018 at 10:10 AM, David Howells <dhowells@redhat.com> wrote: > Andy Lutomirski <luto@amacapital.net> wrote: > >> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and >> reconfigure the result and some *other* existing mount will change? If so, >> that’s rather unfriendly to users. > > The default behaviour has to be the same as mount(2). > Why? Obviously mount(2) needs to keep working the way it does now. Most likely mount(8) also needs to retain its current behavior. But I don't see why the new API needs be bug-compatible with mount(2). --Andy
On Fri, Aug 24, 2018 at 7:43 PM, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Aug 24, 2018 at 10:10 AM, David Howells <dhowells@redhat.com> wrote: >> Andy Lutomirski <luto@amacapital.net> wrote: >> >>> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and >>> reconfigure the result and some *other* existing mount will change? If so, >>> that’s rather unfriendly to users. >> >> The default behaviour has to be the same as mount(2). This is rubbish. Anyone wanting the mount(2) behavior can use mount(2). About exclusive create: can't we just look at the active reference count of the superblock returned by ->get_tree() (if it's one, we are the only users, i.e. the create was exclusive)? Thanks, Miklos
On Fri, Aug 24, 2018 at 09:25:58PM +0200, Miklos Szeredi wrote: > On Fri, Aug 24, 2018 at 7:43 PM, Andy Lutomirski <luto@kernel.org> wrote: > > On Fri, Aug 24, 2018 at 10:10 AM, David Howells <dhowells@redhat.com> wrote: > >> Andy Lutomirski <luto@amacapital.net> wrote: > >> > >>> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and > >>> reconfigure the result and some *other* existing mount will change? If so, > >>> that’s rather unfriendly to users. > >> > >> The default behaviour has to be the same as mount(2). > > This is rubbish. Anyone wanting the mount(2) behavior can use mount(2). > > About exclusive create: can't we just look at the active reference > count of the superblock returned by ->get_tree() (if it's one, we are > the only users, i.e. the create was exclusive)? Let me get it straight - your default behaviour would routinely refuse NFS mount simply because somebody has already mounted from the same server? The main reason for keeping existing semantics is very, very simple: nobody has offered a sane replacement. For all warts (and I admit that policy re sharing turned out to be rather bad for any kind of situation with non-cooperative admins - in partial defense, back in 2000/2001 when it was done anybody talking about something like userns would've gotten laughed at, for a lot of good reasons) mount(2) semantics is defined *and* needs to be supported anyway. All suggested "better replacements" were bloody problematic. Sure, we'll need to sort that out, but again, why the hell tie that to untangling the sodding mount(2) ABI?
Sigh. I just noticed that this tree reintroduces bugs that were fixed with cfb2f6f6e0ba ("Revert "mqueue: switch to on-demand creation of internal mount"") So in addition to ongoing discussion about what the new mount api should look like we have bugs in the implementation. This really does not look like this branch is ready to be merged yet. Eric
Eric W. Biederman <ebiederm@xmission.com> wrote: > I just noticed that this tree reintroduces bugs that were fixed with > cfb2f6f6e0ba ("Revert "mqueue: switch to on-demand creation of internal > mount"") > > So in addition to ongoing discussion about what the new mount api should > look like ... Which don't need to be solved *before* the current patches go in. No strong reason has been given why the desired changes cannot wait one cycle. The core needs to be upstream before we can start porting most of the filesystems to it. > ... we have bugs in the implementation. Allegedly. It would be useful if you could've pointed out what it was that you think you see. I don't see offhand what it reintroduces. And how come this hasn't been noticed since the patches have been sat in linux-next for a while? > This really does not look like this branch is ready to be merged yet. I disagree (probably obviously). David
David Howells <dhowells@redhat.com> wrote: > > ... we have bugs in the implementation. > > Allegedly. It would be useful if you could've pointed out what it was that > you think you see. I don't see offhand what it reintroduces. This? - ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns, 0); - if (IS_ERR(ns->mq_mnt)) { - int err = PTR_ERR(ns->mq_mnt); - ns->mq_mnt = NULL; - return err; - } + m = mq_create_mount(&init_ipc_ns); + if (IS_ERR(m)) + return PTR_ERR(m); + ns->mq_mnt = m; Should be passing ns into mq_create_mount() rather than init_ipc_ns maybe? David
On Fri, Aug 24, 2018 at 9:51 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Aug 24, 2018 at 09:25:58PM +0200, Miklos Szeredi wrote: >> On Fri, Aug 24, 2018 at 7:43 PM, Andy Lutomirski <luto@kernel.org> wrote: >> > On Fri, Aug 24, 2018 at 10:10 AM, David Howells <dhowells@redhat.com> wrote: >> >> Andy Lutomirski <luto@amacapital.net> wrote: >> >> >> >>> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and >> >>> reconfigure the result and some *other* existing mount will change? If so, >> >>> that’s rather unfriendly to users. >> >> >> >> The default behaviour has to be the same as mount(2). >> >> This is rubbish. Anyone wanting the mount(2) behavior can use mount(2). >> >> About exclusive create: can't we just look at the active reference >> count of the superblock returned by ->get_tree() (if it's one, we are >> the only users, i.e. the create was exclusive)? > > Let me get it straight - your default behaviour would routinely refuse NFS mount > simply because somebody has already mounted from the same server? Yeah, bad idea. > The main reason for keeping existing semantics is very, very simple: nobody > has offered a sane replacement. For all warts (and I admit that policy > re sharing turned out to be rather bad for any kind of situation with > non-cooperative admins - in partial defense, back in 2000/2001 when it was > done anybody talking about something like userns would've gotten laughed at, > for a lot of good reasons) mount(2) semantics is defined *and* needs to be > supported anyway. > > All suggested "better replacements" were bloody problematic. Sure, we'll need > to sort that out, but again, why the hell tie that to untangling the sodding > mount(2) ABI? What I was primarily suggesting is pretty simple, yet nobody seems to get it: name the current behavior what it is: legacy. Call this op e.g. CMD_LEGACY_FIND_OR_CREATE. It means "you're using the shiny new API, but be warned: we'll still use the old and stinky way to get the super block". We'll hopefully untangle it, sooner rather than later, and then have ops with less scary names. Thanks, Miklos