mbox series

[git,pull] new mount API

Message ID 20180823223145.GK6515@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show
Series [git,pull] new mount API | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.mount

Message

Al Viro Aug. 23, 2018, 10:31 p.m. UTC
new mount API series from Dave Howells

To quote his cover letter,
	Here are a set of patches to create a filesystem context prior to setting
	up a new mount, populating it with the parsed options/binary data, creating
	the superblock and then effecting the mount.  This is also used for remount
	since much of the parsing stuff is common in many filesystems.

	This allows namespaces and other information to be conveyed through the
	mount procedure.

	This also allows Miklós Szeredi's idea of doing:

		fd = fsopen("nfs");
		fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
		fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
		mfd = fsmount(fd, MS_NODEV);
		move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

	that he presented at LSF-2017 to be implemented (see the relevant patches
	in the series).

	I didn't use netlink as that would make the core kernel depend on
	CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
	issues.

	I've implemented filesystem context handling for procfs, nfs, mqueue,
	cpuset, kernfs, sysfs, cgroup and afs filesystems.

	Unconverted filesystems are handled by a legacy filesystem wrapper.

One trivial conflict in fs/file_table.c:__fput(); resolved as
	if (unlikely(mode & FMODE_NEED_UNMOUNT))
		dissolve_on_fput(mnt);
	dput(dentry);
	mntput(mnt);
out:
	file_free(file);

The following changes since commit 32206ab36553be8714d9253ce33f4085681d369c:

  x86/intel_rdt: Provide pseudo-locking hooks within rdt_mount (2018-06-23 12:53:19 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.mount

for you to fetch changes up to 294fb3407bf29abf653fbe169af4bbb38a146db5:

  proc: Set correct userns for new proc super created by a new pid_namespace (2018-08-21 09:28:43 +0100)

----------------------------------------------------------------
Al Viro (2):
      vfs: syscall: Add open_tree(2) to reference or clone a mount
      teach move_mount(2) to work with OPEN_TREE_CLONE

Andrei Vagin (1):
      proc: Set correct userns for new proc super created by a new pid_namespace

David Howells (42):
      vfs: Require specification of size of mount data for internal mounts
      vfs: syscall: Add move_mount(2) to move mounts around
      vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
      vfs: Introduce the basic header for the new mount API's filesystem context
      vfs: Introduce logging functions
      vfs: Add configuration parser helpers
      vfs: Add LSM hooks for the new mount API
      selinux: Implement the new mount API LSM hooks
      smack: Implement filesystem context security hooks
      apparmor: Implement security hooks for the new mount API
      tomoyo: Implement security hooks for the new mount API
      vfs: Separate changing mount flags full remount
      vfs: Implement a filesystem superblock creation/configuration context
      vfs: Remove unused code after filesystem context changes
      procfs: Move proc_fill_super() to fs/proc/root.c
      proc: Add fs_context support to procfs
      ipc: Convert mqueue fs to fs_context
      cpuset: Use fs_context
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context
      hugetlbfs: Convert to fs_context
      vfs: Remove kern_mount_data()
      vfs: Provide documentation for new mount API
      Make anon_inodes unconditional
      vfs: syscall: Add fsopen() to prepare for superblock creation
      vfs: Implement logging through fs_context
      vfs: Add some logging to the core users of the fs_context log
      vfs: syscall: Add fsconfig() for configuring and managing a context
      vfs: syscall: Add fsmount() to create a mount for a superblock
      vfs: syscall: Add fspick() to select a superblock for reconfiguration
      afs: Add fs_context support
      afs: Use fs_context to pass parameters over automount
      vfs: Add a sample program for the new mount API
      vfs: syscall: Add fsinfo() to query filesystem information
      afs: Add fsinfo support
      vfs: Allow fsinfo() to query what's in an fs_context
      vfs: Allow fsinfo() to be used to query an fs parameter description
      vfs: Implement parameter value retrieval with fsinfo()
      vfs: Fix vfs_dup_fs_context()
      vfs: Fix fs_context logging when there's no log
      afs: Move the source fs parameter to the first position
      vfs: Pass path info fsinfo and rename get_fsinfo sb op to fsinfo
      vfs: Adjust fsinfo sample output

 Documentation/filesystems/mount_api.txt   | 706 ++++++++++++++++++++++++
 arch/arc/kernel/setup.c                   |   1 +
 arch/arm/kernel/atags_parse.c             |   1 +
 arch/arm/kvm/Kconfig                      |   1 -
 arch/arm64/kvm/Kconfig                    |   1 -
 arch/ia64/kernel/perfmon.c                |   3 +-
 arch/mips/kvm/Kconfig                     |   1 -
 arch/powerpc/kvm/Kconfig                  |   1 -
 arch/powerpc/platforms/cell/spufs/inode.c |   6 +-
 arch/s390/hypfs/inode.c                   |   7 +-
 arch/s390/kvm/Kconfig                     |   1 -
 arch/sh/kernel/setup.c                    |   1 +
 arch/sparc/kernel/setup_32.c              |   1 +
 arch/sparc/kernel/setup_64.c              |   1 +
 arch/x86/Kconfig                          |   1 -
 arch/x86/entry/syscalls/syscall_32.tbl    |   7 +
 arch/x86/entry/syscalls/syscall_64.tbl    |   7 +
 arch/x86/kernel/cpu/intel_rdt.h           |  15 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c  | 184 ++++---
 arch/x86/kernel/setup.c                   |   1 +
 arch/x86/kvm/Kconfig                      |   1 -
 drivers/base/Kconfig                      |   1 -
 drivers/base/devtmpfs.c                   |   7 +-
 drivers/char/tpm/Kconfig                  |   1 -
 drivers/dax/super.c                       |   2 +-
 drivers/dma-buf/Kconfig                   |   1 -
 drivers/gpio/Kconfig                      |   1 -
 drivers/gpu/drm/drm_drv.c                 |   3 +-
 drivers/gpu/drm/i915/i915_gemfs.c         |   2 +-
 drivers/iio/Kconfig                       |   1 -
 drivers/infiniband/Kconfig                |   1 -
 drivers/infiniband/hw/qib/qib_fs.c        |   7 +-
 drivers/misc/cxl/api.c                    |   3 +-
 drivers/misc/ibmasm/ibmasmfs.c            |  11 +-
 drivers/mtd/mtdsuper.c                    |  26 +-
 drivers/oprofile/oprofilefs.c             |   8 +-
 drivers/scsi/cxlflash/ocxl_hw.c           |   2 +-
 drivers/usb/gadget/function/f_fs.c        |   7 +-
 drivers/usb/gadget/legacy/inode.c         |   7 +-
 drivers/vfio/Kconfig                      |   1 -
 drivers/virtio/virtio_balloon.c           |   2 +-
 drivers/xen/xenfs/super.c                 |   7 +-
 fs/9p/vfs_super.c                         |   2 +-
 fs/Kconfig                                |   7 +
 fs/Makefile                               |   5 +-
 fs/adfs/super.c                           |   9 +-
 fs/affs/super.c                           |  13 +-
 fs/afs/internal.h                         |  10 +-
 fs/afs/mntpt.c                            | 147 ++---
 fs/afs/super.c                            | 634 +++++++++++++--------
 fs/afs/volume.c                           |   4 +-
 fs/aio.c                                  |   3 +-
 fs/anon_inodes.c                          |   3 +-
 fs/autofs/autofs_i.h                      |   2 +-
 fs/autofs/init.c                          |   4 +-
 fs/autofs/inode.c                         |   3 +-
 fs/befs/linuxvfs.c                        |  11 +-
 fs/bfs/inode.c                            |   8 +-
 fs/binfmt_misc.c                          |   7 +-
 fs/block_dev.c                            |   2 +-
 fs/btrfs/super.c                          |  30 +-
 fs/btrfs/tests/btrfs-tests.c              |   2 +-
 fs/ceph/super.c                           |   3 +-
 fs/cifs/cifs_dfs_ref.c                    |   3 +-
 fs/cifs/cifsfs.c                          |  18 +-
 fs/coda/inode.c                           |  11 +-
 fs/configfs/mount.c                       |   7 +-
 fs/cramfs/inode.c                         |  17 +-
 fs/debugfs/inode.c                        |  14 +-
 fs/devpts/inode.c                         |  10 +-
 fs/ecryptfs/main.c                        |   2 +-
 fs/efivarfs/super.c                       |   9 +-
 fs/efs/super.c                            |  14 +-
 fs/exofs/super.c                          |   7 +-
 fs/ext2/super.c                           |  14 +-
 fs/ext4/super.c                           |  16 +-
 fs/f2fs/super.c                           |  13 +-
 fs/fat/inode.c                            |   3 +-
 fs/fat/namei_msdos.c                      |   8 +-
 fs/fat/namei_vfat.c                       |   8 +-
 fs/file_table.c                           |   9 +-
 fs/filesystems.c                          |   4 +
 fs/freevxfs/vxfs_super.c                  |  12 +-
 fs/fs_context.c                           | 779 ++++++++++++++++++++++++++
 fs/fs_parser.c                            | 483 ++++++++++++++++
 fs/fsopen.c                               | 491 +++++++++++++++++
 fs/fuse/control.c                         |   9 +-
 fs/fuse/inode.c                           |  16 +-
 fs/gfs2/ops_fstype.c                      |   6 +-
 fs/gfs2/super.c                           |   4 +-
 fs/hfs/super.c                            |  12 +-
 fs/hfsplus/super.c                        |  12 +-
 fs/hostfs/hostfs_kern.c                   |   7 +-
 fs/hpfs/super.c                           |  11 +-
 fs/hugetlbfs/inode.c                      | 455 ++++++++++------
 fs/internal.h                             |  13 +-
 fs/isofs/inode.c                          |  11 +-
 fs/jffs2/super.c                          |  10 +-
 fs/jfs/super.c                            |  11 +-
 fs/kernfs/mount.c                         | 102 ++--
 fs/libfs.c                                |  19 +-
 fs/minix/inode.c                          |  14 +-
 fs/namei.c                                |   4 +-
 fs/namespace.c                            | 879 ++++++++++++++++++++++--------
 fs/nfs/internal.h                         |   4 +-
 fs/nfs/namespace.c                        |   3 +-
 fs/nfs/nfs4namespace.c                    |   3 +-
 fs/nfs/nfs4super.c                        |  27 +-
 fs/nfs/super.c                            |  22 +-
 fs/nfsd/nfsctl.c                          |   8 +-
 fs/nilfs2/super.c                         |  10 +-
 fs/notify/fanotify/Kconfig                |   1 -
 fs/notify/inotify/Kconfig                 |   1 -
 fs/nsfs.c                                 |   3 +-
 fs/ntfs/super.c                           |  13 +-
 fs/ocfs2/dlmfs/dlmfs.c                    |   5 +-
 fs/ocfs2/super.c                          |  14 +-
 fs/omfs/inode.c                           |   9 +-
 fs/openpromfs/inode.c                     |  11 +-
 fs/orangefs/orangefs-kernel.h             |   2 +-
 fs/orangefs/super.c                       |   5 +-
 fs/overlayfs/super.c                      |  11 +-
 fs/pipe.c                                 |   3 +-
 fs/pnode.c                                |   1 +
 fs/proc/inode.c                           |  50 +-
 fs/proc/internal.h                        |   6 +-
 fs/proc/root.c                            | 247 ++++++---
 fs/pstore/inode.c                         |  10 +-
 fs/qnx4/inode.c                           |  14 +-
 fs/qnx6/inode.c                           |  14 +-
 fs/ramfs/inode.c                          |   6 +-
 fs/reiserfs/super.c                       |  14 +-
 fs/romfs/super.c                          |  13 +-
 fs/squashfs/super.c                       |  12 +-
 fs/statfs.c                               | 571 +++++++++++++++++++
 fs/super.c                                | 397 +++++++++++---
 fs/sysfs/mount.c                          |  67 ++-
 fs/sysv/inode.c                           |   3 +-
 fs/sysv/super.c                           |  16 +-
 fs/tracefs/inode.c                        |  10 +-
 fs/ubifs/super.c                          |   5 +-
 fs/udf/super.c                            |  16 +-
 fs/ufs/super.c                            |  11 +-
 fs/xfs/xfs_super.c                        |  10 +-
 include/linux/cgroup.h                    |   3 +-
 include/linux/debugfs.h                   |   8 +-
 include/linux/fs.h                        |  50 +-
 include/linux/fs_context.h                | 208 +++++++
 include/linux/fs_parser.h                 | 117 ++++
 include/linux/fsinfo.h                    |  41 ++
 include/linux/kernfs.h                    |  41 +-
 include/linux/lsm_hooks.h                 |  79 ++-
 include/linux/module.h                    |   6 +
 include/linux/mount.h                     |  10 +-
 include/linux/mtd/super.h                 |   4 +-
 include/linux/ramfs.h                     |   4 +-
 include/linux/security.h                  |  72 ++-
 include/linux/shmem_fs.h                  |   3 +-
 include/linux/syscalls.h                  |  13 +
 include/uapi/linux/fcntl.h                |   2 +
 include/uapi/linux/fs.h                   |  82 ++-
 include/uapi/linux/fsinfo.h               | 302 ++++++++++
 include/uapi/linux/mount.h                |  75 +++
 init/Kconfig                              |  10 -
 init/do_mounts.c                          |   5 +-
 init/do_mounts_initrd.c                   |   1 +
 ipc/mqueue.c                              | 120 +++-
 kernel/bpf/inode.c                        |   7 +-
 kernel/cgroup/cgroup-internal.h           |  50 +-
 kernel/cgroup/cgroup-v1.c                 | 415 ++++++++------
 kernel/cgroup/cgroup.c                    | 285 +++++++---
 kernel/cgroup/cpuset.c                    |  67 ++-
 kernel/trace/trace.c                      |   7 +-
 mm/shmem.c                                |  10 +-
 mm/zsmalloc.c                             |   3 +-
 net/socket.c                              |   3 +-
 net/sunrpc/rpc_pipe.c                     |   7 +-
 samples/Kconfig                           |   6 +
 samples/Makefile                          |   2 +-
 samples/mount_api/Makefile                |   7 +
 samples/mount_api/test-fsmount.c          | 118 ++++
 samples/statx/Makefile                    |   7 +-
 samples/statx/test-fs-query.c             | 137 +++++
 samples/statx/test-fsinfo.c               | 584 ++++++++++++++++++++
 security/apparmor/apparmorfs.c            |   8 +-
 security/apparmor/include/mount.h         |  11 +-
 security/apparmor/lsm.c                   | 111 +++-
 security/apparmor/mount.c                 |  47 ++
 security/inode.c                          |   7 +-
 security/security.c                       |  65 ++-
 security/selinux/hooks.c                  | 320 ++++++++++-
 security/selinux/selinuxfs.c              |   8 +-
 security/smack/smack.h                    |  11 +-
 security/smack/smack_lsm.c                | 373 +++++++++++--
 security/smack/smackfs.c                  |   9 +-
 security/tomoyo/common.h                  |   3 +
 security/tomoyo/mount.c                   |  46 ++
 security/tomoyo/tomoyo.c                  |  19 +-
 198 files changed, 9288 insertions(+), 1874 deletions(-)
 create mode 100644 Documentation/filesystems/mount_api.txt
 create mode 100644 fs/fs_context.c
 create mode 100644 fs/fs_parser.c
 create mode 100644 fs/fsopen.c
 create mode 100644 include/linux/fs_context.h
 create mode 100644 include/linux/fs_parser.h
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 include/uapi/linux/mount.h
 create mode 100644 samples/mount_api/Makefile
 create mode 100644 samples/mount_api/test-fsmount.c
 create mode 100644 samples/statx/test-fs-query.c
 create mode 100644 samples/statx/test-fsinfo.c

Comments

Andy Lutomirski Aug. 23, 2018, 11:24 p.m. UTC | #1
On Thu, Aug 23, 2018 at 3:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> new mount API series from Dave Howells
>
> To quote his cover letter,
>         Here are a set of patches to create a filesystem context prior to setting
>         up a new mount, populating it with the parsed options/binary data, creating
>         the superblock and then effecting the mount.  This is also used for remount
>         since much of the parsing stuff is common in many filesystems.
>
>         This allows namespaces and other information to be conveyed through the
>         mount procedure.
>
>         This also allows Miklós Szeredi's idea of doing:
>
>                 fd = fsopen("nfs");
>                 fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
>                 fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
>                 mfd = fsmount(fd, MS_NODEV);
>                 move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
>
>         that he presented at LSF-2017 to be implemented (see the relevant patches
>         in the series).
>
>         I didn't use netlink as that would make the core kernel depend on
>         CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
>         issues.
>
>         I've implemented filesystem context handling for procfs, nfs, mqueue,
>         cpuset, kernfs, sysfs, cgroup and afs filesystems.
>
>         Unconverted filesystems are handled by a legacy filesystem wrapper.
>
> One trivial conflict in fs/file_table.c:__fput(); resolved as
>         if (unlikely(mode & FMODE_NEED_UNMOUNT))
>                 dissolve_on_fput(mnt);
>         dput(dentry);
>         mntput(mnt);
> out:
>         file_free(file);
>

Has anything been done to ensure that the behavior when doing
FSCONFIG_CMD_CREATE against an already-mounted block device is
reasonable?
David Howells Aug. 24, 2018, 12:08 a.m. UTC | #2
Andy Lutomirski <luto@amacapital.net> wrote:

> Has anything been done to ensure that the behavior when doing
> FSCONFIG_CMD_CREATE against an already-mounted block device is
> reasonable?

For the moment, I've left it as the same behaviour as for mount(2) since
mount(2) now uses the same mechanism internally and we aren't permitted to
break userspace.

I would like to add at least one flag to stipulate that, in the case of an
incompatible collision, you can get a failure - but defining what is meant by
incompatible isn't necessarily trivial, and would vary by filesystem *and* the
LSM.

However, I don't want to start reengineering everything this close to the
merge window and we don't really need it immediately.

David
Andy Lutomirski Aug. 24, 2018, 12:16 a.m. UTC | #3
On Thu, Aug 23, 2018 at 5:08 PM, David Howells <dhowells@redhat.com> wrote:
> Andy Lutomirski <luto@amacapital.net> wrote:
>
>> Has anything been done to ensure that the behavior when doing
>> FSCONFIG_CMD_CREATE against an already-mounted block device is
>> reasonable?
>
> For the moment, I've left it as the same behaviour as for mount(2) since
> mount(2) now uses the same mechanism internally and we aren't permitted to
> break userspace.
>
> I would like to add at least one flag to stipulate that, in the case of an
> incompatible collision, you can get a failure - but defining what is meant by
> incompatible isn't necessarily trivial, and would vary by filesystem *and* the
> LSM.
>
> However, I don't want to start reengineering everything this close to the
> merge window and we don't really need it immediately.
>

The problem is that, once this ends up merged, then we're kind of
stuck with it, too.  It would be a bit sad if your better proposal for
handling nfs and instantiation of filesystems in general were added in
the next release and then we end up with the current
FSCONFIG_CMD_CREATE as a permanently supported but non-preferred
option.
Al Viro Aug. 24, 2018, 12:31 a.m. UTC | #4
On Thu, Aug 23, 2018 at 05:16:53PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 23, 2018 at 5:08 PM, David Howells <dhowells@redhat.com> wrote:
> > Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >> Has anything been done to ensure that the behavior when doing
> >> FSCONFIG_CMD_CREATE against an already-mounted block device is
> >> reasonable?
> >
> > For the moment, I've left it as the same behaviour as for mount(2) since
> > mount(2) now uses the same mechanism internally and we aren't permitted to
> > break userspace.
> >
> > I would like to add at least one flag to stipulate that, in the case of an
> > incompatible collision, you can get a failure - but defining what is meant by
> > incompatible isn't necessarily trivial, and would vary by filesystem *and* the
> > LSM.
> >
> > However, I don't want to start reengineering everything this close to the
> > merge window and we don't really need it immediately.
> >
> 
> The problem is that, once this ends up merged, then we're kind of
> stuck with it, too.  It would be a bit sad if your better proposal for
> handling nfs and instantiation of filesystems in general were added in
> the next release and then we end up with the current
> FSCONFIG_CMD_CREATE as a permanently supported but non-preferred
> option.

For fuck sake, mount(2) is a permanently supported option!  Folks, get over it -
you are mixing entirely different issues.  You know, I know and everyone even
remotely sane knows that mount(2) *IS* *NOT* *GOING* *AWAY*.  And it's not
changing semantics either.  So bemoaning the "permanently supported non-preferred
options" is utter lunacy.  It's there and it will remain there, period.

We do not break userland.  And "their local scripts would break terribly inside
userns container" does *NOT* render those second-class in any sense.  So you
can curse the current behaviour of mount(2) and I even agree with some of that,
but we are not going to be able to remove that.  Yes, it would've been nice if,
etc., but it's not going to happen.  Not now, not for many years.
Andy Lutomirski Aug. 24, 2018, 2:36 a.m. UTC | #5
> On Aug 23, 2018, at 5:31 PM, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
>> On Thu, Aug 23, 2018 at 05:16:53PM -0700, Andy Lutomirski wrote:
>>> On Thu, Aug 23, 2018 at 5:08 PM, David Howells <dhowells@redhat.com> wrote:
>>> Andy Lutomirski <luto@amacapital.net> wrote:
>>> 
>>>> Has anything been done to ensure that the behavior when doing
>>>> FSCONFIG_CMD_CREATE against an already-mounted block device is
>>>> reasonable?
>>> 
>>> For the moment, I've left it as the same behaviour as for mount(2) since
>>> mount(2) now uses the same mechanism internally and we aren't permitted to
>>> break userspace.
>>> 
>>> I would like to add at least one flag to stipulate that, in the case of an
>>> incompatible collision, you can get a failure - but defining what is meant by
>>> incompatible isn't necessarily trivial, and would vary by filesystem *and* the
>>> LSM.
>>> 
>>> However, I don't want to start reengineering everything this close to the
>>> merge window and we don't really need it immediately.
>>> 
>> 
>> The problem is that, once this ends up merged, then we're kind of
>> stuck with it, too.  It would be a bit sad if your better proposal for
>> handling nfs and instantiation of filesystems in general were added in
>> the next release and then we end up with the current
>> FSCONFIG_CMD_CREATE as a permanently supported but non-preferred
>> option.
> 
> For fuck sake, mount(2) is a permanently supported option!  

Exactly. mount(2) has broken semantics and it’s permanently supported.

If this merge request gets pulled, then FSCONFIG_CMD_CREATE will *also* be a permanently supported API with broken semantics.
Al Viro Aug. 24, 2018, 3:13 a.m. UTC | #6
On Thu, Aug 23, 2018 at 07:36:15PM -0700, Andy Lutomirski wrote:

> > For fuck sake, mount(2) is a permanently supported option!  
> 
> Exactly. mount(2) has broken semantics and it’s permanently supported.
> 
> If this merge request gets pulled, then FSCONFIG_CMD_CREATE will *also* be a permanently supported API with broken semantics.

Oh no - mount(2) behaviour can be expressed that way!  The horror...

Andy, this is bullshit.  You are saying that dealing with mount(2)
mess of ABI (badly unorthogonal set of operations, overloading from
hell, etc.) must be tied with massive rework of fs drivers.  Why?

It's a simple enough question and pardon me, but your "it's broken"
doesn't inspire any confidence that you even understand what the
current behaviour is and what its problems (they are real enough)
are.  I have seen a lot of handwaving from you in these threads,
combined with "surely, it must work thus", followed with "it's
a special case/class/whatnot" or "surely, it must be broken" every
time you find something that doesn't "work thus".

Bugger it; explain why we must combine untangling the existing ABI
(and we *will* have to keep the existing semantics possible to
express for sys_mount() sake) with this, or with anything else,
for that matter.
Andy Lutomirski Aug. 24, 2018, 4:51 a.m. UTC | #7
On Thu, Aug 23, 2018 at 8:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Aug 23, 2018 at 07:36:15PM -0700, Andy Lutomirski wrote:
>
>> > For fuck sake, mount(2) is a permanently supported option!
>>
>> Exactly. mount(2) has broken semantics and it’s permanently supported.
>>
>> If this merge request gets pulled, then FSCONFIG_CMD_CREATE will *also* be a permanently supported API with broken semantics.
>
> Oh no - mount(2) behaviour can be expressed that way!  The horror...
>
> Andy, this is bullshit.  You are saying that dealing with mount(2)
> mess of ABI (badly unorthogonal set of operations, overloading from
> hell, etc.) must be tied with massive rework of fs drivers.  Why?
>

When this was reviewed earlier, a problem was identified.  I asked if
it had been addressed.  I did *not* say that it was mandatory to
address it, nor did I say anything about reworking fs drivers.

A reasonable answer might have been "avoiding this pitfall in the new
API would involve a large amount of reworking of existing filesystem
drivers.  I think that the new API, as is, has enough benefits that it
makes sense to merge it even with this pitfall, and, if needed, we can
introduce an improved version down the road."

I had at least hoped for a better answer than "bugger it."

--Andy
Theodore Ts'o Aug. 24, 2018, 6:05 a.m. UTC | #8
On Thu, Aug 23, 2018 at 09:51:00PM -0700, Andy Lutomirski wrote:
> When this was reviewed earlier, a problem was identified.  I asked if
> it had been addressed.  I did *not* say that it was mandatory to
> address it, nor did I say anything about reworking fs drivers.
> 
> A reasonable answer might have been "avoiding this pitfall in the new
> API would involve a large amount of reworking of existing filesystem
> drivers.  I think that the new API, as is, has enough benefits that it
> makes sense to merge it even with this pitfall, and, if needed, we can
> introduce an improved version down the road."

It's also not clear that an API that you think is "cleaner" would
actually be more usable.  In fact, I believe it's going to be a sh*t
show for userspace, because it won't be obvious what will work, and
what will cause an error of the form, "sorry we can't do this cleaner
thing that some people think is better".  Which means a huge amount of
special casing in the program, or a lot of very surprising failures
that will then get exposed to the system administrator, many of whom
haven't really had much of a problem with the existing mount(8) user
interface.

It may very well be that the "cleaner" approach will be hellacious
from a human usability perspective.  Figuring all of this out could
take months and months (stalling the new mount API for a long time),
and we may never know for sure until we implement a full prototype of
kernel changes, massive rewrite of the file systems, and userspace
programs --- and then see what is the best way to expose the radically
different semantics depending on what individual file systems can and
can not do to the poor human system administrator who is just trying
to mount the d*mned file system.

	      	     	  	    - Ted
Miklos Szeredi Aug. 24, 2018, 8:38 a.m. UTC | #9
On Fri, Aug 24, 2018 at 8:05 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> On Thu, Aug 23, 2018 at 09:51:00PM -0700, Andy Lutomirski wrote:
>> When this was reviewed earlier, a problem was identified.  I asked if
>> it had been addressed.  I did *not* say that it was mandatory to
>> address it, nor did I say anything about reworking fs drivers.
>>
>> A reasonable answer might have been "avoiding this pitfall in the new
>> API would involve a large amount of reworking of existing filesystem
>> drivers.  I think that the new API, as is, has enough benefits that it
>> makes sense to merge it even with this pitfall, and, if needed, we can
>> introduce an improved version down the road."
>
> It's also not clear that an API that you think is "cleaner" would
> actually be more usable.  In fact, I believe it's going to be a sh*t
> show for userspace, because it won't be obvious what will work, and
> what will cause an error of the form, "sorry we can't do this cleaner
> thing that some people think is better".  Which means a huge amount of
> special casing in the program, or a lot of very surprising failures
> that will then get exposed to the system administrator, many of whom
> haven't really had much of a problem with the existing mount(8) user
> interface.

In what way is the kernel better suited to read the mind of the poor
sysadmin, than a userspace helper program?

No, I don't think this argument is about the mount(8) interface, which
is what the sysadmin is using and can continue to use for the
foreseeable future, with basically the same feature set

It's about new uses of the mount(2) API  that *would* be possible if
it was a saner interface.   And so doing the new mount API radically
different than what current mount(2) is doing is quite all right in my
opinion.

> It may very well be that the "cleaner" approach will be hellacious
> from a human usability perspective.  Figuring all of this out could
> take months and months (stalling the new mount API for a long time),
> and we may never know for sure until we implement a full prototype of
> kernel changes, massive rewrite of the file systems, and userspace
> programs --- and then see what is the best way to expose the radically
> different semantics depending on what individual file systems can and
> can not do to the poor human system administrator who is just trying
> to mount the d*mned file system.

I'm not against merging this patchset (have nits about resuing the
MS_* constants for the new API that I've complained about, but that's
really easy to fix before -final), but it may make sense to
differentiate the legacy behavior from a saner one from the very
start.  I.e. rename FSCONFIG_CMD_CREATE to something implying it's
actually *not* a create in exclusive sense that one would first imply
(and yeah, we have creat() and O_CREAT, which don't imply exclusivity,
yet they at least have clear semantics that current super block
creation does definitely not).

Also, we don't have a massive conversion of filesystems yet to the new
internal API, so we could require anything converted to support the
exclusive create semantics.

None of this requires months of figuring out, just a bit of tweaking
and acknowledging that the current semantics are weird, to say the
least.

Thanks,
Miklos
Miklos Szeredi Aug. 24, 2018, 8:56 a.m. UTC | #10
On Fri, Aug 24, 2018 at 10:38 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Aug 24, 2018 at 8:05 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
>> On Thu, Aug 23, 2018 at 09:51:00PM -0700, Andy Lutomirski wrote:
>>> When this was reviewed earlier, a problem was identified.  I asked if
>>> it had been addressed.  I did *not* say that it was mandatory to
>>> address it, nor did I say anything about reworking fs drivers.
>>>
>>> A reasonable answer might have been "avoiding this pitfall in the new
>>> API would involve a large amount of reworking of existing filesystem
>>> drivers.  I think that the new API, as is, has enough benefits that it
>>> makes sense to merge it even with this pitfall, and, if needed, we can
>>> introduce an improved version down the road."
>>
>> It's also not clear that an API that you think is "cleaner" would
>> actually be more usable.  In fact, I believe it's going to be a sh*t
>> show for userspace, because it won't be obvious what will work, and
>> what will cause an error of the form, "sorry we can't do this cleaner
>> thing that some people think is better".  Which means a huge amount of
>> special casing in the program, or a lot of very surprising failures
>> that will then get exposed to the system administrator, many of whom
>> haven't really had much of a problem with the existing mount(8) user
>> interface.
>
> In what way is the kernel better suited to read the mind of the poor
> sysadmin, than a userspace helper program?

I have a concrete example:  mount -oloop.  You can leave it to
mount(8) to automagically find an existing loop device or setup a new
one, or you can do the low level thing and set up your own loop device
and mount it.  Same story as NFS and friends, except it's not the
kernel that does the magic "same source -> same sb" policy but the
mount(8) utility.  Ever notice the difference?  See?   And yeah, there
are races involved, and userspace is perfectly suited to deal with
them.

Thanks,
Miklos
Miklos Szeredi Aug. 24, 2018, 9:29 a.m. UTC | #11
On Fri, Aug 24, 2018 at 10:56 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Aug 24, 2018 at 10:38 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Fri, Aug 24, 2018 at 8:05 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
>>> On Thu, Aug 23, 2018 at 09:51:00PM -0700, Andy Lutomirski wrote:
>>>> When this was reviewed earlier, a problem was identified.  I asked if
>>>> it had been addressed.  I did *not* say that it was mandatory to
>>>> address it, nor did I say anything about reworking fs drivers.
>>>>
>>>> A reasonable answer might have been "avoiding this pitfall in the new
>>>> API would involve a large amount of reworking of existing filesystem
>>>> drivers.  I think that the new API, as is, has enough benefits that it
>>>> makes sense to merge it even with this pitfall, and, if needed, we can
>>>> introduce an improved version down the road."
>>>
>>> It's also not clear that an API that you think is "cleaner" would
>>> actually be more usable.  In fact, I believe it's going to be a sh*t
>>> show for userspace, because it won't be obvious what will work, and
>>> what will cause an error of the form, "sorry we can't do this cleaner
>>> thing that some people think is better".  Which means a huge amount of
>>> special casing in the program, or a lot of very surprising failures
>>> that will then get exposed to the system administrator, many of whom
>>> haven't really had much of a problem with the existing mount(8) user
>>> interface.
>>
>> In what way is the kernel better suited to read the mind of the poor
>> sysadmin, than a userspace helper program?
>
> I have a concrete example:  mount -oloop.  You can leave it to
> mount(8) to automagically find an existing loop device or setup a new
> one, or you can do the low level thing and set up your own loop device
> and mount it.  Same story as NFS and friends, except it's not the
> kernel that does the magic "same source -> same sb" policy but the
> mount(8) utility.  Ever notice the difference?  See?   And yeah, there
> are races involved, and userspace is perfectly suited to deal with
> them.

And to continue from that thought, a namespace for filesystem
instances could, for example, live under /dev/fs/$FSTYPE/$INSTANCE

Naming the $INSTANCE could be done by the one creating that instance,
or in case of legacy mounts could have a fixed prefix + sequence
number, or something similar.

All this could come later, let's just not exclude the possibility of
using the new API in this way.

Thanks,
Miklos
David Howells Aug. 24, 2018, 9:45 a.m. UTC | #12
Miklos Szeredi <miklos@szeredi.hu> wrote:

> I'm not against merging this patchset (have nits about resuing the
> MS_* constants for the new API that I've complained about, but that's
> really easy to fix before -final),

Can you send me a patch that does what you want here?  They're only used by
fsmount() for creating a vfsmount and not used for the superblock creation or
reconfiguration.

> but it may make sense to differentiate the legacy behavior from a saner one
> from the very start.  I.e. rename FSCONFIG_CMD_CREATE

I would suggest leaving it as-is and add an FSCONFIG_CMD_CREATE_EXCLUSIVE.

> to something implying it's actually *not* a create in exclusive sense that
> one would first imply (and yeah, we have creat() and O_CREAT, which don't
> imply exclusivity, yet they at least have clear semantics that current super
> block creation does definitely not).

The problem is that "exclusivity" isn't necessarily an easy thing to define.
Take nfs4 and btrfs for example.  They creating a backing superblock that the
actual node is derived from (though in different ways).  How do you define
what "exclusive" means in their case?

David
Miklos Szeredi Aug. 24, 2018, 10:06 a.m. UTC | #13
On Fri, Aug 24, 2018 at 11:45 AM, David Howells <dhowells@redhat.com> wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
>> I'm not against merging this patchset (have nits about resuing the
>> MS_* constants for the new API that I've complained about, but that's
>> really easy to fix before -final),
>
> Can you send me a patch that does what you want here?  They're only used by
> fsmount() for creating a vfsmount and not used for the superblock creation or
> reconfiguration.
>
>> but it may make sense to differentiate the legacy behavior from a saner one
>> from the very start.  I.e. rename FSCONFIG_CMD_CREATE
>
> I would suggest leaving it as-is and add an FSCONFIG_CMD_CREATE_EXCLUSIVE.
>
>> to something implying it's actually *not* a create in exclusive sense that
>> one would first imply (and yeah, we have creat() and O_CREAT, which don't
>> imply exclusivity, yet they at least have clear semantics that current super
>> block creation does definitely not).
>
> The problem is that "exclusivity" isn't necessarily an easy thing to define.
> Take nfs4 and btrfs for example.  They creating a backing superblock that the
> actual node is derived from (though in different ways).  How do you define
> what "exclusive" means in their case?

Exclusive creation of a filesystem instance is just that:  we promise
that no one else is using that filesystem instance, and we can set
options as we like, guaranteed that no bother to any other uses.

Now, with block filesystems that can only be done once, obviously, so
if someone already created a btrfs instance using /dev/foobar then
trying to create another instance will fail with EEXIST, EBUSY or
whaterver.

For NFS it's a different story, it *is* possible to create several
instances of the filesystem even if connecting to the same server,
except then they will not share caches, will not have local coherency,
etc..

Regardless, I think a "filesystem instance" is a pretty clear concept,
that we can base a future API on and it's not just an internal
implementation detail that the current mount(2) API makes look like.
It *is* in fact the only concept that makes FSCONFIG_CMD_RECONFIGURE
have sane meaning (i.e. I want to change *that* instance, and am fully
aware what affect that will have on mounts of that instance).

Compare that to "mount -o remount path"; that one would imply we are
changing something for a mount that is at "path", but if there are
more than one users of that instance (which is based on magic
algorithm inside the filesystem) than that might have unwanted side
effects.  It's what Eric's complaint is about, and it's I think the
root of the issue at hand.

Thanks,
Miklos
Miklos Szeredi Aug. 24, 2018, 2:18 p.m. UTC | #14
On Fri, Aug 24, 2018 at 10:45:31AM +0100, David Howells wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > I'm not against merging this patchset (have nits about resuing the
> > MS_* constants for the new API that I've complained about, but that's
> > really easy to fix before -final),
> 
> Can you send me a patch that does what you want here?  They're only used by
> fsmount() for creating a vfsmount and not used for the superblock creation or
> reconfiguration.

I'm bothered by the fact that we use the same MS_ prefix in the old mount(2) api
and the new fsmount(2) api.  What happens if we introduce new flags for
fsmount(2) and are already out of flags for mount(2)?  I see a big mess that
way.

Also notice, how the old code just totally ignored MS_RELATIME?  Bit of rot in
the new interface already.

So here's a patch. The MNT_ prefix is already used by libmount, and we don't
want any confusion arising from that.  So how about M_*?  Short and sweet, just
like O_* for open(2).

Reusing the same values as MNT_ isn't important, but it does reduce the code
size a bit.  Especially if we'd switch to MNT_STRICTATIME internally as well.


Thanks,
Miklos
---

diff --git a/fs/namespace.c b/fs/namespace.c
index e34e3fd064b0..a4b8503f80b8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3351,7 +3351,7 @@ EXPORT_SYMBOL_GPL(kern_mount);
  * Create a kernel mount representation for a new, prepared superblock
  * (specified by fs_fd) and attach to an open_tree-like file descriptor.
  */
-SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags)
+SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, m_flags)
 {
 	struct fs_context *fc;
 	struct file *file;
@@ -3366,27 +3366,21 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags
 	if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
 		return -EINVAL;
 
-	if (ms_flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
-			 MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
-			 MS_STRICTATIME))
+	if (m_flags & ~(M_NOSUID | M_NODEV | M_NOEXEC |	M_NOATIME |
+			M_NODIRATIME | M_STRICTATIME | M_RDONLY))
 		return -EINVAL;
 
-	if (ms_flags & MS_RDONLY)
-		mnt_flags |= MNT_READONLY;
-	if (ms_flags & MS_NOSUID)
-		mnt_flags |= MNT_NOSUID;
-	if (ms_flags & MS_NODEV)
-		mnt_flags |= MNT_NODEV;
-	if (ms_flags & MS_NOEXEC)
-		mnt_flags |= MNT_NOEXEC;
-	if (ms_flags & MS_NODIRATIME)
-		mnt_flags |= MNT_NODIRATIME;
+	BUILD_BUG_ON(M_NOSUID != MNT_NOSUID || M_NODEV != MNT_NODEV ||
+		     M_NOEXEC != MNT_NOEXEC || M_NOATIME != MNT_NOATIME ||
+		     M_NODIRATIME != MNT_NODIRATIME ||
+		     M_STRICTATIME != MNT_RELATIME || M_RDONLY != MNT_READONLY);
+
+	/* Relatime is the default, but internally that's what we flag */
+	mnt_flags = m_flags & ~M_STRICTATIME;
 
-	if (ms_flags & MS_STRICTATIME) {
-		if (ms_flags & MS_NOATIME)
+	if (m_flags & M_STRICTATIME) {
+		if (m_flags & M_NOATIME)
 			return -EINVAL;
-	} else if (ms_flags & MS_NOATIME) {
-		mnt_flags |= MNT_NOATIME;
 	} else {
 		mnt_flags |= MNT_RELATIME;
 	}
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 3634e065836c..331e69f5a7bf 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -1,6 +1,15 @@
 #ifndef _UAPI_LINUX_MOUNT_H
 #define _UAPI_LINUX_MOUNT_H
 
+/* Mount flags passed to fsmount(2) */
+#define M_NOSUID	0x01
+#define M_NODEV		0x02
+#define M_NOEXEC	0x04
+#define M_NOATIME	0x08
+#define M_NODIRATIME	0x10
+#define M_STRICTATIME	0x20
+#define M_RDONLY	0x40
+
 /*
  * These are the fs-independent mount-flags: up to 32 flags are supported
  *
David Howells Aug. 24, 2018, 2:26 p.m. UTC | #15
Miklos Szeredi <miklos@szeredi.hu> wrote:

> +/* Mount flags passed to fsmount(2) */
> +#define M_NOSUID	0x01
> +#define M_NODEV		0x02
> +#define M_NOEXEC	0x04
> +#define M_NOATIME	0x08
> +#define M_NODIRATIME	0x10
> +#define M_STRICTATIME	0x20
> +#define M_RDONLY	0x40

If we're going to do this, I would suggest a longer prefix than just 'M' and
renumber them to put *_RDONLY first.

> +	BUILD_BUG_ON(M_NOSUID != MNT_NOSUID || M_NODEV != MNT_NODEV ||
> +		     M_NOEXEC != MNT_NOEXEC || M_NOATIME != MNT_NOATIME ||
> +		     M_NODIRATIME != MNT_NODIRATIME ||
> +		     M_STRICTATIME != MNT_RELATIME || M_RDONLY != MNT_READONLY);

Please don't, please do:

	if (ms_flags & M_RDONLY)
		mnt_flags |= MNT_READONLY;

Yes, and at some point I'd also like to compress the numbering on the SB_*
constants and break the identity as Christoph suggested.

David
Karel Zak Aug. 24, 2018, 2:26 p.m. UTC | #16
On Fri, Aug 24, 2018 at 04:18:10PM +0200, Miklos Szeredi wrote:
> I'm bothered by the fact that we use the same MS_ prefix in the old mount(2) api
> and the new fsmount(2) api.  What happens if we introduce new flags for
> fsmount(2) and are already out of flags for mount(2)?  I see a big mess that
> way.
> 
> Also notice, how the old code just totally ignored MS_RELATIME?  Bit of rot in
> the new interface already.
> 
> So here's a patch. The MNT_ prefix is already used by libmount, and we don't
> want any confusion arising from that.  So how about M_*?  Short and sweet, just
> like O_* for open(2).

+1 Please.
Miklos Szeredi Aug. 24, 2018, 2:30 p.m. UTC | #17
On Fri, Aug 24, 2018 at 4:26 PM, David Howells <dhowells@redhat.com> wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
>> +/* Mount flags passed to fsmount(2) */
>> +#define M_NOSUID     0x01
>> +#define M_NODEV              0x02
>> +#define M_NOEXEC     0x04
>> +#define M_NOATIME    0x08
>> +#define M_NODIRATIME 0x10
>> +#define M_STRICTATIME        0x20
>> +#define M_RDONLY     0x40
>
> If we're going to do this, I would suggest a longer prefix than just 'M'

Any suggestions?

We could do "MOUNT_", but that sounds tad too long.  But hey, that
doesn't matter much either.   Let me know your preference.

>  and
> renumber them to put *_RDONLY first.
>
>> +     BUILD_BUG_ON(M_NOSUID != MNT_NOSUID || M_NODEV != MNT_NODEV ||
>> +                  M_NOEXEC != MNT_NOEXEC || M_NOATIME != MNT_NOATIME ||
>> +                  M_NODIRATIME != MNT_NODIRATIME ||
>> +                  M_STRICTATIME != MNT_RELATIME || M_RDONLY != MNT_READONLY);
>
> Please don't, please do:
>
>         if (ms_flags & M_RDONLY)
>                 mnt_flags |= MNT_READONLY;

As I said, I don't really care either way.

Thanks,
Miklos
Andy Lutomirski Aug. 24, 2018, 2:49 p.m. UTC | #18
> On Aug 24, 2018, at 2:45 AM, David Howells <dhowells@redhat.com> wrote:

> 
> 
> The problem is that "exclusivity" isn't necessarily an easy thing to define.
> Take nfs4 and btrfs for example.  They creating a backing superblock that the
> actual node is derived from (though in different ways).  How do you define
> what "exclusive" means in their case?
> 

I would argue that “exclusive” means “semantically equivalent to getting a fully independent instance.”  This means, at least, that all options need to be respected as specified (or fail if they can’t be) and future reconfigure/remount actions only affect *this* instance and not others.

NFS might already meet these requirements, although I’m not sure about the latter one.

But it might be that adding CREATE_EXCLUSIVE is not all that useful and it would be better to have an API like you sketched where you first explicitly instantiate the driver and then you create however many trees (a la fsmount()) you want, where each tree has its own set of per-tree options like “subvol” for btrfs.
Miklos Szeredi Aug. 24, 2018, 3:02 p.m. UTC | #19
On Fri, Aug 24, 2018 at 4:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Aug 24, 2018, at 2:45 AM, David Howells <dhowells@redhat.com> wrote:
>
>>
>>
>> The problem is that "exclusivity" isn't necessarily an easy thing to define.
>> Take nfs4 and btrfs for example.  They creating a backing superblock that the
>> actual node is derived from (though in different ways).  How do you define
>> what "exclusive" means in their case?
>>
>
> I would argue that “exclusive” means “semantically equivalent to getting a fully independent instance.”

The only way to get semantically equivalent instance is to create a
fully independent instance.  See reconfigure (nee remount).

Thanks,
Miklos
Andy Lutomirski Aug. 24, 2018, 3:09 p.m. UTC | #20
> On Aug 24, 2018, at 8:02 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
>> On Fri, Aug 24, 2018 at 4:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>> 
>>>> On Aug 24, 2018, at 2:45 AM, David Howells <dhowells@redhat.com> wrote:
>>> 
>>> 
>>> 
>>> The problem is that "exclusivity" isn't necessarily an easy thing to define.
>>> Take nfs4 and btrfs for example.  They creating a backing superblock that the
>>> actual node is derived from (though in different ways).  How do you define
>>> what "exclusive" means in their case?
>>> 
>> 
>> I would argue that “exclusive” means “semantically equivalent to getting a fully independent instance.”
> 
> The only way to get semantically equivalent instance is to create a
> fully independent instance.  See reconfigure (nee remount).
> 
> 

Hmm. Is it that case in the current patchset that you can do CMD_CREATE and reconfigure the result and some *other* existing mount will change?  If so, that’s rather unfriendly to users.
Miklos Szeredi Aug. 24, 2018, 5:08 p.m. UTC | #21
On Fri, Aug 24, 2018 at 5:09 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Aug 24, 2018, at 8:02 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>
>>> On Fri, Aug 24, 2018 at 4:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>>
>>>>> On Aug 24, 2018, at 2:45 AM, David Howells <dhowells@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> The problem is that "exclusivity" isn't necessarily an easy thing to define.
>>>> Take nfs4 and btrfs for example.  They creating a backing superblock that the
>>>> actual node is derived from (though in different ways).  How do you define
>>>> what "exclusive" means in their case?
>>>>
>>>
>>> I would argue that “exclusive” means “semantically equivalent to getting a fully independent instance.”
>>
>> The only way to get semantically equivalent instance is to create a
>> fully independent instance.  See reconfigure (nee remount).
>>
>>
>
> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and reconfigure the result and some *other* existing mount will change?  If so, that’s rather unfriendly to users.

Exactly.

Thanks,
Miklos
David Howells Aug. 24, 2018, 5:10 p.m. UTC | #22
Andy Lutomirski <luto@amacapital.net> wrote:

> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and
> reconfigure the result and some *other* existing mount will change?  If so,
> that’s rather unfriendly to users.

The default behaviour has to be the same as mount(2).

David
Andy Lutomirski Aug. 24, 2018, 5:43 p.m. UTC | #23
On Fri, Aug 24, 2018 at 10:10 AM, David Howells <dhowells@redhat.com> wrote:
> Andy Lutomirski <luto@amacapital.net> wrote:
>
>> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and
>> reconfigure the result and some *other* existing mount will change?  If so,
>> that’s rather unfriendly to users.
>
> The default behaviour has to be the same as mount(2).
>

Why?

Obviously mount(2) needs to keep working the way it does now.  Most
likely mount(8) also needs to retain its current behavior.  But I
don't see why the new API needs be bug-compatible with mount(2).

--Andy
Miklos Szeredi Aug. 24, 2018, 7:25 p.m. UTC | #24
On Fri, Aug 24, 2018 at 7:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Aug 24, 2018 at 10:10 AM, David Howells <dhowells@redhat.com> wrote:
>> Andy Lutomirski <luto@amacapital.net> wrote:
>>
>>> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and
>>> reconfigure the result and some *other* existing mount will change?  If so,
>>> that’s rather unfriendly to users.
>>
>> The default behaviour has to be the same as mount(2).

This is rubbish.  Anyone wanting the mount(2) behavior can use mount(2).

About exclusive create: can't we just look at the active reference
count of the superblock returned by ->get_tree() (if it's one, we are
the only users, i.e. the create was exclusive)?

Thanks,
Miklos
Al Viro Aug. 24, 2018, 7:51 p.m. UTC | #25
On Fri, Aug 24, 2018 at 09:25:58PM +0200, Miklos Szeredi wrote:
> On Fri, Aug 24, 2018 at 7:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > On Fri, Aug 24, 2018 at 10:10 AM, David Howells <dhowells@redhat.com> wrote:
> >> Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >>> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and
> >>> reconfigure the result and some *other* existing mount will change?  If so,
> >>> that’s rather unfriendly to users.
> >>
> >> The default behaviour has to be the same as mount(2).
> 
> This is rubbish.  Anyone wanting the mount(2) behavior can use mount(2).
> 
> About exclusive create: can't we just look at the active reference
> count of the superblock returned by ->get_tree() (if it's one, we are
> the only users, i.e. the create was exclusive)?

Let me get it straight - your default behaviour would routinely refuse NFS mount
simply because somebody has already mounted from the same server?

The main reason for keeping existing semantics is very, very simple: nobody
has offered a sane replacement.  For all warts (and I admit that policy
re sharing turned out to be rather bad for any kind of situation with
non-cooperative admins - in partial defense, back in 2000/2001 when it was
done anybody talking about something like userns would've gotten laughed at,
for a lot of good reasons) mount(2) semantics is defined *and* needs to be
supported anyway.

All suggested "better replacements" were bloody problematic.  Sure, we'll need
to sort that out, but again, why the hell tie that to untangling the sodding
mount(2) ABI?
Eric W. Biederman Aug. 26, 2018, 3:22 a.m. UTC | #26
Sigh.

I just noticed that this tree reintroduces bugs that were fixed with
cfb2f6f6e0ba ("Revert "mqueue: switch to on-demand creation of internal mount"")

So in addition to ongoing discussion about what the new mount api should
look like we have bugs in the implementation.  This really does not look
like this branch is ready to be merged yet.

Eric
David Howells Aug. 26, 2018, 8:42 p.m. UTC | #27
Eric W. Biederman <ebiederm@xmission.com> wrote:

> I just noticed that this tree reintroduces bugs that were fixed with
> cfb2f6f6e0ba ("Revert "mqueue: switch to on-demand creation of internal
> mount"")
> 
> So in addition to ongoing discussion about what the new mount api should
> look like ...

Which don't need to be solved *before* the current patches go in.  No strong
reason has been given why the desired changes cannot wait one cycle.

The core needs to be upstream before we can start porting most of the
filesystems to it.

> ... we have bugs in the implementation.

Allegedly.  It would be useful if you could've pointed out what it was that
you think you see.  I don't see offhand what it reintroduces.

And how come this hasn't been noticed since the patches have been sat in
linux-next for a while?

> This really does not look like this branch is ready to be merged yet.

I disagree (probably obviously).

David
David Howells Aug. 26, 2018, 8:46 p.m. UTC | #28
David Howells <dhowells@redhat.com> wrote:

> > ... we have bugs in the implementation.
> 
> Allegedly.  It would be useful if you could've pointed out what it was that
> you think you see.  I don't see offhand what it reintroduces.

This?

-	ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns, 0);
-	if (IS_ERR(ns->mq_mnt)) {
-		int err = PTR_ERR(ns->mq_mnt);
-		ns->mq_mnt = NULL;
-		return err;
-	}
+	m = mq_create_mount(&init_ipc_ns);
+	if (IS_ERR(m))
+		return PTR_ERR(m);
+	ns->mq_mnt = m;

Should be passing ns into mq_create_mount() rather than init_ipc_ns maybe?

David
Miklos Szeredi Aug. 29, 2018, 12:32 p.m. UTC | #29
On Fri, Aug 24, 2018 at 9:51 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Aug 24, 2018 at 09:25:58PM +0200, Miklos Szeredi wrote:
>> On Fri, Aug 24, 2018 at 7:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> > On Fri, Aug 24, 2018 at 10:10 AM, David Howells <dhowells@redhat.com> wrote:
>> >> Andy Lutomirski <luto@amacapital.net> wrote:
>> >>
>> >>> Hmm. Is it that case in the current patchset that you can do CMD_CREATE and
>> >>> reconfigure the result and some *other* existing mount will change?  If so,
>> >>> that’s rather unfriendly to users.
>> >>
>> >> The default behaviour has to be the same as mount(2).
>>
>> This is rubbish.  Anyone wanting the mount(2) behavior can use mount(2).
>>
>> About exclusive create: can't we just look at the active reference
>> count of the superblock returned by ->get_tree() (if it's one, we are
>> the only users, i.e. the create was exclusive)?
>
> Let me get it straight - your default behaviour would routinely refuse NFS mount
> simply because somebody has already mounted from the same server?

Yeah, bad idea.

> The main reason for keeping existing semantics is very, very simple: nobody
> has offered a sane replacement.  For all warts (and I admit that policy
> re sharing turned out to be rather bad for any kind of situation with
> non-cooperative admins - in partial defense, back in 2000/2001 when it was
> done anybody talking about something like userns would've gotten laughed at,
> for a lot of good reasons) mount(2) semantics is defined *and* needs to be
> supported anyway.
>
> All suggested "better replacements" were bloody problematic.  Sure, we'll need
> to sort that out, but again, why the hell tie that to untangling the sodding
> mount(2) ABI?

What I was primarily suggesting is pretty simple, yet nobody seems to
get it: name the current behavior what it is: legacy. Call this op
e.g. CMD_LEGACY_FIND_OR_CREATE.  It means "you're using the shiny new
API, but be warned: we'll still use the old and stinky way to get the
super block".

We'll hopefully untangle it, sooner rather than later, and then have
ops with less scary names.

Thanks,
Miklos