mbox series

[00/14] VFS: Filesystem information [ver #18]

Message ID 158376244589.344135.12925590041630631412.stgit@warthog.procyon.org.uk (mailing list archive)
Headers show
Series VFS: Filesystem information [ver #18] | expand

Message

David Howells March 9, 2020, 2 p.m. UTC
Here's a set of patches that adds a system call, fsinfo(), that allows
information about the VFS, mount topology, superblock and files to be
retrieved.

The patchset is based on top of the notifications patchset and allows event
counters implemented in the latter to be retrieved to allow overruns to be
efficiently managed.

Included are a couple of sample programs plus limited example code for NFS
and Ext4.  The example code is not intended to go upstream as-is.


=======
THE WHY
=======

Why do we want this?

Using /proc/mounts (or similar) has problems:

 (1) Reading from it holds a global lock (namespace_sem) that prevents
     mounting and unmounting.  Lots of data is encoded and mangled into
     text whilst the lock is held, including superblock option strings and
     mount point paths.  This causes performance problems when there are a
     lot of mount objects in a system.

 (2) Even though namespace_sem is held during a read, reading the whole
     file isn't necessarily atomic with respect to mount-type operations.
     If a read isn't satisfied in one go, then it may return to userspace
     briefly and then continue reading some way into the file.  But changes
     can occur in the interval that may then go unseen.

 (3) Determining what has changed means parsing and comparing consecutive
     outputs of /proc/mounts.

 (4) Querying a specific mount or superblock means searching through
     /proc/mounts and searching by path or mount ID - but we might have an
     fd we want to query.

 (5) Mount topology is not explicit.  One must derive it manually by
     comparing entries.

 (6) Whilst you can poll() it for events, it only tells you that something
     changed in the namespace, not what or whether you can even see the
     change.

To fix the notification issues, the preceding notifications patchset added
mount watch notifications whereby you can watch for notifications in a
specific mount subtree.  The notification messages include the ID(s) of the
affected mounts.

To support notifications, however, we need to be able to handle overruns in
the notification queue.  I added a number of event counters to struct
super_block and struct mount to allow you to pin down the changes, but
there needs to be a way to retrieve them.  Exposing them through /proc
would require adding yet another /proc/mounts-type file.  We could add
per-mount directories full of attributes in sysfs, but that has issues also
(see below).

Adding an extensible system call interface for retrieving filesystem
information also allows other things to be exposed:

 (1) Jeff Layton's error handling changes need a way to allow error event
     information to be retrieved.

 (2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are
     actually 3-state { Set, Unset, Not supported }.  It could be useful to
     provide a way to expose information like this[*].

 (3) Limits of the numerical metadata values in a filesystem[*].

 (4) Filesystem capability information[*].  Filesystems don't all have the
     same capabilities, and even different instances may have different
     capabilities, particularly with network filesystems where the set of
     may be server-dependent.  Capabilities might even vary at file
     granularity - though possibly such information should be conveyed
     through statx() instead.

 (5) ID mapping/shifting tables in use for a superblock.

 (6) Filesystem-specific information.  I need something for AFS so that I
     can do pioctl()-emulation, thereby allowing me to implement certain of
     the AFS command line utilities that query state of a particular file.
     This could also have application for other filesystems, such as NFS,
     CIFS and ext4.

 [*] In a lot of cases these are probably fixed and can be memcpy'd from
     static data.

There's a further consideration: I want to make it possible to have
fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager
such that the manager can supervise a mount attempted inside the container.
The manager would be given an fd pointing to the fs_context struct and
would then need some way to query it (fsinfo()) and modify it (fsconfig()).
This could also be used to arbitrate user-requested mounts when containers
are not in play.


============================
WHY NOT USE PROCFS OR SYSFS?
============================

Why is it better to go with a new system call rather than adding more magic
stuff to /proc or /sysfs for each superblock object and each mount object?

 (1) It can be targetted.  It makes it easy to query directly by path or
     fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
     cannot do three of these things easily.

 (2) Easier to provide LSM oversight.  Is the accessing process allowed to
     query information pertinent to a particular file?

 (3) It's more efficient as we can return specific binary data rather than
     making huge text dumps.  Granted, sysfs and procfs could present the
     same data, though as lots of little files which have to be
     individually opened, read, closed and parsed.

 (4) We wouldn't have the overhead of open and close (even adding a
     self-contained readfile() syscall has to do that internally).

 (5) Opening a file in procfs or sysfs has a pathwalk overhead for each
     file accessed.  We can use an integer attribute ID instead (yes, this
     is similar to ioctl) - but could also use a string ID if that is
     preferred.

 (6) Can query cross-namespace if, say, a container manager process is
     given an fs_context that hasn't yet been mounted into a namespace - or
     hasn't even been fully created yet.

 (7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
     mount happens or is removed - and since systemd makes much use of
     mount namespaces and mount propagation, this will create a lot of
     nodes.


================
DESIGN DECISIONS
================

 (1) Information is partitioned into sets of attributes.

 (2) Attribute IDs are integers as they're fast to compare.

 (3) Attribute values are typed (struct, list of structs, string, opaque
     blob).  They type is fixed for a particular attribute.

 (4) For structure types, the length is also a version.  New fields can be
     tacked onto the end.

 (5) When copying a versioned struct to userspace, the core handles a
     version mismatch by truncating or zero-padding the data as necessary.
     None of this is seen by the filesystem.

 (6) The core handles all the buffering and buffer resizing.

 (7) The filesystem never gets any access to the userspace parameter buffer
     or result buffer.

 (8) "Meta" attributes can describe other attributes.


========
OVERVIEW
========

fsinfo() is a system call that allows information about the filesystem at a
particular path point to be queried as a set of attributes.

Attribute values are of four basic types:

 (1) Structure with version-dependent length (the length is the version).

 (2) Variable-length string.

 (3) List of structures (all the same length).

 (4) Opaque blob.

Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type.  Values can be up to INT_MAX size,
subject to memory availability.

Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but the values still have to be of the type for that attribute.

I've tried to make the interface as light as possible, so integer attribute
ID rather than string and the core does all the buffer allocation and
expansion and all the extensibility support work rather than leaving that
to the filesystems.  This means that userspace pointers are not exposed to
the filesystem.


fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:

 (1) General superblock attributes:

     - Filesystem identifiers (UUID, volume label, device numbers, ...)
     - The limits on a filesystem's capabilities
     - Information on supported statx fields and attributes and IOC flags.
     - A variety single-bit flags indicating supported capabilities.
     - Timestamp resolution and range.
     - The amount of space/free space in a filesystem (as statfs()).
     - Superblock notification counter.

 (2) Filesystem-specific superblock attributes:

     - Superblock-level timestamps.
     - Cell name, workgroup or other netfs grouping concept.
     - Server names and addresses.

 (3) VFS information:

     - Mount topology information.
     - Mount attributes.
     - Mount notification counter.
     - Mount point path.

 (4) Information about what the fsinfo() syscall itself supports, including
     the type and struct size of attributes.

The system is extensible:

 (1) New attributes can be added.  There is no requirement that a
     filesystem implement every attribute.  A helper function is provided
     to scan a list of attributes and a filesystem can have multiple such
     lists.

 (2) Version length-dependent structure attributes can be made larger and
     have additional information tacked on the end, provided it keeps the
     layout of the existing fields.  If an older process asks for a shorter
     structure, it will only be given the bits it asks for.  If a newer
     process asks for a longer structure on an older kernel, the extra
     space will be set to 0.  In all cases, the size of the data actually
     available is returned.

     In essence, the size of a structure is that structure's version: a
     smaller size is an earlier version and a later version includes
     everything that the earlier version did.

 (3) New single-bit capability flags can be added.  This is a structure-typed
     attribute and, as such, (2) applies.  Any bits you wanted but the kernel
     doesn't support are automatically set to 0.

fsinfo() may be called like the following, for example:

	struct fsinfo_params params = {
		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
		.Nth		= 2,
	};
	struct fsinfo_server_address address;
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
		     &address, sizeof(address));

The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:

	struct fsinfo_params params = {
		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_NFS_SERVER_NAME;
	};
	char server_name[256];
	len = fsinfo(AT_FDCWD, "/home/dhowells/", &params,
		     &server_name, sizeof(server_name));

would retrieve the name of the NFS server as a string.

In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:

	fd = fsopen("ext4", 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_CONFIGURATION;
	};
	char buffer[65536];
	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));

even if that context doesn't currently have a superblock attached.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

on branch:

	fsinfo-core


===================
SIGNIFICANT CHANGES
===================

 ver #18:

 (*) Moved the mount and superblock notification patches into a different
     branch.

 (*) Made superblock configuration (->show_opts), bindmount path
     (->show_path) and filesystem statistics (->show_stats) available as
     the CONFIGURATION, MOUNT_PATH and FS_STATISTICS attributes.

 (*) Made mountpoint device name available, filtered through the superblock
     (->show_devname), as the SOURCE attribute.

 (*) Made the mountpoint available as a full path as well as a relative
     one.

 (*) Added more event counters to MOUNT_INFO, including a subtree
     notification counter, to make it easier to clean up after a
     notification overrun.

 (*) Made the event counter value returned by MOUNT_CHILDREN the sum of the
     five event counters.

 (*) Added a mount uniquifier and added that to the MOUNT_CHILDREN entries
     also so that mount ID reuse can be detected.

 (*) Merged the SB_NOTIFICATION attribute into the MOUNT_INFO attribute to
     avoid duplicate information.

 (*) Switched to using the RESOLVE_* flags rather than AT_* flags for
     pathwalk control.  Added more RESOLVE_* flags.

 (*) Used a lock instead of RCU to enumerate children for the
     MOUNT_CHILDREN attribute for safety.  This is probably worth
     revisiting at a later date, however.


 ver #17:

 (*) Applied comments from Jann Horn, Darrick Wong and Christian Brauner.

 (*) Rearranged the order in which fsinfo() does things so that the
     superblock operations table can have a function pointer rather than a
     table pointer.  The ->fsinfo() op is now called at least twice, once
     to determine the size of buffer needed and then to retrieve the data.
     If the retrieval step indicates yet more space is needed, the buffer
     will be expanded and that step repeated.

 (*) Merge the element size into the size in the fsinfo_attribute def and
     don't set size for strings or opaques.  Let a helper work that out.
     This means that strings can actually get larger then 4K.

 (*) A helper is provided to scan a list of attributes and call the
     appropriate get function.  This can be called from a filesystem's
     ->fsinfo() method multiple times.  It also handles attribute
     enumeration and info querying.

 (*) Rearranged the patches to put all the notification patches first.
     This allowed some of the bits to be squashed together.  At some point,
     I'll move the notification patches into a different branch.

 ver #16:

 (*) Split the features bits out of the fsinfo() core into their own patch
     and got rid of the name encoding attributes.

 (*) Renamed the 'array' type to 'list' and made AFS use it for returning
     server address lists.

 (*) Changed the ->fsinfo() method into an ->fsinfo_attributes[] table,
     where each attribute has a ->get() method to deal with it.  These
     tables can then be returned with an fsinfo meta attribute.

 (*) Dropped the fscontext query and parameter/description retrieval
     attributes for now.

 (*) Picked the mount topology attributes into this branch.

 (*) Picked the mount notifications into this branch and rebased on top of
     notifications-pipe-core.

 (*) Picked the superblock notifications into this branch.

 (*) Add sample code for Ext4 and NFS.

David
---
David Howells (14):
      VFS: Add additional RESOLVE_* flags
      fsinfo: Add fsinfo() syscall to query filesystem information
      fsinfo: Provide a bitmap of supported features
      fsinfo: Allow retrieval of superblock devname, options and stats
      fsinfo: Allow fsinfo() to look up a mount object by ID
      fsinfo: Add a uniquifier ID to struct mount
      fsinfo: Allow mount information to be queried
      fsinfo: Allow the mount topology propogation flags to be retrieved
      fsinfo: Provide notification overrun handling support
      fsinfo: sample: Mount listing program
      fsinfo: Add API documentation
      fsinfo: Add support for AFS
      fsinfo: Example support for Ext4
      fsinfo: Example support for NFS


 Documentation/filesystems/fsinfo.rst        |  564 +++++++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/include/asm/unistd.h             |    2 
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 fs/Kconfig                                  |    7 
 fs/Makefile                                 |    1 
 fs/afs/internal.h                           |    1 
 fs/afs/super.c                              |  218 +++++++
 fs/d_path.c                                 |    2 
 fs/ext4/Makefile                            |    1 
 fs/ext4/ext4.h                              |    6 
 fs/ext4/fsinfo.c                            |   45 +
 fs/ext4/super.c                             |    3 
 fs/fsinfo.c                                 |  720 ++++++++++++++++++++++
 fs/internal.h                               |   13 
 fs/mount.h                                  |    3 
 fs/namespace.c                              |  362 +++++++++++
 fs/nfs/Makefile                             |    1 
 fs/nfs/fsinfo.c                             |  230 +++++++
 fs/nfs/internal.h                           |    6 
 fs/nfs/nfs4super.c                          |    3 
 fs/nfs/super.c                              |    3 
 fs/open.c                                   |    8 
 include/linux/fcntl.h                       |    3 
 include/linux/fs.h                          |    4 
 include/linux/fsinfo.h                      |  111 +++
 include/linux/syscalls.h                    |    4 
 include/uapi/asm-generic/unistd.h           |    4 
 include/uapi/linux/fsinfo.h                 |  360 +++++++++++
 include/uapi/linux/mount.h                  |   10 
 include/uapi/linux/openat2.h                |    8 
 include/uapi/linux/windows.h                |   35 +
 kernel/sys_ni.c                             |    1 
 samples/vfs/Makefile                        |    7 
 samples/vfs/test-fsinfo.c                   |  880 +++++++++++++++++++++++++++
 samples/vfs/test-mntinfo.c                  |  277 ++++++++
 50 files changed, 3905 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/filesystems/fsinfo.rst
 create mode 100644 fs/ext4/fsinfo.c
 create mode 100644 fs/fsinfo.c
 create mode 100644 fs/nfs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 include/uapi/linux/windows.h
 create mode 100644 samples/vfs/test-fsinfo.c
 create mode 100644 samples/vfs/test-mntinfo.c

Comments

Jeff Layton March 9, 2020, 5:50 p.m. UTC | #1
On Mon, 2020-03-09 at 14:00 +0000, David Howells wrote:
> Here's a set of patches that adds a system call, fsinfo(), that allows
> information about the VFS, mount topology, superblock and files to be
> retrieved.
> 
> The patchset is based on top of the notifications patchset and allows event
> counters implemented in the latter to be retrieved to allow overruns to be
> efficiently managed.
> 
> Included are a couple of sample programs plus limited example code for NFS
> and Ext4.  The example code is not intended to go upstream as-is.
> 
> 
> =======
> THE WHY
> =======
> 
> Why do we want this?
> 
> Using /proc/mounts (or similar) has problems:
> 
>  (1) Reading from it holds a global lock (namespace_sem) that prevents
>      mounting and unmounting.  Lots of data is encoded and mangled into
>      text whilst the lock is held, including superblock option strings and
>      mount point paths.  This causes performance problems when there are a
>      lot of mount objects in a system.
> 
>  (2) Even though namespace_sem is held during a read, reading the whole
>      file isn't necessarily atomic with respect to mount-type operations.
>      If a read isn't satisfied in one go, then it may return to userspace
>      briefly and then continue reading some way into the file.  But changes
>      can occur in the interval that may then go unseen.
> 
>  (3) Determining what has changed means parsing and comparing consecutive
>      outputs of /proc/mounts.
> 
>  (4) Querying a specific mount or superblock means searching through
>      /proc/mounts and searching by path or mount ID - but we might have an
>      fd we want to query.
> 
>  (5) Mount topology is not explicit.  One must derive it manually by
>      comparing entries.
> 
>  (6) Whilst you can poll() it for events, it only tells you that something
>      changed in the namespace, not what or whether you can even see the
>      change.
> 
> To fix the notification issues, the preceding notifications patchset added
> mount watch notifications whereby you can watch for notifications in a
> specific mount subtree.  The notification messages include the ID(s) of the
> affected mounts.
> 
> To support notifications, however, we need to be able to handle overruns in
> the notification queue.  I added a number of event counters to struct
> super_block and struct mount to allow you to pin down the changes, but
> there needs to be a way to retrieve them.  Exposing them through /proc
> would require adding yet another /proc/mounts-type file.  We could add
> per-mount directories full of attributes in sysfs, but that has issues also
> (see below).
> 
> Adding an extensible system call interface for retrieving filesystem
> information also allows other things to be exposed:
> 
>  (1) Jeff Layton's error handling changes need a way to allow error event
>      information to be retrieved.
> 
>  (2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are
>      actually 3-state { Set, Unset, Not supported }.  It could be useful to
>      provide a way to expose information like this[*].
> 
>  (3) Limits of the numerical metadata values in a filesystem[*].
> 
>  (4) Filesystem capability information[*].  Filesystems don't all have the
>      same capabilities, and even different instances may have different
>      capabilities, particularly with network filesystems where the set of
>      may be server-dependent.  Capabilities might even vary at file
>      granularity - though possibly such information should be conveyed
>      through statx() instead.
> 
>  (5) ID mapping/shifting tables in use for a superblock.
> 
>  (6) Filesystem-specific information.  I need something for AFS so that I
>      can do pioctl()-emulation, thereby allowing me to implement certain of
>      the AFS command line utilities that query state of a particular file.
>      This could also have application for other filesystems, such as NFS,
>      CIFS and ext4.
> 
>  [*] In a lot of cases these are probably fixed and can be memcpy'd from
>      static data.
> 
> There's a further consideration: I want to make it possible to have
> fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager
> such that the manager can supervise a mount attempted inside the container.
> The manager would be given an fd pointing to the fs_context struct and
> would then need some way to query it (fsinfo()) and modify it (fsconfig()).
> This could also be used to arbitrate user-requested mounts when containers
> are not in play.
> 
> 
> ============================
> WHY NOT USE PROCFS OR SYSFS?
> ============================
> 
> Why is it better to go with a new system call rather than adding more magic
> stuff to /proc or /sysfs for each superblock object and each mount object?
> 
>  (1) It can be targetted.  It makes it easy to query directly by path or
>      fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
>      cannot do three of these things easily.
> 
>  (2) Easier to provide LSM oversight.  Is the accessing process allowed to
>      query information pertinent to a particular file?
> 
>  (3) It's more efficient as we can return specific binary data rather than
>      making huge text dumps.  Granted, sysfs and procfs could present the
>      same data, though as lots of little files which have to be
>      individually opened, read, closed and parsed.
> 
>  (4) We wouldn't have the overhead of open and close (even adding a
>      self-contained readfile() syscall has to do that internally).
> 
>  (5) Opening a file in procfs or sysfs has a pathwalk overhead for each
>      file accessed.  We can use an integer attribute ID instead (yes, this
>      is similar to ioctl) - but could also use a string ID if that is
>      preferred.
> 
>  (6) Can query cross-namespace if, say, a container manager process is
>      given an fs_context that hasn't yet been mounted into a namespace - or
>      hasn't even been fully created yet.
> 
>  (7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
>      mount happens or is removed - and since systemd makes much use of
>      mount namespaces and mount propagation, this will create a lot of
>      nodes.
> 
> 
> ================
> DESIGN DECISIONS
> ================
> 
>  (1) Information is partitioned into sets of attributes.
> 
>  (2) Attribute IDs are integers as they're fast to compare.
> 
>  (3) Attribute values are typed (struct, list of structs, string, opaque
>      blob).  They type is fixed for a particular attribute.
> 
>  (4) For structure types, the length is also a version.  New fields can be
>      tacked onto the end.
> 
>  (5) When copying a versioned struct to userspace, the core handles a
>      version mismatch by truncating or zero-padding the data as necessary.
>      None of this is seen by the filesystem.
> 
>  (6) The core handles all the buffering and buffer resizing.
> 
>  (7) The filesystem never gets any access to the userspace parameter buffer
>      or result buffer.
> 
>  (8) "Meta" attributes can describe other attributes.
> 
> 
> ========
> OVERVIEW
> ========
> 
> fsinfo() is a system call that allows information about the filesystem at a
> particular path point to be queried as a set of attributes.
> 
> Attribute values are of four basic types:
> 
>  (1) Structure with version-dependent length (the length is the version).
> 
>  (2) Variable-length string.
> 
>  (3) List of structures (all the same length).
> 
>  (4) Opaque blob.
> 
> Attributes can have multiple values either as a sequence of values or a
> sequence-of-sequences of values and all the values of a particular
> attribute must be of the same type.  Values can be up to INT_MAX size,
> subject to memory availability.
> 
> Note that the values of an attribute *are* allowed to vary between dentries
> within a single superblock, depending on the specific dentry that you're
> looking at, but the values still have to be of the type for that attribute.
> 
> I've tried to make the interface as light as possible, so integer attribute
> ID rather than string and the core does all the buffer allocation and
> expansion and all the extensibility support work rather than leaving that
> to the filesystems.  This means that userspace pointers are not exposed to
> the filesystem.
> 
> 
> fsinfo() allows a variety of information to be retrieved about a filesystem
> and the mount topology:
> 
>  (1) General superblock attributes:
> 
>      - Filesystem identifiers (UUID, volume label, device numbers, ...)
>      - The limits on a filesystem's capabilities
>      - Information on supported statx fields and attributes and IOC flags.
>      - A variety single-bit flags indicating supported capabilities.
>      - Timestamp resolution and range.
>      - The amount of space/free space in a filesystem (as statfs()).
>      - Superblock notification counter.
> 
>  (2) Filesystem-specific superblock attributes:
> 
>      - Superblock-level timestamps.
>      - Cell name, workgroup or other netfs grouping concept.
>      - Server names and addresses.
> 
>  (3) VFS information:
> 
>      - Mount topology information.
>      - Mount attributes.
>      - Mount notification counter.
>      - Mount point path.
> 
>  (4) Information about what the fsinfo() syscall itself supports, including
>      the type and struct size of attributes.
> 
> The system is extensible:
> 
>  (1) New attributes can be added.  There is no requirement that a
>      filesystem implement every attribute.  A helper function is provided
>      to scan a list of attributes and a filesystem can have multiple such
>      lists.
> 
>  (2) Version length-dependent structure attributes can be made larger and
>      have additional information tacked on the end, provided it keeps the
>      layout of the existing fields.  If an older process asks for a shorter
>      structure, it will only be given the bits it asks for.  If a newer
>      process asks for a longer structure on an older kernel, the extra
>      space will be set to 0.  In all cases, the size of the data actually
>      available is returned.
> 
>      In essence, the size of a structure is that structure's version: a
>      smaller size is an earlier version and a later version includes
>      everything that the earlier version did.
> 
>  (3) New single-bit capability flags can be added.  This is a structure-typed
>      attribute and, as such, (2) applies.  Any bits you wanted but the kernel
>      doesn't support are automatically set to 0.
> 
> fsinfo() may be called like the following, for example:
> 
> 	struct fsinfo_params params = {
> 		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
> 		.flags		= FSINFO_FLAGS_QUERY_PATH,
> 		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
> 		.Nth		= 2,
> 	};
> 	struct fsinfo_server_address address;
> 	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> 		     &address, sizeof(address));
> 
> The above example would query an AFS filesystem to retrieve the address
> list for the 3rd server, and:
> 
> 	struct fsinfo_params params = {
> 		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
> 		.flags		= FSINFO_FLAGS_QUERY_PATH,
> 		.request	= FSINFO_ATTR_NFS_SERVER_NAME;
> 	};
> 	char server_name[256];
> 	len = fsinfo(AT_FDCWD, "/home/dhowells/", &params,
> 		     &server_name, sizeof(server_name));
> 
> would retrieve the name of the NFS server as a string.
> 
> In future, I want to make fsinfo() capable of querying a context created by
> fsopen() or fspick(), e.g.:
> 
> 	fd = fsopen("ext4", 0);
> 	struct fsinfo_params params = {
> 		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
> 		.request	= FSINFO_ATTR_CONFIGURATION;
> 	};
> 	char buffer[65536];
> 	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));
> 
> even if that context doesn't currently have a superblock attached.
> 
> The patches can be found here also:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
> 
> on branch:
> 
> 	fsinfo-core
> 
> 
> ===================
> SIGNIFICANT CHANGES
> ===================
> 
>  ver #18:
> 
>  (*) Moved the mount and superblock notification patches into a different
>      branch.
> 
>  (*) Made superblock configuration (->show_opts), bindmount path
>      (->show_path) and filesystem statistics (->show_stats) available as
>      the CONFIGURATION, MOUNT_PATH and FS_STATISTICS attributes.
> 
>  (*) Made mountpoint device name available, filtered through the superblock
>      (->show_devname), as the SOURCE attribute.
> 
>  (*) Made the mountpoint available as a full path as well as a relative
>      one.
> 
>  (*) Added more event counters to MOUNT_INFO, including a subtree
>      notification counter, to make it easier to clean up after a
>      notification overrun.
> 
>  (*) Made the event counter value returned by MOUNT_CHILDREN the sum of the
>      five event counters.
> 
>  (*) Added a mount uniquifier and added that to the MOUNT_CHILDREN entries
>      also so that mount ID reuse can be detected.
> 
>  (*) Merged the SB_NOTIFICATION attribute into the MOUNT_INFO attribute to
>      avoid duplicate information.
> 
>  (*) Switched to using the RESOLVE_* flags rather than AT_* flags for
>      pathwalk control.  Added more RESOLVE_* flags.
> 
>  (*) Used a lock instead of RCU to enumerate children for the
>      MOUNT_CHILDREN attribute for safety.  This is probably worth
>      revisiting at a later date, however.
> 
> 
>  ver #17:
> 
>  (*) Applied comments from Jann Horn, Darrick Wong and Christian Brauner.
> 
>  (*) Rearranged the order in which fsinfo() does things so that the
>      superblock operations table can have a function pointer rather than a
>      table pointer.  The ->fsinfo() op is now called at least twice, once
>      to determine the size of buffer needed and then to retrieve the data.
>      If the retrieval step indicates yet more space is needed, the buffer
>      will be expanded and that step repeated.
> 
>  (*) Merge the element size into the size in the fsinfo_attribute def and
>      don't set size for strings or opaques.  Let a helper work that out.
>      This means that strings can actually get larger then 4K.
> 
>  (*) A helper is provided to scan a list of attributes and call the
>      appropriate get function.  This can be called from a filesystem's
>      ->fsinfo() method multiple times.  It also handles attribute
>      enumeration and info querying.
> 
>  (*) Rearranged the patches to put all the notification patches first.
>      This allowed some of the bits to be squashed together.  At some point,
>      I'll move the notification patches into a different branch.
> 
>  ver #16:
> 
>  (*) Split the features bits out of the fsinfo() core into their own patch
>      and got rid of the name encoding attributes.
> 
>  (*) Renamed the 'array' type to 'list' and made AFS use it for returning
>      server address lists.
> 
>  (*) Changed the ->fsinfo() method into an ->fsinfo_attributes[] table,
>      where each attribute has a ->get() method to deal with it.  These
>      tables can then be returned with an fsinfo meta attribute.
> 
>  (*) Dropped the fscontext query and parameter/description retrieval
>      attributes for now.
> 
>  (*) Picked the mount topology attributes into this branch.
> 
>  (*) Picked the mount notifications into this branch and rebased on top of
>      notifications-pipe-core.
> 
>  (*) Picked the superblock notifications into this branch.
> 
>  (*) Add sample code for Ext4 and NFS.
> 
> David
> ---
> David Howells (14):
>       VFS: Add additional RESOLVE_* flags
>       fsinfo: Add fsinfo() syscall to query filesystem information
>       fsinfo: Provide a bitmap of supported features
>       fsinfo: Allow retrieval of superblock devname, options and stats
>       fsinfo: Allow fsinfo() to look up a mount object by ID
>       fsinfo: Add a uniquifier ID to struct mount
>       fsinfo: Allow mount information to be queried
>       fsinfo: Allow the mount topology propogation flags to be retrieved
>       fsinfo: Provide notification overrun handling support
>       fsinfo: sample: Mount listing program
>       fsinfo: Add API documentation
>       fsinfo: Add support for AFS
>       fsinfo: Example support for Ext4
>       fsinfo: Example support for NFS
> 
> 
>  Documentation/filesystems/fsinfo.rst        |  564 +++++++++++++++++
>  arch/alpha/kernel/syscalls/syscall.tbl      |    1 
>  arch/arm/tools/syscall.tbl                  |    1 
>  arch/arm64/include/asm/unistd.h             |    2 
>  arch/ia64/kernel/syscalls/syscall.tbl       |    1 
>  arch/m68k/kernel/syscalls/syscall.tbl       |    1 
>  arch/microblaze/kernel/syscalls/syscall.tbl |    1 
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
>  arch/parisc/kernel/syscalls/syscall.tbl     |    1 
>  arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
>  arch/s390/kernel/syscalls/syscall.tbl       |    1 
>  arch/sh/kernel/syscalls/syscall.tbl         |    1 
>  arch/sparc/kernel/syscalls/syscall.tbl      |    1 
>  arch/x86/entry/syscalls/syscall_32.tbl      |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl      |    1 
>  arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
>  fs/Kconfig                                  |    7 
>  fs/Makefile                                 |    1 
>  fs/afs/internal.h                           |    1 
>  fs/afs/super.c                              |  218 +++++++
>  fs/d_path.c                                 |    2 
>  fs/ext4/Makefile                            |    1 
>  fs/ext4/ext4.h                              |    6 
>  fs/ext4/fsinfo.c                            |   45 +
>  fs/ext4/super.c                             |    3 
>  fs/fsinfo.c                                 |  720 ++++++++++++++++++++++
>  fs/internal.h                               |   13 
>  fs/mount.h                                  |    3 
>  fs/namespace.c                              |  362 +++++++++++
>  fs/nfs/Makefile                             |    1 
>  fs/nfs/fsinfo.c                             |  230 +++++++
>  fs/nfs/internal.h                           |    6 
>  fs/nfs/nfs4super.c                          |    3 
>  fs/nfs/super.c                              |    3 
>  fs/open.c                                   |    8 
>  include/linux/fcntl.h                       |    3 
>  include/linux/fs.h                          |    4 
>  include/linux/fsinfo.h                      |  111 +++
>  include/linux/syscalls.h                    |    4 
>  include/uapi/asm-generic/unistd.h           |    4 
>  include/uapi/linux/fsinfo.h                 |  360 +++++++++++
>  include/uapi/linux/mount.h                  |   10 
>  include/uapi/linux/openat2.h                |    8 
>  include/uapi/linux/windows.h                |   35 +
>  kernel/sys_ni.c                             |    1 
>  samples/vfs/Makefile                        |    7 
>  samples/vfs/test-fsinfo.c                   |  880 +++++++++++++++++++++++++++
>  samples/vfs/test-mntinfo.c                  |  277 ++++++++
>  50 files changed, 3905 insertions(+), 14 deletions(-)
>  create mode 100644 Documentation/filesystems/fsinfo.rst
>  create mode 100644 fs/ext4/fsinfo.c
>  create mode 100644 fs/fsinfo.c
>  create mode 100644 fs/nfs/fsinfo.c
>  create mode 100644 include/linux/fsinfo.h
>  create mode 100644 include/uapi/linux/fsinfo.h
>  create mode 100644 include/uapi/linux/windows.h
>  create mode 100644 samples/vfs/test-fsinfo.c
>  create mode 100644 samples/vfs/test-mntinfo.c
> 
> 
The PostgreSQL devs asked a while back for some way to tell whether
there have been any writeback errors on a superblock w/o having to do
any sort of flush -- just "have there been any so far".

I sent a patch a few weeks ago to make syncfs() return errors when there
have been writeback errors on the superblock. It's not merged yet, but
once we have something like that in place, we could expose info from the
errseq_t to userland using this interface.

Something like this patch would do it (which depends on a few others in
my tree, nothing very large though):

---------------------------8<-----------------------

[PATCH] vfs: allow fsinfo to fetch the current state of s_wb_err

Add a new "error_state" struct to fsinfo, and teach the kernel to fill
that out from sb->s_wb_info. There are two fields:

wb_error_last: the most recently recorded errno for the filesystem

wb_error_cookie: this value will change vs. the previously fetched
                 value if a new error was recorded since it was last
		 checked. Callers should treat this as an opaque value
		 that can be compared to earlier fetched values.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/fsinfo.c                 | 11 +++++++++++
 include/uapi/linux/fsinfo.h | 13 +++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index 6d2bc03998e4..3bbe6d7b1a79 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -275,6 +275,7 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_SOURCE,		fsinfo_generic_mount_source),
 	FSINFO_STRING	(FSINFO_ATTR_CONFIGURATION,	fsinfo_generic_seq_read),
 	FSINFO_STRING	(FSINFO_ATTR_FS_STATISTICS,	fsinfo_generic_seq_read),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_ERROR_STATE,	fsinfo_generic_error_state),
 
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
@@ -376,6 +377,16 @@ static int fsinfo_get_attribute_info(struct path *path,
 	return -EOPNOTSUPP; /* We want to go through all the lists */
 }
 
+static int fsinfo_generic_error_state(struct path *path,
+				      struct fsinfo_context *ctx)
+{
+	struct fsinfo_error_state *es = ctx->buffer;
+
+	es->wb_error_cookie = errseq_scrape(&path->dentry->d_sb->s_wb_err);
+	es->wb_error_last = es->wb_error_cookie & MAX_ERRNO;
+	return sizeof(*es);
+}
+
 /**
  * fsinfo_get_attribute - Look up and handle an attribute
  * @path: The object to query
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 346cf0cf42cb..3d33744c2320 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -27,6 +27,7 @@
 #define FSINFO_ATTR_SOURCE		0x09	/* Superblock source/device name (string) */
 #define FSINFO_ATTR_CONFIGURATION	0x0a	/* Superblock configuration/options (string) */
 #define FSINFO_ATTR_FS_STATISTICS	0x0b	/* Superblock filesystem statistics (string) */
+#define FSINFO_ATTR_ERROR_STATE	0x0c	/* errseq_t state */
 
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
@@ -357,4 +358,16 @@ struct fsinfo_nfs_server_address {
 
 #define FSINFO_ATTR_NFS_SERVER_ADDRESSES__STRUCT struct fsinfo_nfs_server_address
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_ERROR_STATE).
+ *
+ * Retrieve the error state for a filesystem.
+ */
+struct fsinfo_error_state {
+	__u32		wb_error_cookie;	/* writeback error cookie */
+	__u32		wb_error_last;		/* latest writeback error */
+};
+
+#define FSINFO_ATTR_ERROR_STATE__STRUCT struct fsinfo_error_state
+
 #endif /* _UAPI_LINUX_FSINFO_H */
Andres Freund March 9, 2020, 7:22 p.m. UTC | #2
Hi,

On 2020-03-09 13:50:59 -0400, Jeff Layton wrote:
> The PostgreSQL devs asked a while back for some way to tell whether
> there have been any writeback errors on a superblock w/o having to do
> any sort of flush -- just "have there been any so far".

Indeed.


> I sent a patch a few weeks ago to make syncfs() return errors when there
> have been writeback errors on the superblock. It's not merged yet, but
> once we have something like that in place, we could expose info from the
> errseq_t to userland using this interface.

I'm still a bit worried about the details of errseq_t being exposed to
userland. Partially because it seems to restrict further evolution of
errseq_t, and partially because it will likely up with userland trying
to understand it (it's e.g. just too attractive to report a count of
errors etc).

Is there a reason to not instead report a 64bit counter instead of the
cookie? In contrast to the struct file case we'd only have the space
overhead once per superblock, rather than once per #files * #fd. And it
seems that the maintenance of that counter could be done without
widespread changes, e.g. instead/in addition to your change:

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..897439475315 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -51,7 +51,10 @@ static inline void mapping_set_error(struct address_space *mapping, int error)
>  		return;
>
>  	/* Record in wb_err for checkers using errseq_t based tracking */
> -	filemap_set_wb_err(mapping, error);
> +	__filemap_set_wb_err(mapping, error);
> +
> +	/* Record it in superblock */
> +	errseq_set(&mapping->host->i_sb->s_wb_err, error);
>
>  	/* Record it in flags for now, for legacy callers */
>  	if (error == -ENOSPC)

Btw, seems like mapping_set_error() should have a non-inline cold path?

Greetings,

Andres Freund
Miklos Szeredi March 9, 2020, 8:02 p.m. UTC | #3
On Mon, Mar 09, 2020 at 02:00:46PM +0000, David Howells wrote:
> ============================
> WHY NOT USE PROCFS OR SYSFS?
> ============================

And here's the updated patch (hopefully addressed all of Al's concerns)
that uses procfs and a new mountfs.

Get mountinfo from open file:

  cat /proc/$PID/fdmount/$FD/*

Get mountinfo by mount ID:

  mount -t mountfs mountfs /mountfs
  cat /mountfs/$MNT_ID/*

> Why is it better to go with a new system call rather than adding more magic
> stuff to /proc or /sysfs for each superblock object and each mount object?
> 
>  (1) It can be targetted.  It makes it easy to query directly by path or
>      fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
>      cannot do three of these things easily.

See above: with the addition of open(path, O_PATH) it can do all of these.

> 
>  (2) Easier to provide LSM oversight.  Is the accessing process allowed to
>      query information pertinent to a particular file?

Not quite sure why this would be easier for a new ad-hoc interface than for
the well established filesystem API.

> 
>  (3) It's more efficient as we can return specific binary data rather than
>      making huge text dumps.  Granted, sysfs and procfs could present the
>      same data, though as lots of little files which have to be
>      individually opened, read, closed and parsed.
> 
>  (4) We wouldn't have the overhead of open and close (even adding a
>      self-contained readfile() syscall has to do that internally).
> 
>  (5) Opening a file in procfs or sysfs has a pathwalk overhead for each
>      file accessed.  We can use an integer attribute ID instead (yes, this
>      is similar to ioctl) - but could also use a string ID if that is
>      preferred.

Is that super-high performance really warranted?  What would be the
application of that?

> 
>  (6) Can query cross-namespace if, say, a container manager process is
>      given an fs_context that hasn't yet been mounted into a namespace - or
>      hasn't even been fully created yet.

This patch can do that too.

> 
>  (7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
>      mount happens or is removed - and since systemd makes much use of
>      mount namespaces and mount propagation, this will create a lot of
>      nodes.

This patch creates a single struct mountfs_entry per mount, which is 48bytes.

Now onto the advantages of a filesystem based API:

 - immediately usable from all programming languages, including scripts

 - same goes for future extensions: no need to update libc, utils, language
   bindings, strace, etc...

Thanks,
Miklos

---
 fs/Makefile              |    1 
 fs/mount.h               |    8 
 fs/mountfs/Makefile      |    1 
 fs/mountfs/super.c       |  502 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c           |   31 ++
 fs/proc/base.c           |    2 
 fs/proc/fd.c             |   82 +++++++
 fs/proc/fd.h             |    3 
 fs/proc_namespace.c      |   22 --
 fs/seq_file.c            |   23 ++
 include/linux/seq_file.h |    1 
 11 files changed, 654 insertions(+), 22 deletions(-)

--- a/fs/Makefile
+++ b/fs/Makefile
@@ -135,3 +135,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-y				+= mountfs/
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,6 +72,7 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	struct mountfs_entry *mnt_mountfs_entry;
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -153,3 +154,10 @@ static inline bool is_anon_ns(struct mnt
 {
 	return ns->seq == 0;
 }
+
+void mnt_namespace_lock_read(void);
+void mnt_namespace_unlock_read(void);
+
+void mountfs_create(struct mount *mnt);
+extern void mountfs_remove(struct mount *mnt);
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path);
--- /dev/null
+++ b/fs/mountfs/Makefile
@@ -0,0 +1 @@
+obj-y				+= super.o
--- /dev/null
+++ b/fs/mountfs/super.c
@@ -0,0 +1,502 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../pnode.h"
+#include <linux/fs.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/fs_struct.h>
+#include <linux/fs_context.h>
+
+#define MOUNTFS_SUPER_MAGIC 0x4e756f4d
+
+static DEFINE_SPINLOCK(mountfs_lock);
+static struct rb_root mountfs_entries = RB_ROOT;
+static struct vfsmount *mountfs_mnt __read_mostly;
+
+struct mountfs_entry {
+	struct kref kref;
+	struct mount *mnt;
+	struct rb_node node;
+	int id;
+};
+
+static const char *mountfs_attrs[] = {
+	"root", "mountpoint", "id", "parent", "options", "children",
+	"group", "master", "propagate_from"
+};
+
+#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
+			 (ARRAY_SIZE(mountfs_attrs) + 1))
+
+void mountfs_entry_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct mountfs_entry, kref));
+}
+
+void mountfs_entry_put(struct mountfs_entry *entry)
+{
+	kref_put(&entry->kref, mountfs_entry_release);
+}
+
+static bool mountfs_entry_visible(struct mountfs_entry *entry)
+{
+	struct mount *mnt;
+	bool visible = false;
+
+	rcu_read_lock();
+	mnt = rcu_dereference(entry->mnt);
+	if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns)
+		visible = true;
+	rcu_read_unlock();
+
+	return visible;
+}
+static int mountfs_attr_show(struct seq_file *sf, void *v)
+{
+	const char *name = sf->file->f_path.dentry->d_name.name;
+	struct mountfs_entry *entry = sf->private;
+	struct mount *mnt;
+	struct vfsmount *m;
+	struct super_block *sb;
+	struct path root;
+	int tmp, err = -ENODEV;
+
+	mnt_namespace_lock_read();
+
+	mnt = entry->mnt;
+	if (!mnt || !mnt->mnt_ns)
+		goto out;
+
+	err = 0;
+	m = &mnt->mnt;
+	sb = m->mnt_sb;
+
+	if (strcmp(name, "root") == 0) {
+		if (sb->s_op->show_path) {
+			err = sb->s_op->show_path(sf, m->mnt_root);
+		} else {
+			seq_dentry(sf, m->mnt_root, " \t\n\\");
+		}
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "mountpoint") == 0) {
+		struct path mnt_path = { .dentry = m->mnt_root, .mnt = m };
+
+		get_fs_root(current->fs, &root);
+		err = seq_path_root(sf, &mnt_path, &root, " \t\n\\");
+		if (err == SEQ_SKIP) {
+			seq_puts(sf, "(unreachable)");
+			err = 0;
+		}
+		seq_putc(sf, '\n');
+		path_put(&root);
+	} else if (strcmp(name, "id") == 0) {
+		seq_printf(sf, "%i\n", mnt->mnt_id);
+	} else if (strcmp(name, "parent") == 0) {
+		tmp = rcu_dereference(mnt->mnt_parent)->mnt_id;
+		seq_printf(sf, "%i\n", tmp);
+	} else if (strcmp(name, "options") == 0) {
+		int mnt_flags = READ_ONCE(m->mnt_flags);
+
+		seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw");
+		seq_mnt_opts(sf, mnt_flags);
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "children") == 0) {
+		struct mount *child;
+		bool first = true;
+
+		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+			if (!first)
+				seq_putc(sf, ',');
+			else
+				first = false;
+			seq_printf(sf, "%i", child->mnt_id);
+		}
+		if (!first)
+			seq_putc(sf, '\n');
+	} else if (strcmp(name, "group") == 0) {
+		if (IS_MNT_SHARED(mnt))
+			seq_printf(sf, "%i\n", mnt->mnt_group_id);
+	} else if (strcmp(name, "master") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			tmp = rcu_dereference(mnt->mnt_master)->mnt_group_id;
+			seq_printf(sf, "%i\n", tmp);
+		}
+	} else if (strcmp(name, "propagate_from") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			get_fs_root(current->fs, &root);
+			tmp = get_dominating_id(mnt, &root);
+			if (tmp)
+				seq_printf(sf, "%i\n", tmp);
+		}
+	} else {
+		WARN_ON(1);
+		err = -EIO;
+	}
+out:
+	mnt_namespace_unlock_read();
+
+	return err;
+}
+
+static int mountfs_attr_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mountfs_attr_show, inode->i_private);
+}
+
+static const struct file_operations mountfs_attr_fops = {
+	.open		= mountfs_attr_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node)
+{
+	return rb_entry(node, struct mountfs_entry, node);
+}
+
+static struct rb_node **mountfs_find_node(int id, struct rb_node **parent)
+{
+	struct rb_node **link = &mountfs_entries.rb_node;
+
+	*parent = NULL;
+	while (*link) {
+		struct mountfs_entry *entry = mountfs_node_to_entry(*link);
+
+		*parent = *link;
+		if (id < entry->id)
+			link = &entry->node.rb_left;
+		else if (id > entry->id)
+			link = &entry->node.rb_right;
+		else
+			break;
+	}
+	return link;
+}
+
+void mountfs_create(struct mount *mnt)
+{
+	struct mountfs_entry *entry;
+	struct rb_node **link, *parent;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry) {
+		WARN(1, "failed to allocate mountfs entry");
+		return;
+	}
+	kref_init(&entry->kref);
+	entry->mnt = mnt;
+	entry->id = mnt->mnt_id;
+
+	spin_lock(&mountfs_lock);
+	link = mountfs_find_node(entry->id, &parent);
+	if (!WARN_ON(*link)) {
+		rb_link_node(&entry->node, parent, link);
+		rb_insert_color(&entry->node, &mountfs_entries);
+		mnt->mnt_mountfs_entry = entry;
+	} else {
+		kfree(entry);
+	}
+	spin_unlock(&mountfs_lock);
+}
+
+void mountfs_remove(struct mount *mnt)
+{
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+
+	if (!entry)
+		return;
+	spin_lock(&mountfs_lock);
+	entry->mnt = NULL;
+	rb_erase(&entry->node, &mountfs_entries);
+	spin_unlock(&mountfs_lock);
+
+	mountfs_entry_put(entry);
+
+	mnt->mnt_mountfs_entry = NULL;
+}
+
+static struct mountfs_entry *mountfs_get_entry(const char *name)
+{
+	struct mountfs_entry *entry = NULL;
+	struct rb_node **link, *dummy;
+	unsigned long mnt_id;
+	char buf[32];
+	int ret;
+
+	ret = kstrtoul(name, 10, &mnt_id);
+	if (ret || mnt_id > INT_MAX)
+		return NULL;
+
+	snprintf(buf, sizeof(buf), "%lu", mnt_id);
+	if (strcmp(buf, name) != 0)
+		return NULL;
+
+	spin_lock(&mountfs_lock);
+	link = mountfs_find_node(mnt_id, &dummy);
+	if (*link) {
+		entry = mountfs_node_to_entry(*link);
+		if (!mountfs_entry_visible(entry))
+			entry = NULL;
+		else
+			kref_get(&entry->kref);
+	}
+	spin_unlock(&mountfs_lock);
+
+	return entry;
+}
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode);
+
+static struct dentry *mountfs_lookup_entry(struct dentry *dentry,
+					   struct mountfs_entry *entry,
+					   int idx)
+{
+	struct inode *inode;
+
+	inode = new_inode(dentry->d_sb);
+	if (!inode) {
+		mountfs_entry_put(entry);
+		return ERR_PTR(-ENOMEM);
+	}
+	inode->i_private = entry;
+	inode->i_ino = MOUNTFS_INO(entry->id) + idx;
+	mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555);
+	return d_splice_alias(inode, dentry);
+
+}
+
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct mountfs_entry *entry = dir->i_private;
+	int i = 0;
+
+	if (entry) {
+		for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+			if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+				break;
+		if (i == ARRAY_SIZE(mountfs_attrs))
+			return ERR_PTR(-ENOMEM);
+		i++;
+		kref_get(&entry->kref);
+	} else {
+		entry = mountfs_get_entry(dentry->d_name.name);
+		if (!entry)
+			return ERR_PTR(-ENOENT);
+	}
+
+	return mountfs_lookup_entry(dentry, entry, i);
+}
+
+static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct mountfs_entry *entry = dentry->d_inode->i_private;
+
+	/* root: valid */
+	if (!entry)
+		return 1;
+
+	/* removed: invalid */
+	if (!entry->mnt)
+		return 0;
+
+	/* attribute or visible in this namespace: valid */
+	if (!d_can_lookup(dentry) || mountfs_entry_visible(entry))
+		return 1;
+
+	/* invlisible in this namespace: valid but deny entry*/
+	return -ENOENT;
+}
+
+static int mountfs_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct rb_node *node;
+	struct mountfs_entry *entry = file_inode(file)->i_private;
+	char name[32];
+	const char *s;
+	unsigned int len, pos, id;
+
+	if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx))
+		return 0;
+
+	if (entry) {
+		while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) {
+			s = mountfs_attrs[ctx->pos - 2];
+			if (!dir_emit(ctx, s, strlen(s),
+				      MOUNTFS_INO(entry->id) + ctx->pos,
+				      DT_REG))
+				break;
+			ctx->pos++;
+		}
+		return 0;
+	}
+
+	pos = ctx->pos - 2;
+	do {
+		spin_lock(&mountfs_lock);
+		mountfs_find_node(pos, &node);
+		pos = 1U + INT_MAX;
+		do {
+			if (!node) {
+				spin_unlock(&mountfs_lock);
+				goto out;
+			}
+			entry = mountfs_node_to_entry(node);
+			node = rb_next(node);
+		} while (!mountfs_entry_visible(entry));
+		if (node)
+			pos = mountfs_node_to_entry(node)->id;
+		id = entry->id;
+		spin_unlock(&mountfs_lock);
+
+		len = snprintf(name, sizeof(name), "%i", id);
+		ctx->pos = id + 2;
+		if (!dir_emit(ctx, name, len, MOUNTFS_INO(id), DT_DIR))
+			return 0;
+	} while (pos <= INT_MAX);
+out:
+	ctx->pos =  pos + 2;
+	return 0;
+}
+
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path)
+{
+	char name[32];
+	struct qstr this = { .name = name };
+	struct mount *mnt = real_mount(m);
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+	struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root;
+
+	this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id);
+	dentry = d_hash_and_lookup(root, &this);
+	if (dentry && dentry->d_inode->i_private != entry) {
+		d_invalidate(dentry);
+		dput(dentry);
+		dentry = NULL;
+	}
+	if (!dentry) {
+		dentry = d_alloc(root, &this);
+		if (!dentry)
+			return -ENOMEM;
+
+		kref_get(&entry->kref);
+		old = mountfs_lookup_entry(dentry, entry, 0);
+		if (old) {
+			dput(dentry);
+			if (IS_ERR(old))
+				return PTR_ERR(old);
+			dentry = old;
+		}
+	}
+
+	*path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry };
+	return 0;
+}
+
+static const struct dentry_operations mountfs_dops = {
+	.d_revalidate = mountfs_d_revalidate,
+};
+
+static const struct inode_operations mountfs_iops = {
+	.lookup = mountfs_lookup,
+};
+
+static const struct file_operations mountfs_fops = {
+	.iterate_shared = mountfs_readdir,
+	.read = generic_read_dir,
+	.llseek = generic_file_llseek,
+};
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode)
+{
+	inode->i_mode = mode;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+	if (S_ISREG(mode)) {
+		inode->i_size = PAGE_SIZE;
+		inode->i_fop = &mountfs_attr_fops;
+	} else {
+		inode->i_op = &mountfs_iops;
+		inode->i_fop = &mountfs_fops;
+	}
+}
+
+static void mountfs_evict_inode(struct inode *inode)
+{
+	struct mountfs_entry *entry = inode->i_private;
+
+	clear_inode(inode);
+	if (entry)
+		mountfs_entry_put(entry);
+}
+
+static const struct super_operations mountfs_sops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= mountfs_evict_inode,
+};
+
+static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *root;
+
+	sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+	sb->s_magic = MOUNTFS_SUPER_MAGIC;
+	sb->s_time_gran = 1;
+	sb->s_shrink.seeks = 0;
+	sb->s_op = &mountfs_sops;
+	sb->s_d_op = &mountfs_dops;
+
+	root = new_inode(sb);
+	if (!root)
+		return -ENOMEM;
+
+	root->i_ino = 1;
+	mountfs_init_inode(root, S_IFDIR | 0444);
+
+	sb->s_root = d_make_root(root);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int mountfs_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, mountfs_fill_super);
+}
+
+static const struct fs_context_operations mountfs_context_ops = {
+	.get_tree = mountfs_get_tree,
+};
+
+static int mountfs_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &mountfs_context_ops;
+	fc->global = true;
+	return 0;
+}
+
+static struct file_system_type mountfs_fs_type = {
+	.name = "mountfs",
+	.init_fs_context = mountfs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
+static int __init mountfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&mountfs_fs_type);
+	if (!err) {
+		mountfs_mnt = kern_mount(&mountfs_fs_type);
+		if (IS_ERR(mountfs_mnt)) {
+			err = PTR_ERR(mountfs_mnt);
+			unregister_filesystem(&mountfs_fs_type);
+		}
+	}
+	return err;
+}
+fs_initcall(mountfs_init);
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -943,6 +943,8 @@ struct vfsmount *vfs_create_mount(struct
 
 	if (fc->sb_flags & SB_KERNMOUNT)
 		mnt->mnt.mnt_flags = MNT_INTERNAL;
+	else
+		mountfs_create(mnt);
 
 	atomic_inc(&fc->root->d_sb->s_active);
 	mnt->mnt.mnt_sb		= fc->root->d_sb;
@@ -1013,7 +1015,7 @@ vfs_submount(const struct dentry *mountp
 }
 EXPORT_SYMBOL_GPL(vfs_submount);
 
-static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+static struct mount *clone_mnt_common(struct mount *old, struct dentry *root,
 					int flag)
 {
 	struct super_block *sb = old->mnt.mnt_sb;
@@ -1079,6 +1081,17 @@ static struct mount *clone_mnt(struct mo
 	return ERR_PTR(err);
 }
 
+static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+			       int flag)
+{
+	struct mount *mnt = clone_mnt_common(old, root, flag);
+
+	if (!IS_ERR(mnt))
+		mountfs_create(mnt);
+
+	return mnt;
+}
+
 static void cleanup_mnt(struct mount *mnt)
 {
 	struct hlist_node *p;
@@ -1091,6 +1104,7 @@ static void cleanup_mnt(struct mount *mn
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) {
@@ -1171,6 +1185,8 @@ static void mntput_no_expire(struct moun
 	unlock_mount_hash();
 	shrink_dentry_list(&list);
 
+	mountfs_remove(mnt);
+
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
 		struct task_struct *task = current;
 		if (likely(!(task->flags & PF_KTHREAD))) {
@@ -1237,13 +1253,14 @@ EXPORT_SYMBOL(path_is_mountpoint);
 struct vfsmount *mnt_clone_internal(const struct path *path)
 {
 	struct mount *p;
-	p = clone_mnt(real_mount(path->mnt), path->dentry, CL_PRIVATE);
+	p = clone_mnt_common(real_mount(path->mnt), path->dentry, CL_PRIVATE);
 	if (IS_ERR(p))
 		return ERR_CAST(p);
 	p->mnt.mnt_flags |= MNT_INTERNAL;
 	return &p->mnt;
 }
 
+
 #ifdef CONFIG_PROC_FS
 /* iterator; we want it to have access to namespace_sem, thus here... */
 static void *m_start(struct seq_file *m, loff_t *pos)
@@ -1385,6 +1402,16 @@ static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+void mnt_namespace_lock_read(void)
+{
+	down_read(&namespace_sem);
+}
+
+void mnt_namespace_unlock_read(void)
+{
+	up_read(&namespace_sem);
+}
+
 enum umount_tree_flags {
 	UMOUNT_SYNC = 1,
 	UMOUNT_PROPAGATE = 2,
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",    S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3497,6 +3498,7 @@ static const struct inode_operations pro
 static const struct pid_entry tid_base_stuff[] = {
 	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",   S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo
 	.iterate_shared	= proc_readfdinfo,
 	.llseek		= generic_file_llseek,
 };
+
+static int proc_fdmount_link(struct dentry *dentry, struct path *path)
+{
+	struct files_struct *files = NULL;
+	struct task_struct *task;
+	struct path fd_path;
+	int ret = -ENOENT;
+
+	task = get_proc_task(d_inode(dentry));
+	if (task) {
+		files = get_files_struct(task);
+		put_task_struct(task);
+	}
+
+	if (files) {
+		unsigned int fd = proc_fd(d_inode(dentry));
+		struct file *fd_file;
+
+		spin_lock(&files->file_lock);
+		fd_file = fcheck_files(files, fd);
+		if (fd_file) {
+			fd_path = fd_file->f_path;
+			path_get(&fd_path);
+			ret = 0;
+		}
+		spin_unlock(&files->file_lock);
+		put_files_struct(files);
+	}
+	if (!ret) {
+		ret = mountfs_lookup_internal(fd_path.mnt, path);
+		path_put(&fd_path);
+	}
+
+	return ret;
+}
+
+static struct dentry *proc_fdmount_instantiate(struct dentry *dentry,
+	struct task_struct *task, const void *ptr)
+{
+	const struct fd_data *data = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->fd = data->fd;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+
+	ei->op.proc_get_link = proc_fdmount_link;
+	tid_fd_update_inode(task, inode, 0);
+
+	d_set_d_op(dentry, &tid_fd_dentry_operations);
+	return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *
+proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+	return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate);
+}
+
+static int proc_readfdmount(struct file *file, struct dir_context *ctx)
+{
+	return proc_readfd_common(file, ctx,
+				  proc_fdmount_instantiate);
+}
+
+const struct inode_operations proc_fdmount_inode_operations = {
+	.lookup		= proc_lookupfdmount,
+	.setattr	= proc_setattr,
+};
+
+const struct file_operations proc_fdmount_operations = {
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_readfdmount,
+	.llseek		= generic_file_llseek,
+};
--- a/fs/proc/fd.h
+++ b/fs/proc/fd.h
@@ -10,6 +10,9 @@ extern const struct inode_operations pro
 extern const struct file_operations proc_fdinfo_operations;
 extern const struct inode_operations proc_fdinfo_inode_operations;
 
+extern const struct file_operations proc_fdmount_operations;
+extern const struct inode_operations proc_fdmount_inode_operations;
+
 extern int proc_fd_permission(struct inode *inode, int mask);
 
 static inline unsigned int proc_fd(struct inode *inode)
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file
 	return security_sb_show_options(m, sb);
 }
 
-static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
-{
-	static const struct proc_fs_info mnt_info[] = {
-		{ MNT_NOSUID, ",nosuid" },
-		{ MNT_NODEV, ",nodev" },
-		{ MNT_NOEXEC, ",noexec" },
-		{ MNT_NOATIME, ",noatime" },
-		{ MNT_NODIRATIME, ",nodiratime" },
-		{ MNT_RELATIME, ",relatime" },
-		{ 0, NULL }
-	};
-	const struct proc_fs_info *fs_infop;
-
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
-		if (mnt->mnt_flags & fs_infop->flag)
-			seq_puts(m, fs_infop->str);
-	}
-}
 
 static inline void mangle(struct seq_file *m, const char *s)
 {
@@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *
 	err = show_sb_opts(m, sb);
 	if (err)
 		goto out;
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 	if (sb->s_op->show_options)
 		err = sb->s_op->show_options(m, mnt_path.dentry);
 	seq_puts(m, " 0 0\n");
@@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_fil
 		goto out;
 
 	seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw");
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 
 	/* Tagged fields ("foo:X" or "bar") */
 	if (IS_MNT_SHARED(r))
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -15,6 +15,7 @@
 #include <linux/cred.h>
 #include <linux/mm.h>
 #include <linux/printk.h>
+#include <linux/mount.h>
 #include <linux/string_helpers.h>
 
 #include <linux/uaccess.h>
@@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struc
 }
 EXPORT_SYMBOL(seq_dentry);
 
+void seq_mnt_opts(struct seq_file *m, int mnt_flags)
+{
+	unsigned int i;
+	static const struct {
+		int flag;
+		const char *str;
+	} mnt_info[] = {
+		{ MNT_NOSUID, ",nosuid" },
+		{ MNT_NODEV, ",nodev" },
+		{ MNT_NOEXEC, ",noexec" },
+		{ MNT_NOATIME, ",noatime" },
+		{ MNT_NODIRATIME, ",nodiratime" },
+		{ MNT_RELATIME, ",relatime" },
+		{ 0, NULL }
+	};
+
+	for (i = 0; mnt_info[i].flag; i++) {
+		if (mnt_flags & mnt_info[i].flag)
+			seq_puts(m, mnt_info[i].str);
+	}
+}
+
 static void *single_start(struct seq_file *p, loff_t *pos)
 {
 	return NULL + (*pos == 0);
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, str
 int seq_dentry(struct seq_file *, struct dentry *, const char *);
 int seq_path_root(struct seq_file *m, const struct path *path,
 		  const struct path *root, const char *esc);
+void seq_mnt_opts(struct seq_file *m, int mnt_flags);
 
 int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
 int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t);
Jeff Layton March 9, 2020, 10:49 p.m. UTC | #4
On Mon, 2020-03-09 at 12:22 -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-03-09 13:50:59 -0400, Jeff Layton wrote:
> > The PostgreSQL devs asked a while back for some way to tell whether
> > there have been any writeback errors on a superblock w/o having to do
> > any sort of flush -- just "have there been any so far".
> 
> Indeed.
> 
> 
> > I sent a patch a few weeks ago to make syncfs() return errors when there
> > have been writeback errors on the superblock. It's not merged yet, but
> > once we have something like that in place, we could expose info from the
> > errseq_t to userland using this interface.
> 
> I'm still a bit worried about the details of errseq_t being exposed to
> userland. Partially because it seems to restrict further evolution of
> errseq_t, and partially because it will likely up with userland trying
> to understand it (it's e.g. just too attractive to report a count of
> errors etc).

Trying to interpret the counter field won't really tell you anything.
The counter is not incremented unless someone has queried the value
since it was last checked. A single increment could represent a single
writeback error or 10000 identical ones.

There _is_ a flag that tells you whether someone has queried it, but
that gets masked off before copying the cookie to userland.

> Is there a reason to not instead report a 64bit counter instead of the
> cookie? In contrast to the struct file case we'd only have the space
> overhead once per superblock, rather than once per #files * #fd. And it
> seems that the maintenance of that counter could be done without
> widespread changes, e.g. instead/in addition to your change:
> 

What problem would moving to a 64-bit counter solve? I get the concern
about people trying to get a counter out of the cookie field, but giving
people an explicit 64-bit counter seems even more open to
misinterpretation.

All that said, is an opaque cookie still something you'd find useful?

> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index ccb14b6a16b5..897439475315 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -51,7 +51,10 @@ static inline void mapping_set_error(struct address_space *mapping, int error)
> >  		return;
> > 
> >  	/* Record in wb_err for checkers using errseq_t based tracking */
> > -	filemap_set_wb_err(mapping, error);
> > +	__filemap_set_wb_err(mapping, error);
> > +
> > +	/* Record it in superblock */
> > +	errseq_set(&mapping->host->i_sb->s_wb_err, error);
> > 
> >  	/* Record it in flags for now, for legacy callers */
> >  	if (error == -ENOSPC)
> 
> Btw, seems like mapping_set_error() should have a non-inline cold path?

Good point. I'll do that in the next iteration.
David Howells March 9, 2020, 10:52 p.m. UTC | #5
Miklos Szeredi <miklos@szeredi.hu> wrote:

> >  (1) It can be targetted.  It makes it easy to query directly by path or
> >      fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
> >      cannot do three of these things easily.
> 
> See above: with the addition of open(path, O_PATH) it can do all of these.

That's a horrible interface.  To query a file by path, you have to do:

	fd = open(path, O_PATH);
	sprintf(procpath, "/proc/self/fdmount/%u/<attr>");
	fd2 = open(procpath, O_RDONLY);
	read(fd2, ...);
	close(fd2);
	close(fd);

See point (3) about efficiency also.  You're having to open *two* files.

> >  (2) Easier to provide LSM oversight.  Is the accessing process allowed to
> >      query information pertinent to a particular file?
> 
> Not quite sure why this would be easier for a new ad-hoc interface than for
> the well established filesystem API.

You're right.  That's why fsinfo() uses standard pathwalk where possible,
e.g.:

	fsinfo(AT_FDCWD, "/path/to/file", ...);

or a fairly standard fd-querying interface:

	fsinfo(fd, "", { resolve_flags = RESOLVE_EMPTY_PATH },  ...);

to query an open file descriptor.  These are well-established filesystem APIs.

Where I vary from this is allowing direct specification of a mount ID also,
with a special flag to say that's what I'm doing:

	fsinfo(AT_FDCWD, "23", { flags = FSINFO_QUERY_FLAGS_MOUNT },  ...);

> >  (7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
> >      mount happens or is removed - and since systemd makes much use of
> >      mount namespaces and mount propagation, this will create a lot of
> >      nodes.
> 
> This patch creates a single struct mountfs_entry per mount, which is 48bytes.

fsinfo() doesn't create any.  Furthermore, it seems that mounts get multiplied
8-10 times by systemd - though, as you say, it's not necessarily a great deal
of memory.

> Now onto the advantages of a filesystem based API:
> 
>  - immediately usable from all programming languages, including scripts

This is not true.  You can't open O_PATH from shell scripts, so you can't
query things by path that you can't or shouldn't open (dev file paths, for
example; symlinks).

I imagine you're thinking of something like:

	{
		id=`cat /proc/self/fdmount/5/parent_mount`
	} 5</my/path/to/my/file

but what if /my/path/to/my/file is actually /dev/foobar?

I've had a grep through the bash sources, but can't seem to find anywhere that
uses O_PATH.

>  - same goes for future extensions: no need to update libc, utils, language
>    bindings, strace, etc...

Applications and libraries using these attributes would have to change anyway
to make use of additional information.

But it's not a good argument since you now have to have text parsers that
change over time.

David
Andres Freund March 10, 2020, 12:18 a.m. UTC | #6
Hi,

On 2020-03-09 18:49:31 -0400, Jeff Layton wrote:
> On Mon, 2020-03-09 at 12:22 -0700, Andres Freund wrote:
> > On 2020-03-09 13:50:59 -0400, Jeff Layton wrote:
> > > I sent a patch a few weeks ago to make syncfs() return errors when there
> > > have been writeback errors on the superblock. It's not merged yet, but
> > > once we have something like that in place, we could expose info from the
> > > errseq_t to userland using this interface.
> >
> > I'm still a bit worried about the details of errseq_t being exposed to
> > userland. Partially because it seems to restrict further evolution of
> > errseq_t, and partially because it will likely up with userland trying
> > to understand it (it's e.g. just too attractive to report a count of
> > errors etc).
>
> Trying to interpret the counter field won't really tell you anything.
> The counter is not incremented unless someone has queried the value
> since it was last checked. A single increment could represent a single
> writeback error or 10000 identical ones.

Oh, right.  A zero errseq would still indicate something, but that's
probably fine.


> > Is there a reason to not instead report a 64bit counter instead of the
> > cookie? In contrast to the struct file case we'd only have the space
> > overhead once per superblock, rather than once per #files * #fd. And it
> > seems that the maintenance of that counter could be done without
> > widespread changes, e.g. instead/in addition to your change:

> What problem would moving to a 64-bit counter solve? I get the concern
> about people trying to get a counter out of the cookie field, but giving
> people an explicit 64-bit counter seems even more open to
> misinterpretation.

Well, you could get an actual error count out of it? I was thinking that
that value would get incremented every time mapping_set_error() is
called, which should make it a meaningful count?

Greetings,

Andres Freund
Miklos Szeredi March 10, 2020, 9:18 a.m. UTC | #7
On Mon, Mar 9, 2020 at 11:53 PM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > >  (1) It can be targetted.  It makes it easy to query directly by path or
> > >      fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
> > >      cannot do three of these things easily.
> >
> > See above: with the addition of open(path, O_PATH) it can do all of these.
>
> That's a horrible interface.  To query a file by path, you have to do:
>
>         fd = open(path, O_PATH);
>         sprintf(procpath, "/proc/self/fdmount/%u/<attr>");
>         fd2 = open(procpath, O_RDONLY);
>         read(fd2, ...);
>         close(fd2);
>         close(fd);
>
> See point (3) about efficiency also.  You're having to open *two* files.

I completely agree, opening two files is surely going to kill
performance of application needing to retrieve a billion mount
attributes per second.</sarcasm>

> > >  (2) Easier to provide LSM oversight.  Is the accessing process allowed to
> > >      query information pertinent to a particular file?
> >
> > Not quite sure why this would be easier for a new ad-hoc interface than for
> > the well established filesystem API.
>
> You're right.  That's why fsinfo() uses standard pathwalk where possible,
> e.g.:
>
>         fsinfo(AT_FDCWD, "/path/to/file", ...);
>
> or a fairly standard fd-querying interface:
>
>         fsinfo(fd, "", { resolve_flags = RESOLVE_EMPTY_PATH },  ...);
>
> to query an open file descriptor.  These are well-established filesystem APIs.

Yes.  The problem is with the "..." part where you pass random
structures to a function.   That's useful sometimes, but at the very
least it breaks type safety, and not what I would call a "clean" API.

> > Now onto the advantages of a filesystem based API:
> >
> >  - immediately usable from all programming languages, including scripts
>
> This is not true.  You can't open O_PATH from shell scripts, so you can't
> query things by path that you can't or shouldn't open (dev file paths, for
> example; symlinks).

Yes.  However, you just wrote the core of a utility that could do this
(in 6 lines, no less).  Now try that feat with fsinfo(2)!

Thanks,
Miklos