mbox series

[00/19] VFS: Filesystem information and notifications [ver #16]

Message ID 158204549488.3299825.3783690177353088425.stgit@warthog.procyon.org.uk (mailing list archive)
Headers show
Series VFS: Filesystem information and notifications [ver #16] | expand

Message

David Howells Feb. 18, 2020, 5:04 p.m. UTC
Here are a set of patches that adds system calls, that (a) allow
information about the VFS, mount topology, superblock and files to be
retrieved and (b) allow for notifications of mount topology rearrangement
events, mount and superblock attribute changes and other superblock events,
such as errors.

============================
FILESYSTEM INFORMATION QUERY
============================

The first system call, fsinfo(), allows information about the filesystem at
a particular path point to be queried as a set of attributes, some of which
may have more than one value.

Attribute values are of four basic types:

 (1) Version dependent-length structure (size defined by type).

 (2) Variable-length string (up to 4096, including NUL).

 (3) List of structures (up to INT_MAX size).

 (4) Opaque blob (up to INT_MAX size).

Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type.

Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but all the values of an attribute have to be of the same type.

I've tried to make the interface as light as possible, so integer/enum
attribute selector rather than string and the core does all the allocation
and extensibility support work rather than leaving that to the filesystems.
That means that for the first two attribute types, the filesystem will
always see a sufficiently-sized buffer allocated.  Further, this removes
the possibility of the filesystem gaining access to the userspace buffer.


fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:

 (1) General superblock attributes:

     - Filesystem identifiers (UUID, volume label, device numbers, ...)
     - The limits on a filesystem's capabilities
     - Information on supported statx fields and attributes and IOC flags.
     - A variety single-bit flags indicating supported capabilities.
     - Timestamp resolution and range.
     - The amount of space/free space in a filesystem (as statfs()).
     - Superblock notification counter.

 (2) Filesystem-specific superblock attributes:

     - Superblock-level timestamps.
     - Cell name.
     - Server names and addresses.
     - Filesystem-specific information.

 (3) VFS information:

     - Mount topology information.
     - Mount attributes.
     - Mount notification counter.

 (4) Information about what the fsinfo() syscall itself supports, including
     the type and struct/element size of attributes.

The system is extensible:

 (1) New attributes can be added.  There is no requirement that a
     filesystem implement every attribute.  Note that the core VFS keeps a
     table of types and sizes so it can handle future extensibility rather
     than delegating this to the filesystems.

 (2) Version length-dependent structure attributes can be made larger and
     have additional information tacked on the end, provided it keeps the
     layout of the existing fields.  If an older process asks for a shorter
     structure, it will only be given the bits it asks for.  If a newer
     process asks for a longer structure on an older kernel, the extra
     space will be set to 0.  In all cases, the size of the data actually
     available is returned.

     In essence, the size of a structure is that structure's version: a
     smaller size is an earlier version and a later version includes
     everything that the earlier version did.

 (3) New single-bit capability flags can be added.  This is a structure-typed
     attribute and, as such, (2) applies.  Any bits you wanted but the kernel
     doesn't support are automatically set to 0.

fsinfo() may be called like the following, for example:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
		.Nth		= 2,
	};
	struct fsinfo_server_address address;
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
		     &address, sizeof(address));

The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_CELL_NAME;
	};
	char cell_name[256];
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
		     &cell_name, sizeof(cell_name));

would retrieve the name of an AFS cell as a string.

In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:

	fd = fsopen("ext4", 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_PARAMETERS;
	};
	char buffer[65536];
	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));

even if that context doesn't currently have a superblock attached.  I would
prefer this to contain length-prefixed strings so that there's no need to
insert escaping, especially as any character, including '\', can be used as
the separator in cifs and so that binary parameters can be returned (though
that is a lesser issue).


========================
FILESYSTEM NOTIFICATIONS
========================

The second system call, watch_mount(), places a watch on a point in the
mount topology specified by the dirfd, path and at_flags parameters.  All
mount topology change and mount attribute change notifications in the
subtree rooted at that point can be intercepted by the watch.  Watches are
ducted through pipes:

	int fd[2];
	pipe2(fd, O_NOTIFICATION_PIPE);
	ioctl(fd[0], IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	watch_mount(AT_FDCWD, "/", 0, fd[0], 0x02);

Events include:

 - New mount made
 - Mount unmounted
 - Mount expired
 - R/O state changed
 - Other attribute changed
 - Mount moved from
 - Mount moved to

Using filtering, this may be limited in various ways (single mount watch vs
subtree watch, recursive vs non-recursive changes, to-R/O vs to-R/W, mount
vs submount).

Each mount now has a change counter.  Whenever a mount is changed, this
gets incremented.  It can be queried by fsinfo() using either
FSINFO_ATTR_MOUNT_INFO or FSINFO_ATTR_MOUNT_CHILDREN.  The ID of the mount
on which the notification is generated is placed into the notification
message (triggered_on).  If the event involves a second mount as well, such
as creation of a new mount, that gets returned too (changed_mount).


The third system call, watch_sb(), places a watch on the superblock
specified by the dirfd, path and at_flags parameters.  This allows various
superblock events to be monitored for, such as:

 - Transition between R/W and R/O
 - Filesystem errors
 - Quota overrun
 - Network status changes

Each superblock now gets a 64-bit unique superblock identifier and a
notification counter.  The counter is incremented each time one of these
notifications would be generated.  This attributes can be queried using
fsinfo() with FSINFO_ATTR_SB_NOTIFICATIONS.  The identifier is placed into
notification messages.


Two sample programs are provided, one to query filesystem attributes and
the other to display a mount subtree.  Both of them can be given a path or
a mount ID to start at.  Further, the watch_test sample program now watches
for mount events under "/" and for superblock events on whatever superblock
is backing "/mnt" when it the program is started.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

on branch:

	fsinfo-core


===================
SIGNIFICANT CHANGES
===================

 ver #16:

 (*) Split the features bits out of the fsinfo() core into their own patch
     and got rid of the name encoding attributes.

 (*) Renamed the 'array' type to 'list' and made AFS use it for returning
     server address lists.

 (*) Changed the ->fsinfo() method into an ->fsinfo_attributes[] table,
     where each attribute has a ->get() method to deal with it.  These
     tables can then be returned with an fsinfo meta attribute.

 (*) Dropped the fscontext query and parameter/description retrieval
     attributes for now.

 (*) Picked the mount topology attributes into this branch.

 (*) Picked the mount notifications into this branch and rebased on top of
     notifications-pipe-core.

 (*) Picked the superblock notifications into this branch.

 (*) Add sample code for Ext4 and NFS.

David
---
David Howells (19):
      vfs: syscall: Add fsinfo() to query filesystem information
      fsinfo: Add syscalls to other arches
      fsinfo: Provide a bitmap of supported features
      vfs: Add mount change counter
      vfs: Introduce a non-repeating system-unique superblock ID
      vfs: Allow fsinfo() to look up a mount object by ID
      vfs: Allow mount information to be queried by fsinfo()
      vfs: fsinfo sample: Mount listing program
      fsinfo: Allow the mount topology propogation flags to be retrieved
      fsinfo: Add API documentation
      afs: Support fsinfo()
      security: Add hooks to rule on setting a superblock or mount watch
      vfs: Add a mount-notification facility
      notifications: sample: Display mount tree change notifications
      vfs: Add superblock notifications
      fsinfo: Provide superblock notification counter
      notifications: sample: Display superblock notifications
      ext4: Add example fsinfo information
      nfs: Add example filesystem information


 Documentation/filesystems/fsinfo.rst        |  490 +++++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    3 
 arch/arm/tools/syscall.tbl                  |    3 
 arch/arm64/include/asm/unistd.h             |    2 
 arch/arm64/include/asm/unistd32.h           |    2 
 arch/ia64/kernel/syscalls/syscall.tbl       |    3 
 arch/m68k/kernel/syscalls/syscall.tbl       |    4 
 arch/microblaze/kernel/syscalls/syscall.tbl |    3 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    3 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    3 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    3 
 arch/parisc/kernel/syscalls/syscall.tbl     |    3 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    3 
 arch/s390/kernel/syscalls/syscall.tbl       |    3 
 arch/sh/kernel/syscalls/syscall.tbl         |    3 
 arch/sparc/kernel/syscalls/syscall.tbl      |    3 
 arch/x86/entry/syscalls/syscall_32.tbl      |    3 
 arch/x86/entry/syscalls/syscall_64.tbl      |    3 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    3 
 fs/Kconfig                                  |   28 +
 fs/Makefile                                 |    2 
 fs/afs/internal.h                           |    1 
 fs/afs/super.c                              |  229 +++++++
 fs/d_path.c                                 |    2 
 fs/ext4/Makefile                            |    1 
 fs/ext4/ext4.h                              |    9 
 fs/ext4/fsinfo.c                            |   40 +
 fs/ext4/super.c                             |    1 
 fs/fsinfo.c                                 |  635 ++++++++++++++++++++
 fs/internal.h                               |   12 
 fs/mount.h                                  |   31 +
 fs/mount_notify.c                           |  188 ++++++
 fs/namespace.c                              |  323 ++++++++++
 fs/nfs/Makefile                             |    1 
 fs/nfs/internal.h                           |    8 
 fs/nfs/nfs4super.c                          |    1 
 fs/nfs/super.c                              |    1 
 fs/super.c                                  |  149 +++++
 include/linux/dcache.h                      |    1 
 include/linux/fs.h                          |   89 +++
 include/linux/fsinfo.h                      |  102 +++
 include/linux/lsm_hooks.h                   |   24 +
 include/linux/security.h                    |   16 +
 include/linux/syscalls.h                    |    8 
 include/uapi/asm-generic/unistd.h           |    8 
 include/uapi/linux/fsinfo.h                 |  371 ++++++++++++
 include/uapi/linux/mount.h                  |   10 
 include/uapi/linux/watch_queue.h            |   61 ++
 kernel/sys_ni.c                             |    3 
 samples/vfs/Makefile                        |    7 
 samples/vfs/test-fsinfo.c                   |  858 +++++++++++++++++++++++++++
 samples/vfs/test-mntinfo.c                  |  243 ++++++++
 samples/watch_queue/watch_test.c            |   76 ++
 security/security.c                         |   14 
 54 files changed, 4081 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/filesystems/fsinfo.rst
 create mode 100644 fs/ext4/fsinfo.c
 create mode 100644 fs/fsinfo.c
 create mode 100644 fs/mount_notify.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 samples/vfs/test-fsinfo.c
 create mode 100644 samples/vfs/test-mntinfo.c

Comments

Stefan Metzmacher Feb. 19, 2020, 10:23 a.m. UTC | #1
Hi David,

I have a few generic remarks for new syscalls...

>  (3) New single-bit capability flags can be added.  This is a structure-typed
>      attribute and, as such, (2) applies.  Any bits you wanted but the kernel
>      doesn't support are automatically set to 0.
> 
> fsinfo() may be called like the following, for example:
> 
> 	struct fsinfo_params params = {
> 		.at_flags	= AT_SYMLINK_NOFOLLOW,

Shouldn't all new syscalls be able to provide the RESOLVE_

Shouldn't all new syscalls be able to provide the RESOLVE_ flags
supported in openat2?

> 		.flags		= FSINFO_FLAGS_QUERY_PATH,
> 		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
> 		.Nth		= 2,
> 	};
> 	struct fsinfo_server_address address;
> 	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> 		     &address, sizeof(address));

Also passing sizeof(params) would allow future updates of fsinfo_params,
also similar to openat2(), clone3()...

> ========================
> FILESYSTEM NOTIFICATIONS
> ========================
> 
> The second system call, watch_mount(), places a watch on a point in the
> mount topology specified by the dirfd, path and at_flags parameters.  All
> mount topology change and mount attribute change notifications in the
> subtree rooted at that point can be intercepted by the watch.  Watches are
> ducted through pipes:
> 
> 	int fd[2];
> 	pipe2(fd, O_NOTIFICATION_PIPE);
> 	ioctl(fd[0], IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> 	watch_mount(AT_FDCWD, "/", 0, fd[0], 0x02);

I guess similar things apply here.

Does that make sense to you?

metze
Christian Brauner Feb. 19, 2020, 2:46 p.m. UTC | #2
On Tue, Feb 18, 2020 at 05:04:55PM +0000, David Howells wrote:
> 
> Here are a set of patches that adds system calls, that (a) allow
> information about the VFS, mount topology, superblock and files to be
> retrieved and (b) allow for notifications of mount topology rearrangement
> events, mount and superblock attribute changes and other superblock events,
> such as errors.
> 
> ============================
> FILESYSTEM INFORMATION QUERY
> ============================
> 
> The first system call, fsinfo(), allows information about the filesystem at
> a particular path point to be queried as a set of attributes, some of which
> may have more than one value.
> 
> Attribute values are of four basic types:
> 
>  (1) Version dependent-length structure (size defined by type).
> 
>  (2) Variable-length string (up to 4096, including NUL).
> 
>  (3) List of structures (up to INT_MAX size).
> 
>  (4) Opaque blob (up to INT_MAX size).

I mainly have an organizational question. :) This is a huge patchset
with lots and lots of (good) features. Wouldn't it make sense to make
the fsinfo() syscall a completely separate patchset from the
watch_mount() and watch_sb() syscalls? It seems that they don't need to
depend on each other at all. This would make reviewing this so much
nicer and likely would mean that fsinfo() could proceed a little faster.

Christian
Darrick J. Wong Feb. 19, 2020, 3:50 p.m. UTC | #3
On Wed, Feb 19, 2020 at 03:46:13PM +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 05:04:55PM +0000, David Howells wrote:
> > 
> > Here are a set of patches that adds system calls, that (a) allow
> > information about the VFS, mount topology, superblock and files to be
> > retrieved and (b) allow for notifications of mount topology rearrangement
> > events, mount and superblock attribute changes and other superblock events,
> > such as errors.
> > 
> > ============================
> > FILESYSTEM INFORMATION QUERY
> > ============================
> > 
> > The first system call, fsinfo(), allows information about the filesystem at
> > a particular path point to be queried as a set of attributes, some of which
> > may have more than one value.
> > 
> > Attribute values are of four basic types:
> > 
> >  (1) Version dependent-length structure (size defined by type).
> > 
> >  (2) Variable-length string (up to 4096, including NUL).
> > 
> >  (3) List of structures (up to INT_MAX size).
> > 
> >  (4) Opaque blob (up to INT_MAX size).
> 
> I mainly have an organizational question. :) This is a huge patchset
> with lots and lots of (good) features. Wouldn't it make sense to make
> the fsinfo() syscall a completely separate patchset from the
> watch_mount() and watch_sb() syscalls? It seems that they don't need to
> depend on each other at all. This would make reviewing this so much
> nicer and likely would mean that fsinfo() could proceed a little faster.

Agreed; I was also wondering why it was necessary to have three new
features in the same large(ish) patchset.

--D

> Christian
David Howells Feb. 19, 2020, 4:16 p.m. UTC | #4
Christian Brauner <christian.brauner@ubuntu.com> wrote:

> I mainly have an organizational question. :) This is a huge patchset
> with lots and lots of (good) features. Wouldn't it make sense to make
> the fsinfo() syscall a completely separate patchset from the
> watch_mount() and watch_sb() syscalls? It seems that they don't need to
> depend on each other at all. This would make reviewing this so much
> nicer and likely would mean that fsinfo() could proceed a little faster.

I can split it up again, but it's not quite as independent as it seems.

There's a notification counter added to both the mount struct and the
super_block struct.  This is bumped by notifications and retrieved by
fsinfo().  You need this in the event that there's an overrun and you have to
rescan the whole tree.

So to actually make use of the mount/sb notification facilities, you need
fsinfo() as well.

David
Ian Kent Feb. 20, 2020, 4:42 a.m. UTC | #5
On Wed, 2020-02-19 at 15:46 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 05:04:55PM +0000, David Howells wrote:
> > Here are a set of patches that adds system calls, that (a) allow
> > information about the VFS, mount topology, superblock and files to
> > be
> > retrieved and (b) allow for notifications of mount topology
> > rearrangement
> > events, mount and superblock attribute changes and other superblock
> > events,
> > such as errors.
> > 
> > ============================
> > FILESYSTEM INFORMATION QUERY
> > ============================
> > 
> > The first system call, fsinfo(), allows information about the
> > filesystem at
> > a particular path point to be queried as a set of attributes, some
> > of which
> > may have more than one value.
> > 
> > Attribute values are of four basic types:
> > 
> >  (1) Version dependent-length structure (size defined by type).
> > 
> >  (2) Variable-length string (up to 4096, including NUL).
> > 
> >  (3) List of structures (up to INT_MAX size).
> > 
> >  (4) Opaque blob (up to INT_MAX size).
> 
> I mainly have an organizational question. :) This is a huge patchset
> with lots and lots of (good) features. Wouldn't it make sense to make
> the fsinfo() syscall a completely separate patchset from the
> watch_mount() and watch_sb() syscalls? It seems that they don't need
> to
> depend on each other at all. This would make reviewing this so much
> nicer and likely would mean that fsinfo() could proceed a little
> faster.

The remainder of the fsinfo() series would need to remain useful
if this was done.

For context I want work on improving handling of large mount
tables.

Ultimately I expect to solve a very long standing autofs problem
of using large direct mount maps without prohibitive performance
overhead (and there a lot of rather challenging autofs changes to
do for this too) and I believe the fsinfo() system call, and
related bits, is the way to do this.

But improving the handling of large mount tables for autofs
will have the side effect of improvements for other mount table
users, even in the early stages of this work.

For example I want to use this for mount table handling improvements
in libmount. Clearly that ultimately needs mount change notification
in the end but ...

There's a bunch of things that need to be done alone the way
to even get started.

One thing that's needed is the ability to call fsinfo() to get
information on a mount to avoid constant reading of the proc based
mount table, which happens a lot (since the mount info. needs
to be up to date) so systemd (and others) would see an improvement
with the fsinfo() system call alone able to be used in libmount.

But for the fsinfo() system call to be used for this the file
system specific mount options need to also be obtained when
using fsinfo(). That means the super block operation fsinfo uses
to provide this must be implemented for at least most file systems.

So separating out the notifications part, leaving whatever is needed
to still be able to do this, should be fine and the system call
would be immediately useful once the super operation is implemented
for the needed file systems.

Whether the implementation of the super operation should be done
as part of this series is another question but would certainly
be a challenge and make the series more complicated. But is needed
for the change to be useful in my case.

Ian
Christian Brauner Feb. 20, 2020, 9:09 a.m. UTC | #6
On Thu, Feb 20, 2020 at 12:42:15PM +0800, Ian Kent wrote:
> On Wed, 2020-02-19 at 15:46 +0100, Christian Brauner wrote:
> > On Tue, Feb 18, 2020 at 05:04:55PM +0000, David Howells wrote:
> > > Here are a set of patches that adds system calls, that (a) allow
> > > information about the VFS, mount topology, superblock and files to
> > > be
> > > retrieved and (b) allow for notifications of mount topology
> > > rearrangement
> > > events, mount and superblock attribute changes and other superblock
> > > events,
> > > such as errors.
> > > 
> > > ============================
> > > FILESYSTEM INFORMATION QUERY
> > > ============================
> > > 
> > > The first system call, fsinfo(), allows information about the
> > > filesystem at
> > > a particular path point to be queried as a set of attributes, some
> > > of which
> > > may have more than one value.
> > > 
> > > Attribute values are of four basic types:
> > > 
> > >  (1) Version dependent-length structure (size defined by type).
> > > 
> > >  (2) Variable-length string (up to 4096, including NUL).
> > > 
> > >  (3) List of structures (up to INT_MAX size).
> > > 
> > >  (4) Opaque blob (up to INT_MAX size).
> > 
> > I mainly have an organizational question. :) This is a huge patchset
> > with lots and lots of (good) features. Wouldn't it make sense to make
> > the fsinfo() syscall a completely separate patchset from the
> > watch_mount() and watch_sb() syscalls? It seems that they don't need
> > to
> > depend on each other at all. This would make reviewing this so much
> > nicer and likely would mean that fsinfo() could proceed a little
> > faster.
> 
> The remainder of the fsinfo() series would need to remain useful
> if this was done.
> 
> For context I want work on improving handling of large mount
> tables.

Yeah, I've talked to David about this; polling on a large mountinfo file
is not great, I agree.

> 
> Ultimately I expect to solve a very long standing autofs problem
> of using large direct mount maps without prohibitive performance
> overhead (and there a lot of rather challenging autofs changes to
> do for this too) and I believe the fsinfo() system call, and
> related bits, is the way to do this.
> 
> But improving the handling of large mount tables for autofs
> will have the side effect of improvements for other mount table
> users, even in the early stages of this work.
> 
> For example I want to use this for mount table handling improvements
> in libmount. Clearly that ultimately needs mount change notification
> in the end but ...
> 
> There's a bunch of things that need to be done alone the way
> to even get started.
> 
> One thing that's needed is the ability to call fsinfo() to get
> information on a mount to avoid constant reading of the proc based
> mount table, which happens a lot (since the mount info. needs
> to be up to date) so systemd (and others) would see an improvement
> with the fsinfo() system call alone able to be used in libmount.
> 
> But for the fsinfo() system call to be used for this the file
> system specific mount options need to also be obtained when
> using fsinfo(). That means the super block operation fsinfo uses
> to provide this must be implemented for at least most file systems.
> 
> So separating out the notifications part, leaving whatever is needed
> to still be able to do this, should be fine and the system call
> would be immediately useful once the super operation is implemented
> for the needed file systems.
> 
> Whether the implementation of the super operation should be done
> as part of this series is another question but would certainly
> be a challenge and make the series more complicated. But is needed
> for the change to be useful in my case.

I think what would might work - and what David had already brought up
briefly - is to either base the fsinfo branch on top of the mount
notificaiton branch or break the notification counters pieces into a
separate patch and base both mount notifications and fsinfo on top of
it.

Christian
Ian Kent Feb. 20, 2020, 11:30 a.m. UTC | #7
On Thu, 2020-02-20 at 10:09 +0100, Christian Brauner wrote:
> On Thu, Feb 20, 2020 at 12:42:15PM +0800, Ian Kent wrote:
> > On Wed, 2020-02-19 at 15:46 +0100, Christian Brauner wrote:
> > > On Tue, Feb 18, 2020 at 05:04:55PM +0000, David Howells wrote:
> > > > Here are a set of patches that adds system calls, that (a)
> > > > allow
> > > > information about the VFS, mount topology, superblock and files
> > > > to
> > > > be
> > > > retrieved and (b) allow for notifications of mount topology
> > > > rearrangement
> > > > events, mount and superblock attribute changes and other
> > > > superblock
> > > > events,
> > > > such as errors.
> > > > 
> > > > ============================
> > > > FILESYSTEM INFORMATION QUERY
> > > > ============================
> > > > 
> > > > The first system call, fsinfo(), allows information about the
> > > > filesystem at
> > > > a particular path point to be queried as a set of attributes,
> > > > some
> > > > of which
> > > > may have more than one value.
> > > > 
> > > > Attribute values are of four basic types:
> > > > 
> > > >  (1) Version dependent-length structure (size defined by type).
> > > > 
> > > >  (2) Variable-length string (up to 4096, including NUL).
> > > > 
> > > >  (3) List of structures (up to INT_MAX size).
> > > > 
> > > >  (4) Opaque blob (up to INT_MAX size).
> > > 
> > > I mainly have an organizational question. :) This is a huge
> > > patchset
> > > with lots and lots of (good) features. Wouldn't it make sense to
> > > make
> > > the fsinfo() syscall a completely separate patchset from the
> > > watch_mount() and watch_sb() syscalls? It seems that they don't
> > > need
> > > to
> > > depend on each other at all. This would make reviewing this so
> > > much
> > > nicer and likely would mean that fsinfo() could proceed a little
> > > faster.
> > 
> > The remainder of the fsinfo() series would need to remain useful
> > if this was done.
> > 
> > For context I want work on improving handling of large mount
> > tables.
> 
> Yeah, I've talked to David about this; polling on a large mountinfo
> file
> is not great, I agree.
> 
> > Ultimately I expect to solve a very long standing autofs problem
> > of using large direct mount maps without prohibitive performance
> > overhead (and there a lot of rather challenging autofs changes to
> > do for this too) and I believe the fsinfo() system call, and
> > related bits, is the way to do this.
> > 
> > But improving the handling of large mount tables for autofs
> > will have the side effect of improvements for other mount table
> > users, even in the early stages of this work.
> > 
> > For example I want to use this for mount table handling
> > improvements
> > in libmount. Clearly that ultimately needs mount change
> > notification
> > in the end but ...
> > 
> > There's a bunch of things that need to be done alone the way
> > to even get started.
> > 
> > One thing that's needed is the ability to call fsinfo() to get
> > information on a mount to avoid constant reading of the proc based
> > mount table, which happens a lot (since the mount info. needs
> > to be up to date) so systemd (and others) would see an improvement
> > with the fsinfo() system call alone able to be used in libmount.
> > 
> > But for the fsinfo() system call to be used for this the file
> > system specific mount options need to also be obtained when
> > using fsinfo(). That means the super block operation fsinfo uses
> > to provide this must be implemented for at least most file systems.
> > 
> > So separating out the notifications part, leaving whatever is
> > needed
> > to still be able to do this, should be fine and the system call
> > would be immediately useful once the super operation is implemented
> > for the needed file systems.
> > 
> > Whether the implementation of the super operation should be done
> > as part of this series is another question but would certainly
> > be a challenge and make the series more complicated. But is needed
> > for the change to be useful in my case.
> 
> I think what would might work - and what David had already brought up
> briefly - is to either base the fsinfo branch on top of the mount
> notificaiton branch or break the notification counters pieces into a
> separate patch and base both mount notifications and fsinfo on top of
> it.

Possibly, but I'm pretty sure David has more fsinfo patches.

So I suspect there will be a right time to post patches for the
fsinfo super block operation that David doesn't already have. I'm
going to have to find time for that ...

The post was more to let David know what my first goal is and what
I need for it, and to let others know there is someone wanting to
use this for user space improvements and give some initial insight
into my longer term goals.

Ian
David Howells Feb. 21, 2020, 12:57 p.m. UTC | #8
Stefan Metzmacher <metze@samba.org> wrote:

> > fsinfo() may be called like the following, for example:
> > 
> > 	struct fsinfo_params params = {
> > 		.at_flags	= AT_SYMLINK_NOFOLLOW,
> 
> Shouldn't all new syscalls be able to provide the RESOLVE_ flags
> supported in openat2?

If that's the rule, then fine.  I presume these are a replacement for AT_*.
But the set of RESOLVE_* flags does not appear to be complete - and why's it
not in linux/fs.h if it's meant to be used by everything?

Anyway, it lacks a RESOLVE_NO_AUTOMOUNT flag.  This is not quite the same as
the documented behaviour of RESOLVE_NO_XDEV.

> > 	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> > 		     &address, sizeof(address));
> 
> Also passing sizeof(params) would allow future updates of fsinfo_params,
> also similar to openat2(), clone3()...

I can put that at the beginning of the params block or put dirfd in there.  If
I remember rightly, 6-arg syscalls are discouraged because they may need
special handling on some arches.

David