mbox series

[00/17] VFS: Filesystem information and notifications [ver #17]

Message ID 158230810644.2185128.16726948836367716086.stgit@warthog.procyon.org.uk (mailing list archive)
Headers show
Series VFS: Filesystem information and notifications [ver #17] | expand

Message

David Howells Feb. 21, 2020, 6:01 p.m. UTC
Here are a set of patches that adds system calls, that (a) allow
information about the VFS, mount topology, superblock and files to be
retrieved and (b) allow for notifications of mount topology rearrangement
events, mount and superblock attribute changes and other superblock events,
such as errors.


========================
FILESYSTEM NOTIFICATIONS
========================

The watch_mount() system call places a watch on a point in the mount
topology specified by the dirfd, path and at_flags parameters.  All mount
topology change and mount attribute change notifications in the subtree
rooted at that point can be intercepted by the watch.  Watches are ducted
through pipes:

	int fd[2];
	pipe2(fd, O_NOTIFICATION_PIPE);
	ioctl(fd[0], IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	watch_mount(AT_FDCWD, "/", 0, fd[0], 0x02);

Events include:

 - New mount made
 - Mount unmounted
 - Mount expired
 - R/O state changed
 - Other attribute changed
 - Mount moved from
 - Mount moved to

Using filtering, this may be limited in various ways (single mount watch vs
subtree watch, recursive vs non-recursive changes, to-R/O vs to-R/W, mount
vs submount).

Each mount now has a change counter.  Whenever a mount is changed, this
gets incremented.  It can be queried by fsinfo() using either
FSINFO_ATTR_MOUNT_INFO or FSINFO_ATTR_MOUNT_CHILDREN.  The ID of the mount
on which the notification is generated is placed into the notification
message (triggered_on).  If the event involves a second mount as well, such
as creation of a new mount, that gets returned too (changed_mount).


The watch_sb() system call places a watch on the superblock specified by
the dirfd, path and at_flags parameters.  This allows various superblock
events to be monitored for, such as:

 - Transition between R/W and R/O
 - Filesystem errors
 - Quota overrun
 - Network status changes

Each superblock now gets a 64-bit unique superblock identifier and a
notification counter.  The counter is incremented each time one of these
notifications would be generated.  This attributes can be queried using
fsinfo() with FSINFO_ATTR_SB_NOTIFICATIONS.  The identifier is placed into
notification messages.


============================
FILESYSTEM INFORMATION QUERY
============================

The fsinfo() system call allows information about the filesystem at a
particular path point to be queried as a set of attributes, some of which
may have more than one value.

Attribute values are of four basic types:

 (1) Version dependent-length structure (size defined by type).

 (2) Variable-length string (up to 4096, including NUL).

 (3) List of structures (up to INT_MAX size).

 (4) Opaque blob (up to INT_MAX size).

Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type.

Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but all the values of an attribute have to be of the same type.

I've tried to make the interface as light as possible, so integer/enum
attribute selector rather than string and the core does all the allocation
and extensibility support work rather than leaving that to the filesystems.
That means that for the first two attribute types, the filesystem will
always see a sufficiently-sized buffer allocated.  Further, this removes
the possibility of the filesystem gaining access to the userspace buffer.


fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:

 (1) General superblock attributes:

     - Filesystem identifiers (UUID, volume label, device numbers, ...)
     - The limits on a filesystem's capabilities
     - Information on supported statx fields and attributes and IOC flags.
     - A variety single-bit flags indicating supported capabilities.
     - Timestamp resolution and range.
     - The amount of space/free space in a filesystem (as statfs()).
     - Superblock notification counter.

 (2) Filesystem-specific superblock attributes:

     - Superblock-level timestamps.
     - Cell name.
     - Server names and addresses.
     - Filesystem-specific information.

 (3) VFS information:

     - Mount topology information.
     - Mount attributes.
     - Mount notification counter.

 (4) Information about what the fsinfo() syscall itself supports, including
     the type and struct/element size of attributes.

The system is extensible:

 (1) New attributes can be added.  There is no requirement that a
     filesystem implement every attribute.  Note that the core VFS keeps a
     table of types and sizes so it can handle future extensibility rather
     than delegating this to the filesystems.

 (2) Version length-dependent structure attributes can be made larger and
     have additional information tacked on the end, provided it keeps the
     layout of the existing fields.  If an older process asks for a shorter
     structure, it will only be given the bits it asks for.  If a newer
     process asks for a longer structure on an older kernel, the extra
     space will be set to 0.  In all cases, the size of the data actually
     available is returned.

     In essence, the size of a structure is that structure's version: a
     smaller size is an earlier version and a later version includes
     everything that the earlier version did.

 (3) New single-bit capability flags can be added.  This is a structure-typed
     attribute and, as such, (2) applies.  Any bits you wanted but the kernel
     doesn't support are automatically set to 0.

fsinfo() may be called like the following, for example:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
		.Nth		= 2,
	};
	struct fsinfo_server_address address;
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
		     &address, sizeof(address));

The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_CELL_NAME;
	};
	char cell_name[256];
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
		     &cell_name, sizeof(cell_name));

would retrieve the name of an AFS cell as a string.

In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:

	fd = fsopen("ext4", 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_PARAMETERS;
	};
	char buffer[65536];
	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));

even if that context doesn't currently have a superblock attached.  I would
prefer this to contain length-prefixed strings so that there's no need to
insert escaping, especially as any character, including '\', can be used as
the separator in cifs and so that binary parameters can be returned (though
that is a lesser issue).


Two sample programs are provided, one to query filesystem attributes and
the other to display a mount subtree.  Both of them can be given a path or
a mount ID to start at.  Further, the watch_test sample program now watches
for mount events under "/" and for superblock events on whatever superblock
is backing "/mnt" when it the program is started.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

on branch:

	fsinfo-core


===================
SIGNIFICANT CHANGES
===================

 ver #17:

 (*) Applied comments from Jann Horn, Darrick Wong and Christian Brauner.

 (*) Rearranged the order in which fsinfo() does things so that the
     superblock operations table can have a function pointer rather than a
     table pointer.  The ->fsinfo() op is now called at least twice, once
     to determine the size of buffer needed and then to retrieve the data.
     If the retrieval step indicates yet more space is needed, the buffer
     will be expanded and that step repeated.

 (*) Merge the element size into the size in the fsinfo_attribute def and
     don't set size for strings or opaques.  Let a helper work that out.
     This means that strings can actually get larger then 4K.

 (*) A helper is provided to scan a list of attributes and call the
     appropriate get function.  This can be called from a filesystem's
     ->fsinfo() method multiple times.  It also handles attribute
     enumeration and info querying.

 (*) Rearranged the patches to put all the notification patches first.
     This allowed some of the bits to be squashed together.  At some point,
     I'll move the notification patches into a different branch.

 ver #16:

 (*) Split the features bits out of the fsinfo() core into their own patch
     and got rid of the name encoding attributes.

 (*) Renamed the 'array' type to 'list' and made AFS use it for returning
     server address lists.

 (*) Changed the ->fsinfo() method into an ->fsinfo_attributes[] table,
     where each attribute has a ->get() method to deal with it.  These
     tables can then be returned with an fsinfo meta attribute.

 (*) Dropped the fscontext query and parameter/description retrieval
     attributes for now.

 (*) Picked the mount topology attributes into this branch.

 (*) Picked the mount notifications into this branch and rebased on top of
     notifications-pipe-core.

 (*) Picked the superblock notifications into this branch.

 (*) Add sample code for Ext4 and NFS.

David
---
David Howells (17):
      watch_queue: Add security hooks to rule on setting mount and sb watches
      watch_queue: Implement mount topology and attribute change notifications
      watch_queue: sample: Display mount tree change notifications
      watch_queue: Introduce a non-repeating system-unique superblock ID
      watch_queue: Add superblock notifications
      watch_queue: sample: Display superblock notifications
      fsinfo: Add fsinfo() syscall to query filesystem information
      fsinfo: Provide a bitmap of supported features
      fsinfo: Allow fsinfo() to look up a mount object by ID
      fsinfo: Allow mount information to be queried
      fsinfo: sample: Mount listing program
      fsinfo: Allow the mount topology propogation flags to be retrieved
      fsinfo: Query superblock unique ID and notification counter
      fsinfo: Add API documentation
      fsinfo: Add support for AFS
      fsinfo: Add example support for Ext4
      fsinfo: Add example support for NFS


 Documentation/filesystems/fsinfo.rst        |  491 ++++++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    3 
 arch/arm/tools/syscall.tbl                  |    3 
 arch/arm64/include/asm/unistd.h             |    2 
 arch/ia64/kernel/syscalls/syscall.tbl       |    3 
 arch/m68k/kernel/syscalls/syscall.tbl       |    3 
 arch/microblaze/kernel/syscalls/syscall.tbl |    3 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    3 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    3 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    3 
 arch/parisc/kernel/syscalls/syscall.tbl     |    3 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    3 
 arch/s390/kernel/syscalls/syscall.tbl       |    3 
 arch/sh/kernel/syscalls/syscall.tbl         |    3 
 arch/sparc/kernel/syscalls/syscall.tbl      |    3 
 arch/x86/entry/syscalls/syscall_32.tbl      |    3 
 arch/x86/entry/syscalls/syscall_64.tbl      |    3 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    3 
 fs/Kconfig                                  |   28 +
 fs/Makefile                                 |    2 
 fs/afs/internal.h                           |    1 
 fs/afs/super.c                              |  218 +++++++
 fs/d_path.c                                 |    2 
 fs/ext4/Makefile                            |    1 
 fs/ext4/ext4.h                              |    6 
 fs/ext4/fsinfo.c                            |   45 +
 fs/ext4/super.c                             |    3 
 fs/fsinfo.c                                 |  665 +++++++++++++++++++++
 fs/internal.h                               |   12 
 fs/mount.h                                  |   30 +
 fs/mount_notify.c                           |  185 ++++++
 fs/namespace.c                              |  323 ++++++++++
 fs/nfs/Makefile                             |    1 
 fs/nfs/fsinfo.c                             |  230 +++++++
 fs/nfs/internal.h                           |    6 
 fs/nfs/nfs4super.c                          |    3 
 fs/nfs/super.c                              |    3 
 fs/super.c                                  |  156 +++++
 include/linux/dcache.h                      |    1 
 include/linux/fs.h                          |   87 +++
 include/linux/fsinfo.h                      |  110 ++++
 include/linux/lsm_hooks.h                   |   24 +
 include/linux/security.h                    |   16 +
 include/linux/syscalls.h                    |    8 
 include/uapi/asm-generic/unistd.h           |    8 
 include/uapi/linux/fsinfo.h                 |  361 ++++++++++++
 include/uapi/linux/mount.h                  |   10 
 include/uapi/linux/watch_queue.h            |   61 ++
 include/uapi/linux/windows.h                |   35 +
 kernel/sys_ni.c                             |    7 
 samples/vfs/Makefile                        |    7 
 samples/vfs/test-fsinfo.c                   |  847 +++++++++++++++++++++++++++
 samples/vfs/test-mntinfo.c                  |  243 ++++++++
 samples/watch_queue/watch_test.c            |   76 ++
 security/security.c                         |   14 
 55 files changed, 4365 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/filesystems/fsinfo.rst
 create mode 100644 fs/ext4/fsinfo.c
 create mode 100644 fs/fsinfo.c
 create mode 100644 fs/mount_notify.c
 create mode 100644 fs/nfs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 include/uapi/linux/windows.h
 create mode 100644 samples/vfs/test-fsinfo.c
 create mode 100644 samples/vfs/test-mntinfo.c

Comments

James Bottomley Feb. 21, 2020, 8:21 p.m. UTC | #1
On Fri, 2020-02-21 at 18:01 +0000, David Howells wrote:
[...]
> ============================
> FILESYSTEM INFORMATION QUERY
> ============================
> 
> The fsinfo() system call allows information about the filesystem at a
> particular path point to be queried as a set of attributes, some of
> which may have more than one value.
> 
> Attribute values are of four basic types:
> 
>  (1) Version dependent-length structure (size defined by type).
> 
>  (2) Variable-length string (up to 4096, including NUL).
> 
>  (3) List of structures (up to INT_MAX size).
> 
>  (4) Opaque blob (up to INT_MAX size).
> 
> Attributes can have multiple values either as a sequence of values or
> a sequence-of-sequences of values and all the values of a particular
> attribute must be of the same type.
> 
> Note that the values of an attribute *are* allowed to vary between
> dentries within a single superblock, depending on the specific dentry
> that you're looking at, but all the values of an attribute have to be
> of the same type.
> 
> I've tried to make the interface as light as possible, so
> integer/enum attribute selector rather than string and the core does
> all the allocation and extensibility support work rather than leaving
> that to the filesystems. That means that for the first two attribute
> types, the filesystem will always see a sufficiently-sized buffer
> allocated.  Further, this removes the possibility of the filesystem
> gaining access to the userspace buffer.
> 
> 
> fsinfo() allows a variety of information to be retrieved about a
> filesystem and the mount topology:
> 
>  (1) General superblock attributes:
> 
>      - Filesystem identifiers (UUID, volume label, device numbers,
> ...)
>      - The limits on a filesystem's capabilities
>      - Information on supported statx fields and attributes and IOC
> flags.
>      - A variety single-bit flags indicating supported capabilities.
>      - Timestamp resolution and range.
>      - The amount of space/free space in a filesystem (as statfs()).
>      - Superblock notification counter.
> 
>  (2) Filesystem-specific superblock attributes:
> 
>      - Superblock-level timestamps.
>      - Cell name.
>      - Server names and addresses.
>      - Filesystem-specific information.
> 
>  (3) VFS information:
> 
>      - Mount topology information.
>      - Mount attributes.
>      - Mount notification counter.
> 
>  (4) Information about what the fsinfo() syscall itself supports,
> including
>      the type and struct/element size of attributes.
> 
> The system is extensible:
> 
>  (1) New attributes can be added.  There is no requirement that a
>      filesystem implement every attribute.  Note that the core VFS
> keeps a
>      table of types and sizes so it can handle future extensibility
> rather
>      than delegating this to the filesystems.
> 
>  (2) Version length-dependent structure attributes can be made larger
> and
>      have additional information tacked on the end, provided it keeps
> the
>      layout of the existing fields.  If an older process asks for a
> shorter
>      structure, it will only be given the bits it asks for.  If a
> newer
>      process asks for a longer structure on an older kernel, the
> extra
>      space will be set to 0.  In all cases, the size of the data
> actually
>      available is returned.
> 
>      In essence, the size of a structure is that structure's version:
> a
>      smaller size is an earlier version and a later version includes
>      everything that the earlier version did.
> 
>  (3) New single-bit capability flags can be added.  This is a
> structure-typed
>      attribute and, as such, (2) applies.  Any bits you wanted but
> the kernel
>      doesn't support are automatically set to 0.
> 
> fsinfo() may be called like the following, for example:
> 
> 	struct fsinfo_params params = {
> 		.at_flags	= AT_SYMLINK_NOFOLLOW,
> 		.flags		= FSINFO_FLAGS_QUERY_PATH,
> 		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
> 		.Nth		= 2,
> 	};
> 	struct fsinfo_server_address address;
> 	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> 		     &address, sizeof(address));
> 
> The above example would query an AFS filesystem to retrieve the
> address
> list for the 3rd server, and:
> 
> 	struct fsinfo_params params = {
> 		.at_flags	= AT_SYMLINK_NOFOLLOW,
> 		.flags		= FSINFO_FLAGS_QUERY_PATH,
> 		.request	= FSINFO_ATTR_AFS_CELL_NAME;
> 	};
> 	char cell_name[256];
> 	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> 		     &cell_name, sizeof(cell_name));
> 
> would retrieve the name of an AFS cell as a string.
> 
> In future, I want to make fsinfo() capable of querying a context
> created by
> fsopen() or fspick(), e.g.:
> 
> 	fd = fsopen("ext4", 0);
> 	struct fsinfo_params params = {
> 		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
> 		.request	= FSINFO_ATTR_PARAMETERS;
> 	};
> 	char buffer[65536];
> 	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));
> 
> even if that context doesn't currently have a superblock attached.  I
> would prefer this to contain length-prefixed strings so that there's
> no need to insert escaping, especially as any character, including
> '\', can be used as the separator in cifs and so that binary
> parameters can be returned (though that is a lesser issue).

Could I make a suggestion about how this should be done in a way that
doesn't actually require the fsinfo syscall at all: it could just be
done with fsconfig.  The idea is based on something I've wanted to do
for configfd but couldn't because otherwise it wouldn't substitute for
fsconfig, but Christian made me think it was actually essential to the
ability of the seccomp and other verifier tools in the critique of
configfd and I belive the same critique applies here.

Instead of making fsconfig functionally configure ... as in you pass
the attribute name, type and parameters down into the fs specific
handler and the handler does a string match and then verifies the
parameters and then acts on them, make it table configured, so what
each fstype does is register a table of attributes which can be got and
optionally set (with each attribute having a get and optional set
function).  We'd have multiple tables per fstype, so the generic VFS
can register a table of attributes it understands for every fstype
(things like name, uuid and the like) and then each fs type would
register a table of fs specific attributes following the same pattern. 
The system would examine the fs specific table before the generic one,
allowing overrides.  fsconfig would have the ability to both get and
set attributes, permitting retrieval as well as setting (which is how I
get rid of the fsinfo syscall), we'd have a global parameter, which
would retrieve the entire table by name and type so the whole thing is
introspectable because the upper layer knows a-priori all the
attributes which can be set for a given fs type and what type they are
(so we can make more of the parsing generic).  Any attribute which
doesn't have a set routine would be read only and all attributes would
have to have a get routine meaning everything is queryable.

I think I know how to code this up in a way that would be fully
transparent to the existing syscalls.

James
Miklos Szeredi Feb. 24, 2020, 10:24 a.m. UTC | #2
On Fri, Feb 21, 2020 at 9:21 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Fri, 2020-02-21 at 18:01 +0000, David Howells wrote:
> [...]
> > ============================
> > FILESYSTEM INFORMATION QUERY
> > ============================
> >
> > The fsinfo() system call allows information about the filesystem at a
> > particular path point to be queried as a set of attributes, some of
> > which may have more than one value.
> >
> > Attribute values are of four basic types:
> >
> >  (1) Version dependent-length structure (size defined by type).
> >
> >  (2) Variable-length string (up to 4096, including NUL).
> >
> >  (3) List of structures (up to INT_MAX size).
> >
> >  (4) Opaque blob (up to INT_MAX size).
> >
> > Attributes can have multiple values either as a sequence of values or
> > a sequence-of-sequences of values and all the values of a particular
> > attribute must be of the same type.
> >
> > Note that the values of an attribute *are* allowed to vary between
> > dentries within a single superblock, depending on the specific dentry
> > that you're looking at, but all the values of an attribute have to be
> > of the same type.
> >
> > I've tried to make the interface as light as possible, so
> > integer/enum attribute selector rather than string and the core does
> > all the allocation and extensibility support work rather than leaving
> > that to the filesystems. That means that for the first two attribute
> > types, the filesystem will always see a sufficiently-sized buffer
> > allocated.  Further, this removes the possibility of the filesystem
> > gaining access to the userspace buffer.
> >
> >
> > fsinfo() allows a variety of information to be retrieved about a
> > filesystem and the mount topology:
> >
> >  (1) General superblock attributes:
> >
> >      - Filesystem identifiers (UUID, volume label, device numbers,
> > ...)
> >      - The limits on a filesystem's capabilities
> >      - Information on supported statx fields and attributes and IOC
> > flags.
> >      - A variety single-bit flags indicating supported capabilities.
> >      - Timestamp resolution and range.
> >      - The amount of space/free space in a filesystem (as statfs()).
> >      - Superblock notification counter.
> >
> >  (2) Filesystem-specific superblock attributes:
> >
> >      - Superblock-level timestamps.
> >      - Cell name.
> >      - Server names and addresses.
> >      - Filesystem-specific information.
> >
> >  (3) VFS information:
> >
> >      - Mount topology information.
> >      - Mount attributes.
> >      - Mount notification counter.
> >
> >  (4) Information about what the fsinfo() syscall itself supports,
> > including
> >      the type and struct/element size of attributes.
> >
> > The system is extensible:
> >
> >  (1) New attributes can be added.  There is no requirement that a
> >      filesystem implement every attribute.  Note that the core VFS
> > keeps a
> >      table of types and sizes so it can handle future extensibility
> > rather
> >      than delegating this to the filesystems.
> >
> >  (2) Version length-dependent structure attributes can be made larger
> > and
> >      have additional information tacked on the end, provided it keeps
> > the
> >      layout of the existing fields.  If an older process asks for a
> > shorter
> >      structure, it will only be given the bits it asks for.  If a
> > newer
> >      process asks for a longer structure on an older kernel, the
> > extra
> >      space will be set to 0.  In all cases, the size of the data
> > actually
> >      available is returned.
> >
> >      In essence, the size of a structure is that structure's version:
> > a
> >      smaller size is an earlier version and a later version includes
> >      everything that the earlier version did.
> >
> >  (3) New single-bit capability flags can be added.  This is a
> > structure-typed
> >      attribute and, as such, (2) applies.  Any bits you wanted but
> > the kernel
> >      doesn't support are automatically set to 0.
> >
> > fsinfo() may be called like the following, for example:
> >
> >       struct fsinfo_params params = {
> >               .at_flags       = AT_SYMLINK_NOFOLLOW,
> >               .flags          = FSINFO_FLAGS_QUERY_PATH,
> >               .request        = FSINFO_ATTR_AFS_SERVER_ADDRESSES,
> >               .Nth            = 2,
> >       };
> >       struct fsinfo_server_address address;
> >       len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> >                    &address, sizeof(address));
> >
> > The above example would query an AFS filesystem to retrieve the
> > address
> > list for the 3rd server, and:
> >
> >       struct fsinfo_params params = {
> >               .at_flags       = AT_SYMLINK_NOFOLLOW,
> >               .flags          = FSINFO_FLAGS_QUERY_PATH,
> >               .request        = FSINFO_ATTR_AFS_CELL_NAME;
> >       };
> >       char cell_name[256];
> >       len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> >                    &cell_name, sizeof(cell_name));
> >
> > would retrieve the name of an AFS cell as a string.
> >
> > In future, I want to make fsinfo() capable of querying a context
> > created by
> > fsopen() or fspick(), e.g.:
> >
> >       fd = fsopen("ext4", 0);
> >       struct fsinfo_params params = {
> >               .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> >               .request        = FSINFO_ATTR_PARAMETERS;
> >       };
> >       char buffer[65536];
> >       fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));
> >
> > even if that context doesn't currently have a superblock attached.  I
> > would prefer this to contain length-prefixed strings so that there's
> > no need to insert escaping, especially as any character, including
> > '\', can be used as the separator in cifs and so that binary
> > parameters can be returned (though that is a lesser issue).
>
> Could I make a suggestion about how this should be done in a way that
> doesn't actually require the fsinfo syscall at all: it could just be
> done with fsconfig.  The idea is based on something I've wanted to do
> for configfd but couldn't because otherwise it wouldn't substitute for
> fsconfig, but Christian made me think it was actually essential to the
> ability of the seccomp and other verifier tools in the critique of
> configfd and I belive the same critique applies here.
>
> Instead of making fsconfig functionally configure ... as in you pass
> the attribute name, type and parameters down into the fs specific
> handler and the handler does a string match and then verifies the
> parameters and then acts on them, make it table configured, so what
> each fstype does is register a table of attributes which can be got and
> optionally set (with each attribute having a get and optional set
> function).  We'd have multiple tables per fstype, so the generic VFS
> can register a table of attributes it understands for every fstype
> (things like name, uuid and the like) and then each fs type would
> register a table of fs specific attributes following the same pattern.
> The system would examine the fs specific table before the generic one,
> allowing overrides.  fsconfig would have the ability to both get and
> set attributes, permitting retrieval as well as setting (which is how I
> get rid of the fsinfo syscall), we'd have a global parameter, which
> would retrieve the entire table by name and type so the whole thing is
> introspectable because the upper layer knows a-priori all the
> attributes which can be set for a given fs type and what type they are
> (so we can make more of the parsing generic).  Any attribute which
> doesn't have a set routine would be read only and all attributes would
> have to have a get routine meaning everything is queryable.

And that makes me wonder: would a
"/sys/class/fs/$ST_DEV/options/$OPTION" type interface be feasible for
this?

Thanks,
Miklos
James Bottomley Feb. 24, 2020, 2:55 p.m. UTC | #3
On Mon, 2020-02-24 at 11:24 +0100, Miklos Szeredi wrote:
> On Fri, Feb 21, 2020 at 9:21 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
[...]
> > Could I make a suggestion about how this should be done in a way
> > that doesn't actually require the fsinfo syscall at all: it could
> > just be done with fsconfig.  The idea is based on something I've
> > wanted to do for configfd but couldn't because otherwise it
> > wouldn't substitute for fsconfig, but Christian made me think it
> > was actually essential to the ability of the seccomp and other
> > verifier tools in the critique of configfd and I belive the same
> > critique applies here.
> > 
> > Instead of making fsconfig functionally configure ... as in you
> > pass the attribute name, type and parameters down into the fs
> > specific handler and the handler does a string match and then
> > verifies the parameters and then acts on them, make it table
> > configured, so what each fstype does is register a table of
> > attributes which can be got and optionally set (with each attribute
> > having a get and optional set function).  We'd have multiple tables
> > per fstype, so the generic VFS can register a table of attributes
> > it understands for every fstype (things like name, uuid and the
> > like) and then each fs type would register a table of fs specific
> > attributes following the same pattern. The system would examine the
> > fs specific table before the generic one, allowing
> > overrides.  fsconfig would have the ability to both get and
> > set attributes, permitting retrieval as well as setting (which is
> > how I get rid of the fsinfo syscall), we'd have a global parameter,
> > which would retrieve the entire table by name and type so the whole
> > thing is introspectable because the upper layer knows a-priori all
> > the attributes which can be set for a given fs type and what type
> > they are (so we can make more of the parsing generic).  Any
> > attribute which doesn't have a set routine would be read only and
> > all attributes would have to have a get routine meaning everything
> > is queryable.
> 
> And that makes me wonder: would a
> "/sys/class/fs/$ST_DEV/options/$OPTION" type interface be feasible
> for this?

Once it's table driven, certainly a sysfs directory becomes possible. 
The problem with ST_DEV is filesystems like btrfs and xfs that may have
multiple devices.  The current fsinfo takes a fspick'd directory fd so
the input to the query is a path, which gets messy in sysfs, although I
could see something like /sys/class/fs/mount/<path>/$OPTION working.

James
Miklos Szeredi Feb. 24, 2020, 3:28 p.m. UTC | #4
On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:

> Once it's table driven, certainly a sysfs directory becomes possible.
> The problem with ST_DEV is filesystems like btrfs and xfs that may have
> multiple devices.

For XFS there's always  a single sb->s_dev though, that's what st_dev
will be set to on all files.

Btrfs subvolume is sort of a lightweight superblock, so basically all
such st_dev's are aliases of the same master superblock.  So lookup of
all subvolume st_dev's could result in referencing the same underlying
struct super_block (just like /proc/$PID will reference the same
underlying task group regardless of which of the task group member's
PID is used).

Having this info in sysfs would spare us a number of issues that a set
of new syscalls would bring.  The question is, would that be enough,
or is there a reason that sysfs can't be used to present the various
filesystem related information that fsinfo is supposed to present?

Thanks,
Miklos
Steven Whitehouse Feb. 25, 2020, 12:13 p.m. UTC | #5
Hi,

On 24/02/2020 15:28, Miklos Szeredi wrote:
> On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>
>> Once it's table driven, certainly a sysfs directory becomes possible.
>> The problem with ST_DEV is filesystems like btrfs and xfs that may have
>> multiple devices.
> For XFS there's always  a single sb->s_dev though, that's what st_dev
> will be set to on all files.
>
> Btrfs subvolume is sort of a lightweight superblock, so basically all
> such st_dev's are aliases of the same master superblock.  So lookup of
> all subvolume st_dev's could result in referencing the same underlying
> struct super_block (just like /proc/$PID will reference the same
> underlying task group regardless of which of the task group member's
> PID is used).
>
> Having this info in sysfs would spare us a number of issues that a set
> of new syscalls would bring.  The question is, would that be enough,
> or is there a reason that sysfs can't be used to present the various
> filesystem related information that fsinfo is supposed to present?
>
> Thanks,
> Miklos
>
We need a unique id for superblocks anyway. I had wondered about using 
s_dev some time back, but for the reasons mentioned earlier in this 
thread I think it might just land up being confusing and difficult to 
manage. While fake s_devs are created for sbs that don't have a device, 
I can't help thinking that something closer to ifindex, but for 
superblocks, is needed here. That would avoid the issue of which device 
number to use.

In fact we need that anyway for the notifications, since without that 
there is a race that can lead to missing remounts of the same device, in 
case a umount/mount pair is missed due to an overrun, and then fsinfo 
returns the same device as before, with potentially the same mount 
options too. So I think a unique id for a superblock is a generically 
useful feature, which would also allow for sensible sysfs directory 
naming, if required,

Steve.
James Bottomley Feb. 25, 2020, 3:28 p.m. UTC | #6
On Tue, 2020-02-25 at 12:13 +0000, Steven Whitehouse wrote:
> Hi,
> 
> On 24/02/2020 15:28, Miklos Szeredi wrote:
> > On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > > Once it's table driven, certainly a sysfs directory becomes
> > > possible. The problem with ST_DEV is filesystems like btrfs and
> > > xfs that may have multiple devices.
> > 
> > For XFS there's always  a single sb->s_dev though, that's what
> > st_dev will be set to on all files.
> > 
> > Btrfs subvolume is sort of a lightweight superblock, so basically
> > all such st_dev's are aliases of the same master superblock.  So
> > lookup of all subvolume st_dev's could result in referencing the
> > same underlying struct super_block (just like /proc/$PID will
> > reference the same underlying task group regardless of which of the
> > task group member's PID is used).
> > 
> > Having this info in sysfs would spare us a number of issues that a
> > set of new syscalls would bring.  The question is, would that be
> > enough, or is there a reason that sysfs can't be used to present
> > the various filesystem related information that fsinfo is supposed
> > to present?
> > 
> > Thanks,
> > Miklos
> > 
> 
> We need a unique id for superblocks anyway. I had wondered about
> using s_dev some time back, but for the reasons mentioned earlier in
> this thread I think it might just land up being confusing and
> difficult to manage. While fake s_devs are created for sbs that don't
> have a device, I can't help thinking that something closer to
> ifindex, but for superblocks, is needed here. That would avoid the
> issue of which device number to use.
> 
> In fact we need that anyway for the notifications, since without
> that  there is a race that can lead to missing remounts of the same
> device, in  case a umount/mount pair is missed due to an overrun, and
> then fsinfo returns the same device as before, with potentially the
> same mount options too. So I think a unique id for a superblock is a
> generically useful feature, which would also allow for sensible sysfs
> directory naming, if required,

But would this be informative and useful for the user?  I'm sure we can
find a persistent id for a persistent superblock, but what about tmpfs
... that's going to have to change with every reboot.  It's going to be
remarkably inconvenient if I want to get fsinfo on /run to have to keep
finding what the id is.

The other thing a file descriptor does that sysfs doesn't is that it
solves the information leak: if I'm in a mount namespace that has no
access to certain mounts, I can't fspick them and thus I can't see the
information.  By default, with sysfs I can.

James
Steven Whitehouse Feb. 25, 2020, 3:47 p.m. UTC | #7
Hi,

On 25/02/2020 15:28, James Bottomley wrote:
> On Tue, 2020-02-25 at 12:13 +0000, Steven Whitehouse wrote:
>> Hi,
>>
>> On 24/02/2020 15:28, Miklos Szeredi wrote:
>>> On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
>>> <James.Bottomley@hansenpartnership.com> wrote:
>>>
>>>> Once it's table driven, certainly a sysfs directory becomes
>>>> possible. The problem with ST_DEV is filesystems like btrfs and
>>>> xfs that may have multiple devices.
>>> For XFS there's always  a single sb->s_dev though, that's what
>>> st_dev will be set to on all files.
>>>
>>> Btrfs subvolume is sort of a lightweight superblock, so basically
>>> all such st_dev's are aliases of the same master superblock.  So
>>> lookup of all subvolume st_dev's could result in referencing the
>>> same underlying struct super_block (just like /proc/$PID will
>>> reference the same underlying task group regardless of which of the
>>> task group member's PID is used).
>>>
>>> Having this info in sysfs would spare us a number of issues that a
>>> set of new syscalls would bring.  The question is, would that be
>>> enough, or is there a reason that sysfs can't be used to present
>>> the various filesystem related information that fsinfo is supposed
>>> to present?
>>>
>>> Thanks,
>>> Miklos
>>>
>> We need a unique id for superblocks anyway. I had wondered about
>> using s_dev some time back, but for the reasons mentioned earlier in
>> this thread I think it might just land up being confusing and
>> difficult to manage. While fake s_devs are created for sbs that don't
>> have a device, I can't help thinking that something closer to
>> ifindex, but for superblocks, is needed here. That would avoid the
>> issue of which device number to use.
>>
>> In fact we need that anyway for the notifications, since without
>> that  there is a race that can lead to missing remounts of the same
>> device, in  case a umount/mount pair is missed due to an overrun, and
>> then fsinfo returns the same device as before, with potentially the
>> same mount options too. So I think a unique id for a superblock is a
>> generically useful feature, which would also allow for sensible sysfs
>> directory naming, if required,
> But would this be informative and useful for the user?  I'm sure we can
> find a persistent id for a persistent superblock, but what about tmpfs
> ... that's going to have to change with every reboot.  It's going to be
> remarkably inconvenient if I want to get fsinfo on /run to have to keep
> finding what the id is.

That is a different question though, or at least it might be... the idea 
of the superblock id is to uniquely identify a particular superblock. 
The mount notification should give you the association between that 
superblock and any devices (assuming those are applicable), or you can 
use fsinfo if you were not listening to the notifications at the time of 
the mount to get the same information.

If someone unmounts /run and remounts it, then the superblock id would 
change, but otherwise it would stay the same, so you know that it is the 
same mount that is being described in future notifications. One of the 
main aims here being to combine the fsinfo information with the 
notifications in a race free manner.

There are a number of ways one might want to specify a filesystem: by 
device, by uuid, by volume label and so forth but we can't use any of 
those very easily as a unique id. Someone might remove a drive and 
replace it with a different one (so same device, but different content) 
or they might have two filesystems with the same uuid if they've just 
done a dd copy to a new device. For the mount notifications we need 
something that doesn't suffer from these issues, but which can also be 
very easily associated with what in most cases are more convenient ways 
to specify a particular filesystem.


>
> The other thing a file descriptor does that sysfs doesn't is that it
> solves the information leak: if I'm in a mount namespace that has no
> access to certain mounts, I can't fspick them and thus I can't see the
> information.  By default, with sysfs I can.
>
> James
>
Yes, thats true, and I wasn't advocating for the sysfs method over 
fspick here, just pointing out that a unique superblock id would be a 
generically useful thing to have,

Steve.
Miklos Szeredi Feb. 26, 2020, 9:11 a.m. UTC | #8
On Tue, Feb 25, 2020 at 4:29 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:

> The other thing a file descriptor does that sysfs doesn't is that it
> solves the information leak: if I'm in a mount namespace that has no
> access to certain mounts, I can't fspick them and thus I can't see the
> information.  By default, with sysfs I can.

That's true, but procfs/sysfs has to deal with various namespacing
issues anyway.  If this is just about hiding a number of entries, then
I don't think that's going to be a big deal.

The syscall API is efficient: single syscall per query instead of
several, no parsing necessary.

However, it is difficult to extend, because the ABI must be updated,
possibly libc and util-linux also, so that scripts can also consume
the new parameter.  With the sysfs approach only the kernel needs to
be updated, and possibly only the filesystem code, not even the VFS.

So I think the question comes down to:  do we need a highly efficient
way to query the superblock parameters all at once, or not?

Thanks,
Miklos
Steven Whitehouse Feb. 26, 2020, 10:51 a.m. UTC | #9
Hi,

On 26/02/2020 09:11, Miklos Szeredi wrote:
> On Tue, Feb 25, 2020 at 4:29 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>
>> The other thing a file descriptor does that sysfs doesn't is that it
>> solves the information leak: if I'm in a mount namespace that has no
>> access to certain mounts, I can't fspick them and thus I can't see the
>> information.  By default, with sysfs I can.
> That's true, but procfs/sysfs has to deal with various namespacing
> issues anyway.  If this is just about hiding a number of entries, then
> I don't think that's going to be a big deal.
>
> The syscall API is efficient: single syscall per query instead of
> several, no parsing necessary.
>
> However, it is difficult to extend, because the ABI must be updated,
> possibly libc and util-linux also, so that scripts can also consume
> the new parameter.  With the sysfs approach only the kernel needs to
> be updated, and possibly only the filesystem code, not even the VFS.
>
> So I think the question comes down to:  do we need a highly efficient
> way to query the superblock parameters all at once, or not?
>
> Thanks,
> Miklos
>

That is Ian's use case for autofs I think, and it will also be what is 
needed at start up of most applications using the fs notifications, as 
well as at resync time if there has been an overrun leading to lost fs 
notification messages. We do need a solution that can scale to large 
numbers of mounts efficiently. Being able to extend it is also an 
important consideration too, so hopefully David has a solution to that,

Steve.
Ian Kent Feb. 27, 2020, 5:06 a.m. UTC | #10
On Wed, 2020-02-26 at 10:11 +0100, Miklos Szeredi wrote:
> On Tue, Feb 25, 2020 at 4:29 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> 
> > The other thing a file descriptor does that sysfs doesn't is that
> > it
> > solves the information leak: if I'm in a mount namespace that has
> > no
> > access to certain mounts, I can't fspick them and thus I can't see
> > the
> > information.  By default, with sysfs I can.
> 
> That's true, but procfs/sysfs has to deal with various namespacing
> issues anyway.  If this is just about hiding a number of entries,
> then
> I don't think that's going to be a big deal.

I didn't see name space considerations in sysfs when I was looking at
it recently. Obeying name space requirements is likely a lot of work
in sysfs.

> 
> The syscall API is efficient: single syscall per query instead of
> several, no parsing necessary.
> 
> However, it is difficult to extend, because the ABI must be updated,
> possibly libc and util-linux also, so that scripts can also consume
> the new parameter.  With the sysfs approach only the kernel needs to
> be updated, and possibly only the filesystem code, not even the VFS.
> 
> So I think the question comes down to:  do we need a highly efficient
> way to query the superblock parameters all at once, or not?

Or a similar question could be, how could a sysfs interface work
to provide mount information.

Getting information about all mounts might not be too bad but the
sysfs directory structure that would be needed to represent all
system mounts (without considering name spaces) would likely
result in somewhat busy user space code.

For example, given a path, and the path is all I know, how do I
get mount information?

Ignoring possible multiple mounts on a mount point, call fsinfo()
with the path and get the id (the path walk is low overhead) to
use with fsinfo() to get the all the info I need ... done.

Again, ignoring possible multiple mounts on a mount point, and
assuming there is a sysfs tree enumerating all the system mounts.
I could open <sysfs base> + mount point path followed buy opening
and reading the individual attribute files ... a bit more busy
that one ... particularly if I need to do it for several thousand
mounts.

Then there's the code that would need to be added to maintain the
various views in the sysfs tree, which can't be restricted only to
the VFS because there's file system specific info needed too (the
maintain a table idea), and that's before considering name space
handling changes to sysfs.

At the least the question of "do we need a highly efficient way
to query the superblock parameters all at once" needs to be
extended to include mount table enumeration as well as getting
the info.

But this is just me thinking about mount table handling and the
quite significant problem we now have with user space scanning
the proc mount tables to get this information.

Ian
Miklos Szeredi Feb. 27, 2020, 9:36 a.m. UTC | #11
On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote:

> At the least the question of "do we need a highly efficient way
> to query the superblock parameters all at once" needs to be
> extended to include mount table enumeration as well as getting
> the info.
>
> But this is just me thinking about mount table handling and the
> quite significant problem we now have with user space scanning
> the proc mount tables to get this information.

Right.

So the problem is that currently autofs needs to rescan the proc mount
table on every change.   The solution to that is to

 - add a notification mechanism
 - and a way to selectively query mount/superblock information

right?

For the notification we have uevents in sysfs, which also supplies the
changed parameters.  Taking aside namespace issues and addressing
mounts would this work for autofs?

Thanks,
Miklos
Ian Kent Feb. 27, 2020, 11:34 a.m. UTC | #12
On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote:
> On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote:
> 
> > At the least the question of "do we need a highly efficient way
> > to query the superblock parameters all at once" needs to be
> > extended to include mount table enumeration as well as getting
> > the info.
> > 
> > But this is just me thinking about mount table handling and the
> > quite significant problem we now have with user space scanning
> > the proc mount tables to get this information.
> 
> Right.
> 
> So the problem is that currently autofs needs to rescan the proc
> mount
> table on every change.   The solution to that is to

Actually no, that's not quite the problem I see.

autofs handles large mount tables fairly well (necessarily) and
in time I plan to remove the need to read the proc tables at all
(that's proven very difficult but I'll get back to that).

This has to be done to resolve the age old problem of autofs not
being able to handle large direct mount maps. But, because of
the large number of mounts associated with large direct mount
maps, other system processes are badly affected too.

So the problem I want to see fixed is the effect of very large
mount tables on other user space applications, particularly the
effect when a large number of mounts or umounts are performed.

Clearly large mount tables not only result from autofs and the
problems caused by them are slightly different to the mount and
umount problem I describe. But they are a problem nevertheless
in the sense that frequent notifications that lead to reading
a large proc mount table has significant overhead that can't be
avoided because the table may have changed since the last time
it was read.

It's easy to cause several system processes to peg a fair number
of CPU's when a large number of mounts/umounts are being performed,
namely systemd, udisks2 and a some others. Also I've seen couple
of application processes badly affected purely by the presence of
a large number of mounts in the proc tables, that's not quite so
bad though.

> 
>  - add a notification mechanism   - lookup a mount based on path
>  - and a way to selectively query mount/superblock information
based on path ...
> 
> right?
> 
> For the notification we have uevents in sysfs, which also supplies
> the
> changed parameters.  Taking aside namespace issues and addressing
> mounts would this work for autofs?

The parameters supplied by the notification mechanism are important.

The place this is needed will be libmount since it catches a broad
number of user space applications, including those I mentioned above
(well at least systemd, I think also udisks2, very probably others).

So that means mount table info. needs to be maintained, whether that
can be achieved using sysfs I don't know. Creating and maintaining
the sysfs tree would be a big challenge I think.

But before trying to work out how to use a notification mechanism
just having a way to get the info provided by the proc tables using
a path alone should give initial immediate improvement in libmount.

Ian
Miklos Szeredi Feb. 27, 2020, 1:45 p.m. UTC | #13
On Thu, Feb 27, 2020 at 12:34 PM Ian Kent <raven@themaw.net> wrote:
>
> On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote:
> > On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote:
> >
> > > At the least the question of "do we need a highly efficient way
> > > to query the superblock parameters all at once" needs to be
> > > extended to include mount table enumeration as well as getting
> > > the info.
> > >
> > > But this is just me thinking about mount table handling and the
> > > quite significant problem we now have with user space scanning
> > > the proc mount tables to get this information.
> >
> > Right.
> >
> > So the problem is that currently autofs needs to rescan the proc
> > mount
> > table on every change.   The solution to that is to
>
> Actually no, that's not quite the problem I see.
>
> autofs handles large mount tables fairly well (necessarily) and
> in time I plan to remove the need to read the proc tables at all
> (that's proven very difficult but I'll get back to that).
>
> This has to be done to resolve the age old problem of autofs not
> being able to handle large direct mount maps. But, because of
> the large number of mounts associated with large direct mount
> maps, other system processes are badly affected too.
>
> So the problem I want to see fixed is the effect of very large
> mount tables on other user space applications, particularly the
> effect when a large number of mounts or umounts are performed.
>
> Clearly large mount tables not only result from autofs and the
> problems caused by them are slightly different to the mount and
> umount problem I describe. But they are a problem nevertheless
> in the sense that frequent notifications that lead to reading
> a large proc mount table has significant overhead that can't be
> avoided because the table may have changed since the last time
> it was read.
>
> It's easy to cause several system processes to peg a fair number
> of CPU's when a large number of mounts/umounts are being performed,
> namely systemd, udisks2 and a some others. Also I've seen couple
> of application processes badly affected purely by the presence of
> a large number of mounts in the proc tables, that's not quite so
> bad though.
>
> >
> >  - add a notification mechanism   - lookup a mount based on path
> >  - and a way to selectively query mount/superblock information
> based on path ...
> >
> > right?
> >
> > For the notification we have uevents in sysfs, which also supplies
> > the
> > changed parameters.  Taking aside namespace issues and addressing
> > mounts would this work for autofs?
>
> The parameters supplied by the notification mechanism are important.
>
> The place this is needed will be libmount since it catches a broad
> number of user space applications, including those I mentioned above
> (well at least systemd, I think also udisks2, very probably others).
>
> So that means mount table info. needs to be maintained, whether that
> can be achieved using sysfs I don't know. Creating and maintaining
> the sysfs tree would be a big challenge I think.
>
> But before trying to work out how to use a notification mechanism
> just having a way to get the info provided by the proc tables using
> a path alone should give initial immediate improvement in libmount.

Adding Karel, Lennart, Zbigniew and util-linux@vger...

At a quick glance at libmount and systemd code, it appears that just
switching out the implementation in libmount will not be enough:
systemd is calling functions like mnt_table_parse_*() when it receives
a notification that the mount table changed.

What is the end purpose of parsing the mount tables?  Can systemd guys
comment on that?

Thanks,
Miklos
Karel Zak Feb. 27, 2020, 3:14 p.m. UTC | #14
On Thu, Feb 27, 2020 at 02:45:27PM +0100, Miklos Szeredi wrote:
> > So the problem I want to see fixed is the effect of very large
> > mount tables on other user space applications, particularly the
> > effect when a large number of mounts or umounts are performed.

Yes, now you have to generate (in kernel) and parse (in
userspace) all mount table to get information about just 
one mount table entry. This is typical for umount or systemd.

> > >  - add a notification mechanism   - lookup a mount based on path
> > >  - and a way to selectively query mount/superblock information
> > based on path ...

For umount-like use-cases we need mountpoint/ to mount entry
conversion; I guess something like open(mountpoint/) + fsinfo() 
should be good enough.

For systemd we need the same, but triggered by notification. The ideal
solution is to get mount entry ID or FD from notification and later use this
ID or FD to ask for details about the mount entry (probably again fsinfo()).
The notification has to be usable with in epoll() set.

This solves 99% of our performance issues I guess.

> > So that means mount table info. needs to be maintained, whether that
> > can be achieved using sysfs I don't know. Creating and maintaining
> > the sysfs tree would be a big challenge I think.

It will be still necessary to get complete mount table sometimes, but 
not in performance sensitive scenarios.

I'm not sure about sysfs/, you need somehow resolve namespaces, order
of the mount entries (which one is the last one), etc. IMHO translate
mountpoint path to sysfs/ path will be complicated.

> > But before trying to work out how to use a notification mechanism
> > just having a way to get the info provided by the proc tables using
> > a path alone should give initial immediate improvement in libmount.
> 
> Adding Karel, Lennart, Zbigniew and util-linux@vger...
> 
> At a quick glance at libmount and systemd code, it appears that just
> switching out the implementation in libmount will not be enough:
> systemd is calling functions like mnt_table_parse_*() when it receives
> a notification that the mount table changed.

We're ready to change this stuff in systemd if there will be something
better (something per-mount-entry).

My plan is add new API to libmount to query information about one
mount entry (but I had no time to play with fsinfo yet).

> What is the end purpose of parsing the mount tables?  Can systemd guys
> comment on that?

If mount/umount is triggered by systemd than it need verification
about success and final version of the mount options. It also reads
information from libmount to get userspace mount options (.e.g.
_netdev -- libmount uses mount source, target and fsroot to join
kernel and userpace stuff).

And don't forget that mount units are part of systemd dependencies, so
umount/mount is important event for systemd and it need details about
the changes (what, where, ... etc.)

    Karel
Ian Kent Feb. 28, 2020, 12:12 a.m. UTC | #15
On Thu, 2020-02-27 at 14:45 +0100, Miklos Szeredi wrote:
> On Thu, Feb 27, 2020 at 12:34 PM Ian Kent <raven@themaw.net> wrote:
> > On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote:
> > > On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net>
> > > wrote:
> > > 
> > > > At the least the question of "do we need a highly efficient way
> > > > to query the superblock parameters all at once" needs to be
> > > > extended to include mount table enumeration as well as getting
> > > > the info.
> > > > 
> > > > But this is just me thinking about mount table handling and the
> > > > quite significant problem we now have with user space scanning
> > > > the proc mount tables to get this information.
> > > 
> > > Right.
> > > 
> > > So the problem is that currently autofs needs to rescan the proc
> > > mount
> > > table on every change.   The solution to that is to
> > 
> > Actually no, that's not quite the problem I see.
> > 
> > autofs handles large mount tables fairly well (necessarily) and
> > in time I plan to remove the need to read the proc tables at all
> > (that's proven very difficult but I'll get back to that).
> > 
> > This has to be done to resolve the age old problem of autofs not
> > being able to handle large direct mount maps. But, because of
> > the large number of mounts associated with large direct mount
> > maps, other system processes are badly affected too.
> > 
> > So the problem I want to see fixed is the effect of very large
> > mount tables on other user space applications, particularly the
> > effect when a large number of mounts or umounts are performed.
> > 
> > Clearly large mount tables not only result from autofs and the
> > problems caused by them are slightly different to the mount and
> > umount problem I describe. But they are a problem nevertheless
> > in the sense that frequent notifications that lead to reading
> > a large proc mount table has significant overhead that can't be
> > avoided because the table may have changed since the last time
> > it was read.
> > 
> > It's easy to cause several system processes to peg a fair number
> > of CPU's when a large number of mounts/umounts are being performed,
> > namely systemd, udisks2 and a some others. Also I've seen couple
> > of application processes badly affected purely by the presence of
> > a large number of mounts in the proc tables, that's not quite so
> > bad though.
> > 
> > >  - add a notification mechanism   - lookup a mount based on path
> > >  - and a way to selectively query mount/superblock information
> > based on path ...
> > > right?
> > > 
> > > For the notification we have uevents in sysfs, which also
> > > supplies
> > > the
> > > changed parameters.  Taking aside namespace issues and addressing
> > > mounts would this work for autofs?
> > 
> > The parameters supplied by the notification mechanism are
> > important.
> > 
> > The place this is needed will be libmount since it catches a broad
> > number of user space applications, including those I mentioned
> > above
> > (well at least systemd, I think also udisks2, very probably
> > others).
> > 
> > So that means mount table info. needs to be maintained, whether
> > that
> > can be achieved using sysfs I don't know. Creating and maintaining
> > the sysfs tree would be a big challenge I think.
> > 
> > But before trying to work out how to use a notification mechanism
> > just having a way to get the info provided by the proc tables using
> > a path alone should give initial immediate improvement in libmount.
> 
> Adding Karel, Lennart, Zbigniew and util-linux@vger...
> 
> At a quick glance at libmount and systemd code, it appears that just
> switching out the implementation in libmount will not be enough:
> systemd is calling functions like mnt_table_parse_*() when it
> receives
> a notification that the mount table changed.

Maybe I wasn't clear, my bad, sorry about that.

There's no question that change notification handling is needed too.

I'm claiming that an initial change to use something that can get
the mount information without using the proc tables alone will give
an "initial immediate improvement".

The work needed to implement mount table change notification
handling will take much more time and exactly what changes that
will bring is not clear yet and I do plan to work on that too,
together with Karel.

Ian
Ian Kent Feb. 28, 2020, 12:43 a.m. UTC | #16
On Thu, 2020-02-27 at 16:14 +0100, Karel Zak wrote:
> On Thu, Feb 27, 2020 at 02:45:27PM +0100, Miklos Szeredi wrote:
> > > So the problem I want to see fixed is the effect of very large
> > > mount tables on other user space applications, particularly the
> > > effect when a large number of mounts or umounts are performed.
> 
> Yes, now you have to generate (in kernel) and parse (in
> userspace) all mount table to get information about just 
> one mount table entry. This is typical for umount or systemd.
> 
> > > >  - add a notification mechanism   - lookup a mount based on
> > > > path
> > > >  - and a way to selectively query mount/superblock information
> > > based on path ...
> 
> For umount-like use-cases we need mountpoint/ to mount entry
> conversion; I guess something like open(mountpoint/) + fsinfo() 
> should be good enough.
> 
> For systemd we need the same, but triggered by notification. The
> ideal
> solution is to get mount entry ID or FD from notification and later
> use this
> ID or FD to ask for details about the mount entry (probably again
> fsinfo()).
> The notification has to be usable with in epoll() set.
> 
> This solves 99% of our performance issues I guess.
> 
> > > So that means mount table info. needs to be maintained, whether
> > > that
> > > can be achieved using sysfs I don't know. Creating and
> > > maintaining
> > > the sysfs tree would be a big challenge I think.
> 
> It will be still necessary to get complete mount table sometimes,
> but 
> not in performance sensitive scenarios.

That was my understanding too.

Mount table enumeration is possible with fsinfo() but you still
have to handle each and every mount so improvement there is not
going to be as much as cases where the proc mount table needs to
be scanned independently for an individual mount. It will be
somewhat more straight forward without the need to dissect text
records though.

> 
> I'm not sure about sysfs/, you need somehow resolve namespaces, order
> of the mount entries (which one is the last one), etc. IMHO translate
> mountpoint path to sysfs/ path will be complicated.

I wonder about that too, after all sysfs contains a tree of nodes
from which the view is created unlike proc which translates kernel
information directly based on what the process should see.

We'll need to wait a bit and see what Miklos has in mind for mount
table enumeration and nothing has been said about name spaces yet.

While fsinfo() is not similar to proc it does handle name spaces
in a sensible way via. file handles, a bit similar to the proc fs,
and ordering is catered for in the fsinfo() enumeration in a natural
way. Not sure how that would be handled using sysfs ...

Ian
Miklos Szeredi Feb. 28, 2020, 8:35 a.m. UTC | #17
On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote:

> > I'm not sure about sysfs/, you need somehow resolve namespaces, order
> > of the mount entries (which one is the last one), etc. IMHO translate
> > mountpoint path to sysfs/ path will be complicated.
>
> I wonder about that too, after all sysfs contains a tree of nodes
> from which the view is created unlike proc which translates kernel
> information directly based on what the process should see.
>
> We'll need to wait a bit and see what Miklos has in mind for mount
> table enumeration and nothing has been said about name spaces yet.

Adding Greg for sysfs knowledge.

As far as I understand the sysfs model is, basically:

  - list of devices sorted by class and address
  - with each class having a given set of attributes

Superblocks and mounts could get enumerated by a unique identifier.
mnt_id seems to be good for mounts, s_dev may or may not be good for
superblock, but  s_id (as introduced in this patchset) could be used
instead.

As for namespaces, that's "just" an access control issue, AFAICS.
For example a task with a non-initial mount namespace should not have
access to attributes of mounts outside of its namespace.  Checking
access to superblock attributes would be similar: scan the list of
mounts and only allow access if at least one mount would get access.

> While fsinfo() is not similar to proc it does handle name spaces
> in a sensible way via. file handles, a bit similar to the proc fs,
> and ordering is catered for in the fsinfo() enumeration in a natural
> way. Not sure how that would be handled using sysfs ...

I agree that the access control is much more straightforward with
fsinfo(2) and this may be the single biggest reason to introduce a new
syscall.

Let's see what others thing.

Thanks,
Miklos
Greg Kroah-Hartman Feb. 28, 2020, 12:27 p.m. UTC | #18
On Fri, Feb 28, 2020 at 09:35:17AM +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote:
> 
> > > I'm not sure about sysfs/, you need somehow resolve namespaces, order
> > > of the mount entries (which one is the last one), etc. IMHO translate
> > > mountpoint path to sysfs/ path will be complicated.
> >
> > I wonder about that too, after all sysfs contains a tree of nodes
> > from which the view is created unlike proc which translates kernel
> > information directly based on what the process should see.
> >
> > We'll need to wait a bit and see what Miklos has in mind for mount
> > table enumeration and nothing has been said about name spaces yet.
> 
> Adding Greg for sysfs knowledge.
> 
> As far as I understand the sysfs model is, basically:
> 
>   - list of devices sorted by class and address
>   - with each class having a given set of attributes

Close enough :)

> Superblocks and mounts could get enumerated by a unique identifier.
> mnt_id seems to be good for mounts, s_dev may or may not be good for
> superblock, but  s_id (as introduced in this patchset) could be used
> instead.

So what would the sysfs tree look like with this?

> As for namespaces, that's "just" an access control issue, AFAICS.
> For example a task with a non-initial mount namespace should not have
> access to attributes of mounts outside of its namespace.  Checking
> access to superblock attributes would be similar: scan the list of
> mounts and only allow access if at least one mount would get access.

sysfs does handle namespaces, look at how networking does this.  But,
it's not exactly the simplest thing to do so, so be careful with that as
this is going to be essential for this type of work.

thanks,

greg k-h
James Bottomley Feb. 28, 2020, 3:08 p.m. UTC | #19
On Fri, 2020-02-28 at 09:35 +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote:
> 
> > > I'm not sure about sysfs/, you need somehow resolve namespaces,
> > > order of the mount entries (which one is the last one), etc. IMHO
> > > translate mountpoint path to sysfs/ path will be complicated.
> > 
> > I wonder about that too, after all sysfs contains a tree of nodes
> > from which the view is created unlike proc which translates kernel
> > information directly based on what the process should see.
> > 
> > We'll need to wait a bit and see what Miklos has in mind for mount
> > table enumeration and nothing has been said about name spaces yet.
> 
> Adding Greg for sysfs knowledge.
> 
> As far as I understand the sysfs model is, basically:
> 
>   - list of devices sorted by class and address
>   - with each class having a given set of attributes
> 
> Superblocks and mounts could get enumerated by a unique identifier.
> mnt_id seems to be good for mounts, s_dev may or may not be good for
> superblock, but  s_id (as introduced in this patchset) could be used
> instead.
> 
> As for namespaces, that's "just" an access control issue, AFAICS.

That's an easy thing to say but not an easy thing to check:  it can be
made so for label based namespaces like the network, but the mount
namespace is shared/cloned tree based.  Assessing whether a given
superblock is within your current namespace root can become a large
search exercise.  You can see how much of one in fs/proc_namespaces.c
which controls how /proc/self/mounts appears in your current namespace.

> For example a task with a non-initial mount namespace should not have
> access to attributes of mounts outside of its namespace.  Checking
> access to superblock attributes would be similar: scan the list of
> mounts and only allow access if at least one mount would get access.

That scan can be expensive as I explained above.  That's really why I
think this is a bad idea.  Sysfs itself is nicely currently restricted
to system information that most containers don't need to know, so a lot
of the sysfs issues with containers can be solved by not mounting it. 
If you suddenly make it required for filesystem information and
notifications, that security measure gets blown out of the water.

> > While fsinfo() is not similar to proc it does handle name spaces
> > in a sensible way via. file handles, a bit similar to the proc fs,
> > and ordering is catered for in the fsinfo() enumeration in a
> > natural way. Not sure how that would be handled using sysfs ...
> 
> I agree that the access control is much more straightforward with
> fsinfo(2) and this may be the single biggest reason to introduce a
> new syscall.
> 
> Let's see what others thing.

Containers are file based entities, so file descriptors are their most
natural thing and they have full ACL protection within the container
(can't open the file, can't then get the fd).  The other reason
container people like file descriptors (all the Xat system calls that
have been introduced) is that if we do actually need to break the
boundaries or privileges of the container, we can do so by getting the
orchestration system to pass in a fd the interior of the container
wouldn't have access to.

James
Miklos Szeredi Feb. 28, 2020, 3:40 p.m. UTC | #20
On Fri, Feb 28, 2020 at 4:09 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:

> Containers are file based entities, so file descriptors are their most
> natural thing and they have full ACL protection within the container
> (can't open the file, can't then get the fd).  The other reason
> container people like file descriptors (all the Xat system calls that
> have been introduced) is that if we do actually need to break the
> boundaries or privileges of the container, we can do so by getting the
> orchestration system to pass in a fd the interior of the container
> wouldn't have access to.

Yeah, agreed about the simplicity of fd based access.   Then again a
filesystem access would allow immediate access to all scripts,
languages, etc.  That, I think is a huge bonus compared to the
ioctl-like mess that the current proposal is, which would require
library, utility, language binding updates on all changes.  Ugh.

One way to resolve that is to have the mount information
magic-symlinked from /proc/PID/fdmount/FD directly to the mountinfo
dir, which would then have a link into the sbinfo dir.  With other
access denied to all except sysadmin.

Would that work?

Thanks,
Miklos
Christian Brauner Feb. 28, 2020, 3:52 p.m. UTC | #21
On Tue, Feb 25, 2020 at 07:28:55AM -0800, James Bottomley wrote:
> On Tue, 2020-02-25 at 12:13 +0000, Steven Whitehouse wrote:
> > Hi,
> > 
> > On 24/02/2020 15:28, Miklos Szeredi wrote:
> > > On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
> > > <James.Bottomley@hansenpartnership.com> wrote:
> > > 
> > > > Once it's table driven, certainly a sysfs directory becomes
> > > > possible. The problem with ST_DEV is filesystems like btrfs and
> > > > xfs that may have multiple devices.
> > > 
> > > For XFS there's always  a single sb->s_dev though, that's what
> > > st_dev will be set to on all files.
> > > 
> > > Btrfs subvolume is sort of a lightweight superblock, so basically
> > > all such st_dev's are aliases of the same master superblock.  So
> > > lookup of all subvolume st_dev's could result in referencing the
> > > same underlying struct super_block (just like /proc/$PID will
> > > reference the same underlying task group regardless of which of the
> > > task group member's PID is used).
> > > 
> > > Having this info in sysfs would spare us a number of issues that a
> > > set of new syscalls would bring.  The question is, would that be
> > > enough, or is there a reason that sysfs can't be used to present
> > > the various filesystem related information that fsinfo is supposed
> > > to present?
> > > 
> > > Thanks,
> > > Miklos
> > > 
> > 
> > We need a unique id for superblocks anyway. I had wondered about
> > using s_dev some time back, but for the reasons mentioned earlier in
> > this thread I think it might just land up being confusing and
> > difficult to manage. While fake s_devs are created for sbs that don't
> > have a device, I can't help thinking that something closer to
> > ifindex, but for superblocks, is needed here. That would avoid the
> > issue of which device number to use.
> > 
> > In fact we need that anyway for the notifications, since without
> > that  there is a race that can lead to missing remounts of the same
> > device, in  case a umount/mount pair is missed due to an overrun, and
> > then fsinfo returns the same device as before, with potentially the
> > same mount options too. So I think a unique id for a superblock is a
> > generically useful feature, which would also allow for sensible sysfs
> > directory naming, if required,
> 
> But would this be informative and useful for the user?  I'm sure we can
> find a persistent id for a persistent superblock, but what about tmpfs
> ... that's going to have to change with every reboot.  It's going to be
> remarkably inconvenient if I want to get fsinfo on /run to have to keep
> finding what the id is.
> 
> The other thing a file descriptor does that sysfs doesn't is that it
> solves the information leak: if I'm in a mount namespace that has no
> access to certain mounts, I can't fspick them and thus I can't see the
> information.  By default, with sysfs I can.

Difficult to figure out which part of the thread to reply too. :)

sysfs strikes me as fundamentally misguided for this task.

Init systems or any large-scale daemon will hate parsing things, there's
that and parts of the reason why mountinfo sucks is because of parsing a
possibly a potentially enormous file. Exposing information in sysfs will
require parsing again one way or the other. I've been discussing these
bottlenecks with Lennart quite a bit and reliable and performant mount
notifications without needing to parse stuff is very high on the issue
list. But even if that isn't an issue for some reason the namespace
aspect is definitely something I'd consider a no-go.
James has been poking at this a little already and I agree. More
specifically, sysfs and proc already are a security nightmare for
namespace-aware workloads and require special care. Not leaking
information in any way is a difficult task. I mean, over the last two
years I sent quite a lot of patches to the networking-namespace aware
part of sysfs alone either fixing information leaks, or making other
parts namespace aware that weren't and were causing issues (There's
another large-ish series sitting in Dave's tree right now.). And tbh,
network namespacing in sysfs is imho trivial compared to what we would
need to do to handle mount namespacing and especially mount propagation.
fsinfo() is way cleaner and ultimately simpler approach. We very much
want it file-descriptor based. The mount api opens up the road to secure
and _delegatable_ querying of filesystem information.

Christian
Miklos Szeredi Feb. 28, 2020, 4:24 p.m. UTC | #22
On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:

> > Superblocks and mounts could get enumerated by a unique identifier.
> > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > superblock, but  s_id (as introduced in this patchset) could be used
> > instead.
>
> So what would the sysfs tree look like with this?

For a start something like this:

mounts/$MOUNT_ID/
  parent -> ../$PARENT_ID
  super -> ../../supers/$SUPER_ID
  root: path from mount root to fs root (could be optional as usually
they are the same)
  mountpoint -> $MOUNTPOINT
  flags: mount flags
  propagation: mount propagation
  children/$CHILD_ID -> ../../$CHILD_ID

 supers/$SUPER_ID/
   type: fstype
   source: mount source (devname)
   options: csv of mount options

Thanks,
Miklos
David Howells Feb. 28, 2020, 4:36 p.m. UTC | #23
sysfs also has some other disadvantages for this:

 (1) There's a potential chicken-and-egg problem in that you have to create a
     bunch of files and dirs in sysfs for every created mount and superblock
     (possibly excluding special ones like the socket mount) - but this
     includes sysfs itself.  This might work - provided you create sysfs
     first.

 (2) sysfs is memory intensive.  The directory structure has to be backed by
     dentries and inodes that linger as long as the referenced object does
     (procfs is more efficient in this regard for files that aren't being
     accessed).

 (3) It gives people extra, indirect ways to pin mount objects and
     superblocks.

For the moment, fsinfo() gives you three ways of referring to a filesystem
object:

 (a) Directly by path.

 (b) By path associated with an fd.

 (c) By mount ID (perm checked by working back up the tree).

but will need to add:

 (d) By fscontext fd (which is hard to find in sysfs).  Indeed, the superblock
     may not even exist yet.

David
David Howells Feb. 28, 2020, 4:42 p.m. UTC | #24
Miklos Szeredi <miklos@szeredi.hu> wrote:

>   children/$CHILD_ID -> ../../$CHILD_ID

This would really suck.  This bit would particularly affect rescanning time.

You also really want to read the entire child set atomically and, ideally,
include notification counters.

>  supers/$SUPER_ID/
>    type: fstype
>    source: mount source (devname)
>    options: csv of mount options

There's a lot more to fsinfo() than just this lot - and there's the
possibility that some of the values may change depending on exactly which file
you're looking at.

David
Al Viro Feb. 28, 2020, 5:15 p.m. UTC | #25
On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Superblocks and mounts could get enumerated by a unique identifier.
> > > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > > superblock, but  s_id (as introduced in this patchset) could be used
> > > instead.
> >
> > So what would the sysfs tree look like with this?
> 
> For a start something like this:
> 
> mounts/$MOUNT_ID/
>   parent -> ../$PARENT_ID
>   super -> ../../supers/$SUPER_ID
>   root: path from mount root to fs root (could be optional as usually
> they are the same)
>   mountpoint -> $MOUNTPOINT
>   flags: mount flags
>   propagation: mount propagation
>   children/$CHILD_ID -> ../../$CHILD_ID
> 
>  supers/$SUPER_ID/
>    type: fstype
>    source: mount source (devname)
>    options: csv of mount options

Oh, wonderful.  So let me see if I got it right - any namespace operation
can create/destroy/move around an arbitrary amount of sysfs objects.
Better yet, we suddenly have to express the lifetime rules for struct mount
and struct superblock in terms of struct device garbage.

I'm less than thrilled by the entire fsinfo circus, but this really takes
the cake.

In case it needs to be spelled out: NAK.
Miklos Szeredi March 2, 2020, 8:43 a.m. UTC | #26
On Fri, Feb 28, 2020 at 6:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote:
> > On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> >
> > > > Superblocks and mounts could get enumerated by a unique identifier.
> > > > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > > > superblock, but  s_id (as introduced in this patchset) could be used
> > > > instead.
> > >
> > > So what would the sysfs tree look like with this?
> >
> > For a start something like this:
> >
> > mounts/$MOUNT_ID/
> >   parent -> ../$PARENT_ID
> >   super -> ../../supers/$SUPER_ID
> >   root: path from mount root to fs root (could be optional as usually
> > they are the same)
> >   mountpoint -> $MOUNTPOINT
> >   flags: mount flags
> >   propagation: mount propagation
> >   children/$CHILD_ID -> ../../$CHILD_ID
> >
> >  supers/$SUPER_ID/
> >    type: fstype
> >    source: mount source (devname)
> >    options: csv of mount options
>
> Oh, wonderful.  So let me see if I got it right - any namespace operation
> can create/destroy/move around an arbitrary amount of sysfs objects.

Parent/children symlinks may be excessive...

> Better yet, we suddenly have to express the lifetime rules for struct mount
> and struct superblock in terms of struct device garbage.

How so?   struct mount and struct superblock would hold a ref on
struct device, not the other way round.

In any case, I'm not insistent on the use of sysfs device classes for
this; struct device (488B) does seem too heavy for struct mount
(328B).

What I'm pretty sure about is that a read(2) based interface would be
way more useful than the syscall multiplexer that the current proposal
is.

Thanks,
Miklos
Miklos Szeredi March 2, 2020, 9:09 a.m. UTC | #27
On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com> wrote:
>
> sysfs also has some other disadvantages for this:
>
>  (1) There's a potential chicken-and-egg problem in that you have to create a
>      bunch of files and dirs in sysfs for every created mount and superblock
>      (possibly excluding special ones like the socket mount) - but this
>      includes sysfs itself.  This might work - provided you create sysfs
>      first.

Sysfs architecture looks something like this (I hope Greg will correct
me if I'm wrong):

device driver -> kobj tree <- sysfs tree

The kobj tree is created by the device driver, and the dentry tree is
created on demand from the kobj tree.   Lifetime of kobjs is bound to
both the sysfs objects and the device but not the other way round.
I.e. device can go away while the sysfs object is still being
referenced, and sysfs can be freely mounted and unmounted
independently of device initialization.

So there's no ordering requirement between sysfs mounts and other
mounts.   I might be wrong on the details, since mounts are created
very early in the boot process...

>
>  (2) sysfs is memory intensive.  The directory structure has to be backed by
>      dentries and inodes that linger as long as the referenced object does
>      (procfs is more efficient in this regard for files that aren't being
>      accessed)

See above: I don't think dentries and inodes are pinned, only kobjs
and their associated cruft.  Which may be too heavy, depending on the
details of the kobj tree.

>  (3) It gives people extra, indirect ways to pin mount objects and
>      superblocks.

See above.

> For the moment, fsinfo() gives you three ways of referring to a filesystem
> object:
>
>  (a) Directly by path.

A path is always representable by an O_PATH descriptor.

>
>  (b) By path associated with an fd.

See my proposal about linking from /proc/$PID/fdmount/$FD ->
/sys/devices/virtual/mounts/$MOUNT_ID.

>
>  (c) By mount ID (perm checked by working back up the tree).

Check that perm on lookup of /sys/devices/virtual/mounts/$MOUNT_ID.
The proc symlink would bypass the lookup check by directly jumping to
the mountinfo dir.

> but will need to add:
>
>  (d) By fscontext fd (which is hard to find in sysfs).  Indeed, the superblock
>      may not even exist yet.

Proc symlink would work for that too.

If sysfs is too heavy, this could be proc or a completely new
filesystem.  The implementation is much less relevant at this stage of
the discussion than the interface.

Thanks,
Miklos
Greg Kroah-Hartman March 2, 2020, 9:38 a.m. UTC | #28
On Mon, Mar 02, 2020 at 10:09:51AM +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com> wrote:
> >
> > sysfs also has some other disadvantages for this:
> >
> >  (1) There's a potential chicken-and-egg problem in that you have to create a
> >      bunch of files and dirs in sysfs for every created mount and superblock
> >      (possibly excluding special ones like the socket mount) - but this
> >      includes sysfs itself.  This might work - provided you create sysfs
> >      first.
> 
> Sysfs architecture looks something like this (I hope Greg will correct
> me if I'm wrong):
> 
> device driver -> kobj tree <- sysfs tree
> 
> The kobj tree is created by the device driver, and the dentry tree is
> created on demand from the kobj tree.   Lifetime of kobjs is bound to
> both the sysfs objects and the device but not the other way round.
> I.e. device can go away while the sysfs object is still being
> referenced, and sysfs can be freely mounted and unmounted
> independently of device initialization.
> 
> So there's no ordering requirement between sysfs mounts and other
> mounts.   I might be wrong on the details, since mounts are created
> very early in the boot process...
> 
> >
> >  (2) sysfs is memory intensive.  The directory structure has to be backed by
> >      dentries and inodes that linger as long as the referenced object does
> >      (procfs is more efficient in this regard for files that aren't being
> >      accessed)
> 
> See above: I don't think dentries and inodes are pinned, only kobjs
> and their associated cruft.  Which may be too heavy, depending on the
> details of the kobj tree.

That is correct, they should not be pinned, that is what kernfs handles
and why we can handle 30k virtual block devices on a 31bit s390 instance
:)

So you shouldn't have to worry about memory for sysfs.

There are loads of other reasons probably not to use sysfs for this
instead :)

thanks,

greg k-h
Karel Zak March 2, 2020, 10:34 a.m. UTC | #29
On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote:
> ned-By: MIMEDefang 2.78 on 10.11.54.4
> 
> On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Superblocks and mounts could get enumerated by a unique identifier.
> > > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > > superblock, but  s_id (as introduced in this patchset) could be used
> > > instead.
> >
> > So what would the sysfs tree look like with this?
> 
> For a start something like this:
> 
> mounts/$MOUNT_ID/
>   parent -> ../$PARENT_ID
>   super -> ../../supers/$SUPER_ID
>   root: path from mount root to fs root (could be optional as usually
> they are the same)
>   mountpoint -> $MOUNTPOINT
>   flags: mount flags
>   propagation: mount propagation
>   children/$CHILD_ID -> ../../$CHILD_ID
> 
>  supers/$SUPER_ID/
>    type: fstype
>    source: mount source (devname)
>    options:

What about use-cases where I have no ID, but I have mountpoint path
(e.g. "umount /foo")?  In this case I have to go to open() + fsinfo()
and then sysfs does not make sense for me, right?

    Karel
Ian Kent March 3, 2020, 5:27 a.m. UTC | #30
On Mon, 2020-03-02 at 10:09 +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com>
> wrote:
> > sysfs also has some other disadvantages for this:
> > 
> >  (1) There's a potential chicken-and-egg problem in that you have
> > to create a
> >      bunch of files and dirs in sysfs for every created mount and
> > superblock
> >      (possibly excluding special ones like the socket mount) - but
> > this
> >      includes sysfs itself.  This might work - provided you create
> > sysfs
> >      first.
> 
> Sysfs architecture looks something like this (I hope Greg will
> correct
> me if I'm wrong):
> 
> device driver -> kobj tree <- sysfs tree
> 
> The kobj tree is created by the device driver, and the dentry tree is
> created on demand from the kobj tree.   Lifetime of kobjs is bound to
> both the sysfs objects and the device but not the other way round.
> I.e. device can go away while the sysfs object is still being
> referenced, and sysfs can be freely mounted and unmounted
> independently of device initialization.
> 
> So there's no ordering requirement between sysfs mounts and other
> mounts.   I might be wrong on the details, since mounts are created
> very early in the boot process...
> 
> >  (2) sysfs is memory intensive.  The directory structure has to be
> > backed by
> >      dentries and inodes that linger as long as the referenced
> > object does
> >      (procfs is more efficient in this regard for files that aren't
> > being
> >      accessed)
> 
> See above: I don't think dentries and inodes are pinned, only kobjs
> and their associated cruft.  Which may be too heavy, depending on the
> details of the kobj tree.
> 
> >  (3) It gives people extra, indirect ways to pin mount objects and
> >      superblocks.
> 
> See above.
> 
> > For the moment, fsinfo() gives you three ways of referring to a
> > filesystem
> > object:
> > 
> >  (a) Directly by path.
> 
> A path is always representable by an O_PATH descriptor.
> 
> >  (b) By path associated with an fd.
> 
> See my proposal about linking from /proc/$PID/fdmount/$FD ->
> /sys/devices/virtual/mounts/$MOUNT_ID.
> 
> >  (c) By mount ID (perm checked by working back up the tree).
> 
> Check that perm on lookup of /sys/devices/virtual/mounts/$MOUNT_ID.
> The proc symlink would bypass the lookup check by directly jumping to
> the mountinfo dir.
> 
> > but will need to add:
> > 
> >  (d) By fscontext fd (which is hard to find in sysfs).  Indeed, the
> > superblock
> >      may not even exist yet.
> 
> Proc symlink would work for that too.

There's mounts enumeration too, ordering is required to identify the
top (or bottom depending on terminology) with more than one mount on
a mount point.

> 
> If sysfs is too heavy, this could be proc or a completely new
> filesystem.  The implementation is much less relevant at this stage
> of
> the discussion than the interface.

Ha, proc with the seq file interface, that's already proved to not
work properly and looks difficult to fix.

Ian
Miklos Szeredi March 3, 2020, 7:46 a.m. UTC | #31
On Tue, Mar 3, 2020 at 6:28 AM Ian Kent <raven@themaw.net> wrote:
>
> On Mon, 2020-03-02 at 10:09 +0100, Miklos Szeredi wrote:
> > On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com>
> > wrote:
> > > sysfs also has some other disadvantages for this:
> > >
> > >  (1) There's a potential chicken-and-egg problem in that you have
> > > to create a
> > >      bunch of files and dirs in sysfs for every created mount and
> > > superblock
> > >      (possibly excluding special ones like the socket mount) - but
> > > this
> > >      includes sysfs itself.  This might work - provided you create
> > > sysfs
> > >      first.
> >
> > Sysfs architecture looks something like this (I hope Greg will
> > correct
> > me if I'm wrong):
> >
> > device driver -> kobj tree <- sysfs tree
> >
> > The kobj tree is created by the device driver, and the dentry tree is
> > created on demand from the kobj tree.   Lifetime of kobjs is bound to
> > both the sysfs objects and the device but not the other way round.
> > I.e. device can go away while the sysfs object is still being
> > referenced, and sysfs can be freely mounted and unmounted
> > independently of device initialization.
> >
> > So there's no ordering requirement between sysfs mounts and other
> > mounts.   I might be wrong on the details, since mounts are created
> > very early in the boot process...
> >
> > >  (2) sysfs is memory intensive.  The directory structure has to be
> > > backed by
> > >      dentries and inodes that linger as long as the referenced
> > > object does
> > >      (procfs is more efficient in this regard for files that aren't
> > > being
> > >      accessed)
> >
> > See above: I don't think dentries and inodes are pinned, only kobjs
> > and their associated cruft.  Which may be too heavy, depending on the
> > details of the kobj tree.
> >
> > >  (3) It gives people extra, indirect ways to pin mount objects and
> > >      superblocks.
> >
> > See above.
> >
> > > For the moment, fsinfo() gives you three ways of referring to a
> > > filesystem
> > > object:
> > >
> > >  (a) Directly by path.
> >
> > A path is always representable by an O_PATH descriptor.
> >
> > >  (b) By path associated with an fd.
> >
> > See my proposal about linking from /proc/$PID/fdmount/$FD ->
> > /sys/devices/virtual/mounts/$MOUNT_ID.
> >
> > >  (c) By mount ID (perm checked by working back up the tree).
> >
> > Check that perm on lookup of /sys/devices/virtual/mounts/$MOUNT_ID.
> > The proc symlink would bypass the lookup check by directly jumping to
> > the mountinfo dir.
> >
> > > but will need to add:
> > >
> > >  (d) By fscontext fd (which is hard to find in sysfs).  Indeed, the
> > > superblock
> > >      may not even exist yet.
> >
> > Proc symlink would work for that too.
>
> There's mounts enumeration too, ordering is required to identify the
> top (or bottom depending on terminology) with more than one mount on
> a mount point.
>
> >
> > If sysfs is too heavy, this could be proc or a completely new
> > filesystem.  The implementation is much less relevant at this stage
> > of
> > the discussion than the interface.
>
> Ha, proc with the seq file interface, that's already proved to not
> work properly and looks difficult to fix.

I'm doing a patch.   Let's see how it fares in the face of all these
preconceptions.

Thanks,
Miklos
David Howells March 3, 2020, 9:12 a.m. UTC | #32
Miklos Szeredi <miklos@szeredi.hu> wrote:

> I'm doing a patch.   Let's see how it fares in the face of all these
> preconceptions.

Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
that scanning /proc/mounts when there are a lot of mounts in the system is
slow (not to mention the global lock that is held during the read).

Now, going with sysfs files on top of procfs links might avoid the global
lock, and you can avoid rereading the options string if you export a change
notification, but you're going to end up injecting a whole lot of pathwalk
latency into the system.

On top of that, it isn't going to help with the case that I'm working towards
implementing where a container manager can monitor for mounts taking place
inside the container and supervise them.  What I'm proposing is that during
the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
referring to the context under construction to the manager, which would then
be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
permit it.  Something like:

	fd = receive_context_to_supervise();
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_SB_OPTIONS,
	};
	fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
	supervise_parameters(buffer);
	fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
	fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
	fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_SB_NOTIFICATIONS,
	};
	struct fsinfo_sb_notifications sbnotify;
	fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
	watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
	fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
	close(fd);

However, the supervised mount may be happening in a completely different set
of namespaces, in which case the supervisor presumably wouldn't be able to see
the links in procfs and the relevant portions of sysfs.

David
Miklos Szeredi March 3, 2020, 9:26 a.m. UTC | #33
On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > I'm doing a patch.   Let's see how it fares in the face of all these
> > preconceptions.
>
> Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> that scanning /proc/mounts when there are a lot of mounts in the system is
> slow (not to mention the global lock that is held during the read).
>
> Now, going with sysfs files on top of procfs links might avoid the global
> lock, and you can avoid rereading the options string if you export a change
> notification, but you're going to end up injecting a whole lot of pathwalk
> latency into the system.

Completely irrelevant.  Cached lookup is so much optimized, that you
won't be able to see any of it.

No, I don't think this is going to be a performance issue at all, but
if anything we could introduce a syscall

  ssize_t readfile(int dfd, const char *path, char *buf, size_t
bufsize, int flags);

that is basically the equivalent of open + read + close, or even a
vectored variant that reads multiple files.  But that's off topic
again, since I don't think there's going to be any performance issue
even with plain I/O syscalls.

>
> On top of that, it isn't going to help with the case that I'm working towards
> implementing where a container manager can monitor for mounts taking place
> inside the container and supervise them.  What I'm proposing is that during
> the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
> referring to the context under construction to the manager, which would then
> be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
> permit it.  Something like:
>
>         fd = receive_context_to_supervise();
>         struct fsinfo_params params = {
>                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
>                 .request        = FSINFO_ATTR_SB_OPTIONS,
>         };
>         fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
>         supervise_parameters(buffer);
>         fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
>         fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
>         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
>         struct fsinfo_params params = {
>                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
>                 .request        = FSINFO_ATTR_SB_NOTIFICATIONS,
>         };
>         struct fsinfo_sb_notifications sbnotify;
>         fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
>         watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
>         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
>         close(fd);
>
> However, the supervised mount may be happening in a completely different set
> of namespaces, in which case the supervisor presumably wouldn't be able to see
> the links in procfs and the relevant portions of sysfs.

It would be a "jump" link to the otherwise invisible directory.

Thanks,
Miklos
Miklos Szeredi March 3, 2020, 9:48 a.m. UTC | #34
On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> >
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> >
> > Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> > that scanning /proc/mounts when there are a lot of mounts in the system is
> > slow (not to mention the global lock that is held during the read).

BTW, I do feel that there's room for improvement in userspace code as
well.  Even quite big mount table could be scanned for *changes* very
efficiently.  l.e. cache previous contents of /proc/self/mountinfo and
compare with new contents, line-by-line.  Only need to parse the
changed/added/removed lines.

Also it would be pretty easy to throttle the number of updates so
systemd et al. wouldn't hog the system with unnecessary processing.

Thanks,
Miklos
Christian Brauner March 3, 2020, 10 a.m. UTC | #35
On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> >
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> >
> > Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> > that scanning /proc/mounts when there are a lot of mounts in the system is
> > slow (not to mention the global lock that is held during the read).
> >
> > Now, going with sysfs files on top of procfs links might avoid the global
> > lock, and you can avoid rereading the options string if you export a change
> > notification, but you're going to end up injecting a whole lot of pathwalk
> > latency into the system.
> 
> Completely irrelevant.  Cached lookup is so much optimized, that you
> won't be able to see any of it.
> 
> No, I don't think this is going to be a performance issue at all, but
> if anything we could introduce a syscall
> 
>   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> bufsize, int flags);
> 
> that is basically the equivalent of open + read + close, or even a
> vectored variant that reads multiple files.  But that's off topic
> again, since I don't think there's going to be any performance issue
> even with plain I/O syscalls.
> 
> >
> > On top of that, it isn't going to help with the case that I'm working towards
> > implementing where a container manager can monitor for mounts taking place
> > inside the container and supervise them.  What I'm proposing is that during
> > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
> > referring to the context under construction to the manager, which would then
> > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
> > permit it.  Something like:
> >
> >         fd = receive_context_to_supervise();
> >         struct fsinfo_params params = {
> >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> >                 .request        = FSINFO_ATTR_SB_OPTIONS,
> >         };
> >         fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
> >         supervise_parameters(buffer);
> >         fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
> >         fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
> >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
> >         struct fsinfo_params params = {
> >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> >                 .request        = FSINFO_ATTR_SB_NOTIFICATIONS,
> >         };
> >         struct fsinfo_sb_notifications sbnotify;
> >         fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
> >         watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
> >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
> >         close(fd);
> >
> > However, the supervised mount may be happening in a completely different set
> > of namespaces, in which case the supervisor presumably wouldn't be able to see
> > the links in procfs and the relevant portions of sysfs.
> 
> It would be a "jump" link to the otherwise invisible directory.

More magic links to beam you around sounds like a bad idea. We had a
bunch of CVEs around them in containers and they were one of the major
reasons behind us pushing for openat2(). That's why it has a
RESOLVE_NO_MAGICLINKS flag.

Christian
Miklos Szeredi March 3, 2020, 10:13 a.m. UTC | #36
On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> > >
> > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > preconceptions.
> > >
> > > Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> > > that scanning /proc/mounts when there are a lot of mounts in the system is
> > > slow (not to mention the global lock that is held during the read).
> > >
> > > Now, going with sysfs files on top of procfs links might avoid the global
> > > lock, and you can avoid rereading the options string if you export a change
> > > notification, but you're going to end up injecting a whole lot of pathwalk
> > > latency into the system.
> >
> > Completely irrelevant.  Cached lookup is so much optimized, that you
> > won't be able to see any of it.
> >
> > No, I don't think this is going to be a performance issue at all, but
> > if anything we could introduce a syscall
> >
> >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > bufsize, int flags);
> >
> > that is basically the equivalent of open + read + close, or even a
> > vectored variant that reads multiple files.  But that's off topic
> > again, since I don't think there's going to be any performance issue
> > even with plain I/O syscalls.
> >
> > >
> > > On top of that, it isn't going to help with the case that I'm working towards
> > > implementing where a container manager can monitor for mounts taking place
> > > inside the container and supervise them.  What I'm proposing is that during
> > > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
> > > referring to the context under construction to the manager, which would then
> > > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
> > > permit it.  Something like:
> > >
> > >         fd = receive_context_to_supervise();
> > >         struct fsinfo_params params = {
> > >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> > >                 .request        = FSINFO_ATTR_SB_OPTIONS,
> > >         };
> > >         fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
> > >         supervise_parameters(buffer);
> > >         fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
> > >         fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
> > >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
> > >         struct fsinfo_params params = {
> > >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> > >                 .request        = FSINFO_ATTR_SB_NOTIFICATIONS,
> > >         };
> > >         struct fsinfo_sb_notifications sbnotify;
> > >         fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
> > >         watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
> > >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
> > >         close(fd);
> > >
> > > However, the supervised mount may be happening in a completely different set
> > > of namespaces, in which case the supervisor presumably wouldn't be able to see
> > > the links in procfs and the relevant portions of sysfs.
> >
> > It would be a "jump" link to the otherwise invisible directory.
>
> More magic links to beam you around sounds like a bad idea. We had a
> bunch of CVEs around them in containers and they were one of the major
> reasons behind us pushing for openat2(). That's why it has a
> RESOLVE_NO_MAGICLINKS flag.

No, that link wouldn't beam you around at all, it would end up in an
internally mounted instance of a mountfs, a safe place where no
dangerous CVE's roam.

Thanks,
Miklos
Steven Whitehouse March 3, 2020, 10:21 a.m. UTC | #37
Hi,

On 03/03/2020 09:48, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
>>> Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>
>>>> I'm doing a patch.   Let's see how it fares in the face of all these
>>>> preconceptions.
>>> Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
>>> that scanning /proc/mounts when there are a lot of mounts in the system is
>>> slow (not to mention the global lock that is held during the read).
> BTW, I do feel that there's room for improvement in userspace code as
> well.  Even quite big mount table could be scanned for *changes* very
> efficiently.  l.e. cache previous contents of /proc/self/mountinfo and
> compare with new contents, line-by-line.  Only need to parse the
> changed/added/removed lines.
>
> Also it would be pretty easy to throttle the number of updates so
> systemd et al. wouldn't hog the system with unnecessary processing.
>
> Thanks,
> Miklos
>

At least having patches to compare would allow us to look at the 
performance here and gain some numbers, which would be helpful to frame 
the discussions. However I'm not seeing how it would be easy to throttle 
updates... they occur at whatever rate they are generated and this can 
be fairly high. Also I'm not sure that I follow how the notifications 
and the dumping of the whole table are synchronized in this case, either.

Al has pointed out before that a single mount operation on a subtree can 
generate a large number of changes on that subtree. That kind of 
scenario will need to be dealt with efficiently so that we don't miss 
things, and we also minimize the possibility of overruns, and additional 
overhead on the mount changes themselves, by keeping the notification 
messages small.

We should also look at what the likely worst case might be. I seem to 
remember from what Ian has said in the past that there can be tens of 
thousands of autofs mounts on some large systems. I assume that worst 
case might be something like that, but multiplied by however many 
containers might be on a system. Can anybody think of a situation which 
might require even more mounts?

The network subsystem had a similar problem... they use rtnetlink for 
the routing information, and just like the proposal here it contains a 
dump mechanism, and a way to listen to events (add/remove routes) which 
is synchronized with that dump. Ian did start looking at netlink some 
time ago, but it also has some issues (it is in the network namespace 
not the fs namespace, it also has various things accumulated over the 
years that we don't need for filesystems) but that was part of the 
original inspiration for the fs notifications.

There is also, of course, /proc/net/route which can be useful in many 
circumstances, but for efficiency and synchronization reasons if is not 
the interface of choice for routing protocols. David's proposal has a 
number of the important attributes of an rtnetlink-like (in a conceptual 
sense) solution, and I remain skeptical that a /sysfs or similar 
interface would be an efficient solution to the original problem, even 
if it might perhaps make a useful addition.

There is also the chicken-and-egg issue, in the sense that if the 
interface is via a filesystem (sysfs, proc or whatever), how does one 
receive a notification for that filesystem itself being mounted until 
after it has been mounted? Maybe that is not a particular problem, but I 
think a cleaner solution would not require a mount in order to watch for 
other mounts,

Steve.
Christian Brauner March 3, 2020, 10:25 a.m. UTC | #38
On Tue, Mar 03, 2020 at 11:13:50AM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> > > >
> > > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > > preconceptions.
> > > >
> > > > Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> > > > that scanning /proc/mounts when there are a lot of mounts in the system is
> > > > slow (not to mention the global lock that is held during the read).
> > > >
> > > > Now, going with sysfs files on top of procfs links might avoid the global
> > > > lock, and you can avoid rereading the options string if you export a change
> > > > notification, but you're going to end up injecting a whole lot of pathwalk
> > > > latency into the system.
> > >
> > > Completely irrelevant.  Cached lookup is so much optimized, that you
> > > won't be able to see any of it.
> > >
> > > No, I don't think this is going to be a performance issue at all, but
> > > if anything we could introduce a syscall
> > >
> > >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > > bufsize, int flags);
> > >
> > > that is basically the equivalent of open + read + close, or even a
> > > vectored variant that reads multiple files.  But that's off topic
> > > again, since I don't think there's going to be any performance issue
> > > even with plain I/O syscalls.
> > >
> > > >
> > > > On top of that, it isn't going to help with the case that I'm working towards
> > > > implementing where a container manager can monitor for mounts taking place
> > > > inside the container and supervise them.  What I'm proposing is that during
> > > > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
> > > > referring to the context under construction to the manager, which would then
> > > > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
> > > > permit it.  Something like:
> > > >
> > > >         fd = receive_context_to_supervise();
> > > >         struct fsinfo_params params = {
> > > >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> > > >                 .request        = FSINFO_ATTR_SB_OPTIONS,
> > > >         };
> > > >         fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
> > > >         supervise_parameters(buffer);
> > > >         fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
> > > >         fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
> > > >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
> > > >         struct fsinfo_params params = {
> > > >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> > > >                 .request        = FSINFO_ATTR_SB_NOTIFICATIONS,
> > > >         };
> > > >         struct fsinfo_sb_notifications sbnotify;
> > > >         fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
> > > >         watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
> > > >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
> > > >         close(fd);
> > > >
> > > > However, the supervised mount may be happening in a completely different set
> > > > of namespaces, in which case the supervisor presumably wouldn't be able to see
> > > > the links in procfs and the relevant portions of sysfs.
> > >
> > > It would be a "jump" link to the otherwise invisible directory.
> >
> > More magic links to beam you around sounds like a bad idea. We had a
> > bunch of CVEs around them in containers and they were one of the major
> > reasons behind us pushing for openat2(). That's why it has a
> > RESOLVE_NO_MAGICLINKS flag.
> 
> No, that link wouldn't beam you around at all, it would end up in an
> internally mounted instance of a mountfs, a safe place where no

Even if it is a magic link to a safe place it's a magic link. They
aren't a great solution to this problem. fsinfo() is cleaner and
simpler as it creates a context for a supervised mount which gives the a
managing application fine-grained control and makes it easily
extendable.
Also, we're apparently at the point where it seems were suggesting
another (pseudo)filesystem to get information about filesystems.

Christian
Miklos Szeredi March 3, 2020, 10:32 a.m. UTC | #39
On Tue, Mar 3, 2020 at 11:22 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> Hi,
>
> On 03/03/2020 09:48, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> >>> Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>>
> >>>> I'm doing a patch.   Let's see how it fares in the face of all these
> >>>> preconceptions.
> >>> Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> >>> that scanning /proc/mounts when there are a lot of mounts in the system is
> >>> slow (not to mention the global lock that is held during the read).
> > BTW, I do feel that there's room for improvement in userspace code as
> > well.  Even quite big mount table could be scanned for *changes* very
> > efficiently.  l.e. cache previous contents of /proc/self/mountinfo and
> > compare with new contents, line-by-line.  Only need to parse the
> > changed/added/removed lines.
> >
> > Also it would be pretty easy to throttle the number of updates so
> > systemd et al. wouldn't hog the system with unnecessary processing.
> >
> > Thanks,
> > Miklos
> >
>
> At least having patches to compare would allow us to look at the
> performance here and gain some numbers, which would be helpful to frame
> the discussions. However I'm not seeing how it would be easy to throttle
> updates... they occur at whatever rate they are generated and this can
> be fairly high. Also I'm not sure that I follow how the notifications
> and the dumping of the whole table are synchronized in this case, either.

What I meant is optimizing current userspace without additional kernel
infrastructure.   Since currently there's only the monolithic
/proc/self/mountinfo, it's reasonable that if the rate of change is
very high, then we don't re-read this table on every change, only
within a reasonable time limit (e.g. 1s) to provide timely updates.
Re-reading the table on every change would (does?) slow down the
system so that the actual updates would even be slower, so throttling
in this case very much  makes sense.

Once we have per-mount information from the kernel, throttling updates
probably does not make sense.

Thanks,
Miklos
Ian Kent March 3, 2020, 11:09 a.m. UTC | #40
On Tue, 2020-03-03 at 11:32 +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 11:22 AM Steven Whitehouse <
> swhiteho@redhat.com> wrote:
> > Hi,
> > 
> > On 03/03/2020 09:48, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu
> > > > wrote:
> > > > On Tue, Mar 3, 2020 at 10:13 AM David Howells <
> > > > dhowells@redhat.com> wrote:
> > > > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > > 
> > > > > > I'm doing a patch.   Let's see how it fares in the face of
> > > > > > all these
> > > > > > preconceptions.
> > > > > Don't forget the efficiency criterion.  One reason for going
> > > > > with fsinfo(2) is
> > > > > that scanning /proc/mounts when there are a lot of mounts in
> > > > > the system is
> > > > > slow (not to mention the global lock that is held during the
> > > > > read).
> > > BTW, I do feel that there's room for improvement in userspace
> > > code as
> > > well.  Even quite big mount table could be scanned for *changes*
> > > very
> > > efficiently.  l.e. cache previous contents of
> > > /proc/self/mountinfo and
> > > compare with new contents, line-by-line.  Only need to parse the
> > > changed/added/removed lines.
> > > 
> > > Also it would be pretty easy to throttle the number of updates so
> > > systemd et al. wouldn't hog the system with unnecessary
> > > processing.
> > > 
> > > Thanks,
> > > Miklos
> > > 
> > 
> > At least having patches to compare would allow us to look at the
> > performance here and gain some numbers, which would be helpful to
> > frame
> > the discussions. However I'm not seeing how it would be easy to
> > throttle
> > updates... they occur at whatever rate they are generated and this
> > can
> > be fairly high. Also I'm not sure that I follow how the
> > notifications
> > and the dumping of the whole table are synchronized in this case,
> > either.
> 
> What I meant is optimizing current userspace without additional
> kernel
> infrastructure.   Since currently there's only the monolithic
> /proc/self/mountinfo, it's reasonable that if the rate of change is
> very high, then we don't re-read this table on every change, only
> within a reasonable time limit (e.g. 1s) to provide timely updates.
> Re-reading the table on every change would (does?) slow down the
> system so that the actual updates would even be slower, so throttling
> in this case very much  makes sense.

Optimizing user space is a huge task.

For example, consider this (which is related to a recent upstream
discussion I had):
https://blog.janestreet.com/troubleshooting-systemd-with-systemtap/

Working on improving libmount is really useful but that can't help
with inherently inefficient approaches to keeping info. current
which is actually needed at times.

> 
> Once we have per-mount information from the kernel, throttling
> updates
> probably does not make sense.

And can easily lead to application problems. Throttling will
lead to an inability to have up to date information upon which
application decisions are made.

I don't think it's a viable solution to the separate problem
of a large number of notifications either.

Ian
Miklos Szeredi March 3, 2020, 11:33 a.m. UTC | #41
On Tue, Mar 3, 2020 at 11:25 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 03, 2020 at 11:13:50AM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:

> > > More magic links to beam you around sounds like a bad idea. We had a
> > > bunch of CVEs around them in containers and they were one of the major
> > > reasons behind us pushing for openat2(). That's why it has a
> > > RESOLVE_NO_MAGICLINKS flag.
> >
> > No, that link wouldn't beam you around at all, it would end up in an
> > internally mounted instance of a mountfs, a safe place where no
>
> Even if it is a magic link to a safe place it's a magic link. They
> aren't a great solution to this problem. fsinfo() is cleaner and
> simpler as it creates a context for a supervised mount which gives the a
> managing application fine-grained control and makes it easily
> extendable.

Yeah, it's a nice and clean interface in the ioctl(2) sense. Sure,
fsinfo() is way better than ioctl(), but it at the core it's still the
same syscall multiplexer, do everything hack.

> Also, we're apparently at the point where it seems were suggesting
> another (pseudo)filesystem to get information about filesystems.

Implementation detail.  Why would you care?

Thanks,
Miklos
Karel Zak March 3, 2020, 11:38 a.m. UTC | #42
On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> No, I don't think this is going to be a performance issue at all, but
> if anything we could introduce a syscall
> 
>   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> bufsize, int flags);

off-topic, but I'll buy you many many beers if you implement it ;-),
because open + read + close is pretty common for /sys and /proc in
many userspace tools; for example ps, top, lsblk, lsmem, lsns, udevd
etc. is all about it.

    Karel
Christian Brauner March 3, 2020, 11:56 a.m. UTC | #43
On Tue, Mar 03, 2020 at 12:33:48PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 11:25 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 03, 2020 at 11:13:50AM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> 
> > > > More magic links to beam you around sounds like a bad idea. We had a
> > > > bunch of CVEs around them in containers and they were one of the major
> > > > reasons behind us pushing for openat2(). That's why it has a
> > > > RESOLVE_NO_MAGICLINKS flag.
> > >
> > > No, that link wouldn't beam you around at all, it would end up in an
> > > internally mounted instance of a mountfs, a safe place where no
> >
> > Even if it is a magic link to a safe place it's a magic link. They
> > aren't a great solution to this problem. fsinfo() is cleaner and
> > simpler as it creates a context for a supervised mount which gives the a
> > managing application fine-grained control and makes it easily
> > extendable.
> 
> Yeah, it's a nice and clean interface in the ioctl(2) sense. Sure,
> fsinfo() is way better than ioctl(), but it at the core it's still the
> same syscall multiplexer, do everything hack.

In contrast to a generic ioctl() it's a domain-specific separate
syscall. You can't suddenly set kvm options through fsinfo() I would
hope. I find it at least debatable that a new filesystem is preferable.
And - feel free to simply dismiss the concerns I expressed - so far
there has not been a lot of excitement about this idea.

> 
> > Also, we're apparently at the point where it seems were suggesting
> > another (pseudo)filesystem to get information about filesystems.
> 
> Implementation detail.  Why would you care?

I wouldn't call this an implementation detail. That's quite a big
design choice; it's a separate fileystem. In addition, implementation
details need to be maintained.

Christian
Greg Kroah-Hartman March 3, 2020, 1:03 p.m. UTC | #44
On Tue, Mar 03, 2020 at 12:38:14PM +0100, Karel Zak wrote:
> On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > No, I don't think this is going to be a performance issue at all, but
> > if anything we could introduce a syscall
> > 
> >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > bufsize, int flags);
> 
> off-topic, but I'll buy you many many beers if you implement it ;-),
> because open + read + close is pretty common for /sys and /proc in
> many userspace tools; for example ps, top, lsblk, lsmem, lsns, udevd
> etc. is all about it.

Unlimited beers for a 21-line kernel patch?  Sign me up!

Totally untested, barely compiled patch below.

Actually, I like this idea (the syscall, not just the unlimited beers).
Maybe this could make a lot of sense, I'll write some actual tests for
it now that syscalls are getting "heavy" again due to CPU vendors
finally paying the price for their madness...

thanks,

greg k-h
-------------------


diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 44d510bc9b78..178cd45340e2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@
 435	common	clone3			__x64_sys_clone3/ptregs
 437	common	openat2			__x64_sys_openat2
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
+439	common	readfile		__x86_sys_readfile
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..1a830fada750 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1340,3 +1340,23 @@ int stream_open(struct inode *inode, struct file *filp)
 }
 
 EXPORT_SYMBOL(stream_open);
+
+SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
+		char __user *, buffer, size_t, bufsize, int, flags)
+{
+	int retval;
+	int fd;
+
+	if (force_o_largefile())
+		flags |= O_LARGEFILE;
+
+	fd = do_sys_open(dfd, filename, flags, O_RDONLY);
+	if (fd <= 0)
+		return fd;
+
+	retval = ksys_read(fd, buffer, bufsize);
+
+	__close_fd(current->files, fd);
+
+	return retval;
+}
Greg Kroah-Hartman March 3, 2020, 1:14 p.m. UTC | #45
On Tue, Mar 03, 2020 at 02:03:47PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 12:38:14PM +0100, Karel Zak wrote:
> > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > > No, I don't think this is going to be a performance issue at all, but
> > > if anything we could introduce a syscall
> > > 
> > >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > > bufsize, int flags);
> > 
> > off-topic, but I'll buy you many many beers if you implement it ;-),
> > because open + read + close is pretty common for /sys and /proc in
> > many userspace tools; for example ps, top, lsblk, lsmem, lsns, udevd
> > etc. is all about it.
> 
> Unlimited beers for a 21-line kernel patch?  Sign me up!
> 
> Totally untested, barely compiled patch below.

Ok, that didn't even build, let me try this for real now...
Miklos Szeredi March 3, 2020, 1:34 p.m. UTC | #46
On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:

> > Unlimited beers for a 21-line kernel patch?  Sign me up!
> >
> > Totally untested, barely compiled patch below.
>
> Ok, that didn't even build, let me try this for real now...

Some comments on the interface:

O_LARGEFILE can be unconditional, since offsets are not exposed to the caller.

Use the openat2 style arguments; limit the accepted flags to sane ones
(e.g. don't let this syscall create a file).

If buffer is too small to fit the whole file, return error.

Verify that the number of bytes read matches the file size, otherwise
return error (may need to loop?).

Thanks,
Miklos
Greg Kroah-Hartman March 3, 2020, 1:43 p.m. UTC | #47
On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > >
> > > Totally untested, barely compiled patch below.
> >
> > Ok, that didn't even build, let me try this for real now...
> 
> Some comments on the interface:

Ok, hey, let's do this proper :)

> O_LARGEFILE can be unconditional, since offsets are not exposed to the caller.

Good point.

> Use the openat2 style arguments; limit the accepted flags to sane ones
> (e.g. don't let this syscall create a file).

Yeah, I just added that check to my local version:
	/* Mask off all O_ flags as we only want to read from the file */
	flags &= ~(VALID_OPEN_FLAGS);
	flags |= O_RDONLY | O_LARGEFILE;

> If buffer is too small to fit the whole file, return error.

Why?  What's wrong with just returning the bytes asked for?  If someone
only wants 5 bytes from the front of a file, it should be fine to give
that to them, right?

> Verify that the number of bytes read matches the file size, otherwise
> return error (may need to loop?).

No, we can't "match file size" as sysfs files do not really have a sane
"size".  So I don't want to loop at all here, one-shot, that's all you
get :)

Let me actually do this and try it out for real.

/me has no idea what he is getting himself into...

thanks,

greg k-h
David Howells March 3, 2020, 2:09 p.m. UTC | #48
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> Actually, I like this idea (the syscall,

It might mesh well with atomic_open in some way.

David
Greg Kroah-Hartman March 3, 2020, 2:10 p.m. UTC | #49
On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > 
> > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > >
> > > > Totally untested, barely compiled patch below.
> > >
> > > Ok, that didn't even build, let me try this for real now...
> > 
> > Some comments on the interface:
> 
> Ok, hey, let's do this proper :)

Alright, how about this patch.

Actually tested with some simple sysfs files.

If people don't strongly object, I'll add "real" tests to it, hook it up
to all arches, write a manpage, and all the fun fluff a new syscall
deserves and submit it "for real".

It feels like I'm doing something wrong in that the actuall syscall
logic is just so small.  Maybe I'll benchmark this thing to see if it
makes any real difference...

thanks,

greg k-h

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: [PATCH] readfile: implement readfile syscall

It's a tiny syscall, meant to allow a user to do a single "open this
file, read into this buffer, and close the file" all in a single shot.

Should be good for reading "tiny" files like sysfs, procfs, and other
"small" files.

There is no restarting the syscall, am trying to keep it simple.  At
least for now.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 fs/open.c                              | 21 +++++++++++++++++++++
 include/linux/syscalls.h               |  2 ++
 include/uapi/asm-generic/unistd.h      |  4 +++-
 5 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c17cb77eb150..a79cd025e72b 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -442,3 +442,4 @@
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
 437	i386	openat2			sys_openat2			__ia32_sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
+439	i386	readfile		sys_readfile			__ia32_sys_readfile
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 44d510bc9b78..4f518f4e0e30 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@
 435	common	clone3			__x64_sys_clone3/ptregs
 437	common	openat2			__x64_sys_openat2
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
+439	common	readfile		__x64_sys_readfile
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..109bad47d542 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1340,3 +1340,24 @@ int stream_open(struct inode *inode, struct file *filp)
 }
 
 EXPORT_SYMBOL(stream_open);
+
+SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
+		char __user *, buffer, size_t, bufsize, int, flags)
+{
+	int retval;
+	int fd;
+
+	/* Mask off all O_ flags as we only want to read from the file */
+	flags &= ~(VALID_OPEN_FLAGS);
+	flags |= O_RDONLY | O_LARGEFILE;
+
+	fd = do_sys_open(dfd, filename, flags, 0000);
+	if (fd <= 0)
+		return fd;
+
+	retval = ksys_read(fd, buffer, bufsize);
+
+	__close_fd(current->files, fd);
+
+	return retval;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1815065d52f3..3a636a913437 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1003,6 +1003,8 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_readfile(int dfd, const char __user *filename,
+			     char __user *buffer, size_t bufsize, int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 3a3201e4618e..31f84500915d 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -855,9 +855,11 @@ __SYSCALL(__NR_clone3, sys_clone3)
 __SYSCALL(__NR_openat2, sys_openat2)
 #define __NR_pidfd_getfd 438
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
+#define __NR_readfile 439
+__SYSCALL(__NR_readfile, sys_readfile)
 
 #undef __NR_syscalls
-#define __NR_syscalls 439
+#define __NR_syscalls 440
 
 /*
  * 32 bit systems traditionally used different
Miklos Szeredi March 3, 2020, 2:10 p.m. UTC | #50
On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:

> > If buffer is too small to fit the whole file, return error.
>
> Why?  What's wrong with just returning the bytes asked for?  If someone
> only wants 5 bytes from the front of a file, it should be fine to give
> that to them, right?

I think we need to signal in some way to the caller that the result
was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
caller might be surprised.

>
> > Verify that the number of bytes read matches the file size, otherwise
> > return error (may need to loop?).
>
> No, we can't "match file size" as sysfs files do not really have a sane
> "size".  So I don't want to loop at all here, one-shot, that's all you
> get :)

Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
until read return zero) might be a good idea regardless, because short
reads are allowed.

Thanks,
Miklos
Jann Horn March 3, 2020, 2:13 p.m. UTC | #51
On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > >
> > > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > > >
> > > > > Totally untested, barely compiled patch below.
> > > >
> > > > Ok, that didn't even build, let me try this for real now...
> > >
> > > Some comments on the interface:
> >
> > Ok, hey, let's do this proper :)
>
> Alright, how about this patch.
>
> Actually tested with some simple sysfs files.
>
> If people don't strongly object, I'll add "real" tests to it, hook it up
> to all arches, write a manpage, and all the fun fluff a new syscall
> deserves and submit it "for real".

Just FYI, io_uring is moving towards the same kind of thing... IIRC
you can already use it to batch a bunch of open() calls, then batch a
bunch of read() calls on all the new fds and close them at the same
time. And I think they're planning to add support for doing
open()+read()+close() all in one go, too, except that it's a bit
complicated because passing forward the file descriptor in a generic
way is a bit complicated.

> It feels like I'm doing something wrong in that the actuall syscall
> logic is just so small.  Maybe I'll benchmark this thing to see if it
> makes any real difference...
>
> thanks,
>
> greg k-h
>
> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Subject: [PATCH] readfile: implement readfile syscall
>
> It's a tiny syscall, meant to allow a user to do a single "open this
> file, read into this buffer, and close the file" all in a single shot.
>
> Should be good for reading "tiny" files like sysfs, procfs, and other
> "small" files.
>
> There is no restarting the syscall, am trying to keep it simple.  At
> least for now.
>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[...]
> +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> +               char __user *, buffer, size_t, bufsize, int, flags)
> +{
> +       int retval;
> +       int fd;
> +
> +       /* Mask off all O_ flags as we only want to read from the file */
> +       flags &= ~(VALID_OPEN_FLAGS);
> +       flags |= O_RDONLY | O_LARGEFILE;
> +
> +       fd = do_sys_open(dfd, filename, flags, 0000);
> +       if (fd <= 0)
> +               return fd;
> +
> +       retval = ksys_read(fd, buffer, bufsize);
> +
> +       __close_fd(current->files, fd);
> +
> +       return retval;
> +}

If you're gonna do something like that, wouldn't you want to also
elide the use of the file descriptor table completely? do_sys_open()
will have to do atomic operations in the fd table and stuff, which is
probably moderately bad in terms of cacheline bouncing if this is used
in a multithreaded context; and as a side effect, the fd would be
inherited by anyone who calls fork() concurrently. You'll probably
want to use APIs like do_filp_open() and filp_close(), or something
like that, instead.
David Howells March 3, 2020, 2:19 p.m. UTC | #52
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> +	fd = do_sys_open(dfd, filename, flags, 0000);
> +	if (fd <= 0)
> +		return fd;
> +
> +	retval = ksys_read(fd, buffer, bufsize);
> +
> +	__close_fd(current->files, fd);

If you can use dentry_open() and vfs_read(), you might be able to avoid
dealing with file descriptors entirely.  That might make it worth a syscall.

You're going to be asked for writefile() you know ;-)

David
Christian Brauner March 3, 2020, 2:23 p.m. UTC | #53
On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > >
> > > Totally untested, barely compiled patch below.
> >
> > Ok, that didn't even build, let me try this for real now...
> 
> Some comments on the interface:
> 
> O_LARGEFILE can be unconditional, since offsets are not exposed to the caller.
> 
> Use the openat2 style arguments; limit the accepted flags to sane ones
> (e.g. don't let this syscall create a file).

If we think this is worth it, might even good to either have it support
struct open_how or have it accept two flag arguments. We sure want
openat2()s RESOLVE_* flags in there.

Christian
Greg Kroah-Hartman March 3, 2020, 2:24 p.m. UTC | #54
On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > > > >
> > > > > > Totally untested, barely compiled patch below.
> > > > >
> > > > > Ok, that didn't even build, let me try this for real now...
> > > >
> > > > Some comments on the interface:
> > >
> > > Ok, hey, let's do this proper :)
> >
> > Alright, how about this patch.
> >
> > Actually tested with some simple sysfs files.
> >
> > If people don't strongly object, I'll add "real" tests to it, hook it up
> > to all arches, write a manpage, and all the fun fluff a new syscall
> > deserves and submit it "for real".
> 
> Just FYI, io_uring is moving towards the same kind of thing... IIRC
> you can already use it to batch a bunch of open() calls, then batch a
> bunch of read() calls on all the new fds and close them at the same
> time. And I think they're planning to add support for doing
> open()+read()+close() all in one go, too, except that it's a bit
> complicated because passing forward the file descriptor in a generic
> way is a bit complicated.

It is complicated, I wouldn't recommend using io_ring for reading a
bunch of procfs or sysfs files, that feels like a ton of overkill with
too much setup/teardown to make it worth while.

But maybe not, will have to watch and see how it goes.

> > It feels like I'm doing something wrong in that the actuall syscall
> > logic is just so small.  Maybe I'll benchmark this thing to see if it
> > makes any real difference...
> >
> > thanks,
> >
> > greg k-h
> >
> > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Subject: [PATCH] readfile: implement readfile syscall
> >
> > It's a tiny syscall, meant to allow a user to do a single "open this
> > file, read into this buffer, and close the file" all in a single shot.
> >
> > Should be good for reading "tiny" files like sysfs, procfs, and other
> > "small" files.
> >
> > There is no restarting the syscall, am trying to keep it simple.  At
> > least for now.
> >
> > Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> [...]
> > +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> > +               char __user *, buffer, size_t, bufsize, int, flags)
> > +{
> > +       int retval;
> > +       int fd;
> > +
> > +       /* Mask off all O_ flags as we only want to read from the file */
> > +       flags &= ~(VALID_OPEN_FLAGS);
> > +       flags |= O_RDONLY | O_LARGEFILE;
> > +
> > +       fd = do_sys_open(dfd, filename, flags, 0000);
> > +       if (fd <= 0)
> > +               return fd;
> > +
> > +       retval = ksys_read(fd, buffer, bufsize);
> > +
> > +       __close_fd(current->files, fd);
> > +
> > +       return retval;
> > +}
> 
> If you're gonna do something like that, wouldn't you want to also
> elide the use of the file descriptor table completely? do_sys_open()
> will have to do atomic operations in the fd table and stuff, which is
> probably moderately bad in terms of cacheline bouncing if this is used
> in a multithreaded context; and as a side effect, the fd would be
> inherited by anyone who calls fork() concurrently. You'll probably
> want to use APIs like do_filp_open() and filp_close(), or something
> like that, instead.

Ah, nice, that does make more sense.  I'll play around with that, and
benchmarking this thing later tonight.  Have to go get some stable
kernels out first...

thanks for the quick review, much appreciated.

greg k-h
Greg Kroah-Hartman March 3, 2020, 2:29 p.m. UTC | #55
On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> 
> > > If buffer is too small to fit the whole file, return error.
> >
> > Why?  What's wrong with just returning the bytes asked for?  If someone
> > only wants 5 bytes from the front of a file, it should be fine to give
> > that to them, right?
> 
> I think we need to signal in some way to the caller that the result
> was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> caller might be surprised.

But that's not the way a "normal" read works.  Short reads are fine, if
the file isn't big enough.  That's how char device nodes work all the
time as well, and this kind of is like that, or some kind of "stream" to
read from.

If you think the file is bigger, then you, as the caller, can just pass
in a bigger buffer if you want to (i.e. you can stat the thing and
determine the size beforehand.)

Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
buffer.  That way userspace "knows" it will always read all of the data
it can from the file, we don't have to do any seeking or determining
real file size, or anything else like that.

We return the number of bytes read as well, so we "know" if we did a
short read, and also, you could imply, if the number of bytes read are
the exact same as the number of bytes of the buffer, maybe the file is
either that exact size, or bigger.

This should be "simple", let's not make it complex if we can help it :)

> > > Verify that the number of bytes read matches the file size, otherwise
> > > return error (may need to loop?).
> >
> > No, we can't "match file size" as sysfs files do not really have a sane
> > "size".  So I don't want to loop at all here, one-shot, that's all you
> > get :)
> 
> Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> until read return zero) might be a good idea regardless, because short
> reads are allowed.

If you want to loop, then do a userspace open/read-loop/close cycle.
That's not what this syscall should be for.

Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)

thanks,

greg k-h
David Howells March 3, 2020, 2:40 p.m. UTC | #56
Would you permit readfile() to access FIFOs, chardevs and blockdevs?
Certainly allowing use with blockdevs seems reasonable.

David
Jann Horn March 3, 2020, 2:40 p.m. UTC | #57
On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> >
> > > > If buffer is too small to fit the whole file, return error.
> > >
> > > Why?  What's wrong with just returning the bytes asked for?  If someone
> > > only wants 5 bytes from the front of a file, it should be fine to give
> > > that to them, right?
> >
> > I think we need to signal in some way to the caller that the result
> > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> > caller might be surprised.
>
> But that's not the way a "normal" read works.  Short reads are fine, if
> the file isn't big enough.  That's how char device nodes work all the
> time as well, and this kind of is like that, or some kind of "stream" to
> read from.
>
> If you think the file is bigger, then you, as the caller, can just pass
> in a bigger buffer if you want to (i.e. you can stat the thing and
> determine the size beforehand.)
>
> Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
> buffer.  That way userspace "knows" it will always read all of the data
> it can from the file, we don't have to do any seeking or determining
> real file size, or anything else like that.
>
> We return the number of bytes read as well, so we "know" if we did a
> short read, and also, you could imply, if the number of bytes read are
> the exact same as the number of bytes of the buffer, maybe the file is
> either that exact size, or bigger.
>
> This should be "simple", let's not make it complex if we can help it :)
>
> > > > Verify that the number of bytes read matches the file size, otherwise
> > > > return error (may need to loop?).
> > >
> > > No, we can't "match file size" as sysfs files do not really have a sane
> > > "size".  So I don't want to loop at all here, one-shot, that's all you
> > > get :)
> >
> > Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> > until read return zero) might be a good idea regardless, because short
> > reads are allowed.
>
> If you want to loop, then do a userspace open/read-loop/close cycle.
> That's not what this syscall should be for.
>
> Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)

So how is this supposed to work in e.g. the following case?

========================================
$ cat map_lots_and_read_maps.c
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>

int main(void) {
  for (int i=0; i<1000; i++) {
    mmap(NULL, 0x1000, (i&1)?PROT_READ:PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  }
  int maps = open("/proc/self/maps", O_RDONLY);
  static char buf[0x100000];
  int res;
  do {
    res = read(maps, buf, sizeof(buf));
  } while (res > 0);
}
$ gcc -o map_lots_and_read_maps map_lots_and_read_maps.c
$ strace -e trace='!mmap' ./map_lots_and_read_maps
execve("./map_lots_and_read_maps", ["./map_lots_and_read_maps"],
0x7ffebd297ac0 /* 51 vars */) = 0
brk(NULL)                               = 0x563a1184f000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=208479, ...}) = 0
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320l\2\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1820104, ...}) = 0
mprotect(0x7fb5c2d1a000, 1642496, PROT_NONE) = 0
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7fb5c2eb6500) = 0
mprotect(0x7fb5c2eab000, 12288, PROT_READ) = 0
mprotect(0x563a103e4000, 4096, PROT_READ) = 0
mprotect(0x7fb5c2f12000, 4096, PROT_READ) = 0
munmap(0x7fb5c2eb7000, 208479)          = 0
openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3
read(3, "563a103e1000-563a103e2000 r--p 0"..., 1048576) = 4075
read(3, "7fb5c2985000-7fb5c2986000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c29d8000-7fb5c29d9000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2a2b000-7fb5c2a2c000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c2a7e000-7fb5c2a7f000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2ad1000-7fb5c2ad2000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c2b24000-7fb5c2b25000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2b77000-7fb5c2b78000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c2bca000-7fb5c2bcb000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2c1d000-7fb5c2c1e000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c2c70000-7fb5c2c71000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2cc3000-7fb5c2cc4000 ---p 0"..., 1048576) = 4078
read(3, "7fb5c2eca000-7fb5c2ecb000 r--p 0"..., 1048576) = 2388
read(3, "", 1048576)                    = 0
exit_group(0)                           = ?
+++ exited with 0 +++
$
========================================

The kernel is randomly returning short reads *with different lengths*
that are vaguely around PAGE_SIZE, no matter how big the buffer
supplied by userspace is. And while repeated read() calls will return
consistent state thanks to the seqfile magic, repeated readfile()
calls will probably return garbage with half-complete lines.
Greg Kroah-Hartman March 3, 2020, 3:23 p.m. UTC | #58
On Tue, Mar 03, 2020 at 03:23:51PM +0100, Christian Brauner wrote:
> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > 
> > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > >
> > > > Totally untested, barely compiled patch below.
> > >
> > > Ok, that didn't even build, let me try this for real now...
> > 
> > Some comments on the interface:
> > 
> > O_LARGEFILE can be unconditional, since offsets are not exposed to the caller.
> > 
> > Use the openat2 style arguments; limit the accepted flags to sane ones
> > (e.g. don't let this syscall create a file).
> 
> If we think this is worth it, might even good to either have it support
> struct open_how or have it accept two flag arguments. We sure want
> openat2()s RESOLVE_* flags in there.

If you look at the patch I posted in this thread, I think it properly
supports open_how and RESOLVE_* flags.  But remember it's opening a file
that is already present, in RO mode, no creation allowed, so most of the
open_how interactions are limited.

thanks,

greg k-h
Jens Axboe March 3, 2020, 3:44 p.m. UTC | #59
On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
>> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>>>
>>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
>>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
>>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>
>>>>>>> Unlimited beers for a 21-line kernel patch?  Sign me up!
>>>>>>>
>>>>>>> Totally untested, barely compiled patch below.
>>>>>>
>>>>>> Ok, that didn't even build, let me try this for real now...
>>>>>
>>>>> Some comments on the interface:
>>>>
>>>> Ok, hey, let's do this proper :)
>>>
>>> Alright, how about this patch.
>>>
>>> Actually tested with some simple sysfs files.
>>>
>>> If people don't strongly object, I'll add "real" tests to it, hook it up
>>> to all arches, write a manpage, and all the fun fluff a new syscall
>>> deserves and submit it "for real".
>>
>> Just FYI, io_uring is moving towards the same kind of thing... IIRC
>> you can already use it to batch a bunch of open() calls, then batch a
>> bunch of read() calls on all the new fds and close them at the same
>> time. And I think they're planning to add support for doing
>> open()+read()+close() all in one go, too, except that it's a bit
>> complicated because passing forward the file descriptor in a generic
>> way is a bit complicated.
> 
> It is complicated, I wouldn't recommend using io_ring for reading a
> bunch of procfs or sysfs files, that feels like a ton of overkill with
> too much setup/teardown to make it worth while.
> 
> But maybe not, will have to watch and see how it goes.

It really isn't, and I too thinks it makes more sense than having a
system call just for the explicit purpose of open/read/close. As Jann
said, you can't currently do a linked sequence of open/read/close,
because the fd passing between them isn't done. But that will come in
the future. If the use case is "a bunch of files", then you could
trivially do "open bunch", "read bunch", "close bunch" in three separate
steps.

Curious what the use case is for this that warrants a special system
call?
David Howells March 3, 2020, 3:53 p.m. UTC | #60
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> If you look at the patch I posted in this thread, I think it properly
> supports open_how and RESOLVE_* flags.  But remember it's opening a file
> that is already present, in RO mode, no creation allowed, so most of the
> open_how interactions are limited.

Something we should consider adding to openat2() at some point is the ability
to lock on open/create.  Various network filesystems support it.

David
Greg Kroah-Hartman March 3, 2020, 4:37 p.m. UTC | #61
On Tue, Mar 03, 2020 at 08:44:43AM -0700, Jens Axboe wrote:
> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
> > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
> >> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
> >> <gregkh@linuxfoundation.org> wrote:
> >>>
> >>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> >>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> >>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> >>>>> <gregkh@linuxfoundation.org> wrote:
> >>>>>
> >>>>>>> Unlimited beers for a 21-line kernel patch?  Sign me up!
> >>>>>>>
> >>>>>>> Totally untested, barely compiled patch below.
> >>>>>>
> >>>>>> Ok, that didn't even build, let me try this for real now...
> >>>>>
> >>>>> Some comments on the interface:
> >>>>
> >>>> Ok, hey, let's do this proper :)
> >>>
> >>> Alright, how about this patch.
> >>>
> >>> Actually tested with some simple sysfs files.
> >>>
> >>> If people don't strongly object, I'll add "real" tests to it, hook it up
> >>> to all arches, write a manpage, and all the fun fluff a new syscall
> >>> deserves and submit it "for real".
> >>
> >> Just FYI, io_uring is moving towards the same kind of thing... IIRC
> >> you can already use it to batch a bunch of open() calls, then batch a
> >> bunch of read() calls on all the new fds and close them at the same
> >> time. And I think they're planning to add support for doing
> >> open()+read()+close() all in one go, too, except that it's a bit
> >> complicated because passing forward the file descriptor in a generic
> >> way is a bit complicated.
> > 
> > It is complicated, I wouldn't recommend using io_ring for reading a
> > bunch of procfs or sysfs files, that feels like a ton of overkill with
> > too much setup/teardown to make it worth while.
> > 
> > But maybe not, will have to watch and see how it goes.
> 
> It really isn't, and I too thinks it makes more sense than having a
> system call just for the explicit purpose of open/read/close. As Jann
> said, you can't currently do a linked sequence of open/read/close,
> because the fd passing between them isn't done. But that will come in
> the future. If the use case is "a bunch of files", then you could
> trivially do "open bunch", "read bunch", "close bunch" in three separate
> steps.
> 
> Curious what the use case is for this that warrants a special system
> call?

All of the open/read/close cycles for sysfs and procfs files were the
one reported use case as we have lots of utilities that do that all the
time it seems (top and other monitoring tools).

thanks,

greg k-h
Jeffrey Layton March 3, 2020, 4:51 p.m. UTC | #62
On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote:
> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
> > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
> > > On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > 
> > > > > > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > > > > > > 
> > > > > > > > Totally untested, barely compiled patch below.
> > > > > > > 
> > > > > > > Ok, that didn't even build, let me try this for real now...
> > > > > > 
> > > > > > Some comments on the interface:
> > > > > 
> > > > > Ok, hey, let's do this proper :)
> > > > 
> > > > Alright, how about this patch.
> > > > 
> > > > Actually tested with some simple sysfs files.
> > > > 
> > > > If people don't strongly object, I'll add "real" tests to it, hook it up
> > > > to all arches, write a manpage, and all the fun fluff a new syscall
> > > > deserves and submit it "for real".
> > > 
> > > Just FYI, io_uring is moving towards the same kind of thing... IIRC
> > > you can already use it to batch a bunch of open() calls, then batch a
> > > bunch of read() calls on all the new fds and close them at the same
> > > time. And I think they're planning to add support for doing
> > > open()+read()+close() all in one go, too, except that it's a bit
> > > complicated because passing forward the file descriptor in a generic
> > > way is a bit complicated.
> > 
> > It is complicated, I wouldn't recommend using io_ring for reading a
> > bunch of procfs or sysfs files, that feels like a ton of overkill with
> > too much setup/teardown to make it worth while.
> > 
> > But maybe not, will have to watch and see how it goes.
> 
> It really isn't, and I too thinks it makes more sense than having a
> system call just for the explicit purpose of open/read/close. As Jann
> said, you can't currently do a linked sequence of open/read/close,
> because the fd passing between them isn't done. But that will come in
> the future. If the use case is "a bunch of files", then you could
> trivially do "open bunch", "read bunch", "close bunch" in three separate
> steps.
> 
> Curious what the use case is for this that warrants a special system
> call?
> 

Agreed. I'd really rather see something more general-purpose than the
proposed readfile(). At least with NFS and SMB, you can compound
together fairly arbitrary sorts of operations, and it'd be nice to be
able to pattern calls into the kernel for those sorts of uses.

So, NFSv4 has the concept of a current_stateid that is maintained by the
server. So basically you can do all this (e.g.) in a single compound:

open <some filehandle get a stateid>
write <using that stateid>
close <same stateid>

It'd be nice to be able to do something similar with io_uring. Make it
so that when you do an open, you set the "current fd" inside the
kernel's context, and then be able to issue io_uring requests that
specify a magic "fd" value that use it.

That would be a really useful pattern.
Greg Kroah-Hartman March 3, 2020, 4:51 p.m. UTC | #63
On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote:
> On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > >
> > > > > If buffer is too small to fit the whole file, return error.
> > > >
> > > > Why?  What's wrong with just returning the bytes asked for?  If someone
> > > > only wants 5 bytes from the front of a file, it should be fine to give
> > > > that to them, right?
> > >
> > > I think we need to signal in some way to the caller that the result
> > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> > > caller might be surprised.
> >
> > But that's not the way a "normal" read works.  Short reads are fine, if
> > the file isn't big enough.  That's how char device nodes work all the
> > time as well, and this kind of is like that, or some kind of "stream" to
> > read from.
> >
> > If you think the file is bigger, then you, as the caller, can just pass
> > in a bigger buffer if you want to (i.e. you can stat the thing and
> > determine the size beforehand.)
> >
> > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
> > buffer.  That way userspace "knows" it will always read all of the data
> > it can from the file, we don't have to do any seeking or determining
> > real file size, or anything else like that.
> >
> > We return the number of bytes read as well, so we "know" if we did a
> > short read, and also, you could imply, if the number of bytes read are
> > the exact same as the number of bytes of the buffer, maybe the file is
> > either that exact size, or bigger.
> >
> > This should be "simple", let's not make it complex if we can help it :)
> >
> > > > > Verify that the number of bytes read matches the file size, otherwise
> > > > > return error (may need to loop?).
> > > >
> > > > No, we can't "match file size" as sysfs files do not really have a sane
> > > > "size".  So I don't want to loop at all here, one-shot, that's all you
> > > > get :)
> > >
> > > Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> > > until read return zero) might be a good idea regardless, because short
> > > reads are allowed.
> >
> > If you want to loop, then do a userspace open/read-loop/close cycle.
> > That's not what this syscall should be for.
> >
> > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)
> 
> So how is this supposed to work in e.g. the following case?
> 
> ========================================
> $ cat map_lots_and_read_maps.c
> #include <sys/mman.h>
> #include <fcntl.h>
> #include <unistd.h>
> 
> int main(void) {
>   for (int i=0; i<1000; i++) {
>     mmap(NULL, 0x1000, (i&1)?PROT_READ:PROT_NONE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>   }
>   int maps = open("/proc/self/maps", O_RDONLY);
>   static char buf[0x100000];
>   int res;
>   do {
>     res = read(maps, buf, sizeof(buf));
>   } while (res > 0);
> }
> $ gcc -o map_lots_and_read_maps map_lots_and_read_maps.c
> $ strace -e trace='!mmap' ./map_lots_and_read_maps
> execve("./map_lots_and_read_maps", ["./map_lots_and_read_maps"],
> 0x7ffebd297ac0 /* 51 vars */) = 0
> brk(NULL)                               = 0x563a1184f000
> access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=208479, ...}) = 0
> close(3)                                = 0
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320l\2\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=1820104, ...}) = 0
> mprotect(0x7fb5c2d1a000, 1642496, PROT_NONE) = 0
> close(3)                                = 0
> arch_prctl(ARCH_SET_FS, 0x7fb5c2eb6500) = 0
> mprotect(0x7fb5c2eab000, 12288, PROT_READ) = 0
> mprotect(0x563a103e4000, 4096, PROT_READ) = 0
> mprotect(0x7fb5c2f12000, 4096, PROT_READ) = 0
> munmap(0x7fb5c2eb7000, 208479)          = 0
> openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3
> read(3, "563a103e1000-563a103e2000 r--p 0"..., 1048576) = 4075
> read(3, "7fb5c2985000-7fb5c2986000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c29d8000-7fb5c29d9000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2a2b000-7fb5c2a2c000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c2a7e000-7fb5c2a7f000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2ad1000-7fb5c2ad2000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c2b24000-7fb5c2b25000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2b77000-7fb5c2b78000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c2bca000-7fb5c2bcb000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2c1d000-7fb5c2c1e000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c2c70000-7fb5c2c71000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2cc3000-7fb5c2cc4000 ---p 0"..., 1048576) = 4078
> read(3, "7fb5c2eca000-7fb5c2ecb000 r--p 0"..., 1048576) = 2388
> read(3, "", 1048576)                    = 0
> exit_group(0)                           = ?
> +++ exited with 0 +++
> $
> ========================================
> 
> The kernel is randomly returning short reads *with different lengths*
> that are vaguely around PAGE_SIZE, no matter how big the buffer
> supplied by userspace is. And while repeated read() calls will return
> consistent state thanks to the seqfile magic, repeated readfile()
> calls will probably return garbage with half-complete lines.

Ah crap, I forgot about seqfile, I was only considering the "simple"
cases that sysfs provides.

Ok, Miklos, you were totally right, I'll loop and read until the end of
file or buffer, which ever comes first.

thanks,

greg k-h
Jens Axboe March 3, 2020, 4:55 p.m. UTC | #64
On 3/3/20 9:51 AM, Jeff Layton wrote:
> On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote:
>> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
>>> On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
>>>> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
>>>> <gregkh@linuxfoundation.org> wrote:
>>>>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
>>>>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
>>>>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
>>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>>
>>>>>>>>> Unlimited beers for a 21-line kernel patch?  Sign me up!
>>>>>>>>>
>>>>>>>>> Totally untested, barely compiled patch below.
>>>>>>>>
>>>>>>>> Ok, that didn't even build, let me try this for real now...
>>>>>>>
>>>>>>> Some comments on the interface:
>>>>>>
>>>>>> Ok, hey, let's do this proper :)
>>>>>
>>>>> Alright, how about this patch.
>>>>>
>>>>> Actually tested with some simple sysfs files.
>>>>>
>>>>> If people don't strongly object, I'll add "real" tests to it, hook it up
>>>>> to all arches, write a manpage, and all the fun fluff a new syscall
>>>>> deserves and submit it "for real".
>>>>
>>>> Just FYI, io_uring is moving towards the same kind of thing... IIRC
>>>> you can already use it to batch a bunch of open() calls, then batch a
>>>> bunch of read() calls on all the new fds and close them at the same
>>>> time. And I think they're planning to add support for doing
>>>> open()+read()+close() all in one go, too, except that it's a bit
>>>> complicated because passing forward the file descriptor in a generic
>>>> way is a bit complicated.
>>>
>>> It is complicated, I wouldn't recommend using io_ring for reading a
>>> bunch of procfs or sysfs files, that feels like a ton of overkill with
>>> too much setup/teardown to make it worth while.
>>>
>>> But maybe not, will have to watch and see how it goes.
>>
>> It really isn't, and I too thinks it makes more sense than having a
>> system call just for the explicit purpose of open/read/close. As Jann
>> said, you can't currently do a linked sequence of open/read/close,
>> because the fd passing between them isn't done. But that will come in
>> the future. If the use case is "a bunch of files", then you could
>> trivially do "open bunch", "read bunch", "close bunch" in three separate
>> steps.
>>
>> Curious what the use case is for this that warrants a special system
>> call?
>>
> 
> Agreed. I'd really rather see something more general-purpose than the
> proposed readfile(). At least with NFS and SMB, you can compound
> together fairly arbitrary sorts of operations, and it'd be nice to be
> able to pattern calls into the kernel for those sorts of uses.
> 
> So, NFSv4 has the concept of a current_stateid that is maintained by the
> server. So basically you can do all this (e.g.) in a single compound:
> 
> open <some filehandle get a stateid>
> write <using that stateid>
> close <same stateid>
> 
> It'd be nice to be able to do something similar with io_uring. Make it
> so that when you do an open, you set the "current fd" inside the
> kernel's context, and then be able to issue io_uring requests that
> specify a magic "fd" value that use it.
> 
> That would be a really useful pattern.

For io_uring, you can link requests that you submit into a chain. Each
link in the chain is done in sequence. Which means that you could do:

<open some file><read from that file><close that file>

in a single sequence. The only thing that is missing right now is a way
to have the return of that open propagated to the 'fd' of the read and
close, and it's actually one of the topics to discuss at LSFMM next
month.

One approach would be to use BPF to handle this passing, another
suggestion has been to have the read/close specify some magic 'fd' value
that just means "inherit fd from result of previous". The latter sounds
very close to the stateid you mention above, and the upside here is that
it wouldn't explode the necessary toolchain to need to include BPF.

In other words, this is really close to being reality and practically
feasible.
Jann Horn March 3, 2020, 4:57 p.m. UTC | #65
On Tue, Mar 3, 2020 at 5:51 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote:
> > On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > >
> > > > > > If buffer is too small to fit the whole file, return error.
> > > > >
> > > > > Why?  What's wrong with just returning the bytes asked for?  If someone
> > > > > only wants 5 bytes from the front of a file, it should be fine to give
> > > > > that to them, right?
> > > >
> > > > I think we need to signal in some way to the caller that the result
> > > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> > > > caller might be surprised.
> > >
> > > But that's not the way a "normal" read works.  Short reads are fine, if
> > > the file isn't big enough.  That's how char device nodes work all the
> > > time as well, and this kind of is like that, or some kind of "stream" to
> > > read from.
> > >
> > > If you think the file is bigger, then you, as the caller, can just pass
> > > in a bigger buffer if you want to (i.e. you can stat the thing and
> > > determine the size beforehand.)
> > >
> > > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
> > > buffer.  That way userspace "knows" it will always read all of the data
> > > it can from the file, we don't have to do any seeking or determining
> > > real file size, or anything else like that.
> > >
> > > We return the number of bytes read as well, so we "know" if we did a
> > > short read, and also, you could imply, if the number of bytes read are
> > > the exact same as the number of bytes of the buffer, maybe the file is
> > > either that exact size, or bigger.
> > >
> > > This should be "simple", let's not make it complex if we can help it :)
> > >
> > > > > > Verify that the number of bytes read matches the file size, otherwise
> > > > > > return error (may need to loop?).
> > > > >
> > > > > No, we can't "match file size" as sysfs files do not really have a sane
> > > > > "size".  So I don't want to loop at all here, one-shot, that's all you
> > > > > get :)
> > > >
> > > > Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> > > > until read return zero) might be a good idea regardless, because short
> > > > reads are allowed.
> > >
> > > If you want to loop, then do a userspace open/read-loop/close cycle.
> > > That's not what this syscall should be for.
> > >
> > > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)
> >
> > So how is this supposed to work in e.g. the following case?
[...]
> >   int maps = open("/proc/self/maps", O_RDONLY);
> >   static char buf[0x100000];
> >   int res;
> >   do {
> >     res = read(maps, buf, sizeof(buf));
> >   } while (res > 0);
> > }
[...]
> >
> > The kernel is randomly returning short reads *with different lengths*
> > that are vaguely around PAGE_SIZE, no matter how big the buffer
> > supplied by userspace is. And while repeated read() calls will return
> > consistent state thanks to the seqfile magic, repeated readfile()
> > calls will probably return garbage with half-complete lines.
>
> Ah crap, I forgot about seqfile, I was only considering the "simple"
> cases that sysfs provides.
>
> Ok, Miklos, you were totally right, I'll loop and read until the end of
> file or buffer, which ever comes first.

I wonder what we should do when one of the later reads returns an
error code. As in, we start the first read, get a short read (maybe
because a signal arrived), try a second read, get -EINTR. Do we just
return the error code? That'd probably work fine for most usecases -
e.g. if "top" is reading stuff from procfs, and that gets interrupted
by SIGWINCH or so, it doesn't matter that we've already started the
first read; the only thing "top" really needs to know is that the read
was a short read and it has to retry.
Greg Kroah-Hartman March 3, 2020, 4:59 p.m. UTC | #66
On Tue, Mar 03, 2020 at 02:19:58PM +0000, David Howells wrote:
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> 
> > +	fd = do_sys_open(dfd, filename, flags, 0000);
> > +	if (fd <= 0)
> > +		return fd;
> > +
> > +	retval = ksys_read(fd, buffer, bufsize);
> > +
> > +	__close_fd(current->files, fd);
> 
> If you can use dentry_open() and vfs_read(), you might be able to avoid
> dealing with file descriptors entirely.  That might make it worth a syscall.

Will poke at that...

> You're going to be asked for writefile() you know ;-)

Yup, that just got asked on this thread already :)

greg k-h
Jeffrey Layton March 3, 2020, 7:02 p.m. UTC | #67
On Tue, 2020-03-03 at 09:55 -0700, Jens Axboe wrote:
> On 3/3/20 9:51 AM, Jeff Layton wrote:
> > On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote:
> > > On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
> > > > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
> > > > > On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
> > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> > > > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > > > > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > > > > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > > > 
> > > > > > > > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > > > > > > > > 
> > > > > > > > > > Totally untested, barely compiled patch below.
> > > > > > > > > 
> > > > > > > > > Ok, that didn't even build, let me try this for real now...
> > > > > > > > 
> > > > > > > > Some comments on the interface:
> > > > > > > 
> > > > > > > Ok, hey, let's do this proper :)
> > > > > > 
> > > > > > Alright, how about this patch.
> > > > > > 
> > > > > > Actually tested with some simple sysfs files.
> > > > > > 
> > > > > > If people don't strongly object, I'll add "real" tests to it, hook it up
> > > > > > to all arches, write a manpage, and all the fun fluff a new syscall
> > > > > > deserves and submit it "for real".
> > > > > 
> > > > > Just FYI, io_uring is moving towards the same kind of thing... IIRC
> > > > > you can already use it to batch a bunch of open() calls, then batch a
> > > > > bunch of read() calls on all the new fds and close them at the same
> > > > > time. And I think they're planning to add support for doing
> > > > > open()+read()+close() all in one go, too, except that it's a bit
> > > > > complicated because passing forward the file descriptor in a generic
> > > > > way is a bit complicated.
> > > > 
> > > > It is complicated, I wouldn't recommend using io_ring for reading a
> > > > bunch of procfs or sysfs files, that feels like a ton of overkill with
> > > > too much setup/teardown to make it worth while.
> > > > 
> > > > But maybe not, will have to watch and see how it goes.
> > > 
> > > It really isn't, and I too thinks it makes more sense than having a
> > > system call just for the explicit purpose of open/read/close. As Jann
> > > said, you can't currently do a linked sequence of open/read/close,
> > > because the fd passing between them isn't done. But that will come in
> > > the future. If the use case is "a bunch of files", then you could
> > > trivially do "open bunch", "read bunch", "close bunch" in three separate
> > > steps.
> > > 
> > > Curious what the use case is for this that warrants a special system
> > > call?
> > > 
> > 
> > Agreed. I'd really rather see something more general-purpose than the
> > proposed readfile(). At least with NFS and SMB, you can compound
> > together fairly arbitrary sorts of operations, and it'd be nice to be
> > able to pattern calls into the kernel for those sorts of uses.
> > 
> > So, NFSv4 has the concept of a current_stateid that is maintained by the
> > server. So basically you can do all this (e.g.) in a single compound:
> > 
> > open <some filehandle get a stateid>
> > write <using that stateid>
> > close <same stateid>
> > 
> > It'd be nice to be able to do something similar with io_uring. Make it
> > so that when you do an open, you set the "current fd" inside the
> > kernel's context, and then be able to issue io_uring requests that
> > specify a magic "fd" value that use it.
> > 
> > That would be a really useful pattern.
> 
> For io_uring, you can link requests that you submit into a chain. Each
> link in the chain is done in sequence. Which means that you could do:
> 
> <open some file><read from that file><close that file>
> 
> in a single sequence. The only thing that is missing right now is a way
> to have the return of that open propagated to the 'fd' of the read and
> close, and it's actually one of the topics to discuss at LSFMM next
> month.
> 
> One approach would be to use BPF to handle this passing, another
> suggestion has been to have the read/close specify some magic 'fd' value
> that just means "inherit fd from result of previous". The latter sounds
> very close to the stateid you mention above, and the upside here is that
> it wouldn't explode the necessary toolchain to need to include BPF.
> 
> In other words, this is really close to being reality and practically
> feasible.
> 

Excellent.

Yes, the latter is exactly what I had in mind for this. I suspect that
that would cover a large fraction of the potential use-cases for this.

Basically, all you'd need to do is keep a pointer to struct file in the
internal state for the chain. Then, allow userland to specify some magic
fd value for subsequent chained operations that says to use that instead
of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?

That would cover the smb or nfs server sort of use cases, I think. For
the sysfs cases, I guess you'd need to dispatch several chains, but that
doesn't sound _too_ onerous.

In fact, with that you should even be able to emulate the proposed
readlink syscall in a userland library.
Jens Axboe March 3, 2020, 7:07 p.m. UTC | #68
On 3/3/20 12:02 PM, Jeff Layton wrote:
> On Tue, 2020-03-03 at 09:55 -0700, Jens Axboe wrote:
>> On 3/3/20 9:51 AM, Jeff Layton wrote:
>>> On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote:
>>>> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
>>>>> On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
>>>>>> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
>>>>>>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
>>>>>>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
>>>>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>>>>
>>>>>>>>>>> Unlimited beers for a 21-line kernel patch?  Sign me up!
>>>>>>>>>>>
>>>>>>>>>>> Totally untested, barely compiled patch below.
>>>>>>>>>>
>>>>>>>>>> Ok, that didn't even build, let me try this for real now...
>>>>>>>>>
>>>>>>>>> Some comments on the interface:
>>>>>>>>
>>>>>>>> Ok, hey, let's do this proper :)
>>>>>>>
>>>>>>> Alright, how about this patch.
>>>>>>>
>>>>>>> Actually tested with some simple sysfs files.
>>>>>>>
>>>>>>> If people don't strongly object, I'll add "real" tests to it, hook it up
>>>>>>> to all arches, write a manpage, and all the fun fluff a new syscall
>>>>>>> deserves and submit it "for real".
>>>>>>
>>>>>> Just FYI, io_uring is moving towards the same kind of thing... IIRC
>>>>>> you can already use it to batch a bunch of open() calls, then batch a
>>>>>> bunch of read() calls on all the new fds and close them at the same
>>>>>> time. And I think they're planning to add support for doing
>>>>>> open()+read()+close() all in one go, too, except that it's a bit
>>>>>> complicated because passing forward the file descriptor in a generic
>>>>>> way is a bit complicated.
>>>>>
>>>>> It is complicated, I wouldn't recommend using io_ring for reading a
>>>>> bunch of procfs or sysfs files, that feels like a ton of overkill with
>>>>> too much setup/teardown to make it worth while.
>>>>>
>>>>> But maybe not, will have to watch and see how it goes.
>>>>
>>>> It really isn't, and I too thinks it makes more sense than having a
>>>> system call just for the explicit purpose of open/read/close. As Jann
>>>> said, you can't currently do a linked sequence of open/read/close,
>>>> because the fd passing between them isn't done. But that will come in
>>>> the future. If the use case is "a bunch of files", then you could
>>>> trivially do "open bunch", "read bunch", "close bunch" in three separate
>>>> steps.
>>>>
>>>> Curious what the use case is for this that warrants a special system
>>>> call?
>>>>
>>>
>>> Agreed. I'd really rather see something more general-purpose than the
>>> proposed readfile(). At least with NFS and SMB, you can compound
>>> together fairly arbitrary sorts of operations, and it'd be nice to be
>>> able to pattern calls into the kernel for those sorts of uses.
>>>
>>> So, NFSv4 has the concept of a current_stateid that is maintained by the
>>> server. So basically you can do all this (e.g.) in a single compound:
>>>
>>> open <some filehandle get a stateid>
>>> write <using that stateid>
>>> close <same stateid>
>>>
>>> It'd be nice to be able to do something similar with io_uring. Make it
>>> so that when you do an open, you set the "current fd" inside the
>>> kernel's context, and then be able to issue io_uring requests that
>>> specify a magic "fd" value that use it.
>>>
>>> That would be a really useful pattern.
>>
>> For io_uring, you can link requests that you submit into a chain. Each
>> link in the chain is done in sequence. Which means that you could do:
>>
>> <open some file><read from that file><close that file>
>>
>> in a single sequence. The only thing that is missing right now is a way
>> to have the return of that open propagated to the 'fd' of the read and
>> close, and it's actually one of the topics to discuss at LSFMM next
>> month.
>>
>> One approach would be to use BPF to handle this passing, another
>> suggestion has been to have the read/close specify some magic 'fd' value
>> that just means "inherit fd from result of previous". The latter sounds
>> very close to the stateid you mention above, and the upside here is that
>> it wouldn't explode the necessary toolchain to need to include BPF.
>>
>> In other words, this is really close to being reality and practically
>> feasible.
>>
> 
> Excellent.
> 
> Yes, the latter is exactly what I had in mind for this. I suspect that
> that would cover a large fraction of the potential use-cases for this.
> 
> Basically, all you'd need to do is keep a pointer to struct file in the
> internal state for the chain. Then, allow userland to specify some magic
> fd value for subsequent chained operations that says to use that instead
> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?

Yeah I think that'd be a suitable way to signal that.

> That would cover the smb or nfs server sort of use cases, I think. For
> the sysfs cases, I guess you'd need to dispatch several chains, but that
> doesn't sound _too_ onerous.

The magic fd would be per-chain, so doing multiple chains wouldn't
really matter at all.

Let me try and hack this up, should be pretty trivial.

> In fact, with that you should even be able to emulate the proposed
> readlink syscall in a userland library.

Exactly
Jens Axboe March 3, 2020, 7:23 p.m. UTC | #69
On 3/3/20 12:02 PM, Jeff Layton wrote:
> Basically, all you'd need to do is keep a pointer to struct file in the
> internal state for the chain. Then, allow userland to specify some magic
> fd value for subsequent chained operations that says to use that instead
> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?

BTW, I think we need two magics here. One that says "result from
previous is fd for next", and one that says "fd from previous is fd for
next". The former allows inheritance from open -> read, the latter from
read -> write.
Jeffrey Layton March 3, 2020, 7:43 p.m. UTC | #70
On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote:
> On 3/3/20 12:02 PM, Jeff Layton wrote:
> > Basically, all you'd need to do is keep a pointer to struct file in the
> > internal state for the chain. Then, allow userland to specify some magic
> > fd value for subsequent chained operations that says to use that instead
> > of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?
> 
> BTW, I think we need two magics here. One that says "result from
> previous is fd for next", and one that says "fd from previous is fd for
> next". The former allows inheritance from open -> read, the latter from
> read -> write.
> 

Do we? I suspect that in almost all of the cases, all we'd care about is
the last open. Also if you have unrelated operations in there you still
have to chain the fd through somehow to the next op which is a bit hard
to do with that scheme.

I'd just have a single magic carveout that means "use the result of last
open call done in this chain". If you do a second open (or pipe, or...),
then that would put the old struct file pointer and drop a new one in
there.

If we really do want to enable multiple opens in a single chain though,
then we might want to rethink this and consider some sort of slot table
for storing open fds.
Greg Kroah-Hartman March 3, 2020, 8:15 p.m. UTC | #71
On Tue, Mar 03, 2020 at 05:51:03PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote:
> > On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > >
> > > > > > If buffer is too small to fit the whole file, return error.
> > > > >
> > > > > Why?  What's wrong with just returning the bytes asked for?  If someone
> > > > > only wants 5 bytes from the front of a file, it should be fine to give
> > > > > that to them, right?
> > > >
> > > > I think we need to signal in some way to the caller that the result
> > > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> > > > caller might be surprised.
> > >
> > > But that's not the way a "normal" read works.  Short reads are fine, if
> > > the file isn't big enough.  That's how char device nodes work all the
> > > time as well, and this kind of is like that, or some kind of "stream" to
> > > read from.
> > >
> > > If you think the file is bigger, then you, as the caller, can just pass
> > > in a bigger buffer if you want to (i.e. you can stat the thing and
> > > determine the size beforehand.)
> > >
> > > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
> > > buffer.  That way userspace "knows" it will always read all of the data
> > > it can from the file, we don't have to do any seeking or determining
> > > real file size, or anything else like that.
> > >
> > > We return the number of bytes read as well, so we "know" if we did a
> > > short read, and also, you could imply, if the number of bytes read are
> > > the exact same as the number of bytes of the buffer, maybe the file is
> > > either that exact size, or bigger.
> > >
> > > This should be "simple", let's not make it complex if we can help it :)
> > >
> > > > > > Verify that the number of bytes read matches the file size, otherwise
> > > > > > return error (may need to loop?).
> > > > >
> > > > > No, we can't "match file size" as sysfs files do not really have a sane
> > > > > "size".  So I don't want to loop at all here, one-shot, that's all you
> > > > > get :)
> > > >
> > > > Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> > > > until read return zero) might be a good idea regardless, because short
> > > > reads are allowed.
> > >
> > > If you want to loop, then do a userspace open/read-loop/close cycle.
> > > That's not what this syscall should be for.
> > >
> > > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)
> > 
> > So how is this supposed to work in e.g. the following case?
> > 
> > ========================================
> > $ cat map_lots_and_read_maps.c
> > #include <sys/mman.h>
> > #include <fcntl.h>
> > #include <unistd.h>
> > 
> > int main(void) {
> >   for (int i=0; i<1000; i++) {
> >     mmap(NULL, 0x1000, (i&1)?PROT_READ:PROT_NONE,
> > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> >   }
> >   int maps = open("/proc/self/maps", O_RDONLY);
> >   static char buf[0x100000];
> >   int res;
> >   do {
> >     res = read(maps, buf, sizeof(buf));
> >   } while (res > 0);
> > }
> > $ gcc -o map_lots_and_read_maps map_lots_and_read_maps.c
> > $ strace -e trace='!mmap' ./map_lots_and_read_maps
> > execve("./map_lots_and_read_maps", ["./map_lots_and_read_maps"],
> > 0x7ffebd297ac0 /* 51 vars */) = 0
> > brk(NULL)                               = 0x563a1184f000
> > access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
> > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> > fstat(3, {st_mode=S_IFREG|0644, st_size=208479, ...}) = 0
> > close(3)                                = 0
> > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> > read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320l\2\0\0\0\0\0"...,
> > 832) = 832
> > fstat(3, {st_mode=S_IFREG|0755, st_size=1820104, ...}) = 0
> > mprotect(0x7fb5c2d1a000, 1642496, PROT_NONE) = 0
> > close(3)                                = 0
> > arch_prctl(ARCH_SET_FS, 0x7fb5c2eb6500) = 0
> > mprotect(0x7fb5c2eab000, 12288, PROT_READ) = 0
> > mprotect(0x563a103e4000, 4096, PROT_READ) = 0
> > mprotect(0x7fb5c2f12000, 4096, PROT_READ) = 0
> > munmap(0x7fb5c2eb7000, 208479)          = 0
> > openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3
> > read(3, "563a103e1000-563a103e2000 r--p 0"..., 1048576) = 4075
> > read(3, "7fb5c2985000-7fb5c2986000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c29d8000-7fb5c29d9000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2a2b000-7fb5c2a2c000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c2a7e000-7fb5c2a7f000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2ad1000-7fb5c2ad2000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c2b24000-7fb5c2b25000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2b77000-7fb5c2b78000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c2bca000-7fb5c2bcb000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2c1d000-7fb5c2c1e000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c2c70000-7fb5c2c71000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2cc3000-7fb5c2cc4000 ---p 0"..., 1048576) = 4078
> > read(3, "7fb5c2eca000-7fb5c2ecb000 r--p 0"..., 1048576) = 2388
> > read(3, "", 1048576)                    = 0
> > exit_group(0)                           = ?
> > +++ exited with 0 +++
> > $
> > ========================================
> > 
> > The kernel is randomly returning short reads *with different lengths*
> > that are vaguely around PAGE_SIZE, no matter how big the buffer
> > supplied by userspace is. And while repeated read() calls will return
> > consistent state thanks to the seqfile magic, repeated readfile()
> > calls will probably return garbage with half-complete lines.
> 
> Ah crap, I forgot about seqfile, I was only considering the "simple"
> cases that sysfs provides.
> 
> Ok, Miklos, you were totally right, I'll loop and read until the end of
> file or buffer, which ever comes first.

Hm, nope, this works just fine with the single "read" call.  I can read
/proc/self/maps with a single buffer, also larger files like
/sys/kernel/debug/usb/devices work just fine.

So maybe it is all sane without a loop.

I'll try to get rid of the fd now, and despite the interest in io_uring,
this might be a lot more "simple" overall.

thanks,

greg k-h
Jens Axboe March 3, 2020, 8:33 p.m. UTC | #72
On 3/3/20 12:43 PM, Jeff Layton wrote:
> On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote:
>> On 3/3/20 12:02 PM, Jeff Layton wrote:
>>> Basically, all you'd need to do is keep a pointer to struct file in the
>>> internal state for the chain. Then, allow userland to specify some magic
>>> fd value for subsequent chained operations that says to use that instead
>>> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?
>>
>> BTW, I think we need two magics here. One that says "result from
>> previous is fd for next", and one that says "fd from previous is fd for
>> next". The former allows inheritance from open -> read, the latter from
>> read -> write.
>>
> 
> Do we? I suspect that in almost all of the cases, all we'd care about is
> the last open. Also if you have unrelated operations in there you still
> have to chain the fd through somehow to the next op which is a bit hard
> to do with that scheme.
> 
> I'd just have a single magic carveout that means "use the result of last
> open call done in this chain". If you do a second open (or pipe, or...),
> then that would put the old struct file pointer and drop a new one in
> there.
> 
> If we really do want to enable multiple opens in a single chain though,
> then we might want to rethink this and consider some sort of slot table
> for storing open fds.

I think the one magic can work, you just have to define your chain
appropriately for the case where you have multiple opens. That's true
for the two magic approach as well, of course, I don't want a stack of
open fds, just "last open" should suffice.

I don't like the implicit close, if your op opens an fd, something
should close it again. You pass it back to the application in any case
for io_uring, so the app can just close it. Which means that your chain
should just include a close for whatever fd you open, unless you plan on
using it in the application aftwards.
Jeffrey Layton March 3, 2020, 9:03 p.m. UTC | #73
On Tue, 2020-03-03 at 13:33 -0700, Jens Axboe wrote:
> On 3/3/20 12:43 PM, Jeff Layton wrote:
> > On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote:
> > > On 3/3/20 12:02 PM, Jeff Layton wrote:
> > > > Basically, all you'd need to do is keep a pointer to struct file in the
> > > > internal state for the chain. Then, allow userland to specify some magic
> > > > fd value for subsequent chained operations that says to use that instead
> > > > of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?
> > > 
> > > BTW, I think we need two magics here. One that says "result from
> > > previous is fd for next", and one that says "fd from previous is fd for
> > > next". The former allows inheritance from open -> read, the latter from
> > > read -> write.
> > > 
> > 
> > Do we? I suspect that in almost all of the cases, all we'd care about is
> > the last open. Also if you have unrelated operations in there you still
> > have to chain the fd through somehow to the next op which is a bit hard
> > to do with that scheme.
> > 
> > I'd just have a single magic carveout that means "use the result of last
> > open call done in this chain". If you do a second open (or pipe, or...),
> > then that would put the old struct file pointer and drop a new one in
> > there.
> > 
> > If we really do want to enable multiple opens in a single chain though,
> > then we might want to rethink this and consider some sort of slot table
> > for storing open fds.
> 
> I think the one magic can work, you just have to define your chain
> appropriately for the case where you have multiple opens. That's true
> for the two magic approach as well, of course, I don't want a stack of
> open fds, just "last open" should suffice.
> 

Yep.

> I don't like the implicit close, if your op opens an fd, something
> should close it again. You pass it back to the application in any case
> for io_uring, so the app can just close it. Which means that your chain
> should just include a close for whatever fd you open, unless you plan on
> using it in the application aftwards.
> 

Yeah sorry, I didn't word that correctly. Let me try again:

My thinking was that you would still return the result of the open to
userland, but also stash a struct file pointer in the internal chain
representation. Then you just refer to that when you get the "magic" fd.

You'd still need to explicitly close the file though if you didn't want
to use it past the end of the current chain. So, I guess you _do_ need
the actual fd to properly close the file in that case.

On another note, what happens if you do open+write+close and the write
fails? Does the close still happen, or would you have to issue one
separately after getting the result?
Jens Axboe March 3, 2020, 9:20 p.m. UTC | #74
On 3/3/20 2:03 PM, Jeff Layton wrote:
> On Tue, 2020-03-03 at 13:33 -0700, Jens Axboe wrote:
>> On 3/3/20 12:43 PM, Jeff Layton wrote:
>>> On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote:
>>>> On 3/3/20 12:02 PM, Jeff Layton wrote:
>>>>> Basically, all you'd need to do is keep a pointer to struct file in the
>>>>> internal state for the chain. Then, allow userland to specify some magic
>>>>> fd value for subsequent chained operations that says to use that instead
>>>>> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?
>>>>
>>>> BTW, I think we need two magics here. One that says "result from
>>>> previous is fd for next", and one that says "fd from previous is fd for
>>>> next". The former allows inheritance from open -> read, the latter from
>>>> read -> write.
>>>>
>>>
>>> Do we? I suspect that in almost all of the cases, all we'd care about is
>>> the last open. Also if you have unrelated operations in there you still
>>> have to chain the fd through somehow to the next op which is a bit hard
>>> to do with that scheme.
>>>
>>> I'd just have a single magic carveout that means "use the result of last
>>> open call done in this chain". If you do a second open (or pipe, or...),
>>> then that would put the old struct file pointer and drop a new one in
>>> there.
>>>
>>> If we really do want to enable multiple opens in a single chain though,
>>> then we might want to rethink this and consider some sort of slot table
>>> for storing open fds.
>>
>> I think the one magic can work, you just have to define your chain
>> appropriately for the case where you have multiple opens. That's true
>> for the two magic approach as well, of course, I don't want a stack of
>> open fds, just "last open" should suffice.
>>
> 
> Yep.
> 
>> I don't like the implicit close, if your op opens an fd, something
>> should close it again. You pass it back to the application in any case
>> for io_uring, so the app can just close it. Which means that your chain
>> should just include a close for whatever fd you open, unless you plan on
>> using it in the application aftwards.
>>
> 
> Yeah sorry, I didn't word that correctly. Let me try again:
> 
> My thinking was that you would still return the result of the open to
> userland, but also stash a struct file pointer in the internal chain
> representation. Then you just refer to that when you get the "magic" fd.
> 
> You'd still need to explicitly close the file though if you didn't want
> to use it past the end of the current chain. So, I guess you _do_ need
> the actual fd to properly close the file in that case.

Right, I'm caching both the fd and the file, we'll need both. See below,
quick hack. Needs a few prep patches, we always prepare the file upfront
and the prep handlers expect it to be there, we'd have to do things a
bit differently for that. And attached small test app that does
open+read+close.

> On another note, what happens if you do open+write+close and the write
> fails? Does the close still happen, or would you have to issue one
> separately after getting the result?

For any io_uring chain, if any link in the chain fails, then the rest of
the chain is errored. So if your open fails, you'd get -ECANCELED for
your read+close. If your read fails, just the close is errored. So yes,
you'd have to close the fd again if the chain doesn't fully execute.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0e2065614ace..bbaea6b3e16a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -488,6 +488,10 @@ enum {
 	REQ_F_NEED_CLEANUP_BIT,
 	REQ_F_OVERFLOW_BIT,
 	REQ_F_POLLED_BIT,
+	REQ_F_OPEN_FD_BIT,
+
+	/* not a real bit, just to check we're not overflowing the space */
+	__REQ_F_LAST_BIT,
 };
 
 enum {
@@ -532,6 +536,8 @@ enum {
 	REQ_F_OVERFLOW		= BIT(REQ_F_OVERFLOW_BIT),
 	/* already went through poll handler */
 	REQ_F_POLLED		= BIT(REQ_F_POLLED_BIT),
+	/* use chain previous open fd */
+	REQ_F_OPEN_FD		= BIT(REQ_F_OPEN_FD_BIT),
 };
 
 struct async_poll {
@@ -593,6 +599,8 @@ struct io_kiocb {
 			struct callback_head	task_work;
 			struct hlist_node	hash_node;
 			struct async_poll	*apoll;
+			struct file		*last_open_file;
+			int			last_open_fd;
 		};
 		struct io_wq_work	work;
 	};
@@ -1292,7 +1300,7 @@ static inline void io_put_file(struct io_kiocb *req, struct file *file,
 {
 	if (fixed)
 		percpu_ref_put(&req->ctx->file_data->refs);
-	else
+	else if (!(req->flags & REQ_F_OPEN_FD))
 		fput(file);
 }
 
@@ -1435,6 +1443,12 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr)
 		list_del_init(&req->link_list);
 		if (!list_empty(&nxt->link_list))
 			nxt->flags |= REQ_F_LINK;
+		if (nxt->flags & REQ_F_OPEN_FD) {
+			WARN_ON_ONCE(nxt->file);
+			nxt->last_open_file = req->last_open_file;
+			nxt->last_open_fd = req->last_open_fd;
+			nxt->file = req->last_open_file;
+		}
 		*nxtptr = nxt;
 		break;
 	}
@@ -1957,37 +1971,20 @@ static bool io_file_supports_async(struct file *file)
 	return false;
 }
 
-static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		      bool force_nonblock)
+static int __io_prep_rw(struct io_kiocb *req, bool force_nonblock)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw.kiocb;
-	unsigned ioprio;
-	int ret;
 
 	if (S_ISREG(file_inode(req->file)->i_mode))
 		req->flags |= REQ_F_ISREG;
 
-	kiocb->ki_pos = READ_ONCE(sqe->off);
 	if (kiocb->ki_pos == -1 && !(req->file->f_mode & FMODE_STREAM)) {
 		req->flags |= REQ_F_CUR_POS;
 		kiocb->ki_pos = req->file->f_pos;
 	}
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
-	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
-	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
-	if (unlikely(ret))
-		return ret;
-
-	ioprio = READ_ONCE(sqe->ioprio);
-	if (ioprio) {
-		ret = ioprio_check_cap(ioprio);
-		if (ret)
-			return ret;
-
-		kiocb->ki_ioprio = ioprio;
-	} else
-		kiocb->ki_ioprio = get_current_ioprio();
+	kiocb->ki_flags |= iocb_flags(kiocb->ki_filp);
 
 	/* don't allow async punt if RWF_NOWAIT was requested */
 	if ((kiocb->ki_flags & IOCB_NOWAIT) ||
@@ -2011,6 +2008,31 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		kiocb->ki_complete = io_complete_rw;
 	}
 
+	return 0;
+}
+
+static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct kiocb *kiocb = &req->rw.kiocb;
+	unsigned ioprio;
+	int ret;
+
+	kiocb->ki_pos = READ_ONCE(sqe->off);
+
+	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
+	if (unlikely(ret))
+		return ret;
+
+	ioprio = READ_ONCE(sqe->ioprio);
+	if (ioprio) {
+		ret = ioprio_check_cap(ioprio);
+		if (ret)
+			return ret;
+
+		kiocb->ki_ioprio = ioprio;
+	} else
+		kiocb->ki_ioprio = get_current_ioprio();
+
 	req->rw.addr = READ_ONCE(sqe->addr);
 	req->rw.len = READ_ONCE(sqe->len);
 	/* we own ->private, reuse it for the buffer index */
@@ -2273,13 +2295,10 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct iov_iter iter;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, sqe, force_nonblock);
+	ret = io_prep_rw(req, sqe);
 	if (ret)
 		return ret;
 
-	if (unlikely(!(req->file->f_mode & FMODE_READ)))
-		return -EBADF;
-
 	/* either don't need iovec imported or already have it */
 	if (!req->io || req->flags & REQ_F_NEED_CLEANUP)
 		return 0;
@@ -2304,6 +2323,13 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 	size_t iov_count;
 	ssize_t io_size, ret;
 
+	ret = __io_prep_rw(req, force_nonblock);
+	if (ret)
+		return ret;
+
+	if (unlikely(!(req->file->f_mode & FMODE_READ)))
+		return -EBADF;
+
 	ret = io_import_iovec(READ, req, &iovec, &iter);
 	if (ret < 0)
 		return ret;
@@ -2362,13 +2388,10 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct iov_iter iter;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, sqe, force_nonblock);
+	ret = io_prep_rw(req, sqe);
 	if (ret)
 		return ret;
 
-	if (unlikely(!(req->file->f_mode & FMODE_WRITE)))
-		return -EBADF;
-
 	/* either don't need iovec imported or already have it */
 	if (!req->io || req->flags & REQ_F_NEED_CLEANUP)
 		return 0;
@@ -2393,6 +2416,13 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
 	size_t iov_count;
 	ssize_t ret, io_size;
 
+	ret = __io_prep_rw(req, force_nonblock);
+	if (ret)
+		return ret;
+
+	if (unlikely(!(req->file->f_mode & FMODE_WRITE)))
+		return -EBADF;
+
 	ret = io_import_iovec(WRITE, req, &iovec, &iter);
 	if (ret < 0)
 		return ret;
@@ -2737,8 +2767,8 @@ static int io_openat2_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 static int io_openat2(struct io_kiocb *req, bool force_nonblock)
 {
+	struct file *file = NULL;
 	struct open_flags op;
-	struct file *file;
 	int ret;
 
 	if (force_nonblock)
@@ -2763,8 +2793,12 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
 err:
 	putname(req->open.filename);
 	req->flags &= ~REQ_F_NEED_CLEANUP;
-	if (ret < 0)
+	if (ret < 0) {
 		req_set_fail_links(req);
+	} else if (req->flags & REQ_F_LINK) {
+		req->last_open_file = file;
+		req->last_open_fd = ret;
+	}
 	io_cqring_add_event(req, ret);
 	io_put_req(req);
 	return 0;
@@ -2980,10 +3014,6 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EBADF;
 
 	req->close.fd = READ_ONCE(sqe->fd);
-	if (req->file->f_op == &io_uring_fops ||
-	    req->close.fd == req->ctx->ring_fd)
-		return -EBADF;
-
 	return 0;
 }
 
@@ -3013,6 +3043,18 @@ static int io_close(struct io_kiocb *req, bool force_nonblock)
 {
 	int ret;
 
+	if (req->flags & REQ_F_OPEN_FD) {
+		if (req->close.fd != IOSQE_FD_LAST_OPEN)
+			return -EBADF;
+		req->close.fd = req->last_open_fd;
+		req->last_open_file = NULL;
+		req->last_open_fd = -1;
+	}
+
+	if (req->file->f_op == &io_uring_fops ||
+	    req->close.fd == req->ctx->ring_fd)
+		return -EBADF;
+
 	req->close.put_file = NULL;
 	ret = __close_fd_get_file(req->close.fd, &req->close.put_file);
 	if (ret < 0)
@@ -3437,8 +3479,14 @@ static int __io_accept(struct io_kiocb *req, bool force_nonblock)
 		return -EAGAIN;
 	if (ret == -ERESTARTSYS)
 		ret = -EINTR;
-	if (ret < 0)
+	if (ret < 0) {
 		req_set_fail_links(req);
+	} else if (req->flags & REQ_F_LINK) {
+		rcu_read_lock();
+		req->last_open_file = fcheck_files(current->files, ret);
+		rcu_read_unlock();
+		req->last_open_fd = ret;
+	}
 	io_cqring_add_event(req, ret);
 	io_put_req(req);
 	return 0;
@@ -4779,6 +4827,14 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req,
 		return 0;
 
 	fixed = (flags & IOSQE_FIXED_FILE);
+	if (fd == IOSQE_FD_LAST_OPEN) {
+		if (fixed)
+			return -EBADF;
+		req->flags |= REQ_F_OPEN_FD;
+		req->file = NULL;
+		return 0;
+	}
+
 	if (unlikely(!fixed && req->needs_fixed_file))
 		return -EBADF;
 
@@ -7448,6 +7504,7 @@ static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(44, __s32,  splice_fd_in);
 
 	BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
+	BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int));
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
 	return 0;
 };
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 53b36311cdac..3ccf74efe381 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -77,6 +77,12 @@ enum {
 /* always go async */
 #define IOSQE_ASYNC		(1U << IOSQE_ASYNC_BIT)
 
+/*
+ * 'magic' ->fd values don't point to a real fd, but rather define how fds
+ * can be inherited through links in a chain
+ */
+#define IOSQE_FD_LAST_OPEN	(-4096)	/* previous result is fd */
+
 /*
  * io_uring_setup() flags
  */
Ian Kent March 4, 2020, 2:01 a.m. UTC | #75
On Tue, 2020-03-03 at 14:03 +0100, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 12:38:14PM +0100, Karel Zak wrote:
> > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > > No, I don't think this is going to be a performance issue at all,
> > > but
> > > if anything we could introduce a syscall
> > > 
> > >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > > bufsize, int flags);
> > 
> > off-topic, but I'll buy you many many beers if you implement it ;-
> > ),
> > because open + read + close is pretty common for /sys and /proc in
> > many userspace tools; for example ps, top, lsblk, lsmem, lsns,
> > udevd
> > etc. is all about it.
> 
> Unlimited beers for a 21-line kernel patch?  Sign me up!
> 
> Totally untested, barely compiled patch below.
> 
> Actually, I like this idea (the syscall, not just the unlimited
> beers).
> Maybe this could make a lot of sense, I'll write some actual tests
> for
> it now that syscalls are getting "heavy" again due to CPU vendors
> finally paying the price for their madness...

The problem isn't with open->read->close but with the mount info.
changing between reads (ie. seq file read takes and drops the
needed lock between reads at least once).

The problem is you don't know the buffer size needed to get this
in one hit, how is this different to read(2)?

> 
> thanks,
> 
> greg k-h
> -------------------
> 
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 44d510bc9b78..178cd45340e2 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -359,6 +359,7 @@
>  435	common	clone3			__x64_sys_clone3/ptregs
>  437	common	openat2			__x64_sys_openat2
>  438	common	pidfd_getfd		__x64_sys_pidfd_getfd
> +439	common	readfile		__x86_sys_readfile
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache
> impact
> diff --git a/fs/open.c b/fs/open.c
> index 0788b3715731..1a830fada750 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1340,3 +1340,23 @@ int stream_open(struct inode *inode, struct
> file *filp)
>  }
>  
>  EXPORT_SYMBOL(stream_open);
> +
> +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> +		char __user *, buffer, size_t, bufsize, int, flags)
> +{
> +	int retval;
> +	int fd;
> +
> +	if (force_o_largefile())
> +		flags |= O_LARGEFILE;
> +
> +	fd = do_sys_open(dfd, filename, flags, O_RDONLY);
> +	if (fd <= 0)
> +		return fd;
> +
> +	retval = ksys_read(fd, buffer, bufsize);
> +
> +	__close_fd(current->files, fd);
> +
> +	return retval;
> +}
Ian Kent March 4, 2020, 4:20 a.m. UTC | #76
On Tue, 2020-03-03 at 15:10 +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > If buffer is too small to fit the whole file, return error.
> > 
> > Why?  What's wrong with just returning the bytes asked for?  If
> > someone
> > only wants 5 bytes from the front of a file, it should be fine to
> > give
> > that to them, right?
> 
> I think we need to signal in some way to the caller that the result
> was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise
> the
> caller might be surprised.
> 
> > > Verify that the number of bytes read matches the file size,
> > > otherwise
> > > return error (may need to loop?).
> > 
> > No, we can't "match file size" as sysfs files do not really have a
> > sane
> > "size".  So I don't want to loop at all here, one-shot, that's all
> > you
> > get :)
> 
> Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> until read return zero) might be a good idea regardless, because
> short
> reads are allowed.

Surely a short read equates to an error.

That has to be the definition of readfile() because you can do the
looping thing with read(2) and get the entire file anyway.

If you think about it don't you arrive at the conclusion this can
be done with read(2) alone anyway because you have to loop to get
the entire file, otherwise there's no point to the syscall!

Ian
Karel Zak March 4, 2020, 3:22 p.m. UTC | #77
On Wed, Mar 04, 2020 at 10:01:33AM +0800, Ian Kent wrote:
> On Tue, 2020-03-03 at 14:03 +0100, Greg Kroah-Hartman wrote:
> > Actually, I like this idea (the syscall, not just the unlimited
> > beers).
> > Maybe this could make a lot of sense, I'll write some actual tests
> > for
> > it now that syscalls are getting "heavy" again due to CPU vendors
> > finally paying the price for their madness...
> 
> The problem isn't with open->read->close but with the mount info.
> changing between reads (ie. seq file read takes and drops the
> needed lock between reads at least once).

readfile() is not reaction to mountinfo. 

The motivation is that we have many places with trivial
open->read->close for very small text files due to /sys and /proc. The
current way how kernel delivers these small strings to userspace seems
pretty inefficient if we can do the same by one syscall.

    Karel

$ strace -e openat,read,close -c ps aux
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.32    0.004190           4       987           read
 31.42    0.003039           3       844         4 openat
 25.26    0.002443           2       842           close
------ ----------- ----------- --------- --------- ----------------
100.00    0.009672                  2673         4 total

$ strace -e openat,read,close -c lsns
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 39.95    0.001567           2       593           openat
 30.93    0.001213           2       597           close
 29.12    0.001142           3       365           read
------ ----------- ----------- --------- --------- ----------------
100.00    0.003922                  1555           total


$ strace -e openat,read,close -c lscpu
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 44.67    0.001480           7       189        52 openat
 34.77    0.001152           6       180           read
 20.56    0.000681           4       140           close
------ ----------- ----------- --------- --------- ----------------
100.00    0.003313                   509        52 total
Greg Kroah-Hartman March 4, 2020, 4:49 p.m. UTC | #78
On Wed, Mar 04, 2020 at 04:22:41PM +0100, Karel Zak wrote:
> On Wed, Mar 04, 2020 at 10:01:33AM +0800, Ian Kent wrote:
> > On Tue, 2020-03-03 at 14:03 +0100, Greg Kroah-Hartman wrote:
> > > Actually, I like this idea (the syscall, not just the unlimited
> > > beers).
> > > Maybe this could make a lot of sense, I'll write some actual tests
> > > for
> > > it now that syscalls are getting "heavy" again due to CPU vendors
> > > finally paying the price for their madness...
> > 
> > The problem isn't with open->read->close but with the mount info.
> > changing between reads (ie. seq file read takes and drops the
> > needed lock between reads at least once).
> 
> readfile() is not reaction to mountinfo. 
> 
> The motivation is that we have many places with trivial
> open->read->close for very small text files due to /sys and /proc. The
> current way how kernel delivers these small strings to userspace seems
> pretty inefficient if we can do the same by one syscall.
> 
>     Karel
> 
> $ strace -e openat,read,close -c ps aux
> ...
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  43.32    0.004190           4       987           read
>  31.42    0.003039           3       844         4 openat
>  25.26    0.002443           2       842           close
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.009672                  2673         4 total
> 
> $ strace -e openat,read,close -c lsns
> ...
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  39.95    0.001567           2       593           openat
>  30.93    0.001213           2       597           close
>  29.12    0.001142           3       365           read
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.003922                  1555           total
> 
> 
> $ strace -e openat,read,close -c lscpu
> ...
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  44.67    0.001480           7       189        52 openat
>  34.77    0.001152           6       180           read
>  20.56    0.000681           4       140           close
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.003313                   509        52 total

As a "real-world" test, would you recommend me converting one of the
above tools to my implementation of readfile to see how/if it actually
makes sense, or do you have some other tool you would rather see me try?

thanks,

greg k-h
Karel Zak March 4, 2020, 5:55 p.m. UTC | #79
On Wed, Mar 04, 2020 at 05:49:13PM +0100, Greg Kroah-Hartman wrote:
> On Wed, Mar 04, 2020 at 04:22:41PM +0100, Karel Zak wrote:
> > $ strace -e openat,read,close -c ps aux
> > ...
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  43.32    0.004190           4       987           read
> >  31.42    0.003039           3       844         4 openat
> >  25.26    0.002443           2       842           close
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.009672                  2673         4 total
> > 
> > $ strace -e openat,read,close -c lsns
> > ...
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  39.95    0.001567           2       593           openat
> >  30.93    0.001213           2       597           close
> >  29.12    0.001142           3       365           read
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.003922                  1555           total
> > 
> > 
> > $ strace -e openat,read,close -c lscpu
> > ...
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  44.67    0.001480           7       189        52 openat
> >  34.77    0.001152           6       180           read
> >  20.56    0.000681           4       140           close
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.003313                   509        52 total
> 
> As a "real-world" test, would you recommend me converting one of the
> above tools to my implementation of readfile to see how/if it actually
> makes sense, or do you have some other tool you would rather see me try?

See lib/path.c and lib/sysfs.c in util-linux (https://github.com/karelzak/util-linux). 
For example ul_path_read() and ul_path_scanf(). 

We use it for lsblk, lsmem, lscpu, etc.

 $ git grep -c ul_path_read misc-utils/lsblk.c sys-utils/lscpu.c
 misc-utils/lsblk.c:30
 sys-utils/lscpu.c:31

We're probably a little bit off-topic here, no problem to continue on
util-linux@vger.kernel.org or by private mails. Thanks!

    Karel
Miklos Szeredi March 6, 2020, 4:25 p.m. UTC | #80
On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> 
> I'm doing a patch.   Let's see how it fares in the face of all these
> preconceptions.

Here's a first cut.  Doesn't yet have superblock info, just mount info.
Probably has rough edges, but appears to work.

I started with sysfs, then kernfs, then went with a custom filesystem, because
neither could do what I wanted.

Anyway, this is more for review of the concept, than for a code review, but
obviously if you see a fatal flaw in the design, please let me know.

get mountinfo from open file:

  cat /proc/$PID/fdmount/$FD/*

get mountinfo by mount ID:

  mount -t mountfs mountfs /mountfs
  cat /mountfs/$MNT_ID/*


Thanks,
Miklos

---
 fs/Makefile              |    1 
 fs/mount.h               |   11 +
 fs/mountfs/Makefile      |    1 
 fs/mountfs/super.c       |  497 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c           |   60 +++++
 fs/proc/base.c           |    2 
 fs/proc/fd.c             |   82 +++++++
 fs/proc/fd.h             |    3 
 fs/proc_namespace.c      |   22 --
 fs/seq_file.c            |   23 ++
 include/linux/seq_file.h |    1 
 11 files changed, 682 insertions(+), 21 deletions(-)

--- a/fs/Makefile
+++ b/fs/Makefile
@@ -135,3 +135,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-y				+= mountfs/
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,6 +72,7 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	struct mountfs_entry *mnt_mountfs_entry;
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -153,3 +154,13 @@ static inline bool is_anon_ns(struct mnt
 {
 	return ns->seq == 0;
 }
+
+extern struct mount *get_mount(struct mount *mnt);
+extern void mntput_no_expire(struct mount *mnt);
+
+void mountfs_create(struct mount *mnt, struct mnt_namespace *mnt_ns);
+extern void mountfs_remove(struct mount *mnt);
+void seq_mount_children(struct seq_file *sf, struct mount *mnt);
+void seq_mount_propagate_from(struct seq_file *sf, struct mount *mnt,
+			      const struct path *root);
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path);
--- /dev/null
+++ b/fs/mountfs/Makefile
@@ -0,0 +1 @@
+obj-y				+= super.o
--- /dev/null
+++ b/fs/mountfs/super.c
@@ -0,0 +1,497 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../pnode.h"
+#include <linux/fs.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/fs_struct.h>
+#include <linux/fs_context.h>
+
+#define MOUNTFS_SUPER_MAGIC 0x4e756f4d
+
+static DEFINE_MUTEX(mountfs_lock);
+static struct rb_root mountfs_entries = RB_ROOT;
+static struct vfsmount *mountfs_mnt __read_mostly;
+
+struct mountfs_entry {
+	struct kref kref;
+	struct mount *mnt;
+	struct rb_node node;
+	int id;
+};
+
+static const char *mountfs_attrs[] = {
+	"root", "mountpoint", "id", "parent", "options", "children",
+	"group", "master", "propagate_from"
+};
+
+#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
+			 (ARRAY_SIZE(mountfs_attrs) + 1))
+
+void mountfs_entry_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct mountfs_entry, kref));
+}
+
+void mountfs_entry_put(struct mountfs_entry *entry)
+{
+	kref_put(&entry->kref, mountfs_entry_release);
+}
+
+static struct mount *mountfs_get_mount(struct mountfs_entry *entry)
+{
+	struct mount *mnt;
+
+	rcu_read_lock();
+	mnt = get_mount(rcu_dereference(entry->mnt));
+	rcu_read_unlock();
+
+	return mnt;
+}
+
+static bool mountfs_entry_visible(struct mountfs_entry *entry)
+{
+	struct mount *mnt;
+	bool visible = false;
+
+	rcu_read_lock();
+	mnt = rcu_dereference(entry->mnt);
+	if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns)
+		visible = true;
+	rcu_read_unlock();
+
+	return visible;
+}
+
+static int mountfs_attr_show(struct seq_file *sf, void *v)
+{
+	const char *name = sf->file->f_path.dentry->d_name.name;
+	struct mountfs_entry *entry = sf->private;
+	struct mount *mnt = mountfs_get_mount(entry);
+	struct vfsmount *m;
+	struct super_block *sb;
+	struct path root;
+	int err = 0;
+
+	if (!mnt)
+		return -ENODEV;
+
+	m = &mnt->mnt;
+	sb = m->mnt_sb;
+
+	if (strcmp(name, "root") == 0) {
+		if (sb->s_op->show_path) {
+			err = sb->s_op->show_path(sf, m->mnt_root);
+		} else {
+			seq_dentry(sf, m->mnt_root, " \t\n\\");
+		}
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "mountpoint") == 0) {
+		struct path mnt_path = { .dentry = m->mnt_root, .mnt = m };
+
+		get_fs_root(current->fs, &root);
+		err = seq_path_root(sf, &mnt_path, &root, " \t\n\\");
+		path_put(&root);
+		if (err == SEQ_SKIP) {
+			seq_puts(sf, "(unreachable)");
+			err = 0;
+		}
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "id") == 0) {
+		seq_printf(sf, "%i\n", mnt->mnt_id);
+	} else if (strcmp(name, "parent") == 0) {
+		int parent;
+
+		rcu_read_lock();
+		parent = rcu_dereference(mnt->mnt_parent)->mnt_id;
+		rcu_read_unlock();
+
+		seq_printf(sf, "%i\n", parent);
+	} else if (strcmp(name, "options") == 0) {
+		int mnt_flags = READ_ONCE(m->mnt_flags);
+
+		seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw");
+		seq_mnt_opts(sf, mnt_flags);
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "children") == 0) {
+		seq_mount_children(sf, mnt);
+	} else if (strcmp(name, "group") == 0) {
+		if (IS_MNT_SHARED(mnt))
+			seq_printf(sf, "%i\n", mnt->mnt_group_id);
+	} else if (strcmp(name, "master") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			int master;
+
+			rcu_read_lock();
+			master = rcu_dereference(mnt->mnt_master)->mnt_group_id;
+			rcu_read_unlock();
+			seq_printf(sf, "%i\n", master);
+		}
+	} else if (strcmp(name, "propagate_from") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			get_fs_root(current->fs, &root);
+			seq_mount_propagate_from(sf, mnt, &root);
+			path_put(&root);
+		}
+	}
+	mntput_no_expire(mnt);
+
+	return err;
+}
+
+static int mountfs_attr_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mountfs_attr_show, inode->i_private);
+}
+
+static const struct file_operations mountfs_attr_fops = {
+	.open		= mountfs_attr_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node)
+{
+	return rb_entry(node, struct mountfs_entry, node);
+}
+
+static struct rb_node **mountfs_find_node(int id, struct rb_node **parent)
+{
+	struct rb_node **link = &mountfs_entries.rb_node;
+
+	*parent = NULL;
+	while (*link) {
+		struct mountfs_entry *entry = mountfs_node_to_entry(*link);
+
+		*parent = *link;
+		if (id < entry->id)
+			link = &entry->node.rb_left;
+		else if (id > entry->id)
+			link = &entry->node.rb_right;
+		else
+			break;
+	}
+	return link;
+}
+
+void mountfs_create(struct mount *mnt, struct mnt_namespace *mnt_ns)
+{
+	struct mountfs_entry *entry;
+	struct rb_node **link, *parent;
+
+	if (mnt->mnt.mnt_flags & MNT_INTERNAL)
+		return;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry) {
+		WARN(1, "failed to allocate mountfs entry");
+		return;
+	}
+	kref_init(&entry->kref);
+	entry->mnt = mnt;
+	entry->id = mnt->mnt_id;
+
+	mutex_lock(&mountfs_lock);
+	link = mountfs_find_node(entry->id, &parent);
+	if (!WARN_ON(*link)) {
+		rb_link_node(&entry->node, parent, link);
+		rb_insert_color(&entry->node, &mountfs_entries);
+		mnt->mnt_mountfs_entry = entry;
+	} else {
+		kfree(entry);
+	}
+	mutex_unlock(&mountfs_lock);
+}
+
+void mountfs_remove(struct mount *mnt)
+{
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+
+	if (!entry)
+		return;
+
+	mutex_lock(&mountfs_lock);
+	entry->mnt = NULL;
+	rb_erase(&entry->node, &mountfs_entries);
+	mutex_unlock(&mountfs_lock);
+
+	mountfs_entry_put(entry);
+
+	mnt->mnt_mountfs_entry = NULL;
+}
+
+static struct mountfs_entry *mountfs_get_entry(const char *name)
+{
+	struct mountfs_entry *entry = NULL;
+	struct rb_node **link, *dummy;
+	unsigned long mnt_id;
+	char buf[32];
+	int ret;
+
+	ret = kstrtoul(name, 10, &mnt_id);
+	if (ret || mnt_id > INT_MAX)
+		return NULL;
+
+	if (WARN_ON(snprintf(buf, sizeof(buf), "%lu", mnt_id) >= sizeof(buf)) ||
+	    strcmp(buf, name) != 0)
+		return NULL;
+
+	mutex_lock(&mountfs_lock);
+	link = mountfs_find_node(mnt_id, &dummy);
+	if (*link) {
+		entry = mountfs_node_to_entry(*link);
+		if (!mountfs_entry_visible(entry))
+			entry = NULL;
+		else
+			kref_get(&entry->kref);
+	}
+	mutex_unlock(&mountfs_lock);
+
+	return entry;
+}
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode);
+
+static struct dentry *mountfs_lookup_entry(struct dentry *dentry,
+					   struct mountfs_entry *entry,
+					   int idx)
+{
+	struct inode *inode;
+
+	inode = new_inode(dentry->d_sb);
+	if (!inode) {
+		mountfs_entry_put(entry);
+		return ERR_PTR(-ENOMEM);
+	}
+	inode->i_private = entry;
+	inode->i_ino = MOUNTFS_INO(entry->id) + idx;
+	mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555);
+	return d_splice_alias(inode, dentry);
+
+}
+
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct mountfs_entry *entry = dir->i_private;
+	int i = 0;
+
+	if (entry) {
+		for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+			if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+				break;
+		if (i == ARRAY_SIZE(mountfs_attrs))
+			return ERR_PTR(-ENOMEM);
+		i++;
+	} else {
+		entry = mountfs_get_entry(dentry->d_name.name);
+		if (!entry)
+			return ERR_PTR(-ENOENT);
+	}
+
+	return mountfs_lookup_entry(dentry, entry, i);
+}
+
+static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct mountfs_entry *entry = dentry->d_inode->i_private;
+
+	/* root: valid */
+	if (!entry)
+		return 1;
+
+	/* removed: invalid */
+	if (!entry->mnt)
+		return 0;
+
+	/* attribute or visible in this namespace: valid */
+	if (!d_can_lookup(dentry) || mountfs_entry_visible(entry))
+		return 1;
+
+	/* invlisible in this namespace: valid but deny entry*/
+	return -ENOENT;
+}
+
+static int mountfs_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct rb_node *node;
+	struct mountfs_entry *entry = file_inode(file)->i_private;
+	char name[32];
+	const char *s;
+	unsigned int len;
+
+	if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx))
+		return 0;
+
+	if (entry) {
+		while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) {
+			s = mountfs_attrs[ctx->pos - 2];
+			if (!dir_emit(ctx, s, strlen(s),
+				      MOUNTFS_INO(entry->id) + ctx->pos,
+				      DT_REG))
+				break;
+			ctx->pos++;
+		}
+		return 0;
+	}
+
+	mutex_lock(&mountfs_lock);
+	mountfs_find_node(ctx->pos - 2, &node);
+	for (; node; node = rb_next(node)) {
+		entry = mountfs_node_to_entry(node);
+		len = snprintf(name, sizeof(name), "%i", entry->id);
+		if (WARN_ON(len >= sizeof(name)))
+			goto out_unlock;
+		if (!mountfs_entry_visible(entry))
+			continue;
+		ctx->pos = (loff_t) entry->id + 2;
+		if (!dir_emit(ctx, name, len, MOUNTFS_INO(entry->id), DT_DIR))
+			goto out_unlock;
+	}
+	ctx->pos = (loff_t) INT_MAX + 3;
+out_unlock:
+	mutex_unlock(&mountfs_lock);
+	return 0;
+}
+
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path)
+{
+	char name[32];
+	struct qstr this = { .name = name };
+	struct mount *mnt = real_mount(m);
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+	struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root;
+
+	this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id);
+	if (WARN_ON(this.len >= sizeof(name)))
+		return -EIO;
+
+	dentry = d_hash_and_lookup(root, &this);
+	if (!dentry) {
+		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+
+		dentry = d_alloc_parallel(root, &this, &wq);
+		if (!IS_ERR(dentry) && d_in_lookup(dentry)) {
+			kref_get(&entry->kref);
+			old = mountfs_lookup_entry(dentry, entry, 0);
+			d_lookup_done(dentry);
+			if (unlikely(old)) {
+				dput(dentry);
+				dentry = old;
+			}
+		}
+		if (IS_ERR(dentry))
+			return PTR_ERR(dentry);
+	}
+
+	*path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry };
+	return 0;
+}
+
+static const struct dentry_operations mountfs_dops = {
+	.d_revalidate = mountfs_d_revalidate,
+};
+
+static const struct inode_operations mountfs_iops = {
+	.lookup = mountfs_lookup,
+};
+
+static const struct file_operations mountfs_fops = {
+	.iterate_shared = mountfs_readdir,
+	.read = generic_read_dir,
+	.llseek = generic_file_llseek,
+};
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode)
+{
+	inode->i_mode = mode;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+	if (S_ISREG(mode)) {
+		inode->i_size = PAGE_SIZE;
+		inode->i_fop = &mountfs_attr_fops;
+	} else {
+		inode->i_op = &mountfs_iops;
+		inode->i_fop = &mountfs_fops;
+	}
+}
+
+static void mountfs_evict_inode(struct inode *inode)
+{
+	struct mountfs_entry *entry = inode->i_private;
+
+	clear_inode(inode);
+	if (entry)
+		mountfs_entry_put(entry);
+}
+
+static const struct super_operations mountfs_sops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= mountfs_evict_inode,
+};
+
+static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *root;
+
+	sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+	sb->s_magic = MOUNTFS_SUPER_MAGIC;
+	sb->s_time_gran = 1;
+	sb->s_shrink.seeks = 0;
+	sb->s_op = &mountfs_sops;
+	sb->s_d_op = &mountfs_dops;
+
+	root = new_inode(sb);
+	if (!root)
+		return -ENOMEM;
+
+	root->i_ino = 1;
+	mountfs_init_inode(root, S_IFDIR | 0444);
+
+	sb->s_root = d_make_root(root);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int mountfs_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, mountfs_fill_super);
+}
+
+static const struct fs_context_operations mountfs_context_ops = {
+	.get_tree = mountfs_get_tree,
+};
+
+static int mountfs_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &mountfs_context_ops;
+	fc->global = true;
+	return 0;
+}
+
+static struct file_system_type mountfs_fs_type = {
+	.name = "mountfs",
+	.init_fs_context = mountfs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
+static int __init mountfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&mountfs_fs_type);
+	if (!err) {
+		mountfs_mnt = kern_mount(&mountfs_fs_type);
+		if (IS_ERR(mountfs_mnt)) {
+			err = PTR_ERR(mountfs_mnt);
+			unregister_filesystem(&mountfs_fs_type);
+		}
+	}
+	return err;
+}
+fs_initcall(mountfs_init);
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -172,6 +172,24 @@ unsigned int mnt_get_count(struct mount
 #endif
 }
 
+struct mount *get_mount(struct mount *mnt)
+{
+	if (mnt) {
+		/* see comment in mntput_no_expire() */
+		if (likely(READ_ONCE(mnt->mnt_ns))) {
+			mnt_add_count(mnt, 1);
+		} else {
+			lock_mount_hash();
+			if (mnt->mnt.mnt_flags & MNT_DOOMED)
+				mnt = NULL;
+			else
+				mnt_add_count(mnt, 1);
+			unlock_mount_hash();
+		}
+	}
+	return mnt;
+}
+
 static struct mount *alloc_vfsmnt(const char *name)
 {
 	struct mount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -1091,6 +1109,9 @@ static void cleanup_mnt(struct mount *mn
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+
+	mountfs_remove(mnt);
+
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) {
@@ -1120,7 +1141,7 @@ static void delayed_mntput(struct work_s
 }
 static DECLARE_DELAYED_WORK(delayed_mntput_work, delayed_mntput);
 
-static void mntput_no_expire(struct mount *mnt)
+void mntput_no_expire(struct mount *mnt)
 {
 	LIST_HEAD(list);
 
@@ -1296,6 +1317,37 @@ const struct seq_operations mounts_op =
 };
 #endif  /* CONFIG_PROC_FS */
 
+void seq_mount_children(struct seq_file *sf, struct mount *mnt)
+{
+	struct mount *child;
+	bool first = true;
+
+	down_read(&namespace_sem);
+	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+		if (!first)
+			seq_putc(sf, ',');
+		else
+			first = false;
+		seq_printf(sf, "%i", child->mnt_id);
+	}
+	up_read(&namespace_sem);
+	if (!first)
+		seq_putc(sf, '\n');
+}
+
+void seq_mount_propagate_from(struct seq_file *sf, struct mount *mnt,
+			      const struct path *root)
+{
+	int dom;
+
+	down_read(&namespace_sem);
+	dom = get_dominating_id(mnt, root);
+	up_read(&namespace_sem);
+
+	if (dom)
+		seq_printf(sf, "%i\n", dom);
+}
+
 /**
  * may_umount_tree - check if a mount tree is busy
  * @mnt: root of mount tree
@@ -2062,6 +2114,9 @@ static int attach_recursive_mnt(struct m
 		err = count_mounts(ns, source_mnt);
 		if (err)
 			goto out;
+
+		for (p = source_mnt; p; p = next_mnt(p, source_mnt))
+			mountfs_create(p, ns);
 	}
 
 	if (IS_MNT_SHARED(dest_mnt)) {
@@ -3224,6 +3279,7 @@ struct mnt_namespace *copy_mnt_ns(unsign
 	p = old;
 	q = new;
 	while (p) {
+		mountfs_create(q, new_ns);
 		q->mnt_ns = new_ns;
 		new_ns->mounts++;
 		if (new_fs) {
@@ -3686,6 +3742,8 @@ static void __init init_mount_tree(void)
 	if (IS_ERR(ns))
 		panic("Can't allocate initial namespace");
 	m = real_mount(mnt);
+
+	mountfs_create(m, ns);
 	m->mnt_ns = ns;
 	ns->root = m;
 	ns->mounts = 1;
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",    S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3497,6 +3498,7 @@ static const struct inode_operations pro
 static const struct pid_entry tid_base_stuff[] = {
 	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",   S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo
 	.iterate_shared	= proc_readfdinfo,
 	.llseek		= generic_file_llseek,
 };
+
+static int proc_fdmount_link(struct dentry *dentry, struct path *path)
+{
+	struct files_struct *files = NULL;
+	struct task_struct *task;
+	struct path fd_path;
+	int ret = -ENOENT;
+
+	task = get_proc_task(d_inode(dentry));
+	if (task) {
+		files = get_files_struct(task);
+		put_task_struct(task);
+	}
+
+	if (files) {
+		unsigned int fd = proc_fd(d_inode(dentry));
+		struct file *fd_file;
+
+		spin_lock(&files->file_lock);
+		fd_file = fcheck_files(files, fd);
+		if (fd_file) {
+			fd_path = fd_file->f_path;
+			path_get(&fd_path);
+			ret = 0;
+		}
+		spin_unlock(&files->file_lock);
+		put_files_struct(files);
+	}
+	if (!ret) {
+		ret = mountfs_lookup_internal(fd_path.mnt, path);
+		path_put(&fd_path);
+	}
+
+	return ret;
+}
+
+static struct dentry *proc_fdmount_instantiate(struct dentry *dentry,
+	struct task_struct *task, const void *ptr)
+{
+	const struct fd_data *data = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->fd = data->fd;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+
+	ei->op.proc_get_link = proc_fdmount_link;
+	tid_fd_update_inode(task, inode, 0);
+
+	d_set_d_op(dentry, &tid_fd_dentry_operations);
+	return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *
+proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+	return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate);
+}
+
+static int proc_readfdmount(struct file *file, struct dir_context *ctx)
+{
+	return proc_readfd_common(file, ctx,
+				  proc_fdmount_instantiate);
+}
+
+const struct inode_operations proc_fdmount_inode_operations = {
+	.lookup		= proc_lookupfdmount,
+	.setattr	= proc_setattr,
+};
+
+const struct file_operations proc_fdmount_operations = {
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_readfdmount,
+	.llseek		= generic_file_llseek,
+};
--- a/fs/proc/fd.h
+++ b/fs/proc/fd.h
@@ -10,6 +10,9 @@ extern const struct inode_operations pro
 extern const struct file_operations proc_fdinfo_operations;
 extern const struct inode_operations proc_fdinfo_inode_operations;
 
+extern const struct file_operations proc_fdmount_operations;
+extern const struct inode_operations proc_fdmount_inode_operations;
+
 extern int proc_fd_permission(struct inode *inode, int mask);
 
 static inline unsigned int proc_fd(struct inode *inode)
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file
 	return security_sb_show_options(m, sb);
 }
 
-static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
-{
-	static const struct proc_fs_info mnt_info[] = {
-		{ MNT_NOSUID, ",nosuid" },
-		{ MNT_NODEV, ",nodev" },
-		{ MNT_NOEXEC, ",noexec" },
-		{ MNT_NOATIME, ",noatime" },
-		{ MNT_NODIRATIME, ",nodiratime" },
-		{ MNT_RELATIME, ",relatime" },
-		{ 0, NULL }
-	};
-	const struct proc_fs_info *fs_infop;
-
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
-		if (mnt->mnt_flags & fs_infop->flag)
-			seq_puts(m, fs_infop->str);
-	}
-}
 
 static inline void mangle(struct seq_file *m, const char *s)
 {
@@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *
 	err = show_sb_opts(m, sb);
 	if (err)
 		goto out;
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 	if (sb->s_op->show_options)
 		err = sb->s_op->show_options(m, mnt_path.dentry);
 	seq_puts(m, " 0 0\n");
@@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_fil
 		goto out;
 
 	seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw");
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 
 	/* Tagged fields ("foo:X" or "bar") */
 	if (IS_MNT_SHARED(r))
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -15,6 +15,7 @@
 #include <linux/cred.h>
 #include <linux/mm.h>
 #include <linux/printk.h>
+#include <linux/mount.h>
 #include <linux/string_helpers.h>
 
 #include <linux/uaccess.h>
@@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struc
 }
 EXPORT_SYMBOL(seq_dentry);
 
+void seq_mnt_opts(struct seq_file *m, int mnt_flags)
+{
+	unsigned int i;
+	static const struct {
+		int flag;
+		const char *str;
+	} mnt_info[] = {
+		{ MNT_NOSUID, ",nosuid" },
+		{ MNT_NODEV, ",nodev" },
+		{ MNT_NOEXEC, ",noexec" },
+		{ MNT_NOATIME, ",noatime" },
+		{ MNT_NODIRATIME, ",nodiratime" },
+		{ MNT_RELATIME, ",relatime" },
+		{ 0, NULL }
+	};
+
+	for (i = 0; mnt_info[i].flag; i++) {
+		if (mnt_flags & mnt_info[i].flag)
+			seq_puts(m, mnt_info[i].str);
+	}
+}
+
 static void *single_start(struct seq_file *p, loff_t *pos)
 {
 	return NULL + (*pos == 0);
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, str
 int seq_dentry(struct seq_file *, struct dentry *, const char *);
 int seq_path_root(struct seq_file *m, const struct path *path,
 		  const struct path *root, const char *esc);
+void seq_mnt_opts(struct seq_file *m, int mnt_flags);
 
 int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
 int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t);
Al Viro March 6, 2020, 7:43 p.m. UTC | #81
On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > 
> > I'm doing a patch.   Let's see how it fares in the face of all these
> > preconceptions.
> 
> Here's a first cut.  Doesn't yet have superblock info, just mount info.
> Probably has rough edges, but appears to work.

For starters, you have just made namespace_sem held over copy_to_user().
This is not going to fly.
Miklos Szeredi March 6, 2020, 7:54 p.m. UTC | #82
On Fri, Mar 6, 2020 at 8:43 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > >
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> >
> > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > Probably has rough edges, but appears to work.
>
> For starters, you have just made namespace_sem held over copy_to_user().
> This is not going to fly.

Where?

Thanks,
Miklos
Al Viro March 6, 2020, 7:58 p.m. UTC | #83
On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > > 
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> > 
> > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > Probably has rough edges, but appears to work.
> 
> For starters, you have just made namespace_sem held over copy_to_user().
> This is not going to fly.

In case if the above is too terse: you grab your mutex while under
namespace_sem (see attach_recursive_mnt()); the same mutex is held
while calling dir_emit().  Which can (and normally does) copy data
to userland-supplied buffer.

NAK for that reason alone, and to be honest I had been too busy
suppressing the gag reflex to read and comment any deeper.

I really hate that approach, in case it's not clear from the above.
To the degree that I don't trust myself to filter out the obscenities
if I try to comment on it right now.

The only blocking thing we can afford under namespace_sem is GFP_KERNEL
allocation.
Al Viro March 6, 2020, 8:05 p.m. UTC | #84
On Fri, Mar 06, 2020 at 07:58:23PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > > > 
> > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > preconceptions.
> > > 
> > > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > > Probably has rough edges, but appears to work.
> > 
> > For starters, you have just made namespace_sem held over copy_to_user().
> > This is not going to fly.
> 
> In case if the above is too terse: you grab your mutex while under
> namespace_sem (see attach_recursive_mnt()); the same mutex is held
> while calling dir_emit().  Which can (and normally does) copy data
> to userland-supplied buffer.
> 
> NAK for that reason alone, and to be honest I had been too busy
> suppressing the gag reflex to read and comment any deeper.
> 
> I really hate that approach, in case it's not clear from the above.
> To the degree that I don't trust myself to filter out the obscenities
> if I try to comment on it right now.
> 
> The only blocking thing we can afford under namespace_sem is GFP_KERNEL
> allocation.

Incidentally, attach_recursive_mnt() only gets you the root(s) of
attached tree(s); try mount --rbind and see how much you've missed.
Miklos Szeredi March 6, 2020, 8:11 p.m. UTC | #85
On Fri, Mar 6, 2020 at 9:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 06, 2020 at 07:58:23PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > > > >
> > > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > > preconceptions.
> > > >
> > > > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > > > Probably has rough edges, but appears to work.
> > >
> > > For starters, you have just made namespace_sem held over copy_to_user().
> > > This is not going to fly.
> >
> > In case if the above is too terse: you grab your mutex while under
> > namespace_sem (see attach_recursive_mnt()); the same mutex is held
> > while calling dir_emit().  Which can (and normally does) copy data
> > to userland-supplied buffer.
> >
> > NAK for that reason alone, and to be honest I had been too busy
> > suppressing the gag reflex to read and comment any deeper.
> >
> > I really hate that approach, in case it's not clear from the above.
> > To the degree that I don't trust myself to filter out the obscenities
> > if I try to comment on it right now.
> >
> > The only blocking thing we can afford under namespace_sem is GFP_KERNEL
> > allocation.
>
> Incidentally, attach_recursive_mnt() only gets you the root(s) of
> attached tree(s); try mount --rbind and see how much you've missed.

Okay.  Both trivially fixable:

 - the dir_emit() can be taken out from under the mutex and the rb
tree search repeated for every entry; possibly not as efficient, but I
guess at this point that's irrelevant

 - addition of the mountfs entry moved to the right places

Thanks,
Miklos
Al Viro March 6, 2020, 8:37 p.m. UTC | #86
On Fri, Mar 06, 2020 at 08:05:22PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 07:58:23PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > > > > 
> > > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > > preconceptions.
> > > > 
> > > > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > > > Probably has rough edges, but appears to work.
> > > 
> > > For starters, you have just made namespace_sem held over copy_to_user().
> > > This is not going to fly.
> > 
> > In case if the above is too terse: you grab your mutex while under
> > namespace_sem (see attach_recursive_mnt()); the same mutex is held
> > while calling dir_emit().  Which can (and normally does) copy data
> > to userland-supplied buffer.
> > 
> > NAK for that reason alone, and to be honest I had been too busy
> > suppressing the gag reflex to read and comment any deeper.
> > 
> > I really hate that approach, in case it's not clear from the above.
> > To the degree that I don't trust myself to filter out the obscenities
> > if I try to comment on it right now.
> > 
> > The only blocking thing we can afford under namespace_sem is GFP_KERNEL
> > allocation.
> 
> Incidentally, attach_recursive_mnt() only gets you the root(s) of
> attached tree(s); try mount --rbind and see how much you've missed.

You are misreading mntput_no_expire(), BTW - your get_mount() can
bloody well race with umount(2), hitting the moment when we are done
figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
is not mounted anywhere else, they are not supposed to see the sucker
return 0 until the filesystem is shut down.  You break that.
Al Viro March 6, 2020, 8:38 p.m. UTC | #87
On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:

> You are misreading mntput_no_expire(), BTW - your get_mount() can
> bloody well race with umount(2), hitting the moment when we are done
> figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> is not mounted anywhere else, they are not supposed to see the sucker
> return 0 until the filesystem is shut down.  You break that.

While we are at it, d_alloc_parallel() requires i_rwsem on parent held
at least shared.
Al Viro March 6, 2020, 8:45 p.m. UTC | #88
On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> 
> > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > bloody well race with umount(2), hitting the moment when we are done
> > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > is not mounted anywhere else, they are not supposed to see the sucker
> > return 0 until the filesystem is shut down.  You break that.
> 
> While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> at least shared.

Egads...  Let me see if I got it right - you are providing procfs symlinks
to objects on the internal mount of that thing.  And those objects happen
to be directories, so one can get to their parent that way.  Or am I misreading
that thing?
Al Viro March 6, 2020, 8:49 p.m. UTC | #89
On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> > 
> > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > bloody well race with umount(2), hitting the moment when we are done
> > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > is not mounted anywhere else, they are not supposed to see the sucker
> > > return 0 until the filesystem is shut down.  You break that.
> > 
> > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > at least shared.
> 
> Egads...  Let me see if I got it right - you are providing procfs symlinks
> to objects on the internal mount of that thing.  And those objects happen
> to be directories, so one can get to their parent that way.  Or am I misreading
> that thing?

IDGI.  You have (in your lookup) kstrtoul, followed by snprintf, followed
by strcmp and WARN_ON() in case of mismatch?  Is there any point in having
stat(2) on "00" spew into syslog?  Confused...
Miklos Szeredi March 6, 2020, 8:51 p.m. UTC | #90
On Fri, Mar 6, 2020 at 9:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> >
> > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > bloody well race with umount(2), hitting the moment when we are done
> > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > is not mounted anywhere else, they are not supposed to see the sucker
> > > return 0 until the filesystem is shut down.  You break that.

Ah, good point.

> >
> > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > at least shared.

Okay.

> Egads...  Let me see if I got it right - you are providing procfs symlinks
> to objects on the internal mount of that thing.  And those objects happen
> to be directories, so one can get to their parent that way.  Or am I misreading
> that thing?

Yes.

Thanks,
Miklos
Miklos Szeredi March 6, 2020, 8:51 p.m. UTC | #91
On Fri, Mar 6, 2020 at 9:49 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> > >
> > > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > > bloody well race with umount(2), hitting the moment when we are done
> > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > > is not mounted anywhere else, they are not supposed to see the sucker
> > > > return 0 until the filesystem is shut down.  You break that.
> > >
> > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > > at least shared.
> >
> > Egads...  Let me see if I got it right - you are providing procfs symlinks
> > to objects on the internal mount of that thing.  And those objects happen
> > to be directories, so one can get to their parent that way.  Or am I misreading
> > that thing?
>
> IDGI.  You have (in your lookup) kstrtoul, followed by snprintf, followed
> by strcmp and WARN_ON() in case of mismatch?  Is there any point in having
> stat(2) on "00" spew into syslog?  Confused...

The WARN_ON() is for the buffer overrun, not for the strcmp mismatch.

Thanks,
Miklos
Al Viro March 6, 2020, 8:56 p.m. UTC | #92
On Fri, Mar 06, 2020 at 08:49:26PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> > > 
> > > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > > bloody well race with umount(2), hitting the moment when we are done
> > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > > is not mounted anywhere else, they are not supposed to see the sucker
> > > > return 0 until the filesystem is shut down.  You break that.
> > > 
> > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > > at least shared.
> > 
> > Egads...  Let me see if I got it right - you are providing procfs symlinks
> > to objects on the internal mount of that thing.  And those objects happen
> > to be directories, so one can get to their parent that way.  Or am I misreading
> > that thing?
> 
> IDGI.  You have (in your lookup) kstrtoul, followed by snprintf, followed
> by strcmp and WARN_ON() in case of mismatch?  Is there any point in having
> stat(2) on "00" spew into syslog?  Confused...

AFAICS, refcounting in there cannot be right:
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+                                    unsigned int flags)
+{
+       struct mountfs_entry *entry = dir->i_private;
+       int i = 0;
+               
+       if (entry) {
+               for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+                       if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+                               break;
+               if (i == ARRAY_SIZE(mountfs_attrs))
+                       return ERR_PTR(-ENOMEM);
+               i++;
+       } else {
+               entry = mountfs_get_entry(dentry->d_name.name);
+               if (!entry)
+                       return ERR_PTR(-ENOENT);
+       }
+                          
+       return mountfs_lookup_entry(dentry, entry, i);
+}
ends up consuming a reference in mountfs_lookup_entry() (at the very least,
drops it in case of inode allocation hitting OOM) and nothing in the
that loop in mountfs_lookup() appears to do a matching reference grab.
Al Viro March 6, 2020, 9:28 p.m. UTC | #93
On Fri, Mar 06, 2020 at 09:51:50PM +0100, Miklos Szeredi wrote:
> On Fri, Mar 6, 2020 at 9:49 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> > > >
> > > > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > > > bloody well race with umount(2), hitting the moment when we are done
> > > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > > > is not mounted anywhere else, they are not supposed to see the sucker
> > > > > return 0 until the filesystem is shut down.  You break that.
> > > >
> > > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > > > at least shared.
> > >
> > > Egads...  Let me see if I got it right - you are providing procfs symlinks
> > > to objects on the internal mount of that thing.  And those objects happen
> > > to be directories, so one can get to their parent that way.  Or am I misreading
> > > that thing?
> >
> > IDGI.  You have (in your lookup) kstrtoul, followed by snprintf, followed
> > by strcmp and WARN_ON() in case of mismatch?  Is there any point in having
> > stat(2) on "00" spew into syslog?  Confused...
> 
> The WARN_ON() is for the buffer overrun, not for the strcmp mismatch.

That makes even less sense - buffer overrun on snprintf of an int into
32-character array?  That's what, future-proofing it for the time we
manage to issue 10^31 syscalls since the (much closer) moment when we
get 128bit int?
Greg Kroah-Hartman March 7, 2020, 9:48 a.m. UTC | #94
On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > 
> > I'm doing a patch.   Let's see how it fares in the face of all these
> > preconceptions.
> 
> Here's a first cut.  Doesn't yet have superblock info, just mount info.
> Probably has rough edges, but appears to work.
> 
> I started with sysfs, then kernfs, then went with a custom filesystem, because
> neither could do what I wanted.

Hm, what is wrong with kernfs that prevented you from using it here?
Just complexity or something else?

thanks,

greg k-h
Miklos Szeredi March 7, 2020, 8:48 p.m. UTC | #95
On Sat, Mar 7, 2020 at 10:48 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > >
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> >
> > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > Probably has rough edges, but appears to work.
> >
> > I started with sysfs, then kernfs, then went with a custom filesystem, because
> > neither could do what I wanted.
>
> Hm, what is wrong with kernfs that prevented you from using it here?
> Just complexity or something else?

I wanted to have a single instance covering all the namespaces, with
just a filtered view depending on which namespace the task is looking
at it.

Having a kernfs_node for each attribute is also rather heavy compared
to the size of struct mount.

Thanks,
Miklos