[v4,0/3] fs: allow statmount to fetch the fs_subtype and sb_source

Message ID	20241111-statmount-v4-0-2eaf35d07a80@kernel.org (mailing list archive)
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 654B425777; Mon, 11 Nov 2024 15:10:21 +0000 (UTC) From: Jeff Layton <jlayton@kernel.org> Subject: [PATCH v4 0/3] fs: allow statmount to fetch the fs_subtype and sb_source Date: Mon, 11 Nov 2024 10:09:54 -0500 Message-Id: <20241111-statmount-v4-0-2eaf35d07a80@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz> Cc: Miklos Szeredi <miklos@szeredi.hu>, Ian Kent <raven@themaw.net>, Josef Bacik <josef@toxicpanda.com>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Jeff Layton <jlayton@kernel.org>
Series	fs: allow statmount to fetch the fs_subtype and sb_source \| expand [v4,0/3] fs: allow statmount to fetch the fs_subtype and sb_source [v4,1/3] fs: don't let statmount return empty strings [v4,2/3] fs: add the ability for statmount() to report the fs_subtype [v4,3/3] fs: add the ability for statmount() to report the sb_source

Jeff Layton Nov. 11, 2024, 3:09 p.m. UTC

Meta has some internal logging that scrapes /proc/self/mountinfo today.
I'd like to convert it to use listmount()/statmount(), so we can do a
better job of monitoring with containers. We're missing some fields
though. This patchset adds them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v4:
- Rename mnt_devname to sb_source
- Break statmount_string() behavior changes out into separate patch
- Link to v3: https://lore.kernel.org/r/20241107-statmount-v3-0-da5b9744c121@kernel.org

Changes in v3:
- Unescape the result of ->show_devname
- Move handling of nothing being emitted out of the switch statement
- Link to v2: https://lore.kernel.org/r/20241106-statmount-v2-0-93ba2aad38d1@kernel.org

Changes in v2:
- make statmount_fs_subtype
- return fast if no subtype is emitted
- new patch to allow statmount() to return devicename
- Link to v1: https://lore.kernel.org/r/20241106-statmount-v1-1-b93bafd97621@kernel.org

---
Jeff Layton (3):
      fs: don't let statmount return empty strings
      fs: add the ability for statmount() to report the fs_subtype
      fs: add the ability for statmount() to report the sb_source

 fs/namespace.c             | 68 ++++++++++++++++++++++++++++++++++++++++++----
 include/uapi/linux/mount.h |  6 +++-
 2 files changed, 67 insertions(+), 7 deletions(-)
---
base-commit: 26213e1a6caa5a7f508b919059b0122b451f4dfe
change-id: 20241106-statmount-3f91a7ed75fa

Best regards,

Miklos Szeredi Nov. 12, 2024, 10:32 a.m. UTC | #1

On Mon, 11 Nov 2024 at 16:10, Jeff Layton <jlayton@kernel.org> wrote:
>
> Meta has some internal logging that scrapes /proc/self/mountinfo today.
> I'd like to convert it to use listmount()/statmount(), so we can do a
> better job of monitoring with containers. We're missing some fields
> though. This patchset adds them.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

When thinking of the naming for the unescaped options, I realized that
maybe "source" is better than "sb_soure" which just adds redundant
info.   Just a thought, don't bother if you don't agree.

Acked-by: Miklos Szeredi <mszeredi@redhat.com>

Thanks,
Miklos

Jeff Layton Nov. 12, 2024, 11:12 a.m. UTC | #2

On Tue, 2024-11-12 at 11:32 +0100, Miklos Szeredi wrote:
> On Mon, 11 Nov 2024 at 16:10, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > I'd like to convert it to use listmount()/statmount(), so we can do a
> > better job of monitoring with containers. We're missing some fields
> > though. This patchset adds them.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> When thinking of the naming for the unescaped options, I realized that
> maybe "source" is better than "sb_soure" which just adds redundant
> info.   Just a thought, don't bother if you don't agree.
> 
> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> 

I think sb_source fits better with the convention that the other fields
in the struct use. I say we stick with that.

Christian Brauner Nov. 12, 2024, 1:39 p.m. UTC | #3

On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> Meta has some internal logging that scrapes /proc/self/mountinfo today.
> I'd like to convert it to use listmount()/statmount(), so we can do a
> better job of monitoring with containers. We're missing some fields
> though. This patchset adds them.
> 
> 

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/3] fs: don't let statmount return empty strings
      https://git.kernel.org/vfs/vfs/c/75ead69a7173
[2/3] fs: add the ability for statmount() to report the fs_subtype
      https://git.kernel.org/vfs/vfs/c/ed9d95f691c2
[3/3] fs: add the ability for statmount() to report the sb_source
      https://git.kernel.org/vfs/vfs/c/d31cea0ab403

Karel Zak Nov. 13, 2024, 11:27 a.m. UTC | #4

On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > I'd like to convert it to use listmount()/statmount(), so we can do a
> > better job of monitoring with containers. We're missing some fields
> > though. This patchset adds them.
> > 
> > 
> 
> Applied to the vfs.misc branch of the vfs/vfs.git tree.
> Patches in the vfs.misc branch should appear in linux-next soon.

Jeff, thank you for this!

I have already implemented support for statmount() and listmount() in
libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The
only remaining issue was the mount source and incomplete file system
type.

Currently, the library does not use these syscalls when mounting (it
still uses /proc/#/mountinfo). However, there is already a library API
to fetch mount information from the kernel using listmount+statmount,
and it can be accessed from the command line using:

   findmnt --kernel=listmount

Next on the wish list is a notification (a file descriptor that can be
used in epoll) that returns a 64-bit ID when there is a change in the
mount node. This will enable us to enhance systemd so that it does not
have to read the entire mount table after every change.

    Karel

Miklos Szeredi Nov. 13, 2024, 1:08 p.m. UTC | #5

On Wed, 13 Nov 2024 at 12:28, Karel Zak <kzak@redhat.com> wrote:

> Next on the wish list is a notification (a file descriptor that can be
> used in epoll) that returns a 64-bit ID when there is a change in the
> mount node. This will enable us to enhance systemd so that it does not
> have to read the entire mount table after every change.

IIRC Amir said he'd take this up (this was at LSFMM 2023).  I'm also
happy to take part in the design and implementation.

Thanks,
Miklos

Jeff Layton Nov. 13, 2024, 1:45 p.m. UTC | #6

On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> > > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > > I'd like to convert it to use listmount()/statmount(), so we can do a
> > > better job of monitoring with containers. We're missing some fields
> > > though. This patchset adds them.
> > > 
> > > 
> > 
> > Applied to the vfs.misc branch of the vfs/vfs.git tree.
> > Patches in the vfs.misc branch should appear in linux-next soon.
> 
> Jeff, thank you for this!
> 
> I have already implemented support for statmount() and listmount() in
> libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The
> only remaining issue was the mount source and incomplete file system
> type.
> 

Unfortunately, I think we might be missing something else:

The mountinfo (and "mounts") file generator calls show_sb_opts() which
generates some strings from the sb->s_flags field and then calls
security_sb_show_options(). statmount() will give you the s_flags field
(or an equivalent), but it doesn't give you the security options
string. So, those aren't currently visible from statmount().

How should we expose those? Should we create a new statmount string
field and populate it, or is it better to just tack them onto the end
of the statmount.mnt_opts string?

> Currently, the library does not use these syscalls when mounting (it
> still uses /proc/#/mountinfo). However, there is already a library API
> to fetch mount information from the kernel using listmount+statmount,
> and it can be accessed from the command line using:
> 
>    findmnt --kernel=listmount
> 

Very cool. I need to play with findmnt more.

> Next on the wish list is a notification (a file descriptor that can be
> used in epoll) that returns a 64-bit ID when there is a change in the
> mount node. This will enable us to enhance systemd so that it does not
> have to read the entire mount table after every change.
> 

New fanotify events for mount table changes, perhaps?

Jan Kara Nov. 13, 2024, 3:18 p.m. UTC | #7

On Wed 13-11-24 08:45:06, Jeff Layton wrote:
> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > Next on the wish list is a notification (a file descriptor that can be
> > used in epoll) that returns a 64-bit ID when there is a change in the
> > mount node. This will enable us to enhance systemd so that it does not
> > have to read the entire mount table after every change.
> > 
> 
> New fanotify events for mount table changes, perhaps?

Now that I'm looking at it I'm not sure fanotify is a great fit for this
usecase. A lot of fanotify functionality does not really work for virtual
filesystems such as proc and hence we generally try to discourage use of
fanotify for them. So just supporting one type of event (like FAN_MODIFY)
on one file inside proc looks as rather inconsistent interface. But I
vaguely remember we were discussing some kind of mount event, weren't we?
Or was that for something else?

								Honza

Miklos Szeredi Nov. 13, 2024, 3:49 p.m. UTC | #8

On Wed, 13 Nov 2024 at 16:18, Jan Kara <jack@suse.cz> wrote:
>
> On Wed 13-11-24 08:45:06, Jeff Layton wrote:
> > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > Next on the wish list is a notification (a file descriptor that can be
> > > used in epoll) that returns a 64-bit ID when there is a change in the
> > > mount node. This will enable us to enhance systemd so that it does not
> > > have to read the entire mount table after every change.
> > >
> >
> > New fanotify events for mount table changes, perhaps?
>
> Now that I'm looking at it I'm not sure fanotify is a great fit for this
> usecase. A lot of fanotify functionality does not really work for virtual
> filesystems such as proc and hence we generally try to discourage use of
> fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> on one file inside proc looks as rather inconsistent interface. But I
> vaguely remember we were discussing some kind of mount event, weren't we?
> Or was that for something else?

Yeah, if memory serves right what we agreed on was that placing a
watch on a mount would result in events being generated for
mount/umount/move_mount directly under that mount.  So this would not
be monitoring the entire namespace as poll on /proc/$$/mountinfo does.
IIRC Lennart said that this is okay and even desirable for systemd,
since it's only interested in a particular set of mounts.

Thanks,
Miklos

Jan Kara Nov. 13, 2024, 4 p.m. UTC | #9

On Wed 13-11-24 16:49:52, Miklos Szeredi wrote:
> On Wed, 13 Nov 2024 at 16:18, Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 13-11-24 08:45:06, Jeff Layton wrote:
> > > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > > Next on the wish list is a notification (a file descriptor that can be
> > > > used in epoll) that returns a 64-bit ID when there is a change in the
> > > > mount node. This will enable us to enhance systemd so that it does not
> > > > have to read the entire mount table after every change.
> > > >
> > >
> > > New fanotify events for mount table changes, perhaps?
> >
> > Now that I'm looking at it I'm not sure fanotify is a great fit for this
> > usecase. A lot of fanotify functionality does not really work for virtual
> > filesystems such as proc and hence we generally try to discourage use of
> > fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> > on one file inside proc looks as rather inconsistent interface. But I
> > vaguely remember we were discussing some kind of mount event, weren't we?
> > Or was that for something else?
> 
> Yeah, if memory serves right what we agreed on was that placing a
> watch on a mount would result in events being generated for
> mount/umount/move_mount directly under that mount.  So this would not
> be monitoring the entire namespace as poll on /proc/$$/mountinfo does.
> IIRC Lennart said that this is okay and even desirable for systemd,
> since it's only interested in a particular set of mounts.

Oh, right. Thanks for reminding me. And yes, this would fit within what
fanotify supports quite nicely.

								Honza

Ian Kent Nov. 14, 2024, 1:45 a.m. UTC | #10

On 13/11/24 23:18, Jan Kara wrote:
> On Wed 13-11-24 08:45:06, Jeff Layton wrote:
>> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
>>> On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
>>> Next on the wish list is a notification (a file descriptor that can be
>>> used in epoll) that returns a 64-bit ID when there is a change in the
>>> mount node. This will enable us to enhance systemd so that it does not
>>> have to read the entire mount table after every change.
>>>
>> New fanotify events for mount table changes, perhaps?
> Now that I'm looking at it I'm not sure fanotify is a great fit for this
> usecase. A lot of fanotify functionality does not really work for virtual
> filesystems such as proc and hence we generally try to discourage use of
> fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> on one file inside proc looks as rather inconsistent interface. But I
> vaguely remember we were discussing some kind of mount event, weren't we?
> Or was that for something else?

I still need to have a look at the existing notifications sub-systems but,

tbh, I also don't think they offer the needed functionality.


The thing that was most useful with David's notifications when I was trying

to improve the mounts handling was the queuing interface. It allowed me to

batch notifications up to around a couple of hundred and grab them in one go

for processing. This significantly lowered the overhead of rapid fire event

processing. The ability to go directly to an individual mount and get it's

information only got about half the improvement I saw, the rest come 
from the

notifications improvement.


Ian

Ian Kent Nov. 14, 2024, 1:51 a.m. UTC | #11

On 13/11/24 23:18, Jan Kara wrote:
> On Wed 13-11-24 08:45:06, Jeff Layton wrote:
>> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
>>> On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
>>> Next on the wish list is a notification (a file descriptor that can be
>>> used in epoll) that returns a 64-bit ID when there is a change in the
>>> mount node. This will enable us to enhance systemd so that it does not
>>> have to read the entire mount table after every change.
>>>
>> New fanotify events for mount table changes, perhaps?
> Now that I'm looking at it I'm not sure fanotify is a great fit for this
> usecase. A lot of fanotify functionality does not really work for virtual
> filesystems such as proc and hence we generally try to discourage use of
> fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> on one file inside proc looks as rather inconsistent interface. But I
> vaguely remember we were discussing some kind of mount event, weren't we?
> Or was that for something else?

Well, yes, the idea was to be able to avoid the overhead of scanning the

proc mount table(s) pretty much entirely, particularly bad for rapid fire

event handling such as a largish number of rapid mounta or umounts.


Ian

Christian Brauner Nov. 14, 2024, 11:29 a.m. UTC | #12

On Wed, Nov 13, 2024 at 08:45:06AM -0500, Jeff Layton wrote:
> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> > > > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > > > I'd like to convert it to use listmount()/statmount(), so we can do a
> > > > better job of monitoring with containers. We're missing some fields
> > > > though. This patchset adds them.
> > > > 
> > > > 
> > > 
> > > Applied to the vfs.misc branch of the vfs/vfs.git tree.
> > > Patches in the vfs.misc branch should appear in linux-next soon.
> > 
> > Jeff, thank you for this!
> > 
> > I have already implemented support for statmount() and listmount() in
> > libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The
> > only remaining issue was the mount source and incomplete file system
> > type.
> > 
> 
> Unfortunately, I think we might be missing something else:
> 
> The mountinfo (and "mounts") file generator calls show_sb_opts() which
> generates some strings from the sb->s_flags field and then calls
> security_sb_show_options(). statmount() will give you the s_flags field
> (or an equivalent), but it doesn't give you the security options
> string. So, those aren't currently visible from statmount().
> 
> How should we expose those? Should we create a new statmount string
> field and populate it, or is it better to just tack them onto the end
> of the statmount.mnt_opts string?

I'm leaning towards using a separate field because mnt_opts/opts_array
is about filesystem specific mount options whereas the security mount
options are somewhat generic. So it's easy to tell them apart.

Jan Kara Nov. 14, 2024, 11:56 a.m. UTC | #13

On Thu 14-11-24 09:45:23, Ian Kent wrote:
> On 13/11/24 23:18, Jan Kara wrote:
> > On Wed 13-11-24 08:45:06, Jeff Layton wrote:
> > > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > > Next on the wish list is a notification (a file descriptor that can be
> > > > used in epoll) that returns a 64-bit ID when there is a change in the
> > > > mount node. This will enable us to enhance systemd so that it does not
> > > > have to read the entire mount table after every change.
> > > > 
> > > New fanotify events for mount table changes, perhaps?
> > Now that I'm looking at it I'm not sure fanotify is a great fit for this
> > usecase. A lot of fanotify functionality does not really work for virtual
> > filesystems such as proc and hence we generally try to discourage use of
> > fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> > on one file inside proc looks as rather inconsistent interface. But I
> > vaguely remember we were discussing some kind of mount event, weren't we?
> > Or was that for something else?
> 
> I still need to have a look at the existing notifications sub-systems but,
> tbh, I also don't think they offer the needed functionality.
> 
> The thing that was most useful with David's notifications when I was trying
> to improve the mounts handling was the queuing interface. It allowed me to
> batch notifications up to around a couple of hundred and grab them in one go
> for processing. This significantly lowered the overhead of rapid fire event
> processing. The ability to go directly to an individual mount and get it's
> information only got about half the improvement I saw, the rest come from
> the notifications improvement.

Well, if we implemented the mount notification events in fanotify, then the
mount events get queued in the notification group queue and you can process
the whole batch of events in one go if you want. So I don't see batching as
an issue. What I'm more worried about is that watching the whole system
for new mounts is going to be somewhat cumbersome when all you can do is to
watch new mounts attached under an existing mount / filesystem.

								Honza

Jeff Layton Nov. 14, 2024, 12:29 p.m. UTC | #14

On Thu, 2024-11-14 at 12:29 +0100, Christian Brauner wrote:
> On Wed, Nov 13, 2024 at 08:45:06AM -0500, Jeff Layton wrote:
> > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> > > > > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > > > > I'd like to convert it to use listmount()/statmount(), so we can do a
> > > > > better job of monitoring with containers. We're missing some fields
> > > > > though. This patchset adds them.
> > > > > 
> > > > > 
> > > > 
> > > > Applied to the vfs.misc branch of the vfs/vfs.git tree.
> > > > Patches in the vfs.misc branch should appear in linux-next soon.
> > > 
> > > Jeff, thank you for this!
> > > 
> > > I have already implemented support for statmount() and listmount() in
> > > libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The
> > > only remaining issue was the mount source and incomplete file system
> > > type.
> > > 
> > 
> > Unfortunately, I think we might be missing something else:
> > 
> > The mountinfo (and "mounts") file generator calls show_sb_opts() which
> > generates some strings from the sb->s_flags field and then calls
> > security_sb_show_options(). statmount() will give you the s_flags field
> > (or an equivalent), but it doesn't give you the security options
> > string. So, those aren't currently visible from statmount().
> > 
> > How should we expose those? Should we create a new statmount string
> > field and populate it, or is it better to just tack them onto the end
> > of the statmount.mnt_opts string?
> 
> I'm leaning towards using a separate field because mnt_opts/opts_array
> is about filesystem specific mount options whereas the security mount
> options are somewhat generic. So it's easy to tell them apart.

Ordinarily, I might agree, but we're now growing a new mount option
field that has them separated by NULs. Will we need two extra fields
for this? One comma-separated, and one NUL separated?

/proc/#/mountinfo and mounts prepend these to the output of 
->show_options, so the simple solution would be to just prepend those
there instead of adding a new field. FWIW, only SELinux has any extra
mount options to show here.

Tough call -- anyone else have opinions?

Miklos Szeredi Nov. 14, 2024, 1:16 p.m. UTC | #15

On Thu, 14 Nov 2024 at 13:29, Jeff Layton <jlayton@kernel.org> wrote:

> Ordinarily, I might agree, but we're now growing a new mount option
> field that has them separated by NULs. Will we need two extra fields
> for this? One comma-separated, and one NUL separated?
>
> /proc/#/mountinfo and mounts prepend these to the output of
> ->show_options, so the simple solution would be to just prepend those
> there instead of adding a new field. FWIW, only SELinux has any extra
> mount options to show here.

Compromise: tack them onto the end of the comma separated list, but
add a new field for the nul separated security options.

I think this would be logical, since the comma separated list is more
useful for having a /proc/$$/mountinfo compatible string than for
actually interpreting what's in there.

Thanks,
Miklos

Miklos Szeredi Nov. 14, 2024, 1:20 p.m. UTC | #16

On Thu, 14 Nov 2024 at 12:56, Jan Kara <jack@suse.cz> wrote:

>   What I'm more worried about is that watching the whole system
> for new mounts is going to be somewhat cumbersome when all you can do is to
> watch new mounts attached under an existing mount / filesystem.

We don't even know if there's a use case for that.  I think it would
make sense to think about it when/if such a use case emerges.  The
inode notification interfaces went through that evolution too, no?

Thanks,
Miklos

Christian Brauner Nov. 14, 2024, 2:48 p.m. UTC | #17

On Thu, Nov 14, 2024 at 02:16:05PM +0100, Miklos Szeredi wrote:
> On Thu, 14 Nov 2024 at 13:29, Jeff Layton <jlayton@kernel.org> wrote:
> 
> > Ordinarily, I might agree, but we're now growing a new mount option
> > field that has them separated by NULs. Will we need two extra fields
> > for this? One comma-separated, and one NUL separated?
> >
> > /proc/#/mountinfo and mounts prepend these to the output of
> > ->show_options, so the simple solution would be to just prepend those
> > there instead of adding a new field. FWIW, only SELinux has any extra
> > mount options to show here.
> 
> Compromise: tack them onto the end of the comma separated list, but
> add a new field for the nul separated security options.
> 
> I think this would be logical, since the comma separated list is more
> useful for having a /proc/$$/mountinfo compatible string than for
> actually interpreting what's in there.

Fair. Here's an incremental for the array of security options.

diff --git a/fs/namespace.c b/fs/namespace.c
index 4f39c4aba85d..a9065a9ab971 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -5072,13 +5072,30 @@ static int statmount_mnt_opts(struct kstatmount *s, struct seq_file *seq)
 	return 0;
 }
 
+static inline int statmount_opt_unescape(struct seq_file *seq, char *buf_start)
+{
+	char *buf_end, *opt_start, *opt_end;
+	int count = 0;
+
+	buf_end = seq->buf + seq->count;
+	*buf_end = '\0';
+	for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) {
+		opt_end = strchrnul(opt_start, ',');
+		*opt_end = '\0';
+		buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1;
+		if (WARN_ON_ONCE(++count == INT_MAX))
+			return -EOVERFLOW;
+	}
+	seq->count = buf_start - 1 - seq->buf;
+	return count;
+}
+
 static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq)
 {
 	struct vfsmount *mnt = s->mnt;
 	struct super_block *sb = mnt->mnt_sb;
 	size_t start = seq->count;
-	char *buf_start, *buf_end, *opt_start, *opt_end;
-	u32 count = 0;
+	char *buf_start;
 	int err;
 
 	if (!sb->s_op->show_options)
@@ -5095,17 +5112,39 @@ static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq)
 	if (seq->count == start)
 		return 0;
 
-	buf_end = seq->buf + seq->count;
-	*buf_end = '\0';
-	for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) {
-		opt_end = strchrnul(opt_start, ',');
-		*opt_end = '\0';
-		buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1;
-		if (WARN_ON_ONCE(++count == 0))
-			return -EOVERFLOW;
-	}
-	seq->count = buf_start - 1 - seq->buf;
-	s->sm.opt_num = count;
+	err = statmount_opt_unescape(seq, buf_start);
+	if (err < 0)
+		return err;
+
+	s->sm.opt_num = err;
+	return 0;
+}
+
+static int statmount_opt_sec_array(struct kstatmount *s, struct seq_file *seq)
+{
+	struct vfsmount *mnt = s->mnt;
+	struct super_block *sb = mnt->mnt_sb;
+	size_t start = seq->count;
+	char *buf_start;
+	int err;
+
+	buf_start = seq->buf + start;
+
+	err = security_sb_show_options(seq, sb);
+	if (!err)
+		return err;
+
+	if (unlikely(seq_has_overflowed(seq)))
+		return -EAGAIN;
+
+	if (seq->count == start)
+		return 0;
+
+	err = statmount_opt_unescape(seq, buf_start);
+	if (err < 0)
+		return err;
+
+	s->sm.opt_sec_num = err;
 	return 0;
 }
 
@@ -5138,6 +5177,10 @@ static int statmount_string(struct kstatmount *s, u64 flag)
 		sm->opt_array = start;
 		ret = statmount_opt_array(s, seq);
 		break;
+	case STATMOUNT_OPT_SEC_ARRAY:
+		sm->opt_sec_array = start;
+		ret = statmount_opt_sec_array(s, seq);
+		break;
 	case STATMOUNT_FS_SUBTYPE:
 		sm->fs_subtype = start;
 		statmount_fs_subtype(s, seq);
@@ -5294,6 +5337,9 @@ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id,
 	if (!err && s->mask & STATMOUNT_OPT_ARRAY)
 		err = statmount_string(s, STATMOUNT_OPT_ARRAY);
 
+	if (!err && s->mask & STATMOUNT_OPT_SEC_ARRAY)
+		err = statmount_string(s, STATMOUNT_OPT_SEC_ARRAY);
+
 	if (!err && s->mask & STATMOUNT_FS_SUBTYPE)
 		err = statmount_string(s, STATMOUNT_FS_SUBTYPE);
 
@@ -5323,7 +5369,7 @@ static inline bool retry_statmount(const long ret, size_t *seq_size)
 #define STATMOUNT_STRING_REQ (STATMOUNT_MNT_ROOT | STATMOUNT_MNT_POINT | \
 			      STATMOUNT_FS_TYPE | STATMOUNT_MNT_OPTS | \
 			      STATMOUNT_FS_SUBTYPE | STATMOUNT_SB_SOURCE | \
-			      STATMOUNT_OPT_ARRAY)
+			      STATMOUNT_OPT_ARRAY | STATMOUNT_OPT_SEC_ARRAY)
 
 static int prepare_kstatmount(struct kstatmount *ks, struct mnt_id_req *kreq,
 			      struct statmount __user *buf, size_t bufsize,
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index c0fda4604187..569d938a5757 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -177,7 +177,9 @@ struct statmount {
 	__u32 sb_source;	/* [str] Source string of the mount */
 	__u32 opt_num;		/* Number of fs options */
 	__u32 opt_array;	/* [str] Array of nul terminated fs options */
-	__u64 __spare2[47];
+	__u32 opt_sec_num;	/* Number of security options */
+	__u32 opt_sec_array;	/* [str] Array of nul terminated security options */
+	__u64 __spare2[45];
 	char str[];		/* Variable size part containing strings */
 };
 
@@ -214,6 +216,7 @@ struct mnt_id_req {
 #define STATMOUNT_FS_SUBTYPE		0x00000100U	/* Want/got fs_subtype */
 #define STATMOUNT_SB_SOURCE		0x00000200U	/* Want/got sb_source */
 #define STATMOUNT_OPT_ARRAY		0x00000400U	/* Want/got opt_... */
+#define STATMOUNT_OPT_SEC_ARRAY		0x00000800U	/* Want/got opt_sec... */
 
 /*
  * Special @mnt_id values that can be passed to listmount

Jeff Layton Nov. 14, 2024, 2:51 p.m. UTC | #18

On Thu, 2024-11-14 at 15:48 +0100, Christian Brauner wrote:
> On Thu, Nov 14, 2024 at 02:16:05PM +0100, Miklos Szeredi wrote:
> > On Thu, 14 Nov 2024 at 13:29, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > > Ordinarily, I might agree, but we're now growing a new mount option
> > > field that has them separated by NULs. Will we need two extra fields
> > > for this? One comma-separated, and one NUL separated?
> > > 
> > > /proc/#/mountinfo and mounts prepend these to the output of
> > > ->show_options, so the simple solution would be to just prepend those
> > > there instead of adding a new field. FWIW, only SELinux has any extra
> > > mount options to show here.
> > 
> > Compromise: tack them onto the end of the comma separated list, but
> > add a new field for the nul separated security options.
> > 
> > I think this would be logical, since the comma separated list is more
> > useful for having a /proc/$$/mountinfo compatible string than for
> > actually interpreting what's in there.
> 
> Fair. Here's an incremental for the array of security options.
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4f39c4aba85d..a9065a9ab971 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -5072,13 +5072,30 @@ static int statmount_mnt_opts(struct kstatmount *s, struct seq_file *seq)
>  	return 0;
>  }
>  
> +static inline int statmount_opt_unescape(struct seq_file *seq, char *buf_start)
> +{
> +	char *buf_end, *opt_start, *opt_end;
> +	int count = 0;
> +
> +	buf_end = seq->buf + seq->count;
> +	*buf_end = '\0';
> +	for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) {
> +		opt_end = strchrnul(opt_start, ',');
> +		*opt_end = '\0';
> +		buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1;
> +		if (WARN_ON_ONCE(++count == INT_MAX))
> +			return -EOVERFLOW;
> +	}
> +	seq->count = buf_start - 1 - seq->buf;
> +	return count;
> +}
> +
>  static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq)
>  {
>  	struct vfsmount *mnt = s->mnt;
>  	struct super_block *sb = mnt->mnt_sb;
>  	size_t start = seq->count;
> -	char *buf_start, *buf_end, *opt_start, *opt_end;
> -	u32 count = 0;
> +	char *buf_start;
>  	int err;
>  
>  	if (!sb->s_op->show_options)
> @@ -5095,17 +5112,39 @@ static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq)
>  	if (seq->count == start)
>  		return 0;
>  
> -	buf_end = seq->buf + seq->count;
> -	*buf_end = '\0';
> -	for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) {
> -		opt_end = strchrnul(opt_start, ',');
> -		*opt_end = '\0';
> -		buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1;
> -		if (WARN_ON_ONCE(++count == 0))
> -			return -EOVERFLOW;
> -	}
> -	seq->count = buf_start - 1 - seq->buf;
> -	s->sm.opt_num = count;
> +	err = statmount_opt_unescape(seq, buf_start);
> +	if (err < 0)
> +		return err;
> +
> +	s->sm.opt_num = err;
> +	return 0;
> +}
> +
> +static int statmount_opt_sec_array(struct kstatmount *s, struct seq_file *seq)
> +{
> +	struct vfsmount *mnt = s->mnt;
> +	struct super_block *sb = mnt->mnt_sb;
> +	size_t start = seq->count;
> +	char *buf_start;
> +	int err;
> +
> +	buf_start = seq->buf + start;
> +
> +	err = security_sb_show_options(seq, sb);
> +	if (!err)
> +		return err;
> +
> +	if (unlikely(seq_has_overflowed(seq)))
> +		return -EAGAIN;
> +
> +	if (seq->count == start)
> +		return 0;
> +
> +	err = statmount_opt_unescape(seq, buf_start);
> +	if (err < 0)
> +		return err;
> +
> +	s->sm.opt_sec_num = err;
>  	return 0;
>  }
>  
> @@ -5138,6 +5177,10 @@ static int statmount_string(struct kstatmount *s, u64 flag)
>  		sm->opt_array = start;
>  		ret = statmount_opt_array(s, seq);
>  		break;
> +	case STATMOUNT_OPT_SEC_ARRAY:
> +		sm->opt_sec_array = start;
> +		ret = statmount_opt_sec_array(s, seq);
> +		break;
>  	case STATMOUNT_FS_SUBTYPE:
>  		sm->fs_subtype = start;
>  		statmount_fs_subtype(s, seq);
> @@ -5294,6 +5337,9 @@ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id,
>  	if (!err && s->mask & STATMOUNT_OPT_ARRAY)
>  		err = statmount_string(s, STATMOUNT_OPT_ARRAY);
>  
> +	if (!err && s->mask & STATMOUNT_OPT_SEC_ARRAY)
> +		err = statmount_string(s, STATMOUNT_OPT_SEC_ARRAY);
> +
>  	if (!err && s->mask & STATMOUNT_FS_SUBTYPE)
>  		err = statmount_string(s, STATMOUNT_FS_SUBTYPE);
>  
> @@ -5323,7 +5369,7 @@ static inline bool retry_statmount(const long ret, size_t *seq_size)
>  #define STATMOUNT_STRING_REQ (STATMOUNT_MNT_ROOT | STATMOUNT_MNT_POINT | \
>  			      STATMOUNT_FS_TYPE | STATMOUNT_MNT_OPTS | \
>  			      STATMOUNT_FS_SUBTYPE | STATMOUNT_SB_SOURCE | \
> -			      STATMOUNT_OPT_ARRAY)
> +			      STATMOUNT_OPT_ARRAY | STATMOUNT_OPT_SEC_ARRAY)
>  
>  static int prepare_kstatmount(struct kstatmount *ks, struct mnt_id_req *kreq,
>  			      struct statmount __user *buf, size_t bufsize,
> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> index c0fda4604187..569d938a5757 100644
> --- a/include/uapi/linux/mount.h
> +++ b/include/uapi/linux/mount.h
> @@ -177,7 +177,9 @@ struct statmount {
>  	__u32 sb_source;	/* [str] Source string of the mount */
>  	__u32 opt_num;		/* Number of fs options */
>  	__u32 opt_array;	/* [str] Array of nul terminated fs options */
> -	__u64 __spare2[47];
> +	__u32 opt_sec_num;	/* Number of security options */
> +	__u32 opt_sec_array;	/* [str] Array of nul terminated security options */
> +	__u64 __spare2[45];

shouldn't that be 46 ?

>  	char str[];		/* Variable size part containing strings */
>  };
>  
> @@ -214,6 +216,7 @@ struct mnt_id_req {
>  #define STATMOUNT_FS_SUBTYPE		0x00000100U	/* Want/got fs_subtype */
>  #define STATMOUNT_SB_SOURCE		0x00000200U	/* Want/got sb_source */
>  #define STATMOUNT_OPT_ARRAY		0x00000400U	/* Want/got opt_... */
> +#define STATMOUNT_OPT_SEC_ARRAY		0x00000800U	/* Want/got opt_sec... */
>  
>  /*
>   * Special @mnt_id values that can be passed to listmount

The rest looks good to me though!

Christian Brauner Nov. 14, 2024, 3:09 p.m. UTC | #19

On Thu, Nov 14, 2024 at 09:51:42AM -0500, Jeff Layton wrote:
> On Thu, 2024-11-14 at 15:48 +0100, Christian Brauner wrote:
> > On Thu, Nov 14, 2024 at 02:16:05PM +0100, Miklos Szeredi wrote:
> > > On Thu, 14 Nov 2024 at 13:29, Jeff Layton <jlayton@kernel.org> wrote:
> > > 
> > > > Ordinarily, I might agree, but we're now growing a new mount option
> > > > field that has them separated by NULs. Will we need two extra fields
> > > > for this? One comma-separated, and one NUL separated?
> > > > 
> > > > /proc/#/mountinfo and mounts prepend these to the output of
> > > > ->show_options, so the simple solution would be to just prepend those
> > > > there instead of adding a new field. FWIW, only SELinux has any extra
> > > > mount options to show here.
> > > 
> > > Compromise: tack them onto the end of the comma separated list, but
> > > add a new field for the nul separated security options.
> > > 
> > > I think this would be logical, since the comma separated list is more
> > > useful for having a /proc/$$/mountinfo compatible string than for
> > > actually interpreting what's in there.
> > 
> > Fair. Here's an incremental for the array of security options.
> > 
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 4f39c4aba85d..a9065a9ab971 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -5072,13 +5072,30 @@ static int statmount_mnt_opts(struct kstatmount *s, struct seq_file *seq)
> >  	return 0;
> >  }
> >  
> > +static inline int statmount_opt_unescape(struct seq_file *seq, char *buf_start)
> > +{
> > +	char *buf_end, *opt_start, *opt_end;
> > +	int count = 0;
> > +
> > +	buf_end = seq->buf + seq->count;
> > +	*buf_end = '\0';
> > +	for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) {
> > +		opt_end = strchrnul(opt_start, ',');
> > +		*opt_end = '\0';
> > +		buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1;
> > +		if (WARN_ON_ONCE(++count == INT_MAX))
> > +			return -EOVERFLOW;
> > +	}
> > +	seq->count = buf_start - 1 - seq->buf;
> > +	return count;
> > +}
> > +
> >  static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq)
> >  {
> >  	struct vfsmount *mnt = s->mnt;
> >  	struct super_block *sb = mnt->mnt_sb;
> >  	size_t start = seq->count;
> > -	char *buf_start, *buf_end, *opt_start, *opt_end;
> > -	u32 count = 0;
> > +	char *buf_start;
> >  	int err;
> >  
> >  	if (!sb->s_op->show_options)
> > @@ -5095,17 +5112,39 @@ static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq)
> >  	if (seq->count == start)
> >  		return 0;
> >  
> > -	buf_end = seq->buf + seq->count;
> > -	*buf_end = '\0';
> > -	for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) {
> > -		opt_end = strchrnul(opt_start, ',');
> > -		*opt_end = '\0';
> > -		buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1;
> > -		if (WARN_ON_ONCE(++count == 0))
> > -			return -EOVERFLOW;
> > -	}
> > -	seq->count = buf_start - 1 - seq->buf;
> > -	s->sm.opt_num = count;
> > +	err = statmount_opt_unescape(seq, buf_start);
> > +	if (err < 0)
> > +		return err;
> > +
> > +	s->sm.opt_num = err;
> > +	return 0;
> > +}
> > +
> > +static int statmount_opt_sec_array(struct kstatmount *s, struct seq_file *seq)
> > +{
> > +	struct vfsmount *mnt = s->mnt;
> > +	struct super_block *sb = mnt->mnt_sb;
> > +	size_t start = seq->count;
> > +	char *buf_start;
> > +	int err;
> > +
> > +	buf_start = seq->buf + start;
> > +
> > +	err = security_sb_show_options(seq, sb);
> > +	if (!err)
> > +		return err;
> > +
> > +	if (unlikely(seq_has_overflowed(seq)))
> > +		return -EAGAIN;
> > +
> > +	if (seq->count == start)
> > +		return 0;
> > +
> > +	err = statmount_opt_unescape(seq, buf_start);
> > +	if (err < 0)
> > +		return err;
> > +
> > +	s->sm.opt_sec_num = err;
> >  	return 0;
> >  }
> >  
> > @@ -5138,6 +5177,10 @@ static int statmount_string(struct kstatmount *s, u64 flag)
> >  		sm->opt_array = start;
> >  		ret = statmount_opt_array(s, seq);
> >  		break;
> > +	case STATMOUNT_OPT_SEC_ARRAY:
> > +		sm->opt_sec_array = start;
> > +		ret = statmount_opt_sec_array(s, seq);
> > +		break;
> >  	case STATMOUNT_FS_SUBTYPE:
> >  		sm->fs_subtype = start;
> >  		statmount_fs_subtype(s, seq);
> > @@ -5294,6 +5337,9 @@ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id,
> >  	if (!err && s->mask & STATMOUNT_OPT_ARRAY)
> >  		err = statmount_string(s, STATMOUNT_OPT_ARRAY);
> >  
> > +	if (!err && s->mask & STATMOUNT_OPT_SEC_ARRAY)
> > +		err = statmount_string(s, STATMOUNT_OPT_SEC_ARRAY);
> > +
> >  	if (!err && s->mask & STATMOUNT_FS_SUBTYPE)
> >  		err = statmount_string(s, STATMOUNT_FS_SUBTYPE);
> >  
> > @@ -5323,7 +5369,7 @@ static inline bool retry_statmount(const long ret, size_t *seq_size)
> >  #define STATMOUNT_STRING_REQ (STATMOUNT_MNT_ROOT | STATMOUNT_MNT_POINT | \
> >  			      STATMOUNT_FS_TYPE | STATMOUNT_MNT_OPTS | \
> >  			      STATMOUNT_FS_SUBTYPE | STATMOUNT_SB_SOURCE | \
> > -			      STATMOUNT_OPT_ARRAY)
> > +			      STATMOUNT_OPT_ARRAY | STATMOUNT_OPT_SEC_ARRAY)
> >  
> >  static int prepare_kstatmount(struct kstatmount *ks, struct mnt_id_req *kreq,
> >  			      struct statmount __user *buf, size_t bufsize,
> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> > index c0fda4604187..569d938a5757 100644
> > --- a/include/uapi/linux/mount.h
> > +++ b/include/uapi/linux/mount.h
> > @@ -177,7 +177,9 @@ struct statmount {
> >  	__u32 sb_source;	/* [str] Source string of the mount */
> >  	__u32 opt_num;		/* Number of fs options */
> >  	__u32 opt_array;	/* [str] Array of nul terminated fs options */
> > -	__u64 __spare2[47];
> > +	__u32 opt_sec_num;	/* Number of security options */
> > +	__u32 opt_sec_array;	/* [str] Array of nul terminated security options */
> > +	__u64 __spare2[45];
> 
> shouldn't that be 46 ?

Yes, apparently I can't count. Thanks for noticing!

> 
> >  	char str[];		/* Variable size part containing strings */
> >  };
> >  
> > @@ -214,6 +216,7 @@ struct mnt_id_req {
> >  #define STATMOUNT_FS_SUBTYPE		0x00000100U	/* Want/got fs_subtype */
> >  #define STATMOUNT_SB_SOURCE		0x00000200U	/* Want/got sb_source */
> >  #define STATMOUNT_OPT_ARRAY		0x00000400U	/* Want/got opt_... */
> > +#define STATMOUNT_OPT_SEC_ARRAY		0x00000800U	/* Want/got opt_sec... */
> >  
> >  /*
> >   * Special @mnt_id values that can be passed to listmount
> 
> The rest looks good to me though!
> -- 
> Jeff Layton <jlayton@kernel.org>

[v4,0/3] fs: allow statmount to fetch the fs_subtype and sb_source

Message

Comments