mbox series

[v4,0/3] fs: allow statmount to fetch the fs_subtype and sb_source

Message ID 20241111-statmount-v4-0-2eaf35d07a80@kernel.org (mailing list archive)
Headers show
Series fs: allow statmount to fetch the fs_subtype and sb_source | expand

Message

Jeff Layton Nov. 11, 2024, 3:09 p.m. UTC
Meta has some internal logging that scrapes /proc/self/mountinfo today.
I'd like to convert it to use listmount()/statmount(), so we can do a
better job of monitoring with containers. We're missing some fields
though. This patchset adds them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v4:
- Rename mnt_devname to sb_source
- Break statmount_string() behavior changes out into separate patch
- Link to v3: https://lore.kernel.org/r/20241107-statmount-v3-0-da5b9744c121@kernel.org

Changes in v3:
- Unescape the result of ->show_devname
- Move handling of nothing being emitted out of the switch statement
- Link to v2: https://lore.kernel.org/r/20241106-statmount-v2-0-93ba2aad38d1@kernel.org

Changes in v2:
- make statmount_fs_subtype
- return fast if no subtype is emitted
- new patch to allow statmount() to return devicename
- Link to v1: https://lore.kernel.org/r/20241106-statmount-v1-1-b93bafd97621@kernel.org

---
Jeff Layton (3):
      fs: don't let statmount return empty strings
      fs: add the ability for statmount() to report the fs_subtype
      fs: add the ability for statmount() to report the sb_source

 fs/namespace.c             | 68 ++++++++++++++++++++++++++++++++++++++++++----
 include/uapi/linux/mount.h |  6 +++-
 2 files changed, 67 insertions(+), 7 deletions(-)
---
base-commit: 26213e1a6caa5a7f508b919059b0122b451f4dfe
change-id: 20241106-statmount-3f91a7ed75fa

Best regards,

Comments

Miklos Szeredi Nov. 12, 2024, 10:32 a.m. UTC | #1
On Mon, 11 Nov 2024 at 16:10, Jeff Layton <jlayton@kernel.org> wrote:
>
> Meta has some internal logging that scrapes /proc/self/mountinfo today.
> I'd like to convert it to use listmount()/statmount(), so we can do a
> better job of monitoring with containers. We're missing some fields
> though. This patchset adds them.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

When thinking of the naming for the unescaped options, I realized that
maybe "source" is better than "sb_soure" which just adds redundant
info.   Just a thought, don't bother if you don't agree.

Acked-by: Miklos Szeredi <mszeredi@redhat.com>

Thanks,
Miklos
Jeff Layton Nov. 12, 2024, 11:12 a.m. UTC | #2
On Tue, 2024-11-12 at 11:32 +0100, Miklos Szeredi wrote:
> On Mon, 11 Nov 2024 at 16:10, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > I'd like to convert it to use listmount()/statmount(), so we can do a
> > better job of monitoring with containers. We're missing some fields
> > though. This patchset adds them.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> When thinking of the naming for the unescaped options, I realized that
> maybe "source" is better than "sb_soure" which just adds redundant
> info.   Just a thought, don't bother if you don't agree.
> 
> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> 

I think sb_source fits better with the convention that the other fields
in the struct use. I say we stick with that.
Christian Brauner Nov. 12, 2024, 1:39 p.m. UTC | #3
On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> Meta has some internal logging that scrapes /proc/self/mountinfo today.
> I'd like to convert it to use listmount()/statmount(), so we can do a
> better job of monitoring with containers. We're missing some fields
> though. This patchset adds them.
> 
> 

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/3] fs: don't let statmount return empty strings
      https://git.kernel.org/vfs/vfs/c/75ead69a7173
[2/3] fs: add the ability for statmount() to report the fs_subtype
      https://git.kernel.org/vfs/vfs/c/ed9d95f691c2
[3/3] fs: add the ability for statmount() to report the sb_source
      https://git.kernel.org/vfs/vfs/c/d31cea0ab403
Karel Zak Nov. 13, 2024, 11:27 a.m. UTC | #4
On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > I'd like to convert it to use listmount()/statmount(), so we can do a
> > better job of monitoring with containers. We're missing some fields
> > though. This patchset adds them.
> > 
> > 
> 
> Applied to the vfs.misc branch of the vfs/vfs.git tree.
> Patches in the vfs.misc branch should appear in linux-next soon.

Jeff, thank you for this!

I have already implemented support for statmount() and listmount() in
libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The
only remaining issue was the mount source and incomplete file system
type.

Currently, the library does not use these syscalls when mounting (it
still uses /proc/#/mountinfo). However, there is already a library API
to fetch mount information from the kernel using listmount+statmount,
and it can be accessed from the command line using:

   findmnt --kernel=listmount

Next on the wish list is a notification (a file descriptor that can be
used in epoll) that returns a 64-bit ID when there is a change in the
mount node. This will enable us to enhance systemd so that it does not
have to read the entire mount table after every change.


    Karel
Miklos Szeredi Nov. 13, 2024, 1:08 p.m. UTC | #5
On Wed, 13 Nov 2024 at 12:28, Karel Zak <kzak@redhat.com> wrote:

> Next on the wish list is a notification (a file descriptor that can be
> used in epoll) that returns a 64-bit ID when there is a change in the
> mount node. This will enable us to enhance systemd so that it does not
> have to read the entire mount table after every change.

IIRC Amir said he'd take this up (this was at LSFMM 2023).  I'm also
happy to take part in the design and implementation.

Thanks,
Miklos
Jeff Layton Nov. 13, 2024, 1:45 p.m. UTC | #6
On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> > > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > > I'd like to convert it to use listmount()/statmount(), so we can do a
> > > better job of monitoring with containers. We're missing some fields
> > > though. This patchset adds them.
> > > 
> > > 
> > 
> > Applied to the vfs.misc branch of the vfs/vfs.git tree.
> > Patches in the vfs.misc branch should appear in linux-next soon.
> 
> Jeff, thank you for this!
> 
> I have already implemented support for statmount() and listmount() in
> libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The
> only remaining issue was the mount source and incomplete file system
> type.
> 

Unfortunately, I think we might be missing something else:

The mountinfo (and "mounts") file generator calls show_sb_opts() which
generates some strings from the sb->s_flags field and then calls
security_sb_show_options(). statmount() will give you the s_flags field
(or an equivalent), but it doesn't give you the security options
string. So, those aren't currently visible from statmount().

How should we expose those? Should we create a new statmount string
field and populate it, or is it better to just tack them onto the end
of the statmount.mnt_opts string?

> Currently, the library does not use these syscalls when mounting (it
> still uses /proc/#/mountinfo). However, there is already a library API
> to fetch mount information from the kernel using listmount+statmount,
> and it can be accessed from the command line using:
> 
>    findmnt --kernel=listmount
> 

Very cool. I need to play with findmnt more.

> Next on the wish list is a notification (a file descriptor that can be
> used in epoll) that returns a 64-bit ID when there is a change in the
> mount node. This will enable us to enhance systemd so that it does not
> have to read the entire mount table after every change.
> 

New fanotify events for mount table changes, perhaps?
Jan Kara Nov. 13, 2024, 3:18 p.m. UTC | #7
On Wed 13-11-24 08:45:06, Jeff Layton wrote:
> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > Next on the wish list is a notification (a file descriptor that can be
> > used in epoll) that returns a 64-bit ID when there is a change in the
> > mount node. This will enable us to enhance systemd so that it does not
> > have to read the entire mount table after every change.
> > 
> 
> New fanotify events for mount table changes, perhaps?

Now that I'm looking at it I'm not sure fanotify is a great fit for this
usecase. A lot of fanotify functionality does not really work for virtual
filesystems such as proc and hence we generally try to discourage use of
fanotify for them. So just supporting one type of event (like FAN_MODIFY)
on one file inside proc looks as rather inconsistent interface. But I
vaguely remember we were discussing some kind of mount event, weren't we?
Or was that for something else?

								Honza
Miklos Szeredi Nov. 13, 2024, 3:49 p.m. UTC | #8
On Wed, 13 Nov 2024 at 16:18, Jan Kara <jack@suse.cz> wrote:
>
> On Wed 13-11-24 08:45:06, Jeff Layton wrote:
> > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > Next on the wish list is a notification (a file descriptor that can be
> > > used in epoll) that returns a 64-bit ID when there is a change in the
> > > mount node. This will enable us to enhance systemd so that it does not
> > > have to read the entire mount table after every change.
> > >
> >
> > New fanotify events for mount table changes, perhaps?
>
> Now that I'm looking at it I'm not sure fanotify is a great fit for this
> usecase. A lot of fanotify functionality does not really work for virtual
> filesystems such as proc and hence we generally try to discourage use of
> fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> on one file inside proc looks as rather inconsistent interface. But I
> vaguely remember we were discussing some kind of mount event, weren't we?
> Or was that for something else?

Yeah, if memory serves right what we agreed on was that placing a
watch on a mount would result in events being generated for
mount/umount/move_mount directly under that mount.  So this would not
be monitoring the entire namespace as poll on /proc/$$/mountinfo does.
IIRC Lennart said that this is okay and even desirable for systemd,
since it's only interested in a particular set of mounts.

Thanks,
Miklos
Jan Kara Nov. 13, 2024, 4 p.m. UTC | #9
On Wed 13-11-24 16:49:52, Miklos Szeredi wrote:
> On Wed, 13 Nov 2024 at 16:18, Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 13-11-24 08:45:06, Jeff Layton wrote:
> > > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > > Next on the wish list is a notification (a file descriptor that can be
> > > > used in epoll) that returns a 64-bit ID when there is a change in the
> > > > mount node. This will enable us to enhance systemd so that it does not
> > > > have to read the entire mount table after every change.
> > > >
> > >
> > > New fanotify events for mount table changes, perhaps?
> >
> > Now that I'm looking at it I'm not sure fanotify is a great fit for this
> > usecase. A lot of fanotify functionality does not really work for virtual
> > filesystems such as proc and hence we generally try to discourage use of
> > fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> > on one file inside proc looks as rather inconsistent interface. But I
> > vaguely remember we were discussing some kind of mount event, weren't we?
> > Or was that for something else?
> 
> Yeah, if memory serves right what we agreed on was that placing a
> watch on a mount would result in events being generated for
> mount/umount/move_mount directly under that mount.  So this would not
> be monitoring the entire namespace as poll on /proc/$$/mountinfo does.
> IIRC Lennart said that this is okay and even desirable for systemd,
> since it's only interested in a particular set of mounts.

Oh, right. Thanks for reminding me. And yes, this would fit within what
fanotify supports quite nicely.

								Honza
Ian Kent Nov. 14, 2024, 1:45 a.m. UTC | #10
On 13/11/24 23:18, Jan Kara wrote:
> On Wed 13-11-24 08:45:06, Jeff Layton wrote:
>> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
>>> On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
>>> Next on the wish list is a notification (a file descriptor that can be
>>> used in epoll) that returns a 64-bit ID when there is a change in the
>>> mount node. This will enable us to enhance systemd so that it does not
>>> have to read the entire mount table after every change.
>>>
>> New fanotify events for mount table changes, perhaps?
> Now that I'm looking at it I'm not sure fanotify is a great fit for this
> usecase. A lot of fanotify functionality does not really work for virtual
> filesystems such as proc and hence we generally try to discourage use of
> fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> on one file inside proc looks as rather inconsistent interface. But I
> vaguely remember we were discussing some kind of mount event, weren't we?
> Or was that for something else?

I still need to have a look at the existing notifications sub-systems but,

tbh, I also don't think they offer the needed functionality.


The thing that was most useful with David's notifications when I was trying

to improve the mounts handling was the queuing interface. It allowed me to

batch notifications up to around a couple of hundred and grab them in one go

for processing. This significantly lowered the overhead of rapid fire event

processing. The ability to go directly to an individual mount and get it's

information only got about half the improvement I saw, the rest come 
from the

notifications improvement.


Ian
Ian Kent Nov. 14, 2024, 1:51 a.m. UTC | #11
On 13/11/24 23:18, Jan Kara wrote:
> On Wed 13-11-24 08:45:06, Jeff Layton wrote:
>> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
>>> On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
>>> Next on the wish list is a notification (a file descriptor that can be
>>> used in epoll) that returns a 64-bit ID when there is a change in the
>>> mount node. This will enable us to enhance systemd so that it does not
>>> have to read the entire mount table after every change.
>>>
>> New fanotify events for mount table changes, perhaps?
> Now that I'm looking at it I'm not sure fanotify is a great fit for this
> usecase. A lot of fanotify functionality does not really work for virtual
> filesystems such as proc and hence we generally try to discourage use of
> fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> on one file inside proc looks as rather inconsistent interface. But I
> vaguely remember we were discussing some kind of mount event, weren't we?
> Or was that for something else?

Well, yes, the idea was to be able to avoid the overhead of scanning the

proc mount table(s) pretty much entirely, particularly bad for rapid fire

event handling such as a largish number of rapid mounta or umounts.


Ian
Christian Brauner Nov. 14, 2024, 11:29 a.m. UTC | #12
On Wed, Nov 13, 2024 at 08:45:06AM -0500, Jeff Layton wrote:
> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> > > > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > > > I'd like to convert it to use listmount()/statmount(), so we can do a
> > > > better job of monitoring with containers. We're missing some fields
> > > > though. This patchset adds them.
> > > > 
> > > > 
> > > 
> > > Applied to the vfs.misc branch of the vfs/vfs.git tree.
> > > Patches in the vfs.misc branch should appear in linux-next soon.
> > 
> > Jeff, thank you for this!
> > 
> > I have already implemented support for statmount() and listmount() in
> > libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The
> > only remaining issue was the mount source and incomplete file system
> > type.
> > 
> 
> Unfortunately, I think we might be missing something else:
> 
> The mountinfo (and "mounts") file generator calls show_sb_opts() which
> generates some strings from the sb->s_flags field and then calls
> security_sb_show_options(). statmount() will give you the s_flags field
> (or an equivalent), but it doesn't give you the security options
> string. So, those aren't currently visible from statmount().
> 
> How should we expose those? Should we create a new statmount string
> field and populate it, or is it better to just tack them onto the end
> of the statmount.mnt_opts string?

I'm leaning towards using a separate field because mnt_opts/opts_array
is about filesystem specific mount options whereas the security mount
options are somewhat generic. So it's easy to tell them apart.
Jan Kara Nov. 14, 2024, 11:56 a.m. UTC | #13
On Thu 14-11-24 09:45:23, Ian Kent wrote:
> On 13/11/24 23:18, Jan Kara wrote:
> > On Wed 13-11-24 08:45:06, Jeff Layton wrote:
> > > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > > Next on the wish list is a notification (a file descriptor that can be
> > > > used in epoll) that returns a 64-bit ID when there is a change in the
> > > > mount node. This will enable us to enhance systemd so that it does not
> > > > have to read the entire mount table after every change.
> > > > 
> > > New fanotify events for mount table changes, perhaps?
> > Now that I'm looking at it I'm not sure fanotify is a great fit for this
> > usecase. A lot of fanotify functionality does not really work for virtual
> > filesystems such as proc and hence we generally try to discourage use of
> > fanotify for them. So just supporting one type of event (like FAN_MODIFY)
> > on one file inside proc looks as rather inconsistent interface. But I
> > vaguely remember we were discussing some kind of mount event, weren't we?
> > Or was that for something else?
> 
> I still need to have a look at the existing notifications sub-systems but,
> tbh, I also don't think they offer the needed functionality.
> 
> The thing that was most useful with David's notifications when I was trying
> to improve the mounts handling was the queuing interface. It allowed me to
> batch notifications up to around a couple of hundred and grab them in one go
> for processing. This significantly lowered the overhead of rapid fire event
> processing. The ability to go directly to an individual mount and get it's
> information only got about half the improvement I saw, the rest come from
> the notifications improvement.

Well, if we implemented the mount notification events in fanotify, then the
mount events get queued in the notification group queue and you can process
the whole batch of events in one go if you want. So I don't see batching as
an issue. What I'm more worried about is that watching the whole system
for new mounts is going to be somewhat cumbersome when all you can do is to
watch new mounts attached under an existing mount / filesystem.

								Honza
Jeff Layton Nov. 14, 2024, 12:29 p.m. UTC | #14
On Thu, 2024-11-14 at 12:29 +0100, Christian Brauner wrote:
> On Wed, Nov 13, 2024 at 08:45:06AM -0500, Jeff Layton wrote:
> > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote:
> > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote:
> > > > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote:
> > > > > Meta has some internal logging that scrapes /proc/self/mountinfo today.
> > > > > I'd like to convert it to use listmount()/statmount(), so we can do a
> > > > > better job of monitoring with containers. We're missing some fields
> > > > > though. This patchset adds them.
> > > > > 
> > > > > 
> > > > 
> > > > Applied to the vfs.misc branch of the vfs/vfs.git tree.
> > > > Patches in the vfs.misc branch should appear in linux-next soon.
> > > 
> > > Jeff, thank you for this!
> > > 
> > > I have already implemented support for statmount() and listmount() in
> > > libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The
> > > only remaining issue was the mount source and incomplete file system
> > > type.
> > > 
> > 
> > Unfortunately, I think we might be missing something else:
> > 
> > The mountinfo (and "mounts") file generator calls show_sb_opts() which
> > generates some strings from the sb->s_flags field and then calls
> > security_sb_show_options(). statmount() will give you the s_flags field
> > (or an equivalent), but it doesn't give you the security options
> > string. So, those aren't currently visible from statmount().
> > 
> > How should we expose those? Should we create a new statmount string
> > field and populate it, or is it better to just tack them onto the end
> > of the statmount.mnt_opts string?
> 
> I'm leaning towards using a separate field because mnt_opts/opts_array
> is about filesystem specific mount options whereas the security mount
> options are somewhat generic. So it's easy to tell them apart.

Ordinarily, I might agree, but we're now growing a new mount option
field that has them separated by NULs. Will we need two extra fields
for this? One comma-separated, and one NUL separated?

/proc/#/mountinfo and mounts prepend these to the output of 
->show_options, so the simple solution would be to just prepend those
there instead of adding a new field. FWIW, only SELinux has any extra
mount options to show here.

Tough call -- anyone else have opinions?
Miklos Szeredi Nov. 14, 2024, 1:16 p.m. UTC | #15
On Thu, 14 Nov 2024 at 13:29, Jeff Layton <jlayton@kernel.org> wrote:

> Ordinarily, I might agree, but we're now growing a new mount option
> field that has them separated by NULs. Will we need two extra fields
> for this? One comma-separated, and one NUL separated?
>
> /proc/#/mountinfo and mounts prepend these to the output of
> ->show_options, so the simple solution would be to just prepend those
> there instead of adding a new field. FWIW, only SELinux has any extra
> mount options to show here.

Compromise: tack them onto the end of the comma separated list, but
add a new field for the nul separated security options.

I think this would be logical, since the comma separated list is more
useful for having a /proc/$$/mountinfo compatible string than for
actually interpreting what's in there.

Thanks,
Miklos
Miklos Szeredi Nov. 14, 2024, 1:20 p.m. UTC | #16
On Thu, 14 Nov 2024 at 12:56, Jan Kara <jack@suse.cz> wrote:

>   What I'm more worried about is that watching the whole system
> for new mounts is going to be somewhat cumbersome when all you can do is to
> watch new mounts attached under an existing mount / filesystem.

We don't even know if there's a use case for that.  I think it would
make sense to think about it when/if such a use case emerges.  The
inode notification interfaces went through that evolution too, no?

Thanks,
Miklos