diff mbox series

[3/4] inotify_user: add system call inotify_add_watch_at()

Message ID 20230918123217.932179-3-max.kellermann@ionos.com (mailing list archive)
State New, archived
Headers show
Series [1/4] inotify_user: pass directory fd to inotify_find_inode() | expand

Commit Message

Max Kellermann Sept. 18, 2023, 12:32 p.m. UTC
This implements a missing piece in the inotify API: referring to a
file by a directory file descriptor and a path name.  This can be
solved in userspace currently only by doing something similar to:

  int old = open(".");
  fchdir(dfd);
  inotify_add_watch(....);
  fchdir(old);

Support for LOOKUP_EMPTY is still missing.  We could add another IN_*
flag for that (which would clutter the IN_* flags list further) or
add a "flags" parameter to the new system call (which would however
duplicate features already present via special IN_* flags).

To: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
To: linux-fsdevel@vger.kernel.org
To: linux-kernel@vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 fs/notify/inotify/inotify_user.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Jan Kara Sept. 18, 2023, 12:40 p.m. UTC | #1
On Mon 18-09-23 14:32:16, Max Kellermann wrote:
> This implements a missing piece in the inotify API: referring to a
> file by a directory file descriptor and a path name.  This can be
> solved in userspace currently only by doing something similar to:
> 
>   int old = open(".");
>   fchdir(dfd);
>   inotify_add_watch(....);
>   fchdir(old);
> 
> Support for LOOKUP_EMPTY is still missing.  We could add another IN_*
> flag for that (which would clutter the IN_* flags list further) or
> add a "flags" parameter to the new system call (which would however
> duplicate features already present via special IN_* flags).
> 
> To: Jan Kara <jack@suse.cz>
> Cc: Amir Goldstein <amir73il@gmail.com>
> To: linux-fsdevel@vger.kernel.org
> To: linux-kernel@vger.kernel.org
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>

Thanks for the patches! But generally we don't add new functionality to the
inotify API and rather steer users towards fanotify. In this particular
case fanotify_mark(2) already has the support for dirfd + name. Is there
any problem with using fanotify for you? Note that since kernel 5.13 you
don't need CAP_SYS_ADMIN capability for fanotify functionality that is
more-or-less equivalent to what inotify provides.

								Honza

> ---
>  fs/notify/inotify/inotify_user.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index b6e6f6ab21f8..8a9096c5ebb1 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -797,6 +797,12 @@ SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
>  	return do_inotify_add_watch(fd, AT_FDCWD, pathname, mask);
>  }
>  
> +SYSCALL_DEFINE4(inotify_add_watch_at, int, fd, int, dfd, const char __user *, pathname,
> +		u32, mask)
> +{
> +	return do_inotify_add_watch(fd, dfd, pathname, mask);
> +}
> +
>  SYSCALL_DEFINE2(inotify_rm_watch, int, fd, __s32, wd)
>  {
>  	struct fsnotify_group *group;
> -- 
> 2.39.2
>
Max Kellermann Sept. 18, 2023, 1:57 p.m. UTC | #2
On Mon, Sep 18, 2023 at 2:40 PM Jan Kara <jack@suse.cz> wrote:
> Note that since kernel 5.13 you
> don't need CAP_SYS_ADMIN capability for fanotify functionality that is
> more-or-less equivalent to what inotify provides.

Oh, I missed that change - I remember fanotify as being inaccessible
for unprivileged processes, and fanotify being designed for things
like virus scanners. Indeed I should migrate my code to fanotify.

If fanotify has now become the designated successor of inotify, that
should be hinted in the inotify manpage, and if inotify is effectively
feature-frozen, maybe that should be an extra status in the
MAINTAINERS file?

Max
Jan Kara Sept. 18, 2023, 2:23 p.m. UTC | #3
On Mon 18-09-23 15:57:43, Max Kellermann wrote:
> On Mon, Sep 18, 2023 at 2:40 PM Jan Kara <jack@suse.cz> wrote:
> > Note that since kernel 5.13 you
> > don't need CAP_SYS_ADMIN capability for fanotify functionality that is
> > more-or-less equivalent to what inotify provides.
> 
> Oh, I missed that change - I remember fanotify as being inaccessible
> for unprivileged processes, and fanotify being designed for things
> like virus scanners. Indeed I should migrate my code to fanotify.
> 
> If fanotify has now become the designated successor of inotify, that
> should be hinted in the inotify manpage, and if inotify is effectively
> feature-frozen, maybe that should be an extra status in the
> MAINTAINERS file?

The manpage update is a good idea. I'm not sure about the MAINTAINERS
status - we do have 'Obsolete' but I'm reluctant to mark inotify as
obsolete as it's perfectly fine for existing users, we fully maintain it
and support it but we just don't want to extend the API anymore. Amir, what
are your thoughts on this?

								Honza
Amir Goldstein Sept. 18, 2023, 3:28 p.m. UTC | #4
On Mon, Sep 18, 2023 at 5:23 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 18-09-23 15:57:43, Max Kellermann wrote:
> > On Mon, Sep 18, 2023 at 2:40 PM Jan Kara <jack@suse.cz> wrote:
> > > Note that since kernel 5.13 you
> > > don't need CAP_SYS_ADMIN capability for fanotify functionality that is
> > > more-or-less equivalent to what inotify provides.
> >
> > Oh, I missed that change - I remember fanotify as being inaccessible
> > for unprivileged processes, and fanotify being designed for things
> > like virus scanners. Indeed I should migrate my code to fanotify.
> >
> > If fanotify has now become the designated successor of inotify, that
> > should be hinted in the inotify manpage, and if inotify is effectively
> > feature-frozen, maybe that should be an extra status in the
> > MAINTAINERS file?
>
> The manpage update is a good idea. I'm not sure about the MAINTAINERS
> status - we do have 'Obsolete' but I'm reluctant to mark inotify as
> obsolete as it's perfectly fine for existing users, we fully maintain it
> and support it but we just don't want to extend the API anymore. Amir, what
> are your thoughts on this?

I think that the mention of inotify vs. fanotify features in fanotify.7 man page
is decent - if anyone wants to improve it I won't mind.
A mention of fanotify as successor in inotify.7 man page is not a bad idea -
patches welcome.

As to MAINTAINERS, I think that 'Maintained' feels right.
We may consider 'Odd Fixes' for inotify and certainly for 'dnotify',
but that sounds a bit too harsh for the level of maintenance they get.

I'd like to point out that IMO, man-page is mainly aimed for the UAPI
users and MAINTAINERS is mostly aimed at bug reporters and drive-by
developers who submit small fixes.

When developers wish to add a feature/improvement to a subsystem,
they are advised to send an RFC with their intentions to the subsystem
maintainers/list to get feedback on their design before starting to implement.
Otherwise, the feature could be NACKed for several reasons other than
"we would rather invest in the newer API".

Bottom line - I don't see a strong reason to change anything, but I also do
not object to improving man page nor to switching to 'Odd Fixes' status.

Thanks,
Amir.
Amir Goldstein Sept. 18, 2023, 6:05 p.m. UTC | #5
[Forked from https://lore.kernel.org/linux-fsdevel/20230918123217.932179-1-max.kellermann@ionos.com/]

On Mon, Sep 18, 2023 at 6:28 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Sep 18, 2023 at 5:23 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Mon 18-09-23 15:57:43, Max Kellermann wrote:
> > > On Mon, Sep 18, 2023 at 2:40 PM Jan Kara <jack@suse.cz> wrote:
> > > > Note that since kernel 5.13 you
> > > > don't need CAP_SYS_ADMIN capability for fanotify functionality that is
> > > > more-or-less equivalent to what inotify provides.
> > >
> > > Oh, I missed that change - I remember fanotify as being inaccessible
> > > for unprivileged processes, and fanotify being designed for things
> > > like virus scanners. Indeed I should migrate my code to fanotify.
> > >
> > > If fanotify has now become the designated successor of inotify, that
> > > should be hinted in the inotify manpage, and if inotify is effectively
> > > feature-frozen, maybe that should be an extra status in the
> > > MAINTAINERS file?
> >
> > The manpage update is a good idea. I'm not sure about the MAINTAINERS
> > status - we do have 'Obsolete' but I'm reluctant to mark inotify as
> > obsolete as it's perfectly fine for existing users, we fully maintain it
> > and support it but we just don't want to extend the API anymore. Amir, what
> > are your thoughts on this?
>
> I think that the mention of inotify vs. fanotify features in fanotify.7 man page
> is decent - if anyone wants to improve it I won't mind.
> A mention of fanotify as successor in inotify.7 man page is not a bad idea -
> patches welcome.
>
> As to MAINTAINERS, I think that 'Maintained' feels right.
> We may consider 'Odd Fixes' for inotify and certainly for 'dnotify',
> but that sounds a bit too harsh for the level of maintenance they get.
>
> I'd like to point out that IMO, man-page is mainly aimed for the UAPI
> users and MAINTAINERS is mostly aimed at bug reporters and drive-by
> developers who submit small fixes.
>
> When developers wish to add a feature/improvement to a subsystem,
> they are advised to send an RFC with their intentions to the subsystem
> maintainers/list to get feedback on their design before starting to implement.
> Otherwise, the feature could be NACKed for several reasons other than
> "we would rather invest in the newer API".
>
> Bottom line - I don't see a strong reason to change anything, but I also do
> not object to improving man page nor to switching to 'Odd Fixes' status.
>

BTW, before we can really mark inotify as Obsolete and document that
inotify was superseded by fanotify, there are at least two items on the
TODO list [1]:
1. UNMOUNT/IGNORED events
2. Filesystems without fid support

MOUNT/UNMOUNT fanotify events have already been discussed
and the feature has known users.

Christian has also mentioned [1] the IN_UNMOUNT use case for
waiting for sb shutdown several times and I will not be surprised
to see systemd starting to use inotify for that use case before too long...

Regarding the second item on the TODO list, we have had this discussion
before - if we are going to continue the current policy of opting-in to fanotify
(i.e. tmpfs, fuse, overlayfs, kernfs), we will always have odd filesystems that
only support inotify and we will need to keep supporting inotify only for the
users that use inotify on those odd filesystems.

OTOH, if we implement FAN_REPORT_DINO_NAME, we could
have fanotify inode mark support for any filesystem, where the
pinned marked inode ino is the object id.

Thanks,
Amir.

[1] https://github.com/amir73il/fsnotify-utils/wiki/fsnotify-TODO
[2] https://lore.kernel.org/linux-fsdevel/20230908-verflachen-neudefinition-4da649d673a9@brauner/
Max Kellermann Sept. 18, 2023, 7:45 p.m. UTC | #6
On Mon, Sep 18, 2023 at 2:40 PM Jan Kara <jack@suse.cz> wrote:
> Is there any problem with using fanotify for you?

Turns out fanotify is unusable for me, unfortunately.
I have been using inotify to get notifications of cgroup events, but
the cgroup filesystem appears to be unsupported by fanotify: all
attempts to use fanotify_mark() on cgroup event files fail with
ENODEV. I think that comes from fanotify_test_fsid(). Filesystems
without a fsid work just fine with inotify, but fail with fanotify.

Since fanotify lacks important features, is it really a good idea to
feature-freeze inotify?

(By the way, what was not documented is that fanotify_init() can only
be used by unprivileged processes if the FAN_REPORT_FID flag was
specified. I had to read the kernel sources to figure that out - I
have no idea why this limitation exists - the code comment in the
kernel source doesn't explain it.)
Amir Goldstein Sept. 19, 2023, 7:16 a.m. UTC | #7
On Mon, Sep 18, 2023 at 9:05 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> [Forked from https://lore.kernel.org/linux-fsdevel/20230918123217.932179-1-max.kellermann@ionos.com/]
>
> On Mon, Sep 18, 2023 at 6:28 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Sep 18, 2023 at 5:23 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Mon 18-09-23 15:57:43, Max Kellermann wrote:
> > > > On Mon, Sep 18, 2023 at 2:40 PM Jan Kara <jack@suse.cz> wrote:
> > > > > Note that since kernel 5.13 you
> > > > > don't need CAP_SYS_ADMIN capability for fanotify functionality that is
> > > > > more-or-less equivalent to what inotify provides.
> > > >
> > > > Oh, I missed that change - I remember fanotify as being inaccessible
> > > > for unprivileged processes, and fanotify being designed for things
> > > > like virus scanners. Indeed I should migrate my code to fanotify.
> > > >
> > > > If fanotify has now become the designated successor of inotify, that
> > > > should be hinted in the inotify manpage, and if inotify is effectively
> > > > feature-frozen, maybe that should be an extra status in the
> > > > MAINTAINERS file?
> > >
> > > The manpage update is a good idea. I'm not sure about the MAINTAINERS
> > > status - we do have 'Obsolete' but I'm reluctant to mark inotify as
> > > obsolete as it's perfectly fine for existing users, we fully maintain it
> > > and support it but we just don't want to extend the API anymore. Amir, what
> > > are your thoughts on this?
> >
> > I think that the mention of inotify vs. fanotify features in fanotify.7 man page
> > is decent - if anyone wants to improve it I won't mind.
> > A mention of fanotify as successor in inotify.7 man page is not a bad idea -
> > patches welcome.
> >
> > As to MAINTAINERS, I think that 'Maintained' feels right.
> > We may consider 'Odd Fixes' for inotify and certainly for 'dnotify',
> > but that sounds a bit too harsh for the level of maintenance they get.
> >
> > I'd like to point out that IMO, man-page is mainly aimed for the UAPI
> > users and MAINTAINERS is mostly aimed at bug reporters and drive-by
> > developers who submit small fixes.
> >
> > When developers wish to add a feature/improvement to a subsystem,
> > they are advised to send an RFC with their intentions to the subsystem
> > maintainers/list to get feedback on their design before starting to implement.
> > Otherwise, the feature could be NACKed for several reasons other than
> > "we would rather invest in the newer API".
> >
> > Bottom line - I don't see a strong reason to change anything, but I also do
> > not object to improving man page nor to switching to 'Odd Fixes' status.
> >
>
> BTW, before we can really mark inotify as Obsolete and document that
> inotify was superseded by fanotify, there are at least two items on the
> TODO list [1]:
> 1. UNMOUNT/IGNORED events
> 2. Filesystems without fid support
>
> MOUNT/UNMOUNT fanotify events have already been discussed
> and the feature has known users.
>
> Christian has also mentioned [1] the IN_UNMOUNT use case for
> waiting for sb shutdown several times and I will not be surprised
> to see systemd starting to use inotify for that use case before too long...
>
> Regarding the second item on the TODO list, we have had this discussion
> before - if we are going to continue the current policy of opting-in to fanotify
> (i.e. tmpfs, fuse, overlayfs, kernfs), we will always have odd filesystems that
> only support inotify and we will need to keep supporting inotify only for the
> users that use inotify on those odd filesystems.
>
> OTOH, if we implement FAN_REPORT_DINO_NAME, we could
> have fanotify inode mark support for any filesystem, where the
> pinned marked inode ino is the object id.
>

Hi Max,

Not sure if you have seen my email before asking your question
on the original patch review thread.
I prefer to answer it here in the wider context of inotify maintenance,
because it touches directly on the topic I raised above.

On Mon, Sep 18, 2023 at 10:45 PM Max Kellermann
<max.kellermann@ionos.com> wrote:
>
> On Mon, Sep 18, 2023 at 2:40 PM Jan Kara <jack@suse.cz> wrote:
> > Is there any problem with using fanotify for you?
>
> Turns out fanotify is unusable for me, unfortunately.
> I have been using inotify to get notifications of cgroup events, but
> the cgroup filesystem appears to be unsupported by fanotify: all
> attempts to use fanotify_mark() on cgroup event files fail with
> ENODEV. I think that comes from fanotify_test_fsid(). Filesystems
> without a fsid work just fine with inotify, but fail with fanotify.
>

This was just fixed by Ivan in commit:
0ce7c12e88cf ("kernfs: attach uuid for every kernfs and report it in fsid")

> Since fanotify lacks important features, is it really a good idea to
> feature-freeze inotify?

As my summary above states, it is correct that fanotify does not
yet fully supersedes inotify, but there is a plan to go in this direction,
hence, inotify is "being phased out" it is not Obsolete nor Deprecated.

However, the question to be asked is different IMO:
When both inotify and fanotify do not support the use case at hand
(as in your case), which is better? to fix/improve inotify or to fix/improve
fanotify?

For me, there should be a very strong reason to choose improving
inotify over improving fanotify.

With the case at hand, you can see that the patch to improve fanotify
to support your use case was far simpler (in LOC at least) than your
patches, not to mention, not having to add a new syscall and new
documentation for an old phased out API.

But there may be exceptions, for example, in 4.19, inotify gained
a new feature:

4d97f7d53da7 ("inotify: Add flag IN_MASK_CREATE for inotify_add_watch()")

I am not sure if this patch would have been accepted nowadays, but
we need to judge every case.

>
> (By the way, what was not documented is that fanotify_init() can only
> be used by unprivileged processes if the FAN_REPORT_FID flag was

fanotify_init(2):
       Prior to Linux 5.13, calling fanotify_init() required the
CAP_SYS_ADMIN capability.
       Since Linux 5.13, users may call fanotify_init() without the
CAP_SYS_ADMIN capability
       to create  and  initialize an fanotify group with limited functionality.

       The limitations imposed on an event listener created by a user
without the
              CAP_SYS_ADMIN capability are as follows:
...
              •  The user is required to create a group that
identifies filesystem objects
                  by file handles, for example, by providing the
FAN_REPORT_FID flag.

I find this documentation that was written by Matthew very good,
but writing documentation is not my strong side and if you feel that
any part of the documentation should be improved I highly appreciate
the feedback and would appreciate man page patches even more.

When we get to the point that the missing functionality gaps between
inotify and fanotify have been closed, I will surely follow your advice
to mention that in the inotify man page and possibly also in MAINTAINERS.

> specified. I had to read the kernel sources to figure that out - I
> have no idea why this limitation exists - the code comment in the
> kernel source doesn't explain it.)

The legacy fanotify events open and report an event->fd as a way
to identify the object - that is not a safe practice for unprivileged listeners
for several reasons.

FAN_REPORT_FID is designed in a way to be almost a drop in replacement
for inotify watch descriptors as an opaque identifier of the object, except that
fsid+fhanle provide much stronger functionality than wd did.

The limitation that FAN_REPORT_FID requires that fs has fsid+fhandle is
a technicality.  It can be solved by either providing fsid and trivial
encode_fh() (*)
to the filesystem in question (as was done in 6.6-rc1 for overlayfs and kernfs)
or by introducing a new mode FAN_REPORT_INO which reports inode number
instead of fsid+fhandle and is enough for listeners that watch directories
and files on a single fs.

Thanks,
Amir.

(*) the ability for fs to support only encode_fh() was added in kernel v6.5
96b2b072ee62 ("exportfs: allow exporting non-decodeable file handles
to userspace")
and a man page patch was already posted [3].

>
> [1] https://github.com/amir73il/fsnotify-utils/wiki/fsnotify-TODO
> [2] https://lore.kernel.org/linux-fsdevel/20230908-verflachen-neudefinition-4da649d673a9@brauner/
[3] https://lore.kernel.org/linux-fsdevel/20230903120433.2605027-1-amir73il@gmail.com/
Max Kellermann Sept. 19, 2023, 9:08 a.m. UTC | #8
On Tue, Sep 19, 2023 at 9:17 AM Amir Goldstein <amir73il@gmail.com> wrote:
> This was just fixed by Ivan in commit:
> 0ce7c12e88cf ("kernfs: attach uuid for every kernfs and report it in fsid")

Indeed, nice to see this will soon be fixed.

> As my summary above states, it is correct that fanotify does not
> yet fully supersedes inotify, but there is a plan to go in this direction,
> hence, inotify is "being phased out" it is not Obsolete nor Deprecated.

I agree that if inotify is to be phased out, we should concentrate on fanotify.

I'm however somewhat disappointed with the complexity of the fanotify
API. I'm not entirely convinced that fanotify is a good successor for
inotify, or that inotify should really be replaced. The additional
features that fanotify has could have been added to inotify instead; I
don't get why this needs an entirely new API. Of course, I'm late to
complain, having just learned about (the unprivileged availability of)
fanotify many years after it has been invented.

System calls needed for one inotify event:
- read()

System calls needed for one fanotify event:
- read()
- (do some magic to look up the fsid -
https://github.com/martinpitt/fatrace/blob/master/fatrace.c implements
a lookup table, yet more complexity that doesn't exist with inotify)
- open() to get a file descriptor for the fsid
- open_by_handle_at(fsid_fd, fid.handle)
- readlink("/proc/self/fd/%d") (which adds a dependency on /proc being mounted)
- close(fd)
- close(fsid_fd)

I should mention that this workflow still needs a privileged process -
yes, I can use fanotify in an unprivileged process, but
open_by_handle_at() is a privileged system call - it requires
CAP_DAC_READ_SEARCH. Without it, I cannot obtain information on which
file was modified, can I?
There is FAN_REPORT_NAME, but it gives me only the name of the
directory entry; but since I'm watching a lot of files and all of them
are called "memory.events.local", that's of no use.

Or am I supposed to use name_to_handle_at() on all watched files to
roll my own lookup? (The name_to_hamdle_at() manpage doesn't make me
confident it's a reliable system call; it sounds like it needs
explicit support from filesystems.)

> > (By the way, what was not documented is that fanotify_init() can only
> > be used by unprivileged processes if the FAN_REPORT_FID flag was
[...]
> I find this documentation that was written by Matthew very good,

Indeed! That's my mistake, I missed this section.

> FAN_REPORT_FID is designed in a way to be almost a drop in replacement
> for inotify watch descriptors as an opaque identifier of the object, except that
> fsid+fhanle provide much stronger functionality than wd did.

How is it stronger?
Christian Brauner Sept. 19, 2023, 9:26 a.m. UTC | #9
> Christian has also mentioned [1] the IN_UNMOUNT use case for
> waiting for sb shutdown several times and I will not be surprised
> to see systemd starting to use inotify for that use case before too long...

I think that having a version of IN_UMOUNT for fanotify would be great.
I've said so a couple of times indeed. It is a really good feature to
monitor superblock deactivation.
Jan Kara Sept. 19, 2023, 9:48 a.m. UTC | #10
On Mon 18-09-23 21:05:11, Amir Goldstein wrote:
> [Forked from https://lore.kernel.org/linux-fsdevel/20230918123217.932179-1-max.kellermann@ionos.com/]
...
> BTW, before we can really mark inotify as Obsolete and document that
> inotify was superseded by fanotify, there are at least two items on the
> TODO list [1]:

Yeah, as I wrote in the original thread, I don't feel like inotify should
be marked as obsolete (at least for some more time) so we are on the same
page here I think.

> 1. UNMOUNT/IGNORED events
> 2. Filesystems without fid support
> 
> MOUNT/UNMOUNT fanotify events have already been discussed
> and the feature has known users.
> 
> Christian has also mentioned [1] the IN_UNMOUNT use case for
> waiting for sb shutdown several times and I will not be surprised
> to see systemd starting to use inotify for that use case before too long...

Yup, both FAN_UNMOUNT and FAN_IGNORED should be easy. Unlike inotify, I'd
just make these explicit events you can opt into and not something you
always get.

> Regarding the second item on the TODO list, we have had this discussion
> before - if we are going to continue the current policy of opting-in to
> fanotify (i.e. tmpfs, fuse, overlayfs, kernfs), we will always have odd
> filesystems that only support inotify and we will need to keep supporting
> inotify only for the users that use inotify on those odd filesystems.
> 
> OTOH, if we implement FAN_REPORT_DINO_NAME, we could
> have fanotify inode mark support for any filesystem, where the
> pinned marked inode ino is the object id.

Is it a real problem after your work to allow filehandles that are not
necessarily usable for NFS export or open_by_handle()? As far as I remember
fanotify should be now able to handle anything that provides f_fsid in its
statfs(2) call. And as I'm checking filesystems not setting fsid currently are:

afs, coda, nfs - networking filesystems where inotify and fanotify have
  dubious value anyway

configfs, debugfs, devpts, efivarfs, hugetlbfs, openpromfs, proc, pstore,
ramfs, sysfs, tracefs - virtual filesystems where fsnotify functionality is
  quite limited. But some special cases could be useful. Adding fsid support
  is the same amount of trouble as for kernfs - a few LOC. In fact, we
  could perhaps add a fstype flag to indicate that this is a filesystem
  without persistent identification and so uuid should be autogenerated on
  mount (likely in alloc_super()) and f_fsid generated from sb->s_uuid.
  This way we could handle all these filesystems with trivial amount of
  effort.

freevxfs - the only real filesystem without f_fsid. Trivial to handle one
  way or the other.

So I don't think we need new uAPI additions to finish off this TODO item.

								Honza
Jan Kara Sept. 19, 2023, 10:01 a.m. UTC | #11
On Tue 19-09-23 11:08:21, Max Kellermann wrote:
> On Tue, Sep 19, 2023 at 9:17 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > As my summary above states, it is correct that fanotify does not
> > yet fully supersedes inotify, but there is a plan to go in this direction,
> > hence, inotify is "being phased out" it is not Obsolete nor Deprecated.
> 
> I agree that if inotify is to be phased out, we should concentrate on fanotify.
> 
> I'm however somewhat disappointed with the complexity of the fanotify
> API. I'm not entirely convinced that fanotify is a good successor for
> inotify, or that inotify should really be replaced. The additional
> features that fanotify has could have been added to inotify instead; I
> don't get why this needs an entirely new API. Of course, I'm late to
> complain, having just learned about (the unprivileged availability of)
> fanotify many years after it has been invented.
> 
> System calls needed for one inotify event:
> - read()
> 
> System calls needed for one fanotify event:
> - read()
> - (do some magic to look up the fsid -
> https://github.com/martinpitt/fatrace/blob/master/fatrace.c implements
> a lookup table, yet more complexity that doesn't exist with inotify)
> - open() to get a file descriptor for the fsid
> - open_by_handle_at(fsid_fd, fid.handle)
> - readlink("/proc/self/fd/%d") (which adds a dependency on /proc being mounted)
> - close(fd)
> - close(fsid_fd)
> 
> I should mention that this workflow still needs a privileged process -
> yes, I can use fanotify in an unprivileged process, but
> open_by_handle_at() is a privileged system call - it requires
> CAP_DAC_READ_SEARCH. Without it, I cannot obtain information on which
> file was modified, can I?
> There is FAN_REPORT_NAME, but it gives me only the name of the
> directory entry; but since I'm watching a lot of files and all of them
> are called "memory.events.local", that's of no use.
> 
> Or am I supposed to use name_to_handle_at() on all watched files to
> roll my own lookup? (The name_to_hamdle_at() manpage doesn't make me
> confident it's a reliable system call; it sounds like it needs
> explicit support from filesystems.)

So with inotify event, you get back 'wd' and 'name' to identify the object
where the event happened. How is this (for your usecase) different from
getting back 'fsid + handle' and 'name' back from fanotify? In inotify case
you had to somehow track wd -> path linkage, with fanotify you need to
track 'fsid + handle' -> path linkage.

> > FAN_REPORT_FID is designed in a way to be almost a drop in replacement
> > for inotify watch descriptors as an opaque identifier of the object, except that
> > fsid+fhanle provide much stronger functionality than wd did.
> 
> How is it stronger?

For your particular usecase I don't think there's any advantage of
fsid+fhandle over plain wd. But if you want to monitor multiple filesystems
or if you have priviledged process that can open by handle, or a standard
filesystem where handles are actually persistent, then there are benefits.

								Honza
Amir Goldstein Sept. 19, 2023, 10:42 a.m. UTC | #12
On Tue, Sep 19, 2023 at 1:01 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 19-09-23 11:08:21, Max Kellermann wrote:
> > On Tue, Sep 19, 2023 at 9:17 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > As my summary above states, it is correct that fanotify does not
> > > yet fully supersedes inotify, but there is a plan to go in this direction,
> > > hence, inotify is "being phased out" it is not Obsolete nor Deprecated.
> >
> > I agree that if inotify is to be phased out, we should concentrate on fanotify.
> >
> > I'm however somewhat disappointed with the complexity of the fanotify
> > API. I'm not entirely convinced that fanotify is a good successor for
> > inotify, or that inotify should really be replaced. The additional
> > features that fanotify has could have been added to inotify instead; I
> > don't get why this needs an entirely new API. Of course, I'm late to
> > complain, having just learned about (the unprivileged availability of)
> > fanotify many years after it has been invented.
> >
> > System calls needed for one inotify event:
> > - read()
> >
> > System calls needed for one fanotify event:
> > - read()
> > - (do some magic to look up the fsid -
> > https://github.com/martinpitt/fatrace/blob/master/fatrace.c implements
> > a lookup table, yet more complexity that doesn't exist with inotify)
> > - open() to get a file descriptor for the fsid
> > - open_by_handle_at(fsid_fd, fid.handle)
> > - readlink("/proc/self/fd/%d") (which adds a dependency on /proc being mounted)
> > - close(fd)
> > - close(fsid_fd)
> >
> > I should mention that this workflow still needs a privileged process -
> > yes, I can use fanotify in an unprivileged process, but
> > open_by_handle_at() is a privileged system call - it requires
> > CAP_DAC_READ_SEARCH. Without it, I cannot obtain information on which
> > file was modified, can I?
> > There is FAN_REPORT_NAME, but it gives me only the name of the
> > directory entry; but since I'm watching a lot of files and all of them
> > are called "memory.events.local", that's of no use.
> >
> > Or am I supposed to use name_to_handle_at() on all watched files to
> > roll my own lookup? (The name_to_hamdle_at() manpage doesn't make me
> > confident it's a reliable system call; it sounds like it needs
> > explicit support from filesystems.)
>
> So with inotify event, you get back 'wd' and 'name' to identify the object
> where the event happened. How is this (for your usecase) different from
> getting back 'fsid + handle' and 'name' back from fanotify? In inotify case
> you had to somehow track wd -> path linkage, with fanotify you need to
> track 'fsid + handle' -> path linkage.
>

And if you want to see an implementation of a drop-in replacement
of inotify/fanotify, you can take a look at:

https://github.com/inotify-tools/inotify-tools/pull/134

And specifically the first commit
41b2ec4 ("Index watches by fanotify fid") to understand why
fid is a drop-in replacement for wd.

> > > FAN_REPORT_FID is designed in a way to be almost a drop in replacement
> > > for inotify watch descriptors as an opaque identifier of the object, except that
> > > fsid+fhanle provide much stronger functionality than wd did.
> >
> > How is it stronger?
>
> For your particular usecase I don't think there's any advantage of
> fsid+fhandle over plain wd. But if you want to monitor multiple filesystems
> or if you have priviledged process that can open by handle, or a standard
> filesystem where handles are actually persistent, then there are benefits.
>

Those cases are demonstrated in the --filesystem functionality of the
pull request above, which handles "dynamic watches" instead of
having to setup watches recursively on all subdirs.

Thanks,
Amir.
Max Kellermann Sept. 19, 2023, 10:42 a.m. UTC | #13
On Tue, Sep 19, 2023 at 12:01 PM Jan Kara <jack@suse.cz> wrote:
> So with inotify event, you get back 'wd' and 'name' to identify the object
> where the event happened. How is this (for your usecase) different from
> getting back 'fsid + handle' and 'name' back from fanotify? In inotify case
> you had to somehow track wd -> path linkage, with fanotify you need to
> track 'fsid + handle' -> path linkage.

The wd is a simple "int" which is the return value of the system call,
and it's part of "struct inotify_event". One system call for
registering it, one system call fo reading it.

From fanotify, I read a "struct fanotify_event_metadata", and then
check variable-length follow-up structs, iterable those follow-up
structs, find the one with "info_type==FAN_EVENT_INFO_TYPE_FID", now I
have a "fsid" of type "__kernel_fsid_t" (a struct containing two
32-bit integers) and a "file_handle" (a variable-length opaque BLOB).
What do I do with these?

The answer appears to be: when I registered, I should have obtained
the fsid (via statfs()) and the file_handle (via name_to_handle_at()).
That's three extra system calls. One statfs(), and twice
name_to_handle_at(), because the first one is needed to get the length
of the buffer I need to allocate for the file_handle (and hope my
filesystem supports file_handles, because apparently that's an
optional feature). Just look at the name_to_handle_at() manpage for
some horrors of its complexity.

Imagine how much more complex the data structure for looking up the
modified file is: inotify has an int as the lookup key, and fanotify
has two integers plus a variable-length BLOB.

> But if you want to monitor multiple filesystems

I can monitor multiple filesystems with inotify.

> or if you have priviledged process that can open by handle

Getting an already-opened file descriptor, or just the file_handle, is
certainly an interesting fanotify feature. But that could have easily
been added to inotify with a new "mask" flag for the
inotify_add_watch() function.

> or a standard filesystem where handles are actually persistent, then there are benefits.

Same here: that could have been an (optional) inotify feature, instead
of making the whole complexity mandatory for everybody.

Max
Max Kellermann Sept. 19, 2023, 10:48 a.m. UTC | #14
On Tue, Sep 19, 2023 at 12:42 PM Amir Goldstein <amir73il@gmail.com> wrote:
> Those cases are demonstrated in the --filesystem functionality of the
> pull request above, which handles "dynamic watches" instead of
> having to setup watches recursively on all subdirs.

The whole-filesystem mode is certainly the most interesting fanotify
feature, and the lack of it the greatest weakness of inotify - but
API-wise, it could have been implemented as a new "mask" flag in
inotify_add_watch() with no extra API complexity. That still doesn't
justify the huge complexity of fanotify.
Amir Goldstein Sept. 19, 2023, 10:58 a.m. UTC | #15
On Tue, Sep 19, 2023 at 1:43 PM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> On Tue, Sep 19, 2023 at 12:01 PM Jan Kara <jack@suse.cz> wrote:
> > So with inotify event, you get back 'wd' and 'name' to identify the object
> > where the event happened. How is this (for your usecase) different from
> > getting back 'fsid + handle' and 'name' back from fanotify? In inotify case
> > you had to somehow track wd -> path linkage, with fanotify you need to
> > track 'fsid + handle' -> path linkage.
>
> The wd is a simple "int" which is the return value of the system call,
> and it's part of "struct inotify_event". One system call for
> registering it, one system call fo reading it.
>
> From fanotify, I read a "struct fanotify_event_metadata", and then
> check variable-length follow-up structs, iterable those follow-up
> structs, find the one with "info_type==FAN_EVENT_INFO_TYPE_FID", now I
> have a "fsid" of type "__kernel_fsid_t" (a struct containing two
> 32-bit integers) and a "file_handle" (a variable-length opaque BLOB).
> What do I do with these?
>
> The answer appears to be: when I registered, I should have obtained
> the fsid (via statfs()) and the file_handle (via name_to_handle_at()).
> That's three extra system calls. One statfs(), and twice
> name_to_handle_at(), because the first one is needed to get the length
> of the buffer I need to allocate for the file_handle (and hope my
> filesystem supports file_handles, because apparently that's an
> optional feature). Just look at the name_to_handle_at() manpage for
> some horrors of its complexity.
>
> Imagine how much more complex the data structure for looking up the
> modified file is: inotify has an int as the lookup key, and fanotify
> has two integers plus a variable-length BLOB.
>

You are not describing a technical problem.
Any API complexity can be hidden from users with userspace
libraries. You can use the inotify-tools lib if you prefer.

I assure you that the added complexity to the API was not
done to make your life harder.
inotify API has several design flaws and fanotify API extensions
were designed to address those flaws.

> > But if you want to monitor multiple filesystems
>
> I can monitor multiple filesystems with inotify.
>
> > or if you have priviledged process that can open by handle
>
> Getting an already-opened file descriptor, or just the file_handle, is
> certainly an interesting fanotify feature. But that could have easily
> been added to inotify with a new "mask" flag for the
> inotify_add_watch() function.
>

"could have easily been added" is not a statement that I am willing
to accept.

You are saying that because you do not understand the complexity
involved and that is fine - you can ask.

The things that you are complaining about in the API are the exact
things that were needed to make the advanced features work.

Beyond that, it is a matter of API consolidation -
We prefer to maintain a single unified API that can cover all
the use cases over maintaining several overlapping APIs.

The complexity added to the API for simple use cases can
be mitigated with user libraries - it is not a good reason IMO
to keep maintaining an old limited API in parallel to a new
improved one.

Thanks,
Amir.
Max Kellermann Sept. 19, 2023, 11:21 a.m. UTC | #16
On Tue, Sep 19, 2023 at 12:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> Any API complexity can be hidden from users with userspace
> libraries. You can use the inotify-tools lib if you prefer.

That doesn't convince me at all, but that's a question of taste. We'll
just keep using inotify (with a patched kernel, which we have anyway).

> > Getting an already-opened file descriptor, or just the file_handle, is
> > certainly an interesting fanotify feature. But that could have easily
> > been added to inotify with a new "mask" flag for the
> > inotify_add_watch() function.
> >
>
> "could have easily been added" is not a statement that I am willing
> to accept.

Are you willing to take a bet? I come up with a patch for implementing
this for inotify, let's say within a week, and you agree to merge it?

(I'm not interested in this feature, I won't ever use it - all I
wanted is dfd support for inotify_add_watch()).

> The things that you are complaining about in the API are the exact
> things that were needed to make the advanced features work.

Not exactly - I complain that fanotify makes the complexity mandatory,
the complexity is the baseline of the API. It would have been possible
to design an API that is simple for 99% of all users, as simple as
inotify; and only those who need the advanced features get the
complexity as an option.

I don't agree with your point that unnecessary complexity should be
mitigated by throwing more (library) code at it. That's just adding
more complexity and more overhead, the opposite of what I want.

Max
Amir Goldstein Sept. 19, 2023, 12:21 p.m. UTC | #17
On Tue, Sep 19, 2023 at 2:22 PM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> On Tue, Sep 19, 2023 at 12:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > Any API complexity can be hidden from users with userspace
> > libraries. You can use the inotify-tools lib if you prefer.
>
> That doesn't convince me at all, but that's a question of taste. We'll
> just keep using inotify (with a patched kernel, which we have anyway).
>

ok.

> > > Getting an already-opened file descriptor, or just the file_handle, is
> > > certainly an interesting fanotify feature. But that could have easily
> > > been added to inotify with a new "mask" flag for the
> > > inotify_add_watch() function.
> > >
> >
> > "could have easily been added" is not a statement that I am willing
> > to accept.
>
> Are you willing to take a bet? I come up with a patch for implementing
> this for inotify, let's say within a week, and you agree to merge it?
>
> (I'm not interested in this feature, I won't ever use it - all I
> wanted is dfd support for inotify_add_watch()).
>

I am not into ego fights. I have no desire to win an argument.
If you have an improvement that you want to make, you can
submit it and it will be judged technically.

If you want to improve inotify you can argue your case
and it will be judged technically.

But if you do that, I strongly advise to share the community
early in the design review of the new feature/API.
It can save you time.

> > The things that you are complaining about in the API are the exact
> > things that were needed to make the advanced features work.
>
> Not exactly - I complain that fanotify makes the complexity mandatory,
> the complexity is the baseline of the API. It would have been possible
> to design an API that is simple for 99% of all users, as simple as
> inotify; and only those who need the advanced features get the
> complexity as an option.
>
> I don't agree with your point that unnecessary complexity should be
> mitigated by throwing more (library) code at it. That's just adding
> more complexity and more overhead, the opposite of what I want.
>

Sorry, "what I want" is not a technical argument :)
"what many users want" with proof could be a start of a technical
argument.

I agree that simplicity of the kernel UAPI vs. delegating
simplicity to user libraries is a matter of taste and different subsystem
maintainers have different opinions in that regard.

And also, it is a bit late to discuss design preferences of an API
that was merged 4 years ago.
Design flaws and problems, sure, but for complexity it's a bit late.

Regarding inotify improvements, as I wrote, they will each be judged
technically, but the trend is towards phasing it out.

Thanks,
Amir.
Max Kellermann Sept. 19, 2023, 12:51 p.m. UTC | #18
On Tue, Sep 19, 2023 at 2:21 PM Amir Goldstein <amir73il@gmail.com> wrote:
> Regarding inotify improvements, as I wrote, they will each be judged
> technically, but the trend is towards phasing it out.

Then please reconsider merging inotify_add_watch_at(). It is a rather
trivial patch set, only exposing a user_path_at() parameter to use
space, like many other new system calls did with other old-style
system calls. Only the last patch, the one which adds the new system
call ot all arch-specific tables, is an ugly one, but that's not a
property of the new feature but a general property of how system calls
are wired in Linux.

My proposed system call adds real value to all those who are currently
using inotify, allowing them to use inotify with a modern and safe and
race-free syscall interface, eliminating the unsafe fchdir() dance to
emulate it in userspace.

The inotify interface is widely used and will be for a long time to
come, while it is hard to find code which already uses fanotify.
GitHub code search finds 438 occurences of fanotify_init() calls, 4.6k
inotify_init1() calls and 6.9k inotify_init() calls. Given the added
complexity of fanotify and the uselessness of most of fanotify's
feature for most software (except for dfd support), it is extremely
unlikely that a noticable fraction of those thousands of projects will
ever migrate to fanotify. Even if inotify is considered a legacy API,
it should be allowed to modernize it; and adding dfd support to system
calls is really important.

Max
Amir Goldstein Sept. 19, 2023, 1:01 p.m. UTC | #19
On Tue, Sep 19, 2023 at 3:51 PM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> On Tue, Sep 19, 2023 at 2:21 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > Regarding inotify improvements, as I wrote, they will each be judged
> > technically, but the trend is towards phasing it out.
>
> Then please reconsider merging inotify_add_watch_at(). It is a rather
> trivial patch set, only exposing a user_path_at() parameter to use
> space, like many other new system calls did with other old-style
> system calls. Only the last patch, the one which adds the new system
> call ot all arch-specific tables, is an ugly one, but that's not a
> property of the new feature but a general property of how system calls
> are wired in Linux.
>
> My proposed system call adds real value to all those who are currently
> using inotify, allowing them to use inotify with a modern and safe and
> race-free syscall interface, eliminating the unsafe fchdir() dance to
> emulate it in userspace.
>
> The inotify interface is widely used and will be for a long time to
> come, while it is hard to find code which already uses fanotify.
> GitHub code search finds 438 occurences of fanotify_init() calls, 4.6k
> inotify_init1() calls and 6.9k inotify_init() calls. Given the added
> complexity of fanotify and the uselessness of most of fanotify's
> feature for most software (except for dfd support), it is extremely
> unlikely that a noticable fraction of those thousands of projects will
> ever migrate to fanotify. Even if inotify is considered a legacy API,
> it should be allowed to modernize it; and adding dfd support to system
> calls is really important.
>

Both Jan and I already gave an answer to this specific patch.
The answer was no.

We do not add new system calls for doing something that is already
possible with existing system calls to make the life of a programmer
easier - this has never been a valid argument for adding a new syscall.

Thanks,
Amir.
Max Kellermann Sept. 19, 2023, 1:11 p.m. UTC | #20
On Tue, Sep 19, 2023 at 3:01 PM Amir Goldstein <amir73il@gmail.com> wrote:
> We do not add new system calls for doing something that is already
> possible with existing system calls to make the life of a programmer
> easier - this has never been a valid argument for adding a new syscall.

- it's not possible to safely add an inotify watch; this isn't about
making something easier, but about making something
safe/reliable/race-free in a way that wasn't possible before
- there are many precedents of new system calls just to add dfd
support (fchmodat, execveat, linkat, mkdirat, ....)
- there are also a few new system calls that were added to make the
life of a programmer easier even though the same was already possible
with existing system calls (close_range, process_madvise, pidfd_getfd,
mount_setattr, ...)
Amir Goldstein Sept. 19, 2023, 1:22 p.m. UTC | #21
On Tue, Sep 19, 2023 at 4:11 PM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> On Tue, Sep 19, 2023 at 3:01 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > We do not add new system calls for doing something that is already
> > possible with existing system calls to make the life of a programmer
> > easier - this has never been a valid argument for adding a new syscall.
>
> - it's not possible to safely add an inotify watch; this isn't about
> making something easier, but about making something
> safe/reliable/race-free in a way that wasn't possible before

Yes, I meant it is possible to get the very similar functionality in
a race-free way using fanotify.
If fanotify does not meet your requirements please let us know
in what way and perhaps fanotify could be improved.
Using "inotify and not fanotity" is not a legit technical requirement.

> - there are many precedents of new system calls just to add dfd
> support (fchmodat, execveat, linkat, mkdirat, ....)
> - there are also a few new system calls that were added to make the
> life of a programmer easier even though the same was already possible
> with existing system calls (close_range, process_madvise, pidfd_getfd,
> mount_setattr, ...)

All those new syscalls add new functionality/security/performance.
If you think they were added to make the life of the programmer easier
you did not understand them.

Anyway, I've said my opinion about inotify_add_watch_at().
final call is up to Jan.

Thanks,
Amir.

P.S. you may be able to provide magic /proc/self/$fd symlinks
as path argument to inotify_add_watch() after opening them
with O_PATH to get what you want - I didn't try.
Jan Kara Sept. 19, 2023, 1:28 p.m. UTC | #22
On Tue 19-09-23 13:21:56, Max Kellermann wrote:
> On Tue, Sep 19, 2023 at 12:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > Getting an already-opened file descriptor, or just the file_handle, is
> > > certainly an interesting fanotify feature. But that could have easily
> > > been added to inotify with a new "mask" flag for the
> > > inotify_add_watch() function.
> > >
> >
> > "could have easily been added" is not a statement that I am willing
> > to accept.
> 
> Are you willing to take a bet? I come up with a patch for implementing
> this for inotify, let's say within a week, and you agree to merge it?

I guess no point in you wasting time for this. But if you'd try, I'll
really find out it isn't so easy. Inotify event is fixed length so
fsid+fhandle is completely out of the realm of "easy extension". If you
wanted to return fd instead of wd, that would be doable with some kind of a
flag in the mark mask, although it would be a bit inconsistent with the
rest of the inotify API.

> > The things that you are complaining about in the API are the exact
> > things that were needed to make the advanced features work.
> 
> Not exactly - I complain that fanotify makes the complexity mandatory,
> the complexity is the baseline of the API. It would have been possible
> to design an API that is simple for 99% of all users, as simple as
> inotify; and only those who need the advanced features get the
> complexity as an option.

Well yes, fanotify could have been designed to make basic usage easier. But
the design (some 15 years ago) was focusing more on filling in the
functional gaps inotify had for usecases such as anti-virus monitors etc.
and kind of left thinking about simple usecases for sometime later.
So we have what we have.

								Honza
Max Kellermann Sept. 19, 2023, 1:41 p.m. UTC | #23
On Tue, Sep 19, 2023 at 3:22 PM Amir Goldstein <amir73il@gmail.com> wrote:
> Yes, I meant it is possible to get the very similar functionality in
> a race-free way using fanotify.

That's not the same. We already agreed that fanotify still misses
features that have been available in inotify since forever. Going
fanotify requires a rewrite of large chunks of code. Rejecting trivial
inotify improvements because people should be using fanotify doesn't
make real-world users happy.

> If fanotify does not meet your requirements please let us know
> in what way and perhaps fanotify could be improved.

- return a watch descriptor (like inotify) as a fixed-size lookup key
- add an option so returned events contain the watch descriptor and
path relative to it (like inotify), not just the directory entry name
- allow unprivileged processes to use this new option instead of FAN_REPORT_FID

Supporting this simplified API still makes fanotify harder to use than
inotify, but retains fanotify's full power while minimizing its API
churn for the 99% of users who were already happy with inotify's
feature set.

> > - there are many precedents of new system calls just to add dfd
> > support (fchmodat, execveat, linkat, mkdirat, ....)
> > - there are also a few new system calls that were added to make the
> > life of a programmer easier even though the same was already possible
> > with existing system calls (close_range, process_madvise, pidfd_getfd,
> > mount_setattr, ...)
>
> All those new syscalls add new functionality/security/performance.

So does inotify_add_watch_at().

On the other hand, fanotify reduces performance by adding complexity
and overhead - more system calls necessary, increased lookup overhead
due to variable-length keys instead of 32-bit integers.

> If you think they were added to make the life of the programmer easier
> you did not understand them.

Oh please. Don't be so arrogant.
Max Kellermann Sept. 19, 2023, 1:48 p.m. UTC | #24
On Tue, Sep 19, 2023 at 3:28 PM Jan Kara <jack@suse.cz> wrote:
> Inotify event is fixed length so
> fsid+fhandle is completely out of the realm of "easy extension".

(Not quite true, it's variable-length. But that's not relevant if
we're talking about adding an optional feature.)

If I were to implement this, I'd add a mask bit called, say,
IN_FILE_HANDLE. If that bit is enabled on a watch, the
(variable-length) inotify_event would be followed by another
(variable-length) struct that looks just like fanotify_event_info_fid,
containining fsid and file_handle (of course, only if the same bit is
also set in inotify_event.mask, which would give the kernel a way to
omit it if it's not available for a certain file).

This is a backwards-compatible extension because it's opt-in. Only
applications which support it will set the IN_FILE_HANDLE bit. Old
applications don't set this bit and never see this trailing struct.

Max
Amir Goldstein Sept. 19, 2023, 1:56 p.m. UTC | #25
> > > - there are many precedents of new system calls just to add dfd
> > > support (fchmodat, execveat, linkat, mkdirat, ....)
> > > - there are also a few new system calls that were added to make the
> > > life of a programmer easier even though the same was already possible
> > > with existing system calls (close_range, process_madvise, pidfd_getfd,
> > > mount_setattr, ...)
> >
> > All those new syscalls add new functionality/security/performance.
>
> So does inotify_add_watch_at().
>
> On the other hand, fanotify reduces performance by adding complexity
> and overhead - more system calls necessary, increased lookup overhead
> due to variable-length keys instead of 32-bit integers.
>

Technical arguments of performance need to be backed up by
performance numbers from real life workloads.
I am not inventing this stuff as I go.
This is how kernel development works.

> > If you think they were added to make the life of the programmer easier
> > you did not understand them.
>
> Oh please. Don't be so arrogant.

I will try. Please try as well to accept a different POV.

Thanks,
Amir.
Amir Goldstein Sept. 19, 2023, 2:55 p.m. UTC | #26
On Tue, Sep 19, 2023 at 12:48 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 18-09-23 21:05:11, Amir Goldstein wrote:
> > [Forked from https://lore.kernel.org/linux-fsdevel/20230918123217.932179-1-max.kellermann@ionos.com/]
> ...
> > BTW, before we can really mark inotify as Obsolete and document that
> > inotify was superseded by fanotify, there are at least two items on the
> > TODO list [1]:
>
> Yeah, as I wrote in the original thread, I don't feel like inotify should
> be marked as obsolete (at least for some more time) so we are on the same
> page here I think.
>
> > 1. UNMOUNT/IGNORED events
> > 2. Filesystems without fid support
> >
> > MOUNT/UNMOUNT fanotify events have already been discussed
> > and the feature has known users.
> >
> > Christian has also mentioned [1] the IN_UNMOUNT use case for
> > waiting for sb shutdown several times and I will not be surprised
> > to see systemd starting to use inotify for that use case before too long...
>
> Yup, both FAN_UNMOUNT and FAN_IGNORED should be easy. Unlike inotify, I'd
> just make these explicit events you can opt into and not something you
> always get.
>
> > Regarding the second item on the TODO list, we have had this discussion
> > before - if we are going to continue the current policy of opting-in to
> > fanotify (i.e. tmpfs, fuse, overlayfs, kernfs), we will always have odd
> > filesystems that only support inotify and we will need to keep supporting
> > inotify only for the users that use inotify on those odd filesystems.
> >
> > OTOH, if we implement FAN_REPORT_DINO_NAME, we could
> > have fanotify inode mark support for any filesystem, where the
> > pinned marked inode ino is the object id.
>
> Is it a real problem after your work to allow filehandles that are not
> necessarily usable for NFS export or open_by_handle()? As far as I remember
> fanotify should be now able to handle anything that provides f_fsid in its

Not exactly. We still have a requirement for non-empty
dentry->d_sb->s_export_op in fanotify_test_fid(), to align with
the same requirement for AT_HANDLE_FID support.

> statfs(2) call. And as I'm checking filesystems not setting fsid currently are:
>
> afs, coda, nfs - networking filesystems where inotify and fanotify have
>   dubious value anyway
>
> configfs, debugfs, devpts, efivarfs, hugetlbfs, openpromfs, proc, pstore,
> ramfs, sysfs, tracefs - virtual filesystems where fsnotify functionality is
>   quite limited. But some special cases could be useful. Adding fsid support
>   is the same amount of trouble as for kernfs - a few LOC. In fact, we
>   could perhaps add a fstype flag to indicate that this is a filesystem
>   without persistent identification and so uuid should be autogenerated on
>   mount (likely in alloc_super()) and f_fsid generated from sb->s_uuid.
>   This way we could handle all these filesystems with trivial amount of
>   effort.

This sounds good to me.
I have a vague memory of suggesting the same and I think
Christian had objections, but I may be remembering wrong.

Possibly, the same opt-in fstype flag could allow also trivial
AT_HANDLE_FID support with the default export_encode_fh()?

>
> freevxfs - the only real filesystem without f_fsid. Trivial to handle one
>   way or the other.
>
> So I don't think we need new uAPI additions to finish off this TODO item.

Yes, I'd love that.
I can try to post something if there are no objections.

Thanks,
Amir.
diff mbox series

Patch

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index b6e6f6ab21f8..8a9096c5ebb1 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -797,6 +797,12 @@  SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
 	return do_inotify_add_watch(fd, AT_FDCWD, pathname, mask);
 }
 
+SYSCALL_DEFINE4(inotify_add_watch_at, int, fd, int, dfd, const char __user *, pathname,
+		u32, mask)
+{
+	return do_inotify_add_watch(fd, dfd, pathname, mask);
+}
+
 SYSCALL_DEFINE2(inotify_rm_watch, int, fd, __s32, wd)
 {
 	struct fsnotify_group *group;