diff mbox series

fs: Do not check if there is a fsnotify watcher on pseudo inodes

Message ID 20200612092603.GB3183@techsingularity.net (mailing list archive)
State New, archived
Headers show
Series fs: Do not check if there is a fsnotify watcher on pseudo inodes | expand

Commit Message

Mel Gorman June 12, 2020, 9:26 a.m. UTC
The kernel uses internal mounts for a number of purposes including pipes.
On every vfs_write regardless of filesystem, fsnotify_modify() is called
to notify of any changes which incurs a small amount of overhead in fsnotify
even when there are no watchers.

A patch is pending that reduces, but does not eliminte, the overhead
of fsnotify but for the internal mounts, even the small overhead is
unnecessary. The user API is based on the pathname and a dirfd and proc
is the only visible path for inodes on an internal mount. Proc does not
have the same pathname as the internal entry so even if fatrace is used
on /proc, no events trigger for the /proc/X/fd/ files.

This patch changes alloc_file_pseudo() to set the internal-only
FMODE_NONOTIFY flag on f_flags so that no check is made for fsnotify
watchers on internal mounts. When fsnotify is updated, it may be that
this patch becomes redundant but it is more robust against any future
changes that may reintroduce overhead for fsnotify on inodes with no
watchers. The test motivating this was "perf bench sched messaging
--pipe". On a single-socket machine using threads the difference of the
patch was as follows.

                              5.7.0                  5.7.0
                            vanilla        nofsnotify-v1r1
Amean     1       1.3837 (   0.00%)      1.3547 (   2.10%)
Amean     3       3.7360 (   0.00%)      3.6543 (   2.19%)
Amean     5       5.8130 (   0.00%)      5.7233 *   1.54%*
Amean     7       8.1490 (   0.00%)      7.9730 *   2.16%*
Amean     12     14.6843 (   0.00%)     14.1820 (   3.42%)
Amean     18     21.8840 (   0.00%)     21.7460 (   0.63%)
Amean     24     28.8697 (   0.00%)     29.1680 (  -1.03%)
Amean     30     36.0787 (   0.00%)     35.2640 *   2.26%*
Amean     32     38.0527 (   0.00%)     38.1223 (  -0.18%)

The difference is small but in some cases it's outside the noise so
while marginal, there is still a small benefit to ignoring fsnotify
for internal mounts.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 fs/file_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Amir Goldstein June 12, 2020, 9:52 a.m. UTC | #1
On Fri, Jun 12, 2020 at 12:26 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> The kernel uses internal mounts for a number of purposes including pipes.
> On every vfs_write regardless of filesystem, fsnotify_modify() is called
> to notify of any changes which incurs a small amount of overhead in fsnotify
> even when there are no watchers.
>
> A patch is pending that reduces, but does not eliminte, the overhead
> of fsnotify but for the internal mounts, even the small overhead is
> unnecessary. The user API is based on the pathname and a dirfd and proc
> is the only visible path for inodes on an internal mount. Proc does not
> have the same pathname as the internal entry so even if fatrace is used
> on /proc, no events trigger for the /proc/X/fd/ files.
>

This looks like a good direction and I was going to suggest that as well.
However, I am confused by the use of terminology "internal mount".
The patch does not do anything dealing with "internal mount".
If alloc_file_pseudo() is only called for filesystems mounted as
internal mounts,
please include this analysis in commit message.
In any case, not every file of internal mount is allocated with
alloc_file_pseudo(),
right? So maybe it would be better to list all users of alloc_file_pseudo()
and say that they all should be opted out of fsnotify, without mentioning
"internal mount"?

Thanks,
Amir.
Mel Gorman June 12, 2020, 1:18 p.m. UTC | #2
On Fri, Jun 12, 2020 at 12:52:28PM +0300, Amir Goldstein wrote:
> On Fri, Jun 12, 2020 at 12:26 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > The kernel uses internal mounts for a number of purposes including pipes.
> > On every vfs_write regardless of filesystem, fsnotify_modify() is called
> > to notify of any changes which incurs a small amount of overhead in fsnotify
> > even when there are no watchers.
> >
> > A patch is pending that reduces, but does not eliminte, the overhead
> > of fsnotify but for the internal mounts, even the small overhead is
> > unnecessary. The user API is based on the pathname and a dirfd and proc
> > is the only visible path for inodes on an internal mount. Proc does not
> > have the same pathname as the internal entry so even if fatrace is used
> > on /proc, no events trigger for the /proc/X/fd/ files.
> >
> 
> This looks like a good direction and I was going to suggest that as well.
> However, I am confused by the use of terminology "internal mount".
> The patch does not do anything dealing with "internal mount".

I was referring to users of kern_mount.

> If alloc_file_pseudo() is only called for filesystems mounted as
> internal mounts,

I believe this is the case and I did not find a counter-example.  The
changelog that introduced the helper is not explicit but it was created
in the context of converting a number of internal mounts like pipes,
anon inodes to a common helper. If I'm wrong, Al will likely point it out.

> please include this analysis in commit message.
> In any case, not every file of internal mount is allocated with
> alloc_file_pseudo(),
> right?

Correct. It is not required and there is at least one counter example
in arch/ia64/kernel/perfmon.c but I don't think that is particularly
important, I don't think anyone is kept awake at night worrying about
small performance overhead on Itanium.

> So maybe it would be better to list all users of alloc_file_pseudo()
> and say that they all should be opted out of fsnotify, without mentioning
> "internal mount"?
> 

The users are DMA buffers, CXL, aio, anon inodes, hugetlbfs, anonymous
pipes, shmem and sockets although not all of them necessary end up using
a VFS operation that triggers fsnotify.  Either way, I don't think it
makes sense (or even possible) to watch any of those with fanotify so
setting the flag seems reasonable.

I updated the changelog and maybe this is clearer.

---8<---
fs: Do not check if there is a fsnotify watcher on pseudo inodes

The kernel can create invisible internal mounts for a number of purposes
including pipes via kern_mount. For pipes, every vfs_write regardless of
filesystem, fsnotify_modify() is called to notify of any changes which
incurs a small amount of overhead in fsnotify even when there are no
watchers. It can also trigger for reads and readv and writev, it was
simply vfs_write() that was noticed first.

A patch is pending that reduces, but does not eliminte, the overhead
of fsnotify but for the internal mounts, even the small overhead is
unnecessary. The user API for fanotify is based on the pathname and a dirfd
and proc are the only visible representation of an internal mount. Proc
does not have the same pathname as the internal entry and the proc inode
is not the same as the internal inode so even if fatrace is used on /proc,
no events trigger for the /proc/X/fd/ files.

This patch changes alloc_file_pseudo() automatically opts out of fsnotify
by setting FMODE_NONOTIFY flag so that no check is made for fsnotify
watchers on internal mounts. It is not mandated that mounts created
with kern_mount use alloc_file_pseudo but a number of important ones
do including aio, anon inodes, hugetlbfs, anonymous pipes, shmem and
sockets. There does not appear to be any way to register watchers on such
inodes or a case where it would even make sense so opting out by default
seems reasonable.

The test motivating this was "perf bench sched messaging --pipe". On
a single-socket machine using threads the difference of the patch was
as follows.

                              5.7.0                  5.7.0
                            vanilla        nofsnotify-v1r1
Amean     1       1.3837 (   0.00%)      1.3547 (   2.10%)
Amean     3       3.7360 (   0.00%)      3.6543 (   2.19%)
Amean     5       5.8130 (   0.00%)      5.7233 *   1.54%*
Amean     7       8.1490 (   0.00%)      7.9730 *   2.16%*
Amean     12     14.6843 (   0.00%)     14.1820 (   3.42%)
Amean     18     21.8840 (   0.00%)     21.7460 (   0.63%)
Amean     24     28.8697 (   0.00%)     29.1680 (  -1.03%)
Amean     30     36.0787 (   0.00%)     35.2640 *   2.26%*
Amean     32     38.0527 (   0.00%)     38.1223 (  -0.18%)

The difference is small but in some cases it's outside the noise so
while marginal, there is still some small benefit to ignoring fsnotify
for internal mounts in some cases.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Amir Goldstein June 12, 2020, 2:52 p.m. UTC | #3
On Fri, Jun 12, 2020 at 4:18 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Jun 12, 2020 at 12:52:28PM +0300, Amir Goldstein wrote:
> > On Fri, Jun 12, 2020 at 12:26 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > The kernel uses internal mounts for a number of purposes including pipes.
> > > On every vfs_write regardless of filesystem, fsnotify_modify() is called
> > > to notify of any changes which incurs a small amount of overhead in fsnotify
> > > even when there are no watchers.
> > >
> > > A patch is pending that reduces, but does not eliminte, the overhead
> > > of fsnotify but for the internal mounts, even the small overhead is
> > > unnecessary. The user API is based on the pathname and a dirfd and proc
> > > is the only visible path for inodes on an internal mount. Proc does not
> > > have the same pathname as the internal entry so even if fatrace is used
> > > on /proc, no events trigger for the /proc/X/fd/ files.
> > >
> >
> > This looks like a good direction and I was going to suggest that as well.
> > However, I am confused by the use of terminology "internal mount".
> > The patch does not do anything dealing with "internal mount".
>
> I was referring to users of kern_mount.

I see. I am not sure if all kern_mount hand out only anonymous inodes,
but FYI, now there a MNT_NS_INTERNAL that is not SB_KERNMOUNT:
df820f8de4e4 ovl: make private mounts longterm

>
> > If alloc_file_pseudo() is only called for filesystems mounted as
> > internal mounts,
>
> I believe this is the case and I did not find a counter-example.  The
> changelog that introduced the helper is not explicit but it was created
> in the context of converting a number of internal mounts like pipes,
> anon inodes to a common helper. If I'm wrong, Al will likely point it out.
>
> > please include this analysis in commit message.
> > In any case, not every file of internal mount is allocated with
> > alloc_file_pseudo(),
> > right?
>
> Correct. It is not required and there is at least one counter example
> in arch/ia64/kernel/perfmon.c but I don't think that is particularly
> important, I don't think anyone is kept awake at night worrying about
> small performance overhead on Itanium.
>
> > So maybe it would be better to list all users of alloc_file_pseudo()
> > and say that they all should be opted out of fsnotify, without mentioning
> > "internal mount"?
> >
>
> The users are DMA buffers, CXL, aio, anon inodes, hugetlbfs, anonymous
> pipes, shmem and sockets although not all of them necessary end up using
> a VFS operation that triggers fsnotify.  Either way, I don't think it
> makes sense (or even possible) to watch any of those with fanotify so
> setting the flag seems reasonable.
>

I also think this seems reasonable, but the more accurate reason IMO
is found in the comment for d_alloc_pseudo():
"allocate a dentry (for lookup-less filesystems)..."

> I updated the changelog and maybe this is clearer.

I still find the use of "internal mount" terminology too vague.
"lookup-less filesystems" would have been more accurate,
because as you correctly point out, the user API to set a watch
requires that the marked object is looked up in the filesystem.

There are also some kernel internal users that set watches
like audit and nfsd, but I think they are also only interested in
inodes that have a path at the time that the mark is setup.

Thanks,
Amir.
Amir Goldstein June 12, 2020, 8:34 p.m. UTC | #4
> > > So maybe it would be better to list all users of alloc_file_pseudo()
> > > and say that they all should be opted out of fsnotify, without mentioning
> > > "internal mount"?
> > >
> >
> > The users are DMA buffers, CXL, aio, anon inodes, hugetlbfs, anonymous
> > pipes, shmem and sockets although not all of them necessary end up using
> > a VFS operation that triggers fsnotify.  Either way, I don't think it
> > makes sense (or even possible) to watch any of those with fanotify so
> > setting the flag seems reasonable.
> >
>
> I also think this seems reasonable, but the more accurate reason IMO
> is found in the comment for d_alloc_pseudo():
> "allocate a dentry (for lookup-less filesystems)..."
>
> > I updated the changelog and maybe this is clearer.
>
> I still find the use of "internal mount" terminology too vague.
> "lookup-less filesystems" would have been more accurate,

Only it is not really accurate for shmfs anf hugetlbfs, which are
not lookup-less, they just hand out un-lookable inodes.

> because as you correctly point out, the user API to set a watch
> requires that the marked object is looked up in the filesystem.
>
> There are also some kernel internal users that set watches
> like audit and nfsd, but I think they are also only interested in
> inodes that have a path at the time that the mark is setup.
>

FWIW I verified that watches can be set on anonymous pipes
via /proc/XX/fd, so if we are going to apply this patch, I think it
should be accompanied with a complimentary patch that forbids
setting up a mark on these sort of inodes. If someone out there
is doing this, at least they would get a loud message that something
has changed instead of silently dropping fsnotify events.

So now the question is how do we identify/classify "these sort of
inodes"? If they are no common well defining characteristics, we
may need to blacklist pipes sockets and anon inodes explicitly
with S_NONOTIFY.

Thanks,
Amir.
Jan Kara June 15, 2020, 8:12 a.m. UTC | #5
On Fri 12-06-20 23:34:16, Amir Goldstein wrote:
> > > > So maybe it would be better to list all users of alloc_file_pseudo()
> > > > and say that they all should be opted out of fsnotify, without mentioning
> > > > "internal mount"?
> > > >
> > >
> > > The users are DMA buffers, CXL, aio, anon inodes, hugetlbfs, anonymous
> > > pipes, shmem and sockets although not all of them necessary end up using
> > > a VFS operation that triggers fsnotify.  Either way, I don't think it
> > > makes sense (or even possible) to watch any of those with fanotify so
> > > setting the flag seems reasonable.
> > >
> >
> > I also think this seems reasonable, but the more accurate reason IMO
> > is found in the comment for d_alloc_pseudo():
> > "allocate a dentry (for lookup-less filesystems)..."
> >
> > > I updated the changelog and maybe this is clearer.
> >
> > I still find the use of "internal mount" terminology too vague.
> > "lookup-less filesystems" would have been more accurate,
> 
> Only it is not really accurate for shmfs anf hugetlbfs, which are
> not lookup-less, they just hand out un-lookable inodes.

OK, but I still think we are safe setting FMODE_NONOTIFY in
alloc_file_pseudo() and that covers all the cases we care about. Or did I
misunderstand something in the discussion? I can see e.g.
__shmem_file_setup() uses alloc_file_pseudo() but again that seems to be
used only for inodes without a path and the comment before d_alloc_pseudo()
pretty clearly states this should be the case.

So is the dispute here really only about how to call files using
d_alloc_pseudo()?

> > because as you correctly point out, the user API to set a watch
> > requires that the marked object is looked up in the filesystem.
> >
> > There are also some kernel internal users that set watches
> > like audit and nfsd, but I think they are also only interested in
> > inodes that have a path at the time that the mark is setup.
> >
> 
> FWIW I verified that watches can be set on anonymous pipes
> via /proc/XX/fd, so if we are going to apply this patch, I think it
> should be accompanied with a complimentary patch that forbids
> setting up a mark on these sort of inodes. If someone out there
> is doing this, at least they would get a loud message that something
> has changed instead of silently dropping fsnotify events.
> 
> So now the question is how do we identify/classify "these sort of
> inodes"? If they are no common well defining characteristics, we
> may need to blacklist pipes sockets and anon inodes explicitly
> with S_NONOTIFY.

We already do have FS_DISALLOW_NOTIFY_PERM in file_system_type->fs_flags so
adding FS_DISALLOW_NOTIFY would be natural if there is a need for this.

I don't think using fsnotify on pipe inodes is sane in any way. You'd
possibly only get the MODIFY or ACCESS events and even those would not be
quite reliable because with pipes stuff like splicing etc. is much more
common and that currently completely bypasses fsnotify subsystem. So
overall I'm fine with completely ignoring fsnotify on such inodes.

								Honza
Amir Goldstein June 15, 2020, 11:07 a.m. UTC | #6
On Mon, Jun 15, 2020 at 11:12 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 12-06-20 23:34:16, Amir Goldstein wrote:
> > > > > So maybe it would be better to list all users of alloc_file_pseudo()
> > > > > and say that they all should be opted out of fsnotify, without mentioning
> > > > > "internal mount"?
> > > > >
> > > >
> > > > The users are DMA buffers, CXL, aio, anon inodes, hugetlbfs, anonymous
> > > > pipes, shmem and sockets although not all of them necessary end up using
> > > > a VFS operation that triggers fsnotify.  Either way, I don't think it
> > > > makes sense (or even possible) to watch any of those with fanotify so
> > > > setting the flag seems reasonable.
> > > >
> > >
> > > I also think this seems reasonable, but the more accurate reason IMO
> > > is found in the comment for d_alloc_pseudo():
> > > "allocate a dentry (for lookup-less filesystems)..."
> > >
> > > > I updated the changelog and maybe this is clearer.
> > >
> > > I still find the use of "internal mount" terminology too vague.
> > > "lookup-less filesystems" would have been more accurate,
> >
> > Only it is not really accurate for shmfs anf hugetlbfs, which are
> > not lookup-less, they just hand out un-lookable inodes.
>
> OK, but I still think we are safe setting FMODE_NONOTIFY in
> alloc_file_pseudo() and that covers all the cases we care about. Or did I
> misunderstand something in the discussion? I can see e.g.
> __shmem_file_setup() uses alloc_file_pseudo() but again that seems to be
> used only for inodes without a path and the comment before d_alloc_pseudo()
> pretty clearly states this should be the case.
>
> So is the dispute here really only about how to call files using
> d_alloc_pseudo()?
>

Yes, semantics, no technical dispute on the patch.

> > > because as you correctly point out, the user API to set a watch
> > > requires that the marked object is looked up in the filesystem.
> > >
> > > There are also some kernel internal users that set watches
> > > like audit and nfsd, but I think they are also only interested in
> > > inodes that have a path at the time that the mark is setup.
> > >
> >
> > FWIW I verified that watches can be set on anonymous pipes
> > via /proc/XX/fd, so if we are going to apply this patch, I think it
> > should be accompanied with a complimentary patch that forbids
> > setting up a mark on these sort of inodes. If someone out there
> > is doing this, at least they would get a loud message that something
> > has changed instead of silently dropping fsnotify events.
> >
> > So now the question is how do we identify/classify "these sort of
> > inodes"? If they are no common well defining characteristics, we
> > may need to blacklist pipes sockets and anon inodes explicitly
> > with S_NONOTIFY.
>
> We already do have FS_DISALLOW_NOTIFY_PERM in file_system_type->fs_flags so
> adding FS_DISALLOW_NOTIFY would be natural if there is a need for this.

Yes, it is possible, but for the specified use case, it is probably easier
to classify by inode type (and maybe IS_ROOT()) than by filesystem type.
Also, in the case of shmem, the same file_system_type is used for user
mountable tmpfs and the kernel internal shm_mnt instance - only the
latter is used for handing out anonymous shmem files.

>
> I don't think using fsnotify on pipe inodes is sane in any way. You'd
> possibly only get the MODIFY or ACCESS events and even those would not be
> quite reliable because with pipes stuff like splicing etc. is much more
> common and that currently completely bypasses fsnotify subsystem. So
> overall I'm fine with completely ignoring fsnotify on such inodes.
>

Agreed for MODIFY ACCESS. Not so sure about other events.
I see that nfsd filecache backend only marks regular files, so that's fine.
I *think* audit only marks directories and exe files, but completely unsure.

Maybe there is no need to optimize out special inodes from all events
and only exclude them from MODIFY/ACCESS, which are the only
events where performance may be a concern?
Or maybe you did not mean to skip events on special inodes in general?

I am not sure how important OPEN events are on special inodes, but
it is scary to stop sending OPEN_PERM events.

Do you agree that we should also actively disallow setting a mark
on special disconnected inodes? instead of silently dropping events
that current kernel does deliver?

We could disallow setting a mark on a disconnected inode
(one that user is trying to configure by using a /proc/$pid/fd/X path).
We can enforce this restriction for all backends in the common helper
fsnotify_add_mark_locked().

Thanks,
Amir.
Mel Gorman June 15, 2020, 12:02 p.m. UTC | #7
On Fri, Jun 12, 2020 at 11:34:16PM +0300, Amir Goldstein wrote:
> > > > So maybe it would be better to list all users of alloc_file_pseudo()
> > > > and say that they all should be opted out of fsnotify, without mentioning
> > > > "internal mount"?
> > > >
> > >
> > > The users are DMA buffers, CXL, aio, anon inodes, hugetlbfs, anonymous
> > > pipes, shmem and sockets although not all of them necessary end up using
> > > a VFS operation that triggers fsnotify.  Either way, I don't think it
> > > makes sense (or even possible) to watch any of those with fanotify so
> > > setting the flag seems reasonable.
> > >
> >
> > I also think this seems reasonable, but the more accurate reason IMO
> > is found in the comment for d_alloc_pseudo():
> > "allocate a dentry (for lookup-less filesystems)..."
> >
> > > I updated the changelog and maybe this is clearer.
> >
> > I still find the use of "internal mount" terminology too vague.
> > "lookup-less filesystems" would have been more accurate,
> 
> Only it is not really accurate for shmfs anf hugetlbfs, which are
> not lookup-less, they just hand out un-lookable inodes.
> 

Yes.

> > because as you correctly point out, the user API to set a watch
> > requires that the marked object is looked up in the filesystem.
> >
> > There are also some kernel internal users that set watches
> > like audit and nfsd, but I think they are also only interested in
> > inodes that have a path at the time that the mark is setup.
> >
> 
> FWIW I verified that watches can be set on anonymous pipes
> via /proc/XX/fd, so if we are going to apply this patch, I think it
> should be accompanied with a complimentary patch that forbids
> setting up a mark on these sort of inodes. If someone out there
> is doing this, at least they would get a loud message that something
> has changed instead of silently dropping fsnotify events.
> 

I'm not entirely convinced that an error should be forced. I accept that
you can set a watcher on /proc/XX/fd but do you actually receive any
notifications of activity on those inodes? When I tested, I found that any
watchers a pipe for example were not notified. This didn't surprise me as
such given that the path and inode itself were just a representation of
the underlying "real" inode and that the notifications did not propogate
from a pipe fd to the proc fd. However, I could have made a mistake in
my test case. Maybe they *could* be propagated but it does not appear
that anyone cares.

> So now the question is how do we identify/classify "these sort of
> inodes"? If they are no common well defining characteristics, we
> may need to blacklist pipes sockets and anon inodes explicitly
> with S_NONOTIFY.
> 

I'm not sure we need to go that far either. It appears that some proc
files can receive notifications that may or may not have a useful meaning
to userspace so blocking them all may be problematic. If I'm right in that
fd inodes already have no meaningful notifications, it does not hurt to
ignore fsnotify for pseudo inodes as userspace cannot tell the difference.
diff mbox series

Patch

diff --git a/fs/file_table.c b/fs/file_table.c
index 30d55c9a1744..0076ccf67a7d 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -229,7 +229,7 @@  struct file *alloc_file_pseudo(struct inode *inode, struct vfsmount *mnt,
 		d_set_d_op(path.dentry, &anon_ops);
 	path.mnt = mntget(mnt);
 	d_instantiate(path.dentry, inode);
-	file = alloc_file(&path, flags, fops);
+	file = alloc_file(&path, flags | FMODE_NONOTIFY, fops);
 	if (IS_ERR(file)) {
 		ihold(inode);
 		path_put(&path);