diff mbox series

[RFC] fsnotify: optimize the case of no access event watchers

Message ID 20240109194818.91465-1-amir73il@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] fsnotify: optimize the case of no access event watchers | expand

Commit Message

Amir Goldstein Jan. 9, 2024, 7:48 p.m. UTC
Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
optimized the case where there are no fsnotify watchers on any of the
filesystem's objects.

It is quite common for a system to have a single local filesystem and
it is quite common for the system to have some inotify watches on some
config files or directories, so the optimization of no marks at all is
often not in effect.

Access event watchers are far less common, so optimizing the case of
no marks with access events could improve performance for more systems,
especially for the performance sensitive hot io path.

Maintain a per-sb counter of objects that have marks with access
events in their mask and use that counter to optimize out the call to
fsnotify() in fsnotify access hooks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Jens,

You may want to try if this patch improves performance for your workload
with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y.

I am not at all sure that access events are the right thing to optimize.
I am pretty sure that access event watchers are rare in the wild, unlike
FS_MODIFY watchers that are common (e.g. tail -f).

Let me know how this patch works for you and we can continue the
discussion with Jan from there.

Thanks,
Amir.

 fs/notify/mark.c                 | 27 ++++++++++++++++++++++++++-
 include/linux/fs.h               |  1 +
 include/linux/fsnotify.h         | 14 +++++++++++---
 include/linux/fsnotify_backend.h |  2 ++
 4 files changed, 40 insertions(+), 4 deletions(-)

Comments

Jens Axboe Jan. 9, 2024, 8:12 p.m. UTC | #1
On 1/9/24 12:48 PM, Amir Goldstein wrote:
> Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> optimized the case where there are no fsnotify watchers on any of the
> filesystem's objects.
> 
> It is quite common for a system to have a single local filesystem and
> it is quite common for the system to have some inotify watches on some
> config files or directories, so the optimization of no marks at all is
> often not in effect.
> 
> Access event watchers are far less common, so optimizing the case of
> no marks with access events could improve performance for more systems,
> especially for the performance sensitive hot io path.
> 
> Maintain a per-sb counter of objects that have marks with access
> events in their mask and use that counter to optimize out the call to
> fsnotify() in fsnotify access hooks.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> Jens,
> 
> You may want to try if this patch improves performance for your workload
> with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y.

Ran the usual test, and this effectively removes fsnotify from the
profiles, which (as per other email) is between 5-6% of CPU time. So I'd
say it looks mighty appealing!
Jens Axboe Jan. 9, 2024, 8:24 p.m. UTC | #2
On 1/9/24 1:12 PM, Jens Axboe wrote:
> On 1/9/24 12:48 PM, Amir Goldstein wrote:
>> Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
>> optimized the case where there are no fsnotify watchers on any of the
>> filesystem's objects.
>>
>> It is quite common for a system to have a single local filesystem and
>> it is quite common for the system to have some inotify watches on some
>> config files or directories, so the optimization of no marks at all is
>> often not in effect.
>>
>> Access event watchers are far less common, so optimizing the case of
>> no marks with access events could improve performance for more systems,
>> especially for the performance sensitive hot io path.
>>
>> Maintain a per-sb counter of objects that have marks with access
>> events in their mask and use that counter to optimize out the call to
>> fsnotify() in fsnotify access hooks.
>>
>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> ---
>>
>> Jens,
>>
>> You may want to try if this patch improves performance for your workload
>> with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y.
> 
> Ran the usual test, and this effectively removes fsnotify from the
> profiles, which (as per other email) is between 5-6% of CPU time. So I'd
> say it looks mighty appealing!

Tried with an IRQ based workload as well, as those are always impacted
more by the fsnotify slowness. This patch removes ~8% of useless
overhead in that case, so even bigger win there.
Amir Goldstein Jan. 10, 2024, 9:08 a.m. UTC | #3
On Tue, Jan 9, 2024 at 10:24 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 1/9/24 1:12 PM, Jens Axboe wrote:
> > On 1/9/24 12:48 PM, Amir Goldstein wrote:
> >> Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> >> optimized the case where there are no fsnotify watchers on any of the
> >> filesystem's objects.
> >>
> >> It is quite common for a system to have a single local filesystem and
> >> it is quite common for the system to have some inotify watches on some
> >> config files or directories, so the optimization of no marks at all is
> >> often not in effect.
> >>
> >> Access event watchers are far less common, so optimizing the case of
> >> no marks with access events could improve performance for more systems,
> >> especially for the performance sensitive hot io path.
> >>
> >> Maintain a per-sb counter of objects that have marks with access
> >> events in their mask and use that counter to optimize out the call to
> >> fsnotify() in fsnotify access hooks.
> >>
> >> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >> ---
> >>
> >> Jens,
> >>
> >> You may want to try if this patch improves performance for your workload
> >> with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y.
> >
> > Ran the usual test, and this effectively removes fsnotify from the
> > profiles, which (as per other email) is between 5-6% of CPU time. So I'd
> > say it looks mighty appealing!
>
> Tried with an IRQ based workload as well, as those are always impacted
> more by the fsnotify slowness. This patch removes ~8% of useless
> overhead in that case, so even bigger win there.
>

Do the IRQ based workloads always go through io_req_io_end()?
Meaning that unlike the polled io workloads, they also incur the
overhead of the fsnotify_{modify,access}() hooks?

I remember I asked you once (cannot find where) whether
io_complete_rw_iopoll() needs the fsnotify hooks and you said that
it is a highly specialized code path for fast io, whose users will not
want those access/modify hooks.

Considering the fact that fsnotify_{modify,access}() could just as well
be bypassed by mmap() read/write, I fully support this reasoning.

Anyway, that explains (to me) why compiling-out the fsnotify_perm()
hooks took away all the regression that you observed in upstream,
because I was wondering where the overhead of fsnotify_access() was.

Jan,

What are your thoughts about this optimization patch?

My thoughts are that the optimization is clearly a win, but do we
really want to waste a full long in super_block for counting access
event watchers that may never exist?

Should we perhaps instead use a flag to say that "access watchers
existed"?

We could put s_fsnotify_access_watchers inside a struct
fsnotify_sb_mark_connector and special case alloc/free of
FSNOTIFY_OBJ_TYPE_SB connector.

Using a specialized fsnotify_sb_mark_connector will make room for
similar optimizations in the future and could be a placeholder for
the "default permission mask" that we discussed.

Thoughts?

Thanks,
Amir.
Matthew Wilcox Jan. 10, 2024, 12:46 p.m. UTC | #4
On Wed, Jan 10, 2024 at 11:08:17AM +0200, Amir Goldstein wrote:
> My thoughts are that the optimization is clearly a win, but do we
> really want to waste a full long in super_block for counting access
> event watchers that may never exist?

Would it make more sense for it to be global, and perhaps even use the
static key infrastructure to enable/disable fsnotify?
Amir Goldstein Jan. 10, 2024, 12:55 p.m. UTC | #5
On Wed, Jan 10, 2024 at 2:46 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Jan 10, 2024 at 11:08:17AM +0200, Amir Goldstein wrote:
> > My thoughts are that the optimization is clearly a win, but do we
> > really want to waste a full long in super_block for counting access
> > event watchers that may never exist?
>
> Would it make more sense for it to be global,

Ironically, we once tried to disabled fsnotify hooks on pipefs and
found out (the hard way) that some people are using IN_ACCESS
event to get notified on pipe reads or something like that, so the
global option may result in less predictable performance.

> and perhaps even use the
> static key infrastructure to enable/disable fsnotify?

We are talking about disabling specific fsnotify hooks fsnotify_access()
and fsnotify_perm().

I doubt that static key infrastructure is required, because with this patch,
Jens did not observe any regression compared to the code being compiled out.

Thanks,
Amir.
Jan Kara Jan. 10, 2024, 1:56 p.m. UTC | #6
On Tue 09-01-24 21:48:18, Amir Goldstein wrote:
> Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> optimized the case where there are no fsnotify watchers on any of the
> filesystem's objects.
> 
> It is quite common for a system to have a single local filesystem and
> it is quite common for the system to have some inotify watches on some
> config files or directories, so the optimization of no marks at all is
> often not in effect.

I agree.

> Access event watchers are far less common, so optimizing the case of
> no marks with access events could improve performance for more systems,
> especially for the performance sensitive hot io path.
> 
> Maintain a per-sb counter of objects that have marks with access
> events in their mask and use that counter to optimize out the call to
> fsnotify() in fsnotify access hooks.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

I'm not saying no to this but let's discuss first before hacking in some
partial solution :). AFAIU what Jens sees is something similar as was
reported for example here [1]. In these cases we go through
fsnotify_parent() down to fsnotify() until the check:

        if (!(test_mask & marks_mask))
                return 0;

there. And this is all relatively cheap (pure fetches like
sb->s_fsnotify_marks, mnt->mnt_fsnotify_marks, inode->i_fsnotify_marks,
sb->s_fsnotify_mask, mnt->mnt_fsnotify_mask, inode->i_fsnotify_mask) except
for parent handling in __fsnotify_parent(). That requires multiple atomic
operations - take_dentry_name_snapshot() is lock & unlock of d_lock, dget()
is cmpxchg on d_lockref, dput() is another cmpxchg on d_lockref - and this
gets really expensive, even more so if multiple threads race for the same
parent dentry.

So I think what we ideally need is a way to avoid this expensive "stabilize
parent & its name" game unless we are pretty sure we are going to generate
the event. There's no way to avoid fetching the parent early if
dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED (we can still postpone the
take_dentry_name_snapshot() cost though). However for the case where
dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED == 0 (and this is the case
here AFAICT) we need the parent only after the check of 'test_mask &
marks_mask'. So in that case we could completely avoid the extra cost of
parent dentry handling.

So in principle I believe we could make fsnotify noticeably lighter for the
case where no event is generated. Just the question is how to refactor the
current set of functions to achieve this without creating an unmaintainable
mess. I suspect if we lifted creation & some prefilling of iter_info into
__fsnotify_parent() and then fill in the parent in case we need it for
reporting only in fsnotify(), the code could be reasonably readable. We'd
need to always propagate the dentry down to fsnotify() though, currently we
often propagate only the inode because the dentry may be (still in case of
create or already in case of unlink) negative. Similarly we'd somehow need
to communicate down into fsnotify() whether the passed name needs
snapshotting or not...

What do you think?

								Honza

[1] https://lore.kernel.org/all/SJ0PR08MB6494F5A32B7C60A5AD8B33C2AB09A@SJ0PR08MB6494.namprd08.prod.outlook.com
Amir Goldstein Jan. 10, 2024, 2:41 p.m. UTC | #7
On Wed, Jan 10, 2024 at 3:56 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 09-01-24 21:48:18, Amir Goldstein wrote:
> > Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> > optimized the case where there are no fsnotify watchers on any of the
> > filesystem's objects.
> >
> > It is quite common for a system to have a single local filesystem and
> > it is quite common for the system to have some inotify watches on some
> > config files or directories, so the optimization of no marks at all is
> > often not in effect.
>
> I agree.
>
> > Access event watchers are far less common, so optimizing the case of
> > no marks with access events could improve performance for more systems,
> > especially for the performance sensitive hot io path.
> >
> > Maintain a per-sb counter of objects that have marks with access
> > events in their mask and use that counter to optimize out the call to
> > fsnotify() in fsnotify access hooks.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> I'm not saying no to this but let's discuss first before hacking in some
> partial solution :). AFAIU what Jens sees is something similar as was
> reported for example here [1]. In these cases we go through
> fsnotify_parent() down to fsnotify() until the check:
>
>         if (!(test_mask & marks_mask))
>                 return 0;
>
> there. And this is all relatively cheap (pure fetches like
> sb->s_fsnotify_marks, mnt->mnt_fsnotify_marks, inode->i_fsnotify_marks,
> sb->s_fsnotify_mask, mnt->mnt_fsnotify_mask, inode->i_fsnotify_mask) except
> for parent handling in __fsnotify_parent(). That requires multiple atomic
> operations - take_dentry_name_snapshot() is lock & unlock of d_lock, dget()
> is cmpxchg on d_lockref, dput() is another cmpxchg on d_lockref - and this
> gets really expensive, even more so if multiple threads race for the same
> parent dentry.

Sorry, I forgot to link the Jens regression report [1] which included this
partial perf report:

     3.36%             [kernel.vmlinux]  [k] fsnotify
     2.32%             [kernel.vmlinux]  [k] __fsnotify_paren

Your analysis about __fsnotify_parent() may be correct, but what would be
the explanation to time spent in fsnotify() in this report?

Jens,

Can you provide an expanded via of the perf function called by
fsnotify() and __fsnotify_parent()?

In general, I think that previous optimization work as this commit by Mel:
71d734103edf ("fsnotify: Rearrange fast path to minimise overhead when
there is no watcher")
showed improvements when checks were inlined.

[1] https://lore.kernel.org/linux-fsdevel/53682ece-f0e7-48de-9a1c-879ee34b0449@kernel.dk/

>
> So I think what we ideally need is a way to avoid this expensive "stabilize
> parent & its name" game unless we are pretty sure we are going to generate
> the event. There's no way to avoid fetching the parent early if
> dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED (we can still postpone the
> take_dentry_name_snapshot() cost though). However for the case where
> dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED == 0 (and this is the case
> here AFAICT) we need the parent only after the check of 'test_mask &
> marks_mask'. So in that case we could completely avoid the extra cost of
> parent dentry handling.
>
> So in principle I believe we could make fsnotify noticeably lighter for the
> case where no event is generated. Just the question is how to refactor the
> current set of functions to achieve this without creating an unmaintainable
> mess. I suspect if we lifted creation & some prefilling of iter_info into
> __fsnotify_parent() and then fill in the parent in case we need it for
> reporting only in fsnotify(), the code could be reasonably readable. We'd
> need to always propagate the dentry down to fsnotify() though, currently we
> often propagate only the inode because the dentry may be (still in case of
> create or already in case of unlink) negative. Similarly we'd somehow need
> to communicate down into fsnotify() whether the passed name needs
> snapshotting or not...
>
> What do you think?

I am not saying no ;)
but it sound a bit complicated so if the goal is to reduce the overhead
of fsnotify_access() and fsnotify_perm(), which I don't think any application
cares about, then I'd rather go with a much simpler solution even if it
does not cover all the corner cases.

Thanks,
Amir.
Jan Kara Jan. 10, 2024, 4:58 p.m. UTC | #8
On Wed 10-01-24 16:41:46, Amir Goldstein wrote:
> On Wed, Jan 10, 2024 at 3:56 PM Jan Kara <jack@suse.cz> wrote:
> > On Tue 09-01-24 21:48:18, Amir Goldstein wrote:
> > > Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> > > optimized the case where there are no fsnotify watchers on any of the
> > > filesystem's objects.
> > >
> > > It is quite common for a system to have a single local filesystem and
> > > it is quite common for the system to have some inotify watches on some
> > > config files or directories, so the optimization of no marks at all is
> > > often not in effect.
> >
> > I agree.
> >
> > > Access event watchers are far less common, so optimizing the case of
> > > no marks with access events could improve performance for more systems,
> > > especially for the performance sensitive hot io path.
> > >
> > > Maintain a per-sb counter of objects that have marks with access
> > > events in their mask and use that counter to optimize out the call to
> > > fsnotify() in fsnotify access hooks.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > I'm not saying no to this but let's discuss first before hacking in some
> > partial solution :). AFAIU what Jens sees is something similar as was
> > reported for example here [1]. In these cases we go through
> > fsnotify_parent() down to fsnotify() until the check:
> >
> >         if (!(test_mask & marks_mask))
> >                 return 0;
> >
> > there. And this is all relatively cheap (pure fetches like
> > sb->s_fsnotify_marks, mnt->mnt_fsnotify_marks, inode->i_fsnotify_marks,
> > sb->s_fsnotify_mask, mnt->mnt_fsnotify_mask, inode->i_fsnotify_mask) except
> > for parent handling in __fsnotify_parent(). That requires multiple atomic
> > operations - take_dentry_name_snapshot() is lock & unlock of d_lock, dget()
> > is cmpxchg on d_lockref, dput() is another cmpxchg on d_lockref - and this
> > gets really expensive, even more so if multiple threads race for the same
> > parent dentry.
> 
> Sorry, I forgot to link the Jens regression report [1] which included this
> partial perf report:
> 
>      3.36%             [kernel.vmlinux]  [k] fsnotify
>      2.32%             [kernel.vmlinux]  [k] __fsnotify_paren
> 
> Your analysis about __fsnotify_parent() may be correct, but what would be
> the explanation to time spent in fsnotify() in this report?

OK, I don't have a good explanation for why fsnotify() itself is so high in
Jens' profile. Maybe cacheline fetches caused by inode->i_fsnotify_mask and
inode->i_fsnotify_marks checks are causing the overhead but it would be
good to have it confirmed.

> In general, I think that previous optimization work as this commit by Mel:
> 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead when
> there is no watcher") showed improvements when checks were inlined.

Well, I believe that commit helped exactly because the check for
DCACHE_FSNOTIFY_PARENT_WATCHED was moved ahead of all the work now in
__fsnotify_parent().

> [1] https://lore.kernel.org/linux-fsdevel/53682ece-f0e7-48de-9a1c-879ee34b0449@kernel.dk/
> 
> > So I think what we ideally need is a way to avoid this expensive "stabilize
> > parent & its name" game unless we are pretty sure we are going to generate
> > the event. There's no way to avoid fetching the parent early if
> > dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED (we can still postpone the
> > take_dentry_name_snapshot() cost though). However for the case where
> > dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED == 0 (and this is the case
> > here AFAICT) we need the parent only after the check of 'test_mask &
> > marks_mask'. So in that case we could completely avoid the extra cost of
> > parent dentry handling.
> >
> > So in principle I believe we could make fsnotify noticeably lighter for the
> > case where no event is generated. Just the question is how to refactor the
> > current set of functions to achieve this without creating an unmaintainable
> > mess. I suspect if we lifted creation & some prefilling of iter_info into
> > __fsnotify_parent() and then fill in the parent in case we need it for
> > reporting only in fsnotify(), the code could be reasonably readable. We'd
> > need to always propagate the dentry down to fsnotify() though, currently we
> > often propagate only the inode because the dentry may be (still in case of
> > create or already in case of unlink) negative. Similarly we'd somehow need
> > to communicate down into fsnotify() whether the passed name needs
> > snapshotting or not...
> >
> > What do you think?
> 
> I am not saying no ;)
> but it sound a bit complicated so if the goal is to reduce the overhead
> of fsnotify_access() and fsnotify_perm(), which I don't think any application
> cares about, then I'd rather go with a much simpler solution even if it
> does not cover all the corner cases.

OK, let's figure out what exactly causes slowdown in Jens' case first. I
agree your solution helps mitigate the cost of fsnotify_access() for reads
but I forsee people complaining about fsnotify_modify() cost for writes in
short order :) and there it is not so simple to solve as there's likely
some watch for FS_MODIFY event somewhere.

								Honza
Amir Goldstein Jan. 11, 2024, 6:27 a.m. UTC | #9
> > I am not saying no ;)
> > but it sound a bit complicated so if the goal is to reduce the overhead
> > of fsnotify_access() and fsnotify_perm(), which I don't think any application
> > cares about, then I'd rather go with a much simpler solution even if it
> > does not cover all the corner cases.
>
> OK, let's figure out what exactly causes slowdown in Jens' case first. I
> agree your solution helps mitigate the cost of fsnotify_access() for reads
> but I forsee people complaining about fsnotify_modify() cost for writes in
> short order :) and there it is not so simple to solve as there's likely
> some watch for FS_MODIFY event somewhere.
>

Actually, I think we may be able to eat the cake and leave it whole.
As I've written to you once in a different context, I think fsnotify
has two mostly non-overlapping use cases:
1. Watch FS_ACCESS/FS_MODIFY on selective inodes (a.k.a tail)
2. Watch sb/mount/recursive subtree

Anyone that will try watching FS_ACCESS/FS_MODIFY on a large
data set, would find that to be way too noisy to be useful.

For example, we have a filesystem monitor application which needs to
know about changes to any file in the filesystem, but we cannot use
FS_MODIFY for that monitor because it is too noisy, so we use
FS_CLOSE_WRITE as second best.

I guess we cannot rule out that the use case of watching direct
children for FS_ACCESS/FS_MODIFY may exist in the wild.

IOW, I think we should be able to optimize most of the access/modify
hooks by checking those events specifically in the inline helpers for:
1. inode mask
2. parent watching children mask
3. sb->s_iflags & SB_I_FSNOTIFY_ACCESS_MONITOR

The use of such an access monitor is currently unlikely, so
I think tainting the sb is good enough for now.

Note that the upcoming pre-content events (a.k.a HSM) are
certainly going to fall under this "unlikely" category -
If you'd want to use HSM (on-demand file content filling) on
a filesystem, you will pay the performance penalty for some intensive
io workloads unless you'd use an fd with FMODE_NONOTIFY.
I think that is to be expected.

HSM is anyway expected to incur extra performance penalty for
sb_write_barrier() (for the pre-modify events) and in my wip
sb_write_barrier branch [1], there is a similar optimization that does
activate_sb_write_barrier() on the first pre-modify event watch
and then leaves it activated on that sb forever.

I will try to write this patch.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/sb_write_barrier
Christian Brauner Jan. 11, 2024, 10:05 a.m. UTC | #10
On Wed, Jan 10, 2024 at 11:08:17AM +0200, Amir Goldstein wrote:
> On Tue, Jan 9, 2024 at 10:24 PM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 1/9/24 1:12 PM, Jens Axboe wrote:
> > > On 1/9/24 12:48 PM, Amir Goldstein wrote:
> > >> Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> > >> optimized the case where there are no fsnotify watchers on any of the
> > >> filesystem's objects.
> > >>
> > >> It is quite common for a system to have a single local filesystem and
> > >> it is quite common for the system to have some inotify watches on some
> > >> config files or directories, so the optimization of no marks at all is
> > >> often not in effect.
> > >>
> > >> Access event watchers are far less common, so optimizing the case of
> > >> no marks with access events could improve performance for more systems,
> > >> especially for the performance sensitive hot io path.
> > >>
> > >> Maintain a per-sb counter of objects that have marks with access
> > >> events in their mask and use that counter to optimize out the call to
> > >> fsnotify() in fsnotify access hooks.
> > >>
> > >> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >> ---
> > >>
> > >> Jens,
> > >>
> > >> You may want to try if this patch improves performance for your workload
> > >> with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y.
> > >
> > > Ran the usual test, and this effectively removes fsnotify from the
> > > profiles, which (as per other email) is between 5-6% of CPU time. So I'd
> > > say it looks mighty appealing!
> >
> > Tried with an IRQ based workload as well, as those are always impacted
> > more by the fsnotify slowness. This patch removes ~8% of useless
> > overhead in that case, so even bigger win there.
> >
> 
> Do the IRQ based workloads always go through io_req_io_end()?
> Meaning that unlike the polled io workloads, they also incur the
> overhead of the fsnotify_{modify,access}() hooks?
> 
> I remember I asked you once (cannot find where) whether
> io_complete_rw_iopoll() needs the fsnotify hooks and you said that
> it is a highly specialized code path for fast io, whose users will not
> want those access/modify hooks.
> 
> Considering the fact that fsnotify_{modify,access}() could just as well
> be bypassed by mmap() read/write, I fully support this reasoning.
> 
> Anyway, that explains (to me) why compiling-out the fsnotify_perm()
> hooks took away all the regression that you observed in upstream,
> because I was wondering where the overhead of fsnotify_access() was.
> 
> Jan,
> 
> What are your thoughts about this optimization patch?
> 
> My thoughts are that the optimization is clearly a win, but do we
> really want to waste a full long in super_block for counting access
> event watchers that may never exist?

Meh, not too excited about it. Couldn't you use a flag in s_fsnotify_mask?
Maybe that's what you mean below.

> 
> Should we perhaps instead use a flag to say that "access watchers
> existed"?
> 
> We could put s_fsnotify_access_watchers inside a struct
> fsnotify_sb_mark_connector and special case alloc/free of
> FSNOTIFY_OBJ_TYPE_SB connector.

Would it make sense without incurring performance impact to move
atomic_long_t s_fsnotify_connectors and your new atomic_long_t into a
struct fsnotify_data that gets allocated when an sb is created? Then we
don't waste space in struct super_block. This is basically a copy-pasta
of the LSM sb->s_security approach.
Amir Goldstein Jan. 11, 2024, 10:33 a.m. UTC | #11
> > Jan,
> >
> > What are your thoughts about this optimization patch?
> >
> > My thoughts are that the optimization is clearly a win, but do we
> > really want to waste a full long in super_block for counting access
> > event watchers that may never exist?
>
> Meh, not too excited about it. Couldn't you use a flag in s_fsnotify_mask?
> Maybe that's what you mean below.
>

Yes, for my v2 I will use a "taint" flag in s_fsnotify_mask instead
of keeping count of content monitors.

> >
> > Should we perhaps instead use a flag to say that "access watchers
> > existed"?
> >
> > We could put s_fsnotify_access_watchers inside a struct
> > fsnotify_sb_mark_connector and special case alloc/free of
> > FSNOTIFY_OBJ_TYPE_SB connector.
>
> Would it make sense without incurring performance impact to move
> atomic_long_t s_fsnotify_connectors and your new atomic_long_t into a
> struct fsnotify_data that gets allocated when an sb is created? Then we
> don't waste space in struct super_block. This is basically a copy-paste
> of the LSM sb->s_security approach.

Not as trivial to move s_fsnotify_connectors, but I think it should be possible,
because basically, I don't think we need to ever detach an sb connector
once it is attached.

I did already look at this when writing the above suggestion, but it was
not straightforward so left it for later.

This change would be relevant for another feature that I need for HSM
called "default event mask" (sort of like the default iptables rule), so
I will look into making those changes when I get to this feature.

Thanks,
Amir.
diff mbox series

Patch

diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index d6944ff86ffa..14dc581df414 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -153,9 +153,29 @@  static struct inode *fsnotify_update_iref(struct fsnotify_mark_connector *conn,
 	return inode;
 }
 
+/*
+ * To avoid the performance penalty of rare access event watchers in the hot
+ * io path, keep track of whether any such watchers exists on the filesystem.
+ */
+static void fsnotify_update_sb_watchers(struct fsnotify_mark_connector *conn,
+					u32 old_mask, u32 new_mask)
+{
+	struct super_block *sb = fsnotify_connector_sb(conn);
+	bool old_watchers = old_mask & ALL_FSNOTIFY_ACCESS_EVENTS;
+	bool new_watchers = new_mask & ALL_FSNOTIFY_ACCESS_EVENTS;
+
+	if (!sb || old_watchers == new_watchers)
+		return;
+
+	if (new_watchers)
+		atomic_long_inc(&sb->s_fsnotify_access_watchers);
+	else
+		atomic_long_dec(&sb->s_fsnotify_access_watchers);
+}
+
 static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
 {
-	u32 new_mask = 0;
+	u32 old_mask, new_mask = 0;
 	bool want_iref = false;
 	struct fsnotify_mark *mark;
 
@@ -163,6 +183,7 @@  static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
 	/* We can get detached connector here when inode is getting unlinked. */
 	if (!fsnotify_valid_obj_type(conn->type))
 		return NULL;
+	old_mask = fsnotify_conn_mask(conn);
 	hlist_for_each_entry(mark, &conn->list, obj_list) {
 		if (!(mark->flags & FSNOTIFY_MARK_FLAG_ATTACHED))
 			continue;
@@ -172,6 +193,7 @@  static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
 			want_iref = true;
 	}
 	*fsnotify_conn_mask_p(conn) = new_mask;
+	fsnotify_update_sb_watchers(conn, old_mask, new_mask);
 
 	return fsnotify_update_iref(conn, want_iref);
 }
@@ -243,11 +265,13 @@  static void *fsnotify_detach_connector_from_object(
 					unsigned int *type)
 {
 	struct inode *inode = NULL;
+	u32 old_mask;
 
 	*type = conn->type;
 	if (conn->type == FSNOTIFY_OBJ_TYPE_DETACHED)
 		return NULL;
 
+	old_mask = fsnotify_conn_mask(conn);
 	if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
 		inode = fsnotify_conn_inode(conn);
 		inode->i_fsnotify_mask = 0;
@@ -261,6 +285,7 @@  static void *fsnotify_detach_connector_from_object(
 		fsnotify_conn_sb(conn)->s_fsnotify_mask = 0;
 	}
 
+	fsnotify_update_sb_watchers(conn, old_mask, 0);
 	fsnotify_put_sb_connectors(conn);
 	rcu_assign_pointer(*(conn->obj), NULL);
 	conn->obj = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42298095b7a5..643d2aeb037d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1282,6 +1282,7 @@  struct super_block {
 	 * inodes objects are currently double-accounted.
 	 */
 	atomic_long_t s_fsnotify_connectors;
+	atomic_long_t s_fsnotify_access_watchers;
 
 	/* Read-only state of the superblock is being changed */
 	int s_readonly_remount;
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 8300a5286988..9dba2e038017 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -17,6 +17,14 @@ 
 #include <linux/slab.h>
 #include <linux/bug.h>
 
+/* Are there any inode/mount/sb objects that are being watched? */
+static inline int fsnotify_sb_has_watchers(struct super_block *sb, __u32 mask)
+{
+	if (mask & ALL_FSNOTIFY_ACCESS_EVENTS)
+		return atomic_long_read(&sb->s_fsnotify_access_watchers);
+	return atomic_long_read(&sb->s_fsnotify_connectors);
+}
+
 /*
  * Notify this @dir inode about a change in a child directory entry.
  * The directory entry may have turned positive or negative or its inode may
@@ -30,7 +38,7 @@  static inline int fsnotify_name(__u32 mask, const void *data, int data_type,
 				struct inode *dir, const struct qstr *name,
 				u32 cookie)
 {
-	if (atomic_long_read(&dir->i_sb->s_fsnotify_connectors) == 0)
+	if (!fsnotify_sb_has_watchers(dir->i_sb, mask))
 		return 0;
 
 	return fsnotify(mask, data, data_type, dir, name, NULL, cookie);
@@ -44,7 +52,7 @@  static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry,
 
 static inline void fsnotify_inode(struct inode *inode, __u32 mask)
 {
-	if (atomic_long_read(&inode->i_sb->s_fsnotify_connectors) == 0)
+	if (!fsnotify_sb_has_watchers(inode->i_sb, mask))
 		return;
 
 	if (S_ISDIR(inode->i_mode))
@@ -59,7 +67,7 @@  static inline int fsnotify_parent(struct dentry *dentry, __u32 mask,
 {
 	struct inode *inode = d_inode(dentry);
 
-	if (atomic_long_read(&inode->i_sb->s_fsnotify_connectors) == 0)
+	if (!fsnotify_sb_has_watchers(inode->i_sb, mask))
 		return 0;
 
 	if (S_ISDIR(inode->i_mode)) {
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 7f63be5ca0f1..5e0e76cbd7ee 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -77,6 +77,8 @@ 
  */
 #define ALL_FSNOTIFY_DIRENT_EVENTS (FS_CREATE | FS_DELETE | FS_MOVE | FS_RENAME)
 
+#define ALL_FSNOTIFY_ACCESS_EVENTS (FS_ACCESS | FS_ACCESS_PERM)
+
 #define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
 				  FS_OPEN_EXEC_PERM)