diff mbox series

[v2,08/16] fanotify: merge duplicate events on parent and child

Message ID 20200217131455.31107-9-amir73il@gmail.com (mailing list archive)
State New, archived
Headers show
Series Fanotify event with name info | expand

Commit Message

Amir Goldstein Feb. 17, 2020, 1:14 p.m. UTC
With inotify, when a watch is set on a directory and on its child, an
event on the child is reported twice, once with wd of the parent watch
and once with wd of the child watch without the filename.

With fanotify, when a watch is set on a directory and on its child, an
event on the child is reported twice, but it has the exact same
information - either an open file descriptor of the child or an encoded
fid of the child.

The reason that the two identical events are not merged is because the
tag used for merging events in the queue is the child inode in one event
and parent inode in the other.

For events with path or dentry data, use the dentry instead of inode as
the tag for event merging, so that the event reported on parent will be
merged with the event reported on the child.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Jan Kara Feb. 26, 2020, 9:18 a.m. UTC | #1
On Mon 17-02-20 15:14:47, Amir Goldstein wrote:
> With inotify, when a watch is set on a directory and on its child, an
> event on the child is reported twice, once with wd of the parent watch
> and once with wd of the child watch without the filename.
> 
> With fanotify, when a watch is set on a directory and on its child, an
> event on the child is reported twice, but it has the exact same
> information - either an open file descriptor of the child or an encoded
> fid of the child.
> 
> The reason that the two identical events are not merged is because the
> tag used for merging events in the queue is the child inode in one event
> and parent inode in the other.
> 
> For events with path or dentry data, use the dentry instead of inode as
> the tag for event merging, so that the event reported on parent will be
> merged with the event reported on the child.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

I agree that reporting identical event twice seems wasteful but ...

> @@ -312,7 +313,12 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
>  	if (!event)
>  		goto out;
>  init: __maybe_unused
> -	fsnotify_init_event(&event->fse, inode);
> +	/*
> +	 * Use the dentry instead of inode as tag for event queue, so event
> +	 * reported on parent is merged with event reported on child when both
> +	 * directory and child watches exist.
> +	 */
> +	fsnotify_init_event(&event->fse, (void *)dentry ?: inode);

... this seems quite ugly and also previously we could merge 'inode' events
with others and now we cannot because some will carry "dentry where event
happened" and other ones "inode with watch" as object identifier. So if you
want to do this, I'd use "inode where event happened" as object identifier
for fanotify.

Hum, now thinking about this, maybe we could clean this up even a bit more.
event->inode is currently used only by inotify and fanotify for merging
purposes. Now inotify could use its 'wd' instead of inode with exactly the
same results, fanotify path or fid check is at least as strong as the inode
check. So only for the case of pure "inode" events, we need to store inode
identifier in struct fanotify_event - and we can do that in the union with
struct path and completely remove the 'inode' member from fsnotify_event.
Am I missing something?

								Honza
Amir Goldstein Feb. 26, 2020, 12:14 p.m. UTC | #2
On Wed, Feb 26, 2020 at 11:18 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 17-02-20 15:14:47, Amir Goldstein wrote:
> > With inotify, when a watch is set on a directory and on its child, an
> > event on the child is reported twice, once with wd of the parent watch
> > and once with wd of the child watch without the filename.
> >
> > With fanotify, when a watch is set on a directory and on its child, an
> > event on the child is reported twice, but it has the exact same
> > information - either an open file descriptor of the child or an encoded
> > fid of the child.
> >
> > The reason that the two identical events are not merged is because the
> > tag used for merging events in the queue is the child inode in one event
> > and parent inode in the other.
> >
> > For events with path or dentry data, use the dentry instead of inode as
> > the tag for event merging, so that the event reported on parent will be
> > merged with the event reported on the child.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> I agree that reporting identical event twice seems wasteful but ...
>
> > @@ -312,7 +313,12 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
> >       if (!event)
> >               goto out;
> >  init: __maybe_unused
> > -     fsnotify_init_event(&event->fse, inode);
> > +     /*
> > +      * Use the dentry instead of inode as tag for event queue, so event
> > +      * reported on parent is merged with event reported on child when both
> > +      * directory and child watches exist.
> > +      */
> > +     fsnotify_init_event(&event->fse, (void *)dentry ?: inode);
>
> ... this seems quite ugly and also previously we could merge 'inode' events
> with others and now we cannot because some will carry "dentry where event
> happened" and other ones "inode with watch" as object identifier. So if you
> want to do this, I'd use "inode where event happened" as object identifier
> for fanotify.

<scratch head> Why didn't I think of that?...

I suppose you mean to just use:

     fsnotify_init_event(&event->fse, id);


>
> Hum, now thinking about this, maybe we could clean this up even a bit more.
> event->inode is currently used only by inotify and fanotify for merging
> purposes. Now inotify could use its 'wd' instead of inode with exactly the
> same results, fanotify path or fid check is at least as strong as the inode
> check. So only for the case of pure "inode" events, we need to store inode
> identifier in struct fanotify_event - and we can do that in the union with
> struct path and completely remove the 'inode' member from fsnotify_event.
> Am I missing something?

That generally sounds good and I did notice it is strange that wd is not
being compared.
However, I think I was worried that comparing fid+name (in following patches)
would be more expensive than comparing dentry (or object inode) as a
"rule out first" in merge, so I preferred to keep the tag/dentry/id comparison
for fanotify_fid case.

Given this analysis (and assuming it is correct), would you like me to
just go a head
with the change suggested above? or anything beyond that?

Thanks,
Amir.
Jan Kara Feb. 26, 2020, 2:38 p.m. UTC | #3
On Wed 26-02-20 14:14:50, Amir Goldstein wrote:
> On Wed, Feb 26, 2020 at 11:18 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Mon 17-02-20 15:14:47, Amir Goldstein wrote:
> > > With inotify, when a watch is set on a directory and on its child, an
> > > event on the child is reported twice, once with wd of the parent watch
> > > and once with wd of the child watch without the filename.
> > >
> > > With fanotify, when a watch is set on a directory and on its child, an
> > > event on the child is reported twice, but it has the exact same
> > > information - either an open file descriptor of the child or an encoded
> > > fid of the child.
> > >
> > > The reason that the two identical events are not merged is because the
> > > tag used for merging events in the queue is the child inode in one event
> > > and parent inode in the other.
> > >
> > > For events with path or dentry data, use the dentry instead of inode as
> > > the tag for event merging, so that the event reported on parent will be
> > > merged with the event reported on the child.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > I agree that reporting identical event twice seems wasteful but ...
> >
> > > @@ -312,7 +313,12 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
> > >       if (!event)
> > >               goto out;
> > >  init: __maybe_unused
> > > -     fsnotify_init_event(&event->fse, inode);
> > > +     /*
> > > +      * Use the dentry instead of inode as tag for event queue, so event
> > > +      * reported on parent is merged with event reported on child when both
> > > +      * directory and child watches exist.
> > > +      */
> > > +     fsnotify_init_event(&event->fse, (void *)dentry ?: inode);
> >
> > ... this seems quite ugly and also previously we could merge 'inode' events
> > with others and now we cannot because some will carry "dentry where event
> > happened" and other ones "inode with watch" as object identifier. So if you
> > want to do this, I'd use "inode where event happened" as object identifier
> > for fanotify.
> 
> <scratch head> Why didn't I think of that?...
> 
> I suppose you mean to just use:
> 
>      fsnotify_init_event(&event->fse, id);

Yes.

> > Hum, now thinking about this, maybe we could clean this up even a bit more.
> > event->inode is currently used only by inotify and fanotify for merging
> > purposes. Now inotify could use its 'wd' instead of inode with exactly the
> > same results, fanotify path or fid check is at least as strong as the inode
> > check. So only for the case of pure "inode" events, we need to store inode
> > identifier in struct fanotify_event - and we can do that in the union with
> > struct path and completely remove the 'inode' member from fsnotify_event.
> > Am I missing something?
> 
> That generally sounds good and I did notice it is strange that wd is not
> being compared.  However, I think I was worried that comparing fid+name
> (in following patches) would be more expensive than comparing dentry (or
> object inode) as a "rule out first" in merge, so I preferred to keep the
> tag/dentry/id comparison for fanotify_fid case.

Yes, that could be a concern.
 
> Given this analysis (and assuming it is correct), would you like me to
> just go a head with the change suggested above? or anything beyond that?

Let's go just with the change suggested above for now. We can work on this
later (probably with optimizing of the fanotify merging code).

								Honza
Amir Goldstein Jan. 22, 2021, 1:59 p.m. UTC | #4
> > > Hum, now thinking about this, maybe we could clean this up even a bit more.
> > > event->inode is currently used only by inotify and fanotify for merging
> > > purposes. Now inotify could use its 'wd' instead of inode with exactly the
> > > same results, fanotify path or fid check is at least as strong as the inode
> > > check. So only for the case of pure "inode" events, we need to store inode
> > > identifier in struct fanotify_event - and we can do that in the union with
> > > struct path and completely remove the 'inode' member from fsnotify_event.
> > > Am I missing something?
> >
> > That generally sounds good and I did notice it is strange that wd is not
> > being compared.  However, I think I was worried that comparing fid+name
> > (in following patches) would be more expensive than comparing dentry (or
> > object inode) as a "rule out first" in merge, so I preferred to keep the
> > tag/dentry/id comparison for fanotify_fid case.
>
> Yes, that could be a concern.
>
> > Given this analysis (and assuming it is correct), would you like me to
> > just go a head with the change suggested above? or anything beyond that?
>
> Let's go just with the change suggested above for now. We can work on this
> later (probably with optimizing of the fanotify merging code).
>

Hi Jan,

Recap:
- fanotify_merge is very inefficient and uses extensive CPU if queue contains
  many events, so it is rather easy for a poorly written listener to
cripple the system
- You had an idea to store in event->objectid a hash of all the compared
  fields (e.g. fid+name)
- I think you had an idea to keep a hash table of events in the queue
to find the
  merge candidates faster
- For internal uses, I carry a patch that limits the linear search for
last 128 events
  which is enough to relieve the CPU overuse in case of unattended long queues

I tried looking into implementing the hash table idea, assuming I understood you
correctly and I struggled to choose appropriate table sizes. It seemed to make
sense to use a global hash table, such as inode/dentry cache for all the groups
but that would add complexity to locking rules of queue/dequeue and
group cleanup.

A simpler solution I considered, similar to my 128 events limit patch,
is to limit
the linear search to events queued in the last X seconds.
The rationale is that event merging is not supposed to be long term at all.
If a listener fails to perform read from the queue, it is not fsnotify's job to
try and keep the queue compact. I think merging events mechanism was
mainly meant to merge short bursts of events on objects, which are quite
common and surely can happen concurrently on several objects.

My intuition is that making event->objectid into event->hash in addition
to limiting the age of events to merge would address the real life workloads.
One question if we do choose this approach is what should the age limit be?
Should it be configurable? Default to infinity and let distro cap the age or
provide a sane default by kernel while slightly changing behavior (yes please).

What are your thoughts about this?
Do you have a better idea maybe?

Thanks,
Amir.
Amir Goldstein Jan. 23, 2021, 1:30 p.m. UTC | #5
On Fri, Jan 22, 2021 at 3:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > > Hum, now thinking about this, maybe we could clean this up even a bit more.
> > > > event->inode is currently used only by inotify and fanotify for merging
> > > > purposes. Now inotify could use its 'wd' instead of inode with exactly the
> > > > same results, fanotify path or fid check is at least as strong as the inode
> > > > check. So only for the case of pure "inode" events, we need to store inode
> > > > identifier in struct fanotify_event - and we can do that in the union with
> > > > struct path and completely remove the 'inode' member from fsnotify_event.
> > > > Am I missing something?
> > >
> > > That generally sounds good and I did notice it is strange that wd is not
> > > being compared.  However, I think I was worried that comparing fid+name
> > > (in following patches) would be more expensive than comparing dentry (or
> > > object inode) as a "rule out first" in merge, so I preferred to keep the
> > > tag/dentry/id comparison for fanotify_fid case.
> >
> > Yes, that could be a concern.
> >
> > > Given this analysis (and assuming it is correct), would you like me to
> > > just go a head with the change suggested above? or anything beyond that?
> >
> > Let's go just with the change suggested above for now. We can work on this
> > later (probably with optimizing of the fanotify merging code).
> >
>
> Hi Jan,
>
> Recap:
> - fanotify_merge is very inefficient and uses extensive CPU if queue contains
>   many events, so it is rather easy for a poorly written listener to
> cripple the system
> - You had an idea to store in event->objectid a hash of all the compared
>   fields (e.g. fid+name)
> - I think you had an idea to keep a hash table of events in the queue
> to find the
>   merge candidates faster
> - For internal uses, I carry a patch that limits the linear search for
> last 128 events
>   which is enough to relieve the CPU overuse in case of unattended long queues
>
> I tried looking into implementing the hash table idea, assuming I understood you
> correctly and I struggled to choose appropriate table sizes. It seemed to make
> sense to use a global hash table, such as inode/dentry cache for all the groups
> but that would add complexity to locking rules of queue/dequeue and
> group cleanup.
>
> A simpler solution I considered, similar to my 128 events limit patch,
> is to limit
> the linear search to events queued in the last X seconds.
> The rationale is that event merging is not supposed to be long term at all.
> If a listener fails to perform read from the queue, it is not fsnotify's job to
> try and keep the queue compact. I think merging events mechanism was
> mainly meant to merge short bursts of events on objects, which are quite
> common and surely can happen concurrently on several objects.
>
> My intuition is that making event->objectid into event->hash in addition
> to limiting the age of events to merge would address the real life workloads.
> One question if we do choose this approach is what should the age limit be?
> Should it be configurable? Default to infinity and let distro cap the age or
> provide a sane default by kernel while slightly changing behavior (yes please).
>
> What are your thoughts about this?

Aha! found it:
https://lore.kernel.org/linux-fsdevel/20200227112755.GZ10728@quack2.suse.cz/
You suggested a small hash table per group (128 slots).

My intuition is that this will not be good enough for the worst case, which is
not that hard to hit is real life:
1. Listener sets FAN_UNLIMITED_QUEUE
2. Listener adds a FAN_MARK_FILESYSTEM watch
3. Many thousands of events are queued
4. Listener lingers (due to bad implementation?) in reading events
5. Every single event now incurs a huge fanotify_merge() cost

Reducing the cost of merge from O(N) to O(N/128) doesn't really fix the problem.

> Do you have a better idea maybe?

Thanks,
Amir.
Jan Kara Jan. 25, 2021, 1:01 p.m. UTC | #6
On Sat 23-01-21 15:30:59, Amir Goldstein wrote:
> On Fri, Jan 22, 2021 at 3:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > > > Hum, now thinking about this, maybe we could clean this up even a bit more.
> > > > > event->inode is currently used only by inotify and fanotify for merging
> > > > > purposes. Now inotify could use its 'wd' instead of inode with exactly the
> > > > > same results, fanotify path or fid check is at least as strong as the inode
> > > > > check. So only for the case of pure "inode" events, we need to store inode
> > > > > identifier in struct fanotify_event - and we can do that in the union with
> > > > > struct path and completely remove the 'inode' member from fsnotify_event.
> > > > > Am I missing something?
> > > >
> > > > That generally sounds good and I did notice it is strange that wd is not
> > > > being compared.  However, I think I was worried that comparing fid+name
> > > > (in following patches) would be more expensive than comparing dentry (or
> > > > object inode) as a "rule out first" in merge, so I preferred to keep the
> > > > tag/dentry/id comparison for fanotify_fid case.
> > >
> > > Yes, that could be a concern.
> > >
> > > > Given this analysis (and assuming it is correct), would you like me to
> > > > just go a head with the change suggested above? or anything beyond that?
> > >
> > > Let's go just with the change suggested above for now. We can work on this
> > > later (probably with optimizing of the fanotify merging code).
> > >
> >
> > Hi Jan,
> >
> > Recap:
> > - fanotify_merge is very inefficient and uses extensive CPU if queue contains
> >   many events, so it is rather easy for a poorly written listener to
> > cripple the system
> > - You had an idea to store in event->objectid a hash of all the compared
> >   fields (e.g. fid+name)
> > - I think you had an idea to keep a hash table of events in the queue
> > to find the
> >   merge candidates faster
> > - For internal uses, I carry a patch that limits the linear search for
> > last 128 events
> >   which is enough to relieve the CPU overuse in case of unattended long queues
> >
> > I tried looking into implementing the hash table idea, assuming I understood you
> > correctly and I struggled to choose appropriate table sizes. It seemed to make
> > sense to use a global hash table, such as inode/dentry cache for all the groups
> > but that would add complexity to locking rules of queue/dequeue and
> > group cleanup.
> >
> > A simpler solution I considered, similar to my 128 events limit patch,
> > is to limit
> > the linear search to events queued in the last X seconds.
> > The rationale is that event merging is not supposed to be long term at all.
> > If a listener fails to perform read from the queue, it is not fsnotify's job to
> > try and keep the queue compact. I think merging events mechanism was
> > mainly meant to merge short bursts of events on objects, which are quite
> > common and surely can happen concurrently on several objects.
> >
> > My intuition is that making event->objectid into event->hash in addition
> > to limiting the age of events to merge would address the real life workloads.
> > One question if we do choose this approach is what should the age limit be?
> > Should it be configurable? Default to infinity and let distro cap the age or
> > provide a sane default by kernel while slightly changing behavior (yes please).
> >
> > What are your thoughts about this?
> 
> Aha! found it:
> https://lore.kernel.org/linux-fsdevel/20200227112755.GZ10728@quack2.suse.cz/
> You suggested a small hash table per group (128 slots).
> 
> My intuition is that this will not be good enough for the worst case, which is
> not that hard to hit is real life:
> 1. Listener sets FAN_UNLIMITED_QUEUE
> 2. Listener adds a FAN_MARK_FILESYSTEM watch
> 3. Many thousands of events are queued
> 4. Listener lingers (due to bad implementation?) in reading events
> 5. Every single event now incurs a huge fanotify_merge() cost
> 
> Reducing the cost of merge from O(N) to O(N/128) doesn't really fix the
> problem.

So my thought was that indeed reducing the overhead of merging by a factor
of 128 should be enough for any practical case as much as I agree that in
principle the computational complexity remains the same. And I've picked
per-group hash table to avoid interferences among notification groups and
to keep locking simple. That being said I'm not opposed to combining this
with a limit on the number of elements traversed in a hash chain (e.g.
those 128 you use yourself) - it will be naturally ordered by queue order
if we are a bit careful. This will provide efficient and effective merging
for ~8k queued events which seems enough to me. I find time based limits
not really worth it. Yes, they provide more predictable behavior but less
predictable runtime and overall I don't find the complexity worth the
benefit.

								Honza
Amir Goldstein Jan. 26, 2021, 4:21 p.m. UTC | #7
On Mon, Jan 25, 2021 at 3:01 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 23-01-21 15:30:59, Amir Goldstein wrote:
> > On Fri, Jan 22, 2021 at 3:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > > > > Hum, now thinking about this, maybe we could clean this up even a bit more.
> > > > > > event->inode is currently used only by inotify and fanotify for merging
> > > > > > purposes. Now inotify could use its 'wd' instead of inode with exactly the
> > > > > > same results, fanotify path or fid check is at least as strong as the inode
> > > > > > check. So only for the case of pure "inode" events, we need to store inode
> > > > > > identifier in struct fanotify_event - and we can do that in the union with
> > > > > > struct path and completely remove the 'inode' member from fsnotify_event.
> > > > > > Am I missing something?
> > > > >
> > > > > That generally sounds good and I did notice it is strange that wd is not
> > > > > being compared.  However, I think I was worried that comparing fid+name
> > > > > (in following patches) would be more expensive than comparing dentry (or
> > > > > object inode) as a "rule out first" in merge, so I preferred to keep the
> > > > > tag/dentry/id comparison for fanotify_fid case.
> > > >
> > > > Yes, that could be a concern.
> > > >
> > > > > Given this analysis (and assuming it is correct), would you like me to
> > > > > just go a head with the change suggested above? or anything beyond that?
> > > >
> > > > Let's go just with the change suggested above for now. We can work on this
> > > > later (probably with optimizing of the fanotify merging code).
> > > >
> > >
> > > Hi Jan,
> > >
> > > Recap:
> > > - fanotify_merge is very inefficient and uses extensive CPU if queue contains
> > >   many events, so it is rather easy for a poorly written listener to
> > > cripple the system
> > > - You had an idea to store in event->objectid a hash of all the compared
> > >   fields (e.g. fid+name)
> > > - I think you had an idea to keep a hash table of events in the queue
> > > to find the
> > >   merge candidates faster
> > > - For internal uses, I carry a patch that limits the linear search for
> > > last 128 events
> > >   which is enough to relieve the CPU overuse in case of unattended long queues
> > >
> > > I tried looking into implementing the hash table idea, assuming I understood you
> > > correctly and I struggled to choose appropriate table sizes. It seemed to make
> > > sense to use a global hash table, such as inode/dentry cache for all the groups
> > > but that would add complexity to locking rules of queue/dequeue and
> > > group cleanup.
> > >
> > > A simpler solution I considered, similar to my 128 events limit patch,
> > > is to limit
> > > the linear search to events queued in the last X seconds.
> > > The rationale is that event merging is not supposed to be long term at all.
> > > If a listener fails to perform read from the queue, it is not fsnotify's job to
> > > try and keep the queue compact. I think merging events mechanism was
> > > mainly meant to merge short bursts of events on objects, which are quite
> > > common and surely can happen concurrently on several objects.
> > >
> > > My intuition is that making event->objectid into event->hash in addition
> > > to limiting the age of events to merge would address the real life workloads.
> > > One question if we do choose this approach is what should the age limit be?
> > > Should it be configurable? Default to infinity and let distro cap the age or
> > > provide a sane default by kernel while slightly changing behavior (yes please).
> > >
> > > What are your thoughts about this?
> >
> > Aha! found it:
> > https://lore.kernel.org/linux-fsdevel/20200227112755.GZ10728@quack2.suse.cz/
> > You suggested a small hash table per group (128 slots).
> >
> > My intuition is that this will not be good enough for the worst case, which is
> > not that hard to hit is real life:
> > 1. Listener sets FAN_UNLIMITED_QUEUE
> > 2. Listener adds a FAN_MARK_FILESYSTEM watch
> > 3. Many thousands of events are queued
> > 4. Listener lingers (due to bad implementation?) in reading events
> > 5. Every single event now incurs a huge fanotify_merge() cost
> >
> > Reducing the cost of merge from O(N) to O(N/128) doesn't really fix the
> > problem.
>
> So my thought was that indeed reducing the overhead of merging by a factor
> of 128 should be enough for any practical case as much as I agree that in
> principle the computational complexity remains the same. And I've picked
> per-group hash table to avoid interferences among notification groups and
> to keep locking simple. That being said I'm not opposed to combining this
> with a limit on the number of elements traversed in a hash chain (e.g.
> those 128 you use yourself) - it will be naturally ordered by queue order
> if we are a bit careful. This will provide efficient and effective merging
> for ~8k queued events which seems enough to me. I find time based limits
> not really worth it. Yes, they provide more predictable behavior but less
> predictable runtime and overall I don't find the complexity worth the
> benefit.
>

Sounds reasonable.
If you have time, please take a look at this WIP branch:
https://github.com/amir73il/linux/commits/fanotify_merge
and let me know if you like the direction it is taking.

This branch is only compile tested, but I am asking w.r.t to the chosen
data structures.
So far it is just an array of queues selected by (yet unmodified) objectid.
Reading is just from any available queue.

My goal was to avoid having to hang the event on multiple list/hlist and
the idea is to implement read by order of events as follows:

- With multi queue, high bit of obejctid will be masked for merge compare.
- Instead, they will be used to store the next_qid to read from

For example:
- event #1 is added to queue 6
- set group->last_qid = 6
- set group->next_qid = 6 (because group->num_events == 1)
- event #2 is added to queue 13
- the next_qid bits of the last event in last_qid (6) queue are set to 13
- set group->last_qid = 13

- read() checks value of group->next_qid and reads the first event
from queue 6 (event #1)
- event #1 has 13 stored in next_qid bits so set group->next_qid = 13
- read() reads first event from queue 13 (event #2)

Permission events require special care, but that is the idea of a simple singly
linked list using qid's for reading events by insert order and merging by hashed
queue.

The advantage of this method is that most of the generic code remains unaware
of the multi queue changes (see WIP) and I think it gets the job done without
complicating the code too much.

Thoughts?

Thanks,
Amir.
Jan Kara Jan. 27, 2021, 11:24 a.m. UTC | #8
On Tue 26-01-21 18:21:26, Amir Goldstein wrote:
> On Mon, Jan 25, 2021 at 3:01 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sat 23-01-21 15:30:59, Amir Goldstein wrote:
> > > On Fri, Jan 22, 2021 at 3:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > > > > Hum, now thinking about this, maybe we could clean this up even a bit more.
> > > > > > > event->inode is currently used only by inotify and fanotify for merging
> > > > > > > purposes. Now inotify could use its 'wd' instead of inode with exactly the
> > > > > > > same results, fanotify path or fid check is at least as strong as the inode
> > > > > > > check. So only for the case of pure "inode" events, we need to store inode
> > > > > > > identifier in struct fanotify_event - and we can do that in the union with
> > > > > > > struct path and completely remove the 'inode' member from fsnotify_event.
> > > > > > > Am I missing something?
> > > > > >
> > > > > > That generally sounds good and I did notice it is strange that wd is not
> > > > > > being compared.  However, I think I was worried that comparing fid+name
> > > > > > (in following patches) would be more expensive than comparing dentry (or
> > > > > > object inode) as a "rule out first" in merge, so I preferred to keep the
> > > > > > tag/dentry/id comparison for fanotify_fid case.
> > > > >
> > > > > Yes, that could be a concern.
> > > > >
> > > > > > Given this analysis (and assuming it is correct), would you like me to
> > > > > > just go a head with the change suggested above? or anything beyond that?
> > > > >
> > > > > Let's go just with the change suggested above for now. We can work on this
> > > > > later (probably with optimizing of the fanotify merging code).
> > > > >
> > > >
> > > > Hi Jan,
> > > >
> > > > Recap:
> > > > - fanotify_merge is very inefficient and uses extensive CPU if queue contains
> > > >   many events, so it is rather easy for a poorly written listener to
> > > > cripple the system
> > > > - You had an idea to store in event->objectid a hash of all the compared
> > > >   fields (e.g. fid+name)
> > > > - I think you had an idea to keep a hash table of events in the queue
> > > > to find the
> > > >   merge candidates faster
> > > > - For internal uses, I carry a patch that limits the linear search for
> > > > last 128 events
> > > >   which is enough to relieve the CPU overuse in case of unattended long queues
> > > >
> > > > I tried looking into implementing the hash table idea, assuming I understood you
> > > > correctly and I struggled to choose appropriate table sizes. It seemed to make
> > > > sense to use a global hash table, such as inode/dentry cache for all the groups
> > > > but that would add complexity to locking rules of queue/dequeue and
> > > > group cleanup.
> > > >
> > > > A simpler solution I considered, similar to my 128 events limit patch,
> > > > is to limit
> > > > the linear search to events queued in the last X seconds.
> > > > The rationale is that event merging is not supposed to be long term at all.
> > > > If a listener fails to perform read from the queue, it is not fsnotify's job to
> > > > try and keep the queue compact. I think merging events mechanism was
> > > > mainly meant to merge short bursts of events on objects, which are quite
> > > > common and surely can happen concurrently on several objects.
> > > >
> > > > My intuition is that making event->objectid into event->hash in addition
> > > > to limiting the age of events to merge would address the real life workloads.
> > > > One question if we do choose this approach is what should the age limit be?
> > > > Should it be configurable? Default to infinity and let distro cap the age or
> > > > provide a sane default by kernel while slightly changing behavior (yes please).
> > > >
> > > > What are your thoughts about this?
> > >
> > > Aha! found it:
> > > https://lore.kernel.org/linux-fsdevel/20200227112755.GZ10728@quack2.suse.cz/
> > > You suggested a small hash table per group (128 slots).
> > >
> > > My intuition is that this will not be good enough for the worst case, which is
> > > not that hard to hit is real life:
> > > 1. Listener sets FAN_UNLIMITED_QUEUE
> > > 2. Listener adds a FAN_MARK_FILESYSTEM watch
> > > 3. Many thousands of events are queued
> > > 4. Listener lingers (due to bad implementation?) in reading events
> > > 5. Every single event now incurs a huge fanotify_merge() cost
> > >
> > > Reducing the cost of merge from O(N) to O(N/128) doesn't really fix the
> > > problem.
> >
> > So my thought was that indeed reducing the overhead of merging by a factor
> > of 128 should be enough for any practical case as much as I agree that in
> > principle the computational complexity remains the same. And I've picked
> > per-group hash table to avoid interferences among notification groups and
> > to keep locking simple. That being said I'm not opposed to combining this
> > with a limit on the number of elements traversed in a hash chain (e.g.
> > those 128 you use yourself) - it will be naturally ordered by queue order
> > if we are a bit careful. This will provide efficient and effective merging
> > for ~8k queued events which seems enough to me. I find time based limits
> > not really worth it. Yes, they provide more predictable behavior but less
> > predictable runtime and overall I don't find the complexity worth the
> > benefit.
> >
> 
> Sounds reasonable.
> If you have time, please take a look at this WIP branch:
> https://github.com/amir73il/linux/commits/fanotify_merge
> and let me know if you like the direction it is taking.
> 
> This branch is only compile tested, but I am asking w.r.t to the chosen
> data structures.  So far it is just an array of queues selected by (yet
> unmodified) objectid.  Reading is just from any available queue.
> 
> My goal was to avoid having to hang the event on multiple list/hlist and
> the idea is to implement read by order of events as follows:

As a side note, since we use notification_list as a strict queue, we could
actually use a singly linked list for linking all the events (implemented
in include/linux/llist.h). That way we can save one pointer in
fsnotify_event if we wish without too much complication AFAICT. But I'm not
sure we really care.

> - With multi queue, high bit of obejctid will be masked for merge compare.
> - Instead, they will be used to store the next_qid to read from
> 
> For example:
> - event #1 is added to queue 6
> - set group->last_qid = 6
> - set group->next_qid = 6 (because group->num_events == 1)
> - event #2 is added to queue 13
> - the next_qid bits of the last event in last_qid (6) queue are set to 13
> - set group->last_qid = 13
>
> - read() checks value of group->next_qid and reads the first event
> from queue 6 (event #1)
> - event #1 has 13 stored in next_qid bits so set group->next_qid = 13
> - read() reads first event from queue 13 (event #2)

That's an interesting idea. I like it and I think it would work. Just
instead of masking, I'd use bitfields. Or we could just restrict objectid
to 32-bits and use remaining 32-bits for the next_qid pointer. I know it
will waste some bits but 32-bits of objectid should provide us with enough
space to avoid doing full event comparison in most cases - BTW WRT naming I
find 'qid' somewhat confusing. Can we call it say 'next_bucket' or
something like that?

> Permission events require special care, but that is the idea of a simple
> singly linked list using qid's for reading events by insert order and
> merging by hashed queue.

Why are permission events special in this regard?

								Honza
Amir Goldstein Jan. 27, 2021, 12:57 p.m. UTC | #9
On Wed, Jan 27, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 26-01-21 18:21:26, Amir Goldstein wrote:
> > On Mon, Jan 25, 2021 at 3:01 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Sat 23-01-21 15:30:59, Amir Goldstein wrote:
> > > > On Fri, Jan 22, 2021 at 3:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > >
> > > > > > > > Hum, now thinking about this, maybe we could clean this up even a bit more.
> > > > > > > > event->inode is currently used only by inotify and fanotify for merging
> > > > > > > > purposes. Now inotify could use its 'wd' instead of inode with exactly the
> > > > > > > > same results, fanotify path or fid check is at least as strong as the inode
> > > > > > > > check. So only for the case of pure "inode" events, we need to store inode
> > > > > > > > identifier in struct fanotify_event - and we can do that in the union with
> > > > > > > > struct path and completely remove the 'inode' member from fsnotify_event.
> > > > > > > > Am I missing something?
> > > > > > >
> > > > > > > That generally sounds good and I did notice it is strange that wd is not
> > > > > > > being compared.  However, I think I was worried that comparing fid+name
> > > > > > > (in following patches) would be more expensive than comparing dentry (or
> > > > > > > object inode) as a "rule out first" in merge, so I preferred to keep the
> > > > > > > tag/dentry/id comparison for fanotify_fid case.
> > > > > >
> > > > > > Yes, that could be a concern.
> > > > > >
> > > > > > > Given this analysis (and assuming it is correct), would you like me to
> > > > > > > just go a head with the change suggested above? or anything beyond that?
> > > > > >
> > > > > > Let's go just with the change suggested above for now. We can work on this
> > > > > > later (probably with optimizing of the fanotify merging code).
> > > > > >
> > > > >
> > > > > Hi Jan,
> > > > >
> > > > > Recap:
> > > > > - fanotify_merge is very inefficient and uses extensive CPU if queue contains
> > > > >   many events, so it is rather easy for a poorly written listener to
> > > > > cripple the system
> > > > > - You had an idea to store in event->objectid a hash of all the compared
> > > > >   fields (e.g. fid+name)
> > > > > - I think you had an idea to keep a hash table of events in the queue
> > > > > to find the
> > > > >   merge candidates faster
> > > > > - For internal uses, I carry a patch that limits the linear search for
> > > > > last 128 events
> > > > >   which is enough to relieve the CPU overuse in case of unattended long queues
> > > > >
> > > > > I tried looking into implementing the hash table idea, assuming I understood you
> > > > > correctly and I struggled to choose appropriate table sizes. It seemed to make
> > > > > sense to use a global hash table, such as inode/dentry cache for all the groups
> > > > > but that would add complexity to locking rules of queue/dequeue and
> > > > > group cleanup.
> > > > >
> > > > > A simpler solution I considered, similar to my 128 events limit patch,
> > > > > is to limit
> > > > > the linear search to events queued in the last X seconds.
> > > > > The rationale is that event merging is not supposed to be long term at all.
> > > > > If a listener fails to perform read from the queue, it is not fsnotify's job to
> > > > > try and keep the queue compact. I think merging events mechanism was
> > > > > mainly meant to merge short bursts of events on objects, which are quite
> > > > > common and surely can happen concurrently on several objects.
> > > > >
> > > > > My intuition is that making event->objectid into event->hash in addition
> > > > > to limiting the age of events to merge would address the real life workloads.
> > > > > One question if we do choose this approach is what should the age limit be?
> > > > > Should it be configurable? Default to infinity and let distro cap the age or
> > > > > provide a sane default by kernel while slightly changing behavior (yes please).
> > > > >
> > > > > What are your thoughts about this?
> > > >
> > > > Aha! found it:
> > > > https://lore.kernel.org/linux-fsdevel/20200227112755.GZ10728@quack2.suse.cz/
> > > > You suggested a small hash table per group (128 slots).
> > > >
> > > > My intuition is that this will not be good enough for the worst case, which is
> > > > not that hard to hit is real life:
> > > > 1. Listener sets FAN_UNLIMITED_QUEUE
> > > > 2. Listener adds a FAN_MARK_FILESYSTEM watch
> > > > 3. Many thousands of events are queued
> > > > 4. Listener lingers (due to bad implementation?) in reading events
> > > > 5. Every single event now incurs a huge fanotify_merge() cost
> > > >
> > > > Reducing the cost of merge from O(N) to O(N/128) doesn't really fix the
> > > > problem.
> > >
> > > So my thought was that indeed reducing the overhead of merging by a factor
> > > of 128 should be enough for any practical case as much as I agree that in
> > > principle the computational complexity remains the same. And I've picked
> > > per-group hash table to avoid interferences among notification groups and
> > > to keep locking simple. That being said I'm not opposed to combining this
> > > with a limit on the number of elements traversed in a hash chain (e.g.
> > > those 128 you use yourself) - it will be naturally ordered by queue order
> > > if we are a bit careful. This will provide efficient and effective merging
> > > for ~8k queued events which seems enough to me. I find time based limits
> > > not really worth it. Yes, they provide more predictable behavior but less
> > > predictable runtime and overall I don't find the complexity worth the
> > > benefit.
> > >
> >
> > Sounds reasonable.
> > If you have time, please take a look at this WIP branch:
> > https://github.com/amir73il/linux/commits/fanotify_merge
> > and let me know if you like the direction it is taking.
> >
> > This branch is only compile tested, but I am asking w.r.t to the chosen
> > data structures.  So far it is just an array of queues selected by (yet
> > unmodified) objectid.  Reading is just from any available queue.
> >
> > My goal was to avoid having to hang the event on multiple list/hlist and
> > the idea is to implement read by order of events as follows:
>
> As a side note, since we use notification_list as a strict queue, we could
> actually use a singly linked list for linking all the events (implemented
> in include/linux/llist.h). That way we can save one pointer in
> fsnotify_event if we wish without too much complication AFAICT. But I'm not
> sure we really care.
>

Handling of the overflow event is going to be a bit subtle and permission
events are not following the strict FIFO.
Anyway, I'd rather not get into that change.

> > - With multi queue, high bit of obejctid will be masked for merge compare.
> > - Instead, they will be used to store the next_qid to read from
> >
> > For example:
> > - event #1 is added to queue 6
> > - set group->last_qid = 6
> > - set group->next_qid = 6 (because group->num_events == 1)
> > - event #2 is added to queue 13
> > - the next_qid bits of the last event in last_qid (6) queue are set to 13
> > - set group->last_qid = 13
> >
> > - read() checks value of group->next_qid and reads the first event
> > from queue 6 (event #1)
> > - event #1 has 13 stored in next_qid bits so set group->next_qid = 13
> > - read() reads first event from queue 13 (event #2)
>
> That's an interesting idea. I like it and I think it would work. Just
> instead of masking, I'd use bitfields. Or we could just restrict objectid
> to 32-bits and use remaining 32-bits for the next_qid pointer. I know it
> will waste some bits but 32-bits of objectid should provide us with enough
> space to avoid doing full event comparison in most cases

Certainly.
The entire set of objects to compare is going to be limited to 128*128,
so 32bit should be plenty of hash bits.
Simplicity is preferred.

>  - BTW WRT naming I
> find 'qid' somewhat confusing. Can we call it say 'next_bucket' or
> something like that?
>

Sure. If its going to be 32bit, I can just call it next_key for simplicity
and store the next event key instead of the next event bucket.

> > Permission events require special care, but that is the idea of a simple
> > singly linked list using qid's for reading events by insert order and
> > merging by hashed queue.
>
> Why are permission events special in this regard?
>

They are not removed from the head of the queue, so
middle event next_key may need to be updated when they
are removed.

I guess since permission events are not merged, they could
use their own queue. If we do not care about ordering of
permission events and non-permission events, we can treat this
as a priority queue and it will simplify things considerably.
Boosting priority of blocking hooks seems like the right thing to do.
I wonder if we could make that change?

Thanks,
Amir.
Jan Kara Jan. 27, 2021, 3:15 p.m. UTC | #10
On Wed 27-01-21 14:57:56, Amir Goldstein wrote:
> On Wed, Jan 27, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
> > > - With multi queue, high bit of obejctid will be masked for merge compare.
> > > - Instead, they will be used to store the next_qid to read from
> > >
> > > For example:
> > > - event #1 is added to queue 6
> > > - set group->last_qid = 6
> > > - set group->next_qid = 6 (because group->num_events == 1)
> > > - event #2 is added to queue 13
> > > - the next_qid bits of the last event in last_qid (6) queue are set to 13
> > > - set group->last_qid = 13
> > >
> > > - read() checks value of group->next_qid and reads the first event
> > > from queue 6 (event #1)
> > > - event #1 has 13 stored in next_qid bits so set group->next_qid = 13
> > > - read() reads first event from queue 13 (event #2)
> >
> > That's an interesting idea. I like it and I think it would work. Just
> > instead of masking, I'd use bitfields. Or we could just restrict objectid
> > to 32-bits and use remaining 32-bits for the next_qid pointer. I know it
> > will waste some bits but 32-bits of objectid should provide us with enough
> > space to avoid doing full event comparison in most cases
> 
> Certainly.
> The entire set of objects to compare is going to be limited to 128*128,
> so 32bit should be plenty of hash bits.
> Simplicity is preferred.
> 
> >  - BTW WRT naming I
> > find 'qid' somewhat confusing. Can we call it say 'next_bucket' or
> > something like that?
> >
> 
> Sure. If its going to be 32bit, I can just call it next_key for simplicity
> and store the next event key instead of the next event bucket.
> 
> > > Permission events require special care, but that is the idea of a simple
> > > singly linked list using qid's for reading events by insert order and
> > > merging by hashed queue.
> >
> > Why are permission events special in this regard?
> >
> 
> They are not removed from the head of the queue, so
> middle event next_key may need to be updated when they
> are removed.

Oh, you mean the special case when we receive a signal and thus remove
permission event from a notification queue? I forgot about that one and
yes, it needs a special handling...

> I guess since permission events are not merged, they could
> use their own queue. If we do not care about ordering of
> permission events and non-permission events, we can treat this
> as a priority queue and it will simplify things considerably.
> Boosting priority of blocking hooks seems like the right thing to do.
> I wonder if we could make that change?

Yes, permission events are not merged and I'm not aware of any users
actually mixing permission and other events in a notification group. OTOH
I'm somewhat reluctant to reorder events that much. It could break
someone, it could starve notification events, etc. AFAIU the pain with
permission events is updating the ->next_key field in case we want to remove
unreported permission event. Finding previous entry with this scheme is
indeed somewhat painful (we'd have to walk the queue which requires
maintaining 'cur' pointer for every queue). So maybe growing fsnotify_event
by one pointer to contain single linked list for a hash chain would be
simplest in the end? Then removing from the hash chain in the corner case of
tearing permission event out is simple enough...

								Honza
Amir Goldstein Jan. 27, 2021, 6:03 p.m. UTC | #11
On Wed, Jan 27, 2021 at 5:15 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 27-01-21 14:57:56, Amir Goldstein wrote:
> > On Wed, Jan 27, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
> > > > - With multi queue, high bit of obejctid will be masked for merge compare.
> > > > - Instead, they will be used to store the next_qid to read from
> > > >
> > > > For example:
> > > > - event #1 is added to queue 6
> > > > - set group->last_qid = 6
> > > > - set group->next_qid = 6 (because group->num_events == 1)
> > > > - event #2 is added to queue 13
> > > > - the next_qid bits of the last event in last_qid (6) queue are set to 13
> > > > - set group->last_qid = 13
> > > >
> > > > - read() checks value of group->next_qid and reads the first event
> > > > from queue 6 (event #1)
> > > > - event #1 has 13 stored in next_qid bits so set group->next_qid = 13
> > > > - read() reads first event from queue 13 (event #2)
> > >
> > > That's an interesting idea. I like it and I think it would work. Just
> > > instead of masking, I'd use bitfields. Or we could just restrict objectid
> > > to 32-bits and use remaining 32-bits for the next_qid pointer. I know it
> > > will waste some bits but 32-bits of objectid should provide us with enough
> > > space to avoid doing full event comparison in most cases
> >
> > Certainly.
> > The entire set of objects to compare is going to be limited to 128*128,
> > so 32bit should be plenty of hash bits.
> > Simplicity is preferred.
> >
> > >  - BTW WRT naming I
> > > find 'qid' somewhat confusing. Can we call it say 'next_bucket' or
> > > something like that?
> > >
> >
> > Sure. If its going to be 32bit, I can just call it next_key for simplicity
> > and store the next event key instead of the next event bucket.
> >
> > > > Permission events require special care, but that is the idea of a simple
> > > > singly linked list using qid's for reading events by insert order and
> > > > merging by hashed queue.
> > >
> > > Why are permission events special in this regard?
> > >
> >
> > They are not removed from the head of the queue, so
> > middle event next_key may need to be updated when they
> > are removed.
>
> Oh, you mean the special case when we receive a signal and thus remove
> permission event from a notification queue? I forgot about that one and
> yes, it needs a special handling...
>
> > I guess since permission events are not merged, they could
> > use their own queue. If we do not care about ordering of
> > permission events and non-permission events, we can treat this
> > as a priority queue and it will simplify things considerably.
> > Boosting priority of blocking hooks seems like the right thing to do.
> > I wonder if we could make that change?
>
> Yes, permission events are not merged and I'm not aware of any users
> actually mixing permission and other events in a notification group. OTOH
> I'm somewhat reluctant to reorder events that much. It could break
> someone, it could starve notification events, etc. AFAIU the pain with
> permission events is updating the ->next_key field in case we want to remove
> unreported permission event. Finding previous entry with this scheme is
> indeed somewhat painful (we'd have to walk the queue which requires
> maintaining 'cur' pointer for every queue). So maybe growing fsnotify_event
> by one pointer to contain single linked list for a hash chain would be
> simplest in the end? Then removing from the hash chain in the corner case of
> tearing permission event out is simple enough...
>

Better to disable the multi queue for the very uninteresting corner case (mixing
permissions and non permissions) . The simplest thing to do is to enable multi
queue only for FAN_CLASS_NOTIF. I suppose users do not use high priority
classes for non-permission event use case and if they do, they will get less
merged events - no big deal.

The important things we get are:
1. Code remains simple
2. Deterministic CPU usage (linear search is capped to 128 events)
3. In the most common use case of async change listener we can merge
    events on up to 16K unique objects which should be sufficient

I'll try to write this up.

Thanks,
Amir.
Jan Kara Jan. 28, 2021, 10:27 a.m. UTC | #12
On Wed 27-01-21 20:03:23, Amir Goldstein wrote:
> On Wed, Jan 27, 2021 at 5:15 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 27-01-21 14:57:56, Amir Goldstein wrote:
> > > On Wed, Jan 27, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
> > > > > - With multi queue, high bit of obejctid will be masked for merge compare.
> > > > > - Instead, they will be used to store the next_qid to read from
> > > > >
> > > > > For example:
> > > > > - event #1 is added to queue 6
> > > > > - set group->last_qid = 6
> > > > > - set group->next_qid = 6 (because group->num_events == 1)
> > > > > - event #2 is added to queue 13
> > > > > - the next_qid bits of the last event in last_qid (6) queue are set to 13
> > > > > - set group->last_qid = 13
> > > > >
> > > > > - read() checks value of group->next_qid and reads the first event
> > > > > from queue 6 (event #1)
> > > > > - event #1 has 13 stored in next_qid bits so set group->next_qid = 13
> > > > > - read() reads first event from queue 13 (event #2)
> > > >
> > > > That's an interesting idea. I like it and I think it would work. Just
> > > > instead of masking, I'd use bitfields. Or we could just restrict objectid
> > > > to 32-bits and use remaining 32-bits for the next_qid pointer. I know it
> > > > will waste some bits but 32-bits of objectid should provide us with enough
> > > > space to avoid doing full event comparison in most cases
> > >
> > > Certainly.
> > > The entire set of objects to compare is going to be limited to 128*128,
> > > so 32bit should be plenty of hash bits.
> > > Simplicity is preferred.
> > >
> > > >  - BTW WRT naming I
> > > > find 'qid' somewhat confusing. Can we call it say 'next_bucket' or
> > > > something like that?
> > > >
> > >
> > > Sure. If its going to be 32bit, I can just call it next_key for simplicity
> > > and store the next event key instead of the next event bucket.
> > >
> > > > > Permission events require special care, but that is the idea of a simple
> > > > > singly linked list using qid's for reading events by insert order and
> > > > > merging by hashed queue.
> > > >
> > > > Why are permission events special in this regard?
> > > >
> > >
> > > They are not removed from the head of the queue, so
> > > middle event next_key may need to be updated when they
> > > are removed.
> >
> > Oh, you mean the special case when we receive a signal and thus remove
> > permission event from a notification queue? I forgot about that one and
> > yes, it needs a special handling...
> >
> > > I guess since permission events are not merged, they could
> > > use their own queue. If we do not care about ordering of
> > > permission events and non-permission events, we can treat this
> > > as a priority queue and it will simplify things considerably.
> > > Boosting priority of blocking hooks seems like the right thing to do.
> > > I wonder if we could make that change?
> >
> > Yes, permission events are not merged and I'm not aware of any users
> > actually mixing permission and other events in a notification group. OTOH
> > I'm somewhat reluctant to reorder events that much. It could break
> > someone, it could starve notification events, etc. AFAIU the pain with
> > permission events is updating the ->next_key field in case we want to remove
> > unreported permission event. Finding previous entry with this scheme is
> > indeed somewhat painful (we'd have to walk the queue which requires
> > maintaining 'cur' pointer for every queue). So maybe growing fsnotify_event
> > by one pointer to contain single linked list for a hash chain would be
> > simplest in the end? Then removing from the hash chain in the corner case of
> > tearing permission event out is simple enough...
> >
> 
> Better to disable the multi queue for the very uninteresting corner case (mixing
> permissions and non permissions) . The simplest thing to do is to enable multi
> queue only for FAN_CLASS_NOTIF. I suppose users do not use high priority
> classes for non-permission event use case and if they do, they will get less
> merged events - no big deal.
> 
> The important things we get are:
> 1. Code remains simple
> 2. Deterministic CPU usage (linear search is capped to 128 events)
> 3. In the most common use case of async change listener we can merge
>     events on up to 16K unique objects which should be sufficient
> 
> I'll try to write this up.

OK, sounds fine to me. We can always reconsider if some users come back
complaining...

								Honza
Amir Goldstein Jan. 28, 2021, 6:50 p.m. UTC | #13
On Thu, Jan 28, 2021 at 12:27 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 27-01-21 20:03:23, Amir Goldstein wrote:
> > On Wed, Jan 27, 2021 at 5:15 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Wed 27-01-21 14:57:56, Amir Goldstein wrote:
> > > > On Wed, Jan 27, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > - With multi queue, high bit of obejctid will be masked for merge compare.
> > > > > > - Instead, they will be used to store the next_qid to read from
> > > > > >
> > > > > > For example:
> > > > > > - event #1 is added to queue 6
> > > > > > - set group->last_qid = 6
> > > > > > - set group->next_qid = 6 (because group->num_events == 1)
> > > > > > - event #2 is added to queue 13
> > > > > > - the next_qid bits of the last event in last_qid (6) queue are set to 13
> > > > > > - set group->last_qid = 13
> > > > > >
> > > > > > - read() checks value of group->next_qid and reads the first event
> > > > > > from queue 6 (event #1)
> > > > > > - event #1 has 13 stored in next_qid bits so set group->next_qid = 13
> > > > > > - read() reads first event from queue 13 (event #2)
> > > > >
> > > > > That's an interesting idea. I like it and I think it would work. Just
> > > > > instead of masking, I'd use bitfields. Or we could just restrict objectid
> > > > > to 32-bits and use remaining 32-bits for the next_qid pointer. I know it
> > > > > will waste some bits but 32-bits of objectid should provide us with enough
> > > > > space to avoid doing full event comparison in most cases
> > > >
> > > > Certainly.
> > > > The entire set of objects to compare is going to be limited to 128*128,
> > > > so 32bit should be plenty of hash bits.
> > > > Simplicity is preferred.
> > > >
> > > > >  - BTW WRT naming I
> > > > > find 'qid' somewhat confusing. Can we call it say 'next_bucket' or
> > > > > something like that?
> > > > >
> > > >
> > > > Sure. If its going to be 32bit, I can just call it next_key for simplicity
> > > > and store the next event key instead of the next event bucket.
> > > >
> > > > > > Permission events require special care, but that is the idea of a simple
> > > > > > singly linked list using qid's for reading events by insert order and
> > > > > > merging by hashed queue.
> > > > >
> > > > > Why are permission events special in this regard?
> > > > >
> > > >
> > > > They are not removed from the head of the queue, so
> > > > middle event next_key may need to be updated when they
> > > > are removed.
> > >
> > > Oh, you mean the special case when we receive a signal and thus remove
> > > permission event from a notification queue? I forgot about that one and
> > > yes, it needs a special handling...
> > >
> > > > I guess since permission events are not merged, they could
> > > > use their own queue. If we do not care about ordering of
> > > > permission events and non-permission events, we can treat this
> > > > as a priority queue and it will simplify things considerably.
> > > > Boosting priority of blocking hooks seems like the right thing to do.
> > > > I wonder if we could make that change?
> > >
> > > Yes, permission events are not merged and I'm not aware of any users
> > > actually mixing permission and other events in a notification group. OTOH
> > > I'm somewhat reluctant to reorder events that much. It could break
> > > someone, it could starve notification events, etc. AFAIU the pain with
> > > permission events is updating the ->next_key field in case we want to remove
> > > unreported permission event. Finding previous entry with this scheme is
> > > indeed somewhat painful (we'd have to walk the queue which requires
> > > maintaining 'cur' pointer for every queue). So maybe growing fsnotify_event
> > > by one pointer to contain single linked list for a hash chain would be
> > > simplest in the end? Then removing from the hash chain in the corner case of
> > > tearing permission event out is simple enough...
> > >
> >
> > Better to disable the multi queue for the very uninteresting corner case (mixing
> > permissions and non permissions) . The simplest thing to do is to enable multi
> > queue only for FAN_CLASS_NOTIF. I suppose users do not use high priority
> > classes for non-permission event use case and if they do, they will get less
> > merged events - no big deal.
> >
> > The important things we get are:
> > 1. Code remains simple
> > 2. Deterministic CPU usage (linear search is capped to 128 events)
> > 3. In the most common use case of async change listener we can merge
> >     events on up to 16K unique objects which should be sufficient
> >
> > I'll try to write this up.
>
> OK, sounds fine to me. We can always reconsider if some users come back
> complaining...
>

Pushed that to fanotify_merge branch.
It's even working ;-)

Please let me know if this looks fine so far and I will complete the rest,
test performance and post the patches.

Remaining:
- Let event->key be a hash of all merge compared fields
- Limit linear search per chain

Thanks,
Amir.
diff mbox series

Patch

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 98c3cbf29003..dab7e9895e02 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -282,6 +282,7 @@  struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
 	gfp_t gfp = GFP_KERNEL_ACCOUNT;
 	struct inode *id = fanotify_fid_inode(inode, mask, data, data_type);
 	const struct path *path = fsnotify_data_path(data, data_type);
+	struct dentry *dentry = fsnotify_data_dentry(data, data_type);
 
 	/*
 	 * For queues with unlimited length lost events are not expected and
@@ -312,7 +313,12 @@  struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
 	if (!event)
 		goto out;
 init: __maybe_unused
-	fsnotify_init_event(&event->fse, inode);
+	/*
+	 * Use the dentry instead of inode as tag for event queue, so event
+	 * reported on parent is merged with event reported on child when both
+	 * directory and child watches exist.
+	 */
+	fsnotify_init_event(&event->fse, (void *)dentry ?: inode);
 	event->mask = mask;
 	if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
 		event->pid = get_pid(task_pid(current));