diff mbox series

[39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

Message ID 20230109205336.3665937-40-surenb@google.com (mailing list archive)
State New
Headers show
Series Per-VMA locks | expand

Commit Message

Suren Baghdasaryan Jan. 9, 2023, 8:53 p.m. UTC
call_rcu() can take a long time when callback offloading is enabled.
Its use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed. To minimize that impact, place VMAs into
a list and free them in groups using one call_rcu() call per group.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       |  1 +
 include/linux/mm_types.h | 19 +++++++++--
 kernel/fork.c            | 68 +++++++++++++++++++++++++++++++++++-----
 mm/init-mm.c             |  3 ++
 mm/mmap.c                |  1 +
 5 files changed, 82 insertions(+), 10 deletions(-)

Comments

Michal Hocko Jan. 17, 2023, 3:57 p.m. UTC | #1
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed.

What kind of regressions.

> To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.

Please add some data to justify this additional complexity.
Suren Baghdasaryan Jan. 18, 2023, 1:19 a.m. UTC | #2
On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed.
>
> What kind of regressions.
>
> > To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
>
> Please add some data to justify this additional complexity.

Sorry, should have done that in the first place. A 4.3% regression was
noticed when running execl test from unixbench suite. spawn test also
showed 1.6% regression. Profiling revealed that vma freeing was taking
longer due to call_rcu() which is slow when RCU callback offloading is
enabled. I asked Paul McKenney and he explained to me that because the
callbacks are offloaded to some other kthread, possibly running on
some other CPU, it is necessary to use explicit locking.  Locking on a
per-call_rcu() basis would result in excessive contention during
callback flooding. So, by batching call_rcu() work we cut that
overhead and reduce this lock contention.


> --
> Michal Hocko
> SUSE Labs
Michal Hocko Jan. 18, 2023, 9:49 a.m. UTC | #3
On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed.
> >
> > What kind of regressions.
> >
> > > To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > Please add some data to justify this additional complexity.
> 
> Sorry, should have done that in the first place. A 4.3% regression was
> noticed when running execl test from unixbench suite. spawn test also
> showed 1.6% regression. Profiling revealed that vma freeing was taking
> longer due to call_rcu() which is slow when RCU callback offloading is
> enabled.

Could you be more specific? vma freeing is async with the RCU so how
come this has resulted in a regression? Is there any heavy
rcu_synchronize in the exec path? That would be an interesting
information.
Suren Baghdasaryan Jan. 18, 2023, 6:04 p.m. UTC | #4
On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed.
> > >
> > > What kind of regressions.
> > >
> > > > To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > Please add some data to justify this additional complexity.
> >
> > Sorry, should have done that in the first place. A 4.3% regression was
> > noticed when running execl test from unixbench suite. spawn test also
> > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > longer due to call_rcu() which is slow when RCU callback offloading is
> > enabled.
>
> Could you be more specific? vma freeing is async with the RCU so how
> come this has resulted in a regression? Is there any heavy
> rcu_synchronize in the exec path? That would be an interesting
> information.

No, there is no heavy rcu_synchronize() or any other additional
synchronous load in the exit path. It's the call_rcu() which can block
the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
other call_rcu()'s going on in parallel. Note that call_rcu() calls
rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
revealed that this function was taking multiple ms (don't recall the
actual number, sorry). Paul's explanation implied that this happens
due to contention on the locks taken in this function. For more
in-depth details I'll have to ask Paul for help :) This code is quite
complex and I don't know all the details of RCU implementation.


> --
> Michal Hocko
> SUSE Labs
Paul E. McKenney Jan. 18, 2023, 6:34 p.m. UTC | #5
On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed.
> > > >
> > > > What kind of regressions.
> > > >
> > > > > To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > Please add some data to justify this additional complexity.
> > >
> > > Sorry, should have done that in the first place. A 4.3% regression was
> > > noticed when running execl test from unixbench suite. spawn test also
> > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > enabled.
> >
> > Could you be more specific? vma freeing is async with the RCU so how
> > come this has resulted in a regression? Is there any heavy
> > rcu_synchronize in the exec path? That would be an interesting
> > information.
> 
> No, there is no heavy rcu_synchronize() or any other additional
> synchronous load in the exit path. It's the call_rcu() which can block
> the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> other call_rcu()'s going on in parallel. Note that call_rcu() calls
> rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> revealed that this function was taking multiple ms (don't recall the
> actual number, sorry). Paul's explanation implied that this happens
> due to contention on the locks taken in this function. For more
> in-depth details I'll have to ask Paul for help :) This code is quite
> complex and I don't know all the details of RCU implementation.

There are a couple of possibilities here.

First, if I am remembering correctly, the time between the call_rcu()
and invocation of the corresponding callback was taking multiple seconds,
but that was because the kernel was built with CONFIG_LAZY_RCU=y in
order to save power by batching RCU work over multiple call_rcu()
invocations.  If this is causing a problem for a given call site, the
shiny new call_rcu_hurry() can be used instead.  Doing this gets back
to the old-school non-laziness, but can of course consume more power.

Second, there is a much shorter one-jiffy delay between the call_rcu()
and the invocation of the corresponding callback in kernels built with
either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
of this delay is to avoid lock contention, and so this delay is incurred
only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
invoking call_rcu() at least 16 times within a given jiffy will incur
the added delay.  The reason for this delay is the use of a separate
->nocb_bypass list.  As Suren says, this bypass list is used to reduce
lock contention on the main ->cblist.  This is not needed in old-school
kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
(including most datacenter kernels) because in that case the callbacks
enqueued by call_rcu() are touched only by the corresponding CPU, so
that there is no need for locks.

Third, if you are instead seeing multiple milliseconds of CPU consumed by
call_rcu() in the common case (for example, without the aid of interrupts,
NMIs, or SMIs), please do let me know.  That sounds to me like a bug.

Or have I lost track of some other slow case?

							Thanx, Paul
Suren Baghdasaryan Jan. 18, 2023, 7:01 p.m. UTC | #6
On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > multiple VMAs are being freed.
> > > > >
> > > > > What kind of regressions.
> > > > >
> > > > > > To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > Please add some data to justify this additional complexity.
> > > >
> > > > Sorry, should have done that in the first place. A 4.3% regression was
> > > > noticed when running execl test from unixbench suite. spawn test also
> > > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > > enabled.
> > >
> > > Could you be more specific? vma freeing is async with the RCU so how
> > > come this has resulted in a regression? Is there any heavy
> > > rcu_synchronize in the exec path? That would be an interesting
> > > information.
> >
> > No, there is no heavy rcu_synchronize() or any other additional
> > synchronous load in the exit path. It's the call_rcu() which can block
> > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> > other call_rcu()'s going on in parallel. Note that call_rcu() calls
> > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> > revealed that this function was taking multiple ms (don't recall the
> > actual number, sorry). Paul's explanation implied that this happens
> > due to contention on the locks taken in this function. For more
> > in-depth details I'll have to ask Paul for help :) This code is quite
> > complex and I don't know all the details of RCU implementation.
>
> There are a couple of possibilities here.
>
> First, if I am remembering correctly, the time between the call_rcu()
> and invocation of the corresponding callback was taking multiple seconds,
> but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> order to save power by batching RCU work over multiple call_rcu()
> invocations.  If this is causing a problem for a given call site, the
> shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> to the old-school non-laziness, but can of course consume more power.

That would not be the case because CONFIG_LAZY_RCU was not an option
at the time I was profiling this issue.
Laxy RCU would be a great option to replace this patch but
unfortunately it's not the default behavior, so I would still have to
implement this batching in case lazy RCU is not enabled.

>
> Second, there is a much shorter one-jiffy delay between the call_rcu()
> and the invocation of the corresponding callback in kernels built with
> either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> of this delay is to avoid lock contention, and so this delay is incurred
> only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> invoking call_rcu() at least 16 times within a given jiffy will incur
> the added delay.  The reason for this delay is the use of a separate
> ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> lock contention on the main ->cblist.  This is not needed in old-school
> kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> (including most datacenter kernels) because in that case the callbacks
> enqueued by call_rcu() are touched only by the corresponding CPU, so
> that there is no need for locks.

I believe this is the reason in my profiled case.

>
> Third, if you are instead seeing multiple milliseconds of CPU consumed by
> call_rcu() in the common case (for example, without the aid of interrupts,
> NMIs, or SMIs), please do let me know.  That sounds to me like a bug.

I don't think I've seen such a case.
Thanks for clarifications, Paul!

>
> Or have I lost track of some other slow case?
>
>                                                         Thanx, Paul
Paul E. McKenney Jan. 18, 2023, 8:20 p.m. UTC | #7
On Wed, Jan 18, 2023 at 11:01:08AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > multiple VMAs are being freed.
> > > > > >
> > > > > > What kind of regressions.
> > > > > >
> > > > > > > To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > >
> > > > > > Please add some data to justify this additional complexity.
> > > > >
> > > > > Sorry, should have done that in the first place. A 4.3% regression was
> > > > > noticed when running execl test from unixbench suite. spawn test also
> > > > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > > > enabled.
> > > >
> > > > Could you be more specific? vma freeing is async with the RCU so how
> > > > come this has resulted in a regression? Is there any heavy
> > > > rcu_synchronize in the exec path? That would be an interesting
> > > > information.
> > >
> > > No, there is no heavy rcu_synchronize() or any other additional
> > > synchronous load in the exit path. It's the call_rcu() which can block
> > > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> > > other call_rcu()'s going on in parallel. Note that call_rcu() calls
> > > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> > > revealed that this function was taking multiple ms (don't recall the
> > > actual number, sorry). Paul's explanation implied that this happens
> > > due to contention on the locks taken in this function. For more
> > > in-depth details I'll have to ask Paul for help :) This code is quite
> > > complex and I don't know all the details of RCU implementation.
> >
> > There are a couple of possibilities here.
> >
> > First, if I am remembering correctly, the time between the call_rcu()
> > and invocation of the corresponding callback was taking multiple seconds,
> > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > order to save power by batching RCU work over multiple call_rcu()
> > invocations.  If this is causing a problem for a given call site, the
> > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > to the old-school non-laziness, but can of course consume more power.
> 
> That would not be the case because CONFIG_LAZY_RCU was not an option
> at the time I was profiling this issue.
> Laxy RCU would be a great option to replace this patch but
> unfortunately it's not the default behavior, so I would still have to
> implement this batching in case lazy RCU is not enabled.
> 
> > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > and the invocation of the corresponding callback in kernels built with
> > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > of this delay is to avoid lock contention, and so this delay is incurred
> > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > invoking call_rcu() at least 16 times within a given jiffy will incur
> > the added delay.  The reason for this delay is the use of a separate
> > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > lock contention on the main ->cblist.  This is not needed in old-school
> > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > (including most datacenter kernels) because in that case the callbacks
> > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > that there is no need for locks.
> 
> I believe this is the reason in my profiled case.
> 
> >
> > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > call_rcu() in the common case (for example, without the aid of interrupts,
> > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> 
> I don't think I've seen such a case.

Whew!!!  ;-)

> Thanks for clarifications, Paul!

No problem!

							Thanx, Paul
Michal Hocko Jan. 19, 2023, 12:52 p.m. UTC | #8
On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
[...]
> > There are a couple of possibilities here.
> >
> > First, if I am remembering correctly, the time between the call_rcu()
> > and invocation of the corresponding callback was taking multiple seconds,
> > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > order to save power by batching RCU work over multiple call_rcu()
> > invocations.  If this is causing a problem for a given call site, the
> > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > to the old-school non-laziness, but can of course consume more power.
> 
> That would not be the case because CONFIG_LAZY_RCU was not an option
> at the time I was profiling this issue.
> Laxy RCU would be a great option to replace this patch but
> unfortunately it's not the default behavior, so I would still have to
> implement this batching in case lazy RCU is not enabled.
> 
> >
> > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > and the invocation of the corresponding callback in kernels built with
> > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > of this delay is to avoid lock contention, and so this delay is incurred
> > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > invoking call_rcu() at least 16 times within a given jiffy will incur
> > the added delay.  The reason for this delay is the use of a separate
> > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > lock contention on the main ->cblist.  This is not needed in old-school
> > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > (including most datacenter kernels) because in that case the callbacks
> > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > that there is no need for locks.
> 
> I believe this is the reason in my profiled case.
> 
> >
> > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > call_rcu() in the common case (for example, without the aid of interrupts,
> > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> 
> I don't think I've seen such a case.
> Thanks for clarifications, Paul!

Thanks for the explanation Paul. I have to say this has caught me as a
surprise. There are just not enough details about the benchmark to
understand what is going on but I find it rather surprising that
call_rcu can induce a higher overhead than the actual kmem_cache_free
which is the callback. My naive understanding has been that call_rcu is
really fast way to defer the execution to the RCU safe context to do the
final cleanup.
Michal Hocko Jan. 19, 2023, 12:59 p.m. UTC | #9
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed. To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.

After some more clarification I can understand how call_rcu might not be
super happy about thousands of callbacks to be invoked and I do agree
that this is not really optimal.

On the other hand I do not like this solution much either.
VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
much with processes with a huge number of vmas either. It would still be
in housands of callbacks to be scheduled without a good reason.

Instead, are there any other cases than remove_vma that need this
batching? We could easily just link all the vmas into linked list and
use a single call_rcu instead, no? This would both simplify the
implementation, remove the scaling issue as well and we do not have to
argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
Suren Baghdasaryan Jan. 19, 2023, 6:52 p.m. UTC | #10
On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
>
> After some more clarification I can understand how call_rcu might not be
> super happy about thousands of callbacks to be invoked and I do agree
> that this is not really optimal.
>
> On the other hand I do not like this solution much either.
> VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> much with processes with a huge number of vmas either. It would still be
> in housands of callbacks to be scheduled without a good reason.
>
> Instead, are there any other cases than remove_vma that need this
> batching? We could easily just link all the vmas into linked list and
> use a single call_rcu instead, no? This would both simplify the
> implementation, remove the scaling issue as well and we do not have to
> argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.

Yes, I agree the solution is not stellar. I wanted something simple
but this is probably too simple. OTOH keeping all dead vm_area_structs
on the list without hooking up a shrinker (additional complexity) does
not sound too appealing either. WDYT about time domain throttling to
limit draining the list to say once per second like this:

void vm_area_free(struct vm_area_struct *vma)
{
       struct mm_struct *mm = vma->vm_mm;
       bool drain;

       free_anon_vma_name(vma);

       spin_lock(&mm->vma_free_list.lock);
       list_add(&vma->vm_free_list, &mm->vma_free_list.head);
       mm->vma_free_list.size++;
-       drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+       drain = jiffies > mm->last_drain_tm + HZ;

       spin_unlock(&mm->vma_free_list.lock);

-       if (drain)
+       if (drain) {
              drain_free_vmas(mm);
+             mm->last_drain_tm = jiffies;
+       }
}

Ultimately we want to prevent very frequent call_rcu() calls, so
throttling in the time domain seems appropriate. That's the simplest
way I can think of to address your concern about a quick spike in VMA
freeing. It does not place any restriction on the list size and we
might have excessive dead vm_area_structs if after a large spike there
are no vm_area_free() calls but I don't know if that's a real problem,
so not sure we should be addressing it at this time. WDYT?


>
> --
> Michal Hocko
> SUSE Labs
Paul E. McKenney Jan. 19, 2023, 7:17 p.m. UTC | #11
On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> [...]
> > > There are a couple of possibilities here.
> > >
> > > First, if I am remembering correctly, the time between the call_rcu()
> > > and invocation of the corresponding callback was taking multiple seconds,
> > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > order to save power by batching RCU work over multiple call_rcu()
> > > invocations.  If this is causing a problem for a given call site, the
> > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > to the old-school non-laziness, but can of course consume more power.
> > 
> > That would not be the case because CONFIG_LAZY_RCU was not an option
> > at the time I was profiling this issue.
> > Laxy RCU would be a great option to replace this patch but
> > unfortunately it's not the default behavior, so I would still have to
> > implement this batching in case lazy RCU is not enabled.
> > 
> > >
> > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > and the invocation of the corresponding callback in kernels built with
> > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > > of this delay is to avoid lock contention, and so this delay is incurred
> > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > the added delay.  The reason for this delay is the use of a separate
> > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > lock contention on the main ->cblist.  This is not needed in old-school
> > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > (including most datacenter kernels) because in that case the callbacks
> > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > that there is no need for locks.
> > 
> > I believe this is the reason in my profiled case.
> > 
> > >
> > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > 
> > I don't think I've seen such a case.
> > Thanks for clarifications, Paul!
> 
> Thanks for the explanation Paul. I have to say this has caught me as a
> surprise. There are just not enough details about the benchmark to
> understand what is going on but I find it rather surprising that
> call_rcu can induce a higher overhead than the actual kmem_cache_free
> which is the callback. My naive understanding has been that call_rcu is
> really fast way to defer the execution to the RCU safe context to do the
> final cleanup.

If I am following along correctly (ha!), then your "induce a higher
overhead" should be something like "induce a higher to-kfree() latency".

Of course, there already is a higher latency-to-kfree via call_rcu()
than via a direct call to kfree(), and callback-offload CPUs that are
being flooded with callbacks raise that latency a jiffy or so more in
order to avoid lock contention.

If this becomes a problem, the callback-offloading code can be a bit
smarter about avoiding lock contention, but need to see a real problem
before I make that change.  But if there is a real problem I will of
course fix it.

Or did I miss a turn in this discussion?

							Thanx, Paul
Paul E. McKenney Jan. 19, 2023, 7:20 p.m. UTC | #12
On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > After some more clarification I can understand how call_rcu might not be
> > super happy about thousands of callbacks to be invoked and I do agree
> > that this is not really optimal.
> >
> > On the other hand I do not like this solution much either.
> > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > much with processes with a huge number of vmas either. It would still be
> > in housands of callbacks to be scheduled without a good reason.
> >
> > Instead, are there any other cases than remove_vma that need this
> > batching? We could easily just link all the vmas into linked list and
> > use a single call_rcu instead, no? This would both simplify the
> > implementation, remove the scaling issue as well and we do not have to
> > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> 
> Yes, I agree the solution is not stellar. I wanted something simple
> but this is probably too simple. OTOH keeping all dead vm_area_structs
> on the list without hooking up a shrinker (additional complexity) does
> not sound too appealing either. WDYT about time domain throttling to
> limit draining the list to say once per second like this:
> 
> void vm_area_free(struct vm_area_struct *vma)
> {
>        struct mm_struct *mm = vma->vm_mm;
>        bool drain;
> 
>        free_anon_vma_name(vma);
> 
>        spin_lock(&mm->vma_free_list.lock);
>        list_add(&vma->vm_free_list, &mm->vma_free_list.head);
>        mm->vma_free_list.size++;
> -       drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> +       drain = jiffies > mm->last_drain_tm + HZ;
> 
>        spin_unlock(&mm->vma_free_list.lock);
> 
> -       if (drain)
> +       if (drain) {
>               drain_free_vmas(mm);
> +             mm->last_drain_tm = jiffies;
> +       }
> }
> 
> Ultimately we want to prevent very frequent call_rcu() calls, so
> throttling in the time domain seems appropriate. That's the simplest
> way I can think of to address your concern about a quick spike in VMA
> freeing. It does not place any restriction on the list size and we
> might have excessive dead vm_area_structs if after a large spike there
> are no vm_area_free() calls but I don't know if that's a real problem,
> so not sure we should be addressing it at this time. WDYT?

Just to double-check, we really did try the very frequent call_rcu()
invocations and we really did see a problem, correct?

Although it is not perfect, call_rcu() is designed to take a fair amount
of abuse.  So if we didn't see a real problem, the frequent call_rcu()
invocations might be a bit simpler.

							Thanx, Paul
Suren Baghdasaryan Jan. 19, 2023, 7:47 p.m. UTC | #13
On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > After some more clarification I can understand how call_rcu might not be
> > > super happy about thousands of callbacks to be invoked and I do agree
> > > that this is not really optimal.
> > >
> > > On the other hand I do not like this solution much either.
> > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > much with processes with a huge number of vmas either. It would still be
> > > in housands of callbacks to be scheduled without a good reason.
> > >
> > > Instead, are there any other cases than remove_vma that need this
> > > batching? We could easily just link all the vmas into linked list and
> > > use a single call_rcu instead, no? This would both simplify the
> > > implementation, remove the scaling issue as well and we do not have to
> > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> >
> > Yes, I agree the solution is not stellar. I wanted something simple
> > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > on the list without hooking up a shrinker (additional complexity) does
> > not sound too appealing either. WDYT about time domain throttling to
> > limit draining the list to say once per second like this:
> >
> > void vm_area_free(struct vm_area_struct *vma)
> > {
> >        struct mm_struct *mm = vma->vm_mm;
> >        bool drain;
> >
> >        free_anon_vma_name(vma);
> >
> >        spin_lock(&mm->vma_free_list.lock);
> >        list_add(&vma->vm_free_list, &mm->vma_free_list.head);
> >        mm->vma_free_list.size++;
> > -       drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> > +       drain = jiffies > mm->last_drain_tm + HZ;
> >
> >        spin_unlock(&mm->vma_free_list.lock);
> >
> > -       if (drain)
> > +       if (drain) {
> >               drain_free_vmas(mm);
> > +             mm->last_drain_tm = jiffies;
> > +       }
> > }
> >
> > Ultimately we want to prevent very frequent call_rcu() calls, so
> > throttling in the time domain seems appropriate. That's the simplest
> > way I can think of to address your concern about a quick spike in VMA
> > freeing. It does not place any restriction on the list size and we
> > might have excessive dead vm_area_structs if after a large spike there
> > are no vm_area_free() calls but I don't know if that's a real problem,
> > so not sure we should be addressing it at this time. WDYT?
>
> Just to double-check, we really did try the very frequent call_rcu()
> invocations and we really did see a problem, correct?

Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw
regressions when a process exits and all its VMAs get destroyed,
causing a flood of call_rcu()'s.

>
> Although it is not perfect, call_rcu() is designed to take a fair amount
> of abuse.  So if we didn't see a real problem, the frequent call_rcu()
> invocations might be a bit simpler.
>
>                                                         Thanx, Paul
Paul E. McKenney Jan. 19, 2023, 7:55 p.m. UTC | #14
On Thu, Jan 19, 2023 at 11:47:36AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either. WDYT about time domain throttling to
> > > limit draining the list to say once per second like this:
> > >
> > > void vm_area_free(struct vm_area_struct *vma)
> > > {
> > >        struct mm_struct *mm = vma->vm_mm;
> > >        bool drain;
> > >
> > >        free_anon_vma_name(vma);
> > >
> > >        spin_lock(&mm->vma_free_list.lock);
> > >        list_add(&vma->vm_free_list, &mm->vma_free_list.head);
> > >        mm->vma_free_list.size++;
> > > -       drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> > > +       drain = jiffies > mm->last_drain_tm + HZ;
> > >
> > >        spin_unlock(&mm->vma_free_list.lock);
> > >
> > > -       if (drain)
> > > +       if (drain) {
> > >               drain_free_vmas(mm);
> > > +             mm->last_drain_tm = jiffies;
> > > +       }
> > > }
> > >
> > > Ultimately we want to prevent very frequent call_rcu() calls, so
> > > throttling in the time domain seems appropriate. That's the simplest
> > > way I can think of to address your concern about a quick spike in VMA
> > > freeing. It does not place any restriction on the list size and we
> > > might have excessive dead vm_area_structs if after a large spike there
> > > are no vm_area_free() calls but I don't know if that's a real problem,
> > > so not sure we should be addressing it at this time. WDYT?
> >
> > Just to double-check, we really did try the very frequent call_rcu()
> > invocations and we really did see a problem, correct?
> 
> Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw
> regressions when a process exits and all its VMAs get destroyed,
> causing a flood of call_rcu()'s.

Thank you for the reminder, real problem needs solution.  ;-)

							Thanx, Paul

> > Although it is not perfect, call_rcu() is designed to take a fair amount
> > of abuse.  So if we didn't see a real problem, the frequent call_rcu()
> > invocations might be a bit simpler.
Michal Hocko Jan. 20, 2023, 8:52 a.m. UTC | #15
On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > After some more clarification I can understand how call_rcu might not be
> > super happy about thousands of callbacks to be invoked and I do agree
> > that this is not really optimal.
> >
> > On the other hand I do not like this solution much either.
> > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > much with processes with a huge number of vmas either. It would still be
> > in housands of callbacks to be scheduled without a good reason.
> >
> > Instead, are there any other cases than remove_vma that need this
> > batching? We could easily just link all the vmas into linked list and
> > use a single call_rcu instead, no? This would both simplify the
> > implementation, remove the scaling issue as well and we do not have to
> > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> 
> Yes, I agree the solution is not stellar. I wanted something simple
> but this is probably too simple. OTOH keeping all dead vm_area_structs
> on the list without hooking up a shrinker (additional complexity) does
> not sound too appealing either.

I suspect you have missed my idea. I do not really want to keep the list
around or any shrinker. It is dead simple. Collect all vmas in
remove_vma and then call_rcu the whole list at once after the whole list
(be it from exit_mmap or remove_mt). See?
Michal Hocko Jan. 20, 2023, 8:57 a.m. UTC | #16
On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
> On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > [...]
> > > > There are a couple of possibilities here.
> > > >
> > > > First, if I am remembering correctly, the time between the call_rcu()
> > > > and invocation of the corresponding callback was taking multiple seconds,
> > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > > order to save power by batching RCU work over multiple call_rcu()
> > > > invocations.  If this is causing a problem for a given call site, the
> > > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > > to the old-school non-laziness, but can of course consume more power.
> > > 
> > > That would not be the case because CONFIG_LAZY_RCU was not an option
> > > at the time I was profiling this issue.
> > > Laxy RCU would be a great option to replace this patch but
> > > unfortunately it's not the default behavior, so I would still have to
> > > implement this batching in case lazy RCU is not enabled.
> > > 
> > > >
> > > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > > and the invocation of the corresponding callback in kernels built with
> > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > > > of this delay is to avoid lock contention, and so this delay is incurred
> > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > > the added delay.  The reason for this delay is the use of a separate
> > > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > > lock contention on the main ->cblist.  This is not needed in old-school
> > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > > (including most datacenter kernels) because in that case the callbacks
> > > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > > that there is no need for locks.
> > > 
> > > I believe this is the reason in my profiled case.
> > > 
> > > >
> > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > > 
> > > I don't think I've seen such a case.
> > > Thanks for clarifications, Paul!
> > 
> > Thanks for the explanation Paul. I have to say this has caught me as a
> > surprise. There are just not enough details about the benchmark to
> > understand what is going on but I find it rather surprising that
> > call_rcu can induce a higher overhead than the actual kmem_cache_free
> > which is the callback. My naive understanding has been that call_rcu is
> > really fast way to defer the execution to the RCU safe context to do the
> > final cleanup.
> 
> If I am following along correctly (ha!), then your "induce a higher
> overhead" should be something like "induce a higher to-kfree() latency".

Yes, this is expected.

> Of course, there already is a higher latency-to-kfree via call_rcu()
> than via a direct call to kfree(), and callback-offload CPUs that are
> being flooded with callbacks raise that latency a jiffy or so more in
> order to avoid lock contention.
> 
> If this becomes a problem, the callback-offloading code can be a bit
> smarter about avoiding lock contention, but need to see a real problem
> before I make that change.  But if there is a real problem I will of
> course fix it.

I believe that Suren claims that the call_rcu is really visible in the
exit_mmap case. Time-to-free actual vmas shouldn't really be material
for that path. If that happens much more later on there could be some
side effects by an increased memory consumption but that should be
marginal. How fast exit_mmap really is should only depend on direct
calls from that path.

But I guess we need some specific numbers from Suren to be sure what is
going on here.

Thanks!
Paul E. McKenney Jan. 20, 2023, 4:08 p.m. UTC | #17
On Fri, Jan 20, 2023 at 09:57:05AM +0100, Michal Hocko wrote:
> On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
> > On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> > > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > [...]
> > > > > There are a couple of possibilities here.
> > > > >
> > > > > First, if I am remembering correctly, the time between the call_rcu()
> > > > > and invocation of the corresponding callback was taking multiple seconds,
> > > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > > > order to save power by batching RCU work over multiple call_rcu()
> > > > > invocations.  If this is causing a problem for a given call site, the
> > > > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > > > to the old-school non-laziness, but can of course consume more power.
> > > > 
> > > > That would not be the case because CONFIG_LAZY_RCU was not an option
> > > > at the time I was profiling this issue.
> > > > Laxy RCU would be a great option to replace this patch but
> > > > unfortunately it's not the default behavior, so I would still have to
> > > > implement this batching in case lazy RCU is not enabled.
> > > > 
> > > > >
> > > > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > > > and the invocation of the corresponding callback in kernels built with
> > > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > > > > of this delay is to avoid lock contention, and so this delay is incurred
> > > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > > > the added delay.  The reason for this delay is the use of a separate
> > > > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > > > lock contention on the main ->cblist.  This is not needed in old-school
> > > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > > > (including most datacenter kernels) because in that case the callbacks
> > > > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > > > that there is no need for locks.
> > > > 
> > > > I believe this is the reason in my profiled case.
> > > > 
> > > > >
> > > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > > > 
> > > > I don't think I've seen such a case.
> > > > Thanks for clarifications, Paul!
> > > 
> > > Thanks for the explanation Paul. I have to say this has caught me as a
> > > surprise. There are just not enough details about the benchmark to
> > > understand what is going on but I find it rather surprising that
> > > call_rcu can induce a higher overhead than the actual kmem_cache_free
> > > which is the callback. My naive understanding has been that call_rcu is
> > > really fast way to defer the execution to the RCU safe context to do the
> > > final cleanup.
> > 
> > If I am following along correctly (ha!), then your "induce a higher
> > overhead" should be something like "induce a higher to-kfree() latency".
> 
> Yes, this is expected.
> 
> > Of course, there already is a higher latency-to-kfree via call_rcu()
> > than via a direct call to kfree(), and callback-offload CPUs that are
> > being flooded with callbacks raise that latency a jiffy or so more in
> > order to avoid lock contention.
> > 
> > If this becomes a problem, the callback-offloading code can be a bit
> > smarter about avoiding lock contention, but need to see a real problem
> > before I make that change.  But if there is a real problem I will of
> > course fix it.
> 
> I believe that Suren claims that the call_rcu is really visible in the
> exit_mmap case. Time-to-free actual vmas shouldn't really be material
> for that path. If that happens much more later on there could be some
> side effects by an increased memory consumption but that should be
> marginal. How fast exit_mmap really is should only depend on direct
> calls from that path.
> 
> But I guess we need some specific numbers from Suren to be sure what is
> going on here.

Actually, Suren did discuss these (perhaps offlist) back in August.
I was just being forgetful.  :-/

							Thanx, Paul
Suren Baghdasaryan Jan. 20, 2023, 4:20 p.m. UTC | #18
On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > After some more clarification I can understand how call_rcu might not be
> > > super happy about thousands of callbacks to be invoked and I do agree
> > > that this is not really optimal.
> > >
> > > On the other hand I do not like this solution much either.
> > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > much with processes with a huge number of vmas either. It would still be
> > > in housands of callbacks to be scheduled without a good reason.
> > >
> > > Instead, are there any other cases than remove_vma that need this
> > > batching? We could easily just link all the vmas into linked list and
> > > use a single call_rcu instead, no? This would both simplify the
> > > implementation, remove the scaling issue as well and we do not have to
> > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> >
> > Yes, I agree the solution is not stellar. I wanted something simple
> > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > on the list without hooking up a shrinker (additional complexity) does
> > not sound too appealing either.
>
> I suspect you have missed my idea. I do not really want to keep the list
> around or any shrinker. It is dead simple. Collect all vmas in
> remove_vma and then call_rcu the whole list at once after the whole list
> (be it from exit_mmap or remove_mt). See?

Yes, I understood your idea but keeping dead objects until the process
exits even when the system is low on memory (no shrinkers attached)
seems too wasteful. If we do this I would advocate for attaching a
shrinker.

>
> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Jan. 20, 2023, 4:45 p.m. UTC | #19
On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either.
> >
> > I suspect you have missed my idea. I do not really want to keep the list
> > around or any shrinker. It is dead simple. Collect all vmas in
> > remove_vma and then call_rcu the whole list at once after the whole list
> > (be it from exit_mmap or remove_mt). See?
>
> Yes, I understood your idea but keeping dead objects until the process
> exits even when the system is low on memory (no shrinkers attached)
> seems too wasteful. If we do this I would advocate for attaching a
> shrinker.

Maybe even simpler, since we are hit with this VMA freeing flood
during exit_mmap (when all VMAs are destroyed), we pass a hint to
vm_area_free to batch the destruction and all other cases call
call_rcu()? I don't think there will be other cases of VMA destruction
floods.

>
> >
> > --
> > Michal Hocko
> > SUSE Labs
Matthew Wilcox Jan. 20, 2023, 4:49 p.m. UTC | #20
On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > After some more clarification I can understand how call_rcu might not be
> > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > that this is not really optimal.
> > > > >
> > > > > On the other hand I do not like this solution much either.
> > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > much with processes with a huge number of vmas either. It would still be
> > > > > in housands of callbacks to be scheduled without a good reason.
> > > > >
> > > > > Instead, are there any other cases than remove_vma that need this
> > > > > batching? We could easily just link all the vmas into linked list and
> > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > >
> > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > on the list without hooking up a shrinker (additional complexity) does
> > > > not sound too appealing either.
> > >
> > > I suspect you have missed my idea. I do not really want to keep the list
> > > around or any shrinker. It is dead simple. Collect all vmas in
> > > remove_vma and then call_rcu the whole list at once after the whole list
> > > (be it from exit_mmap or remove_mt). See?
> >
> > Yes, I understood your idea but keeping dead objects until the process
> > exits even when the system is low on memory (no shrinkers attached)
> > seems too wasteful. If we do this I would advocate for attaching a
> > shrinker.
> 
> Maybe even simpler, since we are hit with this VMA freeing flood
> during exit_mmap (when all VMAs are destroyed), we pass a hint to
> vm_area_free to batch the destruction and all other cases call
> call_rcu()? I don't think there will be other cases of VMA destruction
> floods.

... or have two different call_rcu functions; one for munmap() and
one for exit.  It'd be nice to use kmem_cache_free_bulk().
Liam R. Howlett Jan. 20, 2023, 5:08 p.m. UTC | #21
* Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > >
> > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > that this is not really optimal.
> > > > > >
> > > > > > On the other hand I do not like this solution much either.
> > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > >
> > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > >
> > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > not sound too appealing either.
> > > >
> > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > (be it from exit_mmap or remove_mt). See?
> > >
> > > Yes, I understood your idea but keeping dead objects until the process
> > > exits even when the system is low on memory (no shrinkers attached)
> > > seems too wasteful. If we do this I would advocate for attaching a
> > > shrinker.
> > 
> > Maybe even simpler, since we are hit with this VMA freeing flood
> > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > vm_area_free to batch the destruction and all other cases call
> > call_rcu()? I don't think there will be other cases of VMA destruction
> > floods.
> 
> ... or have two different call_rcu functions; one for munmap() and
> one for exit.  It'd be nice to use kmem_cache_free_bulk().

Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
Once we have obtained the write lock again, I think it's safe to say we
can just go ahead and free the VMAs directly.
Suren Baghdasaryan Jan. 20, 2023, 5:17 p.m. UTC | #22
On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > >
> > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > that this is not really optimal.
> > > > > > >
> > > > > > > On the other hand I do not like this solution much either.
> > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > >
> > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > >
> > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > not sound too appealing either.
> > > > >
> > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > (be it from exit_mmap or remove_mt). See?
> > > >
> > > > Yes, I understood your idea but keeping dead objects until the process
> > > > exits even when the system is low on memory (no shrinkers attached)
> > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > shrinker.
> > >
> > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > vm_area_free to batch the destruction and all other cases call
> > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > floods.
> >
> > ... or have two different call_rcu functions; one for munmap() and
> > one for exit.  It'd be nice to use kmem_cache_free_bulk().
>
> Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
> have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> Once we have obtained the write lock again, I think it's safe to say we
> can just go ahead and free the VMAs directly.

I think that would be still racy if the page fault handler found that
VMA under read-RCU protection but did not lock it yet (no locks are
held yet). If it's preempted, the VMA can be freed and destroyed from
under it without RCU grace period.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Paul E. McKenney Jan. 20, 2023, 5:21 p.m. UTC | #23
On Fri, Jan 20, 2023 at 04:49:42PM +0000, Matthew Wilcox wrote:
> On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > >
> > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > that this is not really optimal.
> > > > > >
> > > > > > On the other hand I do not like this solution much either.
> > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > >
> > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > >
> > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > not sound too appealing either.
> > > >
> > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > (be it from exit_mmap or remove_mt). See?
> > >
> > > Yes, I understood your idea but keeping dead objects until the process
> > > exits even when the system is low on memory (no shrinkers attached)
> > > seems too wasteful. If we do this I would advocate for attaching a
> > > shrinker.
> > 
> > Maybe even simpler, since we are hit with this VMA freeing flood
> > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > vm_area_free to batch the destruction and all other cases call
> > call_rcu()? I don't think there will be other cases of VMA destruction
> > floods.
> 
> ... or have two different call_rcu functions; one for munmap() and
> one for exit.  It'd be nice to use kmem_cache_free_bulk().

Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head
structure's field, is much more cache-efficient.

The penalty is that there is no callback function to do any cleanup.
There is just a kfree()/kvfree (bulk version where applicable),
nothing else.

							Thanx, Paul
Matthew Wilcox Jan. 20, 2023, 5:32 p.m. UTC | #24
On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > >
> > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > that this is not really optimal.
> > > > > > > >
> > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > >
> > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > >
> > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > not sound too appealing either.
> > > > > >
> > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > (be it from exit_mmap or remove_mt). See?
> > > > >
> > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > shrinker.
> > > >
> > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > vm_area_free to batch the destruction and all other cases call
> > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > floods.
> > >
> > > ... or have two different call_rcu functions; one for munmap() and
> > > one for exit.  It'd be nice to use kmem_cache_free_bulk().
> >
> > Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
> > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > Once we have obtained the write lock again, I think it's safe to say we
> > can just go ahead and free the VMAs directly.
> 
> I think that would be still racy if the page fault handler found that
> VMA under read-RCU protection but did not lock it yet (no locks are
> held yet). If it's preempted, the VMA can be freed and destroyed from
> under it without RCU grace period.

The page fault handler (or whatever other reader -- ptrace, proc, etc)
should have a refcount on the mm_struct, so we can't be in this path
trying to free VMAs.  Right?
Suren Baghdasaryan Jan. 20, 2023, 5:50 p.m. UTC | #25
On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > > >
> > > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > > that this is not really optimal.
> > > > > > > > >
> > > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > > >
> > > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > > >
> > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > > not sound too appealing either.
> > > > > > >
> > > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > > (be it from exit_mmap or remove_mt). See?
> > > > > >
> > > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > > shrinker.
> > > > >
> > > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > > vm_area_free to batch the destruction and all other cases call
> > > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > > floods.
> > > >
> > > > ... or have two different call_rcu functions; one for munmap() and
> > > > one for exit.  It'd be nice to use kmem_cache_free_bulk().
> > >
> > > Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
> > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > > Once we have obtained the write lock again, I think it's safe to say we
> > > can just go ahead and free the VMAs directly.
> >
> > I think that would be still racy if the page fault handler found that
> > VMA under read-RCU protection but did not lock it yet (no locks are
> > held yet). If it's preempted, the VMA can be freed and destroyed from
> > under it without RCU grace period.
>
> The page fault handler (or whatever other reader -- ptrace, proc, etc)
> should have a refcount on the mm_struct, so we can't be in this path
> trying to free VMAs.  Right?

Hmm. That sounds right. I checked process_mrelease() as well, which
operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
VMAs without freeing them, so we are still good. Michal, do you agree
this is ok?

lock_vma_under_rcu() receives mm as a parameter, so I guess it's
implied that the caller should either mmget() it or operate on
current->mm, so no need to document this requirement?
Suren Baghdasaryan Jan. 20, 2023, 6:42 p.m. UTC | #26
On Fri, Jan 20, 2023 at 9:21 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jan 20, 2023 at 04:49:42PM +0000, Matthew Wilcox wrote:
> > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > >
> > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > that this is not really optimal.
> > > > > > >
> > > > > > > On the other hand I do not like this solution much either.
> > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > >
> > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > >
> > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > not sound too appealing either.
> > > > >
> > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > (be it from exit_mmap or remove_mt). See?
> > > >
> > > > Yes, I understood your idea but keeping dead objects until the process
> > > > exits even when the system is low on memory (no shrinkers attached)
> > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > shrinker.
> > >
> > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > vm_area_free to batch the destruction and all other cases call
> > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > floods.
> >
> > ... or have two different call_rcu functions; one for munmap() and
> > one for exit.  It'd be nice to use kmem_cache_free_bulk().
>
> Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head
> structure's field, is much more cache-efficient.
>
> The penalty is that there is no callback function to do any cleanup.
> There is just a kfree()/kvfree (bulk version where applicable),
> nothing else.

If Liam's suggestion works then we won't need anything additional. We
will free the vm_area_structs directly on process exit and use
call_rcu() in all other cases. Let's see if Michal knows of any case
which still needs an RCU grace period during exit_mmap.

>
>                                                         Thanx, Paul
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Liam R. Howlett Jan. 20, 2023, 7:23 p.m. UTC | #27
* Suren Baghdasaryan <surenb@google.com> [230120 12:50]:
> On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> > > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > > > >
> > > > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > > > that this is not really optimal.
> > > > > > > > > >
> > > > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > > > >
> > > > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > > > >
> > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > > > not sound too appealing either.
> > > > > > > >
> > > > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > > > (be it from exit_mmap or remove_mt). See?
> > > > > > >
> > > > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > > > shrinker.
> > > > > >
> > > > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > > > vm_area_free to batch the destruction and all other cases call
> > > > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > > > floods.
> > > > >
> > > > > ... or have two different call_rcu functions; one for munmap() and
> > > > > one for exit.  It'd be nice to use kmem_cache_free_bulk().
> > > >
> > > > Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
> > > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > > > Once we have obtained the write lock again, I think it's safe to say we
> > > > can just go ahead and free the VMAs directly.
> > >
> > > I think that would be still racy if the page fault handler found that
> > > VMA under read-RCU protection but did not lock it yet (no locks are
> > > held yet). If it's preempted, the VMA can be freed and destroyed from
> > > under it without RCU grace period.
> >
> > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > should have a refcount on the mm_struct, so we can't be in this path
> > trying to free VMAs.  Right?
> 
> Hmm. That sounds right. I checked process_mrelease() as well, which
> operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> VMAs without freeing them, so we are still good. Michal, do you agree
> this is ok?
> 
> lock_vma_under_rcu() receives mm as a parameter, so I guess it's
> implied that the caller should either mmget() it or operate on
> current->mm, so no need to document this requirement?

It is also implied by the vma->vm_mm link.  Otherwise any RCU holder of
the VMA could have an unsafe pointer.  In fact, if this isn't true then
we need to change the callers to take the ref count to avoid just this
scenario.
Michal Hocko Jan. 23, 2023, 9:56 a.m. UTC | #28
On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
[...]
> > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > should have a refcount on the mm_struct, so we can't be in this path
> > trying to free VMAs.  Right?
> 
> Hmm. That sounds right. I checked process_mrelease() as well, which
> operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> VMAs without freeing them, so we are still good. Michal, do you agree
> this is ok?

Don't we need RCU procetions for the vma life time assurance? Jann has
already shown how rwsem is not safe wrt to unlock and free without RCU.
Michal Hocko Jan. 23, 2023, 9:59 a.m. UTC | #29
On Fri 20-01-23 08:20:43, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either.
> >
> > I suspect you have missed my idea. I do not really want to keep the list
> > around or any shrinker. It is dead simple. Collect all vmas in
> > remove_vma and then call_rcu the whole list at once after the whole list
> > (be it from exit_mmap or remove_mt). See?
> 
> Yes, I understood your idea but keeping dead objects until the process
> exits even when the system is low on memory (no shrinkers attached)
> seems too wasteful. If we do this I would advocate for attaching a
> shrinker.

I am still not sure we are on the same page here. No, vmas shouldn't lay
around un ntil the process exit. I am really suggesting queuing only for
remove_vma paths. You can have a different rcu callback than the one
used for trivial single vma removal paths.
Suren Baghdasaryan Jan. 23, 2023, 4:22 p.m. UTC | #30
On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> [...]
> > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > should have a refcount on the mm_struct, so we can't be in this path
> > > trying to free VMAs.  Right?
> >
> > Hmm. That sounds right. I checked process_mrelease() as well, which
> > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > VMAs without freeing them, so we are still good. Michal, do you agree
> > this is ok?
>
> Don't we need RCU procetions for the vma life time assurance? Jann has
> already shown how rwsem is not safe wrt to unlock and free without RCU.

Jann's case requires a thread freeing the VMA to be blocked on vma
write lock waiting for the vma real lock to be released by a page
fault handler. However exit_mmap() means mm->mm_users==0, which in
turn suggests that there are no racing page fault handlers and no new
page fault handlers will appear. Is that a correct assumption? If so,
then races with page fault handlers can't happen while in exit_mmap().
Any other path (other than page fault handlers), accesses vma->lock
under protection of mmap_lock (for read or write, does not matter).
One exception is when we operate on an isolated VMA, then we don't
need mmap_lock protection, but exit_mmap() does not deal with isolated
VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
protection of mmap_lock in write mode, so races with anything other
than page fault handler should be safe as they are today.

That said, the future possible users of lock_vma_under_rcu() using VMA
without mmap_lock protection will have to ensure mm's stability while
they are using the obtained VMA. IOW they should elevate mm's refcount
and keep it elevated as long as they are using that VMA and not before
vma->lock is released. I guess it would be a good idea to document
that requirement in lock_vma_under_rcu() comments if we decide to take
this route.

>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Jan. 23, 2023, 4:55 p.m. UTC | #31
On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > [...]
> > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > trying to free VMAs.  Right?
> > >
> > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > this is ok?
> >
> > Don't we need RCU procetions for the vma life time assurance? Jann has
> > already shown how rwsem is not safe wrt to unlock and free without RCU.
> 
> Jann's case requires a thread freeing the VMA to be blocked on vma
> write lock waiting for the vma real lock to be released by a page
> fault handler. However exit_mmap() means mm->mm_users==0, which in
> turn suggests that there are no racing page fault handlers and no new
> page fault handlers will appear. Is that a correct assumption? If so,
> then races with page fault handlers can't happen while in exit_mmap().
> Any other path (other than page fault handlers), accesses vma->lock
> under protection of mmap_lock (for read or write, does not matter).
> One exception is when we operate on an isolated VMA, then we don't
> need mmap_lock protection, but exit_mmap() does not deal with isolated
> VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> protection of mmap_lock in write mode, so races with anything other
> than page fault handler should be safe as they are today.

I do not see you talking about #PF (RCU + vma read lock protected) with
munmap. It is my understanding that the latter will synchronize over per
vma lock (along with mmap_lock exclusive locking). But then we are back
to the lifetime guarantees, or do I miss anything.
Suren Baghdasaryan Jan. 23, 2023, 5:07 p.m. UTC | #32
On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > [...]
> > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > trying to free VMAs.  Right?
> > > >
> > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > this is ok?
> > >
> > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> >
> > Jann's case requires a thread freeing the VMA to be blocked on vma
> > write lock waiting for the vma real lock to be released by a page
> > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > turn suggests that there are no racing page fault handlers and no new
> > page fault handlers will appear. Is that a correct assumption? If so,
> > then races with page fault handlers can't happen while in exit_mmap().
> > Any other path (other than page fault handlers), accesses vma->lock
> > under protection of mmap_lock (for read or write, does not matter).
> > One exception is when we operate on an isolated VMA, then we don't
> > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > protection of mmap_lock in write mode, so races with anything other
> > than page fault handler should be safe as they are today.
>
> I do not see you talking about #PF (RCU + vma read lock protected) with
> munmap. It is my understanding that the latter will synchronize over per
> vma lock (along with mmap_lock exclusive locking). But then we are back
> to the lifetime guarantees, or do I miss anything.

munmap() or any VMA-freeing operation other than exit_mmap() will free
using call_rcu(), as implemented today. The suggestion is to free VMAs
directly, without RCU grace period only when done from exit_mmap().
That' because VMA freeing flood has been seen so far only in the case
of exit_mmap() and we assume other cases are not that heavy to cause
call_rcu() flood to cause regressions. That assumption might prove
false but we can deal with that once we know it needs fixing.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Jan. 23, 2023, 5:16 p.m. UTC | #33
On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > [...]
> > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > trying to free VMAs.  Right?
> > > > >
> > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > this is ok?
> > > >
> > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > >
> > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > write lock waiting for the vma real lock to be released by a page
> > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > turn suggests that there are no racing page fault handlers and no new
> > > page fault handlers will appear. Is that a correct assumption? If so,
> > > then races with page fault handlers can't happen while in exit_mmap().
> > > Any other path (other than page fault handlers), accesses vma->lock
> > > under protection of mmap_lock (for read or write, does not matter).
> > > One exception is when we operate on an isolated VMA, then we don't
> > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > protection of mmap_lock in write mode, so races with anything other
> > > than page fault handler should be safe as they are today.
> >
> > I do not see you talking about #PF (RCU + vma read lock protected) with
> > munmap. It is my understanding that the latter will synchronize over per
> > vma lock (along with mmap_lock exclusive locking). But then we are back
> > to the lifetime guarantees, or do I miss anything.
> 
> munmap() or any VMA-freeing operation other than exit_mmap() will free
> using call_rcu(), as implemented today. The suggestion is to free VMAs
> directly, without RCU grace period only when done from exit_mmap().

OK, I have clearly missed that. This makes more sense but it also adds
some more complexity and assumptions - a harder to maintain code in the
end. Whoever wants to touch this scheme in future would have to
re-evaluate all of them. So, I would just avoid that special casing if
that is feasible.

Dealing with the flood of call_rcu during exit_mmap is a trivial thing
to deal with as proposed elsewhere (just batch all of them in a single
run). This will surely add some more code but at least the locking would
consistent.
Suren Baghdasaryan Jan. 23, 2023, 5:43 p.m. UTC | #34
On Mon, Jan 23, 2023 at 1:59 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Fri 20-01-23 08:20:43, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > After some more clarification I can understand how call_rcu might not be
> > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > that this is not really optimal.
> > > > >
> > > > > On the other hand I do not like this solution much either.
> > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > much with processes with a huge number of vmas either. It would still be
> > > > > in housands of callbacks to be scheduled without a good reason.
> > > > >
> > > > > Instead, are there any other cases than remove_vma that need this
> > > > > batching? We could easily just link all the vmas into linked list and
> > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > >
> > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > on the list without hooking up a shrinker (additional complexity) does
> > > > not sound too appealing either.
> > >
> > > I suspect you have missed my idea. I do not really want to keep the list
> > > around or any shrinker. It is dead simple. Collect all vmas in
> > > remove_vma and then call_rcu the whole list at once after the whole list
> > > (be it from exit_mmap or remove_mt). See?
> >
> > Yes, I understood your idea but keeping dead objects until the process
> > exits even when the system is low on memory (no shrinkers attached)
> > seems too wasteful. If we do this I would advocate for attaching a
> > shrinker.
>
> I am still not sure we are on the same page here. No, vmas shouldn't lay
> around un ntil the process exit. I am really suggesting queuing only for
> remove_vma paths. You can have a different rcu callback than the one
> used for trivial single vma removal paths.

Oh, I also missed the remove_mt() part and thought you want to drain
the list only in exit_mmap(). I think that's a good option!

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Suren Baghdasaryan Jan. 23, 2023, 5:46 p.m. UTC | #35
On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > [...]
> > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > > trying to free VMAs.  Right?
> > > > > >
> > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > > this is ok?
> > > > >
> > > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > > >
> > > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > > write lock waiting for the vma real lock to be released by a page
> > > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > > turn suggests that there are no racing page fault handlers and no new
> > > > page fault handlers will appear. Is that a correct assumption? If so,
> > > > then races with page fault handlers can't happen while in exit_mmap().
> > > > Any other path (other than page fault handlers), accesses vma->lock
> > > > under protection of mmap_lock (for read or write, does not matter).
> > > > One exception is when we operate on an isolated VMA, then we don't
> > > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > > protection of mmap_lock in write mode, so races with anything other
> > > > than page fault handler should be safe as they are today.
> > >
> > > I do not see you talking about #PF (RCU + vma read lock protected) with
> > > munmap. It is my understanding that the latter will synchronize over per
> > > vma lock (along with mmap_lock exclusive locking). But then we are back
> > > to the lifetime guarantees, or do I miss anything.
> >
> > munmap() or any VMA-freeing operation other than exit_mmap() will free
> > using call_rcu(), as implemented today. The suggestion is to free VMAs
> > directly, without RCU grace period only when done from exit_mmap().
>
> OK, I have clearly missed that. This makes more sense but it also adds
> some more complexity and assumptions - a harder to maintain code in the
> end. Whoever wants to touch this scheme in future would have to
> re-evaluate all of them. So, I would just avoid that special casing if
> that is feasible.

Ok, I understand your point.

>
> Dealing with the flood of call_rcu during exit_mmap is a trivial thing
> to deal with as proposed elsewhere (just batch all of them in a single
> run). This will surely add some more code but at least the locking would
> consistent.

Yes, batching the vmas into a list and draining it in remove_mt() and
exit_mmap() as you suggested makes sense to me and is quite simple.
Let's do that if nobody has objections.

> --
> Michal Hocko
> SUSE Labs
Matthew Wilcox Jan. 23, 2023, 6:23 p.m. UTC | #36
On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> > > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > [...]
> > > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > > > trying to free VMAs.  Right?
> > > > > > >
> > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > > > this is ok?
> > > > > >
> > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > > > >
> > > > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > > > write lock waiting for the vma real lock to be released by a page
> > > > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > > > turn suggests that there are no racing page fault handlers and no new
> > > > > page fault handlers will appear. Is that a correct assumption? If so,
> > > > > then races with page fault handlers can't happen while in exit_mmap().
> > > > > Any other path (other than page fault handlers), accesses vma->lock
> > > > > under protection of mmap_lock (for read or write, does not matter).
> > > > > One exception is when we operate on an isolated VMA, then we don't
> > > > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > > > protection of mmap_lock in write mode, so races with anything other
> > > > > than page fault handler should be safe as they are today.
> > > >
> > > > I do not see you talking about #PF (RCU + vma read lock protected) with
> > > > munmap. It is my understanding that the latter will synchronize over per
> > > > vma lock (along with mmap_lock exclusive locking). But then we are back
> > > > to the lifetime guarantees, or do I miss anything.
> > >
> > > munmap() or any VMA-freeing operation other than exit_mmap() will free
> > > using call_rcu(), as implemented today. The suggestion is to free VMAs
> > > directly, without RCU grace period only when done from exit_mmap().
> >
> > OK, I have clearly missed that. This makes more sense but it also adds
> > some more complexity and assumptions - a harder to maintain code in the
> > end. Whoever wants to touch this scheme in future would have to
> > re-evaluate all of them. So, I would just avoid that special casing if
> > that is feasible.
> 
> Ok, I understand your point.
> 
> >
> > Dealing with the flood of call_rcu during exit_mmap is a trivial thing
> > to deal with as proposed elsewhere (just batch all of them in a single
> > run). This will surely add some more code but at least the locking would
> > consistent.
> 
> Yes, batching the vmas into a list and draining it in remove_mt() and
> exit_mmap() as you suggested makes sense to me and is quite simple.
> Let's do that if nobody has objections.

I object.  We *know* nobody has a reference to any of the VMAs because
you have to have a refcount on the mm before you can get a reference
to a VMA.  If Michal is saying that somebody could do:

	mmget(mm);
	vma = find_vma(mm);
	lock_vma(vma);
	mmput(mm);
	vma->a = b;
	unlock_vma(mm, vma);

then that's something we'd catch in review -- you obviously can't use
the mm after you've dropped your reference to it.

Having all this extra code to solve two problems badly is a very poor
choice.  We have two distinct problems, each of which has a simple,
efficient solution.
Suren Baghdasaryan Jan. 23, 2023, 6:47 p.m. UTC | #37
On Mon, Jan 23, 2023 at 10:23 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > [...]
> > > > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > > > > trying to free VMAs.  Right?
> > > > > > > >
> > > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > > > > this is ok?
> > > > > > >
> > > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > > > > >
> > > > > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > > > > write lock waiting for the vma real lock to be released by a page
> > > > > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > > > > turn suggests that there are no racing page fault handlers and no new
> > > > > > page fault handlers will appear. Is that a correct assumption? If so,
> > > > > > then races with page fault handlers can't happen while in exit_mmap().
> > > > > > Any other path (other than page fault handlers), accesses vma->lock
> > > > > > under protection of mmap_lock (for read or write, does not matter).
> > > > > > One exception is when we operate on an isolated VMA, then we don't
> > > > > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > > > > protection of mmap_lock in write mode, so races with anything other
> > > > > > than page fault handler should be safe as they are today.
> > > > >
> > > > > I do not see you talking about #PF (RCU + vma read lock protected) with
> > > > > munmap. It is my understanding that the latter will synchronize over per
> > > > > vma lock (along with mmap_lock exclusive locking). But then we are back
> > > > > to the lifetime guarantees, or do I miss anything.
> > > >
> > > > munmap() or any VMA-freeing operation other than exit_mmap() will free
> > > > using call_rcu(), as implemented today. The suggestion is to free VMAs
> > > > directly, without RCU grace period only when done from exit_mmap().
> > >
> > > OK, I have clearly missed that. This makes more sense but it also adds
> > > some more complexity and assumptions - a harder to maintain code in the
> > > end. Whoever wants to touch this scheme in future would have to
> > > re-evaluate all of them. So, I would just avoid that special casing if
> > > that is feasible.
> >
> > Ok, I understand your point.
> >
> > >
> > > Dealing with the flood of call_rcu during exit_mmap is a trivial thing
> > > to deal with as proposed elsewhere (just batch all of them in a single
> > > run). This will surely add some more code but at least the locking would
> > > consistent.
> >
> > Yes, batching the vmas into a list and draining it in remove_mt() and
> > exit_mmap() as you suggested makes sense to me and is quite simple.
> > Let's do that if nobody has objections.
>
> I object.  We *know* nobody has a reference to any of the VMAs because
> you have to have a refcount on the mm before you can get a reference
> to a VMA.  If Michal is saying that somebody could do:
>
>         mmget(mm);
>         vma = find_vma(mm);
>         lock_vma(vma);
>         mmput(mm);
>         vma->a = b;
>         unlock_vma(mm, vma);

More precisely, it's:
         mmget(mm);
         vma = lock_vma_under_rcu(mm, addr); -> calls vma_read_trylock(vma)
         mmput(mm);
         vma->a = b;
         vma_read_unlock(vma);

To be fair, vma_read_unlock() does not take mm as a parameter, so one
could have an impression that mm doesn't need to be pinned at the time
of its call.


>
> then that's something we'd catch in review -- you obviously can't use
> the mm after you've dropped your reference to it.
>
> Having all this extra code to solve two problems badly is a very poor
> choice.  We have two distinct problems, each of which has a simple,
> efficient solution.
>
Michal Hocko Jan. 23, 2023, 7:18 p.m. UTC | #38
On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
[...]
> > Yes, batching the vmas into a list and draining it in remove_mt() and
> > exit_mmap() as you suggested makes sense to me and is quite simple.
> > Let's do that if nobody has objections.
> 
> I object.  We *know* nobody has a reference to any of the VMAs because
> you have to have a refcount on the mm before you can get a reference
> to a VMA.  If Michal is saying that somebody could do:
> 
> 	mmget(mm);
> 	vma = find_vma(mm);
> 	lock_vma(vma);
> 	mmput(mm);
> 	vma->a = b;
> 	unlock_vma(mm, vma);
> 
> then that's something we'd catch in review -- you obviously can't use
> the mm after you've dropped your reference to it.

I am not claiming this is possible now. I do not think we want to have
something like that in the future either but that is really hard to
envision. I am claiming that it is subtle and potentially error prone to
have two different ways of mass vma freeing wrt. locking. Also, don't we
have a very similar situation during last munmaps?
Matthew Wilcox Jan. 23, 2023, 7:30 p.m. UTC | #39
On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> [...]
> > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > Let's do that if nobody has objections.
> > 
> > I object.  We *know* nobody has a reference to any of the VMAs because
> > you have to have a refcount on the mm before you can get a reference
> > to a VMA.  If Michal is saying that somebody could do:
> > 
> > 	mmget(mm);
> > 	vma = find_vma(mm);
> > 	lock_vma(vma);
> > 	mmput(mm);
> > 	vma->a = b;
> > 	unlock_vma(mm, vma);
> > 
> > then that's something we'd catch in review -- you obviously can't use
> > the mm after you've dropped your reference to it.
> 
> I am not claiming this is possible now. I do not think we want to have
> something like that in the future either but that is really hard to
> envision. I am claiming that it is subtle and potentially error prone to
> have two different ways of mass vma freeing wrt. locking. Also, don't we
> have a very similar situation during last munmaps?

We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
There are two cases; there's munmap(), which typically frees a single
VMA (yes, theoretically, you can free hundreds of VMAs with a single
call which spans multiple VMAs, but in practice that doesn't happen),
and there's exit_mmap() which happens on exec() and exit().

For the munmap() case, just RCU-free each one individually.  For the
exit_mmap() case, there's no need to use RCU because nobody should still
have a VMA pointer after calling mmdrop() [1]

[1] Sorry, the above example should have been mmgrab()/mmdrop(), not
mmget()/mmput(); you're not allowed to look at the VMA list with an
mmget(), you need to have grabbed.
Suren Baghdasaryan Jan. 23, 2023, 7:57 p.m. UTC | #40
On Mon, Jan 23, 2023 at 11:31 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > [...]
> > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > Let's do that if nobody has objections.
> > >
> > > I object.  We *know* nobody has a reference to any of the VMAs because
> > > you have to have a refcount on the mm before you can get a reference
> > > to a VMA.  If Michal is saying that somebody could do:
> > >
> > >     mmget(mm);
> > >     vma = find_vma(mm);
> > >     lock_vma(vma);
> > >     mmput(mm);
> > >     vma->a = b;
> > >     unlock_vma(mm, vma);
> > >
> > > then that's something we'd catch in review -- you obviously can't use
> > > the mm after you've dropped your reference to it.
> >
> > I am not claiming this is possible now. I do not think we want to have
> > something like that in the future either but that is really hard to
> > envision. I am claiming that it is subtle and potentially error prone to
> > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > have a very similar situation during last munmaps?
>
> We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
> There are two cases; there's munmap(), which typically frees a single
> VMA (yes, theoretically, you can free hundreds of VMAs with a single
> call which spans multiple VMAs, but in practice that doesn't happen),
> and there's exit_mmap() which happens on exec() and exit().
>
> For the munmap() case, just RCU-free each one individually.  For the
> exit_mmap() case, there's no need to use RCU because nobody should still
> have a VMA pointer after calling mmdrop() [1]
>
> [1] Sorry, the above example should have been mmgrab()/mmdrop(), not
> mmget()/mmput(); you're not allowed to look at the VMA list with an
> mmget(), you need to have grabbed.

I think it's clear that this would work with the current code and that
the concern is about any future possible misuse. So, it would be
preferable to proactively prevent such misuse.

vma_write_lock() and vma_write_unlock_mm() both have
mmap_assert_write_locked(), so they always happen under mmap_lock
protection and therefore do not pose any danger. The only issue we
need to be careful with is calling
vma_read_trylock()/vma_read_unlock() outside of mmap_lock protection
while mm is unstable. I don't think doing mmget/mmput inside these
functions is called for but maybe some assertions would prevent future
misuse?
Michal Hocko Jan. 23, 2023, 8 p.m. UTC | #41
On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > [...]
> > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > Let's do that if nobody has objections.
> > > 
> > > I object.  We *know* nobody has a reference to any of the VMAs because
> > > you have to have a refcount on the mm before you can get a reference
> > > to a VMA.  If Michal is saying that somebody could do:
> > > 
> > > 	mmget(mm);
> > > 	vma = find_vma(mm);
> > > 	lock_vma(vma);
> > > 	mmput(mm);
> > > 	vma->a = b;
> > > 	unlock_vma(mm, vma);
> > > 
> > > then that's something we'd catch in review -- you obviously can't use
> > > the mm after you've dropped your reference to it.
> > 
> > I am not claiming this is possible now. I do not think we want to have
> > something like that in the future either but that is really hard to
> > envision. I am claiming that it is subtle and potentially error prone to
> > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > have a very similar situation during last munmaps?
> 
> We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
> There are two cases; there's munmap(), which typically frees a single
> VMA (yes, theoretically, you can free hundreds of VMAs with a single
> call which spans multiple VMAs, but in practice that doesn't happen),
> and there's exit_mmap() which happens on exec() and exit().

This requires special casing remove_vma for those two different paths
(exit_mmap and remove_mt).  If you ask me that sounds like a suboptimal
code to even not handle potential large munmap which might very well be
a rare thing as you say. But haven't we learned that sooner or later we
will find out there is somebody that cares afterall? Anyway, this is not
something I care about all that much. It is just weird to special case
exit_mmap, if you ask me. Up to Suren to decide which way he wants to
go. I just really didn't like the initial implementation of batching
based on a completely arbitrary batch limit and lazy freeing.
Suren Baghdasaryan Jan. 23, 2023, 8:08 p.m. UTC | #42
On Mon, Jan 23, 2023 at 12:00 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > > [...]
> > > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > > Let's do that if nobody has objections.
> > > >
> > > > I object.  We *know* nobody has a reference to any of the VMAs because
> > > > you have to have a refcount on the mm before you can get a reference
> > > > to a VMA.  If Michal is saying that somebody could do:
> > > >
> > > >   mmget(mm);
> > > >   vma = find_vma(mm);
> > > >   lock_vma(vma);
> > > >   mmput(mm);
> > > >   vma->a = b;
> > > >   unlock_vma(mm, vma);
> > > >
> > > > then that's something we'd catch in review -- you obviously can't use
> > > > the mm after you've dropped your reference to it.
> > >
> > > I am not claiming this is possible now. I do not think we want to have
> > > something like that in the future either but that is really hard to
> > > envision. I am claiming that it is subtle and potentially error prone to
> > > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > > have a very similar situation during last munmaps?
> >
> > We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
> > There are two cases; there's munmap(), which typically frees a single
> > VMA (yes, theoretically, you can free hundreds of VMAs with a single
> > call which spans multiple VMAs, but in practice that doesn't happen),
> > and there's exit_mmap() which happens on exec() and exit().
>
> This requires special casing remove_vma for those two different paths
> (exit_mmap and remove_mt).  If you ask me that sounds like a suboptimal
> code to even not handle potential large munmap which might very well be
> a rare thing as you say. But haven't we learned that sooner or later we
> will find out there is somebody that cares afterall? Anyway, this is not
> something I care about all that much. It is just weird to special case
> exit_mmap, if you ask me. Up to Suren to decide which way he wants to
> go. I just really didn't like the initial implementation of batching
> based on a completely arbitrary batch limit and lazy freeing.

I would prefer to go with the simplest sufficient solution. A
potential issue with a large munmap might prove to be real but I think
we know how to easily fix that with batching if the issue ever
materializes (I'll have a fix ready implementing Michal's suggestion).
So, I suggest going with Liam's/Matthew's solution and converting to
Michal's solution if regression shows up anywhere else. Would that be
acceptable?

> --
> Michal Hocko
> SUSE Labs
Liam R. Howlett Jan. 23, 2023, 8:38 p.m. UTC | #43
* Michal Hocko <mhocko@suse.com> [230123 15:00]:
> On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > > [...]
> > > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > > Let's do that if nobody has objections.
> > > > 
> > > > I object.  We *know* nobody has a reference to any of the VMAs because
> > > > you have to have a refcount on the mm before you can get a reference
> > > > to a VMA.  If Michal is saying that somebody could do:
> > > > 
> > > > 	mmget(mm);
> > > > 	vma = find_vma(mm);
> > > > 	lock_vma(vma);
> > > > 	mmput(mm);
> > > > 	vma->a = b;
> > > > 	unlock_vma(mm, vma);
> > > > 
> > > > then that's something we'd catch in review -- you obviously can't use
> > > > the mm after you've dropped your reference to it.
> > > 
> > > I am not claiming this is possible now. I do not think we want to have
> > > something like that in the future either but that is really hard to
> > > envision. I am claiming that it is subtle and potentially error prone to
> > > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > > have a very similar situation during last munmaps?
> > 
> > We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
> > There are two cases; there's munmap(), which typically frees a single
> > VMA (yes, theoretically, you can free hundreds of VMAs with a single
> > call which spans multiple VMAs, but in practice that doesn't happen),
> > and there's exit_mmap() which happens on exec() and exit().
> 
> This requires special casing remove_vma for those two different paths
> (exit_mmap and remove_mt).  If you ask me that sounds like a suboptimal
> code to even not handle potential large munmap which might very well be
> a rare thing as you say. But haven't we learned that sooner or later we
> will find out there is somebody that cares afterall? Anyway, this is not
> something I care about all that much. It is just weird to special case
> exit_mmap, if you ask me.

exit_mmap() is already a special case for locking (and statistics).
This exists today to optimize the special exit scenario.  I don't think
it's a question of sub-optimal code but what we can get away without
doing in the case of the process exit.
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3158f33e268c..50c7a6dd9c7a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -250,6 +250,7 @@  void setup_initial_init_mm(void *start_code, void *end_code,
 struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
+void drain_free_vmas(struct mm_struct *mm);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fce9113d979c..c0e6c8e4700b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -592,8 +592,18 @@  struct vm_area_struct {
 	/* Information about our backing store: */
 	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
 					   units */
-	struct file * vm_file;		/* File we map to (can be NULL). */
-	void * vm_private_data;		/* was vm_pte (shared mem) */
+	union {
+		struct {
+			/* File we map to (can be NULL). */
+			struct file *vm_file;
+
+			/* was vm_pte (shared mem) */
+			void *vm_private_data;
+		};
+#ifdef CONFIG_PER_VMA_LOCK
+		struct list_head vm_free_list;
+#endif
+	};
 
 #ifdef CONFIG_ANON_VMA_NAME
 	/*
@@ -693,6 +703,11 @@  struct mm_struct {
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
 		int mm_lock_seq;
+		struct {
+			struct list_head head;
+			spinlock_t lock;
+			int size;
+		} vma_free_list;
 #endif
 
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d9f14e55ecf..97f2b751f88d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,26 +481,75 @@  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 }
 
 #ifdef CONFIG_PER_VMA_LOCK
-static void __vm_area_free(struct rcu_head *head)
+static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
 	/* The vma should either have no lock holders or be write-locked. */
 	vma_assert_no_reader(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
-#endif
+
+static void vma_free_rcu_callback(struct rcu_head *head)
+{
+	struct vm_area_struct *first_vma;
+	struct vm_area_struct *vma, *vma2;
+
+	first_vma = container_of(head, struct vm_area_struct, vm_rcu);
+	list_for_each_entry_safe(vma, vma2, &first_vma->vm_free_list, vm_free_list)
+		__vm_area_free(vma);
+	__vm_area_free(first_vma);
+}
+
+void drain_free_vmas(struct mm_struct *mm)
+{
+	struct vm_area_struct *first_vma;
+	LIST_HEAD(to_destroy);
+
+	spin_lock(&mm->vma_free_list.lock);
+	list_splice_init(&mm->vma_free_list.head, &to_destroy);
+	mm->vma_free_list.size = 0;
+	spin_unlock(&mm->vma_free_list.lock);
+
+	if (list_empty(&to_destroy))
+		return;
+
+	first_vma = list_first_entry(&to_destroy, struct vm_area_struct, vm_free_list);
+	/* Remove the head which is allocated on the stack */
+	list_del(&to_destroy);
+
+	call_rcu(&first_vma->vm_rcu, vma_free_rcu_callback);
+}
+
+#define VM_AREA_FREE_LIST_MAX	32
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	bool drain;
+
+	free_anon_vma_name(vma);
+
+	spin_lock(&mm->vma_free_list.lock);
+	list_add(&vma->vm_free_list, &mm->vma_free_list.head);
+	mm->vma_free_list.size++;
+	drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+	spin_unlock(&mm->vma_free_list.lock);
+
+	if (drain)
+		drain_free_vmas(mm);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+void drain_free_vmas(struct mm_struct *mm) {}
 
 void vm_area_free(struct vm_area_struct *vma)
 {
 	free_anon_vma_name(vma);
-#ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, __vm_area_free);
-#else
 	kmem_cache_free(vm_area_cachep, vma);
-#endif
 }
 
+#endif /* CONFIG_PER_VMA_LOCK */
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -1150,6 +1199,9 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	INIT_LIST_HEAD(&mm->mmlist);
 #ifdef CONFIG_PER_VMA_LOCK
 	WRITE_ONCE(mm->mm_lock_seq, 0);
+	INIT_LIST_HEAD(&mm->vma_free_list.head);
+	spin_lock_init(&mm->vma_free_list.lock);
+	mm->vma_free_list.size = 0;
 #endif
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 33269314e060..b53d23c2d7a3 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -39,6 +39,9 @@  struct mm_struct init_mm = {
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
 	.mm_lock_seq	= 0,
+	.vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
+	.vma_free_list.lock =  __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),
+	.vma_free_list.size = 0,
 #endif
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
diff --git a/mm/mmap.c b/mm/mmap.c
index 332af383f7cd..a0d5d3af1d95 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3159,6 +3159,7 @@  void exit_mmap(struct mm_struct *mm)
 	trace_exit_mmap(mm);
 	__mt_destroy(&mm->mm_mt);
 	mmap_write_unlock(mm);
+	drain_free_vmas(mm);
 	vm_unacct_memory(nr_accounted);
 }