diff mbox series

[2/2] mm/vmalloc: rework the drain logic

Message ID 20201116220033.1837-2-urezki@gmail.com (mailing list archive)
State New, archived
Headers show
Series [1/2] mm/vmalloc: use free_vm_area() if an allocation fails | expand

Commit Message

Uladzislau Rezki Nov. 16, 2020, 10 p.m. UTC
A current "lazy drain" model suffers from at least two issues.

First one is related to the unsorted list of vmap areas, thus
in order to identify the [min:max] range of areas to be drained,
it requires a full list scan. What is a time consuming if the
list is too long.

Second one and as a next step is about merging all fragments
with a free space. What is also a time consuming because it
has to iterate over entire list which holds outstanding lazy
areas.

See below the "preemptirqsoff" tracer that illustrates a high
latency. It is ~24 676us. Our workloads like audio and video
are effected by such long latency:

<snip>
  tracer: preemptirqsoff

  preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+
  --------------------------------------------------------------------
  latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8)
     -----------------
     | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16)
     -----------------
   => started at: __purge_vmap_area_lazy
   => ended at:   __purge_vmap_area_lazy

                   _------=> CPU#
                  / _-----=> irqs-off
                 | / _----=> need-resched
                 || / _---=> hardirq/softirq
                 ||| / _--=> preempt-depth
                 |||| /     delay
   cmd     pid   ||||| time  |   caller
      \   /      |||||  \    |   /
crtc_com-261     1...1    1us*: _raw_spin_lock <-__purge_vmap_area_lazy
[...]
crtc_com-261     1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy
crtc_com-261     1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy
crtc_com-261     1...1 24683us : <stack trace>
 => free_vmap_area_noflush
 => remove_vm_area
 => __vunmap
 => vfree
 => drm_property_free_blob
 => drm_mode_object_unreference
 => drm_property_unreference_blob
 => __drm_atomic_helper_crtc_destroy_state
 => sde_crtc_destroy_state
 => drm_atomic_state_default_clear
 => drm_atomic_state_clear
 => drm_atomic_state_free
 => complete_commit
 => _msm_drm_commit_work_cb
 => kthread_worker_fn
 => kthread
 => ret_from_fork
<snip>

To address those two issues we can redesign a purging of the
outstanding lazy areas. Instead of queuing vmap areas to the
list, we replace it by the separate rb-tree. In hat case an
area is located in the tree/list in ascending order. It will
give us below advantages:

a) Outstanding vmap areas are merged creating bigger coalesced
   blocks, thus it becomes less fragmented.

b) It is possible to calculate a flush range [min:max] without
   scanning all elements. It is O(1) access time or complexity;

c) The final merge of areas with the rb-tree that represents a
   free space is faster because of (a). As a result the lock
   contention is also reduced.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/linux/vmalloc.h |  8 ++--
 mm/vmalloc.c            | 90 +++++++++++++++++++++++------------------
 2 files changed, 53 insertions(+), 45 deletions(-)

Comments

huang ying Nov. 17, 2020, 2:37 a.m. UTC | #1
On Tue, Nov 17, 2020 at 6:00 AM Uladzislau Rezki (Sony)
<urezki@gmail.com> wrote:
>
> A current "lazy drain" model suffers from at least two issues.
>
> First one is related to the unsorted list of vmap areas, thus
> in order to identify the [min:max] range of areas to be drained,
> it requires a full list scan. What is a time consuming if the
> list is too long.
>
> Second one and as a next step is about merging all fragments
> with a free space. What is also a time consuming because it
> has to iterate over entire list which holds outstanding lazy
> areas.
>
> See below the "preemptirqsoff" tracer that illustrates a high
> latency. It is ~24 676us. Our workloads like audio and video
> are effected by such long latency:

This seems like a real problem.  But I found there's long latency
avoidance mechanism in the loop in __purge_vmap_area_lazy() as
follows,

        if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
            cond_resched_lock(&free_vmap_area_lock);

If it works properly, the latency problem can be solved.  Can you
check whether this doesn't work for you?

Best Reagrds,
Huang, Ying

<snip>
Uladzislau Rezki Nov. 17, 2020, 1:04 p.m. UTC | #2
On Tue, Nov 17, 2020 at 10:37:34AM +0800, huang ying wrote:
> On Tue, Nov 17, 2020 at 6:00 AM Uladzislau Rezki (Sony)
> <urezki@gmail.com> wrote:
> >
> > A current "lazy drain" model suffers from at least two issues.
> >
> > First one is related to the unsorted list of vmap areas, thus
> > in order to identify the [min:max] range of areas to be drained,
> > it requires a full list scan. What is a time consuming if the
> > list is too long.
> >
> > Second one and as a next step is about merging all fragments
> > with a free space. What is also a time consuming because it
> > has to iterate over entire list which holds outstanding lazy
> > areas.
> >
> > See below the "preemptirqsoff" tracer that illustrates a high
> > latency. It is ~24 676us. Our workloads like audio and video
> > are effected by such long latency:
> 
> This seems like a real problem.  But I found there's long latency
> avoidance mechanism in the loop in __purge_vmap_area_lazy() as
> follows,
> 
>         if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
>             cond_resched_lock(&free_vmap_area_lock);
> 
I have added that "resched threshold" because of on my tests i could
simply hit out of memory, due to the fact that a drain work is not up
to speed to process such long outstanding list of vmap areas.

>
> If it works properly, the latency problem can be solved.  Can you
> check whether this doesn't work for you?
>
We have that cond_resched_lock() in our products. The patch that is
in question creates bigger vmap areas on early step(merge them), so
the final structure becomes less fragmented, what speeds up a drain
logic, thus reduces a preemption off time.

Apart of that, high priority tasks like RT or DL which are users of
the vmalloc()/vfree() can start draining process from its contexts,
what is also a problem. In that sense, i think we need to make the
vfree() call to be asynchronous, so latency sensitive tasks and others
do not perform any draining from their contexts.

--
Vlad Rezki
huang ying Nov. 18, 2020, 2:44 a.m. UTC | #3
On Tue, Nov 17, 2020 at 9:04 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Tue, Nov 17, 2020 at 10:37:34AM +0800, huang ying wrote:
> > On Tue, Nov 17, 2020 at 6:00 AM Uladzislau Rezki (Sony)
> > <urezki@gmail.com> wrote:
> > >
> > > A current "lazy drain" model suffers from at least two issues.
> > >
> > > First one is related to the unsorted list of vmap areas, thus
> > > in order to identify the [min:max] range of areas to be drained,
> > > it requires a full list scan. What is a time consuming if the
> > > list is too long.
> > >
> > > Second one and as a next step is about merging all fragments
> > > with a free space. What is also a time consuming because it
> > > has to iterate over entire list which holds outstanding lazy
> > > areas.
> > >
> > > See below the "preemptirqsoff" tracer that illustrates a high
> > > latency. It is ~24 676us. Our workloads like audio and video
> > > are effected by such long latency:
> >
> > This seems like a real problem.  But I found there's long latency
> > avoidance mechanism in the loop in __purge_vmap_area_lazy() as
> > follows,
> >
> >         if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
> >             cond_resched_lock(&free_vmap_area_lock);
> >
> I have added that "resched threshold" because of on my tests i could
> simply hit out of memory, due to the fact that a drain work is not up
> to speed to process such long outstanding list of vmap areas.

OK.  Now I think I understand the problem.  For free area purging,
there are multiple "producers" but one "consumer", and it lacks enough
mechanism to slow down the "producers" if "consumer" can not catch up.
And your patch tries to resolve the problem via accelerating the
"consumer".
That isn't perfect, but I think we may have quite some opportunities
to merge the free areas, so it should just work.

And I found the long latency avoidance logic in
__purge_vmap_area_lazy() appears problematic,

         if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
             cond_resched_lock(&free_vmap_area_lock);

Shouldn't it be something as follows?

         if (i >= BATCH && atomic_long_read(&vmap_lazy_nr) <
resched_threshold) {
             cond_resched_lock(&free_vmap_area_lock);
             i = 0;
         } else
             i++;

This will accelerate the purging via batching and slow down vmalloc()
via holding free_vmap_area_lock.  If it makes sense, can we try this?

And, can we reduce lazy_max_pages() to control the length of the
purging list?  It could be > 8K if the vmalloc/vfree size is small.

Best Regards,
Huang, Ying
Uladzislau Rezki Nov. 18, 2020, 4:16 p.m. UTC | #4
On Wed, Nov 18, 2020 at 10:44:13AM +0800, huang ying wrote:
> On Tue, Nov 17, 2020 at 9:04 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Tue, Nov 17, 2020 at 10:37:34AM +0800, huang ying wrote:
> > > On Tue, Nov 17, 2020 at 6:00 AM Uladzislau Rezki (Sony)
> > > <urezki@gmail.com> wrote:
> > > >
> > > > A current "lazy drain" model suffers from at least two issues.
> > > >
> > > > First one is related to the unsorted list of vmap areas, thus
> > > > in order to identify the [min:max] range of areas to be drained,
> > > > it requires a full list scan. What is a time consuming if the
> > > > list is too long.
> > > >
> > > > Second one and as a next step is about merging all fragments
> > > > with a free space. What is also a time consuming because it
> > > > has to iterate over entire list which holds outstanding lazy
> > > > areas.
> > > >
> > > > See below the "preemptirqsoff" tracer that illustrates a high
> > > > latency. It is ~24 676us. Our workloads like audio and video
> > > > are effected by such long latency:
> > >
> > > This seems like a real problem.  But I found there's long latency
> > > avoidance mechanism in the loop in __purge_vmap_area_lazy() as
> > > follows,
> > >
> > >         if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
> > >             cond_resched_lock(&free_vmap_area_lock);
> > >
> > I have added that "resched threshold" because of on my tests i could
> > simply hit out of memory, due to the fact that a drain work is not up
> > to speed to process such long outstanding list of vmap areas.
> 
> OK.  Now I think I understand the problem.  For free area purging,
> there are multiple "producers" but one "consumer", and it lacks enough
> mechanism to slow down the "producers" if "consumer" can not catch up.
> And your patch tries to resolve the problem via accelerating the
> "consumer".
>
Seems, correct. But just in case one more time:

the cond_resched_lock was added once upon a time to get rid of long
preemption off time. Due to dropping the lock, "producers" can start
generate further vmap area, so "consumer" can not catch up. Seems

Later on, a resched threshold was added. It is just a simple protection
threshold, passing which, a freeing is prioritized back over allocation,
so we guarantee that we do not hit out of memory.

>
> That isn't perfect, but I think we may have quite some opportunities
> to merge the free areas, so it should just work.
> 
Yes, merging opportunity should do the work. But of course there are
exceptions.

> And I found the long latency avoidance logic in
> __purge_vmap_area_lazy() appears problematic,
> 
>          if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
>              cond_resched_lock(&free_vmap_area_lock);
> 
> Shouldn't it be something as follows?
> 
>          if (i >= BATCH && atomic_long_read(&vmap_lazy_nr) <
> resched_threshold) {
>              cond_resched_lock(&free_vmap_area_lock);
>              i = 0;
>          } else
>              i++;
> 
> This will accelerate the purging via batching and slow down vmalloc()
> via holding free_vmap_area_lock.  If it makes sense, can we try this?
> 
Probably we can switch to just using "batch" methodology:

<snip>
    if (!(i++ % batch_threshold))
        cond_resched_lock(&free_vmap_area_lock);
<snip>

The question is, which value we should use as a batch_threshold: 100, 1000, etc.

Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not
allowed to drop the free_vmap_area_lock at all. Because any simultaneous
allocations are not allowed within a drain region, so it should occur in
disjoint regions. But i need to double check it.

>
> And, can we reduce lazy_max_pages() to control the length of the
> purging list?  It could be > 8K if the vmalloc/vfree size is small.
>
We can adjust it for sure. But it will influence on number of global
TLB flushes that must be performed.

Thanks.

--
Vlad Rezki
Huang, Ying Nov. 19, 2020, 1:40 a.m. UTC | #5
Uladzislau Rezki <urezki@gmail.com> writes:

> On Wed, Nov 18, 2020 at 10:44:13AM +0800, huang ying wrote:
>> On Tue, Nov 17, 2020 at 9:04 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>> >
>> > On Tue, Nov 17, 2020 at 10:37:34AM +0800, huang ying wrote:
>> > > On Tue, Nov 17, 2020 at 6:00 AM Uladzislau Rezki (Sony)
>> > > <urezki@gmail.com> wrote:
>> > > >
>> > > > A current "lazy drain" model suffers from at least two issues.
>> > > >
>> > > > First one is related to the unsorted list of vmap areas, thus
>> > > > in order to identify the [min:max] range of areas to be drained,
>> > > > it requires a full list scan. What is a time consuming if the
>> > > > list is too long.
>> > > >
>> > > > Second one and as a next step is about merging all fragments
>> > > > with a free space. What is also a time consuming because it
>> > > > has to iterate over entire list which holds outstanding lazy
>> > > > areas.
>> > > >
>> > > > See below the "preemptirqsoff" tracer that illustrates a high
>> > > > latency. It is ~24 676us. Our workloads like audio and video
>> > > > are effected by such long latency:
>> > >
>> > > This seems like a real problem.  But I found there's long latency
>> > > avoidance mechanism in the loop in __purge_vmap_area_lazy() as
>> > > follows,
>> > >
>> > >         if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
>> > >             cond_resched_lock(&free_vmap_area_lock);
>> > >
>> > I have added that "resched threshold" because of on my tests i could
>> > simply hit out of memory, due to the fact that a drain work is not up
>> > to speed to process such long outstanding list of vmap areas.
>> 
>> OK.  Now I think I understand the problem.  For free area purging,
>> there are multiple "producers" but one "consumer", and it lacks enough
>> mechanism to slow down the "producers" if "consumer" can not catch up.
>> And your patch tries to resolve the problem via accelerating the
>> "consumer".
>>
> Seems, correct. But just in case one more time:
>
> the cond_resched_lock was added once upon a time to get rid of long
> preemption off time. Due to dropping the lock, "producers" can start
> generate further vmap area, so "consumer" can not catch up. Seems

Yes.  And in theory there are vfree() storm, that is, thousands vfree()
can be called in short time.  But I don't think that's practical use
case.

> Later on, a resched threshold was added. It is just a simple protection
> threshold, passing which, a freeing is prioritized back over allocation,
> so we guarantee that we do not hit out of memory.

Yes.  That can accelerate freeing if necessary.

>>
>> That isn't perfect, but I think we may have quite some opportunities
>> to merge the free areas, so it should just work.
>> 
> Yes, merging opportunity should do the work. But of course there are
> exceptions.
>
>> And I found the long latency avoidance logic in
>> __purge_vmap_area_lazy() appears problematic,
>> 
>>          if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
>>              cond_resched_lock(&free_vmap_area_lock);
>> 
>> Shouldn't it be something as follows?
>> 
>>          if (i >= BATCH && atomic_long_read(&vmap_lazy_nr) <
>> resched_threshold) {
>>              cond_resched_lock(&free_vmap_area_lock);
>>              i = 0;
>>          } else
>>              i++;
>> 
>> This will accelerate the purging via batching and slow down vmalloc()
>> via holding free_vmap_area_lock.  If it makes sense, can we try this?
>> 
> Probably we can switch to just using "batch" methodology:
>
> <snip>
>     if (!(i++ % batch_threshold))
>         cond_resched_lock(&free_vmap_area_lock);
> <snip>

That's the typical long latency avoidance method.

> The question is, which value we should use as a batch_threshold: 100, 1000, etc.

I think we can do some measurement to determine it?

> Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not
> allowed to drop the free_vmap_area_lock at all. Because any simultaneous
> allocations are not allowed within a drain region, so it should occur in
> disjoint regions. But i need to double check it.
>
>>
>> And, can we reduce lazy_max_pages() to control the length of the
>> purging list?  It could be > 8K if the vmalloc/vfree size is small.
>>
> We can adjust it for sure. But it will influence on number of global
> TLB flushes that must be performed.

Em...  For example, if we set it to 100, then the number of the TLB
flushes can be reduced to 1% of the un-optimized implementation
already.  Do you think so?

Best Regards,
Huang, Ying
Uladzislau Rezki Nov. 19, 2020, 5:36 p.m. UTC | #6
On Thu, Nov 19, 2020 at 09:40:29AM +0800, Huang, Ying wrote:
> Uladzislau Rezki <urezki@gmail.com> writes:
> 
> > On Wed, Nov 18, 2020 at 10:44:13AM +0800, huang ying wrote:
> >> On Tue, Nov 17, 2020 at 9:04 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >> >
> >> > On Tue, Nov 17, 2020 at 10:37:34AM +0800, huang ying wrote:
> >> > > On Tue, Nov 17, 2020 at 6:00 AM Uladzislau Rezki (Sony)
> >> > > <urezki@gmail.com> wrote:
> >> > > >
> >> > > > A current "lazy drain" model suffers from at least two issues.
> >> > > >
> >> > > > First one is related to the unsorted list of vmap areas, thus
> >> > > > in order to identify the [min:max] range of areas to be drained,
> >> > > > it requires a full list scan. What is a time consuming if the
> >> > > > list is too long.
> >> > > >
> >> > > > Second one and as a next step is about merging all fragments
> >> > > > with a free space. What is also a time consuming because it
> >> > > > has to iterate over entire list which holds outstanding lazy
> >> > > > areas.
> >> > > >
> >> > > > See below the "preemptirqsoff" tracer that illustrates a high
> >> > > > latency. It is ~24 676us. Our workloads like audio and video
> >> > > > are effected by such long latency:
> >> > >
> >> > > This seems like a real problem.  But I found there's long latency
> >> > > avoidance mechanism in the loop in __purge_vmap_area_lazy() as
> >> > > follows,
> >> > >
> >> > >         if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
> >> > >             cond_resched_lock(&free_vmap_area_lock);
> >> > >
> >> > I have added that "resched threshold" because of on my tests i could
> >> > simply hit out of memory, due to the fact that a drain work is not up
> >> > to speed to process such long outstanding list of vmap areas.
> >> 
> >> OK.  Now I think I understand the problem.  For free area purging,
> >> there are multiple "producers" but one "consumer", and it lacks enough
> >> mechanism to slow down the "producers" if "consumer" can not catch up.
> >> And your patch tries to resolve the problem via accelerating the
> >> "consumer".
> >>
> > Seems, correct. But just in case one more time:
> >
> > the cond_resched_lock was added once upon a time to get rid of long
> > preemption off time. Due to dropping the lock, "producers" can start
> > generate further vmap area, so "consumer" can not catch up. Seems
> 
> Yes.  And in theory there are vfree() storm, that is, thousands vfree()
> can be called in short time.  But I don't think that's practical use
> case.
> 
> > Later on, a resched threshold was added. It is just a simple protection
> > threshold, passing which, a freeing is prioritized back over allocation,
> > so we guarantee that we do not hit out of memory.
> 
> Yes.  That can accelerate freeing if necessary.
> 
> >>
> >> That isn't perfect, but I think we may have quite some opportunities
> >> to merge the free areas, so it should just work.
> >> 
> > Yes, merging opportunity should do the work. But of course there are
> > exceptions.
> >
> >> And I found the long latency avoidance logic in
> >> __purge_vmap_area_lazy() appears problematic,
> >> 
> >>          if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
> >>              cond_resched_lock(&free_vmap_area_lock);
> >> 
> >> Shouldn't it be something as follows?
> >> 
> >>          if (i >= BATCH && atomic_long_read(&vmap_lazy_nr) <
> >> resched_threshold) {
> >>              cond_resched_lock(&free_vmap_area_lock);
> >>              i = 0;
> >>          } else
> >>              i++;
> >> 
> >> This will accelerate the purging via batching and slow down vmalloc()
> >> via holding free_vmap_area_lock.  If it makes sense, can we try this?
> >> 
> > Probably we can switch to just using "batch" methodology:
> >
> > <snip>
> >     if (!(i++ % batch_threshold))
> >         cond_resched_lock(&free_vmap_area_lock);
> > <snip>
> 
> That's the typical long latency avoidance method.
> 
> > The question is, which value we should use as a batch_threshold: 100, 1000, etc.
> 
> I think we can do some measurement to determine it?
> 
Hmm.. looking at it one more time i do not see what batching solves.
Anyway we need to have some threshold(what we do have), that regulates
a priority between vmalloc()/vfree().

What we can do more with it are:

- purging should be just performed asynchronously in workqueue context.
Giving the fact, that now we also do a merge of outstanding areas, the
data structure(rb-tree) will not be so fragmented.

- lazy_max_pages() can slightly be decreased. If there are existing
workloads which suffer from such long value. It would be good to get
real complains and evidence.

> > Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not
> > allowed to drop the free_vmap_area_lock at all. Because any simultaneous
> > allocations are not allowed within a drain region, so it should occur in
> > disjoint regions. But i need to double check it.
> >
> >>
> >> And, can we reduce lazy_max_pages() to control the length of the
> >> purging list?  It could be > 8K if the vmalloc/vfree size is small.
> >>
> > We can adjust it for sure. But it will influence on number of global
> > TLB flushes that must be performed.
> 
> Em...  For example, if we set it to 100, then the number of the TLB
> flushes can be reduced to 1% of the un-optimized implementation
> already.  Do you think so?
> 
If we set lazy_max_pages() to vague value such as 100, the performance
will be just destroyed.

Thanks!

--
Vlad Rezki
Huang, Ying Nov. 20, 2020, 2:34 a.m. UTC | #7
Uladzislau Rezki <urezki@gmail.com> writes:

> On Thu, Nov 19, 2020 at 09:40:29AM +0800, Huang, Ying wrote:
>> Uladzislau Rezki <urezki@gmail.com> writes:
>> 
>> > On Wed, Nov 18, 2020 at 10:44:13AM +0800, huang ying wrote:
>> >> On Tue, Nov 17, 2020 at 9:04 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>> >> >
>> >> > On Tue, Nov 17, 2020 at 10:37:34AM +0800, huang ying wrote:
>> >> > > On Tue, Nov 17, 2020 at 6:00 AM Uladzislau Rezki (Sony)
>> >> > > <urezki@gmail.com> wrote:
>> >> > > >
>> >> > > > A current "lazy drain" model suffers from at least two issues.
>> >> > > >
>> >> > > > First one is related to the unsorted list of vmap areas, thus
>> >> > > > in order to identify the [min:max] range of areas to be drained,
>> >> > > > it requires a full list scan. What is a time consuming if the
>> >> > > > list is too long.
>> >> > > >
>> >> > > > Second one and as a next step is about merging all fragments
>> >> > > > with a free space. What is also a time consuming because it
>> >> > > > has to iterate over entire list which holds outstanding lazy
>> >> > > > areas.
>> >> > > >
>> >> > > > See below the "preemptirqsoff" tracer that illustrates a high
>> >> > > > latency. It is ~24 676us. Our workloads like audio and video
>> >> > > > are effected by such long latency:
>> >> > >
>> >> > > This seems like a real problem.  But I found there's long latency
>> >> > > avoidance mechanism in the loop in __purge_vmap_area_lazy() as
>> >> > > follows,
>> >> > >
>> >> > >         if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
>> >> > >             cond_resched_lock(&free_vmap_area_lock);
>> >> > >
>> >> > I have added that "resched threshold" because of on my tests i could
>> >> > simply hit out of memory, due to the fact that a drain work is not up
>> >> > to speed to process such long outstanding list of vmap areas.
>> >> 
>> >> OK.  Now I think I understand the problem.  For free area purging,
>> >> there are multiple "producers" but one "consumer", and it lacks enough
>> >> mechanism to slow down the "producers" if "consumer" can not catch up.
>> >> And your patch tries to resolve the problem via accelerating the
>> >> "consumer".
>> >>
>> > Seems, correct. But just in case one more time:
>> >
>> > the cond_resched_lock was added once upon a time to get rid of long
>> > preemption off time. Due to dropping the lock, "producers" can start
>> > generate further vmap area, so "consumer" can not catch up. Seems
>> 
>> Yes.  And in theory there are vfree() storm, that is, thousands vfree()
>> can be called in short time.  But I don't think that's practical use
>> case.
>> 
>> > Later on, a resched threshold was added. It is just a simple protection
>> > threshold, passing which, a freeing is prioritized back over allocation,
>> > so we guarantee that we do not hit out of memory.
>> 
>> Yes.  That can accelerate freeing if necessary.
>> 
>> >>
>> >> That isn't perfect, but I think we may have quite some opportunities
>> >> to merge the free areas, so it should just work.
>> >> 
>> > Yes, merging opportunity should do the work. But of course there are
>> > exceptions.
>> >
>> >> And I found the long latency avoidance logic in
>> >> __purge_vmap_area_lazy() appears problematic,
>> >> 
>> >>          if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
>> >>              cond_resched_lock(&free_vmap_area_lock);
>> >> 
>> >> Shouldn't it be something as follows?
>> >> 
>> >>          if (i >= BATCH && atomic_long_read(&vmap_lazy_nr) <
>> >> resched_threshold) {
>> >>              cond_resched_lock(&free_vmap_area_lock);
>> >>              i = 0;
>> >>          } else
>> >>              i++;
>> >> 
>> >> This will accelerate the purging via batching and slow down vmalloc()
>> >> via holding free_vmap_area_lock.  If it makes sense, can we try this?
>> >> 
>> > Probably we can switch to just using "batch" methodology:
>> >
>> > <snip>
>> >     if (!(i++ % batch_threshold))
>> >         cond_resched_lock(&free_vmap_area_lock);
>> > <snip>
>> 
>> That's the typical long latency avoidance method.
>> 
>> > The question is, which value we should use as a batch_threshold: 100, 1000, etc.
>> 
>> I think we can do some measurement to determine it?
>> 
> Hmm.. looking at it one more time i do not see what batching solves.

Without batch protection, we may release the lock and CPU anytime during
looping if "vmap_lazy_nr < resched_threshold".  Too many vmalloc/vfree
may be done during that.  So I think we can restrict it.  Batching can
improve the performance of purging itself too.

> Anyway we need to have some threshold(what we do have), that regulates
> a priority between vmalloc()/vfree().
>
> What we can do more with it are:
>
> - purging should be just performed asynchronously in workqueue context.
> Giving the fact, that now we also do a merge of outstanding areas, the
> data structure(rb-tree) will not be so fragmented.

Async works only if there are idle CPU time on other CPUs.  And it may
punish other innocent workloads instead of the heavy vmalloc/vfree
users.  So we should be careful about that.

> - lazy_max_pages() can slightly be decreased. If there are existing
> workloads which suffer from such long value. It would be good to get
> real complains and evidence.
>
>> > Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not
>> > allowed to drop the free_vmap_area_lock at all. Because any simultaneous
>> > allocations are not allowed within a drain region, so it should occur in
>> > disjoint regions. But i need to double check it.
>> >
>> >>
>> >> And, can we reduce lazy_max_pages() to control the length of the
>> >> purging list?  It could be > 8K if the vmalloc/vfree size is small.
>> >>
>> > We can adjust it for sure. But it will influence on number of global
>> > TLB flushes that must be performed.
>> 
>> Em...  For example, if we set it to 100, then the number of the TLB
>> flushes can be reduced to 1% of the un-optimized implementation
>> already.  Do you think so?
>> 
> If we set lazy_max_pages() to vague value such as 100, the performance
> will be just destroyed.

Sorry, my original words weren't clear enough.  What I really want to
suggest is to control the length of the purging list instead of reduce
lazy_max_pages() directly.  That is, we can have a "atomic_t
nr_purge_item" to record the length of the purging list and start
purging if (vmap_lazy_nr > lazy_max_pages && nr_purge_item >
max_purge_item).  vmap_lazy_nr is to control the virtual address space,
nr_purge_item is to control the batching purging latency.  "100" is just
an example, the real value should be determined according to the test
results.

Best Regards,
Huang, Ying
Uladzislau Rezki Nov. 23, 2020, 1:59 p.m. UTC | #8
On Fri, Nov 20, 2020 at 10:34:19AM +0800, Huang, Ying wrote:
> Uladzislau Rezki <urezki@gmail.com> writes:
> 
> > On Thu, Nov 19, 2020 at 09:40:29AM +0800, Huang, Ying wrote:
> >> Uladzislau Rezki <urezki@gmail.com> writes:
> >> 
> >> > On Wed, Nov 18, 2020 at 10:44:13AM +0800, huang ying wrote:
> >> >> On Tue, Nov 17, 2020 at 9:04 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >> >> >
> >> >> > On Tue, Nov 17, 2020 at 10:37:34AM +0800, huang ying wrote:
> >> >> > > On Tue, Nov 17, 2020 at 6:00 AM Uladzislau Rezki (Sony)
> >> >> > > <urezki@gmail.com> wrote:
> >> >> > > >
> >> >> > > > A current "lazy drain" model suffers from at least two issues.
> >> >> > > >
> >> >> > > > First one is related to the unsorted list of vmap areas, thus
> >> >> > > > in order to identify the [min:max] range of areas to be drained,
> >> >> > > > it requires a full list scan. What is a time consuming if the
> >> >> > > > list is too long.
> >> >> > > >
> >> >> > > > Second one and as a next step is about merging all fragments
> >> >> > > > with a free space. What is also a time consuming because it
> >> >> > > > has to iterate over entire list which holds outstanding lazy
> >> >> > > > areas.
> >> >> > > >
> >> >> > > > See below the "preemptirqsoff" tracer that illustrates a high
> >> >> > > > latency. It is ~24 676us. Our workloads like audio and video
> >> >> > > > are effected by such long latency:
> >> >> > >
> >> >> > > This seems like a real problem.  But I found there's long latency
> >> >> > > avoidance mechanism in the loop in __purge_vmap_area_lazy() as
> >> >> > > follows,
> >> >> > >
> >> >> > >         if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
> >> >> > >             cond_resched_lock(&free_vmap_area_lock);
> >> >> > >
> >> >> > I have added that "resched threshold" because of on my tests i could
> >> >> > simply hit out of memory, due to the fact that a drain work is not up
> >> >> > to speed to process such long outstanding list of vmap areas.
> >> >> 
> >> >> OK.  Now I think I understand the problem.  For free area purging,
> >> >> there are multiple "producers" but one "consumer", and it lacks enough
> >> >> mechanism to slow down the "producers" if "consumer" can not catch up.
> >> >> And your patch tries to resolve the problem via accelerating the
> >> >> "consumer".
> >> >>
> >> > Seems, correct. But just in case one more time:
> >> >
> >> > the cond_resched_lock was added once upon a time to get rid of long
> >> > preemption off time. Due to dropping the lock, "producers" can start
> >> > generate further vmap area, so "consumer" can not catch up. Seems
> >> 
> >> Yes.  And in theory there are vfree() storm, that is, thousands vfree()
> >> can be called in short time.  But I don't think that's practical use
> >> case.
> >> 
> >> > Later on, a resched threshold was added. It is just a simple protection
> >> > threshold, passing which, a freeing is prioritized back over allocation,
> >> > so we guarantee that we do not hit out of memory.
> >> 
> >> Yes.  That can accelerate freeing if necessary.
> >> 
> >> >>
> >> >> That isn't perfect, but I think we may have quite some opportunities
> >> >> to merge the free areas, so it should just work.
> >> >> 
> >> > Yes, merging opportunity should do the work. But of course there are
> >> > exceptions.
> >> >
> >> >> And I found the long latency avoidance logic in
> >> >> __purge_vmap_area_lazy() appears problematic,
> >> >> 
> >> >>          if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
> >> >>              cond_resched_lock(&free_vmap_area_lock);
> >> >> 
> >> >> Shouldn't it be something as follows?
> >> >> 
> >> >>          if (i >= BATCH && atomic_long_read(&vmap_lazy_nr) <
> >> >> resched_threshold) {
> >> >>              cond_resched_lock(&free_vmap_area_lock);
> >> >>              i = 0;
> >> >>          } else
> >> >>              i++;
> >> >> 
> >> >> This will accelerate the purging via batching and slow down vmalloc()
> >> >> via holding free_vmap_area_lock.  If it makes sense, can we try this?
> >> >> 
> >> > Probably we can switch to just using "batch" methodology:
> >> >
> >> > <snip>
> >> >     if (!(i++ % batch_threshold))
> >> >         cond_resched_lock(&free_vmap_area_lock);
> >> > <snip>
> >> 
> >> That's the typical long latency avoidance method.
> >> 
> >> > The question is, which value we should use as a batch_threshold: 100, 1000, etc.
> >> 
> >> I think we can do some measurement to determine it?
> >> 
> > Hmm.. looking at it one more time i do not see what batching solves.
> 
> Without batch protection, we may release the lock and CPU anytime during
> looping if "vmap_lazy_nr < resched_threshold".  Too many vmalloc/vfree
> may be done during that.  So I think we can restrict it.  Batching can
> improve the performance of purging itself too.
> 
In theory:
I see your point. It is a trade-off though, to allow faster vmalloc or vfree.
Batching will make alloc more tight, and yes, speed up the process of draining
holding a CPU until batch is drained + introducing latency for other tasks.

In practical:
I mentioned about that, i think we need to measure the batching approach, say
we set it to 100, providing some figures so we see some evidence from practical
point of view. For example run test_vmalloc.sh to analyze it. If you see some
advantages from performance point of view it would be great. Just share some
data.

> > Anyway we need to have some threshold(what we do have), that regulates
> > a priority between vmalloc()/vfree().
> >
> > What we can do more with it are:
> >
> > - purging should be just performed asynchronously in workqueue context.
> > Giving the fact, that now we also do a merge of outstanding areas, the
> > data structure(rb-tree) will not be so fragmented.
> 
> Async works only if there are idle CPU time on other CPUs.  And it may
> punish other innocent workloads instead of the heavy vmalloc/vfree
> users.  So we should be careful about that.
> 
Yep, scheduling latency will be as a side affect of such approach. The question
is if it is negligible and can be considered as a risk. I do not think it would 
be a big problem.

I have other issue with it though, which i can not explain so far. If i am doing 
the "purge" in the separate worker, i see that a memory leaks after heavy test
runs.

> > - lazy_max_pages() can slightly be decreased. If there are existing
> > workloads which suffer from such long value. It would be good to get
> > real complains and evidence.
> >
> >> > Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not
> >> > allowed to drop the free_vmap_area_lock at all. Because any simultaneous
> >> > allocations are not allowed within a drain region, so it should occur in
> >> > disjoint regions. But i need to double check it.
> >> >
> >> >>
> >> >> And, can we reduce lazy_max_pages() to control the length of the
> >> >> purging list?  It could be > 8K if the vmalloc/vfree size is small.
> >> >>
> >> > We can adjust it for sure. But it will influence on number of global
> >> > TLB flushes that must be performed.
> >> 
> >> Em...  For example, if we set it to 100, then the number of the TLB
> >> flushes can be reduced to 1% of the un-optimized implementation
> >> already.  Do you think so?
> >> 
> > If we set lazy_max_pages() to vague value such as 100, the performance
> > will be just destroyed.
> 
> Sorry, my original words weren't clear enough.  What I really want to
> suggest is to control the length of the purging list instead of reduce
> lazy_max_pages() directly.  That is, we can have a "atomic_t
> nr_purge_item" to record the length of the purging list and start
> purging if (vmap_lazy_nr > lazy_max_pages && nr_purge_item >
> max_purge_item).  vmap_lazy_nr is to control the virtual address space,
> nr_purge_item is to control the batching purging latency.  "100" is just
> an example, the real value should be determined according to the test
> results.
> 
OK. Now i see what you meant. Please note, the merging is in place, so
the list size gets reduced.

--
Vlad Rezki
Huang, Ying Nov. 24, 2020, 2:25 a.m. UTC | #9
Uladzislau Rezki <urezki@gmail.com> writes:
>> >> >> And I found the long latency avoidance logic in
>> >> >> __purge_vmap_area_lazy() appears problematic,
>> >> >> 
>> >> >>          if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
>> >> >>              cond_resched_lock(&free_vmap_area_lock);
>> >> >> 
>> >> >> Shouldn't it be something as follows?
>> >> >> 
>> >> >>          if (i >= BATCH && atomic_long_read(&vmap_lazy_nr) <
>> >> >> resched_threshold) {
>> >> >>              cond_resched_lock(&free_vmap_area_lock);
>> >> >>              i = 0;
>> >> >>          } else
>> >> >>              i++;
>> >> >> 
>> >> >> This will accelerate the purging via batching and slow down vmalloc()
>> >> >> via holding free_vmap_area_lock.  If it makes sense, can we try this?
>> >> >> 
>> >> > Probably we can switch to just using "batch" methodology:
>> >> >
>> >> > <snip>
>> >> >     if (!(i++ % batch_threshold))
>> >> >         cond_resched_lock(&free_vmap_area_lock);
>> >> > <snip>
>> >> 
>> >> That's the typical long latency avoidance method.
>> >> 
>> >> > The question is, which value we should use as a batch_threshold: 100, 1000, etc.
>> >> 
>> >> I think we can do some measurement to determine it?
>> >> 
>> > Hmm.. looking at it one more time i do not see what batching solves.
>> 
>> Without batch protection, we may release the lock and CPU anytime during
>> looping if "vmap_lazy_nr < resched_threshold".  Too many vmalloc/vfree
>> may be done during that.  So I think we can restrict it.  Batching can
>> improve the performance of purging itself too.
>> 
> In theory:
> I see your point. It is a trade-off though, to allow faster vmalloc or vfree.
> Batching will make alloc more tight, and yes, speed up the process of draining
> holding a CPU until batch is drained + introducing latency for other tasks.
>
> In practical:
> I mentioned about that, i think we need to measure the batching approach, say
> we set it to 100, providing some figures so we see some evidence from practical
> point of view. For example run test_vmalloc.sh to analyze it. If you see some
> advantages from performance point of view it would be great. Just share some
> data.

Per my understanding, this is the common practice in kernel to satisfy
both throughput and latency requirement.  But it may be not important
for this specific case.  I am afraid I have no time to work on this now.
Just my 2 cents.  If you don't think that's a good idea, just ignore it.

>> > Anyway we need to have some threshold(what we do have), that regulates
>> > a priority between vmalloc()/vfree().
>> >
>> > What we can do more with it are:
>> >
>> > - purging should be just performed asynchronously in workqueue context.
>> > Giving the fact, that now we also do a merge of outstanding areas, the
>> > data structure(rb-tree) will not be so fragmented.
>> 
>> Async works only if there are idle CPU time on other CPUs.  And it may
>> punish other innocent workloads instead of the heavy vmalloc/vfree
>> users.  So we should be careful about that.
>> 
> Yep, scheduling latency will be as a side affect of such approach. The question
> is if it is negligible and can be considered as a risk. I do not think it would 
> be a big problem.
>
> I have other issue with it though, which i can not explain so far. If i am doing 
> the "purge" in the separate worker, i see that a memory leaks after heavy test
> runs.
>
>> > - lazy_max_pages() can slightly be decreased. If there are existing
>> > workloads which suffer from such long value. It would be good to get
>> > real complains and evidence.
>> >
>> >> > Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not
>> >> > allowed to drop the free_vmap_area_lock at all. Because any simultaneous
>> >> > allocations are not allowed within a drain region, so it should occur in
>> >> > disjoint regions. But i need to double check it.
>> >> >
>> >> >>
>> >> >> And, can we reduce lazy_max_pages() to control the length of the
>> >> >> purging list?  It could be > 8K if the vmalloc/vfree size is small.
>> >> >>
>> >> > We can adjust it for sure. But it will influence on number of global
>> >> > TLB flushes that must be performed.
>> >> 
>> >> Em...  For example, if we set it to 100, then the number of the TLB
>> >> flushes can be reduced to 1% of the un-optimized implementation
>> >> already.  Do you think so?
>> >> 
>> > If we set lazy_max_pages() to vague value such as 100, the performance
>> > will be just destroyed.
>> 
>> Sorry, my original words weren't clear enough.  What I really want to
>> suggest is to control the length of the purging list instead of reduce
>> lazy_max_pages() directly.  That is, we can have a "atomic_t
>> nr_purge_item" to record the length of the purging list and start
>> purging if (vmap_lazy_nr > lazy_max_pages && nr_purge_item >
>> max_purge_item).  vmap_lazy_nr is to control the virtual address space,
>> nr_purge_item is to control the batching purging latency.  "100" is just
>> an example, the real value should be determined according to the test
>> results.
>> 
> OK. Now i see what you meant. Please note, the merging is in place, so
> the list size gets reduced.

Yes.  In theory, even with merging, the length of the purging list may
become too long in some cases.  And the code/algorithm changes that are
needed by controlling the length of the purging list is much less than
that are needed by merging.  So I suggest to do length controlling
firstly, then merging.  Again, just my 2 cents.

Best Regards,
Huang, Ying
Uladzislau Rezki Nov. 24, 2020, 4:40 p.m. UTC | #10
> >> >> 
> >> >> That's the typical long latency avoidance method.
> >> >> 
> >> >> > The question is, which value we should use as a batch_threshold: 100, 1000, etc.
> >> >> 
> >> >> I think we can do some measurement to determine it?
> >> >> 
> >> > Hmm.. looking at it one more time i do not see what batching solves.
> >> 
> >> Without batch protection, we may release the lock and CPU anytime during
> >> looping if "vmap_lazy_nr < resched_threshold".  Too many vmalloc/vfree
> >> may be done during that.  So I think we can restrict it.  Batching can
> >> improve the performance of purging itself too.
> >> 
> > In theory:
> > I see your point. It is a trade-off though, to allow faster vmalloc or vfree.
> > Batching will make alloc more tight, and yes, speed up the process of draining
> > holding a CPU until batch is drained + introducing latency for other tasks.
> >
> > In practical:
> > I mentioned about that, i think we need to measure the batching approach, say
> > we set it to 100, providing some figures so we see some evidence from practical
> > point of view. For example run test_vmalloc.sh to analyze it. If you see some
> > advantages from performance point of view it would be great. Just share some
> > data.
> 
> Per my understanding, this is the common practice in kernel to satisfy
> both throughput and latency requirement.  But it may be not important
> for this specific case.  I am afraid I have no time to work on this now.
> Just my 2 cents.  If you don't think that's a good idea, just ignore it.
> 
I will keep it in mind, thanks for your input.

> >> > Anyway we need to have some threshold(what we do have), that regulates
> >> > a priority between vmalloc()/vfree().
> >> >
> >> > What we can do more with it are:
> >> >
> >> > - purging should be just performed asynchronously in workqueue context.
> >> > Giving the fact, that now we also do a merge of outstanding areas, the
> >> > data structure(rb-tree) will not be so fragmented.
> >> 
> >> Async works only if there are idle CPU time on other CPUs.  And it may
> >> punish other innocent workloads instead of the heavy vmalloc/vfree
> >> users.  So we should be careful about that.
> >> 
> > Yep, scheduling latency will be as a side affect of such approach. The question
> > is if it is negligible and can be considered as a risk. I do not think it would 
> > be a big problem.
> >
> > I have other issue with it though, which i can not explain so far. If i am doing 
> > the "purge" in the separate worker, i see that a memory leaks after heavy test
> > runs.
> >
> >> > - lazy_max_pages() can slightly be decreased. If there are existing
> >> > workloads which suffer from such long value. It would be good to get
> >> > real complains and evidence.
> >> >
> >> >> > Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not
> >> >> > allowed to drop the free_vmap_area_lock at all. Because any simultaneous
> >> >> > allocations are not allowed within a drain region, so it should occur in
> >> >> > disjoint regions. But i need to double check it.
> >> >> >
> >> >> >>
> >> >> >> And, can we reduce lazy_max_pages() to control the length of the
> >> >> >> purging list?  It could be > 8K if the vmalloc/vfree size is small.
> >> >> >>
> >> >> > We can adjust it for sure. But it will influence on number of global
> >> >> > TLB flushes that must be performed.
> >> >> 
> >> >> Em...  For example, if we set it to 100, then the number of the TLB
> >> >> flushes can be reduced to 1% of the un-optimized implementation
> >> >> already.  Do you think so?
> >> >> 
> >> > If we set lazy_max_pages() to vague value such as 100, the performance
> >> > will be just destroyed.
> >> 
> >> Sorry, my original words weren't clear enough.  What I really want to
> >> suggest is to control the length of the purging list instead of reduce
> >> lazy_max_pages() directly.  That is, we can have a "atomic_t
> >> nr_purge_item" to record the length of the purging list and start
> >> purging if (vmap_lazy_nr > lazy_max_pages && nr_purge_item >
> >> max_purge_item).  vmap_lazy_nr is to control the virtual address space,
> >> nr_purge_item is to control the batching purging latency.  "100" is just
> >> an example, the real value should be determined according to the test
> >> results.
> >> 
> > OK. Now i see what you meant. Please note, the merging is in place, so
> > the list size gets reduced.
> 
> Yes.  In theory, even with merging, the length of the purging list may
> become too long in some cases.  And the code/algorithm changes that are
> needed by controlling the length of the purging list is much less than
> that are needed by merging.  So I suggest to do length controlling
> firstly, then merging.  Again, just my 2 cents.
> 
All such kind of tuning parameters work for one case and does not for
others. Therefore i prefer to have something more generic that tends
to improve the things, instead of thinking how to tune parameters to
cover all test cases and workloads.

Anyway, thank you for your comments!

--
Vlad Rezki
Huang, Ying Nov. 25, 2020, 12:52 a.m. UTC | #11
Uladzislau Rezki <urezki@gmail.com> writes:
>> >> > - lazy_max_pages() can slightly be decreased. If there are existing
>> >> > workloads which suffer from such long value. It would be good to get
>> >> > real complains and evidence.
>> >> >
>> >> >> > Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not
>> >> >> > allowed to drop the free_vmap_area_lock at all. Because any simultaneous
>> >> >> > allocations are not allowed within a drain region, so it should occur in
>> >> >> > disjoint regions. But i need to double check it.
>> >> >> >
>> >> >> >>
>> >> >> >> And, can we reduce lazy_max_pages() to control the length of the
>> >> >> >> purging list?  It could be > 8K if the vmalloc/vfree size is small.
>> >> >> >>
>> >> >> > We can adjust it for sure. But it will influence on number of global
>> >> >> > TLB flushes that must be performed.
>> >> >> 
>> >> >> Em...  For example, if we set it to 100, then the number of the TLB
>> >> >> flushes can be reduced to 1% of the un-optimized implementation
>> >> >> already.  Do you think so?
>> >> >> 
>> >> > If we set lazy_max_pages() to vague value such as 100, the performance
>> >> > will be just destroyed.
>> >> 
>> >> Sorry, my original words weren't clear enough.  What I really want to
>> >> suggest is to control the length of the purging list instead of reduce
>> >> lazy_max_pages() directly.  That is, we can have a "atomic_t
>> >> nr_purge_item" to record the length of the purging list and start
>> >> purging if (vmap_lazy_nr > lazy_max_pages && nr_purge_item >
>> >> max_purge_item).  vmap_lazy_nr is to control the virtual address space,
>> >> nr_purge_item is to control the batching purging latency.  "100" is just
>> >> an example, the real value should be determined according to the test
>> >> results.
>> >> 
>> > OK. Now i see what you meant. Please note, the merging is in place, so
>> > the list size gets reduced.
>> 
>> Yes.  In theory, even with merging, the length of the purging list may
>> become too long in some cases.  And the code/algorithm changes that are
>> needed by controlling the length of the purging list is much less than
>> that are needed by merging.  So I suggest to do length controlling
>> firstly, then merging.  Again, just my 2 cents.
>> 
> All such kind of tuning parameters work for one case and does not for
> others. Therefore i prefer to have something more generic that tends
> to improve the things, instead of thinking how to tune parameters to
> cover all test cases and workloads.

It's a new mechanism to control the length of the purging list directly.
So, I don't think that's just parameter tuning.  It's just a simple and
direct method.  It can work together with merging method to control the
purging latency even if the vmap areas cannot be merged in some cases.
But these cases may not exist in practice, so I will not insist to use
this method.

Best Regards,
Huang, Ying
Uladzislau Rezki Nov. 25, 2020, 8:34 p.m. UTC | #12
On Wed, Nov 25, 2020 at 08:52:58AM +0800, Huang, Ying wrote:
> Uladzislau Rezki <urezki@gmail.com> writes:
> >> >> > - lazy_max_pages() can slightly be decreased. If there are existing
> >> >> > workloads which suffer from such long value. It would be good to get
> >> >> > real complains and evidence.
> >> >> >
> >> >> >> > Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not
> >> >> >> > allowed to drop the free_vmap_area_lock at all. Because any simultaneous
> >> >> >> > allocations are not allowed within a drain region, so it should occur in
> >> >> >> > disjoint regions. But i need to double check it.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> And, can we reduce lazy_max_pages() to control the length of the
> >> >> >> >> purging list?  It could be > 8K if the vmalloc/vfree size is small.
> >> >> >> >>
> >> >> >> > We can adjust it for sure. But it will influence on number of global
> >> >> >> > TLB flushes that must be performed.
> >> >> >> 
> >> >> >> Em...  For example, if we set it to 100, then the number of the TLB
> >> >> >> flushes can be reduced to 1% of the un-optimized implementation
> >> >> >> already.  Do you think so?
> >> >> >> 
> >> >> > If we set lazy_max_pages() to vague value such as 100, the performance
> >> >> > will be just destroyed.
> >> >> 
> >> >> Sorry, my original words weren't clear enough.  What I really want to
> >> >> suggest is to control the length of the purging list instead of reduce
> >> >> lazy_max_pages() directly.  That is, we can have a "atomic_t
> >> >> nr_purge_item" to record the length of the purging list and start
> >> >> purging if (vmap_lazy_nr > lazy_max_pages && nr_purge_item >
> >> >> max_purge_item).  vmap_lazy_nr is to control the virtual address space,
> >> >> nr_purge_item is to control the batching purging latency.  "100" is just
> >> >> an example, the real value should be determined according to the test
> >> >> results.
> >> >> 
> >> > OK. Now i see what you meant. Please note, the merging is in place, so
> >> > the list size gets reduced.
> >> 
> >> Yes.  In theory, even with merging, the length of the purging list may
> >> become too long in some cases.  And the code/algorithm changes that are
> >> needed by controlling the length of the purging list is much less than
> >> that are needed by merging.  So I suggest to do length controlling
> >> firstly, then merging.  Again, just my 2 cents.
> >> 
> > All such kind of tuning parameters work for one case and does not for
> > others. Therefore i prefer to have something more generic that tends
> > to improve the things, instead of thinking how to tune parameters to
> > cover all test cases and workloads.
> 
> It's a new mechanism to control the length of the purging list directly.
> So, I don't think that's just parameter tuning.  It's just a simple and
> direct method.  It can work together with merging method to control the
> purging latency even if the vmap areas cannot be merged in some cases.
> But these cases may not exist in practice, so I will not insist to use
> this method.
> 
No problem. I see your point about an extra thing to control the list length.
Let's see if there are still complains from users. If we have such feedback, 
we will rework it further.

Thanks!

--
Vlad Rezki
diff mbox series

Patch

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 938eaf9517e2..80c0181c411d 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -72,16 +72,14 @@  struct vmap_area {
 	struct list_head list;          /* address sorted list */
 
 	/*
-	 * The following three variables can be packed, because
-	 * a vmap_area object is always one of the three states:
+	 * The following two variables can be packed, because
+	 * a vmap_area object can be either:
 	 *    1) in "free" tree (root is vmap_area_root)
-	 *    2) in "busy" tree (root is free_vmap_area_root)
-	 *    3) in purge list  (head is vmap_purge_list)
+	 *    2) or "busy" tree (root is free_vmap_area_root)
 	 */
 	union {
 		unsigned long subtree_max_size; /* in "free" tree */
 		struct vm_struct *vm;           /* in "busy" tree */
-		struct llist_node purge_list;   /* in purge list */
 	};
 };
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b08b06a8cc2a..f16a71fb0624 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -413,10 +413,13 @@  static DEFINE_SPINLOCK(vmap_area_lock);
 static DEFINE_SPINLOCK(free_vmap_area_lock);
 /* Export for kexec only */
 LIST_HEAD(vmap_area_list);
-static LLIST_HEAD(vmap_purge_list);
 static struct rb_root vmap_area_root = RB_ROOT;
 static bool vmap_initialized __read_mostly;
 
+static struct rb_root purge_vmap_area_root = RB_ROOT;
+static LIST_HEAD(purge_vmap_area_list);
+static DEFINE_SPINLOCK(purge_vmap_area_lock);
+
 /*
  * This kmem_cache is used for vmap_area objects. Instead of
  * allocating from slab we reuse an object from this cache to
@@ -820,10 +823,17 @@  merge_or_add_vmap_area(struct vmap_area *va,
 	if (!merged)
 		link_va(va, root, parent, link, head);
 
-	/*
-	 * Last step is to check and update the tree.
-	 */
-	augment_tree_propagate_from(va);
+	return va;
+}
+
+static __always_inline struct vmap_area *
+merge_or_add_vmap_area_augment(struct vmap_area *va,
+	struct rb_root *root, struct list_head *head)
+{
+	va = merge_or_add_vmap_area(va, root, head);
+	if (va)
+		augment_tree_propagate_from(va);
+
 	return va;
 }
 
@@ -1138,7 +1148,7 @@  static void free_vmap_area(struct vmap_area *va)
 	 * Insert/Merge it back to the free tree/list.
 	 */
 	spin_lock(&free_vmap_area_lock);
-	merge_or_add_vmap_area(va, &free_vmap_area_root, &free_vmap_area_list);
+	merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
 	spin_unlock(&free_vmap_area_lock);
 }
 
@@ -1326,32 +1336,32 @@  void set_iounmap_nonlazy(void)
 static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 {
 	unsigned long resched_threshold;
-	struct llist_node *valist;
-	struct vmap_area *va;
-	struct vmap_area *n_va;
+	struct list_head local_pure_list;
+	struct vmap_area *va, *n_va;
 
 	lockdep_assert_held(&vmap_purge_lock);
 
-	valist = llist_del_all(&vmap_purge_list);
-	if (unlikely(valist == NULL))
+	spin_lock(&purge_vmap_area_lock);
+	purge_vmap_area_root = RB_ROOT;
+	list_replace_init(&purge_vmap_area_list, &local_pure_list);
+	spin_unlock(&purge_vmap_area_lock);
+
+	if (unlikely(list_empty(&local_pure_list)))
 		return false;
 
-	/*
-	 * TODO: to calculate a flush range without looping.
-	 * The list can be up to lazy_max_pages() elements.
-	 */
-	llist_for_each_entry(va, valist, purge_list) {
-		if (va->va_start < start)
-			start = va->va_start;
-		if (va->va_end > end)
-			end = va->va_end;
-	}
+	start = min(start,
+		list_first_entry(&local_pure_list,
+			struct vmap_area, list)->va_start);
+
+	end = max(end,
+		list_last_entry(&local_pure_list,
+			struct vmap_area, list)->va_end);
 
 	flush_tlb_kernel_range(start, end);
 	resched_threshold = lazy_max_pages() << 1;
 
 	spin_lock(&free_vmap_area_lock);
-	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
+	list_for_each_entry_safe(va, n_va, &local_pure_list, list) {
 		unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
 		unsigned long orig_start = va->va_start;
 		unsigned long orig_end = va->va_end;
@@ -1361,8 +1371,8 @@  static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 		 * detached and there is no need to "unlink" it from
 		 * anything.
 		 */
-		va = merge_or_add_vmap_area(va, &free_vmap_area_root,
-					    &free_vmap_area_list);
+		va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root,
+				&free_vmap_area_list);
 
 		if (!va)
 			continue;
@@ -1419,9 +1429,15 @@  static void free_vmap_area_noflush(struct vmap_area *va)
 	nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
 				PAGE_SHIFT, &vmap_lazy_nr);
 
-	/* After this point, we may free va at any time */
-	llist_add(&va->purge_list, &vmap_purge_list);
+	/*
+	 * Merge or place it to the purge tree/list.
+	 */
+	spin_lock(&purge_vmap_area_lock);
+	merge_or_add_vmap_area(va,
+		&purge_vmap_area_root, &purge_vmap_area_list);
+	spin_unlock(&purge_vmap_area_lock);
 
+	/* After this point, we may free va at any time */
 	if (unlikely(nr_lazy > lazy_max_pages()))
 		try_purge_vmap_area_lazy();
 }
@@ -3351,8 +3367,8 @@  struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	while (area--) {
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
-		va = merge_or_add_vmap_area(vas[area], &free_vmap_area_root,
-					    &free_vmap_area_list);
+		va = merge_or_add_vmap_area_augment(vas[area], &free_vmap_area_root,
+				&free_vmap_area_list);
 		if (va)
 			kasan_release_vmalloc(orig_start, orig_end,
 				va->va_start, va->va_end);
@@ -3401,8 +3417,8 @@  struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	for (area = 0; area < nr_vms; area++) {
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
-		va = merge_or_add_vmap_area(vas[area], &free_vmap_area_root,
-					    &free_vmap_area_list);
+		va = merge_or_add_vmap_area_augment(vas[area], &free_vmap_area_root,
+				&free_vmap_area_list);
 		if (va)
 			kasan_release_vmalloc(orig_start, orig_end,
 				va->va_start, va->va_end);
@@ -3482,18 +3498,15 @@  static void show_numa_info(struct seq_file *m, struct vm_struct *v)
 
 static void show_purge_info(struct seq_file *m)
 {
-	struct llist_node *head;
 	struct vmap_area *va;
 
-	head = READ_ONCE(vmap_purge_list.first);
-	if (head == NULL)
-		return;
-
-	llist_for_each_entry(va, head, purge_list) {
+	spin_lock(&purge_vmap_area_lock);
+	list_for_each_entry(va, &purge_vmap_area_list, list) {
 		seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
 			(void *)va->va_start, (void *)va->va_end,
 			va->va_end - va->va_start);
 	}
+	spin_unlock(&purge_vmap_area_lock);
 }
 
 static int s_show(struct seq_file *m, void *p)
@@ -3551,10 +3564,7 @@  static int s_show(struct seq_file *m, void *p)
 	seq_putc(m, '\n');
 
 	/*
-	 * As a final step, dump "unpurged" areas. Note,
-	 * that entire "/proc/vmallocinfo" output will not
-	 * be address sorted, because the purge list is not
-	 * sorted.
+	 * As a final step, dump "unpurged" areas.
 	 */
 	if (list_is_last(&va->list, &vmap_area_list))
 		show_purge_info(m);