diff mbox series

[v1,2/2] mm: add priority threshold to __purge_vmap_area_lazy()

Message ID 20190124115648.9433-3-urezki@gmail.com (mailing list archive)
State New, archived
Headers show
Series stability fixes for vmalloc allocator | expand

Commit Message

Uladzislau Rezki Jan. 24, 2019, 11:56 a.m. UTC
commit 763b218ddfaf ("mm: add preempt points into
__purge_vmap_area_lazy()")

introduced some preempt points, one of those is making an
allocation more prioritized over lazy free of vmap areas.

Prioritizing an allocation over freeing does not work well
all the time, i.e. it should be rather a compromise.

1) Number of lazy pages directly influence on busy list length
thus on operations like: allocation, lookup, unmap, remove, etc.

2) Under heavy stress of vmalloc subsystem i run into a situation
when memory usage gets increased hitting out_of_memory -> panic
state due to completely blocking of logic that frees vmap areas
in the __purge_vmap_area_lazy() function.

Establish a threshold passing which the freeing is prioritized
back over allocation creating a balance between each other.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

Comments

Andrew Morton Jan. 28, 2019, 8:04 p.m. UTC | #1
On Thu, 24 Jan 2019 12:56:48 +0100 "Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:

> commit 763b218ddfaf ("mm: add preempt points into
> __purge_vmap_area_lazy()")
> 
> introduced some preempt points, one of those is making an
> allocation more prioritized over lazy free of vmap areas.
> 
> Prioritizing an allocation over freeing does not work well
> all the time, i.e. it should be rather a compromise.
> 
> 1) Number of lazy pages directly influence on busy list length
> thus on operations like: allocation, lookup, unmap, remove, etc.
> 
> 2) Under heavy stress of vmalloc subsystem i run into a situation
> when memory usage gets increased hitting out_of_memory -> panic
> state due to completely blocking of logic that frees vmap areas
> in the __purge_vmap_area_lazy() function.
> 
> Establish a threshold passing which the freeing is prioritized
> back over allocation creating a balance between each other.

It would be useful to credit the vmalloc test driver for this
discovery, and perhaps to identify specifically which test triggered
the kernel misbehaviour.  Please send along suitable words and I'll add
them.


> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  	struct llist_node *valist;
>  	struct vmap_area *va;
>  	struct vmap_area *n_va;
> -	bool do_free = false;
> +	int resched_threshold;
>  
>  	lockdep_assert_held(&vmap_purge_lock);
>  
>  	valist = llist_del_all(&vmap_purge_list);
> +	if (unlikely(valist == NULL))
> +		return false;

Why this change?

> +	/*
> +	 * TODO: to calculate a flush range without looping.
> +	 * The list can be up to lazy_max_pages() elements.
> +	 */

How important is this?

>  	llist_for_each_entry(va, valist, purge_list) {
>  		if (va->va_start < start)
>  			start = va->va_start;
>  		if (va->va_end > end)
>  			end = va->va_end;
> -		do_free = true;
>  	}
>  
> -	if (!do_free)
> -		return false;
> -
>  	flush_tlb_kernel_range(start, end);
> +	resched_threshold = (int) lazy_max_pages() << 1;

Is the typecast really needed?

Perhaps resched_threshold shiould have unsigned long type and perhaps
vmap_lazy_nr should be atomic_long_t?

>  	spin_lock(&vmap_area_lock);
>  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  
>  		__free_vmap_area(va);
>  		atomic_sub(nr, &vmap_lazy_nr);
> -		cond_resched_lock(&vmap_area_lock);
> +
> +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> +			cond_resched_lock(&vmap_area_lock);
>  	}
>  	spin_unlock(&vmap_area_lock);
>  	return true;
Joel Fernandes Jan. 28, 2019, 10:45 p.m. UTC | #2
On Thu, Jan 24, 2019 at 12:56:48PM +0100, Uladzislau Rezki (Sony) wrote:
> commit 763b218ddfaf ("mm: add preempt points into
> __purge_vmap_area_lazy()")
> 
> introduced some preempt points, one of those is making an
> allocation more prioritized over lazy free of vmap areas.
> 
> Prioritizing an allocation over freeing does not work well
> all the time, i.e. it should be rather a compromise.
> 
> 1) Number of lazy pages directly influence on busy list length
> thus on operations like: allocation, lookup, unmap, remove, etc.
> 
> 2) Under heavy stress of vmalloc subsystem i run into a situation
> when memory usage gets increased hitting out_of_memory -> panic
> state due to completely blocking of logic that frees vmap areas
> in the __purge_vmap_area_lazy() function.
> 
> Establish a threshold passing which the freeing is prioritized
> back over allocation creating a balance between each other.

I'm a bit concerned that this will introduce the latency back if vmap_lazy_nr
is greater than half of lazy_max_pages(). Which IIUC will be more likely if
the number of CPUs is large.

In fact, when vmap_lazy_nr is high, that's when the latency will be the worst
so one could say that that's when you *should* reschedule since the frees are
taking too long and hurting real-time tasks.

Could this be better solved by tweaking lazy_max_pages() such that purging is
more aggressive?

Another approach could be to detect the scenario you brought up (allocations
happening faster than free), somehow, and avoid a reschedule?

thanks,

 - Joel

> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index fb4fb5fcee74..abe83f885069 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  	struct llist_node *valist;
>  	struct vmap_area *va;
>  	struct vmap_area *n_va;
> -	bool do_free = false;
> +	int resched_threshold;
>  
>  	lockdep_assert_held(&vmap_purge_lock);
>  
>  	valist = llist_del_all(&vmap_purge_list);
> +	if (unlikely(valist == NULL))
> +		return false;
> +
> +	/*
> +	 * TODO: to calculate a flush range without looping.
> +	 * The list can be up to lazy_max_pages() elements.
> +	 */
>  	llist_for_each_entry(va, valist, purge_list) {
>  		if (va->va_start < start)
>  			start = va->va_start;
>  		if (va->va_end > end)
>  			end = va->va_end;
> -		do_free = true;
>  	}
>  
> -	if (!do_free)
> -		return false;
> -
>  	flush_tlb_kernel_range(start, end);
> +	resched_threshold = (int) lazy_max_pages() << 1;
>  
>  	spin_lock(&vmap_area_lock);
>  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  
>  		__free_vmap_area(va);
>  		atomic_sub(nr, &vmap_lazy_nr);
> -		cond_resched_lock(&vmap_area_lock);
> +
> +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> +			cond_resched_lock(&vmap_area_lock);
>  	}
>  	spin_unlock(&vmap_area_lock);
>  	return true;
> -- 
> 2.11.0
>
Uladzislau Rezki Jan. 29, 2019, 4:17 p.m. UTC | #3
On Mon, Jan 28, 2019 at 12:04:29PM -0800, Andrew Morton wrote:
> On Thu, 24 Jan 2019 12:56:48 +0100 "Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:
> 
> > commit 763b218ddfaf ("mm: add preempt points into
> > __purge_vmap_area_lazy()")
> > 
> > introduced some preempt points, one of those is making an
> > allocation more prioritized over lazy free of vmap areas.
> > 
> > Prioritizing an allocation over freeing does not work well
> > all the time, i.e. it should be rather a compromise.
> > 
> > 1) Number of lazy pages directly influence on busy list length
> > thus on operations like: allocation, lookup, unmap, remove, etc.
> > 
> > 2) Under heavy stress of vmalloc subsystem i run into a situation
> > when memory usage gets increased hitting out_of_memory -> panic
> > state due to completely blocking of logic that frees vmap areas
> > in the __purge_vmap_area_lazy() function.
> > 
> > Establish a threshold passing which the freeing is prioritized
> > back over allocation creating a balance between each other.
> 
> It would be useful to credit the vmalloc test driver for this
> discovery, and perhaps to identify specifically which test triggered
> the kernel misbehaviour.  Please send along suitable words and I'll add
> them.
> 
Please see below more detail of testing:

<snip>
Using vmalloc test driver in "stress mode", i.e. When all available test
cases are run simultaneously on all online CPUs applying a pressure on the
vmalloc subsystem, my HiKey 960 board runs out of memory due to the fact
that __purge_vmap_area_lazy() logic simply is not able to free pages in
time.

How i run it:

1) You should build your kernel with CONFIG_TEST_VMALLOC=m
2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

during this test "vmap_lazy_nr" pages will go far beyond acceptable
lazy_max_pages() threshold, that will lead to enormous busy list size
and other problems including allocation time and so on.
<snip>
> 
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  	struct llist_node *valist;
> >  	struct vmap_area *va;
> >  	struct vmap_area *n_va;
> > -	bool do_free = false;
> > +	int resched_threshold;
> >  
> >  	lockdep_assert_held(&vmap_purge_lock);
> >  
> >  	valist = llist_del_all(&vmap_purge_list);
> > +	if (unlikely(valist == NULL))
> > +		return false;
> 
> Why this change?
> 
I decided to refactor a bit, simplify and get rid of unneeded
do_free check logic. I think it is more straightforward just to
check if list is empty or not, instead of accessing to "do_free"
"n" times in a loop.

I can drop it, or upload as separate patch. What is your view?

> > +	/*
> > +	 * TODO: to calculate a flush range without looping.
> > +	 * The list can be up to lazy_max_pages() elements.
> > +	 */
> 
> How important is this?
> 
It depends on vmap_lazy_nr pages in the list we iterate. For example
on my ARM 8 cores with 4Gb system i see that __purge_vmap_area_lazy()
can take up to 12 milliseconds because of long list. That is why there
is the cond_resched_lock().

As for this first loop's time execution, it takes ~4/5 milliseconds to
find out the flush range. Probably it is not so important since it is
not done in atomic context means it can be interrupted or preempted.
So, it will increase execution time of the current process that does:

vfree()/etc -> __purge_vmap_area_lazy().

From the other hand if we could calculate that range in runtime, i
mean when we add a VA to the vmap_purge_list checking va->va_start
and va->va_end with min/max we could get rid of that loop. But this
is just an idea.

> >  	llist_for_each_entry(va, valist, purge_list) {
> >  		if (va->va_start < start)
> >  			start = va->va_start;
> >  		if (va->va_end > end)
> >  			end = va->va_end;
> > -		do_free = true;
> >  	}
> >  
> > -	if (!do_free)
> > -		return false;
> > -
> >  	flush_tlb_kernel_range(start, end);
> > +	resched_threshold = (int) lazy_max_pages() << 1;
> 
> Is the typecast really needed?
> 
> Perhaps resched_threshold shiould have unsigned long type and perhaps
> vmap_lazy_nr should be atomic_long_t?
> 
I think so. Especially that atomit_t is 32 bit integer value on both 32
and 64 bit systems. lazy_max_pages() deals with unsigned long that is 8
bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on 64 bit
as well.

Should i send it as separate patch? What is your view?

> >  	spin_lock(&vmap_area_lock);
> >  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> > @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  
> >  		__free_vmap_area(va);
> >  		atomic_sub(nr, &vmap_lazy_nr);
> > -		cond_resched_lock(&vmap_area_lock);
> > +
> > +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> > +			cond_resched_lock(&vmap_area_lock);
> >  	}
> >  	spin_unlock(&vmap_area_lock);
> >  	return true;
> 

Thank you for your comments and review.

--
Vlad Rezki
Uladzislau Rezki Jan. 29, 2019, 5:39 p.m. UTC | #4
On Mon, Jan 28, 2019 at 05:45:28PM -0500, Joel Fernandes wrote:
> On Thu, Jan 24, 2019 at 12:56:48PM +0100, Uladzislau Rezki (Sony) wrote:
> > commit 763b218ddfaf ("mm: add preempt points into
> > __purge_vmap_area_lazy()")
> > 
> > introduced some preempt points, one of those is making an
> > allocation more prioritized over lazy free of vmap areas.
> > 
> > Prioritizing an allocation over freeing does not work well
> > all the time, i.e. it should be rather a compromise.
> > 
> > 1) Number of lazy pages directly influence on busy list length
> > thus on operations like: allocation, lookup, unmap, remove, etc.
> > 
> > 2) Under heavy stress of vmalloc subsystem i run into a situation
> > when memory usage gets increased hitting out_of_memory -> panic
> > state due to completely blocking of logic that frees vmap areas
> > in the __purge_vmap_area_lazy() function.
> > 
> > Establish a threshold passing which the freeing is prioritized
> > back over allocation creating a balance between each other.
> 
> I'm a bit concerned that this will introduce the latency back if vmap_lazy_nr
> is greater than half of lazy_max_pages(). Which IIUC will be more likely if
> the number of CPUs is large.
> 
The threshold that we establish is two times more than lazy_max_pages(),
i.e. in case of 4 system CPUs lazy_max_pages() is 24576, therefore the
threshold is 49152, if PAGE_SIZE is 4096.

It means that we allow rescheduling if vmap_lazy_nr < 49152. If vmap_lazy_nr 
is higher then we forbid rescheduling and free areas until it becomes lower
again to stabilize the system. By doing that, we will not allow vmap_lazy_nr
to be enormously increased.

>
> In fact, when vmap_lazy_nr is high, that's when the latency will be the worst
> so one could say that that's when you *should* reschedule since the frees are
> taking too long and hurting real-time tasks.
> 
> Could this be better solved by tweaking lazy_max_pages() such that purging is
> more aggressive?
> 
> Another approach could be to detect the scenario you brought up (allocations
> happening faster than free), somehow, and avoid a reschedule?
> 
This is what i am trying to achieve by this change. 

Thank you for your comments.

--
Vlad Rezki
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 18 ++++++++++++------
> >  1 file changed, 12 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index fb4fb5fcee74..abe83f885069 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  	struct llist_node *valist;
> >  	struct vmap_area *va;
> >  	struct vmap_area *n_va;
> > -	bool do_free = false;
> > +	int resched_threshold;
> >  
> >  	lockdep_assert_held(&vmap_purge_lock);
> >  
> >  	valist = llist_del_all(&vmap_purge_list);
> > +	if (unlikely(valist == NULL))
> > +		return false;
> > +
> > +	/*
> > +	 * TODO: to calculate a flush range without looping.
> > +	 * The list can be up to lazy_max_pages() elements.
> > +	 */
> >  	llist_for_each_entry(va, valist, purge_list) {
> >  		if (va->va_start < start)
> >  			start = va->va_start;
> >  		if (va->va_end > end)
> >  			end = va->va_end;
> > -		do_free = true;
> >  	}
> >  
> > -	if (!do_free)
> > -		return false;
> > -
> >  	flush_tlb_kernel_range(start, end);
> > +	resched_threshold = (int) lazy_max_pages() << 1;
> >  
> >  	spin_lock(&vmap_area_lock);
> >  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> > @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  
> >  		__free_vmap_area(va);
> >  		atomic_sub(nr, &vmap_lazy_nr);
> > -		cond_resched_lock(&vmap_area_lock);
> > +
> > +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> > +			cond_resched_lock(&vmap_area_lock);
> >  	}
> >  	spin_unlock(&vmap_area_lock);
> >  	return true;
> > -- 
> > 2.11.0
> >
Andrew Morton Jan. 29, 2019, 6:03 p.m. UTC | #5
On Tue, 29 Jan 2019 17:17:54 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:

> > > +	resched_threshold = (int) lazy_max_pages() << 1;
> > 
> > Is the typecast really needed?
> > 
> > Perhaps resched_threshold shiould have unsigned long type and perhaps
> > vmap_lazy_nr should be atomic_long_t?
> > 
> I think so. Especially that atomit_t is 32 bit integer value on both 32
> and 64 bit systems. lazy_max_pages() deals with unsigned long that is 8
> bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on 64 bit
> as well.
> 
> Should i send it as separate patch? What is your view?

Sounds good.  When convenient, please.
Joel Fernandes March 6, 2019, 4:25 p.m. UTC | #6
On Tue, Jan 29, 2019 at 06:39:36PM +0100, Uladzislau Rezki wrote:
> On Mon, Jan 28, 2019 at 05:45:28PM -0500, Joel Fernandes wrote:
> > On Thu, Jan 24, 2019 at 12:56:48PM +0100, Uladzislau Rezki (Sony) wrote:
> > > commit 763b218ddfaf ("mm: add preempt points into
> > > __purge_vmap_area_lazy()")
> > > 
> > > introduced some preempt points, one of those is making an
> > > allocation more prioritized over lazy free of vmap areas.
> > > 
> > > Prioritizing an allocation over freeing does not work well
> > > all the time, i.e. it should be rather a compromise.
> > > 
> > > 1) Number of lazy pages directly influence on busy list length
> > > thus on operations like: allocation, lookup, unmap, remove, etc.
> > > 
> > > 2) Under heavy stress of vmalloc subsystem i run into a situation
> > > when memory usage gets increased hitting out_of_memory -> panic
> > > state due to completely blocking of logic that frees vmap areas
> > > in the __purge_vmap_area_lazy() function.
> > > 
> > > Establish a threshold passing which the freeing is prioritized
> > > back over allocation creating a balance between each other.
> > 
> > I'm a bit concerned that this will introduce the latency back if vmap_lazy_nr
> > is greater than half of lazy_max_pages(). Which IIUC will be more likely if
> > the number of CPUs is large.
> > 
> The threshold that we establish is two times more than lazy_max_pages(),
> i.e. in case of 4 system CPUs lazy_max_pages() is 24576, therefore the
> threshold is 49152, if PAGE_SIZE is 4096.
> 
> It means that we allow rescheduling if vmap_lazy_nr < 49152. If vmap_lazy_nr 
> is higher then we forbid rescheduling and free areas until it becomes lower
> again to stabilize the system. By doing that, we will not allow vmap_lazy_nr
> to be enormously increased.

Sorry for late reply.

This sounds reasonable. Such an extreme situation of vmap_lazy_nr being twice
the lazy_max_pages() is probably only possible using a stress test anyway
since (hopefully) the try_purge_vmap_area_lazy() call is happening often
enough to keep the vmap_lazy_nr low.

Have you experimented with what is the highest threshold that prevents the
issues you're seeing? Have you tried 3x or 4x the vmap_lazy_nr?

I also wonder what is the cost these days of the global TLB flush on the most
common Linux architectures and if the whole purge vmap_area lazy stuff is
starting to get a bit dated, and if we can do the purging inline as areas are
freed. There is a cost to having this mechanism too as you said, which is as
the list size grows, all other operations also take time.

thanks,

 - Joel


> > In fact, when vmap_lazy_nr is high, that's when the latency will be the worst
> > so one could say that that's when you *should* reschedule since the frees are
> > taking too long and hurting real-time tasks.
> > 
> > Could this be better solved by tweaking lazy_max_pages() such that purging is
> > more aggressive?
> > 
> > Another approach could be to detect the scenario you brought up (allocations
> > happening faster than free), somehow, and avoid a reschedule?
> > 
> This is what i am trying to achieve by this change. 
> 
> Thank you for your comments.
> 
> --
> Vlad Rezki
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  mm/vmalloc.c | 18 ++++++++++++------
> > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index fb4fb5fcee74..abe83f885069 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> > >  	struct llist_node *valist;
> > >  	struct vmap_area *va;
> > >  	struct vmap_area *n_va;
> > > -	bool do_free = false;
> > > +	int resched_threshold;
> > >  
> > >  	lockdep_assert_held(&vmap_purge_lock);
> > >  
> > >  	valist = llist_del_all(&vmap_purge_list);
> > > +	if (unlikely(valist == NULL))
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * TODO: to calculate a flush range without looping.
> > > +	 * The list can be up to lazy_max_pages() elements.
> > > +	 */
> > >  	llist_for_each_entry(va, valist, purge_list) {
> > >  		if (va->va_start < start)
> > >  			start = va->va_start;
> > >  		if (va->va_end > end)
> > >  			end = va->va_end;
> > > -		do_free = true;
> > >  	}
> > >  
> > > -	if (!do_free)
> > > -		return false;
> > > -
> > >  	flush_tlb_kernel_range(start, end);
> > > +	resched_threshold = (int) lazy_max_pages() << 1;
> > >  
> > >  	spin_lock(&vmap_area_lock);
> > >  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> > > @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> > >  
> > >  		__free_vmap_area(va);
> > >  		atomic_sub(nr, &vmap_lazy_nr);
> > > -		cond_resched_lock(&vmap_area_lock);
> > > +
> > > +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> > > +			cond_resched_lock(&vmap_area_lock);
> > >  	}
> > >  	spin_unlock(&vmap_area_lock);
> > >  	return true;
> > > -- 
> > > 2.11.0
> > >
Uladzislau Rezki March 7, 2019, 11:15 a.m. UTC | #7
> > > 
> > > I'm a bit concerned that this will introduce the latency back if vmap_lazy_nr
> > > is greater than half of lazy_max_pages(). Which IIUC will be more likely if
> > > the number of CPUs is large.
> > > 
> > The threshold that we establish is two times more than lazy_max_pages(),
> > i.e. in case of 4 system CPUs lazy_max_pages() is 24576, therefore the
> > threshold is 49152, if PAGE_SIZE is 4096.
> > 
> > It means that we allow rescheduling if vmap_lazy_nr < 49152. If vmap_lazy_nr 
> > is higher then we forbid rescheduling and free areas until it becomes lower
> > again to stabilize the system. By doing that, we will not allow vmap_lazy_nr
> > to be enormously increased.
> 
> Sorry for late reply.
> 
> This sounds reasonable. Such an extreme situation of vmap_lazy_nr being twice
> the lazy_max_pages() is probably only possible using a stress test anyway
> since (hopefully) the try_purge_vmap_area_lazy() call is happening often
> enough to keep the vmap_lazy_nr low.
> 
> Have you experimented with what is the highest threshold that prevents the
> issues you're seeing? Have you tried 3x or 4x the vmap_lazy_nr?
> 
I do not think it make sense to go with 3x/4x/etc threshold for many
reasons. One of them is we just need to prevent that skew, returning back
to reasonable balance.

>
> I also wonder what is the cost these days of the global TLB flush on the most
> common Linux architectures and if the whole purge vmap_area lazy stuff is
> starting to get a bit dated, and if we can do the purging inline as areas are
> freed. There is a cost to having this mechanism too as you said, which is as
> the list size grows, all other operations also take time.
> 
I guess if we go with flushing the TLB each time per each vmap_area freeing,
then i think we are in trouble. Though, i have not analyzed how that approach
impacts performance.

I agree about the cost of having such mechanism. Basically it is one of the
source of bigger fragmentation(not limited to it). But from the other hand
the KVA allocator has to be capable of effective and fast allocation even
in that condition.

I am working on v2 of https://lkml.org/lkml/2018/10/19/786. When i complete
some other job related tasks i will upload a new RFC.

--
Vlad Rezki
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index fb4fb5fcee74..abe83f885069 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -661,23 +661,27 @@  static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 	struct llist_node *valist;
 	struct vmap_area *va;
 	struct vmap_area *n_va;
-	bool do_free = false;
+	int resched_threshold;
 
 	lockdep_assert_held(&vmap_purge_lock);
 
 	valist = llist_del_all(&vmap_purge_list);
+	if (unlikely(valist == NULL))
+		return false;
+
+	/*
+	 * TODO: to calculate a flush range without looping.
+	 * The list can be up to lazy_max_pages() elements.
+	 */
 	llist_for_each_entry(va, valist, purge_list) {
 		if (va->va_start < start)
 			start = va->va_start;
 		if (va->va_end > end)
 			end = va->va_end;
-		do_free = true;
 	}
 
-	if (!do_free)
-		return false;
-
 	flush_tlb_kernel_range(start, end);
+	resched_threshold = (int) lazy_max_pages() << 1;
 
 	spin_lock(&vmap_area_lock);
 	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
@@ -685,7 +689,9 @@  static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 
 		__free_vmap_area(va);
 		atomic_sub(nr, &vmap_lazy_nr);
-		cond_resched_lock(&vmap_area_lock);
+
+		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
+			cond_resched_lock(&vmap_area_lock);
 	}
 	spin_unlock(&vmap_area_lock);
 	return true;