mm: reduce spinlock contention in release_pages()

Message ID	20211124151915.GA6163@haolee.io (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Wed, 24 Nov 2021 15:19:15 +0000 From: Hao Lee <haolee.swjtu@gmail.com> To: linux-mm@kvack.org Cc: hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, shakeelb@google.com, haolee.swjtu@gmail.com, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: reduce spinlock contention in release_pages() Message-ID: <20211124151915.GA6163@haolee.io> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.12.1 (2019-06-15) Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: reduce spinlock contention in release_pages() \| expand mm: reduce spinlock contention in release_pages()

Hao Lee Nov. 24, 2021, 3:19 p.m. UTC

When several tasks are terminated simultaneously, lots of pages will be
released, which can cause severe spinlock contention. Other tasks which
are running on the same core will be seriously affected. We can yield
cpu to fix this problem.

Signed-off-by: Hao Lee <haolee.swjtu@gmail.com>
---
 include/linux/memcontrol.h | 26 ++++++++++++++++++++++++++
 mm/memcontrol.c            | 12 ++++++++++++
 mm/swap.c                  |  8 +++++++-
 3 files changed, 45 insertions(+), 1 deletion(-)

Matthew Wilcox Nov. 24, 2021, 3:57 p.m. UTC | #1

On Wed, Nov 24, 2021 at 03:19:15PM +0000, Hao Lee wrote:
> When several tasks are terminated simultaneously, lots of pages will be
> released, which can cause severe spinlock contention. Other tasks which
> are running on the same core will be seriously affected. We can yield
> cpu to fix this problem.

The realtime people will eat you alive for this suggestion.

> +++ b/mm/swap.c
> @@ -960,8 +960,14 @@ void release_pages(struct page **pages, int nr)
>  		if (PageLRU(page)) {
>  			struct lruvec *prev_lruvec = lruvec;
>  
> -			lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
> +retry:
> +			lruvec = folio_lruvec_tryrelock_irqsave(folio, lruvec,
>  									&flags);
> +			if (!lruvec) {
> +				cond_resched();
> +				goto retry;
> +			}
> +
>  			if (prev_lruvec != lruvec)

Michal Hocko Nov. 24, 2021, 4:31 p.m. UTC | #2

On Wed 24-11-21 15:19:15, Hao Lee wrote:
> When several tasks are terminated simultaneously, lots of pages will be
> released, which can cause severe spinlock contention. Other tasks which
> are running on the same core will be seriously affected. We can yield
> cpu to fix this problem.

How does this actually address the problem? You are effectivelly losing
fairness completely. We do batch currently so no single task should be
able to monopolize the cpu for too long. Why this is not sufficient?

> diff --git a/mm/swap.c b/mm/swap.c
> index e8c9dc6d0377..91850d51a5a5 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -960,8 +960,14 @@ void release_pages(struct page **pages, int nr)
>  		if (PageLRU(page)) {
>  			struct lruvec *prev_lruvec = lruvec;
>  
> -			lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
> +retry:
> +			lruvec = folio_lruvec_tryrelock_irqsave(folio, lruvec,
>  									&flags);
> +			if (!lruvec) {
> +				cond_resched();
> +				goto retry;
> +			}
> +
>  			if (prev_lruvec != lruvec)
>  				lock_batch = 0;
>  
> -- 
> 2.31.1

Hao Lee Nov. 25, 2021, 3:13 a.m. UTC | #3

On Wed, Nov 24, 2021 at 11:58 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Nov 24, 2021 at 03:19:15PM +0000, Hao Lee wrote:
> > When several tasks are terminated simultaneously, lots of pages will be
> > released, which can cause severe spinlock contention. Other tasks which
> > are running on the same core will be seriously affected. We can yield
> > cpu to fix this problem.
>
> The realtime people will eat you alive for this suggestion.

Thanks for pointing out this.

>
> > +++ b/mm/swap.c
> > @@ -960,8 +960,14 @@ void release_pages(struct page **pages, int nr)
> >               if (PageLRU(page)) {
> >                       struct lruvec *prev_lruvec = lruvec;
> >
> > -                     lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
> > +retry:
> > +                     lruvec = folio_lruvec_tryrelock_irqsave(folio, lruvec,
> >                                                                       &flags);
> > +                     if (!lruvec) {
> > +                             cond_resched();
> > +                             goto retry;
> > +                     }
> > +
> >                       if (prev_lruvec != lruvec)

Hao Lee Nov. 25, 2021, 3:24 a.m. UTC | #4

On Thu, Nov 25, 2021 at 12:31 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 24-11-21 15:19:15, Hao Lee wrote:
> > When several tasks are terminated simultaneously, lots of pages will be
> > released, which can cause severe spinlock contention. Other tasks which
> > are running on the same core will be seriously affected. We can yield
> > cpu to fix this problem.
>
> How does this actually address the problem? You are effectivelly losing
> fairness completely.

Got it. Thanks!

> We do batch currently so no single task should be
> able to monopolize the cpu for too long. Why this is not sufficient?

uncharge and unref indeed take advantage of the batch process, but
del_from_lru needs more time to complete. Several tasks will contend
spinlock in the loop if nr is very large. We can notice a transient peak
of sys% reflecting this, and perf can also report spinlock slowpath takes
too much time. This scenario is not rare, especially when containers are
destroyed simultaneously and other latency critical tasks may be affected
by this problem. So I want to figure out a way to deal with it.

Thanks.

>
> > diff --git a/mm/swap.c b/mm/swap.c
> > index e8c9dc6d0377..91850d51a5a5 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -960,8 +960,14 @@ void release_pages(struct page **pages, int nr)
> >               if (PageLRU(page)) {
> >                       struct lruvec *prev_lruvec = lruvec;
> >
> > -                     lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
> > +retry:
> > +                     lruvec = folio_lruvec_tryrelock_irqsave(folio, lruvec,
> >                                                                       &flags);
> > +                     if (!lruvec) {
> > +                             cond_resched();
> > +                             goto retry;
> > +                     }
> > +
> >                       if (prev_lruvec != lruvec)
> >                               lock_batch = 0;
> >
> > --
> > 2.31.1
>
> --
> Michal Hocko
> SUSE Labs

Matthew Wilcox Nov. 25, 2021, 3:30 a.m. UTC | #5

On Thu, Nov 25, 2021 at 11:24:02AM +0800, Hao Lee wrote:
> On Thu, Nov 25, 2021 at 12:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > We do batch currently so no single task should be
> > able to monopolize the cpu for too long. Why this is not sufficient?
> 
> uncharge and unref indeed take advantage of the batch process, but
> del_from_lru needs more time to complete. Several tasks will contend
> spinlock in the loop if nr is very large.

Is SWAP_CLUSTER_MAX too large?  Or does your architecture's spinlock
implementation need to be fixed?

Hao Lee Nov. 25, 2021, 8:02 a.m. UTC | #6

On Thu, Nov 25, 2021 at 03:30:44AM +0000, Matthew Wilcox wrote:
> On Thu, Nov 25, 2021 at 11:24:02AM +0800, Hao Lee wrote:
> > On Thu, Nov 25, 2021 at 12:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > > We do batch currently so no single task should be
> > > able to monopolize the cpu for too long. Why this is not sufficient?
> > 
> > uncharge and unref indeed take advantage of the batch process, but
> > del_from_lru needs more time to complete. Several tasks will contend
> > spinlock in the loop if nr is very large.
> 
> Is SWAP_CLUSTER_MAX too large?  Or does your architecture's spinlock
> implementation need to be fixed?
> 

My testing server is x86_64 with 5.16-rc2. The spinlock should be normal.

I think lock_batch is not the point. lock_batch only break spinning time
into small parts, but it doesn't reduce spinning time. The thing may get
worse if lock_batch is very small.

Here is an example about two tasks contending spinlock. Let's assume each
task need a total of 4 seconds in critical section to complete its work.

Example1:

lock_batch = x

task A      taskB
hold 4s     wait 4s
            hold 4s

total waiting time is 4s.

Example2:

if lock_batch = x/2

task A      taskB
hold 2s     wait 2s
wait 2s     hold 2s
hold 2s     wait 2s
            hold 2s

total waiting time is 6s.

The above theoretical example can also be proved by a test using usemem.

# ./usemem -j 4096 -n 20 10g -s 5

lock_batch=SWAP_CLUSTER_MAX          59.50% native_queued_spin_lock_slowpath
lock_batch=SWAP_CLUSTER_MAX/4        69.95% native_queued_spin_lock_slowpath
lock_batch=SWAP_CLUSTER_MAX/16       82.22% native_queued_spin_lock_slowpath

Nonetheless, enlarging lock_batch can't improve performance obviously
though it won't make it worse, and it's not a good idea to hold a
irq-disabled spinlock for long time.

If cond_reched() will break the task fairness, the only way I can think
of is doing uncharge and unref if the current task can't get the
spinlock. This will reduce the wasted cpu cycles, although the
performance gain is still small (about 4%). However, this way will hurt
batch processing in uncharge(). Maybe there exist a better way...

diff --git a/mm/swap.c b/mm/swap.c
index e8c9dc6d0377..8a947f8d0aaa 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -960,8 +960,16 @@ void release_pages(struct page **pages, int nr)
 		if (PageLRU(page)) {
 			struct lruvec *prev_lruvec = lruvec;

-			lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
+			lruvec = folio_lruvec_tryrelock_irqsave(folio, lruvec,
 									&flags);
+			if (!lruvec) {
+				mem_cgroup_uncharge_list(&pages_to_free);
+				free_unref_page_list(&pages_to_free);
+				INIT_LIST_HEAD(&pages_to_free);
+				lruvec = folio_lruvec_relock_irqsave(folio,
+							lruvec, &flags);
+			}
+
 			if (prev_lruvec != lruvec)
 				lock_batch = 0;

Michal Hocko Nov. 25, 2021, 10:01 a.m. UTC | #7

On Thu 25-11-21 08:02:38, Hao Lee wrote:
> On Thu, Nov 25, 2021 at 03:30:44AM +0000, Matthew Wilcox wrote:
> > On Thu, Nov 25, 2021 at 11:24:02AM +0800, Hao Lee wrote:
> > > On Thu, Nov 25, 2021 at 12:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > We do batch currently so no single task should be
> > > > able to monopolize the cpu for too long. Why this is not sufficient?
> > > 
> > > uncharge and unref indeed take advantage of the batch process, but
> > > del_from_lru needs more time to complete. Several tasks will contend
> > > spinlock in the loop if nr is very large.
> > 
> > Is SWAP_CLUSTER_MAX too large?  Or does your architecture's spinlock
> > implementation need to be fixed?
> > 
> 
> My testing server is x86_64 with 5.16-rc2. The spinlock should be normal.
> 
> I think lock_batch is not the point. lock_batch only break spinning time
> into small parts, but it doesn't reduce spinning time. The thing may get
> worse if lock_batch is very small.
> 
> Here is an example about two tasks contending spinlock. Let's assume each
> task need a total of 4 seconds in critical section to complete its work.
> 
> Example1:
> 
> lock_batch = x
> 
> task A      taskB
> hold 4s     wait 4s
>             hold 4s
> 
> total waiting time is 4s.

4s holding time is _way_ too long and something that this path should
never really reach. We are talking about SWAP_CLUSTER_MAX worth of LRU
pages. Sure there might be a bunch of pages freed that are not on LRU
but those are only added to a list. So again what is the actual problem?

[...]

> If cond_reched() will break the task fairness, the only way I can think
> of is doing uncharge and unref if the current task can't get the
> spinlock. This will reduce the wasted cpu cycles, although the
> performance gain is still small (about 4%). However, this way will hurt
> batch processing in uncharge(). Maybe there exist a better way...
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index e8c9dc6d0377..8a947f8d0aaa 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -960,8 +960,16 @@ void release_pages(struct page **pages, int nr)
>  		if (PageLRU(page)) {
>  			struct lruvec *prev_lruvec = lruvec;
>  
> -			lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
> +			lruvec = folio_lruvec_tryrelock_irqsave(folio, lruvec,
>  									&flags);
> +			if (!lruvec) {
> +				mem_cgroup_uncharge_list(&pages_to_free);
> +				free_unref_page_list(&pages_to_free);
> +				INIT_LIST_HEAD(&pages_to_free);
> +				lruvec = folio_lruvec_relock_irqsave(folio,
> +							lruvec, &flags);
> +			}
> +
>  			if (prev_lruvec != lruvec)
>  				lock_batch = 0;

Aren't you sacrificing one batching over the other and the net result
will really depend on particullar workload. This will effectivelly throw
away any lruvec batching out of window in presence of contention even if
there are no pages to be freed (or only very few of them).

TBH I really have hard time to see how 32 LRU pages handling in a single
lock batch can be harmfull. Maybe if there are gazillions of non-lru
pages where holding the lock is clearly pointless but that shouldn't
really happen most of the time.

Hao Lee Nov. 25, 2021, 12:31 p.m. UTC | #8

On Thu, Nov 25, 2021 at 11:01:02AM +0100, Michal Hocko wrote:
> On Thu 25-11-21 08:02:38, Hao Lee wrote:
> > On Thu, Nov 25, 2021 at 03:30:44AM +0000, Matthew Wilcox wrote:
> > > On Thu, Nov 25, 2021 at 11:24:02AM +0800, Hao Lee wrote:
> > > > On Thu, Nov 25, 2021 at 12:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > We do batch currently so no single task should be
> > > > > able to monopolize the cpu for too long. Why this is not sufficient?
> > > > 
> > > > uncharge and unref indeed take advantage of the batch process, but
> > > > del_from_lru needs more time to complete. Several tasks will contend
> > > > spinlock in the loop if nr is very large.
> > > 
> > > Is SWAP_CLUSTER_MAX too large?  Or does your architecture's spinlock
> > > implementation need to be fixed?
> > > 
> > 
> > My testing server is x86_64 with 5.16-rc2. The spinlock should be normal.
> > 
> > I think lock_batch is not the point. lock_batch only break spinning time
> > into small parts, but it doesn't reduce spinning time. The thing may get
> > worse if lock_batch is very small.
> > 
> > Here is an example about two tasks contending spinlock. Let's assume each
> > task need a total of 4 seconds in critical section to complete its work.
> > 
> > Example1:
> > 
> > lock_batch = x
> > 
> > task A      taskB
> > hold 4s     wait 4s
> >             hold 4s
> > 
> > total waiting time is 4s.
> 
> 4s holding time is _way_ too long and something that this path should
> never really reach. We are talking about SWAP_CLUSTER_MAX worth of LRU
> pages. Sure there might be a bunch of pages freed that are not on LRU
> but those are only added to a list. So again what is the actual problem?
> 

The measurement unit in my example may not be rigorous and may confuse you.
What I mean is the batch processing can only gives each task fairness to
compete for this spinlock, but it can't reduce the wasted cpu cycles during
spinning waiting, no matter what the batch size is.  No matter what the
lock_batch is set, the following perf report won't change much. Many cpu
cycles are wasted on spinning. Other tasks running on the same cores will be
delayed, which is unacceptable for our latency-critical jobs. I'm trying to
find if it's possible to reduce the delay and the contention , after all,
59.50% is too high. This is why I post the thoughtless `cond_resched()`
approach.

Here is the perf report when executing ./usemem -j 4096 -n 20 10g -s 5

+   59.50%  usemem           [kernel.vmlinux]               [k] native_queued_spin_lock_slowpath
+    4.36%  usemem           [kernel.vmlinux]               [k] check_preemption_disabled
+    4.31%  usemem           [kernel.vmlinux]               [k] free_pcppages_bulk
+    3.11%  usemem           [kernel.vmlinux]               [k] release_pages
+    2.12%  usemem           [kernel.vmlinux]               [k] __mod_memcg_lruvec_state
+    2.02%  usemem           [kernel.vmlinux]               [k] __list_del_entry_valid
+    1.98%  usemem           [kernel.vmlinux]               [k] __mod_node_page_state
+    1.67%  usemem           [kernel.vmlinux]               [k] unmap_page_range
+    1.51%  usemem           [kernel.vmlinux]               [k] __mod_zone_page_state

> 
> > If cond_reched() will break the task fairness, the only way I can think
> > of is doing uncharge and unref if the current task can't get the
> > spinlock. This will reduce the wasted cpu cycles, although the
> > performance gain is still small (about 4%). However, this way will hurt
> > batch processing in uncharge(). Maybe there exist a better way...
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index e8c9dc6d0377..8a947f8d0aaa 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -960,8 +960,16 @@ void release_pages(struct page **pages, int nr)
> >  		if (PageLRU(page)) {
> >  			struct lruvec *prev_lruvec = lruvec;
> >  
> > -			lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
> > +			lruvec = folio_lruvec_tryrelock_irqsave(folio, lruvec,
> >  									&flags);
> > +			if (!lruvec) {
> > +				mem_cgroup_uncharge_list(&pages_to_free);
> > +				free_unref_page_list(&pages_to_free);
> > +				INIT_LIST_HEAD(&pages_to_free);
> > +				lruvec = folio_lruvec_relock_irqsave(folio,
> > +							lruvec, &flags);
> > +			}
> > +
> >  			if (prev_lruvec != lruvec)
> >  				lock_batch = 0;
> 
> Aren't you sacrificing one batching over the other and the net result
> will really depend on particullar workload. This will effectivelly throw
> away any lruvec batching out of window in presence of contention even if
> there are no pages to be freed (or only very few of them).

Agree. This is by no means the right way.

> 
> TBH I really have hard time to see how 32 LRU pages handling in a single
> lock batch can be harmfull.

Yes. This may be the most reasonable way for now. I'm just trying my
best to find a slightly better way to reduce the wasted cpu time.

Thanks.

> Maybe if there are gazillions of non-lru
> pages where holding the lock is clearly pointless but that shouldn't
> really happen most of the time.
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 25, 2021, 2:18 p.m. UTC | #9

On Thu 25-11-21 12:31:33, Hao Lee wrote:
> On Thu, Nov 25, 2021 at 11:01:02AM +0100, Michal Hocko wrote:
> > On Thu 25-11-21 08:02:38, Hao Lee wrote:
> > > On Thu, Nov 25, 2021 at 03:30:44AM +0000, Matthew Wilcox wrote:
> > > > On Thu, Nov 25, 2021 at 11:24:02AM +0800, Hao Lee wrote:
> > > > > On Thu, Nov 25, 2021 at 12:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > We do batch currently so no single task should be
> > > > > > able to monopolize the cpu for too long. Why this is not sufficient?
> > > > > 
> > > > > uncharge and unref indeed take advantage of the batch process, but
> > > > > del_from_lru needs more time to complete. Several tasks will contend
> > > > > spinlock in the loop if nr is very large.
> > > > 
> > > > Is SWAP_CLUSTER_MAX too large?  Or does your architecture's spinlock
> > > > implementation need to be fixed?
> > > > 
> > > 
> > > My testing server is x86_64 with 5.16-rc2. The spinlock should be normal.
> > > 
> > > I think lock_batch is not the point. lock_batch only break spinning time
> > > into small parts, but it doesn't reduce spinning time. The thing may get
> > > worse if lock_batch is very small.
> > > 
> > > Here is an example about two tasks contending spinlock. Let's assume each
> > > task need a total of 4 seconds in critical section to complete its work.
> > > 
> > > Example1:
> > > 
> > > lock_batch = x
> > > 
> > > task A      taskB
> > > hold 4s     wait 4s
> > >             hold 4s
> > > 
> > > total waiting time is 4s.
> > 
> > 4s holding time is _way_ too long and something that this path should
> > never really reach. We are talking about SWAP_CLUSTER_MAX worth of LRU
> > pages. Sure there might be a bunch of pages freed that are not on LRU
> > but those are only added to a list. So again what is the actual problem?
> > 
> 
> The measurement unit in my example may not be rigorous and may confuse you.
> What I mean is the batch processing can only gives each task fairness to
> compete for this spinlock, but it can't reduce the wasted cpu cycles during
> spinning waiting, no matter what the batch size is.

Correct. But isn't that a nature of pretty much any spinlock based
contention? There is not really much to be done except for removing the
lock. Batching helps to amortize the spinning for the actual useful work
so that the spinning is not predominant. Trylocking to reduce that
spinning can be helpful only if you _know_ that other useful work could
be done and that you do not dalay the locked work way to much. This is a
tricky balance to make.

> No matter what the
> lock_batch is set, the following perf report won't change much. Many cpu
> cycles are wasted on spinning. Other tasks running on the same cores will be
> delayed, which is unacceptable for our latency-critical jobs.

Could you share more about requirements for those? Why is unmapping in
any of their hot paths which really require low latencies? Because as
long as unmapping requires a shared resource - like lru lock - then you
have a bottle necks.

> I'm trying to
> find if it's possible to reduce the delay and the contention , after all,
> 59.50% is too high. This is why I post the thoughtless `cond_resched()`
> approach.

What is the base for that 59.5%? Also how representative this is for
your sensitive workload?
 
> Here is the perf report when executing ./usemem -j 4096 -n 20 10g -s 5
> 
> +   59.50%  usemem           [kernel.vmlinux]               [k] native_queued_spin_lock_slowpath
> +    4.36%  usemem           [kernel.vmlinux]               [k] check_preemption_disabled
> +    4.31%  usemem           [kernel.vmlinux]               [k] free_pcppages_bulk
> +    3.11%  usemem           [kernel.vmlinux]               [k] release_pages
> +    2.12%  usemem           [kernel.vmlinux]               [k] __mod_memcg_lruvec_state
> +    2.02%  usemem           [kernel.vmlinux]               [k] __list_del_entry_valid
> +    1.98%  usemem           [kernel.vmlinux]               [k] __mod_node_page_state
> +    1.67%  usemem           [kernel.vmlinux]               [k] unmap_page_range
> +    1.51%  usemem           [kernel.vmlinux]               [k] __mod_zone_page_state

Matthew Wilcox Nov. 25, 2021, 6:04 p.m. UTC | #10

On Thu, Nov 25, 2021 at 08:02:38AM +0000, Hao Lee wrote:
> On Thu, Nov 25, 2021 at 03:30:44AM +0000, Matthew Wilcox wrote:
> > On Thu, Nov 25, 2021 at 11:24:02AM +0800, Hao Lee wrote:
> > > On Thu, Nov 25, 2021 at 12:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > We do batch currently so no single task should be
> > > > able to monopolize the cpu for too long. Why this is not sufficient?
> > > 
> > > uncharge and unref indeed take advantage of the batch process, but
> > > del_from_lru needs more time to complete. Several tasks will contend
> > > spinlock in the loop if nr is very large.
> > 
> > Is SWAP_CLUSTER_MAX too large?  Or does your architecture's spinlock
> > implementation need to be fixed?
> > 
> 
> My testing server is x86_64 with 5.16-rc2. The spinlock should be normal.
> 
> I think lock_batch is not the point. lock_batch only break spinning time
> into small parts, but it doesn't reduce spinning time. The thing may get
> worse if lock_batch is very small.

OK.  So if I understand right, you've got a lot of processes all
calling exit_mmap() at the same time, which eventually becomes calls to
unmap_vmas(), unmap_single_vma(), unmap_page_range(), zap_pte_range(),
tlb_flush_mmu(), tlb_batch_pages_flush(), free_pages_and_swap_cache(),
release_pages(), and then you see high contention on the LRU lock.

Your use-case doesn't seem to mind sleeping (after all, these processes
are exiting).  So we could put a semaphore in exit_mmap() to limit the
number of simultaneous callers to unmap_vmas().  Do you want to try
that out and see if it solves your problem?  You might want to make it
a counting semaphore (eg permit two tasks to exit at once) rather than
a mutex.  But maybe a mutex is just fine.

Hao Lee Nov. 26, 2021, 6:50 a.m. UTC | #11

On Thu, Nov 25, 2021 at 10:18 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 25-11-21 12:31:33, Hao Lee wrote:
> > On Thu, Nov 25, 2021 at 11:01:02AM +0100, Michal Hocko wrote:
> > > On Thu 25-11-21 08:02:38, Hao Lee wrote:
> > > > On Thu, Nov 25, 2021 at 03:30:44AM +0000, Matthew Wilcox wrote:
> > > > > On Thu, Nov 25, 2021 at 11:24:02AM +0800, Hao Lee wrote:
> > > > > > On Thu, Nov 25, 2021 at 12:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > We do batch currently so no single task should be
> > > > > > > able to monopolize the cpu for too long. Why this is not sufficient?
> > > > > >
> > > > > > uncharge and unref indeed take advantage of the batch process, but
> > > > > > del_from_lru needs more time to complete. Several tasks will contend
> > > > > > spinlock in the loop if nr is very large.
> > > > >
> > > > > Is SWAP_CLUSTER_MAX too large?  Or does your architecture's spinlock
> > > > > implementation need to be fixed?
> > > > >
> > > >
> > > > My testing server is x86_64 with 5.16-rc2. The spinlock should be normal.
> > > >
> > > > I think lock_batch is not the point. lock_batch only break spinning time
> > > > into small parts, but it doesn't reduce spinning time. The thing may get
> > > > worse if lock_batch is very small.
> > > >
> > > > Here is an example about two tasks contending spinlock. Let's assume each
> > > > task need a total of 4 seconds in critical section to complete its work.
> > > >
> > > > Example1:
> > > >
> > > > lock_batch = x
> > > >
> > > > task A      taskB
> > > > hold 4s     wait 4s
> > > >             hold 4s
> > > >
> > > > total waiting time is 4s.
> > >
> > > 4s holding time is _way_ too long and something that this path should
> > > never really reach. We are talking about SWAP_CLUSTER_MAX worth of LRU
> > > pages. Sure there might be a bunch of pages freed that are not on LRU
> > > but those are only added to a list. So again what is the actual problem?
> > >
> >
> > The measurement unit in my example may not be rigorous and may confuse you.
> > What I mean is the batch processing can only gives each task fairness to
> > compete for this spinlock, but it can't reduce the wasted cpu cycles during
> > spinning waiting, no matter what the batch size is.
>
> Correct. But isn't that a nature of pretty much any spinlock based
> contention? There is not really much to be done except for removing the
> lock.

Yes...

> Batching helps to amortize the spinning for the actual useful work
> so that the spinning is not predominant. Trylocking to reduce that
> spinning can be helpful only if you _know_ that other useful work could
> be done and that you do not dalay the locked work way to much. This is a
> tricky balance to make.

Exactly, thanks for explanation.

>
> > No matter what the
> > lock_batch is set, the following perf report won't change much. Many cpu
> > cycles are wasted on spinning. Other tasks running on the same cores will be
> > delayed, which is unacceptable for our latency-critical jobs.
>
> Could you share more about requirements for those? Why is unmapping in
> any of their hot paths which really require low latencies? Because as
> long as unmapping requires a shared resource - like lru lock - then you
> have a bottle necks.

We deploy best-effort (BE) jobs (e.g. bigdata, machine learning) and
latency-critical (LC) jobs (e.g. map navigation, payments services) on the
same servers to improve resource utilization. The running time of BE jobs are
very short, but its memory consumption is large, and these jobs will run
periodically. The LC jobs are long-run services and are sensitive to delays
because jitters may cause customer churn.

If a batch of BE jobs are finished simultaneously, lots of memory are freed,
and spinlock contentions happen. BE jobs don't care about these contentions,
but contentions cause them to spend more time in kernel mode, and thus, LC
jobs running on the same cpu cores will be delayed and jitters occur. (The
kernel preemption is disabled on our servers, and we try not to separate
LC/BE using cpuset in order to achieve "complete mixture deployment"). Then
LC services people will complain about the poor service stability. This
scenario has occurred several times, so we want to find a way to avoid it.

>
> > I'm trying to
> > find if it's possible to reduce the delay and the contention , after all,
> > 59.50% is too high. This is why I post the thoughtless `cond_resched()`
> > approach.
>
> What is the base for that 59.5%? Also how representative this is for
> your sensitive workload?

Unfortunately, we haven't found an accurate relationship between the
acceptable jitters and the perf percentage. A reference datum is indeed
essential, and we will try to setup it. Thanks for your tips.

>
> > Here is the perf report when executing ./usemem -j 4096 -n 20 10g -s 5
> >
> > +   59.50%  usemem           [kernel.vmlinux]               [k] native_queued_spin_lock_slowpath
> > +    4.36%  usemem           [kernel.vmlinux]               [k] check_preemption_disabled
> > +    4.31%  usemem           [kernel.vmlinux]               [k] free_pcppages_bulk
> > +    3.11%  usemem           [kernel.vmlinux]               [k] release_pages
> > +    2.12%  usemem           [kernel.vmlinux]               [k] __mod_memcg_lruvec_state
> > +    2.02%  usemem           [kernel.vmlinux]               [k] __list_del_entry_valid
> > +    1.98%  usemem           [kernel.vmlinux]               [k] __mod_node_page_state
> > +    1.67%  usemem           [kernel.vmlinux]               [k] unmap_page_range
> > +    1.51%  usemem           [kernel.vmlinux]               [k] __mod_zone_page_state
> --
> Michal Hocko
> SUSE Labs

Hao Lee Nov. 26, 2021, 6:54 a.m. UTC | #12

On Fri, Nov 26, 2021 at 2:04 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Nov 25, 2021 at 08:02:38AM +0000, Hao Lee wrote:
> > On Thu, Nov 25, 2021 at 03:30:44AM +0000, Matthew Wilcox wrote:
> > > On Thu, Nov 25, 2021 at 11:24:02AM +0800, Hao Lee wrote:
> > > > On Thu, Nov 25, 2021 at 12:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > We do batch currently so no single task should be
> > > > > able to monopolize the cpu for too long. Why this is not sufficient?
> > > >
> > > > uncharge and unref indeed take advantage of the batch process, but
> > > > del_from_lru needs more time to complete. Several tasks will contend
> > > > spinlock in the loop if nr is very large.
> > >
> > > Is SWAP_CLUSTER_MAX too large?  Or does your architecture's spinlock
> > > implementation need to be fixed?
> > >
> >
> > My testing server is x86_64 with 5.16-rc2. The spinlock should be normal.
> >
> > I think lock_batch is not the point. lock_batch only break spinning time
> > into small parts, but it doesn't reduce spinning time. The thing may get
> > worse if lock_batch is very small.
>
> OK.  So if I understand right, you've got a lot of processes all
> calling exit_mmap() at the same time, which eventually becomes calls to
> unmap_vmas(), unmap_single_vma(), unmap_page_range(), zap_pte_range(),
> tlb_flush_mmu(), tlb_batch_pages_flush(), free_pages_and_swap_cache(),
> release_pages(), and then you see high contention on the LRU lock.

Exactly.

>
> Your use-case doesn't seem to mind sleeping (after all, these processes
> are exiting).

Yes!

>  So we could put a semaphore in exit_mmap() to limit the
> number of simultaneous callers to unmap_vmas().  Do you want to try
> that out and see if it solves your problem?  You might want to make it
> a counting semaphore (eg permit two tasks to exit at once) rather than
> a mutex.  But maybe a mutex is just fine.

This is really a good idea. My train of thought was trapped in reducing the
lock contention. I will try to implement this idea and see if the service
stability will be improved much. Thanks for your help!

Michal Hocko Nov. 26, 2021, 10:46 a.m. UTC | #13

On Fri 26-11-21 14:50:44, Hao Lee wrote:
> On Thu, Nov 25, 2021 at 10:18 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Could you share more about requirements for those? Why is unmapping in
> > any of their hot paths which really require low latencies? Because as
> > long as unmapping requires a shared resource - like lru lock - then you
> > have a bottle necks.
> 
> We deploy best-effort (BE) jobs (e.g. bigdata, machine learning) and
> latency-critical (LC) jobs (e.g. map navigation, payments services) on the
> same servers to improve resource utilization. The running time of BE jobs are
> very short, but its memory consumption is large, and these jobs will run
> periodically. The LC jobs are long-run services and are sensitive to delays
> because jitters may cause customer churn.

Have you tried to isolate those workloads by memory cgroups? That could
help for lru lock at least. You are likely going to hit other locks on
the way though. E.g. zone lock in the page allocator but that might be
less problematic in the end. If you isolate your long running services
to a different NUMA node then you can get even less interaction.

> If a batch of BE jobs are finished simultaneously, lots of memory are freed,
> and spinlock contentions happen. BE jobs don't care about these contentions,
> but contentions cause them to spend more time in kernel mode, and thus, LC
> jobs running on the same cpu cores will be delayed and jitters occur. (The
> kernel preemption is disabled on our servers, and we try not to separate
> LC/BE using cpuset in order to achieve "complete mixture deployment"). Then
> LC services people will complain about the poor service stability. This
> scenario has occurred several times, so we want to find a way to avoid it.

It will be hard and a constant fight to get reasonably low latencies on
a non preemptible kernel. It would likely be better to partition CPUs
between latency sensitive and BE jobs. I can see how that might not be
really practical but especially with non-preemptible kernels you have a
large space for priority inversions that is hard to forsee or contain.

Hao Lee Nov. 26, 2021, 4:26 p.m. UTC | #14

On Fri, Nov 26, 2021 at 11:46:29AM +0100, Michal Hocko wrote:
> On Fri 26-11-21 14:50:44, Hao Lee wrote:
> > On Thu, Nov 25, 2021 at 10:18 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Could you share more about requirements for those? Why is unmapping in
> > > any of their hot paths which really require low latencies? Because as
> > > long as unmapping requires a shared resource - like lru lock - then you
> > > have a bottle necks.
> > 
> > We deploy best-effort (BE) jobs (e.g. bigdata, machine learning) and
> > latency-critical (LC) jobs (e.g. map navigation, payments services) on the
> > same servers to improve resource utilization. The running time of BE jobs are
> > very short, but its memory consumption is large, and these jobs will run
> > periodically. The LC jobs are long-run services and are sensitive to delays
> > because jitters may cause customer churn.
> 
> Have you tried to isolate those workloads by memory cgroups? That could
> help for lru lock at least.

Sure. LC and BE jobs are in different memory cgroups (containers). memcg
indeed avoids lru contentions between LC and BE, although it can't reduce
contentions between jobs in the same cgroup. BE jobs' memory contentions
could cause cpu jitters and then interfere LC jobs.

> You are likely going to hit other locks on
> the way though. E.g. zone lock in the page allocator but that might be
> less problematic in the end.

Yes, but we haven't encountered lock contentions in the allocation path for
now. Maybe this is because the memory allocations of BE jobs are still
gradual and are not clustered into a very short time period.

> If you isolate your long running services
> to a different NUMA node then you can get even less interaction.

Agree.

> 
> > If a batch of BE jobs are finished simultaneously, lots of memory are freed,
> > and spinlock contentions happen. BE jobs don't care about these contentions,
> > but contentions cause them to spend more time in kernel mode, and thus, LC
> > jobs running on the same cpu cores will be delayed and jitters occur. (The
> > kernel preemption is disabled on our servers, and we try not to separate
> > LC/BE using cpuset in order to achieve "complete mixture deployment"). Then
> > LC services people will complain about the poor service stability. This
> > scenario has occurred several times, so we want to find a way to avoid it.
> 
> It will be hard and a constant fight to get reasonably low latencies on
> a non preemptible kernel. It would likely be better to partition CPUs
> between latency sensitive and BE jobs. I can see how that might not be
> really practical but especially with non-preemptible kernels you have a
> large space for priority inversions that is hard to forsee or contain.

Agree. It's really hard. Maybe we will eventually use cpuset to separate LC
and BE if we can't find a better way to mix them on the same set of cpus.

I will try Matthew's idea to use semaphore or mutex to limit the number of BE
jobs that are in the exiting path. This sounds like a feasible approach for
our scenario...

Thanks

> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 29, 2021, 8:39 a.m. UTC | #15

On Fri 26-11-21 16:26:23, Hao Lee wrote:
[...]
> I will try Matthew's idea to use semaphore or mutex to limit the number of BE
> jobs that are in the exiting path. This sounds like a feasible approach for
> our scenario...

I am not really sure this is something that would be acceptable. Your
problem is resource partitioning. Papering that over by a lock is not
the right way to go. Besides that you will likely hit a hard question on
how many tasks to allow to run concurrently. Whatever the value some
workload will very likely going to suffer. We cannot assume admin to
chose the right value because there is no clear answer for that. Not to
mention other potential problems - e.g. even more priority inversions
etc.

Matthew Wilcox Nov. 29, 2021, 1:23 p.m. UTC | #16

On Mon, Nov 29, 2021 at 09:39:16AM +0100, Michal Hocko wrote:
> On Fri 26-11-21 16:26:23, Hao Lee wrote:
> [...]
> > I will try Matthew's idea to use semaphore or mutex to limit the number of BE
> > jobs that are in the exiting path. This sounds like a feasible approach for
> > our scenario...
> 
> I am not really sure this is something that would be acceptable. Your
> problem is resource partitioning. Papering that over by a lock is not
> the right way to go. Besides that you will likely hit a hard question on
> how many tasks to allow to run concurrently. Whatever the value some
> workload will very likely going to suffer. We cannot assume admin to
> chose the right value because there is no clear answer for that. Not to
> mention other potential problems - e.g. even more priority inversions
> etc.

I don't see how we get priority inversions.  These tasks are exiting; at
the point they take the semaphore, they should not be holding any locks.
They're holding a resource (memory) that needs to be released, but a
task wanting to acquire memory must already be prepared to sleep.

I see this as being a thundering herd problem.  We have dozens, maybe
hundreds of tasks all trying to free their memory at once.  If we force
the herd to go through a narrow gap, they arrive at the spinlock in an
orderly manner.

Michal Hocko Nov. 29, 2021, 1:39 p.m. UTC | #17

On Mon 29-11-21 13:23:19, Matthew Wilcox wrote:
> On Mon, Nov 29, 2021 at 09:39:16AM +0100, Michal Hocko wrote:
> > On Fri 26-11-21 16:26:23, Hao Lee wrote:
> > [...]
> > > I will try Matthew's idea to use semaphore or mutex to limit the number of BE
> > > jobs that are in the exiting path. This sounds like a feasible approach for
> > > our scenario...
> > 
> > I am not really sure this is something that would be acceptable. Your
> > problem is resource partitioning. Papering that over by a lock is not
> > the right way to go. Besides that you will likely hit a hard question on
> > how many tasks to allow to run concurrently. Whatever the value some
> > workload will very likely going to suffer. We cannot assume admin to
> > chose the right value because there is no clear answer for that. Not to
> > mention other potential problems - e.g. even more priority inversions
> > etc.
> 
> I don't see how we get priority inversions.  These tasks are exiting; at
> the point they take the semaphore, they should not be holding any locks.
> They're holding a resource (memory) that needs to be released, but a
> task wanting to acquire memory must already be prepared to sleep.

At least these scenarios come to mind
- a task being blocked by other lower priority tasks slowly tearing down
  their address space - essentially a different incarnation of the same
  problem this is trying to handle
- a huge memory backed task waiting many for smaller ones to finish
- waste of resources on properly partitioned systems. Why should
  somebody block tasks when they are acting on different lruvecs and
  cpus?

mm: reduce spinlock contention in release_pages()

Commit Message

Comments

Patch