diff mbox series

sched/fair: don't use waker's cpu if the waker of sync wake-up is interrupt

Message ID 20210427023758.4048-1-song.bao.hua@hisilicon.com (mailing list archive)
State New, archived
Headers show
Series sched/fair: don't use waker's cpu if the waker of sync wake-up is interrupt | expand

Commit Message

Song Bao Hua (Barry Song) April 27, 2021, 2:37 a.m. UTC
a severe qperf performance decrease was reported in the below use case:
For a hardware with 2 NUMA nodes, node0 has cpu0-31, node1 has cpu32-63.
Ethernet is located in node1.

Run the below commands:
$ taskset -c 32-63 stress -c 32 &
$ qperf 192.168.50.166 tcp_lat
tcp_lat:
	latency = 2.95ms.
Normally the latency should be less than 20us. But in the above test,
latency increased dramatically to 2.95ms.

This is caused by ping-pong of qperf between node0 and node1. Since it
is a sync wake-up and waker's nr_running == 1, WAKE_AFFINE will pull
qperf to node1, but LB will soon migrate qperf back to node0.
Not like a normal sync wake-up coming from a task, the waker in the above
test is an interrupt and nr_running happens to be 1 since stress starts
32 threads on node1 with 32 cpus.

Testing also shows the performance of qperf won't drop if the number
of threads are increased to 64, 96 or larger values:
$ taskset -c 32-63 stress -c 96 &
$ qperf 192.168.50.166 tcp_lat
tcp_lat:
	latency = 14.7us.

Obviously "-c 96" makes "cpu_rq(this_cpu)->nr_running == 1" false in
wake_affine_idle() so WAKE_AFFINE won't pull qperf to node1.

To fix this issue, this patch checks the waker of sync wake-up is a task
but not an interrupt. In this case, the waker will schedule out and give
CPU to wakee.

Reported-by: Yongjia Xie <xieyongjia1@huawei.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 kernel/sched/fair.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Mike Galbraith April 27, 2021, 4:21 a.m. UTC | #1
On Tue, 2021-04-27 at 14:37 +1200, Barry Song wrote:
>
> To fix this issue, this patch checks the waker of sync wake-up is a task
> but not an interrupt. In this case, the waker will schedule out and give
> CPU to wakee.

That rash "the waker will schedule out" assumption, ie this really
really is a synchronous wakeup, may be true in your particular case,
but the sync hint is so overused as to be fairly meaningless. We've
squabbled with it repeatedly over the years because of that.  It really
should either become more of a communication of intent than it
currently is, or just go away.

I'd argue for go away, simply because there is no way for the kernel to
know that there isn't more work directly behind any particular wakeup.

Take a pipe, does shoving some bits through a pipe mean you have no
further use of your CPU?  IFF you're doing nothing but playing ping-
pong, sure it does, but how many real proggies have zero overlap among
its threads of execution?  The mere notion of threaded apps having no
overlap *to be converted to throughput* is dainbramaged, which should
be the death knell of the sync wakeup hint.  Threaded apps can't do
stuff like, oh, networking, which uses the sync hint heavily, without
at least to some extent defeating the purpose of threading if we were
to take the hint seriously.  Heck, just look at the beauty (gak) of
wake_wide().  It was born specifically to combat the pure-evil side of
the sync wakeup hint.

Bah, 'nuff "Danger Will Robinson, that thing will *eat you*!!" ;-)

	-Mike
Song Bao Hua (Barry Song) April 27, 2021, 4:44 a.m. UTC | #2
> -----Original Message-----
> From: Mike Galbraith [mailto:efault@gmx.de]
> Sent: Tuesday, April 27, 2021 4:22 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> vincent.guittot@linaro.org; mingo@redhat.com; peterz@infradead.org;
> dietmar.eggemann@arm.com; rostedt@goodmis.org; bsegall@google.com;
> mgorman@suse.de
> Cc: valentin.schneider@arm.com; juri.lelli@redhat.com; bristot@redhat.com;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; xuwei (O)
> <xuwei5@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> guodong.xu@linaro.org; yangyicong <yangyicong@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; linuxarm@openeuler.org; wanghuiqiang
> <wanghuiqiang@huawei.com>; xieyongjia (A) <xieyongjia1@huawei.com>
> Subject: Re: [PATCH] sched/fair: don't use waker's cpu if the waker of sync
> wake-up is interrupt
> 
> On Tue, 2021-04-27 at 14:37 +1200, Barry Song wrote:
> >
> > To fix this issue, this patch checks the waker of sync wake-up is a task
> > but not an interrupt. In this case, the waker will schedule out and give
> > CPU to wakee.
> 
> That rash "the waker will schedule out" assumption, ie this really
> really is a synchronous wakeup, may be true in your particular case,

Hi Mike,

For my particular case, sync hint is used by interrupt but not task.
so "the waker will schedule out" is false because the existing task
is relevant at all.

Here the description "the waker will schedule out" is just copying
the general idea of sync wake-up though the real code might overuse
this hint.

> but the sync hint is so overused as to be fairly meaningless. We've
> squabbled with it repeatedly over the years because of that.  It really
> should either become more of a communication of intent than it
> currently is, or just go away.

I agree sync hint might have been overused by other kernel subsystem.
But this patch will at least fix a case: sync waker is interrupt,
in this case, the existing task has nothing to do with waker and wakee,
so this case should be excluded from wake_affine_idle().

> 
> I'd argue for go away, simply because there is no way for the kernel to
> know that there isn't more work directly behind any particular wakeup.

To some extent, "go way" might be a good choice. But this is a more
aggressive action. For those cases waker will really scheduler out,
wakee won't be able to take the advantage of getting the idle cpu
of waker.

> 
> Take a pipe, does shoving some bits through a pipe mean you have no
> further use of your CPU?  IFF you're doing nothing but playing ping-
> pong, sure it does, but how many real proggies have zero overlap among
> its threads of execution?  The mere notion of threaded apps having no
> overlap *to be converted to throughput* is dainbramaged, which should
> be the death knell of the sync wakeup hint.  Threaded apps can't do
> stuff like, oh, networking, which uses the sync hint heavily, without
> at least to some extent defeating the purpose of threading if we were
> to take the hint seriously.  Heck, just look at the beauty (gak) of
> wake_wide().  It was born specifically to combat the pure-evil side of
> the sync wakeup hint.

As above, removing the code of migrating wakee to the cpu of sync waker
could be an option, but needs more investigations.

> 
> Bah, 'nuff "Danger Will Robinson, that thing will *eat you*!!" ;-)
> 
> 	-Mike

Thanks
Barry
Mike Galbraith April 27, 2021, 5:54 a.m. UTC | #3
On Tue, 2021-04-27 at 04:44 +0000, Song Bao Hua (Barry Song) wrote:
>
>
> I agree sync hint might have been overused by other kernel subsystem.
> But this patch will at least fix a case: sync waker is interrupt,
> in this case, the existing task has nothing to do with waker and wakee,
> so this case should be excluded from wake_affine_idle().

I long ago tried filtering interrupt wakeups, and met some surprises.
Wakeup twiddling always managing to end up being a rob Peter to pay
Paul operation despite our best efforts, here's hoping that your pile
of stolen cycles is small enough to escape performance bot notice :)

	-Mike
Song Bao Hua (Barry Song) April 27, 2021, 6:05 a.m. UTC | #4
> -----Original Message-----
> From: Mike Galbraith [mailto:efault@gmx.de]
> Sent: Tuesday, April 27, 2021 5:55 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> vincent.guittot@linaro.org; mingo@redhat.com; peterz@infradead.org;
> dietmar.eggemann@arm.com; rostedt@goodmis.org; bsegall@google.com;
> mgorman@suse.de
> Cc: valentin.schneider@arm.com; juri.lelli@redhat.com; bristot@redhat.com;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; xuwei (O)
> <xuwei5@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> guodong.xu@linaro.org; yangyicong <yangyicong@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; linuxarm@openeuler.org; wanghuiqiang
> <wanghuiqiang@huawei.com>; xieyongjia (A) <xieyongjia1@huawei.com>
> Subject: Re: [PATCH] sched/fair: don't use waker's cpu if the waker of sync
> wake-up is interrupt
> 
> On Tue, 2021-04-27 at 04:44 +0000, Song Bao Hua (Barry Song) wrote:
> >
> >
> > I agree sync hint might have been overused by other kernel subsystem.
> > But this patch will at least fix a case: sync waker is interrupt,
> > in this case, the existing task has nothing to do with waker and wakee,
> > so this case should be excluded from wake_affine_idle().
> 
> I long ago tried filtering interrupt wakeups, and met some surprises.
> Wakeup twiddling always managing to end up being a rob Peter to pay
> Paul operation despite our best efforts, here's hoping that your pile
> of stolen cycles is small enough to escape performance bot notice :)

Would you like to share the link you did before to filter interrupt
wakeups?

The wake up path has hundreds of lines of code, so I don't expect that
reading preempt_count will cause visible performance losses to bot. But
who knows :-)

> 
> 	-Mike

Thanks
Barry
Mike Galbraith April 27, 2021, 6:15 a.m. UTC | #5
On Tue, 2021-04-27 at 06:05 +0000, Song Bao Hua (Barry Song) wrote:
>
> Would you like to share the link you did before to filter interrupt
> wakeups?

Can't, failed experiments go only to my Bitmaster-9000 patch shredder
to avoid needing a snow plow to get to near my box.  Besides, it was
long ago in a much changed code base.

	-Mike
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d73bdbb2d40..8ad2d732033d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5829,7 +5829,12 @@  wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
 		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
 
-	if (sync && cpu_rq(this_cpu)->nr_running == 1)
+	/*
+	 * If this is a sync wake-up and the only running thread is just
+	 * waker, thus, waker is not interrupt, we assume wakee will get
+	 * the cpu of waker soon
+	 */
+	if (sync && cpu_rq(this_cpu)->nr_running == 1 && in_task())
 		return this_cpu;
 
 	if (available_idle_cpu(prev_cpu))