mbox series

[RFC,0/4] Reduce worst-case scanning of runqueues in select_idle_sibling

Message ID 20201207091516.24683-1-mgorman@techsingularity.net (mailing list archive)
Headers show
Series Reduce worst-case scanning of runqueues in select_idle_sibling | expand

Message

Mel Gorman Dec. 7, 2020, 9:15 a.m. UTC
This is a minimal series to reduce the amount of runqueue scanning in
select_idle_sibling in the worst case.

Patch 1 removes SIS_AVG_CPU because it's unused.

Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
	of scanning. It should be relatively uncontroversial

Patch 3-4 scans the runqueues in a single pass for select_idle_core()
	and select_idle_cpu() so runqueues are not scanned twice. It's
	a tradeoff because it benefits deep scans but introduces overhead
	for shallow scans.

Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
approach to stand on its own, patches 1-2 should be fine. The main decision
with patch 4 is whether select_idle_core() should do a full scan when searching
for an idle core, whether it should be throttled in some other fashion or
whether it should be just left alone.

Comments

Vincent Guittot Dec. 7, 2020, 3:04 p.m. UTC | #1
On Mon, 7 Dec 2020 at 10:15, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> This is a minimal series to reduce the amount of runqueue scanning in
> select_idle_sibling in the worst case.
>
> Patch 1 removes SIS_AVG_CPU because it's unused.
>
> Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
>         of scanning. It should be relatively uncontroversial
>
> Patch 3-4 scans the runqueues in a single pass for select_idle_core()
>         and select_idle_cpu() so runqueues are not scanned twice. It's
>         a tradeoff because it benefits deep scans but introduces overhead
>         for shallow scans.
>
> Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask

patch 3 looks fine and doesn't collide with Aubrey's work. But I don't
like patch 4  which manipulates different cpumask including
load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's
patchset which should fix the problem of possibly  scanning twice busy
cpus in select_idle_core and select_idle_cpu



> approach to stand on its own, patches 1-2 should be fine. The main decision
> with patch 4 is whether select_idle_core() should do a full scan when searching
> for an idle core, whether it should be throttled in some other fashion or
> whether it should be just left alone.
>
> --
> 2.26.2
>
Mel Gorman Dec. 7, 2020, 3:42 p.m. UTC | #2
On Mon, Dec 07, 2020 at 04:04:41PM +0100, Vincent Guittot wrote:
> On Mon, 7 Dec 2020 at 10:15, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > This is a minimal series to reduce the amount of runqueue scanning in
> > select_idle_sibling in the worst case.
> >
> > Patch 1 removes SIS_AVG_CPU because it's unused.
> >
> > Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
> >         of scanning. It should be relatively uncontroversial
> >
> > Patch 3-4 scans the runqueues in a single pass for select_idle_core()
> >         and select_idle_cpu() so runqueues are not scanned twice. It's
> >         a tradeoff because it benefits deep scans but introduces overhead
> >         for shallow scans.
> >
> > Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
> 
> patch 3 looks fine and doesn't collide with Aubrey's work. But I don't
> like patch 4  which manipulates different cpumask including
> load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's
> patchset which should fix the problem of possibly  scanning twice busy
> cpus in select_idle_core and select_idle_cpu
> 

Seems fair, we can see where we stand after V6 of Aubrey's work.  A lot
of the motivation for patch 4 would go away if we managed to avoid calling
select_idle_core() unnecessarily. As it stands, we can call it a lot from
hackbench even though the chance of getting an idle core are minimal.

Assuming I revisit it, I'll update the schedstat debug patches to include
the times select_idle_core() starts versus how many times it fails and
see can I think of a useful heuristic.

I'll wait for more review on patches 1-3 and if I hear nothing, I'll
resend just those.

Thanks Vincent.
Aubrey Li Dec. 8, 2020, 2:06 a.m. UTC | #3
On 2020/12/7 23:42, Mel Gorman wrote:
> On Mon, Dec 07, 2020 at 04:04:41PM +0100, Vincent Guittot wrote:
>> On Mon, 7 Dec 2020 at 10:15, Mel Gorman <mgorman@techsingularity.net> wrote:
>>>
>>> This is a minimal series to reduce the amount of runqueue scanning in
>>> select_idle_sibling in the worst case.
>>>
>>> Patch 1 removes SIS_AVG_CPU because it's unused.
>>>
>>> Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
>>>         of scanning. It should be relatively uncontroversial
>>>
>>> Patch 3-4 scans the runqueues in a single pass for select_idle_core()
>>>         and select_idle_cpu() so runqueues are not scanned twice. It's
>>>         a tradeoff because it benefits deep scans but introduces overhead
>>>         for shallow scans.
>>>
>>> Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
>>
>> patch 3 looks fine and doesn't collide with Aubrey's work. But I don't
>> like patch 4  which manipulates different cpumask including
>> load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's
>> patchset which should fix the problem of possibly  scanning twice busy
>> cpus in select_idle_core and select_idle_cpu
>>
> 
> Seems fair, we can see where we stand after V6 of Aubrey's work.  A lot
> of the motivation for patch 4 would go away if we managed to avoid calling
> select_idle_core() unnecessarily. As it stands, we can call it a lot from
> hackbench even though the chance of getting an idle core are minimal.
> 

Sorry for the delay, I sent v6 out just now. Comparing to v5, v6 followed Vincent's
suggestion to decouple idle cpumask update from stop_tick signal, that is, the
CPU is set in idle cpumask every time the CPU enters idle, this should address
Peter's concern about the facebook trail-latency workload, as I didn't see
any regression in schbench workload 99.0000th latency report.

However, I also didn't see any significant benefit so far, probably I should
put more load on the system. I'll do more characterization of uperf workload
to see if I can find anything.

Thanks,
-Aubrey