mbox series

[0/4] Reduce scanning of runqueues in select_idle_sibling

Message ID 20201208153501.1467-1-mgorman@techsingularity.net (mailing list archive)
Headers show
Series Reduce scanning of runqueues in select_idle_sibling | expand

Message

Mel Gorman Dec. 8, 2020, 3:34 p.m. UTC
Changelog since v1
o Drop single-pass patch						(vincent)
o Scope variables used for SIS_AVG_CPU					(dietmar)
o Remove redundant assignment						(dietmar

This reduces the amount of runqueue scanning in select_idle_sibling in
the worst case.

Patch 1 removes SIS_AVG_CPU because it's unused.

Patch 2 moves all SIS_PROP-related calculations under SIS_PROP

Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
	of scanning. It should be relatively uncontroversial

Patch 4 returns an idle candidate if one is found while scanning for a
	free core.

Comments

Mel Gorman Dec. 9, 2020, 2:37 p.m. UTC | #1
On Tue, Dec 08, 2020 at 03:34:57PM +0000, Mel Gorman wrote:
> Changelog since v1
> o Drop single-pass patch						(vincent)
> o Scope variables used for SIS_AVG_CPU					(dietmar)
> o Remove redundant assignment						(dietmar
> 
> This reduces the amount of runqueue scanning in select_idle_sibling in
> the worst case.
> 
> Patch 1 removes SIS_AVG_CPU because it's unused.
> 
> Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
> 
> Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
> 	of scanning. It should be relatively uncontroversial
> 
> Patch 4 returns an idle candidate if one is found while scanning for a
> 	free core.
> 

Any other objections to the series? Vincent marked 1, 3 and 4 as
reviewed. While patch 2 had some mild cosmetic concerns, I think the
version and how it treats SIS_PROP is fine as it is to keep it
functionally equivalent to !SIS_PROP and without adding too many
SIS_PROP checks.
Vincent Guittot Dec. 10, 2020, 8 a.m. UTC | #2
On Wed, 9 Dec 2020 at 15:37, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Dec 08, 2020 at 03:34:57PM +0000, Mel Gorman wrote:
> > Changelog since v1
> > o Drop single-pass patch                                              (vincent)
> > o Scope variables used for SIS_AVG_CPU                                        (dietmar)
> > o Remove redundant assignment                                         (dietmar
> >
> > This reduces the amount of runqueue scanning in select_idle_sibling in
> > the worst case.
> >
> > Patch 1 removes SIS_AVG_CPU because it's unused.
> >
> > Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
> >
> > Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
> >       of scanning. It should be relatively uncontroversial
> >
> > Patch 4 returns an idle candidate if one is found while scanning for a
> >       free core.
> >
>
> Any other objections to the series? Vincent marked 1, 3 and 4 as
> reviewed. While patch 2 had some mild cosmetic concerns, I think the
> version and how it treats SIS_PROP is fine as it is to keep it
> functionally equivalent to !SIS_PROP and without adding too many
> SIS_PROP checks.

while testing your patchset and Aubrey one on top of tip, I'm facing
some perf regression on my arm64 numa system on hackbench and reaim.
The regression seems to comes from your patchset but i don't know
which patch in particular yet

hackbench -l 256000 -g 1

v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
with your patchset         15.368(+/- 2.74)  -15.9%

I'm also seeing perf regression on reaim but this one needs more
investigation before confirming

TBH, I was not expecting regressions. I'm running more test to find
which patch is the culprit


>
> --
> Mel Gorman
> SUSE Labs
Vincent Guittot Dec. 10, 2020, 9:38 a.m. UTC | #3
On Thu, 10 Dec 2020 at 09:00, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 9 Dec 2020 at 15:37, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Tue, Dec 08, 2020 at 03:34:57PM +0000, Mel Gorman wrote:
> > > Changelog since v1
> > > o Drop single-pass patch                                              (vincent)
> > > o Scope variables used for SIS_AVG_CPU                                        (dietmar)
> > > o Remove redundant assignment                                         (dietmar
> > >
> > > This reduces the amount of runqueue scanning in select_idle_sibling in
> > > the worst case.
> > >
> > > Patch 1 removes SIS_AVG_CPU because it's unused.
> > >
> > > Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
> > >
> > > Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
> > >       of scanning. It should be relatively uncontroversial
> > >
> > > Patch 4 returns an idle candidate if one is found while scanning for a
> > >       free core.
> > >
> >
> > Any other objections to the series? Vincent marked 1, 3 and 4 as
> > reviewed. While patch 2 had some mild cosmetic concerns, I think the
> > version and how it treats SIS_PROP is fine as it is to keep it
> > functionally equivalent to !SIS_PROP and without adding too many
> > SIS_PROP checks.
>
> while testing your patchset and Aubrey one on top of tip, I'm facing
> some perf regression on my arm64 numa system on hackbench and reaim.
> The regression seems to comes from your patchset but i don't know
> which patch in particular yet
>
> hackbench -l 256000 -g 1
>
> v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> with your patchset         15.368(+/- 2.74)  -15.9%
>
> I'm also seeing perf regression on reaim but this one needs more
> investigation before confirming
>
> TBH, I was not expecting regressions. I'm running more test to find
> which patch is the culprit

The regression comes from patch 3: sched/fair: Do not replace
recent_used_cpu with the new target


>
>
> >
> > --
> > Mel Gorman
> > SUSE Labs
Mel Gorman Dec. 10, 2020, 11:04 a.m. UTC | #4
On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > while testing your patchset and Aubrey one on top of tip, I'm facing
> > some perf regression on my arm64 numa system on hackbench and reaim.
> > The regression seems to comes from your patchset but i don't know
> > which patch in particular yet
> >
> > hackbench -l 256000 -g 1
> >
> > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > with your patchset         15.368(+/- 2.74)  -15.9%
> >
> > I'm also seeing perf regression on reaim but this one needs more
> > investigation before confirming
> >
> > TBH, I was not expecting regressions. I'm running more test to find
> > which patch is the culprit
> 
> The regression comes from patch 3: sched/fair: Do not replace
> recent_used_cpu with the new target
> 

That's not entirely surprising. The intent of the patch is to increase the
hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
corner cases. If multiple tasks have the same p->recent_used_cpu, they can
race to use that CPU and stack as a result instead of searching the domain.
If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
a busy sibling which the search would have avoided in select_idle_core().

I think you are using processes and sockets for hackbench but as you'll
see later, hackbench can be used both to show losses and gains.

I originally tested with 6 machines covering Broadwell (2 socket), Haswell
(2 socket), Skylake (1 socket), Cascadelake (2 socket), EPYC (2 socket)
and EPYC 2 (2 socket) with a range of workloads including hackbench. Of
those, just one reported a major problem with 1 group -- the EPYC 1 machine

EPYC hackbench process-sockets
                          5.10.0-rc6             5.10.0-rc6
                       baseline-v2r2         altrecent-v2r5
Amean     1        1.0607 (   0.00%)      1.1480 (  -8.23%)
Amean     4        1.3277 (   0.00%)      1.3117 (   1.21%)
Amean     7        1.6940 (   0.00%)      1.6950 (  -0.06%)
Amean     12       2.1600 (   0.00%)      2.1367 (   1.08%)
Amean     21       3.2450 (   0.00%)      3.5883 ( -10.58%)
Amean     30       4.1673 (   0.00%)      3.9653 (   4.85%)
Amean     48       4.9257 (   0.00%)      5.0000 (  -1.51%)
Amean     79       7.4950 (   0.00%)      7.4563 (   0.52%)
Amean     110     10.4233 (   0.00%)     10.4727 (  -0.47%)
Amean     141     13.4690 (   0.00%)     13.4563 (   0.09%)
Amean     172     16.6450 (   0.00%)     16.6033 (   0.25%)
Amean     203     19.4873 (   0.00%)     19.7893 *  -1.55%*
Amean     234     22.5507 (   0.00%)     22.8033 (  -1.12%)
Amean     265     25.3380 (   0.00%)     25.6490 (  -1.23%)
Amean     296     28.0070 (   0.00%)     28.1270 (  -0.43%)

That's showing an 8% loss for 1 group and also a problem with 21 groups.
Otherwise, it was more or less flat. EPYC 2 also showed a 2% loss for 1
group and 9% loss for 21 groups (probably related to the size of the LLC
domain as there are many LLCs per socket on EPYC*).

For the *same* machine running hackbench using pipes instead of sockets
we get

EPYC hackbench process-pipes
Amean     1        0.9497 (   0.00%)      0.9517 (  -0.21%)
Amean     4        1.2253 (   0.00%)      1.1387 (   7.07%)
Amean     7        2.0677 (   0.00%)      1.7460 *  15.56%*
Amean     12       2.8717 (   0.00%)      2.4797 *  13.65%*
Amean     21       4.4053 (   0.00%)      3.7463 *  14.96%*
Amean     30       5.3983 (   0.00%)      4.1097 *  23.87%*
Amean     48       6.1050 (   0.00%)      4.6873 *  23.22%*
Amean     79       7.5640 (   0.00%)      6.8493 (   9.45%)
Amean     110     12.2627 (   0.00%)      9.4613 *  22.84%*
Amean     141     16.9980 (   0.00%)     13.8327 *  18.62%*
Amean     172     21.5280 (   0.00%)     17.3693 *  19.32%*
Amean     203     25.4480 (   0.00%)     20.9947 *  17.50%*
Amean     234     29.6570 (   0.00%)     24.9613 *  15.83%*
Amean     265     33.0713 (   0.00%)     28.1103 *  15.00%*
Amean     296     37.4443 (   0.00%)     31.8757 *  14.87%*

So even on the *same hardware*, hackbench can show very different results
depending on how it is run.

The rest of the machines were more or less neutral for this patch. Once
hackbench saturates the machine, the hit rate on recent_used_cpu is going
to be low

1-socket skylake
Amean     1       1.3183 (   0.00%)      1.2827 *   2.71%*
Amean     3       3.6750 (   0.00%)      3.6610 (   0.38%)
Amean     5       6.1003 (   0.00%)      6.0190 *   1.33%*
Amean     7       8.6063 (   0.00%)      8.6047 (   0.02%)
Amean     12     14.9480 (   0.00%)     15.0327 (  -0.57%)
Amean     18     22.3430 (   0.00%)     22.6680 (  -1.45%)
Amean     24     29.4970 (   0.00%)     29.6677 (  -0.58%)
Amean     30     36.7373 (   0.00%)     36.3687 (   1.00%)
Amean     32     39.0973 (   0.00%)     39.4417 (  -0.88%)

Shows a 2.71% gain for one group, otherwise more or less neutral

2-socket CascadeLake

Amean     1        0.3663 (   0.00%)      0.3657 (   0.18%)
Amean     4        0.7510 (   0.00%)      0.7793 (  -3.77%)
Amean     7        1.2650 (   0.00%)      1.2583 (   0.53%)
Amean     12       1.9510 (   0.00%)      1.9450 (   0.31%)
Amean     21       2.9677 (   0.00%)      3.0277 (  -2.02%)
Amean     30       4.2993 (   0.00%)      4.0237 *   6.41%*
Amean     48       6.5373 (   0.00%)      6.2987 *   3.65%*
Amean     79      10.5513 (   0.00%)     10.3280 (   2.12%)
Amean     110     15.8567 (   0.00%)     13.9817 (  11.82%)
Amean     141     17.4243 (   0.00%)     17.3177 (   0.61%)
Amean     172     21.0473 (   0.00%)     20.9760 (   0.34%)
Amean     203     25.1070 (   0.00%)     25.1150 (  -0.03%)
Amean     234     28.6753 (   0.00%)     28.9383 (  -0.92%)
Amean     265     32.7970 (   0.00%)     32.9663 (  -0.52%)
Amean     296     36.6510 (   0.00%)     36.6753 (  -0.07%)

Neutral for 1 group, small regression for 4 groups, few gains around the
middle, neutral when over-saturated.

select_idle_sibling is a curse because it's very rare that a change to
it is a universal win. On balance, I think it's better to avoid searching
the domain at all where possible even if there are cases where searching
can have a benefit such as finding an idle core instead of picking an
idle CPU with a busy sibling via p->recent_used_cpu.
Vincent Guittot Dec. 11, 2020, 9:51 a.m. UTC | #5
On Thu, 10 Dec 2020 at 12:04, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > The regression seems to comes from your patchset but i don't know
> > > which patch in particular yet
> > >
> > > hackbench -l 256000 -g 1
> > >
> > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > with your patchset         15.368(+/- 2.74)  -15.9%
> > >
> > > I'm also seeing perf regression on reaim but this one needs more
> > > investigation before confirming
> > >
> > > TBH, I was not expecting regressions. I'm running more test to find
> > > which patch is the culprit
> >
> > The regression comes from patch 3: sched/fair: Do not replace
> > recent_used_cpu with the new target
> >
>
> That's not entirely surprising. The intent of the patch is to increase the
> hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> race to use that CPU and stack as a result instead of searching the domain.
> If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> a busy sibling which the search would have avoided in select_idle_core().
>
> I think you are using processes and sockets for hackbench but as you'll
> see later, hackbench can be used both to show losses and gains.

I run more hackbench tests with pipe and socket and both show
regression with patch 3 whereas this is significant improvement with
other patches and Aubrey's one

>
> I originally tested with 6 machines covering Broadwell (2 socket), Haswell
> (2 socket), Skylake (1 socket), Cascadelake (2 socket), EPYC (2 socket)
> and EPYC 2 (2 socket) with a range of workloads including hackbench. Of
> those, just one reported a major problem with 1 group -- the EPYC 1 machine
>
> EPYC hackbench process-sockets
>                           5.10.0-rc6             5.10.0-rc6
>                        baseline-v2r2         altrecent-v2r5
> Amean     1        1.0607 (   0.00%)      1.1480 (  -8.23%)
> Amean     4        1.3277 (   0.00%)      1.3117 (   1.21%)
> Amean     7        1.6940 (   0.00%)      1.6950 (  -0.06%)
> Amean     12       2.1600 (   0.00%)      2.1367 (   1.08%)
> Amean     21       3.2450 (   0.00%)      3.5883 ( -10.58%)
> Amean     30       4.1673 (   0.00%)      3.9653 (   4.85%)
> Amean     48       4.9257 (   0.00%)      5.0000 (  -1.51%)
> Amean     79       7.4950 (   0.00%)      7.4563 (   0.52%)
> Amean     110     10.4233 (   0.00%)     10.4727 (  -0.47%)
> Amean     141     13.4690 (   0.00%)     13.4563 (   0.09%)
> Amean     172     16.6450 (   0.00%)     16.6033 (   0.25%)
> Amean     203     19.4873 (   0.00%)     19.7893 *  -1.55%*
> Amean     234     22.5507 (   0.00%)     22.8033 (  -1.12%)
> Amean     265     25.3380 (   0.00%)     25.6490 (  -1.23%)
> Amean     296     28.0070 (   0.00%)     28.1270 (  -0.43%)
>
> That's showing an 8% loss for 1 group and also a problem with 21 groups.
> Otherwise, it was more or less flat. EPYC 2 also showed a 2% loss for 1
> group and 9% loss for 21 groups (probably related to the size of the LLC
> domain as there are many LLCs per socket on EPYC*).
>
> For the *same* machine running hackbench using pipes instead of sockets
> we get
>
> EPYC hackbench process-pipes
> Amean     1        0.9497 (   0.00%)      0.9517 (  -0.21%)
> Amean     4        1.2253 (   0.00%)      1.1387 (   7.07%)
> Amean     7        2.0677 (   0.00%)      1.7460 *  15.56%*
> Amean     12       2.8717 (   0.00%)      2.4797 *  13.65%*
> Amean     21       4.4053 (   0.00%)      3.7463 *  14.96%*
> Amean     30       5.3983 (   0.00%)      4.1097 *  23.87%*
> Amean     48       6.1050 (   0.00%)      4.6873 *  23.22%*
> Amean     79       7.5640 (   0.00%)      6.8493 (   9.45%)
> Amean     110     12.2627 (   0.00%)      9.4613 *  22.84%*
> Amean     141     16.9980 (   0.00%)     13.8327 *  18.62%*
> Amean     172     21.5280 (   0.00%)     17.3693 *  19.32%*
> Amean     203     25.4480 (   0.00%)     20.9947 *  17.50%*
> Amean     234     29.6570 (   0.00%)     24.9613 *  15.83%*
> Amean     265     33.0713 (   0.00%)     28.1103 *  15.00%*
> Amean     296     37.4443 (   0.00%)     31.8757 *  14.87%*
>
> So even on the *same hardware*, hackbench can show very different results
> depending on how it is run.
>
> The rest of the machines were more or less neutral for this patch. Once
> hackbench saturates the machine, the hit rate on recent_used_cpu is going
> to be low
>
> 1-socket skylake
> Amean     1       1.3183 (   0.00%)      1.2827 *   2.71%*
> Amean     3       3.6750 (   0.00%)      3.6610 (   0.38%)
> Amean     5       6.1003 (   0.00%)      6.0190 *   1.33%*
> Amean     7       8.6063 (   0.00%)      8.6047 (   0.02%)
> Amean     12     14.9480 (   0.00%)     15.0327 (  -0.57%)
> Amean     18     22.3430 (   0.00%)     22.6680 (  -1.45%)
> Amean     24     29.4970 (   0.00%)     29.6677 (  -0.58%)
> Amean     30     36.7373 (   0.00%)     36.3687 (   1.00%)
> Amean     32     39.0973 (   0.00%)     39.4417 (  -0.88%)
>
> Shows a 2.71% gain for one group, otherwise more or less neutral
>
> 2-socket CascadeLake
>
> Amean     1        0.3663 (   0.00%)      0.3657 (   0.18%)
> Amean     4        0.7510 (   0.00%)      0.7793 (  -3.77%)
> Amean     7        1.2650 (   0.00%)      1.2583 (   0.53%)
> Amean     12       1.9510 (   0.00%)      1.9450 (   0.31%)
> Amean     21       2.9677 (   0.00%)      3.0277 (  -2.02%)
> Amean     30       4.2993 (   0.00%)      4.0237 *   6.41%*
> Amean     48       6.5373 (   0.00%)      6.2987 *   3.65%*
> Amean     79      10.5513 (   0.00%)     10.3280 (   2.12%)
> Amean     110     15.8567 (   0.00%)     13.9817 (  11.82%)
> Amean     141     17.4243 (   0.00%)     17.3177 (   0.61%)
> Amean     172     21.0473 (   0.00%)     20.9760 (   0.34%)
> Amean     203     25.1070 (   0.00%)     25.1150 (  -0.03%)
> Amean     234     28.6753 (   0.00%)     28.9383 (  -0.92%)
> Amean     265     32.7970 (   0.00%)     32.9663 (  -0.52%)
> Amean     296     36.6510 (   0.00%)     36.6753 (  -0.07%)
>
> Neutral for 1 group, small regression for 4 groups, few gains around the
> middle, neutral when over-saturated.
>
> select_idle_sibling is a curse because it's very rare that a change to
> it is a universal win. On balance, I think it's better to avoid searching
> the domain at all where possible even if there are cases where searching
> can have a benefit such as finding an idle core instead of picking an
> idle CPU with a busy sibling via p->recent_used_cpu.
>
> --
> Mel Gorman
> SUSE Labs
Mel Gorman Dec. 11, 2020, 10:23 a.m. UTC | #6
On Fri, Dec 11, 2020 at 10:51:17AM +0100, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 12:04, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > > The regression seems to comes from your patchset but i don't know
> > > > which patch in particular yet
> > > >
> > > > hackbench -l 256000 -g 1
> > > >
> > > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > > with your patchset         15.368(+/- 2.74)  -15.9%
> > > >
> > > > I'm also seeing perf regression on reaim but this one needs more
> > > > investigation before confirming
> > > >
> > > > TBH, I was not expecting regressions. I'm running more test to find
> > > > which patch is the culprit
> > >
> > > The regression comes from patch 3: sched/fair: Do not replace
> > > recent_used_cpu with the new target
> > >
> >
> > That's not entirely surprising. The intent of the patch is to increase the
> > hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> > corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> > race to use that CPU and stack as a result instead of searching the domain.
> > If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> > a busy sibling which the search would have avoided in select_idle_core().
> >
> > I think you are using processes and sockets for hackbench but as you'll
> > see later, hackbench can be used both to show losses and gains.
> 
> I run more hackbench tests with pipe and socket and both show
> regression with patch 3 whereas this is significant improvement with
> other patches and Aubrey's one
> 

Is SMT enabled on your test machine? If not, then patch 4 should make no
difference but if SMT is enabled, I wonder how this untested version of
patch 3 behaves for you. The main difference is that the recent used cpu
is used as a search target so that it would still check if it's an idle
core and if not, fall through so it's used as an idle CPU after checking
it's allowed by p->cpus_ptr.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c41875aec23..63980bcf6e70 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6275,21 +6275,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		return prev;
 	}
 
-	/* Check a recently used CPU as a potential idle candidate: */
+	/* Check a recently used CPU as a search target: */
 	recent_used_cpu = p->recent_used_cpu;
+	p->recent_used_cpu = prev;
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
-	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
-	    asym_fits_capacity(task_util, recent_used_cpu)) {
-		/*
-		 * Replace recent_used_cpu with prev as it is a potential
-		 * candidate for the next wake:
-		 */
-		p->recent_used_cpu = prev;
-		return recent_used_cpu;
-	}
+	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)))
+		target = recent_used_cpu;
 
 	/*
 	 * For asymmetric CPU capacity systems, our domain of interest is
@@ -6768,9 +6761,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
-		if (want_affine)
-			current->recent_used_cpu = cpu;
 	}
 	rcu_read_unlock();
Vincent Guittot Dec. 12, 2020, 10:02 a.m. UTC | #7
On Fri, 11 Dec 2020 at 11:23, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Dec 11, 2020 at 10:51:17AM +0100, Vincent Guittot wrote:
> > On Thu, 10 Dec 2020 at 12:04, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > > > The regression seems to comes from your patchset but i don't know
> > > > > which patch in particular yet
> > > > >
> > > > > hackbench -l 256000 -g 1
> > > > >
> > > > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > > > with your patchset         15.368(+/- 2.74)  -15.9%
> > > > >
> > > > > I'm also seeing perf regression on reaim but this one needs more
> > > > > investigation before confirming
> > > > >
> > > > > TBH, I was not expecting regressions. I'm running more test to find
> > > > > which patch is the culprit
> > > >
> > > > The regression comes from patch 3: sched/fair: Do not replace
> > > > recent_used_cpu with the new target
> > > >
> > >
> > > That's not entirely surprising. The intent of the patch is to increase the
> > > hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> > > corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> > > race to use that CPU and stack as a result instead of searching the domain.
> > > If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> > > a busy sibling which the search would have avoided in select_idle_core().
> > >
> > > I think you are using processes and sockets for hackbench but as you'll
> > > see later, hackbench can be used both to show losses and gains.
> >
> > I run more hackbench tests with pipe and socket and both show
> > regression with patch 3 whereas this is significant improvement with
> > other patches and Aubrey's one
> >
>
> Is SMT enabled on your test machine? If not, then patch 4 should make no

yes I have SMT on my system : 2 nodes x 28 cores x 4 SMT



> difference but if SMT is enabled, I wonder how this untested version of
> patch 3 behaves for you. The main difference is that the recent used cpu
> is used as a search target so that it would still check if it's an idle
> core and if not, fall through so it's used as an idle CPU after checking
> it's allowed by p->cpus_ptr.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c41875aec23..63980bcf6e70 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6275,21 +6275,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>                 return prev;
>         }
>
> -       /* Check a recently used CPU as a potential idle candidate: */
> +       /* Check a recently used CPU as a search target: */
>         recent_used_cpu = p->recent_used_cpu;
> +       p->recent_used_cpu = prev;
>         if (recent_used_cpu != prev &&
>             recent_used_cpu != target &&
>             cpus_share_cache(recent_used_cpu, target) &&
> -           (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> -           cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> -           asym_fits_capacity(task_util, recent_used_cpu)) {
> -               /*
> -                * Replace recent_used_cpu with prev as it is a potential
> -                * candidate for the next wake:
> -                */
> -               p->recent_used_cpu = prev;
> -               return recent_used_cpu;
> -       }
> +           (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)))
> +               target = recent_used_cpu;
>
>         /*
>          * For asymmetric CPU capacity systems, our domain of interest is
> @@ -6768,9 +6761,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>         } else if (wake_flags & WF_TTWU) { /* XXX always ? */
>                 /* Fast path */
>                 new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -
> -               if (want_affine)
> -                       current->recent_used_cpu = cpu;
>         }
>         rcu_read_unlock();
>
>
> --
> Mel Gorman
> SUSE Labs