[-mm] autonuma: Fix scan period updating

Message ID	20190624025604.30896-1-ying.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) client-ip=192.55.52.151; From: Huang Ying <ying.huang@intel.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying <ying.huang@intel.com>, Rik van Riel <riel@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Mel Gorman <mgorman@suse.de>, jhladky@redhat.com, lvenanci@redhat.com, Ingo Molnar <mingo@kernel.org> Subject: [PATCH -mm] autonuma: Fix scan period updating Date: Mon, 24 Jun 2019 10:56:04 +0800 Message-Id: <20190624025604.30896-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[-mm] autonuma: Fix scan period updating \| expand [-mm] autonuma: Fix scan period updating

Huang, Ying June 24, 2019, 2:56 a.m. UTC

The autonuma scan period should be increased (scanning is slowed down)
if the majority of the page accesses are shared with other processes.
But in current code, the scan period will be decreased (scanning is
speeded up) in that situation.

This patch fixes the code.  And this has been tested via tracing the
scan period changing and /proc/vmstat numa_pte_updates counter when
running a multi-threaded memory accessing program (most memory
areas are accessed by multiple threads).

Fixes: 37ec97deb3a8 ("sched/numa: Slow down scan rate if shared faults dominate")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Mel Gorman June 24, 2019, 2:09 p.m. UTC | #1

On Mon, Jun 24, 2019 at 10:56:04AM +0800, Huang Ying wrote:
> The autonuma scan period should be increased (scanning is slowed down)
> if the majority of the page accesses are shared with other processes.
> But in current code, the scan period will be decreased (scanning is
> speeded up) in that situation.
> 
> This patch fixes the code.  And this has been tested via tracing the
> scan period changing and /proc/vmstat numa_pte_updates counter when
> running a multi-threaded memory accessing program (most memory
> areas are accessed by multiple threads).
> 

The patch somewhat flips the logic on whether shared or private is
considered and it's not immediately obvious why that was required. That
aside, other than the impact on numa_pte_updates, what actual
performance difference was measured and on on what workloads?

huang ying June 25, 2019, 1:23 p.m. UTC | #2

On Mon, Jun 24, 2019 at 10:25 PM Mel Gorman <mgorman@suse.de> wrote:
>
> On Mon, Jun 24, 2019 at 10:56:04AM +0800, Huang Ying wrote:
> > The autonuma scan period should be increased (scanning is slowed down)
> > if the majority of the page accesses are shared with other processes.
> > But in current code, the scan period will be decreased (scanning is
> > speeded up) in that situation.
> >
> > This patch fixes the code.  And this has been tested via tracing the
> > scan period changing and /proc/vmstat numa_pte_updates counter when
> > running a multi-threaded memory accessing program (most memory
> > areas are accessed by multiple threads).
> >
>
> The patch somewhat flips the logic on whether shared or private is
> considered and it's not immediately obvious why that was required. That
> aside, other than the impact on numa_pte_updates, what actual
> performance difference was measured and on on what workloads?

The original scanning period updating logic doesn't match the original
patch description and comments.  I think the original patch
description and comments make more sense.  So I fix the code logic to
make it match the original patch description and comments.

If my understanding to the original code logic and the original patch
description and comments were correct, do you think the original patch
description and comments are wrong so we need to fix the comments
instead?  Or you think we should prove whether the original patch
description and comments are correct?

Best Regards,
Huang, Ying

Mel Gorman July 3, 2019, 9:17 a.m. UTC | #3

On Tue, Jun 25, 2019 at 09:23:22PM +0800, huang ying wrote:
> On Mon, Jun 24, 2019 at 10:25 PM Mel Gorman <mgorman@suse.de> wrote:
> >
> > On Mon, Jun 24, 2019 at 10:56:04AM +0800, Huang Ying wrote:
> > > The autonuma scan period should be increased (scanning is slowed down)
> > > if the majority of the page accesses are shared with other processes.
> > > But in current code, the scan period will be decreased (scanning is
> > > speeded up) in that situation.
> > >
> > > This patch fixes the code.  And this has been tested via tracing the
> > > scan period changing and /proc/vmstat numa_pte_updates counter when
> > > running a multi-threaded memory accessing program (most memory
> > > areas are accessed by multiple threads).
> > >
> >
> > The patch somewhat flips the logic on whether shared or private is
> > considered and it's not immediately obvious why that was required. That
> > aside, other than the impact on numa_pte_updates, what actual
> > performance difference was measured and on on what workloads?
> 
> The original scanning period updating logic doesn't match the original
> patch description and comments.  I think the original patch
> description and comments make more sense.  So I fix the code logic to
> make it match the original patch description and comments.
> 
> If my understanding to the original code logic and the original patch
> description and comments were correct, do you think the original patch
> description and comments are wrong so we need to fix the comments
> instead?  Or you think we should prove whether the original patch
> description and comments are correct?
> 

I'm about to get knocked offline so cannot answer properly. The code may
indeed be wrong and I have observed higher than expected NUMA scanning
behaviour than expected although not enough to cause problems. A comment
fix is fine but if you're changing the scanning behaviour, it should be
backed up with data justifying that the change both reduces the observed
scanning and that it has no adverse performance implications.

Huang, Ying July 4, 2019, 12:32 a.m. UTC | #4

Mel Gorman <mgorman@suse.de> writes:

> On Tue, Jun 25, 2019 at 09:23:22PM +0800, huang ying wrote:
>> On Mon, Jun 24, 2019 at 10:25 PM Mel Gorman <mgorman@suse.de> wrote:
>> >
>> > On Mon, Jun 24, 2019 at 10:56:04AM +0800, Huang Ying wrote:
>> > > The autonuma scan period should be increased (scanning is slowed down)
>> > > if the majority of the page accesses are shared with other processes.
>> > > But in current code, the scan period will be decreased (scanning is
>> > > speeded up) in that situation.
>> > >
>> > > This patch fixes the code.  And this has been tested via tracing the
>> > > scan period changing and /proc/vmstat numa_pte_updates counter when
>> > > running a multi-threaded memory accessing program (most memory
>> > > areas are accessed by multiple threads).
>> > >
>> >
>> > The patch somewhat flips the logic on whether shared or private is
>> > considered and it's not immediately obvious why that was required. That
>> > aside, other than the impact on numa_pte_updates, what actual
>> > performance difference was measured and on on what workloads?
>> 
>> The original scanning period updating logic doesn't match the original
>> patch description and comments.  I think the original patch
>> description and comments make more sense.  So I fix the code logic to
>> make it match the original patch description and comments.
>> 
>> If my understanding to the original code logic and the original patch
>> description and comments were correct, do you think the original patch
>> description and comments are wrong so we need to fix the comments
>> instead?  Or you think we should prove whether the original patch
>> description and comments are correct?
>> 
>
> I'm about to get knocked offline so cannot answer properly. The code may
> indeed be wrong and I have observed higher than expected NUMA scanning
> behaviour than expected although not enough to cause problems. A comment
> fix is fine but if you're changing the scanning behaviour, it should be
> backed up with data justifying that the change both reduces the observed
> scanning and that it has no adverse performance implications.

Got it!  Thanks for comments!  As for performance testing, do you have
some candidate workloads?

Best Regards,
Huang, Ying

Mel Gorman July 12, 2019, 8:27 a.m. UTC | #5

On Thu, Jul 04, 2019 at 08:32:06AM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > On Tue, Jun 25, 2019 at 09:23:22PM +0800, huang ying wrote:
> >> On Mon, Jun 24, 2019 at 10:25 PM Mel Gorman <mgorman@suse.de> wrote:
> >> >
> >> > On Mon, Jun 24, 2019 at 10:56:04AM +0800, Huang Ying wrote:
> >> > > The autonuma scan period should be increased (scanning is slowed down)
> >> > > if the majority of the page accesses are shared with other processes.
> >> > > But in current code, the scan period will be decreased (scanning is
> >> > > speeded up) in that situation.
> >> > >
> >> > > This patch fixes the code.  And this has been tested via tracing the
> >> > > scan period changing and /proc/vmstat numa_pte_updates counter when
> >> > > running a multi-threaded memory accessing program (most memory
> >> > > areas are accessed by multiple threads).
> >> > >
> >> >
> >> > The patch somewhat flips the logic on whether shared or private is
> >> > considered and it's not immediately obvious why that was required. That
> >> > aside, other than the impact on numa_pte_updates, what actual
> >> > performance difference was measured and on on what workloads?
> >> 
> >> The original scanning period updating logic doesn't match the original
> >> patch description and comments.  I think the original patch
> >> description and comments make more sense.  So I fix the code logic to
> >> make it match the original patch description and comments.
> >> 
> >> If my understanding to the original code logic and the original patch
> >> description and comments were correct, do you think the original patch
> >> description and comments are wrong so we need to fix the comments
> >> instead?  Or you think we should prove whether the original patch
> >> description and comments are correct?
> >> 
> >
> > I'm about to get knocked offline so cannot answer properly. The code may
> > indeed be wrong and I have observed higher than expected NUMA scanning
> > behaviour than expected although not enough to cause problems. A comment
> > fix is fine but if you're changing the scanning behaviour, it should be
> > backed up with data justifying that the change both reduces the observed
> > scanning and that it has no adverse performance implications.
> 
> Got it!  Thanks for comments!  As for performance testing, do you have
> some candidate workloads?
> 

Ordinarily I would hope that the patch was motivated by observed
behaviour so you have a metric for goodness. However, for NUMA balancing
I would typically run basic workloads first -- dbench, tbench, netperf,
hackbench and pipetest. The objective would be to measure the degree
automatic NUMA balancing is interfering with a basic workload to see if
they patch reduces the number of minor faults incurred even though there
is no NUMA balancing to be worried about. This measures the general
overhead of a patch. If your reasoning is correct, you'd expect lower
overhead.

For balancing itself, I usually look at Andrea's original autonuma
benchmark, NAS Parallel Benchmark (D class usually although C class for
much older or smaller machines) and spec JBB 2005 and 2015. Of the JBB
benchmarks, 2005 is usually more reasonable for evaluating NUMA balancing
than 2015 is (which can be unstable for a variety of reasons). In this
case, I would be looking at whether the overhead is reduced, whether the
ratio of local hits is the same or improved and the primary metric of
each (time to completion for Andrea's and NAS, throughput for JBB).

Even if there is no change to locality and the primary metric but there
is less scanning and overhead overall, it would still be an improvement.

If you have trouble doing such an evaluation, I'll queue tests if they
are based on a patch that addresses the specific point of concern (scan
period not updated) as it's still not obvious why flipping the logic of
whether shared or private is considered was necessary.

Huang, Ying July 12, 2019, 10:48 a.m. UTC | #6

Mel Gorman <mgorman@suse.de> writes:

> On Thu, Jul 04, 2019 at 08:32:06AM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@suse.de> writes:
>> 
>> > On Tue, Jun 25, 2019 at 09:23:22PM +0800, huang ying wrote:
>> >> On Mon, Jun 24, 2019 at 10:25 PM Mel Gorman <mgorman@suse.de> wrote:
>> >> >
>> >> > On Mon, Jun 24, 2019 at 10:56:04AM +0800, Huang Ying wrote:
>> >> > > The autonuma scan period should be increased (scanning is slowed down)
>> >> > > if the majority of the page accesses are shared with other processes.
>> >> > > But in current code, the scan period will be decreased (scanning is
>> >> > > speeded up) in that situation.
>> >> > >
>> >> > > This patch fixes the code.  And this has been tested via tracing the
>> >> > > scan period changing and /proc/vmstat numa_pte_updates counter when
>> >> > > running a multi-threaded memory accessing program (most memory
>> >> > > areas are accessed by multiple threads).
>> >> > >
>> >> >
>> >> > The patch somewhat flips the logic on whether shared or private is
>> >> > considered and it's not immediately obvious why that was required. That
>> >> > aside, other than the impact on numa_pte_updates, what actual
>> >> > performance difference was measured and on on what workloads?
>> >> 
>> >> The original scanning period updating logic doesn't match the original
>> >> patch description and comments.  I think the original patch
>> >> description and comments make more sense.  So I fix the code logic to
>> >> make it match the original patch description and comments.
>> >> 
>> >> If my understanding to the original code logic and the original patch
>> >> description and comments were correct, do you think the original patch
>> >> description and comments are wrong so we need to fix the comments
>> >> instead?  Or you think we should prove whether the original patch
>> >> description and comments are correct?
>> >> 
>> >
>> > I'm about to get knocked offline so cannot answer properly. The code may
>> > indeed be wrong and I have observed higher than expected NUMA scanning
>> > behaviour than expected although not enough to cause problems. A comment
>> > fix is fine but if you're changing the scanning behaviour, it should be
>> > backed up with data justifying that the change both reduces the observed
>> > scanning and that it has no adverse performance implications.
>> 
>> Got it!  Thanks for comments!  As for performance testing, do you have
>> some candidate workloads?
>> 
>
> Ordinarily I would hope that the patch was motivated by observed
> behaviour so you have a metric for goodness. However, for NUMA balancing
> I would typically run basic workloads first -- dbench, tbench, netperf,
> hackbench and pipetest. The objective would be to measure the degree
> automatic NUMA balancing is interfering with a basic workload to see if
> they patch reduces the number of minor faults incurred even though there
> is no NUMA balancing to be worried about. This measures the general
> overhead of a patch. If your reasoning is correct, you'd expect lower
> overhead.
>
> For balancing itself, I usually look at Andrea's original autonuma
> benchmark, NAS Parallel Benchmark (D class usually although C class for
> much older or smaller machines) and spec JBB 2005 and 2015. Of the JBB
> benchmarks, 2005 is usually more reasonable for evaluating NUMA balancing
> than 2015 is (which can be unstable for a variety of reasons). In this
> case, I would be looking at whether the overhead is reduced, whether the
> ratio of local hits is the same or improved and the primary metric of
> each (time to completion for Andrea's and NAS, throughput for JBB).
>
> Even if there is no change to locality and the primary metric but there
> is less scanning and overhead overall, it would still be an improvement.

Thanks a lot for your detailed guidance.

> If you have trouble doing such an evaluation, I'll queue tests if they
> are based on a patch that addresses the specific point of concern (scan
> period not updated) as it's still not obvious why flipping the logic of
> whether shared or private is considered was necessary.

I can do the evaluation, but it will take quite some time for me to
setup and run all these benchmarks.  So if these benchmarks have already
been setup in your environment, so that your extra effort is minimal, it
will be great if you can queue tests for the patch.  Feel free to reject
me for any inconvenience.

Best Regards,
Huang, Ying

Mel Gorman July 12, 2019, 12:50 p.m. UTC | #7

On Fri, Jul 12, 2019 at 06:48:05PM +0800, Huang, Ying wrote:
> > Ordinarily I would hope that the patch was motivated by observed
> > behaviour so you have a metric for goodness. However, for NUMA balancing
> > I would typically run basic workloads first -- dbench, tbench, netperf,
> > hackbench and pipetest. The objective would be to measure the degree
> > automatic NUMA balancing is interfering with a basic workload to see if
> > they patch reduces the number of minor faults incurred even though there
> > is no NUMA balancing to be worried about. This measures the general
> > overhead of a patch. If your reasoning is correct, you'd expect lower
> > overhead.
> >
> > For balancing itself, I usually look at Andrea's original autonuma
> > benchmark, NAS Parallel Benchmark (D class usually although C class for
> > much older or smaller machines) and spec JBB 2005 and 2015. Of the JBB
> > benchmarks, 2005 is usually more reasonable for evaluating NUMA balancing
> > than 2015 is (which can be unstable for a variety of reasons). In this
> > case, I would be looking at whether the overhead is reduced, whether the
> > ratio of local hits is the same or improved and the primary metric of
> > each (time to completion for Andrea's and NAS, throughput for JBB).
> >
> > Even if there is no change to locality and the primary metric but there
> > is less scanning and overhead overall, it would still be an improvement.
> 
> Thanks a lot for your detailed guidance.
> 

No problem.

> > If you have trouble doing such an evaluation, I'll queue tests if they
> > are based on a patch that addresses the specific point of concern (scan
> > period not updated) as it's still not obvious why flipping the logic of
> > whether shared or private is considered was necessary.
> 
> I can do the evaluation, but it will take quite some time for me to
> setup and run all these benchmarks.  So if these benchmarks have already
> been setup in your environment, so that your extra effort is minimal, it
> will be great if you can queue tests for the patch.  Feel free to reject
> me for any inconvenience.
> 

They're not setup as such, but my testing infrastructure is heavily
automated so it's easy to do and I think it's worth looking at. If you
update your patch to target just the scan period aspects, I'll queue it
up and get back to you. It usually takes a few days for the automation
to finish whatever it's doing and pick up a patch for evaluation.

Huang, Ying July 15, 2019, 8:08 a.m. UTC | #8

Mel Gorman <mgorman@suse.de> writes:

> On Fri, Jul 12, 2019 at 06:48:05PM +0800, Huang, Ying wrote:
>> > Ordinarily I would hope that the patch was motivated by observed
>> > behaviour so you have a metric for goodness. However, for NUMA balancing
>> > I would typically run basic workloads first -- dbench, tbench, netperf,
>> > hackbench and pipetest. The objective would be to measure the degree
>> > automatic NUMA balancing is interfering with a basic workload to see if
>> > they patch reduces the number of minor faults incurred even though there
>> > is no NUMA balancing to be worried about. This measures the general
>> > overhead of a patch. If your reasoning is correct, you'd expect lower
>> > overhead.
>> >
>> > For balancing itself, I usually look at Andrea's original autonuma
>> > benchmark, NAS Parallel Benchmark (D class usually although C class for
>> > much older or smaller machines) and spec JBB 2005 and 2015. Of the JBB
>> > benchmarks, 2005 is usually more reasonable for evaluating NUMA balancing
>> > than 2015 is (which can be unstable for a variety of reasons). In this
>> > case, I would be looking at whether the overhead is reduced, whether the
>> > ratio of local hits is the same or improved and the primary metric of
>> > each (time to completion for Andrea's and NAS, throughput for JBB).
>> >
>> > Even if there is no change to locality and the primary metric but there
>> > is less scanning and overhead overall, it would still be an improvement.
>> 
>> Thanks a lot for your detailed guidance.
>> 
>
> No problem.
>
>> > If you have trouble doing such an evaluation, I'll queue tests if they
>> > are based on a patch that addresses the specific point of concern (scan
>> > period not updated) as it's still not obvious why flipping the logic of
>> > whether shared or private is considered was necessary.
>> 
>> I can do the evaluation, but it will take quite some time for me to
>> setup and run all these benchmarks.  So if these benchmarks have already
>> been setup in your environment, so that your extra effort is minimal, it
>> will be great if you can queue tests for the patch.  Feel free to reject
>> me for any inconvenience.
>> 
>
> They're not setup as such, but my testing infrastructure is heavily
> automated so it's easy to do and I think it's worth looking at. If you
> update your patch to target just the scan period aspects, I'll queue it
> up and get back to you. It usually takes a few days for the automation
> to finish whatever it's doing and pick up a patch for evaluation.

Thanks a lot for your help!  The updated patch is as follows.  It
targets only the scan period aspects.

Best Regards,
Huang, Ying

----------------------8<----------------------------
From 910a52cbf5a521c1562a573904c9507d0367bb0f Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Sat, 22 Jun 2019 17:36:29 +0800
Subject: [PATCH] autonuma: Fix scan period updating

From the commit log and comments of commit 37ec97deb3a8 ("sched/numa:
Slow down scan rate if shared faults dominate"), the autonuma scan
period should be increased (scanning is slowed down) if the majority
of the page accesses are shared with other processes.  But in current
code, the scan period will be decreased (scanning is speeded up) in
that situation.

The commit log and comments make more sense.  So this patch fixes the
code to make it match the commit log and comments.  And this has been
verified via tracing the scan period changing and /proc/vmstat
numa_pte_updates counter when running a multi-threaded memory
accessing program (most memory areas are accessed by multiple
threads).

Fixes: 37ec97deb3a8 ("sched/numa: Slow down scan rate if shared faults dominate")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95a87e9..468a1c5038b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1940,7 +1940,7 @@ static void update_task_scan_period(struct task_struct *p,
 			unsigned long shared, unsigned long private)
 {
 	unsigned int period_slot;
-	int lr_ratio, ps_ratio;
+	int lr_ratio, sp_ratio;
 	int diff;
 
 	unsigned long remote = p->numa_faults_locality[0];
@@ -1971,22 +1971,22 @@ static void update_task_scan_period(struct task_struct *p,
 	 */
 	period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
 	lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
-	ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
+	sp_ratio = (shared * NUMA_PERIOD_SLOTS) / (private + shared);
 
-	if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
+	if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
 		/*
-		 * Most memory accesses are local. There is no need to
-		 * do fast NUMA scanning, since memory is already local.
+		 * Most memory accesses are shared with other tasks.
+		 * There is no point in continuing fast NUMA scanning,
+		 * since other tasks may just move the memory elsewhere.
 		 */
-		int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
+		int slot = sp_ratio - NUMA_PERIOD_THRESHOLD;
 		if (!slot)
 			slot = 1;
 		diff = slot * period_slot;
 	} else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
 		/*
-		 * Most memory accesses are shared with other tasks.
-		 * There is no point in continuing fast NUMA scanning,
-		 * since other tasks may just move the memory elsewhere.
+		 * Most memory accesses are local. There is no need to
+		 * do fast NUMA scanning, since memory is already local.
 		 */
 		int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
 		if (!slot)
@@ -1998,7 +1998,7 @@ static void update_task_scan_period(struct task_struct *p,
 		 * yet they are not on the local NUMA node. Speed up
 		 * NUMA scanning to get the memory moved over.
 		 */
-		int ratio = max(lr_ratio, ps_ratio);
+		int ratio = max(lr_ratio, sp_ratio);
 		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
 	}

[-mm] autonuma: Fix scan period updating

Commit Message

Comments

Patch