diff mbox

[4.17,regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

Message ID 20180615135212.wq45co7ootvdeo2f@techsingularity.net (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Mel Gorman June 15, 2018, 1:52 p.m. UTC
On Fri, Jun 15, 2018 at 02:23:17PM +0200, Jirka Hladky wrote:
> I added configurations that used half of the CPUs. However, that would
> > mean it fits too nicely within sockets. I've added another set for one
> > third of the CPUs and scheduled the tests. Unfortunately, they will not
> > complete quickly as my test grid has a massive backlog of work.
> 
> 
> We always use the number of threads being an integer multiple of the number
> of sockets.  With another number of threads, we have seen the bigger
> variation in results (that's variation between subsequent runs of the same
> test).
> 

It's not immediately obvious what's special about those numbers. I did
briefly recheck the variability of NAS on one of the machines but the
coefficient of variance was usually quite low with occasional outliers
of +/- 5% or +/- 7%. Anyway, it's a side-issue.

>  Nice one, thanks. It's fairly clear that rate limiting may be a major
> > component and it's worth testing with the ratelimit increased. Given that
> > there have been a lot of improvements on locality and corner cases since
> > the rate limit was first introduced, it may also be worth considering
> > elimintating the rate limiting entirely and see what falls out.
> 
> 
> How can we tune mm_numa_migrate_ratelimit? It doesn't seem to be a runtime
> tunable nor kernel boot parameter. Could you please share some hints on how
> to change it and what value to use? I would be interested to try it out.
> 

It's not runtime tunable I'm afraid. It's a code change and recompile.
For example the following allows more pages to be migrated within a
100ms window.

Comments

Mel Gorman June 19, 2018, 3:18 p.m. UTC | #1
On Tue, Jun 19, 2018 at 03:36:53PM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we have tested following variants:
> 
> var1: 4.16 + 2c83362734dad8e48ccc0710b5cd2436a0323893
> fix1: var1+ ratelimit_pages __read_mostly increased by factor 4x
> -static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
> +static unsigned int ratelimit_pages __read_mostly = 512 << (20 - PAGE_SHIFT);
> fix2: var1+ ratelimit_pages __read_mostly increased by factor 8x
> -static unsigned int ratelimit_pages __read_mostly = 512 << (20 - PAGE_SHIFT);
> +static unsigned int ratelimit_pages __read_mostly = 1024 << (20 - PAGE_SHIFT);
> fix3: var1+ ratelimit_pages __read_mostly increased by factor 16x
> -static unsigned int ratelimit_pages __read_mostly = 1024 << (20 - PAGE_SHIFT);
> +static unsigned int ratelimit_pages __read_mostly = 2048 << (20 - PAGE_SHIFT);
> 
> Results for the stream benchmark (standalone processes) have gradually
> improved. For fix3, stream benchmark with runtime 60 seconds does not show
> performance drop compared to 4.16 kernel anymore.
> 

Ok, so at least one option is to remove the rate limiting.  It'll be ok as
long as cross-node migrations are not both a) a regular event and b) each
migrate remains on the new socket long enough for migrations to occur and c)
the bandwidth used for cross-node migration does not interfere badly with
tasks accessing local memory. It'll vary depending on workload and machine
unfortuantely but the rate limiting never accounted for the real capabilities
of hardware and cannot detect bandwidth used for regular accesses.

> For the OpenMP NAS, results are still worse than with vanilla 4.16 kernel.
> Increasing the ratelimit has helped, but even the best results of {fix1,
> fix2, fix3} are still some 5-10% slower than with vanilla 4.16 kernel.  If
> I should pick the best value for ratelimit_pages __read_mostly it would be
> fix2:
> +static unsigned int ratelimit_pages __read_mostly = 1024 << (20 -
> PAGE_SHIFT);
> 
> I have also used the Intel LINPACK (OpenMP) benchmark on 4 NUMA server - it
> gives the similar results as NAS test.
> 
> I think that patch 2c83362734dad8e48ccc0710b5cd2436a0323893 needs a review
> for the OpenMP and standalone processes workflow.
> 

I did get some results although testing of the different potential patches
(revert, numabalance series, faster scanning in isolation etc) are still
in progress. However, I did find that rate limiting was not a factor for
NAS at least (STREAM was too short lived in the configuration I used) on
the machine I used. That does not prevent the ratelimiting being removed
but it highlights that the impact is workload and machine specific.

Second, I did manage to see the regression and the fix from the revert *but*
it required both one third of CPUs to be used and the openMP parallelisation
method. Using all CPUs shows no regression and using a third of the CPUs
with MPI shows no regression. In other words, the impact is specific to
the workload, the configuration and the machine.

I don't have a LINPACK configuration but you say that the behaviour is
similar so I'll stick with NAS.

On the topic of STREAM, it's meant to be a memory bandwidth benchmark and
there is no knowledge within the scheduler for automatically moving tasks
to a memory controller. It really should be tuned to run as one instance
bound to one controller for the figures to make sense. For automatic NUMA
balancing to fix it up, it needs to run long enough and it's not guaranteed
to be optimally located. I think it's less relevant as a workload in this
instance and it'll never be optimal as even spreading early does not mean
it'll spread to each memory controller.

Given the specific requirement of CPUs used and parallelisation method, I
think a plain revert is not the answer because it'll fix one particular
workload and reintroduce regressions on others (as laid out in the
original changelog). There are other observations we can make about the
NAS-OpenMP workload though

1. The locality sucks
Parallelising with MPI indicates that locality as measured by the NUMA
hinting achieves 94% local hits and minimal migration. With OpenMP,
locality is 66% with large amounts of migration. Many of the updates are
huge PMDs so this may be an instance of false sharing or it might be the
timing of when migrations start.

2. Migrations with the revert are lower
There are fewer migrations when the patch is reverted and this may be an
indication that it simply benefits by spreading early before any memory
is allocated so that migrations are avoided. Unfortunately, this is not
a universal win for every workload.

I'll experiement a bit with faster migrations on cross-node accesses
but I think no matter which way we jump on this one it'll be a case of
"it helps one workload and hurts another".
Mel Gorman June 21, 2018, 9:23 a.m. UTC | #2
On Wed, Jun 20, 2018 at 07:25:19PM +0200, Jirka Hladky wrote:
> Hi Mel and others,
> 
> I would like to let you know that I have tested following patch
> 

Understood. FWIW, there is a lot in flight at the moment but the first
likely patch is removing rate limiting entirely and see what falls out.
The rest of the experiment series deals with fast-scan-start, reset of
preferred_nid on cross-node load balancing and dealing with THP false
sharing but it's all preliminary and untested.

Furthermore, matters have been complicated by the posting of "Fixes for
sched/numa_balancing". My own testing indicates that this helped which
means that I need to review this first and then rebase anything else on
top of it.

I would also suggest you test that series paying particular attention to
whether it a) improves performance and b) how close it gets to the
revert in terms of overall performance.
diff mbox

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f7cab1..edb550493f06 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1862,7 +1862,7 @@  static struct page *alloc_misplaced_dst_page(struct page *page,
  * window of time. Default here says do not migrate more than 1280M per second.
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+static unsigned int ratelimit_pages __read_mostly = 512 << (20 - PAGE_SHIFT);
 
 /* Returns true if the node is migrate rate-limited after the update */
 static bool numamigrate_update_ratelimit(pg_data_t *pgdat,