sysbench throughput degradation in 4.13+

On Wed, 27 Sep 2017 11:35:30 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote:
> > 
> > MySQL.  We've tried a few different configs with both test=oltp and
> > test=threads, but both show the same behavior.  What I have settled on for
> > my repro is the following:
> >   
> 
> Right, didn't even need to run it in a guest to observe a regression.
> 
> So the below cures native sysbench and NAS bench for me, does it also
> work for you virt thingy?
> 
> 
> PRE (current tip/master):
> 
> ivb-ex sysbench:
> 
>   2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
>   5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
>  10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
>  20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
>  40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
>  80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
> 
> hsw-ex NAS:
> 
> OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
> OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
> OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
> lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
> lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
> lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
> 
> 
> POST (+patch):
> 
> ivb-ex sysbench:
> 
>   2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
>   5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
>  10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
>  20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
>  40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
>  80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
> 
> hsw-ex NAS:
> 
> lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
> lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
> lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
> 
> 
> This patch takes out all the shiny wake_affine stuff and goes back to
> utter basics. Rik was there another NUMA benchmark that wanted your
> fancy stuff? Because NAS isn't it.

I like the simplicity of your approach!  I hope it does not break
stuff like netperf...

I have been working on the patch below, which is much less optimistic
about when to do an affine wakeup than before.

It may be worth testing, in case it works better with some workload,
though relying on cached values still makes me somewhat uneasy.

I will try to get kernels tested here that implement both approaches,
to see what ends up working best.

---8<---
Subject: sched: make wake_affine_llc less eager

With the wake_affine_llc logic, tasks get moved around too eagerly,
and then moved back later, leading to poor performance for some
workloads.

Make wake_affine_llc less eager by comparing the minimum load of
the source LLC with the maximum load of the destination LLC, similar
to how source_load and target_load work for regular migration.

Also, get rid of an overly optimistic test that could potentially
pull across a lot of tasks if the target LLC happened to have fewer
runnable tasks at load balancing time.

Conversely, sync wakeups could happen without taking LLC loads
into account, if the waker would leave an idle CPU behind on
the target LLC.

Signed-off-by: Rik van Riel <riel@redhat.com>

---
 include/linux/sched/topology.h |  3 ++-
 kernel/sched/fair.c            | 56 +++++++++++++++++++++++++++++++++---------
 2 files changed, 46 insertions(+), 13 deletions(-)

sysbench throughput degradation in 4.13+

Commit Message

Comments

Patch