diff mbox series

[v6,4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

Message ID 20230118080011.2258375-5-npiggin@gmail.com (mailing list archive)
State New
Headers show
Series shoot lazy tlbs | expand

Commit Message

Nicholas Piggin Jan. 18, 2023, 8 a.m. UTC
On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second, due to contention on the mm refcount.

64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
the option. This increases the above benchmark to 118 million context
switches per second.

This generates 314 additional IPI interrupts on a 144 CPU system doing
a kernel compile, which is in the noise in terms of kernel cycles.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

Comments

Linus Torvalds Jan. 18, 2023, 5:30 p.m. UTC | #1
[ Adding a few more x86 and arm64 maintainers - while linux-arch is
the right mailing list, I'm not convinced people actually follow it
all that closely ]

On Wed, Jan 18, 2023 at 12:00 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second, due to contention on the mm refcount.
>
> 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> the option. This increases the above benchmark to 118 million context
> switches per second.

Well, the 1M -> 118M change does seem like a good reason for this series.

The patches certainly don't look offensive to me, so Ack as far as I'm
concerned, but honestly, it's been some time since I've personally
been active on the idle and lazy TLB code, so that ack is probably
largely worthless.

If anything, my main reaction to this all is to wonder whether the
config option is a good idea - maybe we could do this unconditionally,
and make the source code (and logic) simpler to follow when you don't
have to worry about the CONFIG_MMU_LAZY_TLB_REFCOUNT option.

I wouldn't be surprised to hear that x86 can have the same issue where
the mm_struct refcount is a bigger issue than the possibility of an
extra TLB shootdown at the final exit time.

But having the config options as a way to switch people over gradually
(and perhaps then removing it later) doesn't sound wrong to me either.

And I personally find the argument in patch 3/5 fairly convincing:

  Shootdown IPIs cost could be an issue, but they have not been observed
  to be a serious problem with this scheme, because short-lived processes
  tend not to migrate CPUs much, therefore they don't get much chance to
  leave lazy tlb mm references on remote CPUs.

Andy? PeterZ? Catalin?

Nick - it might be good to link to the actual benchmark, and let
people who have access to big machines perhaps just try it out on
non-powerpc platforms...

                   Linus
Nicholas Piggin Jan. 19, 2023, 3:04 a.m. UTC | #2
On Thu Jan 19, 2023 at 3:30 AM AEST, Linus Torvalds wrote:
> [ Adding a few more x86 and arm64 maintainers - while linux-arch is
> the right mailing list, I'm not convinced people actually follow it
> all that closely ]
>
> On Wed, Jan 18, 2023 at 12:00 AM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > On a 16-socket 192-core POWER8 system, a context switching benchmark
> > with as many software threads as CPUs (so each switch will go in and
> > out of idle), upstream can achieve a rate of about 1 million context
> > switches per second, due to contention on the mm refcount.
> >
> > 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> > the option. This increases the above benchmark to 118 million context
> > switches per second.
>
> Well, the 1M -> 118M change does seem like a good reason for this series.

It was an artificial corner case, mind you. I don't think it's a reason
to panic and likely smaller systems with faster atomics will care far
less than our big 2-hop systems.

Benchmark is will-it-scale:

  ./context_switch1_threads -t 768
  min:2174 max:2690 total:1827952

    33.52%  [k] finish_task_switch
    27.26%  [k] interrupt_return
    22.66%  [k] __schedule
     2.30%  [k] _raw_spin_trylock

  ./context_switch1_threads -t 1536
  min:103000 max:120100 total:177201906

The top case has 1/2 the switching pairs to available CPU, which makes
them all switch the same mm between real and lazy. Bottom case is
just switching between user threads so that doesn't hit the lazy
refcount.

> The patches certainly don't look offensive to me, so Ack as far as I'm
> concerned, but honestly, it's been some time since I've personally
> been active on the idle and lazy TLB code, so that ack is probably
> largely worthless.
>
> If anything, my main reaction to this all is to wonder whether the
> config option is a good idea - maybe we could do this unconditionally,
> and make the source code (and logic) simpler to follow when you don't
> have to worry about the CONFIG_MMU_LAZY_TLB_REFCOUNT option.
>
> I wouldn't be surprised to hear that x86 can have the same issue where
> the mm_struct refcount is a bigger issue than the possibility of an
> extra TLB shootdown at the final exit time.
>
> But having the config options as a way to switch people over gradually
> (and perhaps then removing it later) doesn't sound wrong to me either.

IMO it's trivial enough that we could carry both, but everything's a
straw on the camel's back so if we can consolidate it would always be
preferebale. Let's see how it plays out for a few releases.

> And I personally find the argument in patch 3/5 fairly convincing:
>
>   Shootdown IPIs cost could be an issue, but they have not been observed
>   to be a serious problem with this scheme, because short-lived processes
>   tend not to migrate CPUs much, therefore they don't get much chance to
>   leave lazy tlb mm references on remote CPUs.
>
> Andy? PeterZ? Catalin?
>
> Nick - it might be good to link to the actual benchmark, and let
> people who have access to big machines perhaps just try it out on
> non-powerpc platforms...

Yep good point, I'll put it in the changelog. I might submit another
round to Andrew in a bit with acks and any minor tweaks and minus the
last patch, assuming no major changes or objections.

Thanks,
Nick
diff mbox series

Patch

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b8c4ac56bddc..600ace5a7f1a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -265,6 +265,7 @@  config PPC
 	select MMU_GATHER_PAGE_SIZE
 	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
+	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
 	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64