diff mbox series

[119/212] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

Message ID 20210902215620._WXglfIJy%akpm@linux-foundation.org (mailing list archive)
State New
Headers show
Series [001/212] ia64: fix typo in a comment | expand

Commit Message

Andrew Morton Sept. 2, 2021, 9:56 p.m. UTC
From: Nicholas Piggin <npiggin@gmail.com>
Subject: lazy tlb: shoot lazies, a non-refcounting lazy tlb option

On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme that
causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down any
remaining lazy ones.

Shootdown IPIs are some concern, but they have not been observed to be a
big problem with this scheme (the powerpc implementation generated 314
additional interrupts on a 144 CPU system during a kernel compile).  There
are a number of strategies that could be employed to reduce IPIs if they
turn out to be a problem for some workload.

[npiggin@gmail.com: update comments]
  Link: https://lkml.kernel.org/r/1623121901.mszkmmum0n.astroid@bobo.none
Link: https://lkml.kernel.org/r/20210605014216.446867-4-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig  |   14 +++++++++++++
 kernel/fork.c |   51 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

Comments

Andy Lutomirski Sept. 2, 2021, 10:28 p.m. UTC | #1
On Thu, Sep 2, 2021, at 2:56 PM, Andrew Morton wrote:
> From: Nicholas Piggin <npiggin@gmail.com>
> Subject: lazy tlb: shoot lazies, a non-refcounting lazy tlb option
> 
> On big systems, the mm refcount can become highly contented when doing a
> lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
> 
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so instead implement a non-refcounted scheme that
> causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down any
> remaining lazy ones.
> 
> Shootdown IPIs are some concern, but they have not been observed to be a
> big problem with this scheme (the powerpc implementation generated 314
> additional interrupts on a 144 CPU system during a kernel compile).  There
> are a number of strategies that could be employed to reduce IPIs if they
> turn out to be a problem for some workload.

This pile is:

Nacked-by: Andy Lutomirski <luto@kernel.org>

For reasons that have been discussed previously. My series is still in progress.  It’s moving slowly for two reasons.  First, I have limited time to work on it. Second, the existing mm refcounting is a giant pile of worms, and that needs fixing one way or another before we add yet more complexity. For example, has anyone noticed that kthread mms are refcounted using different rules than everything else?

Even if my modified refcounting scheme isn’t the eventual winner, the prerequisite cleanups are still prerequisites. I absolutely nack anything that adds yet more nonsensical complexity to the existing scheme, makes it substantially more fragile, and does not fix the underlying crap that makes speeding up responsibly such a mess.

Nick or anyone else, you’re welcome to finish up my series (and I can give pointers) or you can wait.

> 
> [npiggin@gmail.com: update comments]
>   Link: https://lkml.kernel.org/r/1623121901.mszkmmum0n.astroid@bobo.none
> Link: https://lkml.kernel.org/r/20210605014216.446867-4-npiggin@gmail.com
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  arch/Kconfig  |   14 +++++++++++++
>  kernel/fork.c |   51 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 65 insertions(+)
> 
> --- a/arch/Kconfig~lazy-tlb-shoot-lazies-a-non-refcounting-lazy-tlb-option
> +++ a/arch/Kconfig
> @@ -438,6 +438,20 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>  # to a kthread ->active_mm (non-arch code has been converted already).
>  config MMU_LAZY_TLB_REFCOUNT
>  	def_bool y
> +	depends on !MMU_LAZY_TLB_SHOOTDOWN
> +
> +# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
> +# mm as a lazy tlb beyond its last reference count, by shooting down these
> +# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
> +# be using the mm as a lazy tlb, so that they may switch themselves to using
> +# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
> +# may be using mm as a lazy tlb mm.
> +#
> +# To implement this, an arch must ensure mm_cpumask(mm) contains at least all
> +# possible CPUs in which the mm is lazy, and it must meet the requirements for
> +# MMU_LAZY_TLB_REFCOUNT=n (see above).
> +config MMU_LAZY_TLB_SHOOTDOWN
> +	bool
>  
>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	bool
> --- a/kernel/fork.c~lazy-tlb-shoot-lazies-a-non-refcounting-lazy-tlb-option
> +++ a/kernel/fork.c
> @@ -674,6 +674,53 @@ static void check_mm(struct mm_struct *m
>  #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
>  #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
>  
> +static void do_shoot_lazy_tlb(void *arg)
> +{
> +	struct mm_struct *mm = arg;
> +
> +	if (current->active_mm == mm) {
> +		WARN_ON_ONCE(current->mm);
> +		current->active_mm = &init_mm;
> +		switch_mm(mm, &init_mm, current);
> +	}
> +}
> +
> +static void do_check_lazy_tlb(void *arg)
> +{
> +	struct mm_struct *mm = arg;
> +
> +	WARN_ON_ONCE(current->active_mm == mm);
> +}
> +
> +static void shoot_lazy_tlbs(struct mm_struct *mm)
> +{
> +	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
> +		/*
> +		 * IPI overheads have not found to be expensive, but they could
> +		 * be reduced in a number of possible ways, for example (in
> +		 * roughly increasing order of complexity):
> +		 * - A batch of mms requiring IPIs could be gathered and freed
> +		 *   at once.
> +		 * - CPUs could store their active mm somewhere that can be
> +		 *   remotely checked without a lock, to filter out
> +		 *   false-positives in the cpumask.
> +		 * - After mm_users or mm_count reaches zero, switching away
> +		 *   from the mm could clear mm_cpumask to reduce some IPIs
> +		 *   (some batching or delaying would help).
> +		 * - A delayed freeing and RCU-like quiescing sequence based on
> +		 *   mm switching to avoid IPIs completely.
> +		 */
> +		on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
> +		if (IS_ENABLED(CONFIG_DEBUG_VM))
> +			on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
> +	} else {
> +		/*
> +		 * In this case, lazy tlb mms are refounted and would not reach
> +		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
> +		 */
> +	}
> +}
> +
>  /*
>   * Called when the last reference to the mm
>   * is dropped: either by a lazy thread or by
> @@ -683,6 +730,10 @@ void __mmdrop(struct mm_struct *mm)
>  {
>  	BUG_ON(mm == &init_mm);
>  	WARN_ON_ONCE(mm == current->mm);
> +
> +	/* Ensure no CPUs are using this as their lazy tlb mm */
> +	shoot_lazy_tlbs(mm);
> +
>  	WARN_ON_ONCE(mm == current->active_mm);
>  	mm_free_pgd(mm);
>  	destroy_context(mm);
> _
>
Linus Torvalds Sept. 2, 2021, 10:50 p.m. UTC | #2
On Thu, Sep 2, 2021 at 3:29 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> This pile is:
>
> Nacked-by: Andy Lutomirski <luto@kernel.org>

Can you specify exactly the range you want me to drop?

I assume it's the four patches 117-120, ie

  lazy tlb: introduce lazy mm refcount helper functions
  lazy tlb: allow lazy tlb mm refcounting to be configurable
  lazy tlb: shoot lazies, a non-refcounting lazy tlb option
  powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

but I just want to double-check before I do surgery on that series.

           Linus
Andrew Morton Sept. 2, 2021, 10:53 p.m. UTC | #3
On Thu, 2 Sep 2021 15:50:03 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Sep 2, 2021 at 3:29 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > This pile is:
> >
> > Nacked-by: Andy Lutomirski <luto@kernel.org>
> 
> Can you specify exactly the range you want me to drop?
> 
> I assume it's the four patches 117-120, ie
> 
>   lazy tlb: introduce lazy mm refcount helper functions
>   lazy tlb: allow lazy tlb mm refcounting to be configurable
>   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
> 
> but I just want to double-check before I do surgery on that series.

Yes, those 4.

Sorry, I missed that email thread...
Andy Lutomirski Sept. 3, 2021, 12:35 a.m. UTC | #4
On Thu, Sep 2, 2021, at 3:53 PM, Andrew Morton wrote:
> On Thu, 2 Sep 2021 15:50:03 -0700 Linus Torvalds 
> <torvalds@linux-foundation.org> wrote:
> 
> > On Thu, Sep 2, 2021 at 3:29 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > This pile is:
> > >
> > > Nacked-by: Andy Lutomirski <luto@kernel.org>
> > 
> > Can you specify exactly the range you want me to drop?
> > 
> > I assume it's the four patches 117-120, ie
> > 
> >   lazy tlb: introduce lazy mm refcount helper functions
> >   lazy tlb: allow lazy tlb mm refcounting to be configurable
> >   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
> >   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
> > 
> > but I just want to double-check before I do surgery on that series.
> 
> Yes, those 4.
> 
> Sorry, I missed that email thread...
> 

Indeed.

If anyone cares, my WIP series is here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=sched/lazymm

It has known bugs and is definitely not ready.  The major known problem is that the kthread and execve paths are still not cleaned up.  (We have several different code paths that change current->mm, and they're not all quite consistent with each other.)  kthread_use_mm() follows its own little set of refcounting rules that is not consistent with the scheduler's expectations, but I think it's just close enough that current kernels will not erroneously free an in-use mm or permanently leak a reference.  I have half-written code to consolidate all the ->mm assignments into a single function, but it's not done.

The CPU offline issue fixed in that series seems to me like it should also affect Nick's series, but I haven't dug in.  I don't immediately see why Nick's series would be able to get away without the same rearrangement I needed.  (You can't *shoot* a lazy TLB entry out from under an offlined CPU -- you need to actually get rid of the reference and account for it correctly.  Perhaps it's all okay in Nick's series because mmdrop() becomes a no-op and the stale logical reference doesn't actually exist.)

--Andy
Nicholas Piggin Sept. 3, 2021, 12:46 a.m. UTC | #5
Excerpts from Andrew Morton's message of September 3, 2021 8:53 am:
> On Thu, 2 Sep 2021 15:50:03 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Thu, Sep 2, 2021 at 3:29 PM Andy Lutomirski <luto@kernel.org> wrote:
>> >
>> > This pile is:
>> >
>> > Nacked-by: Andy Lutomirski <luto@kernel.org>
>> 
>> Can you specify exactly the range you want me to drop?
>> 
>> I assume it's the four patches 117-120, ie
>> 
>>   lazy tlb: introduce lazy mm refcount helper functions
>>   lazy tlb: allow lazy tlb mm refcounting to be configurable
>>   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>>   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
>> 
>> but I just want to double-check before I do surgery on that series.
> 
> Yes, those 4.
> 
> Sorry, I missed that email thread...
> 

That's not reasonable. Andy has had complete misunderstandings about the
series which seems to stem from x86's horrible hacks that have gone in
has confused him.

My series doesn't affect x86 at all and it's no reason why Andy's series
to improve x86 can't be merged later. But that half finished series he 
keeps threatening with has been sitting there for almost a year now and 
it's gone nowhere, while there have been no unresolved technical 
objections to mine, it works, it's simple and small.

I've kept trying to offer to help Andy with reviewing his stuff or fix 
the horrible x86 hacks, but nothing.

So

Un-Nacked-by: Nicholas Piggin <npiggin@gmail.com>


Thanks,
Nick
Nicholas Piggin Sept. 3, 2021, 12:48 a.m. UTC | #6
Excerpts from Andy Lutomirski's message of September 3, 2021 8:28 am:
> 
> 
> On Thu, Sep 2, 2021, at 2:56 PM, Andrew Morton wrote:
>> From: Nicholas Piggin <npiggin@gmail.com>
>> Subject: lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>> 
>> On big systems, the mm refcount can become highly contented when doing a
>> lot of context switching with threaded applications (particularly
>> switching between the idle thread and an application thread).
>> 
>> Abandoning lazy tlb slows switching down quite a bit in the important
>> user->idle->user cases, so instead implement a non-refcounted scheme that
>> causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down any
>> remaining lazy ones.
>> 
>> Shootdown IPIs are some concern, but they have not been observed to be a
>> big problem with this scheme (the powerpc implementation generated 314
>> additional interrupts on a 144 CPU system during a kernel compile).  There
>> are a number of strategies that could be employed to reduce IPIs if they
>> turn out to be a problem for some workload.
> 
> This pile is:
> 
> Nacked-by: Andy Lutomirski <luto@kernel.org>
> 
> For reasons that have been discussed previously. My series is still in progress.  It’s moving slowly for two reasons.  First, I have limited time to work on it. Second, the existing mm refcounting is a giant pile of worms, and that needs fixing one way or another before we add yet more complexity. For example, has anyone noticed that kthread mms are refcounted using different rules than everything else?

It's been like a year with ~no progress. mm refcounting is not a pile of  
worms, as you can see in my series it's pretty simple. The _x86_ mm 
refcounting is a huge pile of crap, but that doesn't give reason to nack 
this series.

> 
> Even if my modified refcounting scheme isn’t the eventual winner, the prerequisite cleanups are still prerequisites. I absolutely nack anything that adds yet more nonsensical complexity to the existing scheme, makes it substantially more fragile, and does not fix the underlying crap that makes speeding up responsibly such a mess.
> 
> Nick or anyone else, you’re welcome to finish up my series (and I can give pointers) or you can wait.

You or anyone else is welcome to rebase your series on top of mine.

Thanks,
Nick
Andy Lutomirski Sept. 3, 2021, 5:11 a.m. UTC | #7
On 9/2/21 5:46 PM, Nicholas Piggin wrote:
> Excerpts from Andrew Morton's message of September 3, 2021 8:53 am:
>> On Thu, 2 Sep 2021 15:50:03 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>>> On Thu, Sep 2, 2021 at 3:29 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>
>>>> This pile is:
>>>>
>>>> Nacked-by: Andy Lutomirski <luto@kernel.org>
>>>
>>> Can you specify exactly the range you want me to drop?
>>>
>>> I assume it's the four patches 117-120, ie
>>>
>>>   lazy tlb: introduce lazy mm refcount helper functions
>>>   lazy tlb: allow lazy tlb mm refcounting to be configurable
>>>   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>>>   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
>>>
>>> but I just want to double-check before I do surgery on that series.
>>
>> Yes, those 4.
>>
>> Sorry, I missed that email thread...
>>
> 
> That's not reasonable. Andy has had complete misunderstandings about the
> series which seems to stem from x86's horrible hacks that have gone in
> has confused him.

The horrible hacks in question are almost exclusively in core code.
Here's a brief summary of the situation.

There's a messy interaction between mmget()/mmdrop() and membarrier.
membarrier currently depends on some mmget() and mmdrop() calls to be
full barriers.  You make membarrier keep working by putting an ifdef'd
smp_mb() in the core scheduler.  I clean up the code to make it work
independently of smp_mb() and therefore save the cost of the
unconditional barrier for non-membarrier-using programs.

Your series adds an option MMU_LAZY_TLB_REFCOUNT=n for architectures to
opt out of lazy TLB refcounting.  This is simply wrong.  Right now, the
core scheduler provides current->active_mm and guarantees that
current->active_mm always points to a live (possibly mm_users == 0 but
definitely not freed) mm_struct.  With MMU_LAZY_TLB_REFCOUNT=n,
current->active_mm still exists, is still updated, but may point to
freed memory.  I consider this unacceptable.  A comment says "This can
be disabled if the architecture ensures no CPUs are using an mm as a
"lazy tlb" beyond its final refcount" -- that's nice, but saying "well,
if you this, you have to make sure you don't accidentally dereference
that juicy dangling pointer we give you" is, in my book, a poor
justification.

I have no particular objection to the actual shoot lazies part, except
insofar as I think we can do even better (e.g. my patch).  But 90% of
the complexity of my WIP series is cleaning up the mess so that we can
have a maintainable lazy mm mechanism instead of expanding the current
hard-to-maintain part into three separate possible modes.

Maybe I'm holding my own patches to an excessively high standard.



> 
> My series doesn't affect x86 at all and it's no reason why Andy's series
> to improve x86 can't be merged later. But that half finished series he 
> keeps threatening with has been sitting there for almost a year now and 
> it's gone nowhere, while there have been no unresolved technical 
> objections to mine, it works, it's simple and small.

My series barely touches x86.  The only "hack" is that x86 may have a
CPU that has ->mm == NULL, ->active_mm != NULL, CR3 pointing to the init
pgd, and mm_cpumask clear.  I don't see why this is a problem other than
being somewhat unusual.  But x86 bare metal, like every architecture
that can only flush the TLB using an IPI, can very efficiently shoot
lazies, since it shoots the lazies anyway when tearing down pagetables,
but actually enabling the config option with this series applied will
result in ->active_mm pointing to freed memory.  Ick.

> 
> I've kept trying to offer to help Andy with reviewing his stuff or fix 
> the horrible x86 hacks, but nothing.

I haven't finished it yet.  Sorry.
Nicholas Piggin Sept. 3, 2021, 5:44 a.m. UTC | #8
Excerpts from Andy Lutomirski's message of September 3, 2021 3:11 pm:
> On 9/2/21 5:46 PM, Nicholas Piggin wrote:
>> Excerpts from Andrew Morton's message of September 3, 2021 8:53 am:
>>> On Thu, 2 Sep 2021 15:50:03 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>
>>>> On Thu, Sep 2, 2021 at 3:29 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>
>>>>> This pile is:
>>>>>
>>>>> Nacked-by: Andy Lutomirski <luto@kernel.org>
>>>>
>>>> Can you specify exactly the range you want me to drop?
>>>>
>>>> I assume it's the four patches 117-120, ie
>>>>
>>>>   lazy tlb: introduce lazy mm refcount helper functions
>>>>   lazy tlb: allow lazy tlb mm refcounting to be configurable
>>>>   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>>>>   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
>>>>
>>>> but I just want to double-check before I do surgery on that series.
>>>
>>> Yes, those 4.
>>>
>>> Sorry, I missed that email thread...
>>>
>> 
>> That's not reasonable. Andy has had complete misunderstandings about the
>> series which seems to stem from x86's horrible hacks that have gone in
>> has confused him.
> 
> The horrible hacks in question are almost exclusively in core code.

No, they're in x86.

> Here's a brief summary of the situation.
> 
> There's a messy interaction between mmget()/mmdrop() and membarrier.
> membarrier currently depends on some mmget() and mmdrop() calls to be
> full barriers.

Membarrier has had (and is improving but still has) some complexity, 
which is caused by interaction with _existing_ lazy-mm code in the tree. 
The complexity is not with lazy-mm itself, and my series does not add
more to the membarrier interaction. So I don't accept the criticism
that it has to do with membarrier complexity.

> You make membarrier keep working by putting an ifdef'd
> smp_mb() in the core scheduler.

Sure, it's well commented and replaces the smp_mb provided by atomic 
operation that membarrier relied on to an explicit one. That's not a
horrible hack.

> I clean up the code to make it work
> independently of smp_mb() and therefore save the cost of the
> unconditional barrier for non-membarrier-using programs.

Great. Nothing to do with this series though which is not changing 
membarrier ordering.

I can certainly help you rebase it on top of these patches if you need.

> 
> Your series adds an option MMU_LAZY_TLB_REFCOUNT=n for architectures to
> opt out of lazy TLB refcounting.  This is simply wrong.  Right now, the
> core scheduler provides current->active_mm and guarantees that
> current->active_mm always points to a live (possibly mm_users == 0 but
> definitely not freed) mm_struct.  With MMU_LAZY_TLB_REFCOUNT=n,
> current->active_mm still exists, is still updated, but may point to
> freed memory.

Wrong. It does nothing of the sort. I told you this in the previous 
discussion, you obviously ignored me. You are just wrong, and you can't
actually point to where this happens.

This criticism is invalid too.

> I consider this unacceptable.  A comment says "This can
> be disabled if the architecture ensures no CPUs are using an mm as a
> "lazy tlb" beyond its final refcount" -- that's nice, but saying "well,
> if you this, you have to make sure you don't accidentally dereference
> that juicy dangling pointer we give you" is, in my book, a poor
> justification.

It's not a justification, it's a recipe for other archs which might want 
ot implement it!

> 
> I have no particular objection to the actual shoot lazies part, except
> insofar as I think we can do even better (e.g. my patch).  But 90% of
> the complexity of my WIP series is cleaning up the mess so that we can
> have a maintainable lazy mm mechanism instead of expanding the current
> hard-to-maintain part into three separate possible modes.

What actually happened is that when you ran out of (incorrect) technical 
disputes like this confusion about the active_mm thing, you then started 
to demand that I massively rework core code that you don't understand so 
that it matches the horrible mess that x86 has got itself into. I can
bring up quotes from previous threads.

> 
> Maybe I'm holding my own patches to an excessively high standard.
> 
> 
> 
>> 
>> My series doesn't affect x86 at all and it's no reason why Andy's series
>> to improve x86 can't be merged later. But that half finished series he 
>> keeps threatening with has been sitting there for almost a year now and 
>> it's gone nowhere, while there have been no unresolved technical 
>> objections to mine, it works, it's simple and small.
> 
> My series barely touches x86.

I'm talking about your previous insistence that my patch series removed 
"active_mm" from core code, because it doesn't match x86 internals, and 
similar such stupidity.

> The only "hack" is that x86 may have a
> CPU that has ->mm == NULL, ->active_mm != NULL, CR3 pointing to the init
> pgd, and mm_cpumask clear.  I don't see why this is a problem other than
> being somewhat unusual.  But x86 bare metal, like every architecture
> that can only flush the TLB using an IPI, can very efficiently shoot
> lazies, since it shoots the lazies anyway when tearing down pagetables,
> but actually enabling the config option with this series applied will
> result in ->active_mm pointing to freed memory.  Ick.
> 
>> 
>> I've kept trying to offer to help Andy with reviewing his stuff or fix 
>> the horrible x86 hacks, but nothing.
> 
> I haven't finished it yet.  Sorry.
> 

No need to be sorry about that, it will be trivial to rebase on top of 
my series, I've even done a quick attempt. No problem at all.

Thanks,
Nick
Nicholas Piggin Sept. 3, 2021, 11:48 p.m. UTC | #9
Excerpts from Nicholas Piggin's message of September 3, 2021 3:44 pm:
> Excerpts from Andy Lutomirski's message of September 3, 2021 3:11 pm:
>> On 9/2/21 5:46 PM, Nicholas Piggin wrote:
>>> Excerpts from Andrew Morton's message of September 3, 2021 8:53 am:
>>>> On Thu, 2 Sep 2021 15:50:03 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>>
>>>>> On Thu, Sep 2, 2021 at 3:29 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>>
>>>>>> This pile is:
>>>>>>
>>>>>> Nacked-by: Andy Lutomirski <luto@kernel.org>
>>>>>
>>>>> Can you specify exactly the range you want me to drop?
>>>>>
>>>>> I assume it's the four patches 117-120, ie
>>>>>
>>>>>   lazy tlb: introduce lazy mm refcount helper functions
>>>>>   lazy tlb: allow lazy tlb mm refcounting to be configurable
>>>>>   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>>>>>   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
>>>>>
>>>>> but I just want to double-check before I do surgery on that series.
>>>>
>>>> Yes, those 4.
>>>>
>>>> Sorry, I missed that email thread...
>>>>
>>> 
>>> That's not reasonable. Andy has had complete misunderstandings about the
>>> series which seems to stem from x86's horrible hacks that have gone in
>>> has confused him.
>> 
>> The horrible hacks in question are almost exclusively in core code.
> 
> No, they're in x86.
> 
>> Here's a brief summary of the situation.
>> 
>> There's a messy interaction between mmget()/mmdrop() and membarrier.
>> membarrier currently depends on some mmget() and mmdrop() calls to be
>> full barriers.
> 
> Membarrier has had (and is improving but still has) some complexity, 
> which is caused by interaction with _existing_ lazy-mm code in the tree. 
> The complexity is not with lazy-mm itself, and my series does not add
> more to the membarrier interaction. So I don't accept the criticism
> that it has to do with membarrier complexity.
> 
>> You make membarrier keep working by putting an ifdef'd
>> smp_mb() in the core scheduler.
> 
> Sure, it's well commented and replaces the smp_mb provided by atomic 
> operation that membarrier relied on to an explicit one. That's not a
> horrible hack.
> 
>> I clean up the code to make it work
>> independently of smp_mb() and therefore save the cost of the
>> unconditional barrier for non-membarrier-using programs.
> 
> Great. Nothing to do with this series though which is not changing 
> membarrier ordering.
> 
> I can certainly help you rebase it on top of these patches if you need.
> 
>> 
>> Your series adds an option MMU_LAZY_TLB_REFCOUNT=n for architectures to
>> opt out of lazy TLB refcounting.  This is simply wrong.  Right now, the
>> core scheduler provides current->active_mm and guarantees that
>> current->active_mm always points to a live (possibly mm_users == 0 but
>> definitely not freed) mm_struct.  With MMU_LAZY_TLB_REFCOUNT=n,
>> current->active_mm still exists, is still updated, but may point to
>> freed memory.
> 
> Wrong. It does nothing of the sort. I told you this in the previous 
> discussion, you obviously ignored me. You are just wrong, and you can't
> actually point to where this happens.
> 
> This criticism is invalid too.

If there's no further objections that can be substanitated, then 
can we merge this please?

By the way I should add -

>>> I've kept trying to offer to help Andy with reviewing his stuff or fix 
>>> the horrible x86 hacks, but nothing.
>> 
>> I haven't finished it yet.  Sorry.
>> 
> 
> No need to be sorry about that, it will be trivial to rebase on top of 
> my series, I've even done a quick attempt. No problem at all.

This alternative is far from a foregone conclusion even if it does ever 
get finished. It adds significant ordering complexity to core scheduler
that my approach does not have, for no benefit for powerpc and likely 
no measurable benefit for others either. The first hurdle for it 
obviously will be why can't x86 be updated to use this shoot-lazies 
work, with actual numbers.

Thanks,
Nick
diff mbox series

Patch

--- a/arch/Kconfig~lazy-tlb-shoot-lazies-a-non-refcounting-lazy-tlb-option
+++ a/arch/Kconfig
@@ -438,6 +438,20 @@  config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # to a kthread ->active_mm (non-arch code has been converted already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch must ensure mm_cpumask(mm) contains at least all
+# possible CPUs in which the mm is lazy, and it must meet the requirements for
+# MMU_LAZY_TLB_REFCOUNT=n (see above).
+config MMU_LAZY_TLB_SHOOTDOWN
+	bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
--- a/kernel/fork.c~lazy-tlb-shoot-lazies-a-non-refcounting-lazy-tlb-option
+++ a/kernel/fork.c
@@ -674,6 +674,53 @@  static void check_mm(struct mm_struct *m
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_shoot_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (current->active_mm == mm) {
+		WARN_ON_ONCE(current->mm);
+		current->active_mm = &init_mm;
+		switch_mm(mm, &init_mm, current);
+	}
+}
+
+static void do_check_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void shoot_lazy_tlbs(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+		/*
+		 * IPI overheads have not found to be expensive, but they could
+		 * be reduced in a number of possible ways, for example (in
+		 * roughly increasing order of complexity):
+		 * - A batch of mms requiring IPIs could be gathered and freed
+		 *   at once.
+		 * - CPUs could store their active mm somewhere that can be
+		 *   remotely checked without a lock, to filter out
+		 *   false-positives in the cpumask.
+		 * - After mm_users or mm_count reaches zero, switching away
+		 *   from the mm could clear mm_cpumask to reduce some IPIs
+		 *   (some batching or delaying would help).
+		 * - A delayed freeing and RCU-like quiescing sequence based on
+		 *   mm switching to avoid IPIs completely.
+		 */
+		on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+		if (IS_ENABLED(CONFIG_DEBUG_VM))
+			on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+	} else {
+		/*
+		 * In this case, lazy tlb mms are refounted and would not reach
+		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
+		 */
+	}
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -683,6 +730,10 @@  void __mmdrop(struct mm_struct *mm)
 {
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
+
+	/* Ensure no CPUs are using this as their lazy tlb mm */
+	shoot_lazy_tlbs(mm);
+
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);