diff mbox series

[v6,3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme

Message ID 20230118080011.2258375-4-npiggin@gmail.com (mailing list archive)
State New
Headers show
Series shoot lazy tlbs | expand

Commit Message

Nicholas Piggin Jan. 18, 2023, 8 a.m. UTC
On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

Shootdown IPIs cost could be an issue, but they have not been observed
to be a serious problem with this scheme, because short-lived processes
tend not to migrate CPUs much, therefore they don't get much chance to
leave lazy tlb mm references on remote CPUs. There are a lot of options
to reduce them if necessary.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig  | 15 ++++++++++++
 kernel/fork.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

Comments

Nadav Amit Jan. 18, 2023, 10:22 p.m. UTC | #1
> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> +static void do_shoot_lazy_tlb(void *arg)
> +{
> +	struct mm_struct *mm = arg;
> +
> + 	if (current->active_mm == mm) {
> + 		WARN_ON_ONCE(current->mm);
> + 		current->active_mm = &init_mm;
> + 		switch_mm(mm, &init_mm, current);
> + 	}
> +}

I might be out of touch - doesn’t a flush already take place when we free
the page-tables, at least on common cases on x86?

IIUC exit_mmap() would free page-tables, and whenever page-tables are
freed, on x86, we do shootdown regardless to whether the target CPU TLB state
marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
everything should be fine, no?

[ I understand you care about powerpc, just wondering on the effect on x86 ]
Nicholas Piggin Jan. 19, 2023, 12:53 a.m. UTC | #2
On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
>
>
> > On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> > 
> > +static void do_shoot_lazy_tlb(void *arg)
> > +{
> > +	struct mm_struct *mm = arg;
> > +
> > + 	if (current->active_mm == mm) {
> > + 		WARN_ON_ONCE(current->mm);
> > + 		current->active_mm = &init_mm;
> > + 		switch_mm(mm, &init_mm, current);
> > + 	}
> > +}
>
> I might be out of touch - doesn’t a flush already take place when we free
> the page-tables, at least on common cases on x86?
>
> IIUC exit_mmap() would free page-tables, and whenever page-tables are
> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
> everything should be fine, no?
>
> [ I understand you care about powerpc, just wondering on the effect on x86 ]

If you can easily piggyback on IPI work you already do in exit_mmap then
that's likely to be preferable. I don't know the details of x86 these
days but there is some discussion about it in last year's thread, it
sounded quite feasible.

This is stil required at final __mmdrop() time because it's still
possible that lazy mm refs will need to be cleaned. exit_mmap() itself
explicitly creates one, so if the __mmdrop() runs on a different CPU,
then there's one. kthreads using the mm could create others. If that
part of it is unclear or under-commented, I can try improve it.

Thanks,
Nick
Nicholas Piggin Jan. 19, 2023, 4:22 a.m. UTC | #3
On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
>
>
> > On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> > 
> > +static void do_shoot_lazy_tlb(void *arg)
> > +{
> > +	struct mm_struct *mm = arg;
> > +
> > + 	if (current->active_mm == mm) {
> > + 		WARN_ON_ONCE(current->mm);
> > + 		current->active_mm = &init_mm;
> > + 		switch_mm(mm, &init_mm, current);
> > + 	}
> > +}
>
> I might be out of touch - doesn’t a flush already take place when we free
> the page-tables, at least on common cases on x86?
>
> IIUC exit_mmap() would free page-tables, and whenever page-tables are
> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
> everything should be fine, no?
>
> [ I understand you care about powerpc, just wondering on the effect on x86 ]

Now I come to think of it, Rik had done this for x86 a while back.

https://lore.kernel.org/all/20180728215357.3249-10-riel@surriel.com/

I didn't know about it when I wrote this, so I never dug into why it
didn't get merged. It might have missed the final __mmdrop races but
I'm not not sure, x86 lazy tlb mode is too complicated to know at a
glance. I would check with him though.

Thanks,
Nick
Nadav Amit Jan. 23, 2023, 8:16 a.m. UTC | #4
On 1/19/23 6:22 AM, Nicholas Piggin wrote:
> On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
>>
>>
>>> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>>>
>>> +static void do_shoot_lazy_tlb(void *arg)
>>> +{
>>> +	struct mm_struct *mm = arg;
>>> +
>>> + 	if (current->active_mm == mm) {
>>> + 		WARN_ON_ONCE(current->mm);
>>> + 		current->active_mm = &init_mm;
>>> + 		switch_mm(mm, &init_mm, current);
>>> + 	}
>>> +}
>>
>> I might be out of touch - doesn’t a flush already take place when we free
>> the page-tables, at least on common cases on x86?
>>
>> IIUC exit_mmap() would free page-tables, and whenever page-tables are
>> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
>> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
>> everything should be fine, no?
>>
>> [ I understand you care about powerpc, just wondering on the effect on x86 ]
> 
> Now I come to think of it, Rik had done this for x86 a while back.
> 
> https://lore.kernel.org/all/20180728215357.3249-10-riel@surriel.com/
> 
> I didn't know about it when I wrote this, so I never dug into why it
> didn't get merged. It might have missed the final __mmdrop races but
> I'm not not sure, x86 lazy tlb mode is too complicated to know at a
> glance. I would check with him though.

My point was that naturally (i.e., as done today), when exit_mmap() is 
done, you release the page tables (not just the pages). On x86 it means 
that you also send shootdown IPI to all the *lazy* CPUs to perform a 
flush, so they would exit the lazy mode.

[ this should be true for 99% of the cases, excluding cases where there
   were not page-tables, for instance ]

So the patch of Rik, I think, does not help in the common cases, 
although it may perhaps make implicit actions more explicit in the code.
Nicholas Piggin Jan. 24, 2023, 3:16 a.m. UTC | #5
On Mon Jan 23, 2023 at 6:16 PM AEST, Nadav Amit wrote:
>
>
> On 1/19/23 6:22 AM, Nicholas Piggin wrote:
> > On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
> >>
> >>
> >>> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> >>>
> >>> +static void do_shoot_lazy_tlb(void *arg)
> >>> +{
> >>> +	struct mm_struct *mm = arg;
> >>> +
> >>> + 	if (current->active_mm == mm) {
> >>> + 		WARN_ON_ONCE(current->mm);
> >>> + 		current->active_mm = &init_mm;
> >>> + 		switch_mm(mm, &init_mm, current);
> >>> + 	}
> >>> +}
> >>
> >> I might be out of touch - doesn’t a flush already take place when we free
> >> the page-tables, at least on common cases on x86?
> >>
> >> IIUC exit_mmap() would free page-tables, and whenever page-tables are
> >> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
> >> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
> >> everything should be fine, no?
> >>
> >> [ I understand you care about powerpc, just wondering on the effect on x86 ]
> > 
> > Now I come to think of it, Rik had done this for x86 a while back.
> > 
> > https://lore.kernel.org/all/20180728215357.3249-10-riel@surriel.com/
> > 
> > I didn't know about it when I wrote this, so I never dug into why it
> > didn't get merged. It might have missed the final __mmdrop races but
> > I'm not not sure, x86 lazy tlb mode is too complicated to know at a
> > glance. I would check with him though.
>
> My point was that naturally (i.e., as done today), when exit_mmap() is 
> done, you release the page tables (not just the pages). On x86 it means 
> that you also send shootdown IPI to all the *lazy* CPUs to perform a 
> flush, so they would exit the lazy mode.
>
> [ this should be true for 99% of the cases, excluding cases where there
>    were not page-tables, for instance ]
>
> So the patch of Rik, I think, does not help in the common cases, 
> although it may perhaps make implicit actions more explicit in the code.

If that's what it does, then sure. IIRC x86 didn't used to work that way
long ago, but you would know what it does today. You might find it
doesn't need much arch change to work. OTOH Andy has major problems with
active_mm and some other x86 use-after-free weirdness that that I wasn't
able to comprehend. He'll be naking x86 implementation until that's all
cleaned up so better try to understand what's going on with that first.

Thanks,
Nick
diff mbox series

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index b07d36f08fea..f7da34e4bc62 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -481,6 +481,21 @@  config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch *must*:
+# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
+#   at least all possible CPUs in which the mm is lazy.
+# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
+config MMU_LAZY_TLB_SHOOTDOWN
+	bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..263660e78c2a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -780,6 +780,67 @@  static void check_mm(struct mm_struct *mm)
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_check_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void do_shoot_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (current->active_mm == mm) {
+		WARN_ON_ONCE(current->mm);
+		current->active_mm = &init_mm;
+		switch_mm(mm, &init_mm, current);
+	}
+}
+
+static void cleanup_lazy_tlbs(struct mm_struct *mm)
+{
+	if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+		/*
+		 * In this case, lazy tlb mms are refounted and would not reach
+		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
+		 */
+		return;
+	}
+
+	/*
+	 * Lazy TLB shootdown does not refcount "lazy tlb mm" usage, rather it
+	 * requires lazy mm users to switch to another mm when the refcount
+	 * drops to zero, before the mm is freed. This requires IPIs here to
+	 * switch kernel threads to init_mm.
+	 *
+	 * archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm
+	 * switch with the final userspace teardown TLB flush which leaves the
+	 * mm lazy on this CPU but no others, reducing the need for additional
+	 * IPIs here. There are cases where a final IPI is still required here,
+	 * such as the final mmdrop being performed on a different CPU than the
+	 * one exiting, or kernel threads using the mm when userspace exits.
+	 *
+	 * IPI overheads have not found to be expensive, but they could be
+	 * reduced in a number of possible ways, for example (roughly
+	 * increasing order of complexity):
+	 * - The last lazy reference created by exit_mm() could instead switch
+	 *   to init_mm, however it's probable this will run on the same CPU
+	 *   immediately afterwards, so this may not reduce IPIs much.
+	 * - A batch of mms requiring IPIs could be gathered and freed at once.
+	 * - CPUs store active_mm where it can be remotely checked without a
+	 *   lock, to filter out false-positives in the cpumask.
+	 * - After mm_users or mm_count reaches zero, switching away from the
+	 *   mm could clear mm_cpumask to reduce some IPIs, perhaps together
+	 *   with some batching or delaying of the final IPIs.
+	 * - A delayed freeing and RCU-like quiescing sequence based on mm
+	 *   switching to avoid IPIs completely.
+	 */
+	on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -791,6 +852,10 @@  void __mmdrop(struct mm_struct *mm)
 
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
+
+	/* Ensure no CPUs are using this as their lazy tlb mm */
+	cleanup_lazy_tlbs(mm);
+
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);