diff mbox series

[080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable

Message ID 20211105203851.Ise-VKP1_%akpm@linux-foundation.org (mailing list archive)
State New
Headers show
Series [001/262] scripts/spelling.txt: add more spellings to spelling.txt | expand

Commit Message

Andrew Morton Nov. 5, 2021, 8:38 p.m. UTC
From: Nicholas Piggin <npiggin@gmail.com>
Subject: lazy tlb: allow lazy tlb mm refcounting to be configurable

Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched.  This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the last
refcount is dropped.  Currently this is always enabled, which is what
existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

[akpm@linux-foundation.org: fix comment]
[npiggin@gmail.com: update comments]
  Link: https://lkml.kernel.org/r/1623121605.j47gdpccep.astroid@bobo.none
Link: https://lkml.kernel.org/r/20210605014216.446867-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig             |   14 ++++++++++++++
 include/linux/sched/mm.h |   14 ++++++++++++--
 kernel/sched/core.c      |   22 ++++++++++++++++++----
 kernel/sched/sched.h     |    4 +++-
 4 files changed, 47 insertions(+), 7 deletions(-)

Comments

Andy Lutomirski Nov. 6, 2021, 4:29 a.m. UTC | #1
On Fri, Nov 5, 2021, at 1:38 PM, Andrew Morton wrote:
> From: Nicholas Piggin <npiggin@gmail.com>
> Subject: lazy tlb: allow lazy tlb mm refcounting to be configurable
>
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched.  This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the last
> refcount is dropped.  Currently this is always enabled, which is what
> existing code does, so the patch is effectively a no-op.
>
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Still nacked by me.  Since I seem to have been doing a poor job of explaining my issues with this patch, I'll explain with code:

commit 54b675d9b28d9a56289d06a813250472bc621f40
Author: Andy Lutomirski <luto@kernel.org>
Date:   Fri Nov 5 21:20:47 2021 -0700

    [HACK] demonstrate lazy tlb issues

diff --git a/arch/Kconfig b/arch/Kconfig
index cca27f1b5d0e..19f273642d8f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,7 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
 	depends on !MMU_LAZY_TLB_SHOOTDOWN
+	depends on !X86
 
 # This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
 # mm as a lazy tlb beyond its last reference count, by shooting down these
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 25dd795497e8..c5a0c1e92524 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4902,6 +4902,13 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 */
 	arch_start_context_switch(prev);
 
+	/*
+	 * Sanity check: if something went wrong and the previous mm was
+	 * freed while we were still using it, KASAN might not notice
+	 * without help.
+	 */
+	kasan_check_byte(prev->active_mm);
+
 	/*
 	 * kernel -> kernel   lazy + transfer active
 	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active

Build this with KASAN for x86 and try to boot it.  It splats left and right.  The issue is that the !MMU_LAZY_TLB_REFCOUNT mode, while safe under certain select circumstances (maybe -- still not quite convinced) cheats and ignores the fact that the scheduler itself maintains a pointer to the old mm.  On x86, on bare metal, we *already* don't access lazy mms after the process is gone because the pagetable freeing process shoots down the lazy mm, so we are compliant with all the supposed preconditions of this new mode.  But the scheduler itself still has this nonsense active_mm pointer, and, if anyone ever tries to do anything with it (e.g. the above hack to force kasan to validate it), it all blows up.

On top of this, the whole refcount-me-maybe mode seems incredibly fragile, and I don't think the kernel really benefits from having a set of refcount helpers that may or may not keep the supposedly refcounted object alive depending on config.  And the mere fact that my patch appears to work as long as kasan isn't in play should be a pretty good indicator that this whole thing is not terribly robust.

So I think there are a few credible choices:

1. Find an alternative solution that gets the performance we want without dangling references.

2. Make the MMU_LAZY_TLB_REFCOUNT mode genuinely safe.  This means literally ifdeffing out active_mm so it can't dangle.  Doing that cleanly will be a lot of nasty arch work.

I again apologize that my series is taking so long, although I think it's finally getting into decent shape.  I still need to deal with the scs mess (that's new), finish tidying up kthread, and make sure hotplug is good.  But all this is because this is really hairy code and I'm trying to do it right.

If anyone wants to help, help is welcome.  Otherwise, I really do intend to get it all the way done soon.

--Andy
Linus Torvalds Nov. 6, 2021, 7:10 p.m. UTC | #2
Dropped once again until people can agree on this all..

               Linus

On Fri, Nov 5, 2021 at 9:29 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> So I think there are a few credible choices:
>
> 1. Find an alternative solution that gets the performance we want without dangling references.
>
> 2. Make the MMU_LAZY_TLB_REFCOUNT mode genuinely safe.  This means literally ifdeffing out active_mm so it can't dangle.  Doing that cleanly will be a lot of nasty arch work.
>
> I again apologize that my series is taking so long, although I think it's finally getting into decent shape.  I still need to deal with the scs mess (that's new), finish tidying up kthread, and make sure hotplug is good.  But all this is because this is really hairy code and I'm trying to do it right.
>
> If anyone wants to help, help is welcome.  Otherwise, I really do intend to get it all the way done soon.
>
> --Andy
diff mbox series

Patch

--- a/arch/Kconfig~lazy-tlb-allow-lazy-tlb-mm-refcounting-to-be-configurable
+++ a/arch/Kconfig
@@ -428,6 +428,20 @@  config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example. arch code must also ensure the
+# _lazy_tlb variants of mmgrab/mmdrop are used when dropping the lazy reference
+# to a kthread ->active_mm (non-arch code has been converted already).
+config MMU_LAZY_TLB_REFCOUNT
+	def_bool y
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
--- a/include/linux/sched/mm.h~lazy-tlb-allow-lazy-tlb-mm-refcounting-to-be-configurable
+++ a/include/linux/sched/mm.h
@@ -52,12 +52,22 @@  static inline void mmdrop(struct mm_stru
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-	mmgrab(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+		mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-	mmdrop(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+		mmdrop(mm);
+	} else {
+		/*
+		 * mmdrop_lazy_tlb must provide a full memory barrier, see the
+		 * membarrier comment in finish_task_switch which relies on
+		 * this.
+		 */
+		smp_mb();
+	}
 }
 
 /**
--- a/kernel/sched/core.c~lazy-tlb-allow-lazy-tlb-mm-refcounting-to-be-configurable
+++ a/kernel/sched/core.c
@@ -4772,7 +4772,7 @@  static struct rq *finish_task_switch(str
 	__releases(rq->lock)
 {
 	struct rq *rq = this_rq();
-	struct mm_struct *mm = rq->prev_mm;
+	struct mm_struct *mm = NULL;
 	long prev_state;
 
 	/*
@@ -4791,7 +4791,10 @@  static struct rq *finish_task_switch(str
 		      current->comm, current->pid, preempt_count()))
 		preempt_count_set(FORK_PREEMPT_COUNT);
 
-	rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	mm = rq->prev_lazy_mm;
+	rq->prev_lazy_mm = NULL;
+#endif
 
 	/*
 	 * A task struct has one reference for the use as "current".
@@ -4927,9 +4930,20 @@  context_switch(struct rq *rq, struct tas
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
-			rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+			/* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+			rq->prev_lazy_mm = prev->active_mm;
 			prev->active_mm = NULL;
+#else
+			/*
+			 * Without MMU_LAZY_TLB_REFCOUNT there is no lazy
+			 * tracking (because no rq->prev_lazy_mm) in
+			 * finish_task_switch, so no mmdrop_lazy_tlb(), so no
+			 * memory barrier for membarrier (see the membarrier
+			 * comment in finish_task_switch()).  Do it here.
+			 */
+			smp_mb();
+#endif
 		}
 	}
 
--- a/kernel/sched/sched.h~lazy-tlb-allow-lazy-tlb-mm-refcounting-to-be-configurable
+++ a/kernel/sched/sched.h
@@ -977,7 +977,9 @@  struct rq {
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
-	struct mm_struct	*prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	struct mm_struct	*prev_lazy_mm;
+#endif
 
 	unsigned int		clock_update_flags;
 	u64			clock;