[RFC,01/11] asm-generic/tlb: Provide a comment

Message ID	20180913092811.894806629@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: best guess record for domain of peterz@infradead.org designates 2607:7c80:54:e::133 as permitted sender) client-ip=2607:7c80:54:e::133; Message-ID: <20180913092811.894806629@infradead.org> User-Agent: quilt/0.65 Date: Thu, 13 Sep 2018 11:21:11 +0200 From: Peter Zijlstra <peterz@infradead.org> To: will.deacon@arm.com, aneesh.kumar@linux.vnet.ibm.com, akpm@linux-foundation.org, npiggin@gmail.com Cc: linux-arch@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, peterz@infradead.org, linux@armlinux.org.uk, heiko.carstens@de.ibm.com Subject: [RFC][PATCH 01/11] asm-generic/tlb: Provide a comment References: <20180913092110.817204997@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	my generic mmu_gather patches \| expand [RFC,00/11] my generic mmu_gather patches [RFC,01/11] asm-generic/tlb: Provide a comment [RFC,02/11] asm-generic/tlb: Provide HAVE_MMU_GATHER_PAGE_SIZE [RFC,03/11] x86/mm: Page size aware flush_tlb_mm_range() [RFC,04/11] asm-generic/tlb: Provide generic VIPT cache flush [RFC,05/11] asm-generic/tlb: Provide generic tlb_flush [RFC,06/11] asm-generic/tlb: Conditionally provide tlb_migrate_finish() [RFC,07/11] arm/tlb: Convert to generic mmu_gather [RFC,08/11] ia64/tlb: Conver to generic mmu_gather [RFC,09/11] sh/tlb: Convert SH to generic mmu_gather [RFC,10/11] um/tlb: Convert to generic mmu_gather [RFC,11/11] arch/tlb: Clean up simple architectures

Peter Zijlstra Sept. 13, 2018, 9:21 a.m. UTC

Write a comment explaining some of this..

Cc: Will Deacon <will.deacon@arm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/asm-generic/tlb.h |  120 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 117 insertions(+), 3 deletions(-)

Martin Schwidefsky Sept. 13, 2018, 10:30 a.m. UTC | #1

On Thu, 13 Sep 2018 11:21:11 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Write a comment explaining some of this..
> 
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Nick Piggin <npiggin@gmail.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/asm-generic/tlb.h |  120 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 117 insertions(+), 3 deletions(-)
> 
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -22,6 +22,119 @@
> 
>  #ifdef CONFIG_MMU
> 
> +/*
> + * Generic MMU-gather implementation.
> + *
> + * The mmu_gather data structure is used by the mm code to implement the
> + * correct and efficient ordering of freeing pages and TLB invalidations.
> + *
> + * This correct ordering is:
> + *
> + *  1) unhook page
> + *  2) TLB invalidate page
> + *  3) free page
> + *
> + * That is, we must never free a page before we have ensured there are no live
> + * translations left to it. Otherwise it might be possible to observe (or
> + * worse, change) the page content after it has been reused.
> + *

This first comment already includes the reason why s390 is probably better off
with its own mmu-gather implementation. It depends on the situation if we have

1) unhook the page and do a TLB flush at the same time
2) free page

or

1) unhook page
2) free page
3) final TLB flush of the whole mm

A variant of the second order we had in the past is to do the mm TLB flush first,
then the unhooks and frees of the individual pages. The are some tricky corners
switching between the two variants, see finish_arch_post_lock_switch.

The point is: we *never* have the order 1) unhook, 2) TLB invalidate, 3) free.
If there is concurrency due to a multi-threaded application we have to do the
unhook of the page-table entry and the TLB flush with a single instruction.

Peter Zijlstra Sept. 13, 2018, 10:57 a.m. UTC | #2

On Thu, Sep 13, 2018 at 12:30:14PM +0200, Martin Schwidefsky wrote:

> > + * The mmu_gather data structure is used by the mm code to implement the
> > + * correct and efficient ordering of freeing pages and TLB invalidations.
> > + *
> > + * This correct ordering is:
> > + *
> > + *  1) unhook page
> > + *  2) TLB invalidate page
> > + *  3) free page
> > + *
> > + * That is, we must never free a page before we have ensured there are no live
> > + * translations left to it. Otherwise it might be possible to observe (or
> > + * worse, change) the page content after it has been reused.
> > + *
> 
> This first comment already includes the reason why s390 is probably better off
> with its own mmu-gather implementation. It depends on the situation if we have
> 
> 1) unhook the page and do a TLB flush at the same time
> 2) free page
> 
> or
> 
> 1) unhook page
> 2) free page
> 3) final TLB flush of the whole mm

that's the fullmm case, right?

> A variant of the second order we had in the past is to do the mm TLB flush first,
> then the unhooks and frees of the individual pages. The are some tricky corners
> switching between the two variants, see finish_arch_post_lock_switch.
> 
> The point is: we *never* have the order 1) unhook, 2) TLB invalidate, 3) free.
> If there is concurrency due to a multi-threaded application we have to do the
> unhook of the page-table entry and the TLB flush with a single instruction.

You can still get the thing you want if for !fullmm you have a no-op
tlb_flush() implementation, assuming your arch page-table frobbing thing
has the required TLB flush in.

Note that that's not utterly unlike how the PowerPC/Sparc hash things
work, they clear and invalidate entries different from others and don't
use the mmu_gather tlb-flush.

Martin Schwidefsky Sept. 13, 2018, 12:18 p.m. UTC | #3

On Thu, 13 Sep 2018 12:57:38 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 13, 2018 at 12:30:14PM +0200, Martin Schwidefsky wrote:
> 
> > > + * The mmu_gather data structure is used by the mm code to implement the
> > > + * correct and efficient ordering of freeing pages and TLB invalidations.
> > > + *
> > > + * This correct ordering is:
> > > + *
> > > + *  1) unhook page
> > > + *  2) TLB invalidate page
> > > + *  3) free page
> > > + *
> > > + * That is, we must never free a page before we have ensured there are no live
> > > + * translations left to it. Otherwise it might be possible to observe (or
> > > + * worse, change) the page content after it has been reused.
> > > + *  
> > 
> > This first comment already includes the reason why s390 is probably better off
> > with its own mmu-gather implementation. It depends on the situation if we have
> > 
> > 1) unhook the page and do a TLB flush at the same time
> > 2) free page
> > 
> > or
> > 
> > 1) unhook page
> > 2) free page
> > 3) final TLB flush of the whole mm  
> 
> that's the fullmm case, right?

That includes the fullmm case but we use it for e.g. munmap of a single-threaded
program as well.
 
> > A variant of the second order we had in the past is to do the mm TLB flush first,
> > then the unhooks and frees of the individual pages. The are some tricky corners
> > switching between the two variants, see finish_arch_post_lock_switch.
> > 
> > The point is: we *never* have the order 1) unhook, 2) TLB invalidate, 3) free.
> > If there is concurrency due to a multi-threaded application we have to do the
> > unhook of the page-table entry and the TLB flush with a single instruction.  
> 
> You can still get the thing you want if for !fullmm you have a no-op
> tlb_flush() implementation, assuming your arch page-table frobbing thing
> has the required TLB flush in.

We have a non-empty tlb_flush_mmu_tlbonly to do a full-mm flush for two cases
1) batches of page-table entries for single-threaded programs
2) flushing of the pages used for the page-table structure itself

In fact only the page-table pages are added to the mmu_gather batch, the target
page of the virtual mapping is always freed immediately.
 
> Note that that's not utterly unlike how the PowerPC/Sparc hash things
> work, they clear and invalidate entries different from others and don't
> use the mmu_gather tlb-flush.

We may get something working with a common code mmu_gather, but I fear the
day someone makes a "minor" change to that subtly break s390. The debugging of
TLB related problems is just horrible..

Peter Zijlstra Sept. 13, 2018, 12:39 p.m. UTC | #4

On Thu, Sep 13, 2018 at 02:18:27PM +0200, Martin Schwidefsky wrote:
> We may get something working with a common code mmu_gather, but I fear the
> day someone makes a "minor" change to that subtly break s390. The debugging of
> TLB related problems is just horrible..

Yes it is, not just on s390 :/

And this is not something that's easy to write sanity checks for either
AFAIK. I mean we can do a few multi-threaded mmap()/mprotect()/munmap()
proglets and catch faults, but that doesn't even get close to covering
all the 'fun' spots.

Then again, you're more than welcome to the new:

  MMU GATHER AND TLB INVALIDATION

section in MAINTAINERS.

Martin Schwidefsky Sept. 14, 2018, 10:28 a.m. UTC | #5

On Thu, 13 Sep 2018 14:39:37 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 13, 2018 at 02:18:27PM +0200, Martin Schwidefsky wrote:
> > We may get something working with a common code mmu_gather, but I fear the
> > day someone makes a "minor" change to that subtly break s390. The debugging of
> > TLB related problems is just horrible..  
> 
> Yes it is, not just on s390 :/
> 
> And this is not something that's easy to write sanity checks for either
> AFAIK. I mean we can do a few multi-threaded mmap()/mprotect()/munmap()
> proglets and catch faults, but that doesn't even get close to covering
> all the 'fun' spots.
> 
> Then again, you're more than welcome to the new:
> 
>   MMU GATHER AND TLB INVALIDATION
> 
> section in MAINTAINERS.

I spent some time to get s390 converted to the common mmu_gather code.
There is one thing I would like to request, namely the ability to
disable the page gather part of mmu_gather. For my prototype patch
see below, it defines the negative HAVE_RCU_NO_GATHER_PAGES Kconfig
symbol that if defined will remove some parts from common code.
Ugly but good enough for the prototype to convey the idea.
For the final solution we better use a positive Kconfig symbol and
add that to all arch Kconfig files except for s390.

The code itself is less hairy than I feared, it worked on the first
try and survived my fork/munmap/mprotect TLB stress test. But as
this is TLB flushing there probably is some subtle problem left..

Here we go:
--
From f222a7e40427b625700f2ca0919c32f07931c19a Mon Sep 17 00:00:00 2001
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Date: Fri, 14 Sep 2018 10:50:58 +0200
Subject: [PATCH] s390/tlb: convert s390 to generic mmu_gather

Introduce HAVE_RCU_NO_GATHER_PAGES to allow the arch code to disable
page gathering in the generic mmu_gather code, then enable the generic
mmu_gather code for s390.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/Kconfig                |   3 +
 arch/s390/Kconfig           |   3 +
 arch/s390/include/asm/tlb.h | 131 ++++++++++++++------------------------------
 arch/s390/mm/pgalloc.c      |  63 +--------------------
 include/asm-generic/tlb.h   |   7 +++
 mm/mmu_gather.c             |  18 +++++-
 6 files changed, 72 insertions(+), 153 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 053c44703539..9b257929a7c1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -359,6 +359,9 @@ config HAVE_PERF_USER_STACK_DUMP
 config HAVE_ARCH_JUMP_LABEL
 	bool
 
+config HAVE_RCU_NO_GATHER_PAGES
+	bool
+
 config HAVE_RCU_TABLE_FREE
 	bool
 
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 9a9c7a6fe925..521457e3c5e4 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -161,6 +161,9 @@ config S390
 	select HAVE_NOP_MCOUNT
 	select HAVE_OPROFILE
 	select HAVE_PERF_EVENTS
+	select HAVE_RCU_NO_GATHER_PAGES
+	select HAVE_RCU_TABLE_FREE
+	select HAVE_RCU_TABLE_INVALIDATE
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index cf3d64313740..8073ff272b2b 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -22,98 +22,40 @@
  * Pages used for the page tables is a different story. FIXME: more
  */
 
-#include <linux/mm.h>
-#include <linux/pagemap.h>
-#include <linux/swap.h>
-#include <asm/processor.h>
-#include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
-
-struct mmu_gather {
-	struct mm_struct *mm;
-	struct mmu_table_batch *batch;
-	unsigned int fullmm;
-	unsigned long start, end;
-};
-
-struct mmu_table_batch {
-	struct rcu_head		rcu;
-	unsigned int		nr;
-	void			*tables[0];
-};
-
-#define MAX_TABLE_BATCH		\
-	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
-
-extern void tlb_table_flush(struct mmu_gather *tlb);
-extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
-
-static inline void
-arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-			unsigned long start, unsigned long end)
-{
-	tlb->mm = mm;
-	tlb->start = start;
-	tlb->end = end;
-	tlb->fullmm = !(start | (end+1));
-	tlb->batch = NULL;
-}
-
-static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
-{
-	__tlb_flush_mm_lazy(tlb->mm);
-}
-
-static inline void tlb_flush_mmu_free(struct mmu_gather *tlb)
-{
-	tlb_table_flush(tlb);
-}
-
+void __tlb_remove_table(void *_table);
+static inline void tlb_flush(struct mmu_gather *tlb);
+static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
+					  struct page *page, int page_size);
 
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
-{
-	tlb_flush_mmu_tlbonly(tlb);
-	tlb_flush_mmu_free(tlb);
-}
+#define tlb_start_vma(tlb, vma)			do { } while (0)
+#define tlb_end_vma(tlb, vma)			do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
-static inline void
-arch_tlb_finish_mmu(struct mmu_gather *tlb,
-		unsigned long start, unsigned long end, bool force)
-{
-	if (force) {
-		tlb->start = start;
-		tlb->end = end;
-	}
+#define tlb_flush tlb_flush
+#define pte_free_tlb pte_free_tlb
+#define pmd_free_tlb pmd_free_tlb
+#define p4d_free_tlb p4d_free_tlb
+#define pud_free_tlb pud_free_tlb
 
-	tlb_flush_mmu(tlb);
-}
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm-generic/tlb.h>
 
 /*
  * Release the page cache reference for a pte removed by
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  */
-static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	free_page_and_swap_cache(page);
-	return false; /* avoid calling tlb_flush_mmu */
-}
-
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	free_page_and_swap_cache(page);
-}
-
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 					  struct page *page, int page_size)
 {
-	return __tlb_remove_page(tlb, page);
+	free_page_and_swap_cache(page);
+	return false;
 }
 
-static inline void tlb_remove_page_size(struct mmu_gather *tlb,
-					struct page *page, int page_size)
+static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	return tlb_remove_page(tlb, page);
+	__tlb_flush_mm_lazy(tlb->mm);
 }
 
 /*
@@ -121,9 +63,18 @@ static inline void tlb_remove_page_size(struct mmu_gather *tlb,
  * page table from the tlb.
  */
 static inline void pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
-				unsigned long address)
+                                unsigned long address)
 {
-	page_table_free_rcu(tlb, (unsigned long *) pte, address);
+	__tlb_adjust_range(tlb, address, PAGE_SIZE);
+	tlb->mm->context.flush_mm = 1;
+	tlb->freed_tables = 1;
+	tlb->cleared_ptes = 1;
+	/*
+	 * page_table_free_rcu takes care of the allocation bit masks
+	 * of the 2K table fragments in the 4K page table page,
+	 * then calls tlb_remove_table.
+	 */
+        page_table_free_rcu(tlb, (unsigned long *) pte, address);
 }
 
 /*
@@ -139,6 +90,10 @@ static inline void pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd,
 	if (tlb->mm->context.asce_limit <= _REGION3_SIZE)
 		return;
 	pgtable_pmd_page_dtor(virt_to_page(pmd));
+	__tlb_adjust_range(tlb, address, PAGE_SIZE);
+	tlb->mm->context.flush_mm = 1;
+	tlb->freed_tables = 1;
+	tlb->cleared_puds = 1;
 	tlb_remove_table(tlb, pmd);
 }
 
@@ -154,6 +109,10 @@ static inline void p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d,
 {
 	if (tlb->mm->context.asce_limit <= _REGION1_SIZE)
 		return;
+	__tlb_adjust_range(tlb, address, PAGE_SIZE);
+	tlb->mm->context.flush_mm = 1;
+	tlb->freed_tables = 1;
+	tlb->cleared_p4ds = 1;
 	tlb_remove_table(tlb, p4d);
 }
 
@@ -169,19 +128,11 @@ static inline void pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
 {
 	if (tlb->mm->context.asce_limit <= _REGION2_SIZE)
 		return;
+	tlb->mm->context.flush_mm = 1;
+	tlb->freed_tables = 1;
+	tlb->cleared_puds = 1;
 	tlb_remove_table(tlb, pud);
 }
 
-#define tlb_start_vma(tlb, vma)			do { } while (0)
-#define tlb_end_vma(tlb, vma)			do { } while (0)
-#define tlb_remove_tlb_entry(tlb, ptep, addr)	do { } while (0)
-#define tlb_remove_pmd_tlb_entry(tlb, pmdp, addr)	do { } while (0)
-#define tlb_migrate_finish(mm)			do { } while (0)
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
-	tlb_remove_tlb_entry(tlb, ptep, address)
-
-static inline void tlb_change_page_size(struct mmu_gather *tlb, unsigned int page_size)
-{
-}
 
 #endif /* _S390_TLB_H */
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 76d89ee8b428..f7656a0b3a1a 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -288,7 +288,7 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 	tlb_remove_table(tlb, table);
 }
 
-static void __tlb_remove_table(void *_table)
+void __tlb_remove_table(void *_table)
 {
 	unsigned int mask = (unsigned long) _table & 3;
 	void *table = (void *)((unsigned long) _table ^ mask);
@@ -314,67 +314,6 @@ static void __tlb_remove_table(void *_table)
 	}
 }
 
-static void tlb_remove_table_smp_sync(void *arg)
-{
-	/* Simply deliver the interrupt */
-}
-
-static void tlb_remove_table_one(void *table)
-{
-	/*
-	 * This isn't an RCU grace period and hence the page-tables cannot be
-	 * assumed to be actually RCU-freed.
-	 *
-	 * It is however sufficient for software page-table walkers that rely
-	 * on IRQ disabling. See the comment near struct mmu_table_batch.
-	 */
-	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
-	__tlb_remove_table(table);
-}
-
-static void tlb_remove_table_rcu(struct rcu_head *head)
-{
-	struct mmu_table_batch *batch;
-	int i;
-
-	batch = container_of(head, struct mmu_table_batch, rcu);
-
-	for (i = 0; i < batch->nr; i++)
-		__tlb_remove_table(batch->tables[i]);
-
-	free_page((unsigned long)batch);
-}
-
-void tlb_table_flush(struct mmu_gather *tlb)
-{
-	struct mmu_table_batch **batch = &tlb->batch;
-
-	if (*batch) {
-		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
-		*batch = NULL;
-	}
-}
-
-void tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	struct mmu_table_batch **batch = &tlb->batch;
-
-	tlb->mm->context.flush_mm = 1;
-	if (*batch == NULL) {
-		*batch = (struct mmu_table_batch *)
-			__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
-		if (*batch == NULL) {
-			__tlb_flush_mm_lazy(tlb->mm);
-			tlb_remove_table_one(table);
-			return;
-		}
-		(*batch)->nr = 0;
-	}
-	(*batch)->tables[(*batch)->nr++] = table;
-	if ((*batch)->nr == MAX_TABLE_BATCH)
-		tlb_flush_mmu(tlb);
-}
-
 /*
  * Base infrastructure required to generate basic asces, region, segment,
  * and page tables that do not make use of enhanced features like EDAT1.
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 21c751cd751e..930e25abf4de 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -179,6 +179,7 @@ extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
 #endif
 
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
 /*
  * If we can't allocate a page to make a big batch of page pointers
  * to work on, then just handle a few from the on-stack structure.
@@ -203,6 +204,8 @@ struct mmu_gather_batch {
  */
 #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
 
+#endif
+
 /*
  * struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
@@ -249,6 +252,7 @@ struct mmu_gather {
 
 	unsigned int		batch_count;
 
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
 	struct mmu_gather_batch *active;
 	struct mmu_gather_batch	local;
 	struct page		*__pages[MMU_GATHER_BUNDLE];
@@ -256,6 +260,7 @@ struct mmu_gather {
 #ifdef CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
 	unsigned int page_size;
 #endif
+#endif
 };
 
 void arch_tlb_gather_mmu(struct mmu_gather *tlb,
@@ -264,8 +269,10 @@ void tlb_flush_mmu(struct mmu_gather *tlb);
 void arch_tlb_finish_mmu(struct mmu_gather *tlb,
 			 unsigned long start, unsigned long end, bool force);
 void tlb_flush_mmu_free(struct mmu_gather *tlb);
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
 				   int page_size);
+#endif
 
 static inline void __tlb_adjust_range(struct mmu_gather *tlb,
 				      unsigned long address,
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 2d5e617131f6..d3d2763d91b2 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -13,6 +13,8 @@
 
 #ifdef HAVE_GENERIC_MMU_GATHER
 
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
+
 static bool tlb_next_batch(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
@@ -41,6 +43,8 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	return true;
 }
 
+#endif
+
 void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 				unsigned long start, unsigned long end)
 {
@@ -49,12 +53,14 @@ void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	/* Is it from 0 to ~0? */
 	tlb->fullmm     = !(start | (end+1));
 	tlb->need_flush_all = 0;
+
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
 	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
 	tlb->active     = &tlb->local;
 	tlb->batch_count = 0;
-
+#endif
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
@@ -67,16 +73,20 @@ void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 
 void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
 	struct mmu_gather_batch *batch;
+#endif
 
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
 		free_pages_and_swap_cache(batch->pages, batch->nr);
 		batch->nr = 0;
 	}
 	tlb->active = &tlb->local;
+#endif
 }
 
 void tlb_flush_mmu(struct mmu_gather *tlb)
@@ -92,7 +102,9 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
 void arch_tlb_finish_mmu(struct mmu_gather *tlb,
 		unsigned long start, unsigned long end, bool force)
 {
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
 	struct mmu_gather_batch *batch, *next;
+#endif
 
 	if (force) {
 		__tlb_reset_range(tlb);
@@ -104,13 +116,16 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
 	for (batch = tlb->local.next; batch; batch = next) {
 		next = batch->next;
 		free_pages((unsigned long)batch, 0);
 	}
 	tlb->local.next = NULL;
+#endif
 }
 
+#ifndef CONFIG_HAVE_RCU_NO_GATHER_PAGES
 /* __tlb_remove_page
  *	Must perform the equivalent to __free_pte(pte_get_and_clear(ptep)), while
  *	handling the additional races in SMP caused by other CPUs caching valid
@@ -143,6 +158,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 
 	return false;
 }
+#endif
 
 #endif /* HAVE_GENERIC_MMU_GATHER */

Peter Zijlstra Sept. 14, 2018, 1:02 p.m. UTC | #6

On Fri, Sep 14, 2018 at 12:28:24PM +0200, Martin Schwidefsky wrote:

> I spent some time to get s390 converted to the common mmu_gather code.
> There is one thing I would like to request, namely the ability to
> disable the page gather part of mmu_gather. For my prototype patch
> see below, it defines the negative HAVE_RCU_NO_GATHER_PAGES Kconfig
> symbol that if defined will remove some parts from common code.
> Ugly but good enough for the prototype to convey the idea.
> For the final solution we better use a positive Kconfig symbol and
> add that to all arch Kconfig files except for s390.

In a private thread ealier Linus raised the point that the batching and
freeing of lots of pages at once is probably better for I$.

> +config HAVE_RCU_NO_GATHER_PAGES
> +	bool

I have a problem with the name more than anything else; this name
suggests it is the RCU table freeing that should not batch, which is not
the case, you want the regular page gather gone, but very much require
the RCU table gather to batch.

So I would like to propose calling it:

config HAVE_MMU_GATHER_NO_GATHER

Or something along those lines.

Martin Schwidefsky Sept. 14, 2018, 2:07 p.m. UTC | #7

On Fri, 14 Sep 2018 15:02:07 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Sep 14, 2018 at 12:28:24PM +0200, Martin Schwidefsky wrote:
> 
> > I spent some time to get s390 converted to the common mmu_gather code.
> > There is one thing I would like to request, namely the ability to
> > disable the page gather part of mmu_gather. For my prototype patch
> > see below, it defines the negative HAVE_RCU_NO_GATHER_PAGES Kconfig
> > symbol that if defined will remove some parts from common code.
> > Ugly but good enough for the prototype to convey the idea.
> > For the final solution we better use a positive Kconfig symbol and
> > add that to all arch Kconfig files except for s390.  
> 
> In a private thread ealier Linus raised the point that the batching and
> freeing of lots of pages at once is probably better for I$.

That would be something to try. For now I would like to do a conversion
that more or less preserves the old behavior. You know these pesky TLB
related bugs..

> > +config HAVE_RCU_NO_GATHER_PAGES
> > +	bool  
> 
> I have a problem with the name more than anything else; this name
> suggests it is the RCU table freeing that should not batch, which is not
> the case, you want the regular page gather gone, but very much require
> the RCU table gather to batch.
> 
> So I would like to propose calling it:
> 
> config HAVE_MMU_GATHER_NO_GATHER
> 
> Or something along those lines.
 
Imho a positive config option like HAVE_MMU_GATHER_PAGES would make the
most sense. It has the downside that it needs to be added to all
arch/*/Kconfig files except for s390. 

But I am not hung-up on a name, whatever does not sound to awful will do
for me. HAVE_MMU_GATHER_NO_GATHER would be ok.

Will Deacon Sept. 14, 2018, 4:48 p.m. UTC | #8

Hi Peter,

On Thu, Sep 13, 2018 at 11:21:11AM +0200, Peter Zijlstra wrote:
> Write a comment explaining some of this..

This comment is much-needed, thanks! Some comments inline.

> + * The mmu_gather API consists of:
> + *
> + *  - tlb_gather_mmu() / tlb_finish_mmu(); start and finish a mmu_gather
> + *
> + *    Finish in particular will issue a (final) TLB invalidate and free
> + *    all (remaining) queued pages.
> + *
> + *  - tlb_start_vma() / tlb_end_vma(); marks the start / end of a VMA
> + *
> + *    Defaults to flushing at tlb_end_vma() to reset the range; helps when
> + *    there's large holes between the VMAs.
> + *
> + *  - tlb_remove_page() / __tlb_remove_page()
> + *  - tlb_remove_page_size() / __tlb_remove_page_size()
> + *
> + *    __tlb_remove_page_size() is the basic primitive that queues a page for
> + *    freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
> + *    boolean indicating if the queue is (now) full and a call to
> + *    tlb_flush_mmu() is required.
> + *
> + *    tlb_remove_page() and tlb_remove_page_size() imply the call to
> + *    tlb_flush_mmu() when required and has no return value.
> + *
> + *  - tlb_change_page_size()

This doesn't seem to exist in my tree.
[since realised you rename to it in the next patch]

> + *
> + *    call before __tlb_remove_page*() to set the current page-size; implies a
> + *    possible tlb_flush_mmu() call.
> + *
> + *  - tlb_flush_mmu() / tlb_flush_mmu_tlbonly() / tlb_flush_mmu_free()
> + *
> + *    tlb_flush_mmu_tlbonly() - does the TLB invalidate (and resets
> + *                              related state, like the range)
> + *
> + *    tlb_flush_mmu_free() - frees the queued pages; make absolutely
> + *			     sure no additional tlb_remove_page()
> + *			     calls happen between _tlbonly() and this.
> + *
> + *    tlb_flush_mmu() - the above two calls.
> + *
> + *  - mmu_gather::fullmm
> + *
> + *    A flag set by tlb_gather_mmu() to indicate we're going to free
> + *    the entire mm; this allows a number of optimizations.
> + *
> + *    XXX list optimizations

On arm64, we can elide the invalidation altogether because we won't
re-allocate the ASID. We also have an invalidate-by-ASID (mm) instruction,
which we could use if we needed to.

> + *
> + *  - mmu_gather::need_flush_all
> + *
> + *    A flag that can be set by the arch code if it wants to force
> + *    flush the entire TLB irrespective of the range. For instance
> + *    x86-PAE needs this when changing top-level entries.
> + *
> + * And requires the architecture to provide and implement tlb_flush().
> + *
> + * tlb_flush() may, in addition to the above mentioned mmu_gather fields, make
> + * use of:
> + *
> + *  - mmu_gather::start / mmu_gather::end
> + *
> + *    which (when !need_flush_all; fullmm will have start = end = ~0UL) provides
> + *    the range that needs to be flushed to cover the pages to be freed.

I don't understand the mention of need_flush_all here -- I didn't think it
was used by the core code at all.

> + *
> + *  - mmu_gather::freed_tables
> + *
> + *    set when we freed page table pages
> + *
> + *  - tlb_get_unmap_shift() / tlb_get_unmap_size()
> + *
> + *    returns the smallest TLB entry size unmapped in this range
> + *
> + * Additionally there are a few opt-in features:
> + *
> + *  HAVE_MMU_GATHER_PAGE_SIZE
> + *
> + *  This ensures we call tlb_flush() every time tlb_change_page_size() actually
> + *  changes the size and provides mmu_gather::page_size to tlb_flush().

Ah, you add this later in the series. I think Nick reckoned we could get rid
of this (the page_size field) eventually...

Will

Peter Zijlstra Sept. 19, 2018, 11:33 a.m. UTC | #9

On Fri, Sep 14, 2018 at 05:48:57PM +0100, Will Deacon wrote:

> > + *  - tlb_change_page_size()
> 
> This doesn't seem to exist in my tree.
> [since realised you rename to it in the next patch]
> 

> > + * Additionally there are a few opt-in features:
> > + *
> > + *  HAVE_MMU_GATHER_PAGE_SIZE
> > + *
> > + *  This ensures we call tlb_flush() every time tlb_change_page_size() actually
> > + *  changes the size and provides mmu_gather::page_size to tlb_flush().
> 
> Ah, you add this later in the series. I think Nick reckoned we could get rid
> of this (the page_size field) eventually...

Right; let me fix that ordering..

Peter Zijlstra Sept. 19, 2018, 11:51 a.m. UTC | #10

On Fri, Sep 14, 2018 at 05:48:57PM +0100, Will Deacon wrote:

> > + *  - mmu_gather::fullmm
> > + *
> > + *    A flag set by tlb_gather_mmu() to indicate we're going to free
> > + *    the entire mm; this allows a number of optimizations.
> > + *
> > + *    XXX list optimizations
> 
> On arm64, we can elide the invalidation altogether because we won't
> re-allocate the ASID. We also have an invalidate-by-ASID (mm) instruction,
> which we could use if we needed to.

Right, but I was also struggling to put into words the normal fullmm
case.

I now ended up with:

--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -82,7 +82,11 @@
  *    A flag set by tlb_gather_mmu() to indicate we're going to free
  *    the entire mm; this allows a number of optimizations.
  *
- *    XXX list optimizations
+ *    - We can ignore tlb_{start,end}_vma(); because we don't
+ *      care about ranges. Everything will be shot down.
+ *
+ *    - (RISC) architectures that use ASIDs can cycle to a new ASID
+ *      and delay the invalidation until ASID space runs out.
  *
  *  - mmu_gather::need_flush_all
  *

Does that about cover things; or do we need more?

> > + *
> > + *  - mmu_gather::need_flush_all
> > + *
> > + *    A flag that can be set by the arch code if it wants to force
> > + *    flush the entire TLB irrespective of the range. For instance
> > + *    x86-PAE needs this when changing top-level entries.
> > + *
> > + * And requires the architecture to provide and implement tlb_flush().
> > + *
> > + * tlb_flush() may, in addition to the above mentioned mmu_gather fields, make
> > + * use of:
> > + *
> > + *  - mmu_gather::start / mmu_gather::end
> > + *
> > + *    which (when !need_flush_all; fullmm will have start = end = ~0UL) provides
> > + *    the range that needs to be flushed to cover the pages to be freed.
> 
> I don't understand the mention of need_flush_all here -- I didn't think it
> was used by the core code at all.

The core does indeed not use that flag; but if the architecture set
that, the range is still ignored.

Can you suggest clearer wording?

Will Deacon Sept. 19, 2018, 12:23 p.m. UTC | #11

On Wed, Sep 19, 2018 at 01:51:58PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 14, 2018 at 05:48:57PM +0100, Will Deacon wrote:
> 
> > > + *  - mmu_gather::fullmm
> > > + *
> > > + *    A flag set by tlb_gather_mmu() to indicate we're going to free
> > > + *    the entire mm; this allows a number of optimizations.
> > > + *
> > > + *    XXX list optimizations
> > 
> > On arm64, we can elide the invalidation altogether because we won't
> > re-allocate the ASID. We also have an invalidate-by-ASID (mm) instruction,
> > which we could use if we needed to.
> 
> Right, but I was also struggling to put into words the normal fullmm
> case.
> 
> I now ended up with:
> 
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -82,7 +82,11 @@
>   *    A flag set by tlb_gather_mmu() to indicate we're going to free
>   *    the entire mm; this allows a number of optimizations.
>   *
> - *    XXX list optimizations
> + *    - We can ignore tlb_{start,end}_vma(); because we don't
> + *      care about ranges. Everything will be shot down.
> + *
> + *    - (RISC) architectures that use ASIDs can cycle to a new ASID
> + *      and delay the invalidation until ASID space runs out.
>   *
>   *  - mmu_gather::need_flush_all
>   *
> 
> Does that about cover things; or do we need more?

I think that's fine as a starting point. People can always add more.

> > > + *
> > > + *  - mmu_gather::need_flush_all
> > > + *
> > > + *    A flag that can be set by the arch code if it wants to force
> > > + *    flush the entire TLB irrespective of the range. For instance
> > > + *    x86-PAE needs this when changing top-level entries.
> > > + *
> > > + * And requires the architecture to provide and implement tlb_flush().
> > > + *
> > > + * tlb_flush() may, in addition to the above mentioned mmu_gather fields, make
> > > + * use of:
> > > + *
> > > + *  - mmu_gather::start / mmu_gather::end
> > > + *
> > > + *    which (when !need_flush_all; fullmm will have start = end = ~0UL) provides
> > > + *    the range that needs to be flushed to cover the pages to be freed.
> > 
> > I don't understand the mention of need_flush_all here -- I didn't think it
> > was used by the core code at all.
> 
> The core does indeed not use that flag; but if the architecture set
> that, the range is still ignored.
> 
> Can you suggest clearer wording?

The range is only ignored if the default tlb_flush() implementation is used
though, right? Since this text is about the fields that tlb_flush() can use,
I think we can just delete the part in brackets.

Will

Peter Zijlstra Sept. 19, 2018, 1:12 p.m. UTC | #12

On Wed, Sep 19, 2018 at 01:23:29PM +0100, Will Deacon wrote:

> > > > + *    which (when !need_flush_all; fullmm will have start = end = ~0UL) provides
> > > > + *    the range that needs to be flushed to cover the pages to be freed.
> > > 
> > > I don't understand the mention of need_flush_all here -- I didn't think it
> > > was used by the core code at all.
> > 
> > The core does indeed not use that flag; but if the architecture set
> > that, the range is still ignored.
> > 
> > Can you suggest clearer wording?
> 
> The range is only ignored if the default tlb_flush() implementation is used
> though, right? Since this text is about the fields that tlb_flush() can use,
> I think we can just delete the part in brackets.

Well, any architecture that actually uses need_flush_all will obviously
require a tlb_flush implementation that looks at it.

But OK, I'll remove the note.

[RFC,01/11] asm-generic/tlb: Provide a comment

Commit Message

Comments

Patch