diff mbox

[2/2] arm64: Mark kernel page ranges contiguous

Message ID 1455293208-6763-3-git-send-email-jeremy.linton@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeremy Linton Feb. 12, 2016, 4:06 p.m. UTC
With 64k pages, the next larger segment size is 512M. The linux
kernel also uses different protection flags to cover its code and data.
Because of this requirement, the vast majority of the kernel code and
data structures end up being mapped with 64k pages instead of the larger
pages common with a 4k page kernel.

Recent ARM processors support a contiguous bit in the
page tables which allows the a TLB to cover a range larger than a
single PTE if that range is mapped into physically contiguous
ram.

So, for the kernel its a good idea to set this flag. Some basic
micro benchmarks show it can significantly reduce the number of
L1 dTLB refills.

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
---
 arch/arm64/mm/mmu.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 58 insertions(+), 6 deletions(-)

Comments

Mark Rutland Feb. 12, 2016, 4:57 p.m. UTC | #1
Hi,

On Fri, Feb 12, 2016 at 10:06:48AM -0600, Jeremy Linton wrote:
> With 64k pages, the next larger segment size is 512M. The linux
> kernel also uses different protection flags to cover its code and data.
> Because of this requirement, the vast majority of the kernel code and
> data structures end up being mapped with 64k pages instead of the larger
> pages common with a 4k page kernel.
> 
> Recent ARM processors support a contiguous bit in the
> page tables which allows the a TLB to cover a range larger than a
> single PTE if that range is mapped into physically contiguous
> ram.
> 
> So, for the kernel its a good idea to set this flag. Some basic
> micro benchmarks show it can significantly reduce the number of
> L1 dTLB refills.
> 
> Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
> ---
>  arch/arm64/mm/mmu.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 58 insertions(+), 6 deletions(-)

This generally looks good.

As a heads-up, I have one concern:

> +static void clear_cont_pte_range(pte_t *pte, unsigned long addr)
> +{
> +	int i;
> +
> +	pte -= CONT_RANGE_OFFSET(addr);
> +	for (i = 0; i < CONT_PTES; i++) {
> +		if (pte_cont(*pte))
> +			set_pte(pte, pte_mknoncont(*pte));
> +		pte++;
> +	}
> +	flush_tlb_all();
> +}

As far as I can tell, "splitting" contiguous entries comes with the same
caveats as splitting sections. In the absence of a BBM sequence we might
end up with conflicting TLB entries.

However, I think we're OK for now.

The way we consistently map/unmap/modify image/linear "chunks" should
prevent us from trying to split those, and if/when we do this for the
EFI runtime page tables thy aren't live.

It would be good to figure out how to get rid of the splitting entirely.

Otherwise, this looks good to me; I'll try to give this a spin next
week.

Thanks,
Mark.
Jeremy Linton Feb. 12, 2016, 5:35 p.m. UTC | #2
On 02/12/2016 10:57 AM, Mark Rutland wrote:
(trimming)
 > On Fri, Feb 12, 2016 at 10:06:48AM -0600, Jeremy Linton wrote:
>> +static void clear_cont_pte_range(pte_t *pte, unsigned long addr)
>> +{
>> +	int i;
>> +
>> +	pte -= CONT_RANGE_OFFSET(addr);
>> +	for (i = 0; i < CONT_PTES; i++) {
>> +		if (pte_cont(*pte))
>> +			set_pte(pte, pte_mknoncont(*pte));
>> +		pte++;
>> +	}
>> +	flush_tlb_all();
>> +}
>
> As far as I can tell, "splitting" contiguous entries comes with the same
> caveats as splitting sections. In the absence of a BBM sequence we might
> end up with conflicting TLB entries.

As I mentioned a couple weeks ago, I'm not sure that inverting a BBM to 
a full "make partial copy of the whole table->break TTBR to copy 
sequence" is so bad if the copy process maintains references to the 
original table entries when they aren't in the modification path. It 
might even work with all the CPU's spun up because the break sequence 
would just be IPI's to the remaining cpu's to replace their TTBR/flush 
with a new value. I think you mentioned the ugly part is arbitrating 
access to the update functionality (and all the implied rules of when it 
could be done). But doing it that way doesn't require stalling the CPU's 
during the "make partial copy" portion.

> However, I think we're OK for now.
>
> The way we consistently map/unmap/modify image/linear "chunks" should
> prevent us from trying to split those, and if/when we do this for the
> EFI runtime page tables thy aren't live.
>
> It would be good to figure out how to get rid of the splitting entirely.

Well we could hoist some of it earlier by taking the 
create_mapping_late() calls and doing them earlier with RWX permissions, 
and then applying the RO,ROX,RW later as necessarily.

Which is ugly, but it might solve particular late splitting cases.
Mark Rutland Feb. 12, 2016, 5:58 p.m. UTC | #3
On Fri, Feb 12, 2016 at 11:35:05AM -0600, Jeremy Linton wrote:
> On 02/12/2016 10:57 AM, Mark Rutland wrote:
> (trimming)
> > On Fri, Feb 12, 2016 at 10:06:48AM -0600, Jeremy Linton wrote:
> >>+static void clear_cont_pte_range(pte_t *pte, unsigned long addr)
> >>+{
> >>+	int i;
> >>+
> >>+	pte -= CONT_RANGE_OFFSET(addr);
> >>+	for (i = 0; i < CONT_PTES; i++) {
> >>+		if (pte_cont(*pte))
> >>+			set_pte(pte, pte_mknoncont(*pte));
> >>+		pte++;
> >>+	}
> >>+	flush_tlb_all();
> >>+}
> >
> >As far as I can tell, "splitting" contiguous entries comes with the same
> >caveats as splitting sections. In the absence of a BBM sequence we might
> >end up with conflicting TLB entries.
> 
> As I mentioned a couple weeks ago, I'm not sure that inverting a BBM
> to a full "make partial copy of the whole table->break TTBR to copy
> sequence" is so bad if the copy process maintains references to the
> original table entries when they aren't in the modification path. It
> might even work with all the CPU's spun up because the break
> sequence would just be IPI's to the remaining cpu's to replace their
> TTBR/flush with a new value. I think you mentioned the ugly part is
> arbitrating access to the update functionality (and all the implied
> rules of when it could be done). But doing it that way doesn't
> require stalling the CPU's during the "make partial copy" portion.

That may be true, and worthy of investigation.

One problem I envisaged with that is concurrent kernel pagetable
modification (e.g. vmalloc, DEBUG_PAGEALLOC). To handle that correctly
you require global serialization (or your copy may be stale), though as
you point out that doesn't mean stop-the-world entirely.

For the above, I was simply pointing out that in general,
splitting/fusing contiguous ranges comes with the same issues as
splitting/fusing sections, as that may not be immediately obvious.

> >However, I think we're OK for now.
> >
> >The way we consistently map/unmap/modify image/linear "chunks" should
> >prevent us from trying to split those, and if/when we do this for the
> >EFI runtime page tables thy aren't live.
> >
> >It would be good to figure out how to get rid of the splitting entirely.
> 
> Well we could hoist some of it earlier by taking the
> create_mapping_late() calls and doing them earlier with RWX
> permissions, and then applying the RO,ROX,RW later as necessarily.
> 
> Which is ugly, but it might solve particular late splitting cases.

I'm not sure I follow.

The aim was that after my changes we should only split/fuse for EFI page
tables, and only for !4K page kernels. See [1] for why. Avoiding that in
the EFI case is very painful, so for now we kept split_pud and
split_pmd.

All create_mapping_late() calls should be performed with the same
physical/virtual start/end as earlier "chunk" mappings, and thus should
never result in a fuse/split or translation change -- only permission
changes (which we believe do not result in TLB conflicts, or we'd need
to do far more work to fix those up).

If we split/fuse in any case other than EFI runtime table creation, that
is a bug that we need to fix. If you're seeing a case we do that, then
please let me know!

Thanks,
Mark.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/398178.html
Jeremy Linton Feb. 12, 2016, 6:09 p.m. UTC | #4
On 02/12/2016 11:58 AM, Mark Rutland wrote:

(trimming)

> All create_mapping_late() calls should be performed with the same
> physical/virtual start/end as earlier "chunk" mappings, and thus should
> never result in a fuse/split or translation change -- only permission
> changes (which we believe do not result in TLB conflicts, or we'd need
> to do far more work to fix those up).
>
> If we split/fuse in any case other than EFI runtime table creation, that
> is a bug that we need to fix. If you're seeing a case we do that, then
> please let me know!

We are saying the same thing, and right now the biggest violator is 
probably the .rodata patch I just posted!
Mark Rutland Feb. 12, 2016, 6:11 p.m. UTC | #5
On Fri, Feb 12, 2016 at 12:09:41PM -0600, Jeremy Linton wrote:
> On 02/12/2016 11:58 AM, Mark Rutland wrote:
> 
> (trimming)
> 
> >All create_mapping_late() calls should be performed with the same
> >physical/virtual start/end as earlier "chunk" mappings, and thus should
> >never result in a fuse/split or translation change -- only permission
> >changes (which we believe do not result in TLB conflicts, or we'd need
> >to do far more work to fix those up).
> >
> >If we split/fuse in any case other than EFI runtime table creation, that
> >is a bug that we need to fix. If you're seeing a case we do that, then
> >please let me know!
> 
> We are saying the same thing, and right now the biggest violator is
> probably the .rodata patch I just posted!

Ok, phew!

The simple fix is to make .text and .rodata separate "chunks", then it
all falls out in the wash.

Mark.
Ard Biesheuvel Feb. 13, 2016, 4:43 p.m. UTC | #6
Hi Jeremy,

On 12 February 2016 at 17:06, Jeremy Linton <jeremy.linton@arm.com> wrote:
> With 64k pages, the next larger segment size is 512M. The linux
> kernel also uses different protection flags to cover its code and data.
> Because of this requirement, the vast majority of the kernel code and
> data structures end up being mapped with 64k pages instead of the larger
> pages common with a 4k page kernel.
>
> Recent ARM processors support a contiguous bit in the
> page tables which allows the a TLB to cover a range larger than a
> single PTE if that range is mapped into physically contiguous
> ram.
>
> So, for the kernel its a good idea to set this flag. Some basic
> micro benchmarks show it can significantly reduce the number of
> L1 dTLB refills.
>
> Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>

AFAICT, extending this patch to implement contiguous PMDs for 16 KB
granule kernels should be fairly straightforward, right? Level 2
contiguous block size on 16 KB is 1 GB, which would be useful for the
linear mapping.

> ---
>  arch/arm64/mm/mmu.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 58 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 7711554..ab69a99 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1,3 +1,4 @@
> +
>  /*
>   * Based on arch/arm/mm/mmu.c
>   *
> @@ -103,17 +104,49 @@ static void split_pmd(pmd_t *pmd, pte_t *pte)
>                  * Need to have the least restrictive permissions available
>                  * permissions will be fixed up later
>                  */
> -               set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
> +               set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC_CONT));
>                 pfn++;
>         } while (pte++, i++, i < PTRS_PER_PTE);
>  }
>
> +static void clear_cont_pte_range(pte_t *pte, unsigned long addr)
> +{
> +       int i;
> +
> +       pte -= CONT_RANGE_OFFSET(addr);
> +       for (i = 0; i < CONT_PTES; i++) {
> +               if (pte_cont(*pte))
> +                       set_pte(pte, pte_mknoncont(*pte));
> +               pte++;
> +       }
> +       flush_tlb_all();
> +}
> +
> +/*
> + * Given a range of PTEs set the pfn and provided page protection flags
> + */
> +static void __populate_init_pte(pte_t *pte, unsigned long addr,
> +                              unsigned long end, phys_addr_t phys,
> +                              pgprot_t prot)
> +{
> +       unsigned long pfn = __phys_to_pfn(phys);
> +
> +       do {
> +               /* clear all the bits except the pfn, then apply the prot */
> +               set_pte(pte, pfn_pte(pfn, prot));
> +               pte++;
> +               pfn++;
> +               addr += PAGE_SIZE;
> +       } while (addr != end);
> +}
> +
>  static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
> -                                 unsigned long end, unsigned long pfn,
> +                                 unsigned long end, phys_addr_t phys,
>                                   pgprot_t prot,
>                                   phys_addr_t (*pgtable_alloc)(void))
>  {
>         pte_t *pte;
> +       unsigned long next;
>
>         if (pmd_none(*pmd) || pmd_sect(*pmd)) {
>                 phys_addr_t pte_phys = pgtable_alloc();
> @@ -127,10 +160,29 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
>         BUG_ON(pmd_bad(*pmd));
>
>         pte = pte_set_fixmap_offset(pmd, addr);
> +
>         do {
> -               set_pte(pte, pfn_pte(pfn, prot));
> -               pfn++;
> -       } while (pte++, addr += PAGE_SIZE, addr != end);
> +               next = min(end, (addr + CONT_SIZE) & CONT_MASK);
> +               if (((addr | next | phys) & ~CONT_MASK) == 0) {
> +                       /* a block of CONT_PTES  */
> +                       __populate_init_pte(pte, addr, next, phys,
> +                                           prot | __pgprot(PTE_CONT));
> +               } else {
> +                       /*
> +                        * If the range being split is already inside of a
> +                        * contiguous range but this PTE isn't going to be
> +                        * contiguous, then we want to unmark the adjacent
> +                        * ranges, then update the portion of the range we
> +                        * are interrested in.
> +                        */
> +                       clear_cont_pte_range(pte, addr);
> +                       __populate_init_pte(pte, addr, next, phys, prot);
> +               }
> +
> +               pte += (next - addr) >> PAGE_SHIFT;
> +               phys += next - addr;
> +               addr = next;
> +       } while (addr != end);
>
>         pte_clear_fixmap();
>  }
> @@ -194,7 +246,7 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
>                                 }
>                         }
>                 } else {
> -                       alloc_init_pte(pmd, addr, next, __phys_to_pfn(phys),
> +                       alloc_init_pte(pmd, addr, next, phys,
>                                        prot, pgtable_alloc);
>                 }
>                 phys += next - addr;
> --
> 2.4.3
>
diff mbox

Patch

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7711554..ab69a99 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1,3 +1,4 @@ 
+
 /*
  * Based on arch/arm/mm/mmu.c
  *
@@ -103,17 +104,49 @@  static void split_pmd(pmd_t *pmd, pte_t *pte)
 		 * Need to have the least restrictive permissions available
 		 * permissions will be fixed up later
 		 */
-		set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
+		set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC_CONT));
 		pfn++;
 	} while (pte++, i++, i < PTRS_PER_PTE);
 }
 
+static void clear_cont_pte_range(pte_t *pte, unsigned long addr)
+{
+	int i;
+
+	pte -= CONT_RANGE_OFFSET(addr);
+	for (i = 0; i < CONT_PTES; i++) {
+		if (pte_cont(*pte))
+			set_pte(pte, pte_mknoncont(*pte));
+		pte++;
+	}
+	flush_tlb_all();
+}
+
+/*
+ * Given a range of PTEs set the pfn and provided page protection flags
+ */
+static void __populate_init_pte(pte_t *pte, unsigned long addr,
+			       unsigned long end, phys_addr_t phys,
+			       pgprot_t prot)
+{
+	unsigned long pfn = __phys_to_pfn(phys);
+
+	do {
+		/* clear all the bits except the pfn, then apply the prot */
+		set_pte(pte, pfn_pte(pfn, prot));
+		pte++;
+		pfn++;
+		addr += PAGE_SIZE;
+	} while (addr != end);
+}
+
 static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
-				  unsigned long end, unsigned long pfn,
+				  unsigned long end, phys_addr_t phys,
 				  pgprot_t prot,
 				  phys_addr_t (*pgtable_alloc)(void))
 {
 	pte_t *pte;
+	unsigned long next;
 
 	if (pmd_none(*pmd) || pmd_sect(*pmd)) {
 		phys_addr_t pte_phys = pgtable_alloc();
@@ -127,10 +160,29 @@  static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
 	BUG_ON(pmd_bad(*pmd));
 
 	pte = pte_set_fixmap_offset(pmd, addr);
+
 	do {
-		set_pte(pte, pfn_pte(pfn, prot));
-		pfn++;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+		next = min(end, (addr + CONT_SIZE) & CONT_MASK);
+		if (((addr | next | phys) & ~CONT_MASK) == 0) {
+			/* a block of CONT_PTES	 */
+			__populate_init_pte(pte, addr, next, phys,
+					    prot | __pgprot(PTE_CONT));
+		} else {
+			/*
+			 * If the range being split is already inside of a
+			 * contiguous range but this PTE isn't going to be
+			 * contiguous, then we want to unmark the adjacent
+			 * ranges, then update the portion of the range we
+			 * are interrested in.
+			 */
+			clear_cont_pte_range(pte, addr);
+			__populate_init_pte(pte, addr, next, phys, prot);
+		}
+
+		pte += (next - addr) >> PAGE_SHIFT;
+		phys += next - addr;
+		addr = next;
+	} while (addr != end);
 
 	pte_clear_fixmap();
 }
@@ -194,7 +246,7 @@  static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 				}
 			}
 		} else {
-			alloc_init_pte(pmd, addr, next, __phys_to_pfn(phys),
+			alloc_init_pte(pmd, addr, next, phys,
 				       prot, pgtable_alloc);
 		}
 		phys += next - addr;