Message ID | 1455903983-23910-3-git-send-email-jeremy.linton@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 19 February 2016 at 18:46, Jeremy Linton <jeremy.linton@arm.com> wrote: > With 64k pages, the next larger segment size is 512M. The linux > kernel also uses different protection flags to cover its code and data. > Because of these requirements, the vast majority of the kernel code and > data structures end up being mapped with 64k pages instead of the larger > pages common with a 4k page kernel. > > Recent ARM processors support a contiguous bit in the > page tables which allows a TLB to cover a range larger than a > single PTE if that range is mapped into physically contiguous > RAM. > > So, for the kernel its a good idea to set this flag. Some basic > micro benchmarks show it can significantly reduce the number of > L1 dTLB refills. > > Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Tested on 4k/3 levels, and the page tables look correct to me Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> You didn't answer my question though, regarding implementing this at the PMD level for 16k pages kernels, so that we can use 1 GB chunks for mapping system RAM. Would that just be a matter of doing the exact same thing at the PMD level? Or is it more complicated than that?
On 02/22/2016 04:28 AM, Ard Biesheuvel wrote: > On 19 February 2016 at 18:46, Jeremy Linton <jeremy.linton@arm.com> wrote: >> With 64k pages, the next larger segment size is 512M. The linux >> kernel also uses different protection flags to cover its code and data. >> Because of these requirements, the vast majority of the kernel code and >> data structures end up being mapped with 64k pages instead of the larger >> pages common with a 4k page kernel. >> >> Recent ARM processors support a contiguous bit in the >> page tables which allows a TLB to cover a range larger than a >> single PTE if that range is mapped into physically contiguous >> RAM. >> >> So, for the kernel its a good idea to set this flag. Some basic >> micro benchmarks show it can significantly reduce the number of >> L1 dTLB refills. >> >> Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> > > Tested on 4k/3 levels, and the page tables look correct to me > > Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Thanks, > > You didn't answer my question though, regarding implementing this at > the PMD level for 16k pages kernels, so that we can use 1 GB chunks > for mapping system RAM. Would that just be a matter of doing the exact > same thing at the PMD level? Or is it more complicated than that? > AFAIK, yes. It should be similar with the PMD ranges being unmarked on permission change or PTE allocation.
On Fri, Feb 19, 2016 at 11:46:23AM -0600, Jeremy Linton wrote: > With 64k pages, the next larger segment size is 512M. The linux > kernel also uses different protection flags to cover its code and data. > Because of these requirements, the vast majority of the kernel code and > data structures end up being mapped with 64k pages instead of the larger > pages common with a 4k page kernel. > > Recent ARM processors support a contiguous bit in the > page tables which allows a TLB to cover a range larger than a > single PTE if that range is mapped into physically contiguous > RAM. > > So, for the kernel its a good idea to set this flag. Some basic > micro benchmarks show it can significantly reduce the number of > L1 dTLB refills. > > Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> > --- > arch/arm64/mm/mmu.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 58 insertions(+), 6 deletions(-) > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 7711554..ab69a99 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -1,3 +1,4 @@ > + > /* > * Based on arch/arm/mm/mmu.c > * > @@ -103,17 +104,49 @@ static void split_pmd(pmd_t *pmd, pte_t *pte) > * Need to have the least restrictive permissions available > * permissions will be fixed up later > */ > - set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC)); > + set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC_CONT)); > pfn++; > } while (pte++, i++, i < PTRS_PER_PTE); > } > > +static void clear_cont_pte_range(pte_t *pte, unsigned long addr) > +{ > + int i; > + > + pte -= CONT_RANGE_OFFSET(addr); > + for (i = 0; i < CONT_PTES; i++) { > + if (pte_cont(*pte)) > + set_pte(pte, pte_mknoncont(*pte)); > + pte++; > + } > + flush_tlb_all(); Do you still need this invalidation? I thought the table weren't even live at this point? > +} > + > +/* > + * Given a range of PTEs set the pfn and provided page protection flags > + */ > +static void __populate_init_pte(pte_t *pte, unsigned long addr, > + unsigned long end, phys_addr_t phys, > + pgprot_t prot) > +{ > + unsigned long pfn = __phys_to_pfn(phys); > + > + do { > + /* clear all the bits except the pfn, then apply the prot */ > + set_pte(pte, pfn_pte(pfn, prot)); > + pte++; > + pfn++; > + addr += PAGE_SIZE; > + } while (addr != end); > +} > + > static void alloc_init_pte(pmd_t *pmd, unsigned long addr, > - unsigned long end, unsigned long pfn, > + unsigned long end, phys_addr_t phys, > pgprot_t prot, > phys_addr_t (*pgtable_alloc)(void)) > { > pte_t *pte; > + unsigned long next; > > if (pmd_none(*pmd) || pmd_sect(*pmd)) { > phys_addr_t pte_phys = pgtable_alloc(); > @@ -127,10 +160,29 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr, > BUG_ON(pmd_bad(*pmd)); > > pte = pte_set_fixmap_offset(pmd, addr); > + > do { > - set_pte(pte, pfn_pte(pfn, prot)); > - pfn++; > - } while (pte++, addr += PAGE_SIZE, addr != end); > + next = min(end, (addr + CONT_SIZE) & CONT_MASK); > + if (((addr | next | phys) & ~CONT_MASK) == 0) { > + /* a block of CONT_PTES */ > + __populate_init_pte(pte, addr, next, phys, > + prot | __pgprot(PTE_CONT)); > + } else { > + /* > + * If the range being split is already inside of a > + * contiguous range but this PTE isn't going to be > + * contiguous, then we want to unmark the adjacent > + * ranges, then update the portion of the range we > + * are interrested in. > + */ > + clear_cont_pte_range(pte, addr); > + __populate_init_pte(pte, addr, next, phys, prot); I don't understand the comment or the code here... the splitting is now done seperately, and I can't think of a scenario where you need to clear the cont hint explicitly for adjacent ptes. What am I missing? Will
On 02/25/2016 10:16 AM, Will Deacon wrote: > On Fri, Feb 19, 2016 at 11:46:23AM -0600, Jeremy Linton wrote: (trimming) >> +static void clear_cont_pte_range(pte_t *pte, unsigned long addr) >> +{ >> + int i; >> + >> + pte -= CONT_RANGE_OFFSET(addr); >> + for (i = 0; i < CONT_PTES; i++) { >> + if (pte_cont(*pte)) >> + set_pte(pte, pte_mknoncont(*pte)); >> + pte++; >> + } >> + flush_tlb_all(); > > Do you still need this invalidation? I thought the table weren't even > live at this point? Well it continues to match the calls in alloc_init_p*. I guess the worry is the extra flush that happens at create_mapping_late(), if mapping ranges aren't cont aligned? (because the loop won't actually be doing any set_pte's) If this and the alloc_init_p* flushes are to be removed, there should probably be a way to detect any cases where the splits are happening after the tables have been activated. This might be a little less straightforward given efi_create_mapping(). > >> +} >> + >> +/* >> + * Given a range of PTEs set the pfn and provided page protection flags >> + */ >> +static void __populate_init_pte(pte_t *pte, unsigned long addr, >> + unsigned long end, phys_addr_t phys, >> + pgprot_t prot) >> +{ >> + unsigned long pfn = __phys_to_pfn(phys); >> + >> + do { >> + /* clear all the bits except the pfn, then apply the prot */ >> + set_pte(pte, pfn_pte(pfn, prot)); >> + pte++; >> + pfn++; >> + addr += PAGE_SIZE; >> + } while (addr != end); >> +} >> + (trimming) >> + >> do { >> - set_pte(pte, pfn_pte(pfn, prot)); >> - pfn++; >> - } while (pte++, addr += PAGE_SIZE, addr != end); >> + next = min(end, (addr + CONT_SIZE) & CONT_MASK); >> + if (((addr | next | phys) & ~CONT_MASK) == 0) { >> + /* a block of CONT_PTES */ >> + __populate_init_pte(pte, addr, next, phys, >> + prot | __pgprot(PTE_CONT)); >> + } else { >> + /* >> + * If the range being split is already inside of a >> + * contiguous range but this PTE isn't going to be >> + * contiguous, then we want to unmark the adjacent >> + * ranges, then update the portion of the range we >> + * are interested in. >> + */ >> + clear_cont_pte_range(pte, addr); >> + __populate_init_pte(pte, addr, next, phys, prot); > > I don't understand the comment or the code here... the splitting is now > done seperately, and I can't think of a scenario where you need to clear > the cont hint explicitly for adjacent ptes. My understanding is that splitting is initially running this code path (via map_kernel_chunk, then again via create_mapping_late where splits won't happen). So, split_pmd() is creating cont ranges. When the ranges aren't sufficiently aligned then this is wiping out the cont mapping immediately after their creation.
On Thu, Feb 25, 2016 at 02:46:54PM -0600, Jeremy Linton wrote: > On 02/25/2016 10:16 AM, Will Deacon wrote: > >On Fri, Feb 19, 2016 at 11:46:23AM -0600, Jeremy Linton wrote: > > (trimming) > > >>+static void clear_cont_pte_range(pte_t *pte, unsigned long addr) > >>+{ > >>+ int i; > >>+ > >>+ pte -= CONT_RANGE_OFFSET(addr); > >>+ for (i = 0; i < CONT_PTES; i++) { > >>+ if (pte_cont(*pte)) > >>+ set_pte(pte, pte_mknoncont(*pte)); > >>+ pte++; > >>+ } > >>+ flush_tlb_all(); > > > >Do you still need this invalidation? I thought the table weren't even > >live at this point? > > Well it continues to match the calls in alloc_init_p*. Ok, but if it's not needed (and I don't think it is), then we should remove the invalidation from there too rather than add more of it. > I guess the worry is the extra flush that happens at create_mapping_late(), > if mapping ranges aren't cont aligned? (because the loop won't actually be > doing any set_pte's) I'm just concerned with trying to make this code understandable! I doubt there's a performance argument to be made. > If this and the alloc_init_p* flushes are to be removed, there should > probably be a way to detect any cases where the splits are happening after > the tables have been activated. This might be a little less straightforward > given efi_create_mapping(). I think that's a separate issue, since splitting a live page table is dodgy regardless of the flushing. But yes, it would be nice to detect that case and scream about it. > > > >>+} > >>+ > >>+/* > >>+ * Given a range of PTEs set the pfn and provided page protection flags > >>+ */ > >>+static void __populate_init_pte(pte_t *pte, unsigned long addr, > >>+ unsigned long end, phys_addr_t phys, > >>+ pgprot_t prot) > >>+{ > >>+ unsigned long pfn = __phys_to_pfn(phys); > >>+ > >>+ do { > >>+ /* clear all the bits except the pfn, then apply the prot */ > >>+ set_pte(pte, pfn_pte(pfn, prot)); > >>+ pte++; > >>+ pfn++; > >>+ addr += PAGE_SIZE; > >>+ } while (addr != end); > >>+} > >>+ > (trimming) > >>+ > >> do { > >>- set_pte(pte, pfn_pte(pfn, prot)); > >>- pfn++; > >>- } while (pte++, addr += PAGE_SIZE, addr != end); > >>+ next = min(end, (addr + CONT_SIZE) & CONT_MASK); > >>+ if (((addr | next | phys) & ~CONT_MASK) == 0) { > >>+ /* a block of CONT_PTES */ > >>+ __populate_init_pte(pte, addr, next, phys, > >>+ prot | __pgprot(PTE_CONT)); > >>+ } else { > >>+ /* > >>+ * If the range being split is already inside of a > >>+ * contiguous range but this PTE isn't going to be > >>+ * contiguous, then we want to unmark the adjacent > >>+ * ranges, then update the portion of the range we > >>+ * are interested in. > >>+ */ > >>+ clear_cont_pte_range(pte, addr); > >>+ __populate_init_pte(pte, addr, next, phys, prot); > > > >I don't understand the comment or the code here... the splitting is now > >done seperately, and I can't think of a scenario where you need to clear > >the cont hint explicitly for adjacent ptes. > > > My understanding is that splitting is initially running this code path (via > map_kernel_chunk, then again via create_mapping_late where splits won't > happen). So, split_pmd() is creating cont ranges. When the ranges aren't > sufficiently aligned then this is wiping out the cont mapping immediately > after their creation. Gotcha, thanks for the explanation (I somehow overlooked the initial page table creation). Will
On Fri, Feb 26, 2016 at 05:28:26PM +0000, Will Deacon wrote: > On Thu, Feb 25, 2016 at 02:46:54PM -0600, Jeremy Linton wrote: > > On 02/25/2016 10:16 AM, Will Deacon wrote: > > >On Fri, Feb 19, 2016 at 11:46:23AM -0600, Jeremy Linton wrote: > > >>+static void clear_cont_pte_range(pte_t *pte, unsigned long addr) > > >>+{ > > >>+ int i; > > >>+ > > >>+ pte -= CONT_RANGE_OFFSET(addr); > > >>+ for (i = 0; i < CONT_PTES; i++) { > > >>+ if (pte_cont(*pte)) > > >>+ set_pte(pte, pte_mknoncont(*pte)); > > >>+ pte++; > > >>+ } > > >>+ flush_tlb_all(); > > > > > >Do you still need this invalidation? I thought the table weren't even > > >live at this point? > > > > Well it continues to match the calls in alloc_init_p*. > > Ok, but if it's not needed (and I don't think it is), then we should > remove the invalidation from there too rather than add more of it. I agree. Talking to Mark R, I think all the create_pgd_mapping code should be trimmed of p*d splitting entirely (with some warnings left in place). The efi_create_mapping() case is done on efi_mm anyway with a low address and non-active TTBR0_EL1. As Mark suggested, dropping p*d splitting should work for this case as well by ignoring the next consecutive mapping if it overlaps since it's a linear mapping (possibly some permissions may need to be upgraded). In this case, the TLBI would be done by efi_create_mapping() or even skipped altogether assuming that no page tables are changed (only going from invalid to valid). My preference, however, would be for efi_create_mapping() should use a different mechanism from create_pgd_mapping (aimed at very early memmap) like remap_pfn_range() but I haven't looked in detail on whether this is feasible. Once we have this clean-up in place, we can rebase the contiguous PTE patches on top and remove unnecessary TLBI.
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 7711554..ab69a99 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1,3 +1,4 @@ + /* * Based on arch/arm/mm/mmu.c * @@ -103,17 +104,49 @@ static void split_pmd(pmd_t *pmd, pte_t *pte) * Need to have the least restrictive permissions available * permissions will be fixed up later */ - set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC)); + set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC_CONT)); pfn++; } while (pte++, i++, i < PTRS_PER_PTE); } +static void clear_cont_pte_range(pte_t *pte, unsigned long addr) +{ + int i; + + pte -= CONT_RANGE_OFFSET(addr); + for (i = 0; i < CONT_PTES; i++) { + if (pte_cont(*pte)) + set_pte(pte, pte_mknoncont(*pte)); + pte++; + } + flush_tlb_all(); +} + +/* + * Given a range of PTEs set the pfn and provided page protection flags + */ +static void __populate_init_pte(pte_t *pte, unsigned long addr, + unsigned long end, phys_addr_t phys, + pgprot_t prot) +{ + unsigned long pfn = __phys_to_pfn(phys); + + do { + /* clear all the bits except the pfn, then apply the prot */ + set_pte(pte, pfn_pte(pfn, prot)); + pte++; + pfn++; + addr += PAGE_SIZE; + } while (addr != end); +} + static void alloc_init_pte(pmd_t *pmd, unsigned long addr, - unsigned long end, unsigned long pfn, + unsigned long end, phys_addr_t phys, pgprot_t prot, phys_addr_t (*pgtable_alloc)(void)) { pte_t *pte; + unsigned long next; if (pmd_none(*pmd) || pmd_sect(*pmd)) { phys_addr_t pte_phys = pgtable_alloc(); @@ -127,10 +160,29 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr, BUG_ON(pmd_bad(*pmd)); pte = pte_set_fixmap_offset(pmd, addr); + do { - set_pte(pte, pfn_pte(pfn, prot)); - pfn++; - } while (pte++, addr += PAGE_SIZE, addr != end); + next = min(end, (addr + CONT_SIZE) & CONT_MASK); + if (((addr | next | phys) & ~CONT_MASK) == 0) { + /* a block of CONT_PTES */ + __populate_init_pte(pte, addr, next, phys, + prot | __pgprot(PTE_CONT)); + } else { + /* + * If the range being split is already inside of a + * contiguous range but this PTE isn't going to be + * contiguous, then we want to unmark the adjacent + * ranges, then update the portion of the range we + * are interrested in. + */ + clear_cont_pte_range(pte, addr); + __populate_init_pte(pte, addr, next, phys, prot); + } + + pte += (next - addr) >> PAGE_SHIFT; + phys += next - addr; + addr = next; + } while (addr != end); pte_clear_fixmap(); } @@ -194,7 +246,7 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end, } } } else { - alloc_init_pte(pmd, addr, next, __phys_to_pfn(phys), + alloc_init_pte(pmd, addr, next, phys, prot, pgtable_alloc); } phys += next - addr;
With 64k pages, the next larger segment size is 512M. The linux kernel also uses different protection flags to cover its code and data. Because of these requirements, the vast majority of the kernel code and data structures end up being mapped with 64k pages instead of the larger pages common with a 4k page kernel. Recent ARM processors support a contiguous bit in the page tables which allows a TLB to cover a range larger than a single PTE if that range is mapped into physically contiguous RAM. So, for the kernel its a good idea to set this flag. Some basic micro benchmarks show it can significantly reduce the number of L1 dTLB refills. Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> --- arch/arm64/mm/mmu.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 58 insertions(+), 6 deletions(-)