[v9,00/21] Generic page walk and ptdump

Message ID	20190722154210.42799-1-steven.price@arm.com (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Steven Price <steven.price@arm.com> To: linux-mm@kvack.org Subject: [PATCH v9 00/21] Generic page walk and ptdump Date: Mon, 22 Jul 2019 16:41:49 +0100 Message-Id: <20190722154210.42799-1-steven.price@arm.com> MIME-Version: 1.0 Precedence: list Cc: Mark Rutland <Mark.Rutland@arm.com>, x86@kernel.org, Arnd Bergmann <arnd@arndb.de>, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Peter Zijlstra <peterz@infradead.org>, Catalin Marinas <catalin.marinas@arm.com>, Dave Hansen <dave.hansen@linux.intel.com>, linux-kernel@vger.kernel.org, Steven Price <steven.price@arm.com>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>, James Morse <james.morse@arm.com>, Thomas Gleixner <tglx@linutronix.de>, Will Deacon <will@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, linux-arm-kernel@lists.infradead.org, "Liang, Kan" <kan.liang@linux.intel.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	Generic page walk and ptdump \| expand [v9,00/21] Generic page walk and ptdump [v9,01/21] arc: mm: Add p?d_leaf() definitions [v9,02/21] arm: mm: Add p?d_leaf() definitions [v9,03/21] arm64: mm: Add p?d_leaf() definitions [v9,04/21] mips: mm: Add p?d_leaf() definitions [v9,05/21] powerpc: mm: Add p?d_leaf() definitions [v9,06/21] riscv: mm: Add p?d_leaf() definitions [v9,07/21] s390: mm: Add p?d_leaf() definitions [v9,08/21] sparc: mm: Add p?d_leaf() definitions [v9,09/21] x86: mm: Add p?d_leaf() definitions [v9,10/21] mm: Add generic p?d_leaf() macros [v9,11/21] mm: pagewalk: Add p4d_entry() and pgd_entry() [v9,12/21] mm: pagewalk: Allow walking without vma [v9,13/21] mm: pagewalk: Add test_p?d callbacks [v9,14/21] x86: mm: Don't display pages which aren't present in debugfs [v9,15/21] x86: mm: Point to struct seq_file from struct pg_state [v9,16/21] x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct [v9,17/21] x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct [v9,18/21] x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct [v9,19/21] mm: Add generic ptdump [v9,20/21] x86: mm: Convert dump_pagetables to use walk_page_range [v9,21/21] arm64: mm: Convert mm/dump.c to use walk_page_range()

Steven Price July 22, 2019, 3:41 p.m. UTC

This is a slight reworking and extension of my previous patch set
(Convert x86 & arm64 to use generic page walk), but I've continued the
version numbering as most of the changes are the same. In particular
this series ends with a generic PTDUMP implemention for arm64 and x86.

Many architectures current have a debugfs file for dumping the kernel
page tables. Currently each architecture has to implement custom
functions for this because the details of walking the page tables used
by the kernel are different between architectures.

This series extends the capabilities of walk_page_range() so that it can
deal with the page tables of the kernel (which have no VMAs and can
contain larger huge pages than exist for user space). A generic PTDUMP
implementation is the implemented making use of the new functionality of
walk_page_range() and finally arm64 and x86 are switch to using it,
removing the custom table walkers.

To enable a generic page table walker to walk the unusual mappings of
the kernel we need to implement a set of functions which let us know
when the walker has reached the leaf entry. After a suggestion from Will
Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
the purpose (and is a new name so has no historic baggage). Some
architectures have p?d_large macros but this is easily confused with
"large pages".

Mostly this is a clean up and there should be very little functional
change. The exceptions are:

* x86 PTDUMP debugfs output no longer display pages which aren't
  present (patch 14).

* arm64 has the ability to efficiently process KASAN pages (which
  previously only x86 implemented). This means that the combination of
  KASAN and DEBUG_WX is now useable.

Also available as a git tree:
git://linux-arm.org/linux-sp.git walk_page_range/v9

Changes since v8:
https://lore.kernel.org/lkml/20190403141627.11664-1-steven.price@arm.com/
 * Rename from p?d_large() to p?d_leaf()
 * Dropped patches migrating arm64/x86 custom walkers to
   walk_page_range() in favour of adding a generic PTDUMP implementation
   and migrating arm64/x86 to that instead.
 * Rebased to v5.3-rc1

Steven Price (21):
  arc: mm: Add p?d_leaf() definitions
  arm: mm: Add p?d_leaf() definitions
  arm64: mm: Add p?d_leaf() definitions
  mips: mm: Add p?d_leaf() definitions
  powerpc: mm: Add p?d_leaf() definitions
  riscv: mm: Add p?d_leaf() definitions
  s390: mm: Add p?d_leaf() definitions
  sparc: mm: Add p?d_leaf() definitions
  x86: mm: Add p?d_leaf() definitions
  mm: Add generic p?d_leaf() macros
  mm: pagewalk: Add p4d_entry() and pgd_entry()
  mm: pagewalk: Allow walking without vma
  mm: pagewalk: Add test_p?d callbacks
  x86: mm: Don't display pages which aren't present in debugfs
  x86: mm: Point to struct seq_file from struct pg_state
  x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
  x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
  x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
  mm: Add generic ptdump
  x86: mm: Convert dump_pagetables to use walk_page_range
  arm64: mm: Convert mm/dump.c to use walk_page_range()

 arch/arc/include/asm/pgtable.h               |   1 +
 arch/arm/include/asm/pgtable-2level.h        |   1 +
 arch/arm/include/asm/pgtable-3level.h        |   1 +
 arch/arm64/Kconfig                           |   1 +
 arch/arm64/Kconfig.debug                     |  19 +-
 arch/arm64/include/asm/pgtable.h             |   2 +
 arch/arm64/include/asm/ptdump.h              |   8 +-
 arch/arm64/mm/Makefile                       |   4 +-
 arch/arm64/mm/dump.c                         | 117 +++----
 arch/arm64/mm/ptdump_debugfs.c               |   2 +-
 arch/mips/include/asm/pgtable-64.h           |   8 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |  30 +-
 arch/riscv/include/asm/pgtable-64.h          |   7 +
 arch/riscv/include/asm/pgtable.h             |   7 +
 arch/s390/include/asm/pgtable.h              |   2 +
 arch/sparc/include/asm/pgtable_64.h          |   2 +
 arch/x86/Kconfig                             |   1 +
 arch/x86/Kconfig.debug                       |  20 +-
 arch/x86/include/asm/pgtable.h               |  10 +-
 arch/x86/mm/Makefile                         |   4 +-
 arch/x86/mm/debug_pagetables.c               |   8 +-
 arch/x86/mm/dump_pagetables.c                | 339 +++++--------------
 arch/x86/platform/efi/efi_32.c               |   2 +-
 arch/x86/platform/efi/efi_64.c               |   4 +-
 drivers/firmware/efi/arm-runtime.c           |   2 +-
 include/asm-generic/pgtable.h                |  19 ++
 include/linux/mm.h                           |  26 +-
 include/linux/ptdump.h                       |  19 ++
 mm/Kconfig.debug                             |  21 ++
 mm/Makefile                                  |   1 +
 mm/pagewalk.c                                |  76 +++--
 mm/ptdump.c                                  | 161 +++++++++
 32 files changed, 507 insertions(+), 418 deletions(-)
 create mode 100644 include/linux/ptdump.h
 create mode 100644 mm/ptdump.c

Anshuman Khandual July 23, 2019, 6:39 a.m. UTC | #1

Hello Steven,

On 07/22/2019 09:11 PM, Steven Price wrote:
> This is a slight reworking and extension of my previous patch set
> (Convert x86 & arm64 to use generic page walk), but I've continued the
> version numbering as most of the changes are the same. In particular
> this series ends with a generic PTDUMP implemention for arm64 and x86.
> 
> Many architectures current have a debugfs file for dumping the kernel
> page tables. Currently each architecture has to implement custom
> functions for this because the details of walking the page tables used
> by the kernel are different between architectures.
> 
> This series extends the capabilities of walk_page_range() so that it can
> deal with the page tables of the kernel (which have no VMAs and can
> contain larger huge pages than exist for user space). A generic PTDUMP
> implementation is the implemented making use of the new functionality of
> walk_page_range() and finally arm64 and x86 are switch to using it,
> removing the custom table walkers.

Could other architectures just enable this new generic PTDUMP feature if
required without much problem ?

> 
> To enable a generic page table walker to walk the unusual mappings of
> the kernel we need to implement a set of functions which let us know
> when the walker has reached the leaf entry. After a suggestion from Will
> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
> the purpose (and is a new name so has no historic baggage). Some
> architectures have p?d_large macros but this is easily confused with
> "large pages".

I have not been following the previous version of the series closely, hence
might be missing something here. But p?d_large() which identifies large
mappings on a given level can only signify a leaf entry. Large pages on the
table exist only as leaf entries. So what is the problem for it being used
directly instead. Is there any possibility in the kernel mapping when these
large pages are not leaf entries ?

> 
> Mostly this is a clean up and there should be very little functional
> change. The exceptions are:
> 
> * x86 PTDUMP debugfs output no longer display pages which aren't
>   present (patch 14).

Hmm, kernel mappings pages which are not present! which ones are those ?
Just curious.

> 
> * arm64 has the ability to efficiently process KASAN pages (which
>   previously only x86 implemented). This means that the combination of
>   KASAN and DEBUG_WX is now useable.
> 
> Also available as a git tree:
> git://linux-arm.org/linux-sp.git walk_page_range/v9
> 
> Changes since v8:
> https://lore.kernel.org/lkml/20190403141627.11664-1-steven.price@arm.com/
>  * Rename from p?d_large() to p?d_leaf()

As mentioned before wondering if this is actually required or it is even a
good idea to introduce something like this which expands page table helper
semantics scope further in generic MM.

>  * Dropped patches migrating arm64/x86 custom walkers to
>    walk_page_range() in favour of adding a generic PTDUMP implementation
>    and migrating arm64/x86 to that instead.
>  * Rebased to v5.3-rc1

Creating a generic PTDUMP implementation is definitely a better idea.

Mark Rutland July 23, 2019, 10:16 a.m. UTC | #2

On Mon, Jul 22, 2019 at 04:41:49PM +0100, Steven Price wrote:
> This is a slight reworking and extension of my previous patch set
> (Convert x86 & arm64 to use generic page walk), but I've continued the
> version numbering as most of the changes are the same. In particular
> this series ends with a generic PTDUMP implemention for arm64 and x86.
> 
> Many architectures current have a debugfs file for dumping the kernel
> page tables. Currently each architecture has to implement custom
> functions for this because the details of walking the page tables used
> by the kernel are different between architectures.
> 
> This series extends the capabilities of walk_page_range() so that it can
> deal with the page tables of the kernel (which have no VMAs and can
> contain larger huge pages than exist for user space). A generic PTDUMP
> implementation is the implemented making use of the new functionality of
> walk_page_range() and finally arm64 and x86 are switch to using it,
> removing the custom table walkers.
> 
> To enable a generic page table walker to walk the unusual mappings of
> the kernel we need to implement a set of functions which let us know
> when the walker has reached the leaf entry. After a suggestion from Will
> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
> the purpose (and is a new name so has no historic baggage). Some
> architectures have p?d_large macros but this is easily confused with
> "large pages".
> 
> Mostly this is a clean up and there should be very little functional
> change. The exceptions are:
> 
> * x86 PTDUMP debugfs output no longer display pages which aren't
>   present (patch 14).
> 
> * arm64 has the ability to efficiently process KASAN pages (which
>   previously only x86 implemented). This means that the combination of
>   KASAN and DEBUG_WX is now useable.

Are there any visible changes to the arm64 output?

Could you dump a before/after example somewhere?

Thanks,
Mark.

Steven Price July 24, 2019, 1:35 p.m. UTC | #3

On 23/07/2019 11:16, Mark Rutland wrote:
> On Mon, Jul 22, 2019 at 04:41:49PM +0100, Steven Price wrote:
>> This is a slight reworking and extension of my previous patch set
>> (Convert x86 & arm64 to use generic page walk), but I've continued the
>> version numbering as most of the changes are the same. In particular
>> this series ends with a generic PTDUMP implemention for arm64 and x86.
>>
>> Many architectures current have a debugfs file for dumping the kernel
>> page tables. Currently each architecture has to implement custom
>> functions for this because the details of walking the page tables used
>> by the kernel are different between architectures.
>>
>> This series extends the capabilities of walk_page_range() so that it can
>> deal with the page tables of the kernel (which have no VMAs and can
>> contain larger huge pages than exist for user space). A generic PTDUMP
>> implementation is the implemented making use of the new functionality of
>> walk_page_range() and finally arm64 and x86 are switch to using it,
>> removing the custom table walkers.
>>
>> To enable a generic page table walker to walk the unusual mappings of
>> the kernel we need to implement a set of functions which let us know
>> when the walker has reached the leaf entry. After a suggestion from Will
>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>> the purpose (and is a new name so has no historic baggage). Some
>> architectures have p?d_large macros but this is easily confused with
>> "large pages".
>>
>> Mostly this is a clean up and there should be very little functional
>> change. The exceptions are:
>>
>> * x86 PTDUMP debugfs output no longer display pages which aren't
>>   present (patch 14).
>>
>> * arm64 has the ability to efficiently process KASAN pages (which
>>   previously only x86 implemented). This means that the combination of
>>   KASAN and DEBUG_WX is now useable.
> 
> Are there any visible changes to the arm64 output?

arm64 output shouldn't change. I've confirmed that "efi_page_tables" is
identical on a Juno before/after the change. "kernel_page_tables"
obviously will vary depending on the exact layout of memory, but the
format isn't changed.

x86 output does change due to patch 14. In this case the change is
removing the lines from the output of the form...

> 0xffffffff84800000-0xffffffffa0000000         440M                               pmd

...which are unpopulated areas of the memory map. Populated lines which
have attributes are unchanged.

Steve

Steven Price July 24, 2019, 1:35 p.m. UTC | #4

On 23/07/2019 07:39, Anshuman Khandual wrote:
> Hello Steven,
> 
> On 07/22/2019 09:11 PM, Steven Price wrote:
>> This is a slight reworking and extension of my previous patch set
>> (Convert x86 & arm64 to use generic page walk), but I've continued the
>> version numbering as most of the changes are the same. In particular
>> this series ends with a generic PTDUMP implemention for arm64 and x86.
>>
>> Many architectures current have a debugfs file for dumping the kernel
>> page tables. Currently each architecture has to implement custom
>> functions for this because the details of walking the page tables used
>> by the kernel are different between architectures.
>>
>> This series extends the capabilities of walk_page_range() so that it can
>> deal with the page tables of the kernel (which have no VMAs and can
>> contain larger huge pages than exist for user space). A generic PTDUMP
>> implementation is the implemented making use of the new functionality of
>> walk_page_range() and finally arm64 and x86 are switch to using it,
>> removing the custom table walkers.
> 
> Could other architectures just enable this new generic PTDUMP feature if
> required without much problem ?

The generic PTDUMP is implemented as a library - so the architectures
would have to provide the call into ptdump_walk_pgd() and provide the
necessary callback note_page() which formats the lines in the output.

Hopefully the implementation is generic enough that it should be
flexible enough to work for most architectures.

arm, powerpc and s390 are the obvious architectures to convert next as
they already have note_page() functions which shouldn't be too difficult
to convert to match the callback prototype.

>>
>> To enable a generic page table walker to walk the unusual mappings of
>> the kernel we need to implement a set of functions which let us know
>> when the walker has reached the leaf entry. After a suggestion from Will
>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>> the purpose (and is a new name so has no historic baggage). Some
>> architectures have p?d_large macros but this is easily confused with
>> "large pages".
> 
> I have not been following the previous version of the series closely, hence
> might be missing something here. But p?d_large() which identifies large
> mappings on a given level can only signify a leaf entry. Large pages on the
> table exist only as leaf entries. So what is the problem for it being used
> directly instead. Is there any possibility in the kernel mapping when these
> large pages are not leaf entries ?

There isn't any problem as such with using p?d_large macros. However the
name "large" has caused confusion in the past. In particular there are
two types of "large" page:

1. leaf entries at high levels than normal ('sections' on Arm, for 4K
pages this gives you 2MB and 1GB pages).

2. sets of contiguous entries that can share a TLB entry (the
'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
KB 'pages').

In many cases both give the same effect (reduce pressure on TLBs and
requires contiguous and aligned physical addresses). But for this case
we only care about the 'leaf' case (because the contiguous bit makes no
difference to walking the page tables).

As far as I'm aware p?d_large() currently implements the first and
p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.

Will[1] suggested the same p?d_leaf() and this also avoids stepping on
the existing usage of p?d_large() which isn't always available on every
architecture.

[1]
https://lore.kernel.org/linux-mm/20190701101510.qup3nd6vm6cbdgjv@willie-the-truck/

>>
>> Mostly this is a clean up and there should be very little functional
>> change. The exceptions are:
>>
>> * x86 PTDUMP debugfs output no longer display pages which aren't
>>   present (patch 14).
> 
> Hmm, kernel mappings pages which are not present! which ones are those ?
> Just curious.

x86 currently outputs a line for every range, including those which are
unpopulated. Patch 14 removes those lines from the output, only showing
the populated ranges. This was discussed[2] previously.

[2]
https://lore.kernel.org/lkml/26df02dd-c54e-ea91-bdd1-0a4aad3a30ac@arm.com/

>>
>> * arm64 has the ability to efficiently process KASAN pages (which
>>   previously only x86 implemented). This means that the combination of
>>   KASAN and DEBUG_WX is now useable.
>>
>> Also available as a git tree:
>> git://linux-arm.org/linux-sp.git walk_page_range/v9
>>
>> Changes since v8:
>> https://lore.kernel.org/lkml/20190403141627.11664-1-steven.price@arm.com/
>>  * Rename from p?d_large() to p?d_leaf()
> 
> As mentioned before wondering if this is actually required or it is even a
> good idea to introduce something like this which expands page table helper
> semantics scope further in generic MM.
> 
>>  * Dropped patches migrating arm64/x86 custom walkers to
>>    walk_page_range() in favour of adding a generic PTDUMP implementation
>>    and migrating arm64/x86 to that instead.
>>  * Rebased to v5.3-rc1
> 
> Creating a generic PTDUMP implementation is definitely a better idea.

Yes, that was always where I was heading. But I initially thought it
would be easier to get the generic walking code in, followed by
implementing generic PTDUMP. But it turns out the generic PTDUMP is
actually the easy bit :)

Steve

Thomas Gleixner July 24, 2019, 1:57 p.m. UTC | #5

On Wed, 24 Jul 2019, Steven Price wrote:
> On 23/07/2019 11:16, Mark Rutland wrote:
> > Are there any visible changes to the arm64 output?
> 
> arm64 output shouldn't change. I've confirmed that "efi_page_tables" is
> identical on a Juno before/after the change. "kernel_page_tables"
> obviously will vary depending on the exact layout of memory, but the
> format isn't changed.
> 
> x86 output does change due to patch 14. In this case the change is
> removing the lines from the output of the form...
> 
> > 0xffffffff84800000-0xffffffffa0000000         440M                               pmd
> 
> ...which are unpopulated areas of the memory map. Populated lines which
> have attributes are unchanged.

Having the hole size and the level in the dump is a very conveniant thing.

Right now we have:

0xffffffffc0427000-0xffffffffc042b000          16K     ro                     NX pte
0xffffffffc042b000-0xffffffffc042e000          12K     RW                     NX pte
0xffffffffc042e000-0xffffffffc042f000           4K                               pte
0xffffffffc042f000-0xffffffffc0430000           4K     ro                     x  pte
0xffffffffc0430000-0xffffffffc0431000           4K     ro                     NX pte
0xffffffffc0431000-0xffffffffc0433000           8K     RW                     NX pte
0xffffffffc0433000-0xffffffffc0434000           4K                               pte
0xffffffffc0434000-0xffffffffc0436000           8K     ro                     x  pte
0xffffffffc0436000-0xffffffffc0438000           8K     ro                     NX pte
0xffffffffc0438000-0xffffffffc043a000           8K     RW                     NX pte
0xffffffffc043a000-0xffffffffc043f000          20K                               pte
0xffffffffc043f000-0xffffffffc0444000          20K     ro                     x  pte
0xffffffffc0444000-0xffffffffc0447000          12K     ro                     NX pte
0xffffffffc0447000-0xffffffffc0449000           8K     RW                     NX pte
0xffffffffc0449000-0xffffffffc044f000          24K                               pte
0xffffffffc044f000-0xffffffffc0450000           4K     ro                     x  pte
0xffffffffc0450000-0xffffffffc0451000           4K     ro                     NX pte
0xffffffffc0451000-0xffffffffc0453000           8K     RW                     NX pte
0xffffffffc0453000-0xffffffffc0458000          20K                               pte
0xffffffffc0458000-0xffffffffc0459000           4K     ro                     x  pte
0xffffffffc0459000-0xffffffffc045b000           8K     ro                     NX pte

with your change this becomes:

0xffffffffc0427000-0xffffffffc042b000          16K     ro                     NX pte
0xffffffffc042b000-0xffffffffc042e000          12K     RW                     NX pte
0xffffffffc042f000-0xffffffffc0430000           4K     ro                     x  pte
0xffffffffc0430000-0xffffffffc0431000           4K     ro                     NX pte
0xffffffffc0431000-0xffffffffc0433000           8K     RW                     NX pte
0xffffffffc0434000-0xffffffffc0436000           8K     ro                     x  pte
0xffffffffc0436000-0xffffffffc0438000           8K     ro                     NX pte
0xffffffffc0438000-0xffffffffc043a000           8K     RW                     NX pte
0xffffffffc043f000-0xffffffffc0444000          20K     ro                     x  pte
0xffffffffc0444000-0xffffffffc0447000          12K     ro                     NX pte
0xffffffffc0447000-0xffffffffc0449000           8K     RW                     NX pte
0xffffffffc044f000-0xffffffffc0450000           4K     ro                     x  pte
0xffffffffc0450000-0xffffffffc0451000           4K     ro                     NX pte
0xffffffffc0451000-0xffffffffc0453000           8K     RW                     NX pte
0xffffffffc0458000-0xffffffffc0459000           4K     ro                     x  pte
0xffffffffc0459000-0xffffffffc045b000           8K     ro                     NX pte

which is 5 lines less, but a pain to figure out the size of the holes. And
it becomes even more painful when the holes go across different mapping
levels.

From your 14/N changelog:

> This keeps the output shorter and will help with a future change

I don't care about shorter at all. It's debug information.

> switching to using the generic page walk code as we no longer care about
> the 'level' that the page table holes are at.

I really do not understand why you think that WE no longer care about the
level (and the size) of the holes. I assume that WE is pluralis majestatis
and not meant to reflect the opinion of you and everyone else.

I have no idea whether you ever had to do serious work with PT dump, but I
surely have at various occasions including the PTI mess and I definitely
found the size and the level information from holes very useful.

Thanks,

	tglx

Mark Rutland July 24, 2019, 2:07 p.m. UTC | #6

On Wed, Jul 24, 2019 at 03:57:33PM +0200, Thomas Gleixner wrote:
> On Wed, 24 Jul 2019, Steven Price wrote:
> > On 23/07/2019 11:16, Mark Rutland wrote:
> > > Are there any visible changes to the arm64 output?
> > 
> > arm64 output shouldn't change. I've confirmed that "efi_page_tables" is
> > identical on a Juno before/after the change. "kernel_page_tables"
> > obviously will vary depending on the exact layout of memory, but the
> > format isn't changed.
> > 
> > x86 output does change due to patch 14. In this case the change is
> > removing the lines from the output of the form...
> > 
> > > 0xffffffff84800000-0xffffffffa0000000         440M                               pmd
> > 
> > ...which are unpopulated areas of the memory map. Populated lines which
> > have attributes are unchanged.
> 
> Having the hole size and the level in the dump is a very conveniant thing.

Mhmm; I thought that we logged which level was empty on arm64 (but
apparently not), since knowing the structure can be important.

> Right now we have:
> 
> 0xffffffffc0427000-0xffffffffc042b000          16K     ro                     NX pte
> 0xffffffffc042b000-0xffffffffc042e000          12K     RW                     NX pte
> 0xffffffffc042e000-0xffffffffc042f000           4K                               pte
> 0xffffffffc042f000-0xffffffffc0430000           4K     ro                     x  pte
> 0xffffffffc0430000-0xffffffffc0431000           4K     ro                     NX pte
> 0xffffffffc0431000-0xffffffffc0433000           8K     RW                     NX pte
> 0xffffffffc0433000-0xffffffffc0434000           4K                               pte
> 0xffffffffc0434000-0xffffffffc0436000           8K     ro                     x  pte
> 0xffffffffc0436000-0xffffffffc0438000           8K     ro                     NX pte
> 0xffffffffc0438000-0xffffffffc043a000           8K     RW                     NX pte
> 0xffffffffc043a000-0xffffffffc043f000          20K                               pte
> 0xffffffffc043f000-0xffffffffc0444000          20K     ro                     x  pte
> 0xffffffffc0444000-0xffffffffc0447000          12K     ro                     NX pte
> 0xffffffffc0447000-0xffffffffc0449000           8K     RW                     NX pte
> 0xffffffffc0449000-0xffffffffc044f000          24K                               pte
> 0xffffffffc044f000-0xffffffffc0450000           4K     ro                     x  pte
> 0xffffffffc0450000-0xffffffffc0451000           4K     ro                     NX pte
> 0xffffffffc0451000-0xffffffffc0453000           8K     RW                     NX pte
> 0xffffffffc0453000-0xffffffffc0458000          20K                               pte
> 0xffffffffc0458000-0xffffffffc0459000           4K     ro                     x  pte
> 0xffffffffc0459000-0xffffffffc045b000           8K     ro                     NX pte
> 
> with your change this becomes:
> 
> 0xffffffffc0427000-0xffffffffc042b000          16K     ro                     NX pte
> 0xffffffffc042b000-0xffffffffc042e000          12K     RW                     NX pte
> 0xffffffffc042f000-0xffffffffc0430000           4K     ro                     x  pte
> 0xffffffffc0430000-0xffffffffc0431000           4K     ro                     NX pte
> 0xffffffffc0431000-0xffffffffc0433000           8K     RW                     NX pte
> 0xffffffffc0434000-0xffffffffc0436000           8K     ro                     x  pte
> 0xffffffffc0436000-0xffffffffc0438000           8K     ro                     NX pte
> 0xffffffffc0438000-0xffffffffc043a000           8K     RW                     NX pte
> 0xffffffffc043f000-0xffffffffc0444000          20K     ro                     x  pte
> 0xffffffffc0444000-0xffffffffc0447000          12K     ro                     NX pte
> 0xffffffffc0447000-0xffffffffc0449000           8K     RW                     NX pte
> 0xffffffffc044f000-0xffffffffc0450000           4K     ro                     x  pte
> 0xffffffffc0450000-0xffffffffc0451000           4K     ro                     NX pte
> 0xffffffffc0451000-0xffffffffc0453000           8K     RW                     NX pte
> 0xffffffffc0458000-0xffffffffc0459000           4K     ro                     x  pte
> 0xffffffffc0459000-0xffffffffc045b000           8K     ro                     NX pte
> 
> which is 5 lines less, but a pain to figure out the size of the holes. And
> it becomes even more painful when the holes go across different mapping
> levels.

I agree.

Steven, could you align arm64 with the x86 behaviour here?

Thanks,
Mark.

Steven Price July 24, 2019, 2:18 p.m. UTC | #7

On 24/07/2019 14:57, Thomas Gleixner wrote:
> On Wed, 24 Jul 2019, Steven Price wrote:
>> On 23/07/2019 11:16, Mark Rutland wrote:
>>> Are there any visible changes to the arm64 output?
>>
>> arm64 output shouldn't change. I've confirmed that "efi_page_tables" is
>> identical on a Juno before/after the change. "kernel_page_tables"
>> obviously will vary depending on the exact layout of memory, but the
>> format isn't changed.
>>
>> x86 output does change due to patch 14. In this case the change is
>> removing the lines from the output of the form...
>>
>>> 0xffffffff84800000-0xffffffffa0000000         440M                               pmd
>>
>> ...which are unpopulated areas of the memory map. Populated lines which
>> have attributes are unchanged.
> 
> Having the hole size and the level in the dump is a very conveniant thing.
> 
> Right now we have:
> 
> 0xffffffffc0427000-0xffffffffc042b000          16K     ro                     NX pte
> 0xffffffffc042b000-0xffffffffc042e000          12K     RW                     NX pte
> 0xffffffffc042e000-0xffffffffc042f000           4K                               pte
> 0xffffffffc042f000-0xffffffffc0430000           4K     ro                     x  pte
> 0xffffffffc0430000-0xffffffffc0431000           4K     ro                     NX pte
> 0xffffffffc0431000-0xffffffffc0433000           8K     RW                     NX pte
> 0xffffffffc0433000-0xffffffffc0434000           4K                               pte
> 0xffffffffc0434000-0xffffffffc0436000           8K     ro                     x  pte
> 0xffffffffc0436000-0xffffffffc0438000           8K     ro                     NX pte
> 0xffffffffc0438000-0xffffffffc043a000           8K     RW                     NX pte
> 0xffffffffc043a000-0xffffffffc043f000          20K                               pte
> 0xffffffffc043f000-0xffffffffc0444000          20K     ro                     x  pte
> 0xffffffffc0444000-0xffffffffc0447000          12K     ro                     NX pte
> 0xffffffffc0447000-0xffffffffc0449000           8K     RW                     NX pte
> 0xffffffffc0449000-0xffffffffc044f000          24K                               pte
> 0xffffffffc044f000-0xffffffffc0450000           4K     ro                     x  pte
> 0xffffffffc0450000-0xffffffffc0451000           4K     ro                     NX pte
> 0xffffffffc0451000-0xffffffffc0453000           8K     RW                     NX pte
> 0xffffffffc0453000-0xffffffffc0458000          20K                               pte
> 0xffffffffc0458000-0xffffffffc0459000           4K     ro                     x  pte
> 0xffffffffc0459000-0xffffffffc045b000           8K     ro                     NX pte
> 
> with your change this becomes:
> 
> 0xffffffffc0427000-0xffffffffc042b000          16K     ro                     NX pte
> 0xffffffffc042b000-0xffffffffc042e000          12K     RW                     NX pte
> 0xffffffffc042f000-0xffffffffc0430000           4K     ro                     x  pte
> 0xffffffffc0430000-0xffffffffc0431000           4K     ro                     NX pte
> 0xffffffffc0431000-0xffffffffc0433000           8K     RW                     NX pte
> 0xffffffffc0434000-0xffffffffc0436000           8K     ro                     x  pte
> 0xffffffffc0436000-0xffffffffc0438000           8K     ro                     NX pte
> 0xffffffffc0438000-0xffffffffc043a000           8K     RW                     NX pte
> 0xffffffffc043f000-0xffffffffc0444000          20K     ro                     x  pte
> 0xffffffffc0444000-0xffffffffc0447000          12K     ro                     NX pte
> 0xffffffffc0447000-0xffffffffc0449000           8K     RW                     NX pte
> 0xffffffffc044f000-0xffffffffc0450000           4K     ro                     x  pte
> 0xffffffffc0450000-0xffffffffc0451000           4K     ro                     NX pte
> 0xffffffffc0451000-0xffffffffc0453000           8K     RW                     NX pte
> 0xffffffffc0458000-0xffffffffc0459000           4K     ro                     x  pte
> 0xffffffffc0459000-0xffffffffc045b000           8K     ro                     NX pte
> 
> which is 5 lines less, but a pain to figure out the size of the holes. And
> it becomes even more painful when the holes go across different mapping
> levels.
> 
> From your 14/N changelog:
> 
>> This keeps the output shorter and will help with a future change
> 
> I don't care about shorter at all. It's debug information.

Sorry, the "shorter" part was because Dave Hansen originally said[1]:
> I think I'd actually be OK with the holes just not showing up.  I
> actually find it kinda hard to read sometimes with the holes in there.
> I'd be curious what others think though.

[1]
https://lore.kernel.org/lkml/5f354bf5-4ac8-d0e2-048c-0857c91a21e6@intel.com/

And I'd abbreviated "holes not showing up" as "shorter" in the commit
message - not the best wording I agree.

>> switching to using the generic page walk code as we no longer care about
>> the 'level' that the page table holes are at.
> 
> I really do not understand why you think that WE no longer care about the
> level (and the size) of the holes. I assume that WE is pluralis majestatis
> and not meant to reflect the opinion of you and everyone else.

Again, I apologise - that was sloppy wording in the commit message. By
"we" I meant the code not any particular person. In my original patch[2]
the only use of the 'depth' argument to pte_hole was to report the level
for these debug lines. Removing those lines simplified the code and at
the time nobody raised any objections.

[2]
https://lore.kernel.org/lkml/20190227170608.27963-28-steven.price@arm.com/

> I have no idea whether you ever had to do serious work with PT dump, but I
> surely have at various occasions including the PTI mess and I definitely
> found the size and the level information from holes very useful.

On arm64 we don't have those lines, but equally it's possible they might
be useful in the future. So this might be something to add.

As I said in a previous email[3] I was dropping the lines from the
output assuming nobody had any objections. Since you find these lines
useful, I'll see about reworking the change to retain the lines.

Steve

[3]
https://lore.kernel.org/lkml/26df02dd-c54e-ea91-bdd1-0a4aad3a30ac@arm.com/

Thomas Gleixner July 24, 2019, 2:37 p.m. UTC | #8

On Wed, 24 Jul 2019, Steven Price wrote:
> On 24/07/2019 14:57, Thomas Gleixner wrote:
> > From your 14/N changelog:
> > 
> >> This keeps the output shorter and will help with a future change
> > 
> > I don't care about shorter at all. It's debug information.
> 
> Sorry, the "shorter" part was because Dave Hansen originally said[1]:
> > I think I'd actually be OK with the holes just not showing up.  I
> > actually find it kinda hard to read sometimes with the holes in there.
> > I'd be curious what others think though.

I missed that otherwise I'd have disagreed right away.

> > I really do not understand why you think that WE no longer care about the
> > level (and the size) of the holes. I assume that WE is pluralis majestatis
> > and not meant to reflect the opinion of you and everyone else.
> 
> Again, I apologise - that was sloppy wording in the commit message. By
> "we" I meant the code not any particular person.

Nothing to apologize. Common mistake of trying to impersonate code. That
always reads wrong :)

> > I have no idea whether you ever had to do serious work with PT dump, but I
> > surely have at various occasions including the PTI mess and I definitely
> > found the size and the level information from holes very useful.
> 
> On arm64 we don't have those lines, but equally it's possible they might
> be useful in the future. So this might be something to add.
> 
> As I said in a previous email[3] I was dropping the lines from the
> output assuming nobody had any objections. Since you find these lines
> useful, I'll see about reworking the change to retain the lines.

That would be great and as I saw in the other mail, Mark wants to have them
as well :)

That reminds me, that I had a patch when dealing with L1TF which printed
the PFNs so I could verify that the mitigations do what they are supposed
to do, but that patch got obviously lost somewhere down the road.

Thanks,

	tglx

Anshuman Khandual July 25, 2019, 9:09 a.m. UTC | #9

On 07/24/2019 07:05 PM, Steven Price wrote:
> On 23/07/2019 07:39, Anshuman Khandual wrote:
>> Hello Steven,
>>
>> On 07/22/2019 09:11 PM, Steven Price wrote:
>>> This is a slight reworking and extension of my previous patch set
>>> (Convert x86 & arm64 to use generic page walk), but I've continued the
>>> version numbering as most of the changes are the same. In particular
>>> this series ends with a generic PTDUMP implemention for arm64 and x86.
>>>
>>> Many architectures current have a debugfs file for dumping the kernel
>>> page tables. Currently each architecture has to implement custom
>>> functions for this because the details of walking the page tables used
>>> by the kernel are different between architectures.
>>>
>>> This series extends the capabilities of walk_page_range() so that it can
>>> deal with the page tables of the kernel (which have no VMAs and can
>>> contain larger huge pages than exist for user space). A generic PTDUMP
>>> implementation is the implemented making use of the new functionality of
>>> walk_page_range() and finally arm64 and x86 are switch to using it,
>>> removing the custom table walkers.
>>
>> Could other architectures just enable this new generic PTDUMP feature if
>> required without much problem ?
> 
> The generic PTDUMP is implemented as a library - so the architectures
> would have to provide the call into ptdump_walk_pgd() and provide the
> necessary callback note_page() which formats the lines in the output.

Though I understand that the leaf flag (any given level) details are very much
arch specific would there be any possibility for note_page() call back to be
unified as well. This is extracted from current PTDUMP output on arm64.

0xffffffc000000000-0xffffffc000080000  512K PTE  RW NX SHD AF  UXN MEM/NORMAL

The first three columns are generic

1. Kernel virtual range span
2. Kernel virtual range size
3. Kernel virtual range mapping level

Where as rest of the output are architecture specific page table entry flags.
Just wondering if we could print the first three columns in ptdump_walk_pgd()
itself before calling arch specific callback to fetch a print buffer for rest
of the line bounded with some character limit so that line does not overflow.
Its not something which must be done but I guess it's worth giving it a try.
This will help consolidate ptdump_walk_pgd() further.

> 
> Hopefully the implementation is generic enough that it should be
> flexible enough to work for most architectures.
> 
> arm, powerpc and s390 are the obvious architectures to convert next as
> they already have note_page() functions which shouldn't be too difficult
> to convert to match the callback prototype.

Which can be done independently later on, fair enough.

> 
>>>
>>> To enable a generic page table walker to walk the unusual mappings of
>>> the kernel we need to implement a set of functions which let us know
>>> when the walker has reached the leaf entry. After a suggestion from Will
>>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>>> the purpose (and is a new name so has no historic baggage). Some
>>> architectures have p?d_large macros but this is easily confused with
>>> "large pages".
>>
>> I have not been following the previous version of the series closely, hence
>> might be missing something here. But p?d_large() which identifies large
>> mappings on a given level can only signify a leaf entry. Large pages on the
>> table exist only as leaf entries. So what is the problem for it being used
>> directly instead. Is there any possibility in the kernel mapping when these
>> large pages are not leaf entries ?
> 
> There isn't any problem as such with using p?d_large macros. However the
> name "large" has caused confusion in the past. In particular there are
> two types of "large" page:
> 
> 1. leaf entries at high levels than normal ('sections' on Arm, for 4K
> pages this gives you 2MB and 1GB pages).
> 
> 2. sets of contiguous entries that can share a TLB entry (the
> 'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
> KB 'pages').

This is arm64 specific and AFAIK there are no other architectures where there
will be any confusion wrt p?d_large() not meaning a single entry.

As you have noted before if we are printing individual entries with PTE_CONT
then they need not be identified as p??d_large(). In which case p?d_large()
can just safely point to p?d_sect() identifying regular huge leaf entries.

> 
> In many cases both give the same effect (reduce pressure on TLBs and
> requires contiguous and aligned physical addresses). But for this case
> we only care about the 'leaf' case (because the contiguous bit makes no
> difference to walking the page tables).

Right and we can just safely identify section entries with it. What will be
the problem with that ? Again this is only arm64 specific.

> 
> As far as I'm aware p?d_large() currently implements the first and
> p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.

AFAIK option 2 exists only on arm6 platform. IIUC generic MM requires two
different huge page dentition from platform. HugeTLB identifies large entries
at PGD|PUD|PMD after converting it's content into PTE first. So there is no
need for direct large page definitions for other levels.

1. THP		- pmd_trans_huge()
2. HugeTLB	- pte_huge()	   CONFIG_ARCH_WANT_GENERAL_HUGETLB is set

A simple check for p?d_large() on mm/ and include/linux shows that there are
no existing usage for these in generic MM. Hence it is available.

p?d_trans_huge() cannot use contiguous entries, so its definitely 1 in above
example.

The problem is there is no other type of mapped leaf entries apart from a large
mapping at PGD, PUD, PMD level. Had there been another type of leaf entry then
p?d_leaf() would have made sense as p?d_large() would not have been sufficient.
Hence just wondering if it is really necessary to add brand new p?d_leaf() page
table helper in generic MM functions.

IMHO the new addition of p?d_leaf() can be avoided and p?d_large() should be
cleaned up (if required) in platforms and used in generic functions.

> 
> Will[1] suggested the same p?d_leaf() and this also avoids stepping on
> the existing usage of p?d_large() which isn't always available on every
> architecture.

PTDUMP now needs to identify large leaf entries uniformly on each platform.
Hence platforms enabling generic PTDUMP need to provide clean p?d_large()
definitions.

If there are existing definitions and usage of p?d_large() functions on some
platforms, those need to be fixed before they can use generic PTDUMP. IMHO we
should not be just adding new page table helpers in order to avoid cleaning
up these in platform code.

> 
> [1]
> https://lore.kernel.org/linux-mm/20190701101510.qup3nd6vm6cbdgjv@willie-the-truck/

I guess the primary concern was with the existence of p?d_large()or p?d_huge()
definitions in various platform code and p?d_leaf() was an way of working around
it. The problem is, it adds a new helper without a real need for one.

> 
>>>
>>> Mostly this is a clean up and there should be very little functional
>>> change. The exceptions are:
>>>
>>> * x86 PTDUMP debugfs output no longer display pages which aren't
>>>   present (patch 14).
>>
>> Hmm, kernel mappings pages which are not present! which ones are those ?
>> Just curious.
> 
> x86 currently outputs a line for every range, including those which are
> unpopulated. Patch 14 removes those lines from the output, only showing
> the populated ranges. This was discussed[2] previously.
> 
> [2]
> https://lore.kernel.org/lkml/26df02dd-c54e-ea91-bdd1-0a4aad3a30ac@arm.com/

Currently that is a difference between x86 and arm64 ptdump output. Whether to
show the gaps or not could not be achieved by defining a note_page() callback
function which does nothing but just return ? But if the single line output is
split between generic and callback as I had proposed earlier this will not be
possible any more as half the line would have been already printed.

Will Deacon July 25, 2019, 9:30 a.m. UTC | #10

On Thu, Jul 25, 2019 at 02:39:22PM +0530, Anshuman Khandual wrote:
> On 07/24/2019 07:05 PM, Steven Price wrote:
> > There isn't any problem as such with using p?d_large macros. However the
> > name "large" has caused confusion in the past. In particular there are
> > two types of "large" page:
> > 
> > 1. leaf entries at high levels than normal ('sections' on Arm, for 4K
> > pages this gives you 2MB and 1GB pages).
> > 
> > 2. sets of contiguous entries that can share a TLB entry (the
> > 'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
> > KB 'pages').
> 
> This is arm64 specific and AFAIK there are no other architectures where there
> will be any confusion wrt p?d_large() not meaning a single entry.
> 
> As you have noted before if we are printing individual entries with PTE_CONT
> then they need not be identified as p??d_large(). In which case p?d_large()
> can just safely point to p?d_sect() identifying regular huge leaf entries.

Steven's stuck in the middle of things here, but I do object to p?d_large()
because I find it bonkers to have p?d_large() and p?d_huge() mean completely
different things when they are synonyms in the English language.

Yes, p?d_leaf() matches the terminology used by the Arm architecture, but
given that most page table structures are arranged as a 'tree', then it's
not completely unreasonable, in my opinion. If you have a more descriptive
name, we could use that instead. We could also paint it blue.

> > In many cases both give the same effect (reduce pressure on TLBs and
> > requires contiguous and aligned physical addresses). But for this case
> > we only care about the 'leaf' case (because the contiguous bit makes no
> > difference to walking the page tables).
> 
> Right and we can just safely identify section entries with it. What will be
> the problem with that ? Again this is only arm64 specific.
> 
> > 
> > As far as I'm aware p?d_large() currently implements the first and
> > p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.
> 
> AFAIK option 2 exists only on arm6 platform. IIUC generic MM requires two
> different huge page dentition from platform. HugeTLB identifies large entries
> at PGD|PUD|PMD after converting it's content into PTE first. So there is no
> need for direct large page definitions for other levels.
> 
> 1. THP		- pmd_trans_huge()
> 2. HugeTLB	- pte_huge()	   CONFIG_ARCH_WANT_GENERAL_HUGETLB is set
> 
> A simple check for p?d_large() on mm/ and include/linux shows that there are
> no existing usage for these in generic MM. Hence it is available.

Alternatively, it means we have a good opportunity to give it a better name
before it spreads into the core code.

> IMHO the new addition of p?d_leaf() can be avoided and p?d_large() should be
> cleaned up (if required) in platforms and used in generic functions.

Again, I disagree and think p?d_large() should be confined to arch code
if it sticks around at all.

I don't usually care much about naming, but in this case I really find
the status quo needlessly confusing.

Will

Steven Price July 25, 2019, 10:15 a.m. UTC | #11

On 25/07/2019 10:09, Anshuman Khandual wrote:
> 
> 
> On 07/24/2019 07:05 PM, Steven Price wrote:
>> On 23/07/2019 07:39, Anshuman Khandual wrote:
>>> Hello Steven,
>>>
>>> On 07/22/2019 09:11 PM, Steven Price wrote:
>>>> This is a slight reworking and extension of my previous patch set
>>>> (Convert x86 & arm64 to use generic page walk), but I've continued the
>>>> version numbering as most of the changes are the same. In particular
>>>> this series ends with a generic PTDUMP implemention for arm64 and x86.
>>>>
>>>> Many architectures current have a debugfs file for dumping the kernel
>>>> page tables. Currently each architecture has to implement custom
>>>> functions for this because the details of walking the page tables used
>>>> by the kernel are different between architectures.
>>>>
>>>> This series extends the capabilities of walk_page_range() so that it can
>>>> deal with the page tables of the kernel (which have no VMAs and can
>>>> contain larger huge pages than exist for user space). A generic PTDUMP
>>>> implementation is the implemented making use of the new functionality of
>>>> walk_page_range() and finally arm64 and x86 are switch to using it,
>>>> removing the custom table walkers.
>>>
>>> Could other architectures just enable this new generic PTDUMP feature if
>>> required without much problem ?
>>
>> The generic PTDUMP is implemented as a library - so the architectures
>> would have to provide the call into ptdump_walk_pgd() and provide the
>> necessary callback note_page() which formats the lines in the output.
> 
> Though I understand that the leaf flag (any given level) details are very much
> arch specific would there be any possibility for note_page() call back to be
> unified as well. This is extracted from current PTDUMP output on arm64.
> 
> 0xffffffc000000000-0xffffffc000080000  512K PTE  RW NX SHD AF  UXN MEM/NORMAL
> 
> The first three columns are generic
> 
> 1. Kernel virtual range span
> 2. Kernel virtual range size
> 3. Kernel virtual range mapping level
> 
> Where as rest of the output are architecture specific page table entry flags.
> Just wondering if we could print the first three columns in ptdump_walk_pgd()
> itself before calling arch specific callback to fetch a print buffer for rest
> of the line bounded with some character limit so that line does not overflow.
> Its not something which must be done but I guess it's worth giving it a try.
> This will help consolidate ptdump_walk_pgd() further.

It's not quite as simple as it seems. One of the things note_page() does
is work out whether a contiguous set of pages are "the same" (i.e.
should appear as one range). This is ultimately an architecture specific
decision: we need to look at the flags to do this.

I'm of course happy to be proved wrong if you can see a neat way of
making this work.

>>
>> Hopefully the implementation is generic enough that it should be
>> flexible enough to work for most architectures.
>>
>> arm, powerpc and s390 are the obvious architectures to convert next as
>> they already have note_page() functions which shouldn't be too difficult
>> to convert to match the callback prototype.
> 
> Which can be done independently later on, fair enough.
> 
>>
>>>>
>>>> To enable a generic page table walker to walk the unusual mappings of
>>>> the kernel we need to implement a set of functions which let us know
>>>> when the walker has reached the leaf entry. After a suggestion from Will
>>>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>>>> the purpose (and is a new name so has no historic baggage). Some
>>>> architectures have p?d_large macros but this is easily confused with
>>>> "large pages".
>>>
>>> I have not been following the previous version of the series closely, hence
>>> might be missing something here. But p?d_large() which identifies large
>>> mappings on a given level can only signify a leaf entry. Large pages on the
>>> table exist only as leaf entries. So what is the problem for it being used
>>> directly instead. Is there any possibility in the kernel mapping when these
>>> large pages are not leaf entries ?
>>
>> There isn't any problem as such with using p?d_large macros. However the
>> name "large" has caused confusion in the past. In particular there are
>> two types of "large" page:
>>
>> 1. leaf entries at high levels than normal ('sections' on Arm, for 4K
>> pages this gives you 2MB and 1GB pages).
>>
>> 2. sets of contiguous entries that can share a TLB entry (the
>> 'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
>> KB 'pages').
> 
> This is arm64 specific and AFAIK there are no other architectures where there
> will be any confusion wrt p?d_large() not meaning a single entry.

This isn't arm64 specific (or even Arm specific) - only the examples I
gave are. There are several architectures with software walks where the
TLB can be populated with arbitrary sized entries. I have to admit I
don't fully understand the page table layouts of many of the other
architectures that Linux supports.

> As you have noted before if we are printing individual entries with PTE_CONT
> then they need not be identified as p??d_large(). In which case p?d_large()
> can just safely point to p?d_sect() identifying regular huge leaf entries.

The printing is largely irrelevant here (it's handled by arch code), so
PTE_CONT isn't a problem. However to walk the page tables we need to
know precisely "is this the leaf of the tree", we don't really care what
size page is being mapped, just whether we should continue the walk or not.

>>
>> In many cases both give the same effect (reduce pressure on TLBs and
>> requires contiguous and aligned physical addresses). But for this case
>> we only care about the 'leaf' case (because the contiguous bit makes no
>> difference to walking the page tables).
> 
> Right and we can just safely identify section entries with it. What will be
> the problem with that ? Again this is only arm64 specific.

It's not arm64 specific.

>>
>> As far as I'm aware p?d_large() currently implements the first and
>> p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.
> 
> AFAIK option 2 exists only on arm6 platform. IIUC generic MM requires two
> different huge page dentition from platform. HugeTLB identifies large entries
> at PGD|PUD|PMD after converting it's content into PTE first. So there is no
> need for direct large page definitions for other levels.
> 
> 1. THP		- pmd_trans_huge()
> 2. HugeTLB	- pte_huge()	   CONFIG_ARCH_WANT_GENERAL_HUGETLB is set
> 
> A simple check for p?d_large() on mm/ and include/linux shows that there are
> no existing usage for these in generic MM. Hence it is available.

As Will has already replied - this is probably a good opportunity to
pick a better name - arch code can then be tidied up to use the new name.

[...]
> 
> Currently that is a difference between x86 and arm64 ptdump output. Whether to
> show the gaps or not could not be achieved by defining a note_page() callback
> function which does nothing but just return ? But if the single line output is
> split between generic and callback as I had proposed earlier this will not be
> possible any more as half the line would have been already printed.

I think the proposal at the moment is for arm64 to match x86 as it seems
like it would be useful to know at what level the gaps are. But I also
like giving each arch the flexibility to display what information is
relevant for that architecture. It's the custom page walkers I'm trying
to remove as really there isn't much difference between architectures
there (as lots of generic code has to deal with page tables in one way
or another).

Steve

Anshuman Khandual July 26, 2019, 6:03 a.m. UTC | #12

On 07/25/2019 03:00 PM, Will Deacon wrote:
> On Thu, Jul 25, 2019 at 02:39:22PM +0530, Anshuman Khandual wrote:
>> On 07/24/2019 07:05 PM, Steven Price wrote:
>>> There isn't any problem as such with using p?d_large macros. However the
>>> name "large" has caused confusion in the past. In particular there are
>>> two types of "large" page:
>>>
>>> 1. leaf entries at high levels than normal ('sections' on Arm, for 4K
>>> pages this gives you 2MB and 1GB pages).
>>>
>>> 2. sets of contiguous entries that can share a TLB entry (the
>>> 'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
>>> KB 'pages').
>>
>> This is arm64 specific and AFAIK there are no other architectures where there
>> will be any confusion wrt p?d_large() not meaning a single entry.
>>
>> As you have noted before if we are printing individual entries with PTE_CONT
>> then they need not be identified as p??d_large(). In which case p?d_large()
>> can just safely point to p?d_sect() identifying regular huge leaf entries.
> 
> Steven's stuck in the middle of things here, but I do object to p?d_large()
> because I find it bonkers to have p?d_large() and p?d_huge() mean completely
> different things when they are synonyms in the English language.

Agreed that both p?d_large() and p?d_huge() should not exist at the same time
when they imply the same thing. I believe all these name proliferation happened
because THP, HugeTLB and kernel large mappings like linear, vmemmap, ioremap etc
which the platform code had to deal with in various forms.

> 
> Yes, p?d_leaf() matches the terminology used by the Arm architecture, but
> given that most page table structures are arranged as a 'tree', then it's
> not completely unreasonable, in my opinion. If you have a more descriptive
> name, we could use that instead. We could also paint it blue.

The alternate name chosen p?d_leaf() is absolutely fine and it describes the
entry as intended. There is no disagreement over that. My original concern
was introduction of yet another page table helper.

> 
>>> In many cases both give the same effect (reduce pressure on TLBs and
>>> requires contiguous and aligned physical addresses). But for this case
>>> we only care about the 'leaf' case (because the contiguous bit makes no
>>> difference to walking the page tables).
>>
>> Right and we can just safely identify section entries with it. What will be
>> the problem with that ? Again this is only arm64 specific.
>>
>>>
>>> As far as I'm aware p?d_large() currently implements the first and
>>> p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.
>>
>> AFAIK option 2 exists only on arm6 platform. IIUC generic MM requires two
>> different huge page dentition from platform. HugeTLB identifies large entries
>> at PGD|PUD|PMD after converting it's content into PTE first. So there is no
>> need for direct large page definitions for other levels.
>>
>> 1. THP		- pmd_trans_huge()
>> 2. HugeTLB	- pte_huge()	   CONFIG_ARCH_WANT_GENERAL_HUGETLB is set
>>
>> A simple check for p?d_large() on mm/ and include/linux shows that there are
>> no existing usage for these in generic MM. Hence it is available.
> 
> Alternatively, it means we have a good opportunity to give it a better name
> before it spreads into the core code.

Fair enough, that is another way. So you expect existing platform definitions
for p?d_large()/p?d_huge() getting cleaned up and to start using new p?d_leaf()
instead ?

> 
>> IMHO the new addition of p?d_leaf() can be avoided and p?d_large() should be
>> cleaned up (if required) in platforms and used in generic functions.
> 
> Again, I disagree and think p?d_large() should be confined to arch code
> if it sticks around at all.

All of those instances should migrate to using p?d_leaf() eventually else
there will be three different helpers which probably mean the same thing.

Anshuman Khandual July 28, 2019, 11:20 a.m. UTC | #13

On 07/22/2019 09:11 PM, Steven Price wrote:
> Steven Price (21):
>   arc: mm: Add p?d_leaf() definitions
>   arm: mm: Add p?d_leaf() definitions
>   arm64: mm: Add p?d_leaf() definitions
>   mips: mm: Add p?d_leaf() definitions
>   powerpc: mm: Add p?d_leaf() definitions
>   riscv: mm: Add p?d_leaf() definitions
>   s390: mm: Add p?d_leaf() definitions
>   sparc: mm: Add p?d_leaf() definitions
>   x86: mm: Add p?d_leaf() definitions

The set of architectures here is neither complete (e.g ia64, parisc missing)
nor does it only include architectures which had previously enabled PTDUMP
like arm, arm64, powerpc, s390 and x86. Is there any reason for this set of
archs to be on the list and not the others which are currently falling back
on generic p?d_leaf() defined later in the series ? Are the missing archs
do not have huge page support in the MMU ? If there is a direct dependency
for these symbols with CONFIG_HUGETLB_PAGE then it must be checked before
falling back on the generic ones.

Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() these
functions need to be defined on all arch irrespective if they use PTDUMP or not
or otherwise just define it for archs which need them now for sure i.e x86 and
arm64 (which are moving to new generic PTDUMP framework). Other archs can
implement these later.

Steven Price July 29, 2019, 11:32 a.m. UTC | #14

On 28/07/2019 12:20, Anshuman Khandual wrote:
> On 07/22/2019 09:11 PM, Steven Price wrote:
>> Steven Price (21):
>>   arc: mm: Add p?d_leaf() definitions
>>   arm: mm: Add p?d_leaf() definitions
>>   arm64: mm: Add p?d_leaf() definitions
>>   mips: mm: Add p?d_leaf() definitions
>>   powerpc: mm: Add p?d_leaf() definitions
>>   riscv: mm: Add p?d_leaf() definitions
>>   s390: mm: Add p?d_leaf() definitions
>>   sparc: mm: Add p?d_leaf() definitions
>>   x86: mm: Add p?d_leaf() definitions
> 
> The set of architectures here is neither complete (e.g ia64, parisc missing)
> nor does it only include architectures which had previously enabled PTDUMP
> like arm, arm64, powerpc, s390 and x86. Is there any reason for this set of
> archs to be on the list and not the others which are currently falling back
> on generic p?d_leaf() defined later in the series ? Are the missing archs
> do not have huge page support in the MMU ? If there is a direct dependency
> for these symbols with CONFIG_HUGETLB_PAGE then it must be checked before
> falling back on the generic ones.

The list of architectures here is what I believe to be the list of
architectures which can have leaf entries further up the tree than
normal. I'm by no means an expert on all these architectures so I'm
hoping someone will chime in if they notice something amiss. Obviously
all the NO_MMU

ia64 as far as I can tell doesn't implement leaf entries further up - it
has an interesting hybrid hardware/software walk mechanism and as I
understand it the hardware never walks the page table tree that the
p?d_xxx() operations operate on. So this is a software implementation
detail - but the existance of p?d_huge functions which always return 0
were my first clue that leaf entries are only at the bottom of the tree.

parisc is more interesting and I'm not sure if this is necessarily
correct. I originally proposed a patch with the line "For parisc, we
don't support large pages, so add stubs returning 0" which got Acked by
Helge Deller. However going back to look at that again I see there was a
follow up thread[2] which possibly suggests I was wrong?

Can anyone shed some light on whether parisc does support leaf entries
of the page table tree at a higher than the normal depth?

[1] https://lkml.org/lkml/2019/2/27/572
[2] https://lkml.org/lkml/2019/3/5/610

The intention is that the page table walker would be available for all
architectures so that it can be used in any generic code - PTDUMP simply
seemed like a good place to start.

> Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() these
> functions need to be defined on all arch irrespective if they use PTDUMP or not
> or otherwise just define it for archs which need them now for sure i.e x86 and
> arm64 (which are moving to new generic PTDUMP framework). Other archs can
> implement these later.

The intention was to have valid definitions for all architectures, but
obviously I need help from those familiar with those other architectures
to check whether I've understood them correctly.

Thanks,

Steve

Sven Schnelle July 31, 2019, 9:27 a.m. UTC | #15

Hi Steven,

On Mon, Jul 29, 2019 at 12:32:25PM +0100, Steven Price wrote:
> 
> parisc is more interesting and I'm not sure if this is necessarily
> correct. I originally proposed a patch with the line "For parisc, we
> don't support large pages, so add stubs returning 0" which got Acked by
> Helge Deller. However going back to look at that again I see there was a
> follow up thread[2] which possibly suggests I was wrong?

I just started a week ago implementing ptdump for PA-RISC. Didn't notice that
you're working on making it generic, which is nice. I'll adjust my code
to use the infrastructure you're currently developing.

> Can anyone shed some light on whether parisc does support leaf entries
> of the page table tree at a higher than the normal depth?
> 
> [1] https://lkml.org/lkml/2019/2/27/572
> [2] https://lkml.org/lkml/2019/3/5/610

My understanding is that PA-RISC only has leaf entries on PTE level.

> The intention is that the page table walker would be available for all
> architectures so that it can be used in any generic code - PTDUMP simply
> seemed like a good place to start.
> 
> > Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() these
> > functions need to be defined on all arch irrespective if they use PTDUMP or not
> > or otherwise just define it for archs which need them now for sure i.e x86 and
> > arm64 (which are moving to new generic PTDUMP framework). Other archs can
> > implement these later.

I'll take care of the PA-RISC part - for 32 bit your generic code works, for 64Bit
i need to learn a bit more about the following hack:

arch/parisc/include/asm/pgalloc.h:15
/* Allocate the top level pgd (page directory)
 *
 * Here (for 64 bit kernels) we implement a Hybrid L2/L3 scheme: we
 * allocate the first pmd adjacent to the pgd.  This means that we can
 * subtract a constant offset to get to it.  The pmd and pgd sizes are
 * arranged so that a single pmd covers 4GB (giving a full 64-bit
 * process access to 8TB) so our lookups are effectively L2 for the
 * first 4GB of the kernel (i.e. for all ILP32 processes and all the
 * kernel for machines with under 4GB of memory)
 */

I see that your change clear P?D entries when p?d_bad() returns true, which - i think -
would be the case with the PA-RISC implementation.

Regards
Sven

Steven Price July 31, 2019, 11:18 a.m. UTC | #16

On 31/07/2019 10:27, Sven Schnelle wrote:
> Hi Steven,
> 
> On Mon, Jul 29, 2019 at 12:32:25PM +0100, Steven Price wrote:
>>
>> parisc is more interesting and I'm not sure if this is necessarily
>> correct. I originally proposed a patch with the line "For parisc, we
>> don't support large pages, so add stubs returning 0" which got Acked by
>> Helge Deller. However going back to look at that again I see there was a
>> follow up thread[2] which possibly suggests I was wrong?
> 
> I just started a week ago implementing ptdump for PA-RISC. Didn't notice that
> you're working on making it generic, which is nice. I'll adjust my code
> to use the infrastructure you're currently developing.

Great, hopefully it will make it easier to implement.

>> Can anyone shed some light on whether parisc does support leaf entries
>> of the page table tree at a higher than the normal depth?
>>
>> [1] https://lkml.org/lkml/2019/2/27/572
>> [2] https://lkml.org/lkml/2019/3/5/610
> 
> My understanding is that PA-RISC only has leaf entries on PTE level.

Yes, that's my current interpretation.

>> The intention is that the page table walker would be available for all
>> architectures so that it can be used in any generic code - PTDUMP simply
>> seemed like a good place to start.
>>
>>> Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() these
>>> functions need to be defined on all arch irrespective if they use PTDUMP or not
>>> or otherwise just define it for archs which need them now for sure i.e x86 and
>>> arm64 (which are moving to new generic PTDUMP framework). Other archs can
>>> implement these later.
> 
> I'll take care of the PA-RISC part - for 32 bit your generic code works, for 64Bit
> i need to learn a bit more about the following hack:
> 
> arch/parisc/include/asm/pgalloc.h:15
> /* Allocate the top level pgd (page directory)
>  *
>  * Here (for 64 bit kernels) we implement a Hybrid L2/L3 scheme: we
>  * allocate the first pmd adjacent to the pgd.  This means that we can
>  * subtract a constant offset to get to it.  The pmd and pgd sizes are
>  * arranged so that a single pmd covers 4GB (giving a full 64-bit
>  * process access to 8TB) so our lookups are effectively L2 for the
>  * first 4GB of the kernel (i.e. for all ILP32 processes and all the
>  * kernel for machines with under 4GB of memory)
>  */

As far as I understand this, the page table tree isn't any different
here. It's just that there's a PMD which is allocated at the same time
as the PGD. The PGD's first entry then points to the PMD (P4D/PUD are
folded). There are then some tricks which means that for addresses < 4GB
the PGD stage can be skipped because you already know where the relevant
PMD is.

However, nothing should stop a simple walk from PGD down - it's just an
optimisation to remove the pointer fetch from PGD in the usual case for
accesses < 4GB.

> I see that your change clear P?D entries when p?d_bad() returns true, which - i think -
> would be the case with the PA-RISC implementation.

The only case where p?d_bad() is checked is at the PGD and P4D levels
(unless I'm missing something?). I have to admit I'm a little unsure
about this. Basically the code as it stands doesn't allow leaf entries
at PGD or P4D. I'm not aware of any architectures that do this though.

Thanks,

Steve

[v9,00/21] Generic page walk and ptdump

Message

Comments