[v15,00/23] Generic page walk and ptdump
mbox series

Message ID 20191101140942.51554-1-steven.price@arm.com
Headers show
Series
  • Generic page walk and ptdump
Related show

Message

Steven Price Nov. 1, 2019, 2:09 p.m. UTC
Many architectures current have a debugfs file for dumping the kernel
page tables. Currently each architecture has to implement custom
functions for this because the details of walking the page tables used
by the kernel are different between architectures.

This series extends the capabilities of walk_page_range() so that it can
deal with the page tables of the kernel (which have no VMAs and can
contain larger huge pages than exist for user space). A generic PTDUMP
implementation is the implemented making use of the new functionality of
walk_page_range() and finally arm64 and x86 are switch to using it,
removing the custom table walkers.

To enable a generic page table walker to walk the unusual mappings of
the kernel we need to implement a set of functions which let us know
when the walker has reached the leaf entry. After a suggestion from Will
Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
the purpose (and is a new name so has no historic baggage). Some
architectures have p?d_large macros but this is easily confused with
"large pages".

This series ends with a generic PTDUMP implemention for arm64 and x86.

Mostly this is a clean up and there should be very little functional
change. The exceptions are:

* arm64 PTDUMP debugfs now displays pages which aren't present (patch 22).

* arm64 has the ability to efficiently process KASAN pages (which
  previously only x86 implemented). This means that the combination of
  KASAN and DEBUG_WX is now useable.

Also available as a git tree:
git://linux-arm.org/linux-sp.git walk_page_range/v15

Changes since v14:
https://lore.kernel.org/lkml/20191028135910.33253-1-steven.price@arm.com/
 * Switch walk_page_range() into two functions, the existing
   walk_page_range() now still requires VMAs (and treats areas without a
   VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
   will report the actual page table layout. This fixes the previous
   breakage of /proc/<pid>/pagemap
 * New patch at the end of the series which reduces the 'level' numbers
   by 1 to simplify the code slightly
 * Added tags

Changes since v13:
https://lore.kernel.org/lkml/20191024093716.49420-1-steven.price@arm.com/
 * Fixed typo in arc definition of pmd_leaf() spotted by the kbuild test
   robot
 * Added tags

Changes since v12:
https://lore.kernel.org/lkml/20191018101248.33727-1-steven.price@arm.com/
 * Correct code format in riscv pud_leaf()/pmd_leaf()
 * v12 may not have reached everyone because of mail server problems
   (which are now hopefully resolved!)

Changes since v11:
https://lore.kernel.org/lkml/20191007153822.16518-1-steven.price@arm.com/
 * Use "-1" as dummy depth parameter in patch 14.

Changes since v10:
https://lore.kernel.org/lkml/20190731154603.41797-1-steven.price@arm.com/
 * Rebased to v5.4-rc1 - mainly various updates to deal with the
   splitting out of ops from struct mm_walk.
 * Deal with PGD_LEVEL_MULT not always being constant on x86.

Changes since v9:
https://lore.kernel.org/lkml/20190722154210.42799-1-steven.price@arm.com/
 * Moved generic macros to first page in the series and explained the
   macro naming in the commit message.
 * mips: Moved macros to pgtable.h as they are now valid for both 32 and 64
   bit
 * x86: Dropped patch which changed the debugfs output for x86, instead
   we have...
 * new patch adding 'depth' parameter to pte_hole. This is used to
   provide the necessary information to output lines for 'holes' in the
   debugfs files
 * new patch changing arm64 debugfs output to include holes to match x86
 * generic ptdump KASAN handling has been simplified and now works with
   CONFIG_DEBUG_VIRTUAL.

Changes since v8:
https://lore.kernel.org/lkml/20190403141627.11664-1-steven.price@arm.com/
 * Rename from p?d_large() to p?d_leaf()
 * Dropped patches migrating arm64/x86 custom walkers to
   walk_page_range() in favour of adding a generic PTDUMP implementation
   and migrating arm64/x86 to that instead.
 * Rebased to v5.3-rc1

Steven Price (23):
  mm: Add generic p?d_leaf() macros
  arc: mm: Add p?d_leaf() definitions
  arm: mm: Add p?d_leaf() definitions
  arm64: mm: Add p?d_leaf() definitions
  mips: mm: Add p?d_leaf() definitions
  powerpc: mm: Add p?d_leaf() definitions
  riscv: mm: Add p?d_leaf() definitions
  s390: mm: Add p?d_leaf() definitions
  sparc: mm: Add p?d_leaf() definitions
  x86: mm: Add p?d_leaf() definitions
  mm: pagewalk: Add p4d_entry() and pgd_entry()
  mm: pagewalk: Allow walking without vma
  mm: pagewalk: Add test_p?d callbacks
  mm: pagewalk: Add 'depth' parameter to pte_hole
  x86: mm: Point to struct seq_file from struct pg_state
  x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
  x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
  x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
  mm: Add generic ptdump
  x86: mm: Convert dump_pagetables to use walk_page_range
  arm64: mm: Convert mm/dump.c to use walk_page_range()
  arm64: mm: Display non-present entries in ptdump
  mm: ptdump: Reduce level numbers by 1 in note_page()

 arch/arc/include/asm/pgtable.h               |   1 +
 arch/arm/include/asm/pgtable-2level.h        |   1 +
 arch/arm/include/asm/pgtable-3level.h        |   1 +
 arch/arm64/Kconfig                           |   1 +
 arch/arm64/Kconfig.debug                     |  19 +-
 arch/arm64/include/asm/pgtable.h             |   2 +
 arch/arm64/include/asm/ptdump.h              |   8 +-
 arch/arm64/mm/Makefile                       |   4 +-
 arch/arm64/mm/dump.c                         | 148 +++-----
 arch/arm64/mm/mmu.c                          |   4 +-
 arch/arm64/mm/ptdump_debugfs.c               |   2 +-
 arch/mips/include/asm/pgtable.h              |   5 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |  30 +-
 arch/riscv/include/asm/pgtable-64.h          |   7 +
 arch/riscv/include/asm/pgtable.h             |   7 +
 arch/s390/include/asm/pgtable.h              |   2 +
 arch/sparc/include/asm/pgtable_64.h          |   2 +
 arch/x86/Kconfig                             |   1 +
 arch/x86/Kconfig.debug                       |  20 +-
 arch/x86/include/asm/pgtable.h               |  10 +-
 arch/x86/mm/Makefile                         |   4 +-
 arch/x86/mm/debug_pagetables.c               |   8 +-
 arch/x86/mm/dump_pagetables.c                | 343 +++++--------------
 arch/x86/platform/efi/efi_32.c               |   2 +-
 arch/x86/platform/efi/efi_64.c               |   4 +-
 drivers/firmware/efi/arm-runtime.c           |   2 +-
 fs/proc/task_mmu.c                           |   4 +-
 include/asm-generic/pgtable.h                |  20 ++
 include/linux/pagewalk.h                     |  42 ++-
 include/linux/ptdump.h                       |  22 ++
 mm/Kconfig.debug                             |  21 ++
 mm/Makefile                                  |   1 +
 mm/hmm.c                                     |   8 +-
 mm/migrate.c                                 |   5 +-
 mm/mincore.c                                 |   1 +
 mm/pagewalk.c                                | 126 +++++--
 mm/ptdump.c                                  | 151 ++++++++
 37 files changed, 586 insertions(+), 453 deletions(-)
 create mode 100644 include/linux/ptdump.h
 create mode 100644 mm/ptdump.c

Comments

Qian Cai Nov. 4, 2019, 7:35 p.m. UTC | #1
On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
> Many architectures current have a debugfs file for dumping the kernel
> page tables. Currently each architecture has to implement custom
> functions for this because the details of walking the page tables used
> by the kernel are different between architectures.
> 
> This series extends the capabilities of walk_page_range() so that it can
> deal with the page tables of the kernel (which have no VMAs and can
> contain larger huge pages than exist for user space). A generic PTDUMP
> implementation is the implemented making use of the new functionality of
> walk_page_range() and finally arm64 and x86 are switch to using it,
> removing the custom table walkers.
> 
> To enable a generic page table walker to walk the unusual mappings of
> the kernel we need to implement a set of functions which let us know
> when the walker has reached the leaf entry. After a suggestion from Will
> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
> the purpose (and is a new name so has no historic baggage). Some
> architectures have p?d_large macros but this is easily confused with
> "large pages".
> 
> This series ends with a generic PTDUMP implemention for arm64 and x86.
> 
> Mostly this is a clean up and there should be very little functional
> change. The exceptions are:
> 
> * arm64 PTDUMP debugfs now displays pages which aren't present (patch 22).
> 
> * arm64 has the ability to efficiently process KASAN pages (which
>   previously only x86 implemented). This means that the combination of
>   KASAN and DEBUG_WX is now useable.
> 
> Also available as a git tree:
> git://linux-arm.org/linux-sp.git walk_page_range/v15
> 
> Changes since v14:
> https://lore.kernel.org/lkml/20191028135910.33253-1-steven.price@arm.com/
>  * Switch walk_page_range() into two functions, the existing
>    walk_page_range() now still requires VMAs (and treats areas without a
>    VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>    will report the actual page table layout. This fixes the previous
>    breakage of /proc/<pid>/pagemap
>  * New patch at the end of the series which reduces the 'level' numbers
>    by 1 to simplify the code slightly
>  * Added tags

Does this new version also take care of this boot crash seen with v14? Suppose
it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

[   10.550313][    T0] Switched APIC routing to physical flat.
[   10.563899][    T0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[   10.614633][    T0] clocksource: tsc-early: mask: 0xffffffffffffffff
max_cycles: 0x1fa6f481074, max_idle_ns: 440795311917 ns
[   10.625979][    T0] Calibrating delay loop (skipped), value calculated using
timer frequency.. 4391.73 BogoMIPS (lpj=21958690)
[   10.635990][    T0] pid_max: default: 131072 minimum: 1024
[   11.259736][    T0] ---[ User Space ]---
[   11.263737][    T0] 0x0000000000000000-
0x0000000000001000           4K     RW                     x  pte
[   11.266028][    T0] 0x0000000000001000-
0x0000000000200000        2044K                               pte
[   11.275992][    T0] 0x0000000000200000-
0x0000000004000000          62M                               pmd
[   11.285998][    T0] 0x0000000004000000-
0x0000000004076000         472K                               pte
[   11.296019][    T0] 0x0000000004076000-
0x0000000004200000        1576K                               pte
[   11.305997][    T0] 0x0000000004200000-
0x0000000011000000         206M                               pmd
[   11.316008][    T0] 0x0000000011000000-
0x0000000011100000           1M                               pte
[   11.326008][    T0] 0x0000000011100000-
0x0000000011200000           1M                               pte
[   11.335990][    T0] 0x0000000011200000-
0x0000000011800000           6M                               pmd
[   11.346054][    T0]
==================================================================
[   11.354074][    T0] BUG: KASAN: wild-memory-access in
ptdump_pte_entry+0x39/0x60
[   11.355975][    T0] Read of size 8 at addr 000f887fee5ff000 by task
swapper/0/0
[   11.355975][    T0] 
[   11.355975][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc5-mm1+
#1
[   11.355975][    T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 07/10/2019
[   11.355975][    T0] Call Trace:
[   11.355975][    T0]  dump_stack+0xa0/0xea
[   11.355975][    T0]  __kasan_report.cold.7+0xb0/0xc0
[   11.355975][    T0]  ? note_page+0x7f8/0xa70
[   11.355975][    T0]  ? ptdump_pte_entry+0x39/0x60
[   11.355975][    T0]  ? ptdump_walk_pgd_level_core+0x1b0/0x1b0
[   11.355975][    T0]  kasan_report+0x12/0x20
[   11.355975][    T0]  __asan_load8+0x71/0xa0
[   11.355975][    T0]  ptdump_pte_entry+0x39/0x60
[   11.355975][    T0]  walk_pgd_range+0xb75/0xce0
[   11.355975][    T0]  __walk_page_range+0x206/0x230
[   11.355975][    T0]  ? vmacache_find+0x3a/0x170
[   11.355975][    T0]  walk_page_range+0x136/0x210
[   11.355975][    T0]  ? __walk_page_range+0x230/0x230
[   11.355975][    T0]  ? find_held_lock+0xca/0xf0
[   11.355975][    T0]  ptdump_walk_pgd+0x76/0xd0
[   11.355975][    T0]  ptdump_walk_pgd_level_core+0x13b/0x1b0
[   11.355975][    T0]  ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[   11.355975][    T0]  ? trace_hardirqs_on+0x3a/0x160
[   11.355975][    T0]  ? ptdump_walk_pgd_level_core+0x1b0/0x1b0
[   11.355975][    T0]  ? efi_delete_dummy_variable+0xa9/0xd0
[   11.355975][    T0]  ? __enc_copy+0x90/0x90
[   11.355975][    T0]  ptdump_walk_pgd_level+0x15/0x20
[   11.355975][    T0]  efi_dump_pagetable+0x35/0x37
[   11.355975][    T0]  efi_enter_virtual_mode+0x72a/0x737
[   11.355975][    T0]  start_kernel+0x607/0x6a9
[   11.355975][    T0]  ? thread_stack_cache_init+0xb/0xb
[   11.355975][    T0]  ? idt_setup_from_table+0xd9/0x130
[   11.355975][    T0]  x86_64_start_reservations+0x24/0x26
[   11.355975][    T0]  x86_64_start_kernel+0xf4/0xfb
[   11.355975][    T0]  secondary_startup_64+0xb6/0xc0
[   11.355975][    T0]
==================================================================
[   11.355975][    T0] Disabling lock debugging due to kernel taint
[   11.355991][    T0] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
KASAN NOPTI
[   11.364049][    T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted:
G    B             5.4.0-rc5-mm1+ #1
[   11.365975][    T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 07/10/2019
[   11.365975][    T0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[   11.365975][    T0] Code: 55 41 54 49 89 fc 48 8d 79 18 53 48 89 cb e8 5e 0e
fa ff 48 8b 5b 18 48 89 df e8 52 0e fa ff 4c 89 e7 4c 8b 2b e8 47 0e fa ff <49>
8b 0c 24 4c 89 f6 48 89 df ba 05 00 00 00 e8 03 1d 9b 00 31 c0
[   11.365975][    T0] RSP: 0000:ffffffffaf8079d0 EFLAGS: 00010282
[   11.365975][    T0] RAX: 0000000000000000 RBX: ffffffffaf807cf0 RCX:
ffffffffae374306
[   11.365975][    T0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI:
ffffffffafef2bf4
[   11.365975][    T0] RBP: ffffffffaf8079f0 R08: fffffbfff5fdbb22 R09:
fffffbfff5fdbb22
[   11.365975][    T0] R10: fffffbfff5fdbb21 R11: ffffffffafedd90b R12:
000f887fee5ff000
[   11.365975][    T0] R13: ffffffffae2aee40 R14: 0000000011a00000 R15:
0000000011a01000
[   11.365975][    T0] FS:  0000000000000000(0000) GS:ffff888843400000(0000)
knlGS:0000000000000000
[   11.365975][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.365975][    T0] CR2: ffff8890779ff000 CR3: 0000000baf412000 CR4:
00000000000406b0
[   11.365975][    T0] Call Trace:
[   11.365975][    T0]  walk_pgd_range+0xb75/0xce0
[   11.365975][    T0]  __walk_page_range+0x206/0x230
[   11.365975][    T0]  ? vmacache_find+0x3a/0x170
[   11.365975][    T0]  walk_page_range+0x136/0x210
[   11.365975][    T0]  ? __walk_page_range+0x230/0x230
[   11.365975][    T0]  ? find_held_lock+0xca/0xf0
[   11.365975][    T0]  ptdump_walk_pgd+0x76/0xd0
[   11.365975][    T0]  ptdump_walk_pgd_level_core+0x13b/0x1b0
[   11.365975][    T0]  ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[   11.365975][    T0]  ? trace_hardirqs_on+0x3a/0x160
[   11.365975][    T0]  ? ptdump_walk_pgd_level_core+0x1b0/0x1b0
[   11.365975][    T0]  ? efi_delete_dummy_variable+0xa9/0xd0
[   11.365975][    T0]  ? __enc_copy+0x90/0x90
[   11.365975][    T0]  ptdump_walk_pgd_level+0x15/0x20
[   11.365975][    T0]  efi_dump_pagetable+0x35/0x37
[   11.365975][    T0]  efi_enter_virtual_mode+0x72a/0x737
[   11.365975][    T0]  start_kernel+0x607/0x6a9
[   11.365975][    T0]  ? thread_stack_cache_init+0xb/0xb
[   11.365975][    T0]  ? idt_setup_from_table+0xd9/0x130
[   11.365975][    T0]  x86_64_start_reservations+0x24/0x26
[   11.365975][    T0]  x86_64_start_kernel+0xf4/0xfb
[   11.365975][    T0]  secondary_startup_64+0xb6/0xc0
[   11.365975][    T0] Modules linked in:
[   11.365988][    T0] ---[ end trace 8e90dc89e2468d55 ]---
[   11.375984][    T0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[   11.381335][    T0] Code: 55 41 54 49 89 fc 48 8d 79 18 53 48 89 cb e8 5e 0e
fa ff 48 8b 5b 18 48 89 df e8 52 0e fa ff 4c 89 e7 4c 8b 2b e8 47 0e fa ff <49>
8b 0c 24 4c 89 f6 48 89 df ba 05 00 00 00 e8 03 1d 9b 00 31 c0
[   11.385982][    T0] RSP: 0000:ffffffffaf8079d0 EFLAGS: 00010282
[   11.395982][    T0] RAX: 0000000000000000 RBX: ffffffffaf807cf0 RCX:
ffffffffae374306
[   11.403864][    T0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI:
ffffffffafef2bf4
[   11.405982][    T0] RBP: ffffffffaf8079f0 R08: fffffbfff5fdbb22 R09:
fffffbfff5fdbb22
[   11.415982][    T0] R10: fffffbfff5fdbb21 R11: ffffffffafedd90b R12:
000f887fee5ff000
[   11.425982][    T0] R13: ffffffffae2aee40 R14: 0000000011a00000 R15:
0000000011a01000
[   11.435982][    T0] FS:  0000000000000000(0000) GS:ffff888843400000(0000)
knlGS:0000000000000000
[   11.445982][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.452466][    T0] CR2: ffff8890779ff000 CR3: 0000000baf412000 CR4:
00000000000406b0
[   11.455981][    T0] Kernel panic - not syncing: Fatal exception
[   11.462246][    T0] ---[ end Kernel panic - not syncing: Fatal exception ]---

> 
> Changes since v13:
> https://lore.kernel.org/lkml/20191024093716.49420-1-steven.price@arm.com/
>  * Fixed typo in arc definition of pmd_leaf() spotted by the kbuild test
>    robot
>  * Added tags
> 
> Changes since v12:
> https://lore.kernel.org/lkml/20191018101248.33727-1-steven.price@arm.com/
>  * Correct code format in riscv pud_leaf()/pmd_leaf()
>  * v12 may not have reached everyone because of mail server problems
>    (which are now hopefully resolved!)
> 
> Changes since v11:
> https://lore.kernel.org/lkml/20191007153822.16518-1-steven.price@arm.com/
>  * Use "-1" as dummy depth parameter in patch 14.
> 
> Changes since v10:
> https://lore.kernel.org/lkml/20190731154603.41797-1-steven.price@arm.com/
>  * Rebased to v5.4-rc1 - mainly various updates to deal with the
>    splitting out of ops from struct mm_walk.
>  * Deal with PGD_LEVEL_MULT not always being constant on x86.
> 
> Changes since v9:
> https://lore.kernel.org/lkml/20190722154210.42799-1-steven.price@arm.com/
>  * Moved generic macros to first page in the series and explained the
>    macro naming in the commit message.
>  * mips: Moved macros to pgtable.h as they are now valid for both 32 and 64
>    bit
>  * x86: Dropped patch which changed the debugfs output for x86, instead
>    we have...
>  * new patch adding 'depth' parameter to pte_hole. This is used to
>    provide the necessary information to output lines for 'holes' in the
>    debugfs files
>  * new patch changing arm64 debugfs output to include holes to match x86
>  * generic ptdump KASAN handling has been simplified and now works with
>    CONFIG_DEBUG_VIRTUAL.
> 
> Changes since v8:
> https://lore.kernel.org/lkml/20190403141627.11664-1-steven.price@arm.com/
>  * Rename from p?d_large() to p?d_leaf()
>  * Dropped patches migrating arm64/x86 custom walkers to
>    walk_page_range() in favour of adding a generic PTDUMP implementation
>    and migrating arm64/x86 to that instead.
>  * Rebased to v5.3-rc1
> 
> Steven Price (23):
>   mm: Add generic p?d_leaf() macros
>   arc: mm: Add p?d_leaf() definitions
>   arm: mm: Add p?d_leaf() definitions
>   arm64: mm: Add p?d_leaf() definitions
>   mips: mm: Add p?d_leaf() definitions
>   powerpc: mm: Add p?d_leaf() definitions
>   riscv: mm: Add p?d_leaf() definitions
>   s390: mm: Add p?d_leaf() definitions
>   sparc: mm: Add p?d_leaf() definitions
>   x86: mm: Add p?d_leaf() definitions
>   mm: pagewalk: Add p4d_entry() and pgd_entry()
>   mm: pagewalk: Allow walking without vma
>   mm: pagewalk: Add test_p?d callbacks
>   mm: pagewalk: Add 'depth' parameter to pte_hole
>   x86: mm: Point to struct seq_file from struct pg_state
>   x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
>   x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
>   x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
>   mm: Add generic ptdump
>   x86: mm: Convert dump_pagetables to use walk_page_range
>   arm64: mm: Convert mm/dump.c to use walk_page_range()
>   arm64: mm: Display non-present entries in ptdump
>   mm: ptdump: Reduce level numbers by 1 in note_page()
> 
>  arch/arc/include/asm/pgtable.h               |   1 +
>  arch/arm/include/asm/pgtable-2level.h        |   1 +
>  arch/arm/include/asm/pgtable-3level.h        |   1 +
>  arch/arm64/Kconfig                           |   1 +
>  arch/arm64/Kconfig.debug                     |  19 +-
>  arch/arm64/include/asm/pgtable.h             |   2 +
>  arch/arm64/include/asm/ptdump.h              |   8 +-
>  arch/arm64/mm/Makefile                       |   4 +-
>  arch/arm64/mm/dump.c                         | 148 +++-----
>  arch/arm64/mm/mmu.c                          |   4 +-
>  arch/arm64/mm/ptdump_debugfs.c               |   2 +-
>  arch/mips/include/asm/pgtable.h              |   5 +
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  30 +-
>  arch/riscv/include/asm/pgtable-64.h          |   7 +
>  arch/riscv/include/asm/pgtable.h             |   7 +
>  arch/s390/include/asm/pgtable.h              |   2 +
>  arch/sparc/include/asm/pgtable_64.h          |   2 +
>  arch/x86/Kconfig                             |   1 +
>  arch/x86/Kconfig.debug                       |  20 +-
>  arch/x86/include/asm/pgtable.h               |  10 +-
>  arch/x86/mm/Makefile                         |   4 +-
>  arch/x86/mm/debug_pagetables.c               |   8 +-
>  arch/x86/mm/dump_pagetables.c                | 343 +++++--------------
>  arch/x86/platform/efi/efi_32.c               |   2 +-
>  arch/x86/platform/efi/efi_64.c               |   4 +-
>  drivers/firmware/efi/arm-runtime.c           |   2 +-
>  fs/proc/task_mmu.c                           |   4 +-
>  include/asm-generic/pgtable.h                |  20 ++
>  include/linux/pagewalk.h                     |  42 ++-
>  include/linux/ptdump.h                       |  22 ++
>  mm/Kconfig.debug                             |  21 ++
>  mm/Makefile                                  |   1 +
>  mm/hmm.c                                     |   8 +-
>  mm/migrate.c                                 |   5 +-
>  mm/mincore.c                                 |   1 +
>  mm/pagewalk.c                                | 126 +++++--
>  mm/ptdump.c                                  | 151 ++++++++
>  37 files changed, 586 insertions(+), 453 deletions(-)
>  create mode 100644 include/linux/ptdump.h
>  create mode 100644 mm/ptdump.c
>
Qian Cai Nov. 6, 2019, 1:31 p.m. UTC | #2
> On Nov 4, 2019, at 2:35 PM, Qian Cai <cai@lca.pw> wrote:
> 
> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
>> Many architectures current have a debugfs file for dumping the kernel
>> page tables. Currently each architecture has to implement custom
>> functions for this because the details of walking the page tables used
>> by the kernel are different between architectures.
>> 
>> This series extends the capabilities of walk_page_range() so that it can
>> deal with the page tables of the kernel (which have no VMAs and can
>> contain larger huge pages than exist for user space). A generic PTDUMP
>> implementation is the implemented making use of the new functionality of
>> walk_page_range() and finally arm64 and x86 are switch to using it,
>> removing the custom table walkers.
>> 
>> To enable a generic page table walker to walk the unusual mappings of
>> the kernel we need to implement a set of functions which let us know
>> when the walker has reached the leaf entry. After a suggestion from Will
>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>> the purpose (and is a new name so has no historic baggage). Some
>> architectures have p?d_large macros but this is easily confused with
>> "large pages".
>> 
>> This series ends with a generic PTDUMP implemention for arm64 and x86.
>> 
>> Mostly this is a clean up and there should be very little functional
>> change. The exceptions are:
>> 
>> * arm64 PTDUMP debugfs now displays pages which aren't present (patch 22).
>> 
>> * arm64 has the ability to efficiently process KASAN pages (which
>>  previously only x86 implemented). This means that the combination of
>>  KASAN and DEBUG_WX is now useable.
>> 
>> Also available as a git tree:
>> git://linux-arm.org/linux-sp.git walk_page_range/v15
>> 
>> Changes since v14:
>> https://lore.kernel.org/lkml/20191028135910.33253-1-steven.price@arm.com/
>> * Switch walk_page_range() into two functions, the existing
>>   walk_page_range() now still requires VMAs (and treats areas without a
>>   VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>   will report the actual page table layout. This fixes the previous
>>   breakage of /proc/<pid>/pagemap
>> * New patch at the end of the series which reduces the 'level' numbers
>>   by 1 to simplify the code slightly
>> * Added tags
> 
> Does this new version also take care of this boot crash seen with v14? Suppose
> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
> 
> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> 

V15 is indeed DOA here.

[   10.957006][    T0] pid_max: default: 131072 minimum: 1024
[   11.543186][    T0] ---[ User Space ]---
[   11.547009][    T0] 0x0000000000000000-0x0000000000001000           4K     RW                     x  pte
[   11.556612][    T0] 0x0000000000001000-0x0000000000200000        2044K                               pte
[   11.557008][    T0] 0x0000000000200000-0x0000000004000000          62M                               pmd
[   11.567014][    T0] 0x0000000004000000-0x0000000004076000         472K                               pte
[   11.577033][    T0] 0x0000000004076000-0x0000000004200000        1576K                               pte
[   11.587013][    T0] 0x0000000004200000-0x0000000011000000         206M                               pmd
[   11.597023][    T0] 0x0000000011000000-0x0000000011100000           1M                               pte
[   11.607023][    T0] 0x0000000011100000-0x0000000011200000           1M                               pte
[   11.617006][    T0] 0x0000000011200000-0x0000000011800000           6M                               pmd
[   11.627068][    T0] ==================================================================
[   11.635087][    T0] BUG: KASAN: wild-memory-access in ptdump_pte_entry+0x39/0x60
[   11.636992][    T0] Read of size 8 at addr 000f887fee5ff000 by task swapper/0/0
[   11.636992][    T0] 
[   11.636992][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc6-next-20191106+ #6
[   11.636992][    T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
[   11.636992][    T0] Call Trace:
[   11.636992][    T0]  dump_stack+0xa0/0xea
[   11.636992][    T0]  __kasan_report.cold.7+0xb0/0xc0
[   11.636992][    T0]  ? note_page+0x6a9/0xa70
[   11.636992][    T0]  ? ptdump_pte_entry+0x39/0x60
[   11.636992][    T0]  ? ptdump_walk_pgd_level_core+0x1e0/0x1e0
[   11.636992][    T0]  kasan_report+0x12/0x20
[   11.636992][    T0]  __asan_load8+0x71/0xa0
[   11.636992][    T0]  ptdump_pte_entry+0x39/0x60
[   11.636992][    T0]  walk_pgd_range+0x9e5/0xdb0
[   11.636992][    T0]  __walk_page_range+0x206/0x230
[   11.636992][    T0]  walk_page_range_novma+0xc5/0x130
[   11.636992][    T0]  ? walk_page_range+0x220/0x220
[   11.636992][    T0]  ptdump_walk_pgd+0x76/0xd0
[   11.636992][    T0]  ptdump_walk_pgd_level_core+0x169/0x1e0
[   11.636992][    T0]  ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[   11.636992][    T0]  ? trace_hardirqs_on+0x3a/0x160
[   11.636992][    T0]  ? ptdump_walk_pgd_level_core+0x1e0/0x1e0
[   11.636992][    T0]  ? efi_delete_dummy_variable+0xa9/0xd0
[   11.636992][    T0]  ? __enc_copy+0x90/0x90
[   11.636992][    T0]  ptdump_walk_pgd_level+0x15/0x20
[   11.636992][    T0]  efi_dump_pagetable+0x35/0x37
[   11.636992][    T0]  efi_enter_virtual_mode+0x72a/0x737
[   11.636992][    T0]  start_kernel+0x607/0x6a9
[   11.636992][    T0]  ? thread_stack_cache_init+0xb/0xb
[   11.636992][    T0]  ? idt_setup_from_table+0xd9/0x130
[   11.636992][    T0]  x86_64_start_reservations+0x24/0x26
[   11.636992][    T0]  x86_64_start_kernel+0xf4/0xfb
[   11.636992][    T0]  secondary_startup_64+0xb6/0xc0
[   11.636992][    T0] ==================================================================
[   11.636992][    T0] Disabling lock debugging due to kernel taint
[   11.637009][    T0] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   11.645067][    T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G    B             5.4.0-rc6-next-20191106+ #6
[   11.646992][    T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
[   11.646992][    T0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[   11.646992][    T0] Code: 55 41 54 49 89 fc 48 8d 79 20 53 48 89 cb e8 8e 9d fa ff 48 8b 5b 20 48 89 df e8 82 9d fa ff 4c 89 e7 4c 8b 2b e8 77 9d fa ff <49> 8b 0c 24 4c 89 f6 48 89 df ba 04 00 00 00 e8 f3 8d 9b 00 31 c0
[   11.646992][    T0] RSP: 0000:ffffffff8a2079f0 EFLAGS: 00010286
[   11.646992][    T0] RAX: 0000000000000000 RBX: ffffffff8a207cf0 RCX: ffffffff88d74576
[   11.646992][    T0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffffffff8a8f53d4
[   11.646992][    T0] RBP: ffffffff8a207a10 R08: fffffbfff151c01a R09: fffffbfff151c01a
[   11.646992][    T0] R10: fffffbfff151c019 R11: ffffffff8a8e00cb R12: 000f887fee5ff000
[   11.646992][    T0] R13: ffffffff88caf040 R14: 0000000011a00000 R15: ffffffff89cfdcc0
[   11.646992][    T0] FS:  0000000000000000(0000) GS:ffff888843400000(0000) knlGS:0000000000000000
[   11.646992][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.646992][    T0] CR2: ffff8890779ff000 CR3: 0000000c54a12000 CR4: 00000000000406b0
[   11.646992][    T0] Call Trace:
[   11.646992][    T0]  walk_pgd_range+0x9e5/0xdb0
[   11.646992][    T0]  __walk_page_range+0x206/0x230
[   11.646992][    T0]  walk_page_range_novma+0xc5/0x130
[   11.646992][    T0]  ? walk_page_range+0x220/0x220
[   11.646992][    T0]  ptdump_walk_pgd+0x76/0xd0
[   11.646992][    T0]  ptdump_walk_pgd_level_core+0x169/0x1e0
[   11.646992][    T0]  ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[   11.646992][    T0]  ? trace_hardirqs_on+0x3a/0x160
[   11.646992][    T0]  ? ptdump_walk_pgd_level_core+0x1e0/0x1e0
[   11.646992][    T0]  ? efi_delete_dummy_variable+0xa9/0xd0
[   11.646992][    T0]  ? __enc_copy+0x90/0x90
[   11.646992][    T0]  ptdump_walk_pgd_level+0x15/0x20
[   11.646992][    T0]  efi_dump_pagetable+0x35/0x37
[   11.646992][    T0]  efi_enter_virtual_mode+0x72a/0x737
[   11.646992][    T0]  start_kernel+0x607/0x6a9
[   11.646992][    T0]  ? thread_stack_cache_init+0xb/0xb
[   11.646992][    T0]  ? idt_setup_from_table+0xd9/0x130
[   11.646992][    T0]  x86_64_start_reservations+0x24/0x26
[   11.646992][    T0]  x86_64_start_kernel+0xf4/0xfb
[   11.646992][    T0]  secondary_startup_64+0xb6/0xc0
[   11.646992][    T0] Modules linked in:
[   11.647003][    T0] ---[ end trace 751e8882de194a93 ]---
[   11.652355][    T0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[   11.657001][    T0] Code: 55 41 54 49 89 fc 48 8d 79 20 53 48 89 cb e8 8e 9d fa ff 48 8b 5b 20 48 89 df e8 82 9d fa ff 4c 89 e7 4c 8b 2b e8 77 9d fa ff <49> 8b 0c 24 4c 89 f6 48 89 df ba 04 00 00 00 e8 f3 8d 9b 00 31 c0
[   11.666998][    T0] RSP: 0000:ffffffff8a2079f0 EFLAGS: 00010286
[   11.672961][    T0] RAX: 0000000000000000 RBX: ffffffff8a207cf0 RCX: ffffffff88d74576
[   11.676998][    T0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffffffff8a8f53d4
[   11.686998][    T0] RBP: ffffffff8a207a10 R08: fffffbfff151c01a R09: fffffbfff151c01a
[   11.696998][    T0] R10: fffffbfff151c019 R11: ffffffff8a8e00cb R12: 000f887fee5ff000
[   11.704882][    T0] R13: ffffffff88caf040 R14: 0000000011a00000 R15: ffffffff89cfdcc0
[   11.706999][    T0] FS:  0000000000000000(0000) GS:ffff888843400000(0000) knlGS:0000000000000000
[   11.716998][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.726998][    T0] CR2: ffff8890779ff000 CR3: 0000000c54a12000 CR4: 00000000000406b0
[   11.736998][    T0] Kernel panic - not syncing: Fatal exception
[   11.743272][    T0] ---[ end Kernel panic - not syncing: Fatal exception ]---

>> 
>> Changes since v13:
>> https://lore.kernel.org/lkml/20191024093716.49420-1-steven.price@arm.com/
>> * Fixed typo in arc definition of pmd_leaf() spotted by the kbuild test
>>   robot
>> * Added tags
>> 
>> Changes since v12:
>> https://lore.kernel.org/lkml/20191018101248.33727-1-steven.price@arm.com/
>> * Correct code format in riscv pud_leaf()/pmd_leaf()
>> * v12 may not have reached everyone because of mail server problems
>>   (which are now hopefully resolved!)
>> 
>> Changes since v11:
>> https://lore.kernel.org/lkml/20191007153822.16518-1-steven.price@arm.com/
>> * Use "-1" as dummy depth parameter in patch 14.
>> 
>> Changes since v10:
>> https://lore.kernel.org/lkml/20190731154603.41797-1-steven.price@arm.com/
>> * Rebased to v5.4-rc1 - mainly various updates to deal with the
>>   splitting out of ops from struct mm_walk.
>> * Deal with PGD_LEVEL_MULT not always being constant on x86.
>> 
>> Changes since v9:
>> https://lore.kernel.org/lkml/20190722154210.42799-1-steven.price@arm.com/
>> * Moved generic macros to first page in the series and explained the
>>   macro naming in the commit message.
>> * mips: Moved macros to pgtable.h as they are now valid for both 32 and 64
>>   bit
>> * x86: Dropped patch which changed the debugfs output for x86, instead
>>   we have...
>> * new patch adding 'depth' parameter to pte_hole. This is used to
>>   provide the necessary information to output lines for 'holes' in the
>>   debugfs files
>> * new patch changing arm64 debugfs output to include holes to match x86
>> * generic ptdump KASAN handling has been simplified and now works with
>>   CONFIG_DEBUG_VIRTUAL.
>> 
>> Changes since v8:
>> https://lore.kernel.org/lkml/20190403141627.11664-1-steven.price@arm.com/
>> * Rename from p?d_large() to p?d_leaf()
>> * Dropped patches migrating arm64/x86 custom walkers to
>>   walk_page_range() in favour of adding a generic PTDUMP implementation
>>   and migrating arm64/x86 to that instead.
>> * Rebased to v5.3-rc1
>> 
>> Steven Price (23):
>>  mm: Add generic p?d_leaf() macros
>>  arc: mm: Add p?d_leaf() definitions
>>  arm: mm: Add p?d_leaf() definitions
>>  arm64: mm: Add p?d_leaf() definitions
>>  mips: mm: Add p?d_leaf() definitions
>>  powerpc: mm: Add p?d_leaf() definitions
>>  riscv: mm: Add p?d_leaf() definitions
>>  s390: mm: Add p?d_leaf() definitions
>>  sparc: mm: Add p?d_leaf() definitions
>>  x86: mm: Add p?d_leaf() definitions
>>  mm: pagewalk: Add p4d_entry() and pgd_entry()
>>  mm: pagewalk: Allow walking without vma
>>  mm: pagewalk: Add test_p?d callbacks
>>  mm: pagewalk: Add 'depth' parameter to pte_hole
>>  x86: mm: Point to struct seq_file from struct pg_state
>>  x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
>>  x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
>>  x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
>>  mm: Add generic ptdump
>>  x86: mm: Convert dump_pagetables to use walk_page_range
>>  arm64: mm: Convert mm/dump.c to use walk_page_range()
>>  arm64: mm: Display non-present entries in ptdump
>>  mm: ptdump: Reduce level numbers by 1 in note_page()
>> 
>> arch/arc/include/asm/pgtable.h               |   1 +
>> arch/arm/include/asm/pgtable-2level.h        |   1 +
>> arch/arm/include/asm/pgtable-3level.h        |   1 +
>> arch/arm64/Kconfig                           |   1 +
>> arch/arm64/Kconfig.debug                     |  19 +-
>> arch/arm64/include/asm/pgtable.h             |   2 +
>> arch/arm64/include/asm/ptdump.h              |   8 +-
>> arch/arm64/mm/Makefile                       |   4 +-
>> arch/arm64/mm/dump.c                         | 148 +++-----
>> arch/arm64/mm/mmu.c                          |   4 +-
>> arch/arm64/mm/ptdump_debugfs.c               |   2 +-
>> arch/mips/include/asm/pgtable.h              |   5 +
>> arch/powerpc/include/asm/book3s/64/pgtable.h |  30 +-
>> arch/riscv/include/asm/pgtable-64.h          |   7 +
>> arch/riscv/include/asm/pgtable.h             |   7 +
>> arch/s390/include/asm/pgtable.h              |   2 +
>> arch/sparc/include/asm/pgtable_64.h          |   2 +
>> arch/x86/Kconfig                             |   1 +
>> arch/x86/Kconfig.debug                       |  20 +-
>> arch/x86/include/asm/pgtable.h               |  10 +-
>> arch/x86/mm/Makefile                         |   4 +-
>> arch/x86/mm/debug_pagetables.c               |   8 +-
>> arch/x86/mm/dump_pagetables.c                | 343 +++++--------------
>> arch/x86/platform/efi/efi_32.c               |   2 +-
>> arch/x86/platform/efi/efi_64.c               |   4 +-
>> drivers/firmware/efi/arm-runtime.c           |   2 +-
>> fs/proc/task_mmu.c                           |   4 +-
>> include/asm-generic/pgtable.h                |  20 ++
>> include/linux/pagewalk.h                     |  42 ++-
>> include/linux/ptdump.h                       |  22 ++
>> mm/Kconfig.debug                             |  21 ++
>> mm/Makefile                                  |   1 +
>> mm/hmm.c                                     |   8 +-
>> mm/migrate.c                                 |   5 +-
>> mm/mincore.c                                 |   1 +
>> mm/pagewalk.c                                | 126 +++++--
>> mm/ptdump.c                                  | 151 ++++++++
>> 37 files changed, 586 insertions(+), 453 deletions(-)
>> create mode 100644 include/linux/ptdump.h
>> create mode 100644 mm/ptdump.c
>>
Steven Price Nov. 6, 2019, 3:05 p.m. UTC | #3
On 06/11/2019 13:31, Qian Cai wrote:
> 
> 
>> On Nov 4, 2019, at 2:35 PM, Qian Cai <cai@lca.pw> wrote:
>>
>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
[...]
>>> Changes since v14:
>>> https://lore.kernel.org/lkml/20191028135910.33253-1-steven.price@arm.com/
>>> * Switch walk_page_range() into two functions, the existing
>>>    walk_page_range() now still requires VMAs (and treats areas without a
>>>    VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>>    will report the actual page table layout. This fixes the previous
>>>    breakage of /proc/<pid>/pagemap
>>> * New patch at the end of the series which reduces the 'level' numbers
>>>    by 1 to simplify the code slightly
>>> * Added tags
>>
>> Does this new version also take care of this boot crash seen with v14? Suppose
>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>
>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>
> 
> V15 is indeed DOA here.

Thanks for finding this, it looks like EFI causes issues here. The below fixes
this for me (booting in QEMU).

Andrew: do you want me to send out the entire series again for this fix, or
can you squash this into mm-pagewalk-allow-walking-without-vma.patch?

Thanks,

Steve

---8<---
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c7529dc4f82b..70dcaa23598f 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
  			split_huge_pmd(walk->vma, pmd, addr);
  			if (pmd_trans_unstable(pmd))
  				goto again;
-		} else if (pmd_leaf(*pmd)) {
+		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
  			continue;
  		}
  
@@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
  			split_huge_pud(walk->vma, pud, addr);
  			if (pud_none(*pud))
  				goto again;
-		} else if (pud_leaf(*pud)) {
+		} else if (pud_leaf(*pud) || !pud_present(*pud)) {
  			continue;
  		}
David Hildenbrand Dec. 3, 2019, 11:02 a.m. UTC | #4
On 06.11.19 16:05, Steven Price wrote:
> On 06/11/2019 13:31, Qian Cai wrote:
>>
>>
>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <cai@lca.pw> wrote:
>>>
>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
> [...]
>>>> Changes since v14:
>>>> https://lore.kernel.org/lkml/20191028135910.33253-1-steven.price@arm.com/
>>>> * Switch walk_page_range() into two functions, the existing
>>>>    walk_page_range() now still requires VMAs (and treats areas without a
>>>>    VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>>>    will report the actual page table layout. This fixes the previous
>>>>    breakage of /proc/<pid>/pagemap
>>>> * New patch at the end of the series which reduces the 'level' numbers
>>>>    by 1 to simplify the code slightly
>>>> * Added tags
>>>
>>> Does this new version also take care of this boot crash seen with v14? Suppose
>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>>
>>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>>
>>
>> V15 is indeed DOA here.
> 
> Thanks for finding this, it looks like EFI causes issues here. The below fixes
> this for me (booting in QEMU).
> 
> Andrew: do you want me to send out the entire series again for this fix, or
> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
> 
> Thanks,
> 
> Steve
> 
> ---8<---
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index c7529dc4f82b..70dcaa23598f 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>   			split_huge_pmd(walk->vma, pmd, addr);
>   			if (pmd_trans_unstable(pmd))
>   				goto again;
> -		} else if (pmd_leaf(*pmd)) {
> +		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
>   			continue;
>   		}
>   
> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>   			split_huge_pud(walk->vma, pud, addr);
>   			if (pud_none(*pud))
>   				goto again;
> -		} else if (pud_leaf(*pud)) {
> +		} else if (pud_leaf(*pud) || !pud_present(*pud)) {
>   			continue;
>   		}
>   
> 

Even with this fix, booting for me under QEMU fails. See

https://lore.kernel.org/linux-mm/b7ce62f2-9a48-6e48-6685-003431e521aa@redhat.com/
Qian Cai Dec. 4, 2019, 2:54 p.m. UTC | #5
> On Dec 3, 2019, at 6:02 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 06.11.19 16:05, Steven Price wrote:
>> On 06/11/2019 13:31, Qian Cai wrote:
>>> 
>>> 
>>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <cai@lca.pw> wrote:
>>>> 
>>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
>> [...]
>>>>> Changes since v14:
>>>>> https://lore.kernel.org/lkml/20191028135910.33253-1-steven.price@arm.com/
>>>>> * Switch walk_page_range() into two functions, the existing
>>>>>   walk_page_range() now still requires VMAs (and treats areas without a
>>>>>   VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>>>>   will report the actual page table layout. This fixes the previous
>>>>>   breakage of /proc/<pid>/pagemap
>>>>> * New patch at the end of the series which reduces the 'level' numbers
>>>>>   by 1 to simplify the code slightly
>>>>> * Added tags
>>>> 
>>>> Does this new version also take care of this boot crash seen with v14? Suppose
>>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>>> 
>>>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>>> 
>>> 
>>> V15 is indeed DOA here.
>> 
>> Thanks for finding this, it looks like EFI causes issues here. The below fixes
>> this for me (booting in QEMU).
>> 
>> Andrew: do you want me to send out the entire series again for this fix, or
>> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
>> 
>> Thanks,
>> 
>> Steve
>> 
>> ---8<---
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index c7529dc4f82b..70dcaa23598f 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>  			split_huge_pmd(walk->vma, pmd, addr);
>>  			if (pmd_trans_unstable(pmd))
>>  				goto again;
>> -		} else if (pmd_leaf(*pmd)) {
>> +		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
>>  			continue;
>>  		}
>> 
>> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>  			split_huge_pud(walk->vma, pud, addr);
>>  			if (pud_none(*pud))
>>  				goto again;
>> -		} else if (pud_leaf(*pud)) {
>> +		} else if (pud_leaf(*pud) || !pud_present(*pud)) {
>>  			continue;
>>  		}
>> 
>> 
> 
> Even with this fix, booting for me under QEMU fails. See
> 
> https://lore.kernel.org/linux-mm/b7ce62f2-9a48-6e48-6685-003431e521aa@redhat.com/
> 

Yes, for some reasons, this starts to crash on almost all arches here, so it might be worth
for Andrew to revert those in the meantime while allowing Steven to rework.
David Hildenbrand Dec. 4, 2019, 2:56 p.m. UTC | #6
On 04.12.19 15:54, Qian Cai wrote:
> 
> 
>> On Dec 3, 2019, at 6:02 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 06.11.19 16:05, Steven Price wrote:
>>> On 06/11/2019 13:31, Qian Cai wrote:
>>>>
>>>>
>>>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <cai@lca.pw> wrote:
>>>>>
>>>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
>>> [...]
>>>>>> Changes since v14:
>>>>>> https://lore.kernel.org/lkml/20191028135910.33253-1-steven.price@arm.com/
>>>>>> * Switch walk_page_range() into two functions, the existing
>>>>>>   walk_page_range() now still requires VMAs (and treats areas without a
>>>>>>   VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>>>>>   will report the actual page table layout. This fixes the previous
>>>>>>   breakage of /proc/<pid>/pagemap
>>>>>> * New patch at the end of the series which reduces the 'level' numbers
>>>>>>   by 1 to simplify the code slightly
>>>>>> * Added tags
>>>>>
>>>>> Does this new version also take care of this boot crash seen with v14? Suppose
>>>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>>>>
>>>>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>>>>
>>>>
>>>> V15 is indeed DOA here.
>>>
>>> Thanks for finding this, it looks like EFI causes issues here. The below fixes
>>> this for me (booting in QEMU).
>>>
>>> Andrew: do you want me to send out the entire series again for this fix, or
>>> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
>>>
>>> Thanks,
>>>
>>> Steve
>>>
>>> ---8<---
>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>> index c7529dc4f82b..70dcaa23598f 100644
>>> --- a/mm/pagewalk.c
>>> +++ b/mm/pagewalk.c
>>> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>>  			split_huge_pmd(walk->vma, pmd, addr);
>>>  			if (pmd_trans_unstable(pmd))
>>>  				goto again;
>>> -		} else if (pmd_leaf(*pmd)) {
>>> +		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
>>>  			continue;
>>>  		}
>>>
>>> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>>  			split_huge_pud(walk->vma, pud, addr);
>>>  			if (pud_none(*pud))
>>>  				goto again;
>>> -		} else if (pud_leaf(*pud)) {
>>> +		} else if (pud_leaf(*pud) || !pud_present(*pud)) {
>>>  			continue;
>>>  		}
>>>
>>>
>>
>> Even with this fix, booting for me under QEMU fails. See
>>
>> https://lore.kernel.org/linux-mm/b7ce62f2-9a48-6e48-6685-003431e521aa@redhat.com/
>>
> 
> Yes, for some reasons, this starts to crash on almost all arches here, so it might be worth
> for Andrew to revert those in the meantime while allowing Steven to rework.

I agree, this produces too much noise.
Steven Price Dec. 4, 2019, 4:32 p.m. UTC | #7
On Wed, Dec 04, 2019 at 02:56:58PM +0000, David Hildenbrand wrote:
> On 04.12.19 15:54, Qian Cai wrote:
> > 
> > 
> >> On Dec 3, 2019, at 6:02 AM, David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 06.11.19 16:05, Steven Price wrote:
> >>> On 06/11/2019 13:31, Qian Cai wrote:
> >>>>
> >>>>
> >>>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <cai@lca.pw> wrote:
> >>>>>
> >>>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
> >>> [...]
> >>>>>> Changes since v14:
> >>>>>> https://lore.kernel.org/lkml/20191028135910.33253-1-steven.price@arm.com/
> >>>>>> * Switch walk_page_range() into two functions, the existing
> >>>>>>   walk_page_range() now still requires VMAs (and treats areas without a
> >>>>>>   VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
> >>>>>>   will report the actual page table layout. This fixes the previous
> >>>>>>   breakage of /proc/<pid>/pagemap
> >>>>>> * New patch at the end of the series which reduces the 'level' numbers
> >>>>>>   by 1 to simplify the code slightly
> >>>>>> * Added tags
> >>>>>
> >>>>> Does this new version also take care of this boot crash seen with v14? Suppose
> >>>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
> >>>>>
> >>>>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> >>>>>
> >>>>
> >>>> V15 is indeed DOA here.
> >>>
> >>> Thanks for finding this, it looks like EFI causes issues here. The below fixes
> >>> this for me (booting in QEMU).
> >>>
> >>> Andrew: do you want me to send out the entire series again for this fix, or
> >>> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
> >>>
> >>> Thanks,
> >>>
> >>> Steve
> >>>
> >>> ---8<---
> >>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> >>> index c7529dc4f82b..70dcaa23598f 100644
> >>> --- a/mm/pagewalk.c
> >>> +++ b/mm/pagewalk.c
> >>> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >>>  			split_huge_pmd(walk->vma, pmd, addr);
> >>>  			if (pmd_trans_unstable(pmd))
> >>>  				goto again;
> >>> -		} else if (pmd_leaf(*pmd)) {
> >>> +		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
> >>>  			continue;
> >>>  		}
> >>>
> >>> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >>>  			split_huge_pud(walk->vma, pud, addr);
> >>>  			if (pud_none(*pud))
> >>>  				goto again;
> >>> -		} else if (pud_leaf(*pud)) {
> >>> +		} else if (pud_leaf(*pud) || !pud_present(*pud)) {
> >>>  			continue;
> >>>  		}
> >>>
> >>>
> >>
> >> Even with this fix, booting for me under QEMU fails. See
> >>
> >> https://lore.kernel.org/linux-mm/b7ce62f2-9a48-6e48-6685-003431e521aa@redhat.com/
> >>
> > 
> > Yes, for some reasons, this starts to crash on almost all arches here, so it might be worth
> > for Andrew to revert those in the meantime while allowing Steven to rework.
> 
> I agree, this produces too much noise.

I've bisected this problem and it's a merge conflict with:

ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in walk_pte_range()")

Reverting that commit "fixes" the problem. That commit adds a call to
pte_offset_map_lock(), however that isn't necessarily safe when
considering an "unusual" mapping in the kernel. Combined with my patch
set this leads to the BUG when walking the kernel's page tables.

At this stage I think it's best if Andrew drops my series and I'll try
to rework it on top -rc1 fixing up this conflict and the other x86
32-bit issue that has cropped up.

Steve
Thomas Hellstrom Dec. 4, 2019, 5:51 p.m. UTC | #8
On 12/4/19 5:32 PM, Steven Price wrote:
> On Wed, Dec 04, 2019 at 02:56:58PM +0000, David Hildenbrand wrote:
>> On 04.12.19 15:54, Qian Cai wrote:
>>>
>>>> On Dec 3, 2019, at 6:02 AM, David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 06.11.19 16:05, Steven Price wrote:
>>>>> On 06/11/2019 13:31, Qian Cai wrote:
>>>>>>
>>>>>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <cai@lca.pw> wrote:
>>>>>>>
>>>>>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
>>>>> [...]
>>>>>>>> Changes since v14:
>>>>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20191028135910.33253-1-steven.price%40arm.com%2F&amp;data=02%7C01%7Cthellstrom%40vmware.com%7C9f50ca595f81432eff5b08d778d7968a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637110739727088799&amp;sdata=B3n6TFU7hluQyAXUOEaHBAGNC8mhscMfxSJi%2FrFr%2Flo%3D&amp;reserved=0
>>>>>>>> * Switch walk_page_range() into two functions, the existing
>>>>>>>>   walk_page_range() now still requires VMAs (and treats areas without a
>>>>>>>>   VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>>>>>>>   will report the actual page table layout. This fixes the previous
>>>>>>>>   breakage of /proc/<pid>/pagemap
>>>>>>>> * New patch at the end of the series which reduces the 'level' numbers
>>>>>>>>   by 1 to simplify the code slightly
>>>>>>>> * Added tags
>>>>>>> Does this new version also take care of this boot crash seen with v14? Suppose
>>>>>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>>>>>>
>>>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fraw.githubusercontent.com%2Fcailca%2Flinux-mm%2Fmaster%2Fx86.config&amp;data=02%7C01%7Cthellstrom%40vmware.com%7C9f50ca595f81432eff5b08d778d7968a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637110739727088799&amp;sdata=ymVh49kh7VL9yseRdkjSbTwRh%2B7yBXxhK7QMTUzwn4U%3D&amp;reserved=0
>>>>>>>
>>>>>> V15 is indeed DOA here.
>>>>> Thanks for finding this, it looks like EFI causes issues here. The below fixes
>>>>> this for me (booting in QEMU).
>>>>>
>>>>> Andrew: do you want me to send out the entire series again for this fix, or
>>>>> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Steve
>>>>>
>>>>> ---8<---
>>>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>>>> index c7529dc4f82b..70dcaa23598f 100644
>>>>> --- a/mm/pagewalk.c
>>>>> +++ b/mm/pagewalk.c
>>>>> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>>>>  			split_huge_pmd(walk->vma, pmd, addr);
>>>>>  			if (pmd_trans_unstable(pmd))
>>>>>  				goto again;
>>>>> -		} else if (pmd_leaf(*pmd)) {
>>>>> +		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
>>>>>  			continue;
>>>>>  		}
>>>>>
>>>>> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>>>>  			split_huge_pud(walk->vma, pud, addr);
>>>>>  			if (pud_none(*pud))
>>>>>  				goto again;
>>>>> -		} else if (pud_leaf(*pud)) {
>>>>> +		} else if (pud_leaf(*pud) || !pud_present(*pud)) {
>>>>>  			continue;
>>>>>  		}
>>>>>
>>>>>
>>>> Even with this fix, booting for me under QEMU fails. See
>>>>
>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2Fb7ce62f2-9a48-6e48-6685-003431e521aa%40redhat.com%2F&amp;data=02%7C01%7Cthellstrom%40vmware.com%7C9f50ca595f81432eff5b08d778d7968a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637110739727088799&amp;sdata=fRuLrmrzNEkU2MFzSVdyVyXyRoyZ95yZOYuy7aMSi7A%3D&amp;reserved=0
>>>>
>>> Yes, for some reasons, this starts to crash on almost all arches here, so it might be worth
>>> for Andrew to revert those in the meantime while allowing Steven to rework.
>> I agree, this produces too much noise.
> I've bisected this problem and it's a merge conflict with:
>
> ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in walk_pte_range()")
>
> Reverting that commit "fixes" the problem. That commit adds a call to
> pte_offset_map_lock(), however that isn't necessarily safe when
> considering an "unusual" mapping in the kernel. Combined with my patch
> set this leads to the BUG when walking the kernel's page tables.
>
> At this stage I think it's best if Andrew drops my series and I'll try
> to rework it on top -rc1 fixing up this conflict and the other x86
> 32-bit issue that has cropped up.

Hi,

Unfortunately I wasn't aware of that conflict.

Perhaps something similar to this

https://elixir.bootlin.com/linux/v5.4/source/mm/memory.c#L2012

would fix at least this particular issue?

/Thomas




>
> Steve
>
Qian Cai Dec. 5, 2019, 1:15 p.m. UTC | #9
> On Dec 4, 2019, at 11:32 AM, Steven Price <Steven.Price@arm.com> wrote:
> 
> I've bisected this problem and it's a merge conflict with:
> 
> ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in walk_pte_range()")

Sigh, how does that commit end up merging in the mainline without going through Andrew’s tree and missed all the linux-next testing? It was merged into the mainline Oct 4th?

> Reverting that commit "fixes" the problem. That commit adds a call to
> pte_offset_map_lock(), however that isn't necessarily safe when
> considering an "unusual" mapping in the kernel. Combined with my patch
> set this leads to the BUG when walking the kernel's page tables.
> 
> At this stage I think it's best if Andrew drops my series and I'll try
> to rework it on top -rc1 fixing up this conflict and the other x86
> 32-bit issue that has cropped up.
Thomas Hellstrom Dec. 5, 2019, 2:32 p.m. UTC | #10
On Thu, 2019-12-05 at 08:15 -0500, Qian Cai wrote:
> > On Dec 4, 2019, at 11:32 AM, Steven Price <Steven.Price@arm.com>
> > wrote:
> > 
> > I've bisected this problem and it's a merge conflict with:
> > 
> > ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in
> > walk_pte_range()")
> 
> Sigh, how does that commit end up merging in the mainline without
> going through Andrew’s tree and missed all the linux-next testing? It
> was merged into the mainline Oct 4th?

It was acked by Andrew to be merged through a drm tree, since it was
part of a graphics driver functionality. It was preceded by a fairly
lenghty discussion on linux-mm / linux-kernel.

It was merged into drm-next on 19-11-28, I think that's when it
normally is seen by linux-next. Merged into mainline 19-11-30. Andrew's
tree got merged 19-12-05.

linux-next signaled a merge conflict from one of the patches in this
series (not this one) resolved manually with the akpm tree on 19-12-02.

Thomas
Qian Cai Dec. 5, 2019, 2:38 p.m. UTC | #11
> On Dec 5, 2019, at 9:32 AM, Thomas Hellstrom <thellstrom@vmware.com> wrote:
> 
> On Thu, 2019-12-05 at 08:15 -0500, Qian Cai wrote:
>>> On Dec 4, 2019, at 11:32 AM, Steven Price <Steven.Price@arm.com>
>>> wrote:
>>> 
>>> I've bisected this problem and it's a merge conflict with:
>>> 
>>> ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in
>>> walk_pte_range()")
>> 
>> Sigh, how does that commit end up merging in the mainline without
>> going through Andrew’s tree and missed all the linux-next testing? It
>> was merged into the mainline Oct 4th?
> 
> It was acked by Andrew to be merged through a drm tree, since it was
> part of a graphics driver functionality. It was preceded by a fairly
> lenghty discussion on linux-mm / linux-kernel.
> 
> It was merged into drm-next on 19-11-28, I think that's when it
> normally is seen by linux-next. Merged into mainline 19-11-30. Andrew's
> tree got merged 19-12-05.

Ah, that was the problem. Merged into the mainline after only a day or two
showed up in the linux-next. There isn’t enough time for integration testing.

> 
> linux-next signaled a merge conflict from one of the patches in this
> series (not this one) resolved manually with the akpm tree on 19-12-02.
> 
> Thomas
> 
> 
> 
> 
> 
>