mbox series

[v1,0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode

Message ID cover.1744037648.git.agordeev@linux.ibm.com (mailing list archive)
Headers show
Series mm: Fix apply_to_pte_range() vs lazy MMU mode | expand

Message

Alexander Gordeev April 7, 2025, 3:11 p.m. UTC
Hi All,

This series is an attempt to fix the violation of lazy MMU mode context
requirement as described for arch_enter_lazy_mmu_mode():

    This mode can only be entered and left under the protection of
    the page table locks for all page tables which may be modified.

On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and
arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:

    [  553.332108] preempt_count: 1, expected: 0
    [  553.332117] no locks held by multipathd/2116.
    [  553.332128] CPU: 24 PID: 2116 Comm: multipathd Kdump: loaded Tainted:
    [  553.332139] Hardware name: IBM 3931 A01 701 (LPAR)
    [  553.332146] Call Trace:
    [  553.332152]  [<00000000158de23a>] dump_stack_lvl+0xfa/0x150
    [  553.332167]  [<0000000013e10d12>] __might_resched+0x57a/0x5e8
    [  553.332178]  [<00000000144eb6c2>] __alloc_pages+0x2ba/0x7c0
    [  553.332189]  [<00000000144d5cdc>] __get_free_pages+0x2c/0x88
    [  553.332198]  [<00000000145663f6>] kasan_populate_vmalloc_pte+0x4e/0x110
    [  553.332207]  [<000000001447625c>] apply_to_pte_range+0x164/0x3c8
    [  553.332218]  [<000000001448125a>] apply_to_pmd_range+0xda/0x318
    [  553.332226]  [<000000001448181c>] __apply_to_page_range+0x384/0x768
    [  553.332233]  [<0000000014481c28>] apply_to_page_range+0x28/0x38
    [  553.332241]  [<00000000145665da>] kasan_populate_vmalloc+0x82/0x98
    [  553.332249]  [<00000000144c88d0>] alloc_vmap_area+0x590/0x1c90
    [  553.332257]  [<00000000144ca108>] __get_vm_area_node.constprop.0+0x138/0x260
    [  553.332265]  [<00000000144d17fc>] __vmalloc_node_range+0x134/0x360
    [  553.332274]  [<0000000013d5dbf2>] alloc_thread_stack_node+0x112/0x378
    [  553.332284]  [<0000000013d62726>] dup_task_struct+0x66/0x430
    [  553.332293]  [<0000000013d63962>] copy_process+0x432/0x4b80
    [  553.332302]  [<0000000013d68300>] kernel_clone+0xf0/0x7d0
    [  553.332311]  [<0000000013d68bd6>] __do_sys_clone+0xae/0xc8
    [  553.332400]  [<0000000013d68dee>] __s390x_sys_clone+0xd6/0x118
    [  553.332410]  [<0000000013c9d34c>] do_syscall+0x22c/0x328
    [  553.332419]  [<00000000158e7366>] __do_syscall+0xce/0xf0
    [  553.332428]  [<0000000015913260>] system_call+0x70/0x98

This exposes a KASAN issue fixed with patch 1 and apply_to_pte_range()
issue fixed with patches 2-3. Patch 4 is a debug improvement on top,
that could have helped to notice the issue.

Commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu
mode") looks like powerpc-only fix, yet not entirely conforming to the
above provided requirement (page tables itself are still not protected).
If I am not mistaken, xen and sparc are alike.

Thanks!

Alexander Gordeev (4):
  kasan: Avoid sleepable page allocation from atomic context
  mm: Cleanup apply_to_pte_range() routine
  mm: Protect kernel pgtables in apply_to_pte_range()
  mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context

 include/linux/pgtable.h | 15 ++++++++++++---
 mm/kasan/shadow.c       |  9 +++------
 mm/memory.c             | 33 +++++++++++++++++++++------------
 3 files changed, 36 insertions(+), 21 deletions(-)

Comments

Andrew Morton April 7, 2025, 6:35 p.m. UTC | #1
On Mon,  7 Apr 2025 17:11:26 +0200 Alexander Gordeev <agordeev@linux.ibm.com> wrote:

> This series is an attempt to fix the violation of lazy MMU mode context
> requirement as described for arch_enter_lazy_mmu_mode():
> 
>     This mode can only be entered and left under the protection of
>     the page table locks for all page tables which may be modified.
> 
> On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and
> arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:
>
> ...
>

Could you please reorganize this into two series?  One series which
should be fast-tracked into 6.15-rcX and one series for 6.16-rc1?

And in the first series, please suggest whether its patches should be
backported into -stable and see if we can come up with suitable Fixes:
targets?

Thanks.
Nicholas Piggin April 11, 2025, 7:12 a.m. UTC | #2
On Tue Apr 8, 2025 at 1:11 AM AEST, Alexander Gordeev wrote:
> Hi All,
>
> This series is an attempt to fix the violation of lazy MMU mode context
> requirement as described for arch_enter_lazy_mmu_mode():
>
>     This mode can only be entered and left under the protection of
>     the page table locks for all page tables which may be modified.
>
> On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and
> arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:
>
>     [  553.332108] preempt_count: 1, expected: 0
>     [  553.332117] no locks held by multipathd/2116.
>     [  553.332128] CPU: 24 PID: 2116 Comm: multipathd Kdump: loaded Tainted:
>     [  553.332139] Hardware name: IBM 3931 A01 701 (LPAR)
>     [  553.332146] Call Trace:
>     [  553.332152]  [<00000000158de23a>] dump_stack_lvl+0xfa/0x150
>     [  553.332167]  [<0000000013e10d12>] __might_resched+0x57a/0x5e8
>     [  553.332178]  [<00000000144eb6c2>] __alloc_pages+0x2ba/0x7c0
>     [  553.332189]  [<00000000144d5cdc>] __get_free_pages+0x2c/0x88
>     [  553.332198]  [<00000000145663f6>] kasan_populate_vmalloc_pte+0x4e/0x110
>     [  553.332207]  [<000000001447625c>] apply_to_pte_range+0x164/0x3c8
>     [  553.332218]  [<000000001448125a>] apply_to_pmd_range+0xda/0x318
>     [  553.332226]  [<000000001448181c>] __apply_to_page_range+0x384/0x768
>     [  553.332233]  [<0000000014481c28>] apply_to_page_range+0x28/0x38
>     [  553.332241]  [<00000000145665da>] kasan_populate_vmalloc+0x82/0x98
>     [  553.332249]  [<00000000144c88d0>] alloc_vmap_area+0x590/0x1c90
>     [  553.332257]  [<00000000144ca108>] __get_vm_area_node.constprop.0+0x138/0x260
>     [  553.332265]  [<00000000144d17fc>] __vmalloc_node_range+0x134/0x360
>     [  553.332274]  [<0000000013d5dbf2>] alloc_thread_stack_node+0x112/0x378
>     [  553.332284]  [<0000000013d62726>] dup_task_struct+0x66/0x430
>     [  553.332293]  [<0000000013d63962>] copy_process+0x432/0x4b80
>     [  553.332302]  [<0000000013d68300>] kernel_clone+0xf0/0x7d0
>     [  553.332311]  [<0000000013d68bd6>] __do_sys_clone+0xae/0xc8
>     [  553.332400]  [<0000000013d68dee>] __s390x_sys_clone+0xd6/0x118
>     [  553.332410]  [<0000000013c9d34c>] do_syscall+0x22c/0x328
>     [  553.332419]  [<00000000158e7366>] __do_syscall+0xce/0xf0
>     [  553.332428]  [<0000000015913260>] system_call+0x70/0x98
>
> This exposes a KASAN issue fixed with patch 1 and apply_to_pte_range()
> issue fixed with patches 2-3. Patch 4 is a debug improvement on top,
> that could have helped to notice the issue.
>
> Commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu
> mode") looks like powerpc-only fix, yet not entirely conforming to the
> above provided requirement (page tables itself are still not protected).
> If I am not mistaken, xen and sparc are alike.

Huh. powerpc actually has some crazy code in __switch_to() that is
supposed to handle preemption while in lazy mmu mode. So we probably
don't even need to disable preemption, just use the raw per-cpu
accessors (or keep disabling preemption and remove the now dead code
from context switch).

IIRC all this got built up over a long time with some TLB flush
rules changing at the same time, we could probably stay in lazy mmu
mode for a longer time until it was discovered we really need to
flush before dropping the PTL.

ppc64 and sparc I think don't even need lazy mmu mode for kasan (TLBs
do not require flushing) and will function just fine if not in lazy
mode (they just flush one TLB at a time), not sure about xen. We could
actually go the other way and require that archs operate properly when
not in lazy mode (at least for kernel page tables) and avoid it for
apply_to_page_range()?

Thanks,
Nick
Alexander Gordeev April 11, 2025, 12:04 p.m. UTC | #3
On Fri, Apr 11, 2025 at 05:12:28PM +1000, Nicholas Piggin wrote:
...
> Huh. powerpc actually has some crazy code in __switch_to() that is
> supposed to handle preemption while in lazy mmu mode. So we probably
> don't even need to disable preemption, just use the raw per-cpu
> accessors (or keep disabling preemption and remove the now dead code
> from context switch).

Well, I tried to do the latter ;)
https://lore.kernel.org/linuxppc-dev/3b4e3e28172f09165b19ee7cac67a860d7cc1c6e.1742915600.git.agordeev@linux.ibm.com/
Not sure how it is aligned with the current state (see below).

> IIRC all this got built up over a long time with some TLB flush
> rules changing at the same time, we could probably stay in lazy mmu
> mode for a longer time until it was discovered we really need to
> flush before dropping the PTL.
> 
> ppc64 and sparc I think don't even need lazy mmu mode for kasan (TLBs
> do not require flushing) and will function just fine if not in lazy
> mode (they just flush one TLB at a time), not sure about xen. We could
> actually go the other way and require that archs operate properly when
> not in lazy mode (at least for kernel page tables) and avoid it for
> apply_to_page_range()?

Ryan Roberts hopefully brought some order to the topic:
https://lore.kernel.org/linux-mm/20250303141542.3371656-1-ryan.roberts@arm.com/

> Thanks,
> Nick