Message ID | 20230802151406.3735276-7-willy@infradead.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | New page table range API | expand |
On Wed, 2023-08-02 at 16:13 +0100, Matthew Wilcox (Oracle) wrote: > Most architectures can just define set_pte() and PFN_PTE_SHIFT to > use this definition. It's also a handy spot to document the guarantees > provided by the MM. > > Suggested-by: Mike Rapoport (IBM) <rppt@kernel.org> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> > --- > include/linux/pgtable.h | 81 ++++++++++++++++++++++++++++++----------- > 1 file changed, 60 insertions(+), 21 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index f34e0f2cb4d8..3fde0d5d1c29 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -182,6 +182,66 @@ static inline int pmd_young(pmd_t pmd) > } > #endif > > +/* > + * A facility to provide lazy MMU batching. This allows PTE updates and > + * page invalidations to be delayed until a call to leave lazy MMU mode > + * is issued. Some architectures may benefit from doing this, and it is > + * beneficial for both shadow and direct mode hypervisors, which may batch > + * the PTE updates which happen during this window. Note that using this > + * interface requires that read hazards be removed from the code. A read > + * hazard could result in the direct mode hypervisor case, since the actual > + * write to the page tables may not yet have taken place, so reads though > + * a raw PTE pointer after it has been modified are not guaranteed to be > + * up to date. This mode can only be entered and left under the protection of > + * the page table locks for all page tables which may be modified. In the UP > + * case, this is required so that preemption is disabled, and in the SMP case, > + * it must synchronize the delayed page table writes properly on other CPUs. > + */ > +#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE > +#define arch_enter_lazy_mmu_mode() do {} while (0) > +#define arch_leave_lazy_mmu_mode() do {} while (0) > +#define arch_flush_lazy_mmu_mode() do {} while (0) > +#endif > + > +#ifndef set_ptes > +#ifdef PFN_PTE_SHIFT > +/** > + * set_ptes - Map consecutive pages to a contiguous range of addresses. > + * @mm: Address space to map the pages into. > + * @addr: Address to map the first page at. > + * @ptep: Page table pointer for the first entry. > + * @pte: Page table entry for the first page. > + * @nr: Number of pages to map. > + * > + * May be overridden by the architecture, or the architecture can define > + * set_pte() and PFN_PTE_SHIFT. > + * > + * Context: The caller holds the page table lock. The pages all belong > + * to the same folio. The PTEs are all in the same PMD. > + */ > +static inline void set_ptes(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte, unsigned int nr) > +{ > + page_table_check_ptes_set(mm, ptep, pte, nr); > + > + arch_enter_lazy_mmu_mode(); > + for (;;) { > + set_pte(ptep, pte); > + if (--nr == 0) > + break; > + ptep++; > + pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT)); > + } > + arch_leave_lazy_mmu_mode(); > +} This breaks the Xen PV guest. In move_ptes() in mm/mremap.c we arch_enter_lazy_mmu_mode() and then loop calling set_pte_at(). Which now (or at least in a few commits time when you wire it up for x86 in commit a3e1c9372c9b959) ends up in your implementation of set_ptes(), calls arch_enter_lazy_mmu_mode() again, and: [ 0.628700] ------------[ cut here ]------------ [ 0.628718] kernel BUG at arch/x86/kernel/paravirt.c:144! [ 0.628743] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 0.628769] CPU: 0 PID: 1 Comm: init Not tainted 6.5.0-rc4+ #1295 [ 0.628818] RIP: e030:paravirt_enter_lazy_mmu+0x24/0x30 [ 0.628839] Code: 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 65 8b 05 90 28 f9 7e 85 c0 75 10 65 c7 05 81 28 f9 7e 01 00 00 00 c3 cc cc cc cc <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 [ 0.628875] RSP: e02b:ffffc9004000ba48 EFLAGS: 00010202 [ 0.628891] RAX: 0000000000000001 RBX: ffff8880051b7100 RCX: 000ffffffffff000 [ 0.628908] RDX: 80000000763ff967 RSI: 80000000763ff967 RDI: ffff8880051b7100 [ 0.628925] RBP: 80000000763ff967 R08: ffff8880051b6868 R09: 00007ffce1a20000 [ 0.628943] R10: deadbeefdeadf00d R11: 0000000000000000 R12: 00007ffffffff000 [ 0.628964] R13: ffff8880050b7000 R14: 0000000000000001 R15: 00007fffffffe000 [ 0.628988] FS: 0000000000000000(0000) GS:ffff88807b800000(0000) knlGS:0000000000000000 [ 0.629007] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.629024] CR2: ffffc900003f5000 CR3: 0000000003904000 CR4: 0000000000050660 [ 0.629046] Call Trace: [ 0.629055] <TASK> [ 0.629066] ? die+0x36/0x90 [ 0.629081] ? do_trap+0xda/0x100 [ 0.629093] ? paravirt_enter_lazy_mmu+0x24/0x30 [ 0.629112] ? do_error_trap+0x6a/0x90 [ 0.629123] ? paravirt_enter_lazy_mmu+0x24/0x30 [ 0.629138] ? exc_invalid_op+0x50/0x70 [ 0.629155] ? paravirt_enter_lazy_mmu+0x24/0x30 [ 0.629169] ? asm_exc_invalid_op+0x1a/0x20 [ 0.629185] ? paravirt_enter_lazy_mmu+0x24/0x30 [ 0.629212] ? pte_offset_map_nolock+0x48/0xc0 [ 0.629226] set_ptes.constprop.0+0xd/0x30 [ 0.629240] move_ptes.isra.0+0xdd/0x290 [ 0.629253] ? pmd_install+0xab/0xd0 [ 0.629267] move_page_tables+0x3a0/0x850 [ 0.629294] shift_arg_pages+0xf4/0x1d0 [ 0.629317] setup_arg_pages+0x205/0x380 [ 0.629330] load_elf_binary+0x398/0xe00 I'm working on making PV kernels testable in qemu. With... • some qemu fixes and a nasty hackish Xen console implementation: https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/xenfv-console • a CONFIG_PV_SHIM_EXCLUSIVE build of Xen itself to run in the guest, • some suitable disk image lying around, in ${GUEST_IMAGE}, and • CONFIG_KVM_XEN enabled in your host kernel, ...you should be able to do something like: $ ./qemu-system-x86_64 --accel kvm,xen-version=0x40011,kernel-irqchip=split -drive file=${GUEST_IMAGE},if=none,id=disk -device xen-disk,drive=disk,vdev=xvda -m 1G -kernel ~/git/xen/xen/xen -initrd ~/git/linux/arch/x86/boot/bzImage -append "loglvl=all -- console=hvc0 root=/dev/xvda1" -display none
On Thu, Oct 12, 2023 at 02:53:05PM +0100, David Woodhouse wrote: > > + arch_enter_lazy_mmu_mode(); > > + for (;;) { > > + set_pte(ptep, pte); > > + if (--nr == 0) > > + break; > > + ptep++; > > + pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT)); > > + } > > + arch_leave_lazy_mmu_mode(); > > This breaks the Xen PV guest. > > In move_ptes() in mm/mremap.c we arch_enter_lazy_mmu_mode() and then > loop calling set_pte_at(). Which now (or at least in a few commits time > when you wire it up for x86 in commit a3e1c9372c9b959) ends up in your > implementation of set_ptes(), calls arch_enter_lazy_mmu_mode() again, > and: > > [ 0.628700] ------------[ cut here ]------------ > [ 0.628718] kernel BUG at arch/x86/kernel/paravirt.c:144! Easy fix ... don't do that ;-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index af7639c3b0a3..f3da8836f689 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -231,9 +231,11 @@ static inline pte_t pte_next_pfn(pte_t pte) static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) { + bool multiple = nr > 1; page_table_check_ptes_set(mm, ptep, pte, nr); - arch_enter_lazy_mmu_mode(); + if (multiple) + arch_enter_lazy_mmu_mode(); for (;;) { set_pte(ptep, pte); if (--nr == 0) @@ -241,7 +243,8 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr, ptep++; pte = pte_next_pfn(pte); } - arch_leave_lazy_mmu_mode(); + if (multiple) + arch_leave_lazy_mmu_mode(); } #endif #define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1) I think long-term, we should make lazy_mmu_mode nestable. But this is a reasonable quick fix.
On Thu, 2023-10-12 at 15:05 +0100, Matthew Wilcox wrote: > On Thu, Oct 12, 2023 at 02:53:05PM +0100, David Woodhouse wrote: > > > + arch_enter_lazy_mmu_mode(); > > > + for (;;) { > > > + set_pte(ptep, pte); > > > + if (--nr == 0) > > > + break; > > > + ptep++; > > > + pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT)); > > > + } > > > + arch_leave_lazy_mmu_mode(); > > > > This breaks the Xen PV guest. > > > > In move_ptes() in mm/mremap.c we arch_enter_lazy_mmu_mode() and then > > loop calling set_pte_at(). Which now (or at least in a few commits time > > when you wire it up for x86 in commit a3e1c9372c9b959) ends up in your > > implementation of set_ptes(), calls arch_enter_lazy_mmu_mode() again, > > and: > > > > [ 0.628700] ------------[ cut here ]------------ > > [ 0.628718] kernel BUG at arch/x86/kernel/paravirt.c:144! > > Easy fix ... don't do that ;-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index af7639c3b0a3..f3da8836f689 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -231,9 +231,11 @@ static inline pte_t pte_next_pfn(pte_t pte) > static inline void set_ptes(struct mm_struct *mm, unsigned long addr, > pte_t *ptep, pte_t pte, unsigned int nr) > { > + bool multiple = nr > 1; > page_table_check_ptes_set(mm, ptep, pte, nr); > > - arch_enter_lazy_mmu_mode(); > + if (multiple) > + arch_enter_lazy_mmu_mode(); > for (;;) { > set_pte(ptep, pte); > if (--nr == 0) > @@ -241,7 +243,8 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr, > ptep++; > pte = pte_next_pfn(pte); > } > - arch_leave_lazy_mmu_mode(); > + if (multiple) > + arch_leave_lazy_mmu_mode(); > } > #endif > #define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1) > > I think long-term, we should make lazy_mmu_mode nestable. But this is > a reasonable quick fix. I don't much like doing it implicitly based on (nr==1) but sure, as a quick fix that works. The 64-bit PV guest now boots again. Tested-by: David Woodhouse <dwmw@amazon.co.uk> Thanks.
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f34e0f2cb4d8..3fde0d5d1c29 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -182,6 +182,66 @@ static inline int pmd_young(pmd_t pmd) } #endif +/* + * A facility to provide lazy MMU batching. This allows PTE updates and + * page invalidations to be delayed until a call to leave lazy MMU mode + * is issued. Some architectures may benefit from doing this, and it is + * beneficial for both shadow and direct mode hypervisors, which may batch + * the PTE updates which happen during this window. Note that using this + * interface requires that read hazards be removed from the code. A read + * hazard could result in the direct mode hypervisor case, since the actual + * write to the page tables may not yet have taken place, so reads though + * a raw PTE pointer after it has been modified are not guaranteed to be + * up to date. This mode can only be entered and left under the protection of + * the page table locks for all page tables which may be modified. In the UP + * case, this is required so that preemption is disabled, and in the SMP case, + * it must synchronize the delayed page table writes properly on other CPUs. + */ +#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE +#define arch_enter_lazy_mmu_mode() do {} while (0) +#define arch_leave_lazy_mmu_mode() do {} while (0) +#define arch_flush_lazy_mmu_mode() do {} while (0) +#endif + +#ifndef set_ptes +#ifdef PFN_PTE_SHIFT +/** + * set_ptes - Map consecutive pages to a contiguous range of addresses. + * @mm: Address space to map the pages into. + * @addr: Address to map the first page at. + * @ptep: Page table pointer for the first entry. + * @pte: Page table entry for the first page. + * @nr: Number of pages to map. + * + * May be overridden by the architecture, or the architecture can define + * set_pte() and PFN_PTE_SHIFT. + * + * Context: The caller holds the page table lock. The pages all belong + * to the same folio. The PTEs are all in the same PMD. + */ +static inline void set_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr) +{ + page_table_check_ptes_set(mm, ptep, pte, nr); + + arch_enter_lazy_mmu_mode(); + for (;;) { + set_pte(ptep, pte); + if (--nr == 0) + break; + ptep++; + pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT)); + } + arch_leave_lazy_mmu_mode(); +} +#ifndef set_pte_at +#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1) +#endif +#endif +#else +#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1) +#endif + #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS extern int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, @@ -1051,27 +1111,6 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) #define pgprot_decrypted(prot) (prot) #endif -/* - * A facility to provide lazy MMU batching. This allows PTE updates and - * page invalidations to be delayed until a call to leave lazy MMU mode - * is issued. Some architectures may benefit from doing this, and it is - * beneficial for both shadow and direct mode hypervisors, which may batch - * the PTE updates which happen during this window. Note that using this - * interface requires that read hazards be removed from the code. A read - * hazard could result in the direct mode hypervisor case, since the actual - * write to the page tables may not yet have taken place, so reads though - * a raw PTE pointer after it has been modified are not guaranteed to be - * up to date. This mode can only be entered and left under the protection of - * the page table locks for all page tables which may be modified. In the UP - * case, this is required so that preemption is disabled, and in the SMP case, - * it must synchronize the delayed page table writes properly on other CPUs. - */ -#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE -#define arch_enter_lazy_mmu_mode() do {} while (0) -#define arch_leave_lazy_mmu_mode() do {} while (0) -#define arch_flush_lazy_mmu_mode() do {} while (0) -#endif - /* * A facility to provide batching of the reload of page tables and * other process state with the actual context switch code for