Message ID | 20240605203723.643329-1-yang@os.amperecomputing.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] arm64: mm: force write fault for atomic RMW instructions | expand |
On 05.06.24 22:37, Yang Shi wrote: > The atomic RMW instructions, for example, ldadd, actually does load + > add + store in one instruction, it will trigger two page faults per the > ARM64 architecture spec, the first fault is a read fault, the second > fault is a write fault. > > Some applications use atomic RMW instructions to populate memory, for > example, openjdk uses atomic-add-0 to do pretouch (populate heap memory > at launch time) between v18 and v22 in order to permit use of memory > concurrently with pretouch. > > But the double page fault has some problems: > > 1. Noticeable TLB overhead. The kernel actually installs zero page with > readonly PTE for the read fault. The write fault will trigger a > write-protection fault (CoW). The CoW will allocate a new page and > make the PTE point to the new page, this needs TLB invalidations. The > tlb invalidation and the mandatory memory barriers may incur > significant overhead, particularly on the machines with many cores. > > 2. Break up huge pages. If THP is on the read fault will install huge > zero pages. The later CoW will break up the huge page and allocate > base pages instead of huge page. The applications have to rely on > khugepaged (kernel thread) to collapse huge pages asynchronously. > This also incurs noticeable performance penalty. > > 3. 512x page faults with huge page. Due to #2, the applications have to > have page faults for every 4K area for the write, this makes the speed > up by using huge page actually gone. All interesting and valid points. As raised, the app likely really should be using MADV_POPULATE_WRITE. Acked-by: David Hildenbrand <david@redhat.com>
On Wed, Jun 05, 2024 at 01:37:23PM -0700, Yang Shi wrote: > +static __always_inline bool aarch64_insn_is_class_cas(u32 insn) > +{ > + return aarch64_insn_is_cas(insn) || > + aarch64_insn_is_casp(insn); > +} > + > +/* > + * Exclude unallocated atomic instructions and LD64B/LDAPR. > + * The masks and values were generated by using Python sympy module. > + */ > +static __always_inline bool aarch64_atomic_insn_has_wr_perm(u32 insn) > +{ > + return ((insn & 0x3f207c00) == 0x38200000) || > + ((insn & 0x3f208c00) == 0x38200000) || > + ((insn & 0x7fe06c00) == 0x78202000) || > + ((insn & 0xbf204c00) == 0x38200000); > +} This is still pretty opaque if we want to modify it in the future. I guess we could add more tests on top but it would be nice to have a way to re-generate these masks. I'll think about, for now these tests would do. > @@ -511,6 +539,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, > unsigned long addr = untagged_addr(far); > struct vm_area_struct *vma; > int si_code; > + bool may_force_write = false; > > if (kprobe_page_fault(regs, esr)) > return 0; > @@ -547,6 +576,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, > /* If EPAN is absent then exec implies read */ > if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN)) > vm_flags |= VM_EXEC; > + may_force_write = true; > } > > if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, esr, regs)) { > @@ -568,6 +598,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, > if (!vma) > goto lock_mmap; > > + if (may_force_write && (vma->vm_flags & VM_WRITE) && > + is_el0_atomic_instr(regs)) { > + vm_flags = VM_WRITE; > + mm_flags |= FAULT_FLAG_WRITE; > + } I think we can get rid of may_force_write and just test (vm_flags & VM_READ).
On 6/14/24 5:20 AM, Catalin Marinas wrote: > On Wed, Jun 05, 2024 at 01:37:23PM -0700, Yang Shi wrote: >> +static __always_inline bool aarch64_insn_is_class_cas(u32 insn) >> +{ >> + return aarch64_insn_is_cas(insn) || >> + aarch64_insn_is_casp(insn); >> +} >> + >> +/* >> + * Exclude unallocated atomic instructions and LD64B/LDAPR. >> + * The masks and values were generated by using Python sympy module. >> + */ >> +static __always_inline bool aarch64_atomic_insn_has_wr_perm(u32 insn) >> +{ >> + return ((insn & 0x3f207c00) == 0x38200000) || >> + ((insn & 0x3f208c00) == 0x38200000) || >> + ((insn & 0x7fe06c00) == 0x78202000) || >> + ((insn & 0xbf204c00) == 0x38200000); >> +} > This is still pretty opaque if we want to modify it in the future. I > guess we could add more tests on top but it would be nice to have a way > to re-generate these masks. I'll think about, for now these tests would > do. Sorry for the late reply, just came back from vacation and tried to catch up all the emails and TODOs. We should be able to share the tool used by us to generate the tests. But it may take some time. > >> @@ -511,6 +539,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >> unsigned long addr = untagged_addr(far); >> struct vm_area_struct *vma; >> int si_code; >> + bool may_force_write = false; >> >> if (kprobe_page_fault(regs, esr)) >> return 0; >> @@ -547,6 +576,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >> /* If EPAN is absent then exec implies read */ >> if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN)) >> vm_flags |= VM_EXEC; >> + may_force_write = true; >> } >> >> if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, esr, regs)) { >> @@ -568,6 +598,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >> if (!vma) >> goto lock_mmap; >> >> + if (may_force_write && (vma->vm_flags & VM_WRITE) && >> + is_el0_atomic_instr(regs)) { >> + vm_flags = VM_WRITE; >> + mm_flags |= FAULT_FLAG_WRITE; >> + } > I think we can get rid of may_force_write and just test (vm_flags & > VM_READ). Yes, will fix it in v5. >
On 6/26/24 11:45 AM, Yang Shi wrote: > > > On 6/14/24 5:20 AM, Catalin Marinas wrote: >> On Wed, Jun 05, 2024 at 01:37:23PM -0700, Yang Shi wrote: >>> +static __always_inline bool aarch64_insn_is_class_cas(u32 insn) >>> +{ >>> + return aarch64_insn_is_cas(insn) || >>> + aarch64_insn_is_casp(insn); >>> +} >>> + >>> +/* >>> + * Exclude unallocated atomic instructions and LD64B/LDAPR. >>> + * The masks and values were generated by using Python sympy module. >>> + */ >>> +static __always_inline bool aarch64_atomic_insn_has_wr_perm(u32 insn) >>> +{ >>> + return ((insn & 0x3f207c00) == 0x38200000) || >>> + ((insn & 0x3f208c00) == 0x38200000) || >>> + ((insn & 0x7fe06c00) == 0x78202000) || >>> + ((insn & 0xbf204c00) == 0x38200000); >>> +} >> This is still pretty opaque if we want to modify it in the future. I >> guess we could add more tests on top but it would be nice to have a way >> to re-generate these masks. I'll think about, for now these tests would >> do. > > Sorry for the late reply, just came back from vacation and tried to > catch up all the emails and TODOs. We should be able to share the tool > used by us to generate the tests. But it may take some time. D Scott made the tool available publicly. Please refer to https://gitlab.com/scott-ph/arm64-insn-group-minimizer We can re-generate the tests with this tool in the future. > >> >>> @@ -511,6 +539,7 @@ static int __kprobes do_page_fault(unsigned long >>> far, unsigned long esr, >>> unsigned long addr = untagged_addr(far); >>> struct vm_area_struct *vma; >>> int si_code; >>> + bool may_force_write = false; >>> if (kprobe_page_fault(regs, esr)) >>> return 0; >>> @@ -547,6 +576,7 @@ static int __kprobes do_page_fault(unsigned long >>> far, unsigned long esr, >>> /* If EPAN is absent then exec implies read */ >>> if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN)) >>> vm_flags |= VM_EXEC; >>> + may_force_write = true; >>> } >>> if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, >>> esr, regs)) { >>> @@ -568,6 +598,12 @@ static int __kprobes do_page_fault(unsigned >>> long far, unsigned long esr, >>> if (!vma) >>> goto lock_mmap; >>> + if (may_force_write && (vma->vm_flags & VM_WRITE) && >>> + is_el0_atomic_instr(regs)) { >>> + vm_flags = VM_WRITE; >>> + mm_flags |= FAULT_FLAG_WRITE; >>> + } >> I think we can get rid of may_force_write and just test (vm_flags & >> VM_READ). > > Yes, will fix it in v5. >> >
diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 8c0a36f72d6f..efcc8b2050db 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -325,6 +325,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \ * "-" means "don't care" */ __AARCH64_INSN_FUNCS(class_branch_sys, 0x1c000000, 0x14000000) +__AARCH64_INSN_FUNCS(class_atomic, 0x3b200c00, 0x38200000) __AARCH64_INSN_FUNCS(adr, 0x9F000000, 0x10000000) __AARCH64_INSN_FUNCS(adrp, 0x9F000000, 0x90000000) @@ -345,6 +346,7 @@ __AARCH64_INSN_FUNCS(ldeor, 0x3F20FC00, 0x38202000) __AARCH64_INSN_FUNCS(ldset, 0x3F20FC00, 0x38203000) __AARCH64_INSN_FUNCS(swp, 0x3F20FC00, 0x38208000) __AARCH64_INSN_FUNCS(cas, 0x3FA07C00, 0x08A07C00) +__AARCH64_INSN_FUNCS(casp, 0xBFA07C00, 0x08207C00) __AARCH64_INSN_FUNCS(ldr_reg, 0x3FE0EC00, 0x38606800) __AARCH64_INSN_FUNCS(signed_ldr_reg, 0X3FE0FC00, 0x38A0E800) __AARCH64_INSN_FUNCS(ldr_imm, 0x3FC00000, 0x39400000) @@ -549,6 +551,24 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn) aarch64_insn_is_prfm_lit(insn); } +static __always_inline bool aarch64_insn_is_class_cas(u32 insn) +{ + return aarch64_insn_is_cas(insn) || + aarch64_insn_is_casp(insn); +} + +/* + * Exclude unallocated atomic instructions and LD64B/LDAPR. + * The masks and values were generated by using Python sympy module. + */ +static __always_inline bool aarch64_atomic_insn_has_wr_perm(u32 insn) +{ + return ((insn & 0x3f207c00) == 0x38200000) || + ((insn & 0x3f208c00) == 0x38200000) || + ((insn & 0x7fe06c00) == 0x78202000) || + ((insn & 0xbf204c00) == 0x38200000); +} + enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn); u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn); u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 451ba7cbd5ad..3dbeb6069fbe 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -500,6 +500,34 @@ static bool is_write_abort(unsigned long esr) return (esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM); } +static bool is_el0_atomic_instr(struct pt_regs *regs) +{ + u32 insn; + __le32 insn_le; + unsigned long pc = instruction_pointer(regs); + + if (compat_user_mode(regs)) + return false; + + pagefault_disable(); + if (get_user(insn_le, (__le32 __user *)pc)) { + pagefault_enable(); + return false; + } + pagefault_enable(); + + insn = le32_to_cpu(insn_le); + + if (aarch64_insn_is_class_atomic(insn) && + aarch64_atomic_insn_has_wr_perm(insn)) + return true; + + if (aarch64_insn_is_class_cas(insn)) + return true; + + return false; +} + static int __kprobes do_page_fault(unsigned long far, unsigned long esr, struct pt_regs *regs) { @@ -511,6 +539,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, unsigned long addr = untagged_addr(far); struct vm_area_struct *vma; int si_code; + bool may_force_write = false; if (kprobe_page_fault(regs, esr)) return 0; @@ -547,6 +576,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, /* If EPAN is absent then exec implies read */ if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN)) vm_flags |= VM_EXEC; + may_force_write = true; } if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, esr, regs)) { @@ -568,6 +598,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, if (!vma) goto lock_mmap; + if (may_force_write && (vma->vm_flags & VM_WRITE) && + is_el0_atomic_instr(regs)) { + vm_flags = VM_WRITE; + mm_flags |= FAULT_FLAG_WRITE; + } + if (!(vma->vm_flags & vm_flags)) { vma_end_read(vma); fault = 0;