Message ID | 20240626191830.3819324-1-yang@os.amperecomputing.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v5] arm64: mm: force write fault for atomic RMW instructions | expand |
On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote: > @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, > if (!vma) > goto lock_mmap; > > + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && > + is_el0_atomic_instr(regs)) { > + vm_flags = VM_WRITE; > + mm_flags |= FAULT_FLAG_WRITE; > + } The patch looks fine now and AFAICT there's no ABI change. However, before deciding whether to merge this patch, I'd like to understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This would be the portable (Linux) solution that works better on architectures without such atomic instructions (e.g. arm64 without LSE atomics). So fixing user-space would be my preferred solution. (I poked some people in Arm working in the area, hopefully I'll get some more information)
On Fri, 28 Jun 2024, Catalin Marinas wrote: > On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote: >> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >> if (!vma) >> goto lock_mmap; >> >> + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && >> + is_el0_atomic_instr(regs)) { >> + vm_flags = VM_WRITE; >> + mm_flags |= FAULT_FLAG_WRITE; >> + } > > The patch looks fine now and AFAICT there's no ABI change. > > However, before deciding whether to merge this patch, I'd like to > understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This > would be the portable (Linux) solution that works better on > architectures without such atomic instructions (e.g. arm64 without LSE > atomics). So fixing user-space would be my preferred solution. Doing so would be requesting application code changes that are linux and ARM64 specific from applications running on Linux. A lot of these are proprietary.
On Fri, Jun 28, 2024 at 09:57:37AM -0700, Christoph Lameter (Ampere) wrote: > On Fri, 28 Jun 2024, Catalin Marinas wrote: > > On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote: > > > @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, > > > if (!vma) > > > goto lock_mmap; > > > > > > + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && > > > + is_el0_atomic_instr(regs)) { > > > + vm_flags = VM_WRITE; > > > + mm_flags |= FAULT_FLAG_WRITE; > > > + } > > > > The patch looks fine now and AFAICT there's no ABI change. > > > > However, before deciding whether to merge this patch, I'd like to > > understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This > > would be the portable (Linux) solution that works better on > > architectures without such atomic instructions (e.g. arm64 without LSE > > atomics). So fixing user-space would be my preferred solution. > > Doing so would be requesting application code changes that are linux and > ARM64 specific from applications running on Linux. Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely not. I'd argue that expecting the atomic_add(0) to only trigger a single write fault is arch specific. You can't do this on arm32 or arm64 pre-LSE (I haven't checked other architectures). IIUC, OpenJDK added this feature about two years ago but the arm64 behaviour hasn't changed in the meantime. So it's not like we broke the ABI and forcing user space to update. This patch does feel a bit like working around a non-optimal user choice in kernel space. Who knows, madvise() may even be quicker if you do a single call for a larger VA vs touching each page. > A lot of these are proprietary. Are you aware of other (proprietary) software relying on such pattern to fault pages in as writeable?
On 6/28/24 10:24 AM, Catalin Marinas wrote: > On Fri, Jun 28, 2024 at 09:57:37AM -0700, Christoph Lameter (Ampere) wrote: >> On Fri, 28 Jun 2024, Catalin Marinas wrote: >>> On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote: >>>> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >>>> if (!vma) >>>> goto lock_mmap; >>>> >>>> + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && >>>> + is_el0_atomic_instr(regs)) { >>>> + vm_flags = VM_WRITE; >>>> + mm_flags |= FAULT_FLAG_WRITE; >>>> + } >>> The patch looks fine now and AFAICT there's no ABI change. >>> >>> However, before deciding whether to merge this patch, I'd like to >>> understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This >>> would be the portable (Linux) solution that works better on >>> architectures without such atomic instructions (e.g. arm64 without LSE >>> atomics). So fixing user-space would be my preferred solution. >> Doing so would be requesting application code changes that are linux and >> ARM64 specific from applications running on Linux. > Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely > not. I'd argue that expecting the atomic_add(0) to only trigger a single > write fault is arch specific. You can't do this on arm32 or arm64 > pre-LSE (I haven't checked other architectures). > > IIUC, OpenJDK added this feature about two years ago but the arm64 > behaviour hasn't changed in the meantime. So it's not like we broke the > ABI and forcing user space to update. > > This patch does feel a bit like working around a non-optimal user choice > in kernel space. Who knows, madvise() may even be quicker if you do a > single call for a larger VA vs touching each page. IMHO, I don't think so. I viewed this patch to solve or workaround some ISA inefficiency in kernel. Two faults are not necessary if we know we are definitely going to write the memory very soon, right? > >> A lot of these are proprietary. > Are you aware of other (proprietary) software relying on such pattern to > fault pages in as writeable? >
On Fri, 28 Jun 2024, Catalin Marinas wrote: > Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely > not. I'd argue that expecting the atomic_add(0) to only trigger a single > write fault is arch specific. You can't do this on arm32 or arm64 > pre-LSE (I haven't checked other architectures). The single write fault is x86 behavior. I am not sure how other architectures handle that. > IIUC, OpenJDK added this feature about two years ago but the arm64 > behaviour hasn't changed in the meantime. So it's not like we broke the > ABI and forcing user space to update. The focus of OpenJDK may not be arm64 and they never saw the issue? We only know this because we have an insider on staff. AFACIT we get pushback from them as well. There are certainly numerous other open source applications that behave in a similar way. We just dont know about it. > This patch does feel a bit like working around a non-optimal user choice > in kernel space. Who knows, madvise() may even be quicker if you do a > single call for a larger VA vs touching each page. Looks to me like unexpected surprising behavior on ARM64. madvise is rather particular to Linux and its semantics are ever evolving. >> A lot of these are proprietary. > > Are you aware of other (proprietary) software relying on such pattern to > fault pages in as writeable? I would not be told about such things by companies I did not work for and if I have gotten knowledge about this in some way in the past then I would not be allowed to talk about it.
On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote: > On 6/28/24 10:24 AM, Catalin Marinas wrote: > > This patch does feel a bit like working around a non-optimal user choice > > in kernel space. Who knows, madvise() may even be quicker if you do a > > single call for a larger VA vs touching each page. > > IMHO, I don't think so. I viewed this patch to solve or workaround some ISA > inefficiency in kernel. Two faults are not necessary if we know we are > definitely going to write the memory very soon, right? I agree the Arm architecture behaviour is not ideal here and any timelines for fixing it in hardware, if they do happen, are far into the future. Purely from a kernel perspective, what I want though is make sure that longer term (a) we don't create additional maintenance burden and (b) we don't keep dead code around. Point (a) could be mitigated if the architecture is changed so that any new atomic instructions added to this range would also come with additional syndrome information so that we don't have to update the decoding patterns. Point (b), however, depends on the OpenJDK and the kernel versions in distros. Nick Gasson kindly provided some information on the OpenJDK changes. The atomic_add(0) change happened in early 2022, about 5-6 months after MADV_POPULATE_WRITE support was added to the kernel. What's interesting is Ampere already contributed MADV_POPULATE_WRITE support to OpenJDK a few months ago: https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a The OpenJDK commit lacks explanation but what I gathered from the diff is that this option is the preferred one in the presence of THP (which most/all distros enable by default). If we merge your proposed kernel patch, it will take time before it makes its way into distros. I'm hoping that by that time, distros would have picked a new OpenJDK version already that doesn't need the atomic_add(0) pattern. If that's the case, we end up with some dead code in the kernel that's almost never exercised. I don't follow OpenJDK development but I heard that updates are dragging quite a lot. I can't tell whether people have picked up the atomic_add(0) feature and whether, by the time a kernel patch would make it into distros, they'd also move to the MADV_POPULATE_WRITE pattern. There's a point (c) as well on the overhead of reading the faulting instruction. I hope that's negligible but I haven't measured it.
diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 8c0a36f72d6f..efcc8b2050db 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -325,6 +325,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \ * "-" means "don't care" */ __AARCH64_INSN_FUNCS(class_branch_sys, 0x1c000000, 0x14000000) +__AARCH64_INSN_FUNCS(class_atomic, 0x3b200c00, 0x38200000) __AARCH64_INSN_FUNCS(adr, 0x9F000000, 0x10000000) __AARCH64_INSN_FUNCS(adrp, 0x9F000000, 0x90000000) @@ -345,6 +346,7 @@ __AARCH64_INSN_FUNCS(ldeor, 0x3F20FC00, 0x38202000) __AARCH64_INSN_FUNCS(ldset, 0x3F20FC00, 0x38203000) __AARCH64_INSN_FUNCS(swp, 0x3F20FC00, 0x38208000) __AARCH64_INSN_FUNCS(cas, 0x3FA07C00, 0x08A07C00) +__AARCH64_INSN_FUNCS(casp, 0xBFA07C00, 0x08207C00) __AARCH64_INSN_FUNCS(ldr_reg, 0x3FE0EC00, 0x38606800) __AARCH64_INSN_FUNCS(signed_ldr_reg, 0X3FE0FC00, 0x38A0E800) __AARCH64_INSN_FUNCS(ldr_imm, 0x3FC00000, 0x39400000) @@ -549,6 +551,24 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn) aarch64_insn_is_prfm_lit(insn); } +static __always_inline bool aarch64_insn_is_class_cas(u32 insn) +{ + return aarch64_insn_is_cas(insn) || + aarch64_insn_is_casp(insn); +} + +/* + * Exclude unallocated atomic instructions and LD64B/LDAPR. + * The masks and values were generated by using Python sympy module. + */ +static __always_inline bool aarch64_atomic_insn_has_wr_perm(u32 insn) +{ + return ((insn & 0x3f207c00) == 0x38200000) || + ((insn & 0x3f208c00) == 0x38200000) || + ((insn & 0x7fe06c00) == 0x78202000) || + ((insn & 0xbf204c00) == 0x38200000); +} + enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn); u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn); u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 451ba7cbd5ad..6a8b71917e3b 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -500,6 +500,34 @@ static bool is_write_abort(unsigned long esr) return (esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM); } +static bool is_el0_atomic_instr(struct pt_regs *regs) +{ + u32 insn; + __le32 insn_le; + unsigned long pc = instruction_pointer(regs); + + if (compat_user_mode(regs)) + return false; + + pagefault_disable(); + if (get_user(insn_le, (__le32 __user *)pc)) { + pagefault_enable(); + return false; + } + pagefault_enable(); + + insn = le32_to_cpu(insn_le); + + if (aarch64_insn_is_class_atomic(insn) && + aarch64_atomic_insn_has_wr_perm(insn)) + return true; + + if (aarch64_insn_is_class_cas(insn)) + return true; + + return false; +} + static int __kprobes do_page_fault(unsigned long far, unsigned long esr, struct pt_regs *regs) { @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, if (!vma) goto lock_mmap; + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && + is_el0_atomic_instr(regs)) { + vm_flags = VM_WRITE; + mm_flags |= FAULT_FLAG_WRITE; + } + if (!(vma->vm_flags & vm_flags)) { vma_end_read(vma); fault = 0;