diff mbox series

[v5] arm64: mm: force write fault for atomic RMW instructions

Message ID 20240626191830.3819324-1-yang@os.amperecomputing.com (mailing list archive)
State New
Headers show
Series [v5] arm64: mm: force write fault for atomic RMW instructions | expand

Commit Message

Yang Shi June 26, 2024, 7:18 p.m. UTC
The atomic RMW instructions, for example, ldadd, actually does load +
add + store in one instruction, it will trigger two page faults per the
ARM64 architecture spec, the first fault is a read fault, the second
fault is a write fault.

Some applications use atomic RMW instructions to populate memory, for
example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
at launch time) between v18 and v22 in order to permit use of memory
concurrently with pretouch.

But the double page fault has some problems:

1. Noticeable TLB overhead.  The kernel actually installs zero page with
   readonly PTE for the read fault.  The write fault will trigger a
   write-protection fault (CoW).  The CoW will allocate a new page and
   make the PTE point to the new page, this needs TLB invalidations.  The
   tlb invalidation and the mandatory memory barriers may incur
   significant overhead, particularly on the machines with many cores.

2. Break up huge pages.  If THP is on the read fault will install huge
   zero pages.  The later CoW will break up the huge page and allocate
   base pages instead of huge page.  The applications have to rely on
   khugepaged (kernel thread) to collapse huge pages asynchronously.
   This also incurs noticeable performance penalty.

3. 512x page faults with huge page.  Due to #2, the applications have to
   have page faults for every 4K area for the write, this makes the speed
   up by using huge page actually gone.

So it sounds pointless to have two page faults since we know the memory
will be definitely written very soon.  Forcing write fault for atomic RMW
instruction makes some sense and it can solve the aforementioned problems:

Firstly, it just allocates zero'ed page, no tlb invalidation and memory
barriers anymore.
Secondly, it can populate writable huge pages in the first place and
don't break them up.  Just one page fault is needed for 2M area instrad
of 512 faults and also save cpu time by not using khugepaged.

A simple micro benchmark which populates 1G memory shows the number of
page faults is reduced by half and the time spent by system is reduced
by 60% on a VM running on Ampere Altra platform.

And the benchmark for anonymous read fault on 1G memory, file read fault
on 1G file (cold page cache and warm page cache) don't show noticeable
regression.

Exclude unallocated instructions and LD64B/LDAPR instructions.

Some other architectures also have code inspection in page fault path,
for example, SPARC and x86.

Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/insn.h | 20 ++++++++++++++++++++
 arch/arm64/mm/fault.c         | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

v5: 1. Used vm_flags & VM_READ per Catalin.
    2. Collected ack tag from David.
v4: 1. Fixed the comments from Catalin.
    2. Rebased to v6.10-rc2.
    3. Collected the review tag from Christopher Lameter.
v3: Exclude unallocated insns and LD64B/LDAPR per Catalin. And thanks
    for D Scott help figure out the minimum conditions.
v2: 1. Made commit log more precise per Anshuman and Catalin
    2. Made pagefault_disable/enable window narrower per Anshuman
    3. Covered CAS and CASP variants per Catalin
    4. Put instruction fetching and decoding into a helper function and
       take into account endianess per Catalin
    5. Don't fetch and decode insn for 32 bit mode (compat) per Catalin
    6. More performance tests and exec-only test per Anshuman and Catalin

Comments

Catalin Marinas June 28, 2024, 4:54 p.m. UTC | #1
On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote:
> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>  	if (!vma)
>  		goto lock_mmap;
>  
> +	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
> +	    is_el0_atomic_instr(regs)) {
> +		vm_flags = VM_WRITE;
> +		mm_flags |= FAULT_FLAG_WRITE;
> +	}

The patch looks fine now and AFAICT there's no ABI change.

However, before deciding whether to merge this patch, I'd like to
understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This
would be the portable (Linux) solution that works better on
architectures without such atomic instructions (e.g. arm64 without LSE
atomics). So fixing user-space would be my preferred solution.

(I poked some people in Arm working in the area, hopefully I'll get some
more information)
Christoph Lameter (Ampere) June 28, 2024, 4:57 p.m. UTC | #2
On Fri, 28 Jun 2024, Catalin Marinas wrote:

> On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote:
>> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>  	if (!vma)
>>  		goto lock_mmap;
>>
>> +	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
>> +	    is_el0_atomic_instr(regs)) {
>> +		vm_flags = VM_WRITE;
>> +		mm_flags |= FAULT_FLAG_WRITE;
>> +	}
>
> The patch looks fine now and AFAICT there's no ABI change.
>
> However, before deciding whether to merge this patch, I'd like to
> understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This
> would be the portable (Linux) solution that works better on
> architectures without such atomic instructions (e.g. arm64 without LSE
> atomics). So fixing user-space would be my preferred solution.

Doing so would be requesting application code changes that are linux 
and ARM64 specific from applications running on Linux. A lot of these are 
proprietary.
Catalin Marinas June 28, 2024, 5:24 p.m. UTC | #3
On Fri, Jun 28, 2024 at 09:57:37AM -0700, Christoph Lameter (Ampere) wrote:
> On Fri, 28 Jun 2024, Catalin Marinas wrote:
> > On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote:
> > > @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> > >  	if (!vma)
> > >  		goto lock_mmap;
> > > 
> > > +	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
> > > +	    is_el0_atomic_instr(regs)) {
> > > +		vm_flags = VM_WRITE;
> > > +		mm_flags |= FAULT_FLAG_WRITE;
> > > +	}
> > 
> > The patch looks fine now and AFAICT there's no ABI change.
> > 
> > However, before deciding whether to merge this patch, I'd like to
> > understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This
> > would be the portable (Linux) solution that works better on
> > architectures without such atomic instructions (e.g. arm64 without LSE
> > atomics). So fixing user-space would be my preferred solution.
> 
> Doing so would be requesting application code changes that are linux and
> ARM64 specific from applications running on Linux.

Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely
not. I'd argue that expecting the atomic_add(0) to only trigger a single
write fault is arch specific. You can't do this on arm32 or arm64
pre-LSE (I haven't checked other architectures).

IIUC, OpenJDK added this feature about two years ago but the arm64
behaviour hasn't changed in the meantime. So it's not like we broke the
ABI and forcing user space to update.

This patch does feel a bit like working around a non-optimal user choice
in kernel space. Who knows, madvise() may even be quicker if you do a
single call for a larger VA vs touching each page.

> A lot of these are proprietary.

Are you aware of other (proprietary) software relying on such pattern to
fault pages in as writeable?
Yang Shi June 28, 2024, 6:20 p.m. UTC | #4
On 6/28/24 10:24 AM, Catalin Marinas wrote:
> On Fri, Jun 28, 2024 at 09:57:37AM -0700, Christoph Lameter (Ampere) wrote:
>> On Fri, 28 Jun 2024, Catalin Marinas wrote:
>>> On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote:
>>>> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>   	if (!vma)
>>>>   		goto lock_mmap;
>>>>
>>>> +	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
>>>> +	    is_el0_atomic_instr(regs)) {
>>>> +		vm_flags = VM_WRITE;
>>>> +		mm_flags |= FAULT_FLAG_WRITE;
>>>> +	}
>>> The patch looks fine now and AFAICT there's no ABI change.
>>>
>>> However, before deciding whether to merge this patch, I'd like to
>>> understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This
>>> would be the portable (Linux) solution that works better on
>>> architectures without such atomic instructions (e.g. arm64 without LSE
>>> atomics). So fixing user-space would be my preferred solution.
>> Doing so would be requesting application code changes that are linux and
>> ARM64 specific from applications running on Linux.
> Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely
> not. I'd argue that expecting the atomic_add(0) to only trigger a single
> write fault is arch specific. You can't do this on arm32 or arm64
> pre-LSE (I haven't checked other architectures).
>
> IIUC, OpenJDK added this feature about two years ago but the arm64
> behaviour hasn't changed in the meantime. So it's not like we broke the
> ABI and forcing user space to update.
>
> This patch does feel a bit like working around a non-optimal user choice
> in kernel space. Who knows, madvise() may even be quicker if you do a
> single call for a larger VA vs touching each page.

IMHO, I don't think so. I viewed this patch to solve or workaround some 
ISA inefficiency in kernel. Two faults are not necessary if we know we 
are definitely going to write the memory very soon, right?

>
>> A lot of these are proprietary.
> Are you aware of other (proprietary) software relying on such pattern to
> fault pages in as writeable?
>
Christoph Lameter (Ampere) June 28, 2024, 6:26 p.m. UTC | #5
On Fri, 28 Jun 2024, Catalin Marinas wrote:

> Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely
> not. I'd argue that expecting the atomic_add(0) to only trigger a single
> write fault is arch specific. You can't do this on arm32 or arm64
> pre-LSE (I haven't checked other architectures).

The single write fault is x86 behavior. I am not sure how other 
architectures handle that.

> IIUC, OpenJDK added this feature about two years ago but the arm64
> behaviour hasn't changed in the meantime. So it's not like we broke the
> ABI and forcing user space to update.

The focus of OpenJDK may not be arm64 and they never saw the issue? We 
only know this because we have an insider on staff. AFACIT we get pushback 
from them as well. There are certainly numerous other open 
source applications that behave in a similar way. We just dont know about 
it.

> This patch does feel a bit like working around a non-optimal user choice
> in kernel space. Who knows, madvise() may even be quicker if you do a
> single call for a larger VA vs touching each page.

Looks to me like unexpected surprising behavior on ARM64. madvise is 
rather particular to Linux and its semantics are ever evolving.

>> A lot of these are proprietary.
>
> Are you aware of other (proprietary) software relying on such pattern to
> fault pages in as writeable?

I would not be told about such things by companies I did not work for and 
if I have gotten knowledge about this in some way in the past then I would 
not be allowed to talk about it.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 8c0a36f72d6f..efcc8b2050db 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -325,6 +325,7 @@  static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
  * "-" means "don't care"
  */
 __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
+__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
 
 __AARCH64_INSN_FUNCS(adr,	0x9F000000, 0x10000000)
 __AARCH64_INSN_FUNCS(adrp,	0x9F000000, 0x90000000)
@@ -345,6 +346,7 @@  __AARCH64_INSN_FUNCS(ldeor,	0x3F20FC00, 0x38202000)
 __AARCH64_INSN_FUNCS(ldset,	0x3F20FC00, 0x38203000)
 __AARCH64_INSN_FUNCS(swp,	0x3F20FC00, 0x38208000)
 __AARCH64_INSN_FUNCS(cas,	0x3FA07C00, 0x08A07C00)
+__AARCH64_INSN_FUNCS(casp,	0xBFA07C00, 0x08207C00)
 __AARCH64_INSN_FUNCS(ldr_reg,	0x3FE0EC00, 0x38606800)
 __AARCH64_INSN_FUNCS(signed_ldr_reg, 0X3FE0FC00, 0x38A0E800)
 __AARCH64_INSN_FUNCS(ldr_imm,	0x3FC00000, 0x39400000)
@@ -549,6 +551,24 @@  static __always_inline bool aarch64_insn_uses_literal(u32 insn)
 	       aarch64_insn_is_prfm_lit(insn);
 }
 
+static __always_inline bool aarch64_insn_is_class_cas(u32 insn)
+{
+	return aarch64_insn_is_cas(insn) ||
+	       aarch64_insn_is_casp(insn);
+}
+
+/*
+ * Exclude unallocated atomic instructions and LD64B/LDAPR.
+ * The masks and values were generated by using Python sympy module.
+ */
+static __always_inline bool aarch64_atomic_insn_has_wr_perm(u32 insn)
+{
+	return ((insn & 0x3f207c00) == 0x38200000) ||
+	       ((insn & 0x3f208c00) == 0x38200000) ||
+	       ((insn & 0x7fe06c00) == 0x78202000) ||
+	       ((insn & 0xbf204c00) == 0x38200000);
+}
+
 enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
 u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
 u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 451ba7cbd5ad..6a8b71917e3b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -500,6 +500,34 @@  static bool is_write_abort(unsigned long esr)
 	return (esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM);
 }
 
+static bool is_el0_atomic_instr(struct pt_regs *regs)
+{
+	u32 insn;
+	__le32 insn_le;
+	unsigned long pc = instruction_pointer(regs);
+
+	if (compat_user_mode(regs))
+		return false;
+
+	pagefault_disable();
+	if (get_user(insn_le, (__le32 __user *)pc)) {
+		pagefault_enable();
+		return false;
+	}
+	pagefault_enable();
+
+	insn = le32_to_cpu(insn_le);
+
+	if (aarch64_insn_is_class_atomic(insn) &&
+	    aarch64_atomic_insn_has_wr_perm(insn))
+		return true;
+
+	if (aarch64_insn_is_class_cas(insn))
+		return true;
+
+	return false;
+}
+
 static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 				   struct pt_regs *regs)
 {
@@ -568,6 +596,12 @@  static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	if (!vma)
 		goto lock_mmap;
 
+	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
+	    is_el0_atomic_instr(regs)) {
+		vm_flags = VM_WRITE;
+		mm_flags |= FAULT_FLAG_WRITE;
+	}
+
 	if (!(vma->vm_flags & vm_flags)) {
 		vma_end_read(vma);
 		fault = 0;