diff mbox series

arm64: mm: force write fault for atomic RMW instructions

Message ID 20240507223558.3039562-1-yang@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series arm64: mm: force write fault for atomic RMW instructions | expand

Commit Message

Yang Shi May 7, 2024, 10:35 p.m. UTC
The atomic RMW instructions, for example, ldadd, actually does load +
add + store in one instruction, it may trigger two page faults, the
first fault is a read fault, the second fault is a write fault.

Some applications use atomic RMW instructions to populate memory, for
example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
at launch time) between v18 and v22.

But the double page fault has some problems:

1. Noticeable TLB overhead.  The kernel actually installs zero page with
   readonly PTE for the read fault.  The write fault will trigger a
   write-protection fault (CoW).  The CoW will allocate a new page and
   make the PTE point to the new page, this needs TLB invalidations.  The
   tlb invalidation and the mandatory memory barriers may incur
   significant overhead, particularly on the machines with many cores.

2. Break up huge pages.  If THP is on the read fault will install huge
   zero pages.  The later CoW will break up the huge page and allocate
   base pages instead of huge page.  The applications have to rely on
   khugepaged (kernel thread) to collapse huge pages asynchronously.
   This also incurs noticeable performance penalty.

3. 512x page faults with huge page.  Due to #2, the applications have to
   have page faults for every 4K area for the write, this makes the speed
   up by using huge page actually gone.

So it sounds pointless to have two page faults since we know the memory
will be definitely written very soon.  Forcing write fault for atomic RMW
instruction makes some sense and it can solve the aforementioned problems:

Firstly, it just allocates zero'ed page, no tlb invalidation and memory
barriers anymore.
Secondly, it can populate writable huge pages in the first place and
don't break them up.  Just one page fault is needed for 2M area instrad
of 512 faults and also save cpu time by not using khugepaged.

A simple micro benchmark which populates 1G memory shows the number of
page faults is reduced by half and the time spent by system is reduced
by 60% on a VM running on Ampere Altra platform.

And the benchmark for anonymous read fault on 1G memory, file read fault
on 1G file (cold page cache and warm page cache) don't show noticeable
regression.

Some other architectures also have code inspection in page fault path,
for example, SPARC and x86.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/insn.h |  1 +
 arch/arm64/mm/fault.c         | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

Comments

Christoph Lameter (Ampere) May 7, 2024, 10:42 p.m. UTC | #1
Reviewed-by: Christoph Lameter <cl@linux.com>
Anshuman Khandual May 8, 2024, 6:45 a.m. UTC | #2
Hello Yang,

On 5/8/24 04:05, Yang Shi wrote:
> The atomic RMW instructions, for example, ldadd, actually does load +
> add + store in one instruction, it may trigger two page faults, the
> first fault is a read fault, the second fault is a write fault.

It may or it will definitely create two consecutive page faults. What
if the second write fault never came about. In that case an writable
page table entry would be created unnecessarily (or even wrongfully),
thus breaking the CoW.

Just trying to understand, is the double page fault a possibility or
a certainty. Does that depend on architecture (please do provide some
links) or is it implementation defined.

> 
> Some applications use atomic RMW instructions to populate memory, for
> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory

But why cannot normal store operation is sufficient for pre-touching
the heap memory, why read-modify-write (RMW) is required instead ?

> at launch time) between v18 and v22.

V18, V22 ?

> 
> But the double page fault has some problems:
> 
> 1. Noticeable TLB overhead.  The kernel actually installs zero page with
>    readonly PTE for the read fault.  The write fault will trigger a
>    write-protection fault (CoW).  The CoW will allocate a new page and
>    make the PTE point to the new page, this needs TLB invalidations.  The
>    tlb invalidation and the mandatory memory barriers may incur
>    significant overhead, particularly on the machines with many cores.
> 
> 2. Break up huge pages.  If THP is on the read fault will install huge
>    zero pages.  The later CoW will break up the huge page and allocate
>    base pages instead of huge page.  The applications have to rely on
>    khugepaged (kernel thread) to collapse huge pages asynchronously.
>    This also incurs noticeable performance penalty.
> 
> 3. 512x page faults with huge page.  Due to #2, the applications have to
>    have page faults for every 4K area for the write, this makes the speed
>    up by using huge page actually gone.

The problems mentioned above are reasonable and expected.
 
If the memory address has some valid data, it must have already reached there
via a previous write access, which would have caused initial CoW transition ?
If the memory address has no valid data to begin with, why even use RMW ?

> 
> So it sounds pointless to have two page faults since we know the memory
> will be definitely written very soon.  Forcing write fault for atomic RMW
> instruction makes some sense and it can solve the aforementioned problems:
> 
> Firstly, it just allocates zero'ed page, no tlb invalidation and memory
> barriers anymore.
> Secondly, it can populate writable huge pages in the first place and
> don't break them up.  Just one page fault is needed for 2M area instrad
> of 512 faults and also save cpu time by not using khugepaged.
> 
> A simple micro benchmark which populates 1G memory shows the number of
> page faults is reduced by half and the time spent by system is reduced
> by 60% on a VM running on Ampere Altra platform.
> 
> And the benchmark for anonymous read fault on 1G memory, file read fault
> on 1G file (cold page cache and warm page cache) don't show noticeable
> regression.
> 
> Some other architectures also have code inspection in page fault path,
> for example, SPARC and x86.

Okay, I was about to ask, but is not calling get_user() for all data
read page faults increase the cost for a hot code path in general for
some potential savings for a very specific use case. Not sure if that
is worth the trade-off.

> 
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> ---
>  arch/arm64/include/asm/insn.h |  1 +
>  arch/arm64/mm/fault.c         | 19 +++++++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index db1aeacd4cd9..5d5a3fbeecc0 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
>   * "-" means "don't care"
>   */
>  __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
> +__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
>  
>  __AARCH64_INSN_FUNCS(adr,	0x9F000000, 0x10000000)
>  __AARCH64_INSN_FUNCS(adrp,	0x9F000000, 0x90000000)
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 8251e2fea9c7..f7bceedf5ef3 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>  	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>  	unsigned long addr = untagged_addr(far);
>  	struct vm_area_struct *vma;
> +	unsigned int insn;
>  
>  	if (kprobe_page_fault(regs, esr))
>  		return 0;
> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>  	if (!vma)
>  		goto lock_mmap;
>  
> +	if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
> +		goto continue_fault;
> +
> +	pagefault_disable();
> +
> +	if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
> +		pagefault_enable();
> +		goto continue_fault;
> +	}
> +
> +	if (aarch64_insn_is_class_atomic(insn)) {
> +		vm_flags = VM_WRITE;
> +		mm_flags |= FAULT_FLAG_WRITE;
> +	}
> +
> +	pagefault_enable();
> +
> +continue_fault:
>  	if (!(vma->vm_flags & vm_flags)) {
>  		vma_end_read(vma);
>  		goto lock_mmap;
Christoph Lameter (Ampere) May 8, 2024, 5:15 p.m. UTC | #3
On Wed, 8 May 2024, Anshuman Khandual wrote:

>> The atomic RMW instructions, for example, ldadd, actually does load +
>> add + store in one instruction, it may trigger two page faults, the
>> first fault is a read fault, the second fault is a write fault.
>
> It may or it will definitely create two consecutive page faults. What
> if the second write fault never came about. In that case an writable
> page table entry would be created unnecessarily (or even wrongfully),
> thus breaking the CoW.

An atomic RMV will always perform a write? If there is a read fault 
then write fault will follow.

>> Some applications use atomic RMW instructions to populate memory, for
>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
>
> But why cannot normal store operation is sufficient for pre-touching
> the heap memory, why read-modify-write (RMW) is required instead ?

Sure a regular write operation is sufficient but you would have to modify 
existing applications to get that done. x86 does not do a read fault on 
atomics so we have an issue htere.

> If the memory address has some valid data, it must have already reached there
> via a previous write access, which would have caused initial CoW transition ?
> If the memory address has no valid data to begin with, why even use RMW ?

Because the application can reasonably assume that all uninitialized data 
is zero and therefore it is not necessary to have a prior write access.

>> Some other architectures also have code inspection in page fault path,
>> for example, SPARC and x86.
>
> Okay, I was about to ask, but is not calling get_user() for all data
> read page faults increase the cost for a hot code path in general for
> some potential savings for a very specific use case. Not sure if that
> is worth the trade-off.

The instruction is cache hot since it must be present in the cpu cache for 
the fault. So the overhead is minimal.
Yang Shi May 8, 2024, 6:37 p.m. UTC | #4
On 5/7/24 11:45 PM, Anshuman Khandual wrote:
> Hello Yang,
>
> On 5/8/24 04:05, Yang Shi wrote:
>> The atomic RMW instructions, for example, ldadd, actually does load +
>> add + store in one instruction, it may trigger two page faults, the
>> first fault is a read fault, the second fault is a write fault.
> It may or it will definitely create two consecutive page faults. What
> if the second write fault never came about. In that case an writable
> page table entry would be created unnecessarily (or even wrongfully),
> thus breaking the CoW.
>
> Just trying to understand, is the double page fault a possibility or
> a certainty. Does that depend on architecture (please do provide some
> links) or is it implementation defined.

Christopher helped answer some questions, I will skip those if I have 
nothing to add.

It is defined in ARM architecture reference manual, so it is not 
implementation defined.

>
>> Some applications use atomic RMW instructions to populate memory, for
>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
> But why cannot normal store operation is sufficient for pre-touching
> the heap memory, why read-modify-write (RMW) is required instead ?

Memory write is fine, but it depends on applications. For example, JVM 
may want to "permit use of memory concurrently with pretouch". So they 
chose use atomic instead of memory write.

>
>> at launch time) between v18 and v22.
> V18, V22 ?

v18/v19/v20/v21/v22

>
>> But the double page fault has some problems:
>>
>> 1. Noticeable TLB overhead.  The kernel actually installs zero page with
>>     readonly PTE for the read fault.  The write fault will trigger a
>>     write-protection fault (CoW).  The CoW will allocate a new page and
>>     make the PTE point to the new page, this needs TLB invalidations.  The
>>     tlb invalidation and the mandatory memory barriers may incur
>>     significant overhead, particularly on the machines with many cores.
>>
>> 2. Break up huge pages.  If THP is on the read fault will install huge
>>     zero pages.  The later CoW will break up the huge page and allocate
>>     base pages instead of huge page.  The applications have to rely on
>>     khugepaged (kernel thread) to collapse huge pages asynchronously.
>>     This also incurs noticeable performance penalty.
>>
>> 3. 512x page faults with huge page.  Due to #2, the applications have to
>>     have page faults for every 4K area for the write, this makes the speed
>>     up by using huge page actually gone.
> The problems mentioned above are reasonable and expected.
>   
> If the memory address has some valid data, it must have already reached there
> via a previous write access, which would have caused initial CoW transition ?
> If the memory address has no valid data to begin with, why even use RMW ?
>
>> So it sounds pointless to have two page faults since we know the memory
>> will be definitely written very soon.  Forcing write fault for atomic RMW
>> instruction makes some sense and it can solve the aforementioned problems:
>>
>> Firstly, it just allocates zero'ed page, no tlb invalidation and memory
>> barriers anymore.
>> Secondly, it can populate writable huge pages in the first place and
>> don't break them up.  Just one page fault is needed for 2M area instrad
>> of 512 faults and also save cpu time by not using khugepaged.
>>
>> A simple micro benchmark which populates 1G memory shows the number of
>> page faults is reduced by half and the time spent by system is reduced
>> by 60% on a VM running on Ampere Altra platform.
>>
>> And the benchmark for anonymous read fault on 1G memory, file read fault
>> on 1G file (cold page cache and warm page cache) don't show noticeable
>> regression.
>>
>> Some other architectures also have code inspection in page fault path,
>> for example, SPARC and x86.
> Okay, I was about to ask, but is not calling get_user() for all data
> read page faults increase the cost for a hot code path in general for
> some potential savings for a very specific use case. Not sure if that
> is worth the trade-off.

I tested read fault latency (anonymous read fault and file read fault), 
I didn't see noticeable regression.

>
>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>> ---
>>   arch/arm64/include/asm/insn.h |  1 +
>>   arch/arm64/mm/fault.c         | 19 +++++++++++++++++++
>>   2 files changed, 20 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>> index db1aeacd4cd9..5d5a3fbeecc0 100644
>> --- a/arch/arm64/include/asm/insn.h
>> +++ b/arch/arm64/include/asm/insn.h
>> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
>>    * "-" means "don't care"
>>    */
>>   __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
>> +__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
>>   
>>   __AARCH64_INSN_FUNCS(adr,	0x9F000000, 0x10000000)
>>   __AARCH64_INSN_FUNCS(adrp,	0x9F000000, 0x90000000)
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index 8251e2fea9c7..f7bceedf5ef3 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>   	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>>   	unsigned long addr = untagged_addr(far);
>>   	struct vm_area_struct *vma;
>> +	unsigned int insn;
>>   
>>   	if (kprobe_page_fault(regs, esr))
>>   		return 0;
>> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>   	if (!vma)
>>   		goto lock_mmap;
>>   
>> +	if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
>> +		goto continue_fault;
>> +
>> +	pagefault_disable();
>> +
>> +	if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
>> +		pagefault_enable();
>> +		goto continue_fault;
>> +	}
>> +
>> +	if (aarch64_insn_is_class_atomic(insn)) {
>> +		vm_flags = VM_WRITE;
>> +		mm_flags |= FAULT_FLAG_WRITE;
>> +	}
>> +
>> +	pagefault_enable();
>> +
>> +continue_fault:
>>   	if (!(vma->vm_flags & vm_flags)) {
>>   		vma_end_read(vma);
>>   		goto lock_mmap;
Anshuman Khandual May 9, 2024, 4:23 a.m. UTC | #5
On 5/8/24 22:45, Christoph Lameter (Ampere) wrote:
> On Wed, 8 May 2024, Anshuman Khandual wrote:
> 
>>> The atomic RMW instructions, for example, ldadd, actually does load +
>>> add + store in one instruction, it may trigger two page faults, the
>>> first fault is a read fault, the second fault is a write fault.
>>
>> It may or it will definitely create two consecutive page faults. What
>> if the second write fault never came about. In that case an writable
>> page table entry would be created unnecessarily (or even wrongfully),
>> thus breaking the CoW.
> 
> An atomic RMV will always perform a write? If there is a read fault then write fault will follow.

Alright, but the wording above in the commit message is bit misleading.

> 
>>> Some applications use atomic RMW instructions to populate memory, for
>>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
>>
>> But why cannot normal store operation is sufficient for pre-touching
>> the heap memory, why read-modify-write (RMW) is required instead ?
> 
> Sure a regular write operation is sufficient but you would have to modify existing applications to get that done. x86 does not do a read fault on atomics so we have an issue htere.

Understood, although not being able to change an application to optimize
might not be a compelling argument on its own, but treating such atomic
operations differently in page fault path for improved performance sounds
feasible. But will probably let others weigh in on this and possible need
for parity with x86 behaviour.

> 
>> If the memory address has some valid data, it must have already reached there
>> via a previous write access, which would have caused initial CoW transition ?
>> If the memory address has no valid data to begin with, why even use RMW ?
> 
> Because the application can reasonably assume that all uninitialized data is zero and therefore it is not necessary to have a prior write access.

Alright, but again I wonder why an atomic operation is required to init
or pre-touch uninitialized data, some how it does not make sense unless
there is some more context here.

> 
>>> Some other architectures also have code inspection in page fault path,
>>> for example, SPARC and x86.
>>
>> Okay, I was about to ask, but is not calling get_user() for all data
>> read page faults increase the cost for a hot code path in general for
>> some potential savings for a very specific use case. Not sure if that
>> is worth the trade-off.
> 
> The instruction is cache hot since it must be present in the cpu cache for the fault. So the overhead is minimal.
> 

But could not a pagefault_disable()-enable() window prevent concurring
page faults for the current process thus degrading its performance.
Anshuman Khandual May 9, 2024, 4:31 a.m. UTC | #6
On 5/9/24 00:07, Yang Shi wrote:
> 
> 
> On 5/7/24 11:45 PM, Anshuman Khandual wrote:
>> Hello Yang,
>>
>> On 5/8/24 04:05, Yang Shi wrote:
>>> The atomic RMW instructions, for example, ldadd, actually does load +
>>> add + store in one instruction, it may trigger two page faults, the
>>> first fault is a read fault, the second fault is a write fault.
>> It may or it will definitely create two consecutive page faults. What
>> if the second write fault never came about. In that case an writable
>> page table entry would be created unnecessarily (or even wrongfully),
>> thus breaking the CoW.
>>
>> Just trying to understand, is the double page fault a possibility or
>> a certainty. Does that depend on architecture (please do provide some
>> links) or is it implementation defined.
> 
> Christopher helped answer some questions, I will skip those if I have nothing to add.
> 
> It is defined in ARM architecture reference manual, so it is not implementation defined.

Sure, but please replace the "may trigger" phrase above as appropriate.

> 
>>
>>> Some applications use atomic RMW instructions to populate memory, for
>>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
>> But why cannot normal store operation is sufficient for pre-touching
>> the heap memory, why read-modify-write (RMW) is required instead ?
> 
> Memory write is fine, but it depends on applications. For example, JVM may want to "permit use of memory concurrently with pretouch". So they chose use atomic instead of memory write.
> 
>>
>>> at launch time) between v18 and v22.
>> V18, V22 ?
> 
> v18/v19/v20/v21/v22
> 
>>
>>> But the double page fault has some problems:
>>>
>>> 1. Noticeable TLB overhead.  The kernel actually installs zero page with
>>>     readonly PTE for the read fault.  The write fault will trigger a
>>>     write-protection fault (CoW).  The CoW will allocate a new page and
>>>     make the PTE point to the new page, this needs TLB invalidations.  The
>>>     tlb invalidation and the mandatory memory barriers may incur
>>>     significant overhead, particularly on the machines with many cores.
>>>
>>> 2. Break up huge pages.  If THP is on the read fault will install huge
>>>     zero pages.  The later CoW will break up the huge page and allocate
>>>     base pages instead of huge page.  The applications have to rely on
>>>     khugepaged (kernel thread) to collapse huge pages asynchronously.
>>>     This also incurs noticeable performance penalty.
>>>
>>> 3. 512x page faults with huge page.  Due to #2, the applications have to
>>>     have page faults for every 4K area for the write, this makes the speed
>>>     up by using huge page actually gone.
>> The problems mentioned above are reasonable and expected.
>>   If the memory address has some valid data, it must have already reached there
>> via a previous write access, which would have caused initial CoW transition ?
>> If the memory address has no valid data to begin with, why even use RMW ?
>>
>>> So it sounds pointless to have two page faults since we know the memory
>>> will be definitely written very soon.  Forcing write fault for atomic RMW
>>> instruction makes some sense and it can solve the aforementioned problems:
>>>
>>> Firstly, it just allocates zero'ed page, no tlb invalidation and memory
>>> barriers anymore.
>>> Secondly, it can populate writable huge pages in the first place and
>>> don't break them up.  Just one page fault is needed for 2M area instrad
>>> of 512 faults and also save cpu time by not using khugepaged.
>>>
>>> A simple micro benchmark which populates 1G memory shows the number of
>>> page faults is reduced by half and the time spent by system is reduced
>>> by 60% on a VM running on Ampere Altra platform.
>>>
>>> And the benchmark for anonymous read fault on 1G memory, file read fault
>>> on 1G file (cold page cache and warm page cache) don't show noticeable
>>> regression.
>>>
>>> Some other architectures also have code inspection in page fault path,
>>> for example, SPARC and x86.
>> Okay, I was about to ask, but is not calling get_user() for all data
>> read page faults increase the cost for a hot code path in general for
>> some potential savings for a very specific use case. Not sure if that
>> is worth the trade-off.
> 
> I tested read fault latency (anonymous read fault and file read fault), I didn't see noticeable regression.

Could you please run a multi threaded application accessing one common
buffer while running these atomic operations. We just need to ensure
that pagefault_disable()-enable() window is not preventing concurrent
page faults and adding access latency to other threads.

> 
>>
>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>>> ---
>>>   arch/arm64/include/asm/insn.h |  1 +
>>>   arch/arm64/mm/fault.c         | 19 +++++++++++++++++++
>>>   2 files changed, 20 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>> index db1aeacd4cd9..5d5a3fbeecc0 100644
>>> --- a/arch/arm64/include/asm/insn.h
>>> +++ b/arch/arm64/include/asm/insn.h
>>> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)    \
>>>    * "-" means "don't care"
>>>    */
>>>   __AARCH64_INSN_FUNCS(class_branch_sys,    0x1c000000, 0x14000000)
>>> +__AARCH64_INSN_FUNCS(class_atomic,    0x3b200c00, 0x38200000)
>>>     __AARCH64_INSN_FUNCS(adr,    0x9F000000, 0x10000000)
>>>   __AARCH64_INSN_FUNCS(adrp,    0x9F000000, 0x90000000)
>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>>> index 8251e2fea9c7..f7bceedf5ef3 100644
>>> --- a/arch/arm64/mm/fault.c
>>> +++ b/arch/arm64/mm/fault.c
>>> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>       unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>>>       unsigned long addr = untagged_addr(far);
>>>       struct vm_area_struct *vma;
>>> +    unsigned int insn;
>>>         if (kprobe_page_fault(regs, esr))
>>>           return 0;
>>> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>       if (!vma)
>>>           goto lock_mmap;
>>>   +    if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
>>> +        goto continue_fault;
>>> +
>>> +    pagefault_disable();
>>> +
>>> +    if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
>>> +        pagefault_enable();
>>> +        goto continue_fault;
>>> +    }
>>> +
>>> +    if (aarch64_insn_is_class_atomic(insn)) {
>>> +        vm_flags = VM_WRITE;
>>> +        mm_flags |= FAULT_FLAG_WRITE;
>>> +    }
>>> +
>>> +    pagefault_enable();
>>> +
>>> +continue_fault:
>>>       if (!(vma->vm_flags & vm_flags)) {
>>>           vma_end_read(vma);
>>>           goto lock_mmap;
>
Yang Shi May 9, 2024, 9:46 p.m. UTC | #7
On 5/8/24 9:31 PM, Anshuman Khandual wrote:
>
> On 5/9/24 00:07, Yang Shi wrote:
>>
>> On 5/7/24 11:45 PM, Anshuman Khandual wrote:
>>> Hello Yang,
>>>
>>> On 5/8/24 04:05, Yang Shi wrote:
>>>> The atomic RMW instructions, for example, ldadd, actually does load +
>>>> add + store in one instruction, it may trigger two page faults, the
>>>> first fault is a read fault, the second fault is a write fault.
>>> It may or it will definitely create two consecutive page faults. What
>>> if the second write fault never came about. In that case an writable
>>> page table entry would be created unnecessarily (or even wrongfully),
>>> thus breaking the CoW.
>>>
>>> Just trying to understand, is the double page fault a possibility or
>>> a certainty. Does that depend on architecture (please do provide some
>>> links) or is it implementation defined.
>> Christopher helped answer some questions, I will skip those if I have nothing to add.
>>
>> It is defined in ARM architecture reference manual, so it is not implementation defined.
> Sure, but please replace the "may trigger" phrase above as appropriate.

Yeah, sure.

>
>>>> Some applications use atomic RMW instructions to populate memory, for
>>>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
>>> But why cannot normal store operation is sufficient for pre-touching
>>> the heap memory, why read-modify-write (RMW) is required instead ?
>> Memory write is fine, but it depends on applications. For example, JVM may want to "permit use of memory concurrently with pretouch". So they chose use atomic instead of memory write.
>>
>>>> at launch time) between v18 and v22.
>>> V18, V22 ?
>> v18/v19/v20/v21/v22
>>
>>>> But the double page fault has some problems:
>>>>
>>>> 1. Noticeable TLB overhead.  The kernel actually installs zero page with
>>>>      readonly PTE for the read fault.  The write fault will trigger a
>>>>      write-protection fault (CoW).  The CoW will allocate a new page and
>>>>      make the PTE point to the new page, this needs TLB invalidations.  The
>>>>      tlb invalidation and the mandatory memory barriers may incur
>>>>      significant overhead, particularly on the machines with many cores.
>>>>
>>>> 2. Break up huge pages.  If THP is on the read fault will install huge
>>>>      zero pages.  The later CoW will break up the huge page and allocate
>>>>      base pages instead of huge page.  The applications have to rely on
>>>>      khugepaged (kernel thread) to collapse huge pages asynchronously.
>>>>      This also incurs noticeable performance penalty.
>>>>
>>>> 3. 512x page faults with huge page.  Due to #2, the applications have to
>>>>      have page faults for every 4K area for the write, this makes the speed
>>>>      up by using huge page actually gone.
>>> The problems mentioned above are reasonable and expected.
>>>    If the memory address has some valid data, it must have already reached there
>>> via a previous write access, which would have caused initial CoW transition ?
>>> If the memory address has no valid data to begin with, why even use RMW ?
>>>
>>>> So it sounds pointless to have two page faults since we know the memory
>>>> will be definitely written very soon.  Forcing write fault for atomic RMW
>>>> instruction makes some sense and it can solve the aforementioned problems:
>>>>
>>>> Firstly, it just allocates zero'ed page, no tlb invalidation and memory
>>>> barriers anymore.
>>>> Secondly, it can populate writable huge pages in the first place and
>>>> don't break them up.  Just one page fault is needed for 2M area instrad
>>>> of 512 faults and also save cpu time by not using khugepaged.
>>>>
>>>> A simple micro benchmark which populates 1G memory shows the number of
>>>> page faults is reduced by half and the time spent by system is reduced
>>>> by 60% on a VM running on Ampere Altra platform.
>>>>
>>>> And the benchmark for anonymous read fault on 1G memory, file read fault
>>>> on 1G file (cold page cache and warm page cache) don't show noticeable
>>>> regression.
>>>>
>>>> Some other architectures also have code inspection in page fault path,
>>>> for example, SPARC and x86.
>>> Okay, I was about to ask, but is not calling get_user() for all data
>>> read page faults increase the cost for a hot code path in general for
>>> some potential savings for a very specific use case. Not sure if that
>>> is worth the trade-off.
>> I tested read fault latency (anonymous read fault and file read fault), I didn't see noticeable regression.
> Could you please run a multi threaded application accessing one common
> buffer while running these atomic operations. We just need to ensure
> that pagefault_disable()-enable() window is not preventing concurrent
> page faults and adding access latency to other threads.

I modified page_fault1 test in will-it-scale to make it just generate 
read fault (the original code generated write fault), and anonymous read 
fault should be the most sensitive case to this change. Then I ran the 
test with different number of threads (1 - 160 because total 160 cores 
on my test machine), please see the below table (hopefully my email 
client won't mess it)

nr_threads           before                after            +/-
1                      2056996            2048030        -0.4%
20                    17836422          16718606      -6.27%
40                    28536237          27958875      -2.03%
60                    35947854          35236884      -2%
80                    31646632          39209665      +24%
100                  20836142          21017796      +0.9%
120                  20350980          20635603      +1.4%
140                  20041920          19904015      -0.7%
160                  19561908          20264360      +3.6%

Sometimes the after is better than the before, sometimes opposite. There 
are two outliers, other than them there is not noticeable regression.

To rule out the worst case, I also ran the test 100 iterations with 160 
threads then compared the worst case:

     N           Min           Max        Median           Avg Stddev
  100         34770         84979         65536       63537.7 10358.873
  100         38077         87652         65536      63119.02 8792.7399

Still no noticeable regression.

>
>>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>>>> ---
>>>>    arch/arm64/include/asm/insn.h |  1 +
>>>>    arch/arm64/mm/fault.c         | 19 +++++++++++++++++++
>>>>    2 files changed, 20 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>> index db1aeacd4cd9..5d5a3fbeecc0 100644
>>>> --- a/arch/arm64/include/asm/insn.h
>>>> +++ b/arch/arm64/include/asm/insn.h
>>>> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)    \
>>>>     * "-" means "don't care"
>>>>     */
>>>>    __AARCH64_INSN_FUNCS(class_branch_sys,    0x1c000000, 0x14000000)
>>>> +__AARCH64_INSN_FUNCS(class_atomic,    0x3b200c00, 0x38200000)
>>>>      __AARCH64_INSN_FUNCS(adr,    0x9F000000, 0x10000000)
>>>>    __AARCH64_INSN_FUNCS(adrp,    0x9F000000, 0x90000000)
>>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>>>> index 8251e2fea9c7..f7bceedf5ef3 100644
>>>> --- a/arch/arm64/mm/fault.c
>>>> +++ b/arch/arm64/mm/fault.c
>>>> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>        unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>>>>        unsigned long addr = untagged_addr(far);
>>>>        struct vm_area_struct *vma;
>>>> +    unsigned int insn;
>>>>          if (kprobe_page_fault(regs, esr))
>>>>            return 0;
>>>> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>        if (!vma)
>>>>            goto lock_mmap;
>>>>    +    if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
>>>> +        goto continue_fault;
>>>> +
>>>> +    pagefault_disable();
>>>> +
>>>> +    if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
>>>> +        pagefault_enable();
>>>> +        goto continue_fault;
>>>> +    }
>>>> +
>>>> +    if (aarch64_insn_is_class_atomic(insn)) {
>>>> +        vm_flags = VM_WRITE;
>>>> +        mm_flags |= FAULT_FLAG_WRITE;
>>>> +    }
>>>> +
>>>> +    pagefault_enable();
>>>> +
>>>> +continue_fault:
>>>>        if (!(vma->vm_flags & vm_flags)) {
>>>>            vma_end_read(vma);
>>>>            goto lock_mmap;
Anshuman Khandual May 10, 2024, 4:28 a.m. UTC | #8
On 5/10/24 03:16, Yang Shi wrote:
> 
> 
> On 5/8/24 9:31 PM, Anshuman Khandual wrote:
>>
>> On 5/9/24 00:07, Yang Shi wrote:
>>>
>>> On 5/7/24 11:45 PM, Anshuman Khandual wrote:
>>>> Hello Yang,
>>>>
>>>> On 5/8/24 04:05, Yang Shi wrote:
>>>>> The atomic RMW instructions, for example, ldadd, actually does load +
>>>>> add + store in one instruction, it may trigger two page faults, the
>>>>> first fault is a read fault, the second fault is a write fault.
>>>> It may or it will definitely create two consecutive page faults. What
>>>> if the second write fault never came about. In that case an writable
>>>> page table entry would be created unnecessarily (or even wrongfully),
>>>> thus breaking the CoW.
>>>>
>>>> Just trying to understand, is the double page fault a possibility or
>>>> a certainty. Does that depend on architecture (please do provide some
>>>> links) or is it implementation defined.
>>> Christopher helped answer some questions, I will skip those if I have nothing to add.
>>>
>>> It is defined in ARM architecture reference manual, so it is not implementation defined.
>> Sure, but please replace the "may trigger" phrase above as appropriate.
> 
> Yeah, sure.
> 
>>
>>>>> Some applications use atomic RMW instructions to populate memory, for
>>>>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
>>>> But why cannot normal store operation is sufficient for pre-touching
>>>> the heap memory, why read-modify-write (RMW) is required instead ?
>>> Memory write is fine, but it depends on applications. For example, JVM may want to "permit use of memory concurrently with pretouch". So they chose use atomic instead of memory write.
>>>
>>>>> at launch time) between v18 and v22.
>>>> V18, V22 ?
>>> v18/v19/v20/v21/v22
>>>
>>>>> But the double page fault has some problems:
>>>>>
>>>>> 1. Noticeable TLB overhead.  The kernel actually installs zero page with
>>>>>      readonly PTE for the read fault.  The write fault will trigger a
>>>>>      write-protection fault (CoW).  The CoW will allocate a new page and
>>>>>      make the PTE point to the new page, this needs TLB invalidations.  The
>>>>>      tlb invalidation and the mandatory memory barriers may incur
>>>>>      significant overhead, particularly on the machines with many cores.
>>>>>
>>>>> 2. Break up huge pages.  If THP is on the read fault will install huge
>>>>>      zero pages.  The later CoW will break up the huge page and allocate
>>>>>      base pages instead of huge page.  The applications have to rely on
>>>>>      khugepaged (kernel thread) to collapse huge pages asynchronously.
>>>>>      This also incurs noticeable performance penalty.
>>>>>
>>>>> 3. 512x page faults with huge page.  Due to #2, the applications have to
>>>>>      have page faults for every 4K area for the write, this makes the speed
>>>>>      up by using huge page actually gone.
>>>> The problems mentioned above are reasonable and expected.
>>>>    If the memory address has some valid data, it must have already reached there
>>>> via a previous write access, which would have caused initial CoW transition ?
>>>> If the memory address has no valid data to begin with, why even use RMW ?
>>>>
>>>>> So it sounds pointless to have two page faults since we know the memory
>>>>> will be definitely written very soon.  Forcing write fault for atomic RMW
>>>>> instruction makes some sense and it can solve the aforementioned problems:
>>>>>
>>>>> Firstly, it just allocates zero'ed page, no tlb invalidation and memory
>>>>> barriers anymore.
>>>>> Secondly, it can populate writable huge pages in the first place and
>>>>> don't break them up.  Just one page fault is needed for 2M area instrad
>>>>> of 512 faults and also save cpu time by not using khugepaged.
>>>>>
>>>>> A simple micro benchmark which populates 1G memory shows the number of
>>>>> page faults is reduced by half and the time spent by system is reduced
>>>>> by 60% on a VM running on Ampere Altra platform.
>>>>>
>>>>> And the benchmark for anonymous read fault on 1G memory, file read fault
>>>>> on 1G file (cold page cache and warm page cache) don't show noticeable
>>>>> regression.
>>>>>
>>>>> Some other architectures also have code inspection in page fault path,
>>>>> for example, SPARC and x86.
>>>> Okay, I was about to ask, but is not calling get_user() for all data
>>>> read page faults increase the cost for a hot code path in general for
>>>> some potential savings for a very specific use case. Not sure if that
>>>> is worth the trade-off.
>>> I tested read fault latency (anonymous read fault and file read fault), I didn't see noticeable regression.
>> Could you please run a multi threaded application accessing one common
>> buffer while running these atomic operations. We just need to ensure
>> that pagefault_disable()-enable() window is not preventing concurrent
>> page faults and adding access latency to other threads.
> 
> I modified page_fault1 test in will-it-scale to make it just generate read fault (the original code generated write fault), and anonymous read fault should be the most sensitive case to this change. Then I ran the test with different number of threads (1 - 160 

Right, only with read data faults i.e (!FAULT_FLAG_WRITE and !FAULT_FLAG_INSTRUCTION)
code path enters the pagefault_disable/enable() window, but all others will skip it.

because total 160 cores on my test machine), please see the below table (hopefully my email client won't mess it)

Thanks for providing the test results.

> 
> nr_threads           before                after            +/-
> 1                      2056996            2048030        -0.4%
> 20                    17836422          16718606      -6.27%
> 40                    28536237          27958875      -2.03%
> 60                    35947854          35236884      -2%
> 80                    31646632          39209665      +24%
> 100                  20836142          21017796      +0.9%
> 120                  20350980          20635603      +1.4%
> 140                  20041920          19904015      -0.7%
> 160                  19561908          20264360      +3.6%
> 
> Sometimes the after is better than the before, sometimes opposite. There are two outliers, other than them there is not noticeable regression.

This does not look that bad, but will probably let others weigh in.

> 
> To rule out the worst case, I also ran the test 100 iterations with 160 threads then compared the worst case:
> 
>     N           Min           Max        Median           Avg Stddev
>  100         34770         84979         65536       63537.7 10358.873
>  100         38077         87652         65536      63119.02 8792.7399
> 
> Still no noticeable regression.

I guess to make things better, probably pagefault_enable() could be moved
before aarch64_insn_is_class_atomic() which might not need page faults to
be disabled ? Also what about non user mode atomic instructions, causing
similar scenarios ? Because get_user() will not be able to fetch those. 

> 
>>
>>>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>>>>> ---
>>>>>    arch/arm64/include/asm/insn.h |  1 +
>>>>>    arch/arm64/mm/fault.c         | 19 +++++++++++++++++++
>>>>>    2 files changed, 20 insertions(+)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>>> index db1aeacd4cd9..5d5a3fbeecc0 100644
>>>>> --- a/arch/arm64/include/asm/insn.h
>>>>> +++ b/arch/arm64/include/asm/insn.h
>>>>> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)    \
>>>>>     * "-" means "don't care"
>>>>>     */
>>>>>    __AARCH64_INSN_FUNCS(class_branch_sys,    0x1c000000, 0x14000000)
>>>>> +__AARCH64_INSN_FUNCS(class_atomic,    0x3b200c00, 0x38200000)
>>>>>      __AARCH64_INSN_FUNCS(adr,    0x9F000000, 0x10000000)
>>>>>    __AARCH64_INSN_FUNCS(adrp,    0x9F000000, 0x90000000)
>>>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>>>>> index 8251e2fea9c7..f7bceedf5ef3 100644
>>>>> --- a/arch/arm64/mm/fault.c
>>>>> +++ b/arch/arm64/mm/fault.c
>>>>> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>>        unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>>>>>        unsigned long addr = untagged_addr(far);
>>>>>        struct vm_area_struct *vma;
>>>>> +    unsigned int insn;
>>>>>          if (kprobe_page_fault(regs, esr))
>>>>>            return 0;
>>>>> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>>        if (!vma)
>>>>>            goto lock_mmap;
>>>>>    +    if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
>>>>> +        goto continue_fault;
>>>>> +
>>>>> +    pagefault_disable();
>>>>> +
>>>>> +    if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
>>>>> +        pagefault_enable();
>>>>> +        goto continue_fault;
>>>>> +    }
>>>>> +
>>>>> +    if (aarch64_insn_is_class_atomic(insn)) {
>>>>> +        vm_flags = VM_WRITE;
>>>>> +        mm_flags |= FAULT_FLAG_WRITE;
>>>>> +    }
>>>>> +
>>>>> +    pagefault_enable();
>>>>> +
>>>>> +continue_fault:
>>>>>        if (!(vma->vm_flags & vm_flags)) {
>>>>>            vma_end_read(vma);
>>>>>            goto lock_mmap;
>
Catalin Marinas May 10, 2024, 12:11 p.m. UTC | #9
On Tue, May 07, 2024 at 03:35:58PM -0700, Yang Shi wrote:
> The atomic RMW instructions, for example, ldadd, actually does load +
> add + store in one instruction, it may trigger two page faults, the
> first fault is a read fault, the second fault is a write fault.
> 
> Some applications use atomic RMW instructions to populate memory, for
> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
> at launch time) between v18 and v22.

I'd also argue that this should be optimised in openjdk. Is an LDADD
more efficient on your hardware than a plain STR? I hope it only does
one operation per page rather than per long. There's also MAP_POPULATE
that openjdk can use to pre-fault the pages with no additional fault.
This would be even more efficient than any store or atomic operation.

Not sure the reason for the architecture to report a read fault only on
atomics. Looking at the pseudocode, it checks for both but the read
permission takes priority. Also in case of a translation fault (which is
what we get on the first fault), I think the syndrome write bit is
populated as (!read && write), so 0 since 'read' is 1 for atomics.

> But the double page fault has some problems:
> 
> 1. Noticeable TLB overhead.  The kernel actually installs zero page with
>    readonly PTE for the read fault.  The write fault will trigger a
>    write-protection fault (CoW).  The CoW will allocate a new page and
>    make the PTE point to the new page, this needs TLB invalidations.  The
>    tlb invalidation and the mandatory memory barriers may incur
>    significant overhead, particularly on the machines with many cores.

I can see why the current behaviour is not ideal but I can't tell why
openjdk does it this way either.

A bigger hammer would be to implement mm_forbids_zeropage() but this may
affect some workloads that rely on sparsely populated large arrays.

> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index db1aeacd4cd9..5d5a3fbeecc0 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
>   * "-" means "don't care"
>   */
>  __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
> +__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)

This looks correct, it covers the LDADD and SWP instructions. However,
one concern is whether future architecture versions will add some
instructions in this space that are allowed to do a read only operation
(e.g. skip writing if the value is the same or fails some comparison).

> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 8251e2fea9c7..f7bceedf5ef3 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>  	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>  	unsigned long addr = untagged_addr(far);
>  	struct vm_area_struct *vma;
> +	unsigned int insn;
>  
>  	if (kprobe_page_fault(regs, esr))
>  		return 0;
> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>  	if (!vma)
>  		goto lock_mmap;
>  
> +	if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
> +		goto continue_fault;

I'd avoid the goto if possible. Even better, move this higher up into
the block of if/else statements building the vm_flags and mm_flags.
Factor out the checks into a different function - is_el0_atomic_instr()
or something.

> +
> +	pagefault_disable();

This prevents recursively entering do_page_fault() but it may be worth
testing it with an execute-only permission.

> +
> +	if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
> +		pagefault_enable();
> +		goto continue_fault;
> +	}
> +
> +	if (aarch64_insn_is_class_atomic(insn)) {
> +		vm_flags = VM_WRITE;
> +		mm_flags |= FAULT_FLAG_WRITE;
> +	}

The above would need to check if the fault is coming from a 64-bit user
mode, otherwise the decoding wouldn't make sense:

	if (!user_mode(regs) || compat_user_mode(regs))
		return false;

(assuming a separate function that checks the above and returns a bool;
you'd need to re-enable the page faults)

You also need to take care of endianness since the instructions are
always little-endian. We use a similar pattern in user_insn_read():

	u32 instr;
	__le32 instr_le;
	if (get_user(instr_le, (__le32 __user *)instruction_pointer(regs)))
		return false;
	instr = le32_to_cpu(instr_le);
	...

That said, I'm not keen on this kernel workaround. If openjdk decides to
improve some security and goes for PROT_EXEC-only mappings of its text
sections, the above trick will no longer work.
Yang Shi May 10, 2024, 4:37 p.m. UTC | #10
On 5/9/24 9:28 PM, Anshuman Khandual wrote:
>
> On 5/10/24 03:16, Yang Shi wrote:
>>
>> On 5/8/24 9:31 PM, Anshuman Khandual wrote:
>>> On 5/9/24 00:07, Yang Shi wrote:
>>>> On 5/7/24 11:45 PM, Anshuman Khandual wrote:
>>>>> Hello Yang,
>>>>>
>>>>> On 5/8/24 04:05, Yang Shi wrote:
>>>>>> The atomic RMW instructions, for example, ldadd, actually does load +
>>>>>> add + store in one instruction, it may trigger two page faults, the
>>>>>> first fault is a read fault, the second fault is a write fault.
>>>>> It may or it will definitely create two consecutive page faults. What
>>>>> if the second write fault never came about. In that case an writable
>>>>> page table entry would be created unnecessarily (or even wrongfully),
>>>>> thus breaking the CoW.
>>>>>
>>>>> Just trying to understand, is the double page fault a possibility or
>>>>> a certainty. Does that depend on architecture (please do provide some
>>>>> links) or is it implementation defined.
>>>> Christopher helped answer some questions, I will skip those if I have nothing to add.
>>>>
>>>> It is defined in ARM architecture reference manual, so it is not implementation defined.
>>> Sure, but please replace the "may trigger" phrase above as appropriate.
>> Yeah, sure.
>>
>>>>>> Some applications use atomic RMW instructions to populate memory, for
>>>>>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
>>>>> But why cannot normal store operation is sufficient for pre-touching
>>>>> the heap memory, why read-modify-write (RMW) is required instead ?
>>>> Memory write is fine, but it depends on applications. For example, JVM may want to "permit use of memory concurrently with pretouch". So they chose use atomic instead of memory write.
>>>>
>>>>>> at launch time) between v18 and v22.
>>>>> V18, V22 ?
>>>> v18/v19/v20/v21/v22
>>>>
>>>>>> But the double page fault has some problems:
>>>>>>
>>>>>> 1. Noticeable TLB overhead.  The kernel actually installs zero page with
>>>>>>       readonly PTE for the read fault.  The write fault will trigger a
>>>>>>       write-protection fault (CoW).  The CoW will allocate a new page and
>>>>>>       make the PTE point to the new page, this needs TLB invalidations.  The
>>>>>>       tlb invalidation and the mandatory memory barriers may incur
>>>>>>       significant overhead, particularly on the machines with many cores.
>>>>>>
>>>>>> 2. Break up huge pages.  If THP is on the read fault will install huge
>>>>>>       zero pages.  The later CoW will break up the huge page and allocate
>>>>>>       base pages instead of huge page.  The applications have to rely on
>>>>>>       khugepaged (kernel thread) to collapse huge pages asynchronously.
>>>>>>       This also incurs noticeable performance penalty.
>>>>>>
>>>>>> 3. 512x page faults with huge page.  Due to #2, the applications have to
>>>>>>       have page faults for every 4K area for the write, this makes the speed
>>>>>>       up by using huge page actually gone.
>>>>> The problems mentioned above are reasonable and expected.
>>>>>     If the memory address has some valid data, it must have already reached there
>>>>> via a previous write access, which would have caused initial CoW transition ?
>>>>> If the memory address has no valid data to begin with, why even use RMW ?
>>>>>
>>>>>> So it sounds pointless to have two page faults since we know the memory
>>>>>> will be definitely written very soon.  Forcing write fault for atomic RMW
>>>>>> instruction makes some sense and it can solve the aforementioned problems:
>>>>>>
>>>>>> Firstly, it just allocates zero'ed page, no tlb invalidation and memory
>>>>>> barriers anymore.
>>>>>> Secondly, it can populate writable huge pages in the first place and
>>>>>> don't break them up.  Just one page fault is needed for 2M area instrad
>>>>>> of 512 faults and also save cpu time by not using khugepaged.
>>>>>>
>>>>>> A simple micro benchmark which populates 1G memory shows the number of
>>>>>> page faults is reduced by half and the time spent by system is reduced
>>>>>> by 60% on a VM running on Ampere Altra platform.
>>>>>>
>>>>>> And the benchmark for anonymous read fault on 1G memory, file read fault
>>>>>> on 1G file (cold page cache and warm page cache) don't show noticeable
>>>>>> regression.
>>>>>>
>>>>>> Some other architectures also have code inspection in page fault path,
>>>>>> for example, SPARC and x86.
>>>>> Okay, I was about to ask, but is not calling get_user() for all data
>>>>> read page faults increase the cost for a hot code path in general for
>>>>> some potential savings for a very specific use case. Not sure if that
>>>>> is worth the trade-off.
>>>> I tested read fault latency (anonymous read fault and file read fault), I didn't see noticeable regression.
>>> Could you please run a multi threaded application accessing one common
>>> buffer while running these atomic operations. We just need to ensure
>>> that pagefault_disable()-enable() window is not preventing concurrent
>>> page faults and adding access latency to other threads.
>> I modified page_fault1 test in will-it-scale to make it just generate read fault (the original code generated write fault), and anonymous read fault should be the most sensitive case to this change. Then I ran the test with different number of threads (1 - 160
> Right, only with read data faults i.e (!FAULT_FLAG_WRITE and !FAULT_FLAG_INSTRUCTION)
> code path enters the pagefault_disable/enable() window, but all others will skip it.
>
> because total 160 cores on my test machine), please see the below table (hopefully my email client won't mess it)
>
> Thanks for providing the test results.
>
>> nr_threads           before                after            +/-
>> 1                      2056996            2048030        -0.4%
>> 20                    17836422          16718606      -6.27%
>> 40                    28536237          27958875      -2.03%
>> 60                    35947854          35236884      -2%
>> 80                    31646632          39209665      +24%
>> 100                  20836142          21017796      +0.9%
>> 120                  20350980          20635603      +1.4%
>> 140                  20041920          19904015      -0.7%
>> 160                  19561908          20264360      +3.6%
>>
>> Sometimes the after is better than the before, sometimes opposite. There are two outliers, other than them there is not noticeable regression.
> This does not look that bad, but will probably let others weigh in.
>
>> To rule out the worst case, I also ran the test 100 iterations with 160 threads then compared the worst case:
>>
>>      N           Min           Max        Median           Avg Stddev
>>   100         34770         84979         65536       63537.7 10358.873
>>   100         38077         87652         65536      63119.02 8792.7399
>>
>> Still no noticeable regression.
> I guess to make things better, probably pagefault_enable() could be moved
> before aarch64_insn_is_class_atomic() which might not need page faults to
> be disabled ? Also what about non user mode atomic instructions, causing
> similar scenarios ? Because get_user() will not be able to fetch those.

Yes, you are right, we can re-enable page fault once get_user() returns.

We don't handle non user mode atomic instructions. Is it a problem? The 
kernel address space should be always mapped or mapped for the most time.

>
>>>>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>>>>>> ---
>>>>>>     arch/arm64/include/asm/insn.h |  1 +
>>>>>>     arch/arm64/mm/fault.c         | 19 +++++++++++++++++++
>>>>>>     2 files changed, 20 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>>>> index db1aeacd4cd9..5d5a3fbeecc0 100644
>>>>>> --- a/arch/arm64/include/asm/insn.h
>>>>>> +++ b/arch/arm64/include/asm/insn.h
>>>>>> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)    \
>>>>>>      * "-" means "don't care"
>>>>>>      */
>>>>>>     __AARCH64_INSN_FUNCS(class_branch_sys,    0x1c000000, 0x14000000)
>>>>>> +__AARCH64_INSN_FUNCS(class_atomic,    0x3b200c00, 0x38200000)
>>>>>>       __AARCH64_INSN_FUNCS(adr,    0x9F000000, 0x10000000)
>>>>>>     __AARCH64_INSN_FUNCS(adrp,    0x9F000000, 0x90000000)
>>>>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>>>>>> index 8251e2fea9c7..f7bceedf5ef3 100644
>>>>>> --- a/arch/arm64/mm/fault.c
>>>>>> +++ b/arch/arm64/mm/fault.c
>>>>>> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>>>         unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>>>>>>         unsigned long addr = untagged_addr(far);
>>>>>>         struct vm_area_struct *vma;
>>>>>> +    unsigned int insn;
>>>>>>           if (kprobe_page_fault(regs, esr))
>>>>>>             return 0;
>>>>>> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>>>         if (!vma)
>>>>>>             goto lock_mmap;
>>>>>>     +    if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
>>>>>> +        goto continue_fault;
>>>>>> +
>>>>>> +    pagefault_disable();
>>>>>> +
>>>>>> +    if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
>>>>>> +        pagefault_enable();
>>>>>> +        goto continue_fault;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (aarch64_insn_is_class_atomic(insn)) {
>>>>>> +        vm_flags = VM_WRITE;
>>>>>> +        mm_flags |= FAULT_FLAG_WRITE;
>>>>>> +    }
>>>>>> +
>>>>>> +    pagefault_enable();
>>>>>> +
>>>>>> +continue_fault:
>>>>>>         if (!(vma->vm_flags & vm_flags)) {
>>>>>>             vma_end_read(vma);
>>>>>>             goto lock_mmap;
Yang Shi May 10, 2024, 5:13 p.m. UTC | #11
On 5/10/24 5:11 AM, Catalin Marinas wrote:
> On Tue, May 07, 2024 at 03:35:58PM -0700, Yang Shi wrote:
>> The atomic RMW instructions, for example, ldadd, actually does load +
>> add + store in one instruction, it may trigger two page faults, the
>> first fault is a read fault, the second fault is a write fault.
>>
>> Some applications use atomic RMW instructions to populate memory, for
>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
>> at launch time) between v18 and v22.
> I'd also argue that this should be optimised in openjdk. Is an LDADD
> more efficient on your hardware than a plain STR? I hope it only does
> one operation per page rather than per long. There's also MAP_POPULATE
> that openjdk can use to pre-fault the pages with no additional fault.
> This would be even more efficient than any store or atomic operation.

It is not about whether atomic is more efficient than plain store on our 
hardware or not. It is arch-independent solution used by openjdk.

I agree the applications can use other ways to populate memory, but it 
depends on the usecase of the applications. And openjdk is just one of 
the examples, I can't scan all applications, but it seems like using 
atomic-add-0 to populate memory is a valid usecase.

> Not sure the reason for the architecture to report a read fault only on
> atomics. Looking at the pseudocode, it checks for both but the read
> permission takes priority. Also in case of a translation fault (which is
> what we get on the first fault), I think the syndrome write bit is
> populated as (!read && write), so 0 since 'read' is 1 for atomics.

Yeah, I'm confused too. Triggering write fault in the first place should 
be fine, right? Can we update the spec?

>> But the double page fault has some problems:
>>
>> 1. Noticeable TLB overhead.  The kernel actually installs zero page with
>>     readonly PTE for the read fault.  The write fault will trigger a
>>     write-protection fault (CoW).  The CoW will allocate a new page and
>>     make the PTE point to the new page, this needs TLB invalidations.  The
>>     tlb invalidation and the mandatory memory barriers may incur
>>     significant overhead, particularly on the machines with many cores.
> I can see why the current behaviour is not ideal but I can't tell why
> openjdk does it this way either.
>
> A bigger hammer would be to implement mm_forbids_zeropage() but this may
> affect some workloads that rely on sparsely populated large arrays.

But we still needs to decode the insn, right? Or you mean forbid zero 
page for all read fault? IMHO, this may incur noticeable overhead for 
read fault since the fault handler has to allocate real page every time.

>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>> index db1aeacd4cd9..5d5a3fbeecc0 100644
>> --- a/arch/arm64/include/asm/insn.h
>> +++ b/arch/arm64/include/asm/insn.h
>> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
>>    * "-" means "don't care"
>>    */
>>   __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
>> +__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
> This looks correct, it covers the LDADD and SWP instructions. However,
> one concern is whether future architecture versions will add some
> instructions in this space that are allowed to do a read only operation
> (e.g. skip writing if the value is the same or fails some comparison).

I think we can know the instruction by decoding it, right? Then we can 
decide whether force write fault or not by further decoding.

>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index 8251e2fea9c7..f7bceedf5ef3 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>   	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>>   	unsigned long addr = untagged_addr(far);
>>   	struct vm_area_struct *vma;
>> +	unsigned int insn;
>>   
>>   	if (kprobe_page_fault(regs, esr))
>>   		return 0;
>> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>   	if (!vma)
>>   		goto lock_mmap;
>>   
>> +	if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
>> +		goto continue_fault;
> I'd avoid the goto if possible. Even better, move this higher up into
> the block of if/else statements building the vm_flags and mm_flags.
> Factor out the checks into a different function - is_el0_atomic_instr()
> or something.

Yeah, sure. I can hide all the details in is_el0_atomic_instr().

>> +
>> +	pagefault_disable();
> This prevents recursively entering do_page_fault() but it may be worth
> testing it with an execute-only permission.

You mean the text section permission of the test is executive only?

>> +
>> +	if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
>> +		pagefault_enable();
>> +		goto continue_fault;
>> +	}
>> +
>> +	if (aarch64_insn_is_class_atomic(insn)) {
>> +		vm_flags = VM_WRITE;
>> +		mm_flags |= FAULT_FLAG_WRITE;
>> +	}
> The above would need to check if the fault is coming from a 64-bit user
> mode, otherwise the decoding wouldn't make sense:
>
> 	if (!user_mode(regs) || compat_user_mode(regs))
> 		return false;
>
> (assuming a separate function that checks the above and returns a bool;
> you'd need to re-enable the page faults)

Thanks for catching this. Will fix in v2.

> You also need to take care of endianness since the instructions are
> always little-endian. We use a similar pattern in user_insn_read():
>
> 	u32 instr;
> 	__le32 instr_le;
> 	if (get_user(instr_le, (__le32 __user *)instruction_pointer(regs)))
> 		return false;
> 	instr = le32_to_cpu(inst_le);

Sure, will take care of the endianness in v2.

> That said, I'm not keen on this kernel workaround. If openjdk decides to
> improve some security and goes for PROT_EXEC-only mappings of its text
> sections, the above trick will no longer work.

I agree the optimization/workaround does have limitation. The best way 
is still to have the spec changed to trigger write fault in the first 
place. But it may take time to have spec updated then have real hardware 
available.

Though the old hardware still get benefit from this optimization.
Christoph Lameter (Ampere) May 13, 2024, 10:39 p.m. UTC | #12
On Thu, 9 May 2024, Anshuman Khandual wrote:

>
>>> Okay, I was about to ask, but is not calling get_user() for all data
>>> read page faults increase the cost for a hot code path in general for
>>> some potential savings for a very specific use case. Not sure if that
>>> is worth the trade-off.
>>
>> The instruction is cache hot since it must be present in the cpu cache for the fault. So the overhead is minimal.
>>
>
> But could not a pagefault_disable()-enable() window prevent concurring
> page faults for the current process thus degrading its performance.

The cpu is already executing a fault handler in kernel space. There cannot 
be an additional user space fault since we do not execute that code 
currently.
Christoph Lameter (Ampere) May 13, 2024, 10:41 p.m. UTC | #13
On Fri, 10 May 2024, Yang Shi wrote:

> Yeah, I'm confused too. Triggering write fault in the first place should be 
> fine, right? Can we update the spec?

That is certainly the best solution. Howver, there is a large base of 
machines out there now with this issue.
Yang Shi May 14, 2024, 3:19 a.m. UTC | #14
>> +
>> +	if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
>> +		pagefault_enable();
>> +		goto continue_fault;
>> +	}
>> +
>> +	if (aarch64_insn_is_class_atomic(insn)) {
>> +		vm_flags = VM_WRITE;
>> +		mm_flags |= FAULT_FLAG_WRITE;
>> +	}
> The above would need to check if the fault is coming from a 64-bit user
> mode, otherwise the decoding wouldn't make sense:
>
> 	if (!user_mode(regs) || compat_user_mode(regs))
> 		return false;
>
> (assuming a separate function that checks the above and returns a bool;
> you'd need to re-enable the page faults)
>
> You also need to take care of endianness since the instructions are
> always little-endian. We use a similar pattern in user_insn_read():
>
> 	u32 instr;
> 	__le32 instr_le;
> 	if (get_user(instr_le, (__le32 __user *)instruction_pointer(regs)))
> 		return false;
> 	instr = le32_to_cpu(instr_le);
> 	...
>
> That said, I'm not keen on this kernel workaround. If openjdk decides to
> improve some security and goes for PROT_EXEC-only mappings of its text
> sections, the above trick will no longer work.

I noticed futex does replace insns. IIUC, the below sequence should can 
do the trick for exec-only, right?

disable privileged
read insn with ldxr
enable privileged

>
Catalin Marinas May 14, 2024, 10:39 a.m. UTC | #15
On Fri, May 10, 2024 at 10:13:02AM -0700, Yang Shi wrote:
> On 5/10/24 5:11 AM, Catalin Marinas wrote:
> > On Tue, May 07, 2024 at 03:35:58PM -0700, Yang Shi wrote:
> > > The atomic RMW instructions, for example, ldadd, actually does load +
> > > add + store in one instruction, it may trigger two page faults, the
> > > first fault is a read fault, the second fault is a write fault.
> > > 
> > > Some applications use atomic RMW instructions to populate memory, for
> > > example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
> > > at launch time) between v18 and v22.
> > I'd also argue that this should be optimised in openjdk. Is an LDADD
> > more efficient on your hardware than a plain STR? I hope it only does
> > one operation per page rather than per long. There's also MAP_POPULATE
> > that openjdk can use to pre-fault the pages with no additional fault.
> > This would be even more efficient than any store or atomic operation.
> 
> It is not about whether atomic is more efficient than plain store on our
> hardware or not. It is arch-independent solution used by openjdk.

It may be arch independent but it's not a great choice. If you run this
on pre-LSE atomics hardware (ARMv8.0), this operation would involve
LDXR+STXR and there's no way for the kernel to "upgrade" it to a write
operation on the first LDXR fault.

It would be good to understand why openjdk is doing this instead of a
plain write. Is it because it may be racing with some other threads
already using the heap? That would be a valid pattern.

> > Not sure the reason for the architecture to report a read fault only on
> > atomics. Looking at the pseudocode, it checks for both but the read
> > permission takes priority. Also in case of a translation fault (which is
> > what we get on the first fault), I think the syndrome write bit is
> > populated as (!read && write), so 0 since 'read' is 1 for atomics.
> 
> Yeah, I'm confused too. Triggering write fault in the first place should be
> fine, right? Can we update the spec?

As you noticed, even if we change the spec, we still have the old
hardware. Also, changing the spec would probably need to come with a new
CPUID field since that's software visible. I'll raise it with the
architects, maybe in the future it will allow us to skip the instruction
read.

> > > But the double page fault has some problems:
> > > 
> > > 1. Noticeable TLB overhead.  The kernel actually installs zero page with
> > >     readonly PTE for the read fault.  The write fault will trigger a
> > >     write-protection fault (CoW).  The CoW will allocate a new page and
> > >     make the PTE point to the new page, this needs TLB invalidations.  The
> > >     tlb invalidation and the mandatory memory barriers may incur
> > >     significant overhead, particularly on the machines with many cores.
> > I can see why the current behaviour is not ideal but I can't tell why
> > openjdk does it this way either.
> > 
> > A bigger hammer would be to implement mm_forbids_zeropage() but this may
> > affect some workloads that rely on sparsely populated large arrays.
> 
> But we still needs to decode the insn, right? Or you mean forbid zero page
> for all read fault? IMHO, this may incur noticeable overhead for read fault
> since the fault handler has to allocate real page every time.

The current kernel mm_forbids_zeropage() is a big knob irrespective of
the instruction triggering the fault.

> > > diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> > > index db1aeacd4cd9..5d5a3fbeecc0 100644
> > > --- a/arch/arm64/include/asm/insn.h
> > > +++ b/arch/arm64/include/asm/insn.h
> > > @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
> > >    * "-" means "don't care"
> > >    */
> > >   __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
> > > +__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
> > 
> > This looks correct, it covers the LDADD and SWP instructions. However,
> > one concern is whether future architecture versions will add some
> > instructions in this space that are allowed to do a read only operation
> > (e.g. skip writing if the value is the same or fails some comparison).
> 
> I think we can know the instruction by decoding it, right? Then we can
> decide whether force write fault or not by further decoding.

Your mask above covers unallocated opcodes, we don't know what else will
get in there in the future, whether we get instructions that only do
reads. We could ask for clarification from the architects but I doubt
they'd commit to allocating it only to instructions that do a write in
this space. The alternative is to check for the individual instructions
already allocated in here (after the big mask check above) but this will
increase the fault cost a bit.

There are CAS and CASP variants that also require a write permission
even if they fail the check. We should cover them as well.

> > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > > index 8251e2fea9c7..f7bceedf5ef3 100644
> > > --- a/arch/arm64/mm/fault.c
> > > +++ b/arch/arm64/mm/fault.c
> > > @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> > >   	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
> > >   	unsigned long addr = untagged_addr(far);
> > >   	struct vm_area_struct *vma;
> > > +	unsigned int insn;
> > >   	if (kprobe_page_fault(regs, esr))
> > >   		return 0;
> > > @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> > >   	if (!vma)
> > >   		goto lock_mmap;
> > > +	if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
> > > +		goto continue_fault;
[...]
> > > +
> > > +	pagefault_disable();
> > 
> > This prevents recursively entering do_page_fault() but it may be worth
> > testing it with an execute-only permission.
> 
> You mean the text section permission of the test is executive only?

Yes. Not widely used though.

A point Will raised was on potential ABI changes introduced by this
patch. The ESR_EL1 reported to user remains the same as per the hardware
spec (read-only), so from a SIGSEGV we may have some slight behaviour
changes:

1. PTE invalid:

   a) vma is VM_READ && !VM_WRITE permission - SIGSEGV reported with
      ESR_EL1.WnR == 0 in sigcontext with your patch. Without this
      patch, the PTE is mapped as PTE_RDONLY first and a subsequent
      fault will report SIGSEGV with ESR_EL1.WnR == 1.

   b) vma is !VM_READ && !VM_WRITE permission - SIGSEGV reported with
      ESR_EL1.WnR == 0, so no change from current behaviour, unless we
      fix the patch for (1.a) to fake the WnR bit which would change the
      current expectations.

2. PTE valid with PTE_RDONLY - we get a normal writeable fault in
   hardware, no need to fix ESR_EL1 up.

The patch would have to address (1) above but faking the ESR_EL1.WnR bit
based on the vma flags looks a bit fragile.

Similarly, we have userfaultfd that reports the fault to user. I think
in scenario (1) the kernel will report UFFD_PAGEFAULT_FLAG_WRITE with
your patch but no UFFD_PAGEFAULT_FLAG_WP. Without this patch, there are
indeed two faults, with the second having both UFFD_PAGEFAULT_FLAG_WP
and UFFD_PAGEFAULT_FLAG_WRITE set.
Catalin Marinas May 14, 2024, 10:53 a.m. UTC | #16
On Mon, May 13, 2024 at 09:19:39PM -0600, Yang Shi wrote:
> > That said, I'm not keen on this kernel workaround. If openjdk decides to
> > improve some security and goes for PROT_EXEC-only mappings of its text
> > sections, the above trick will no longer work.
> 
> I noticed futex does replace insns. IIUC, the below sequence should
> can do the trick for exec-only, right?
> 
> disable privileged
> read insn with ldxr
> enable privileged

Do you mean not using the unprivileged LDTR as in get_user()? You don't
even need an LDXR, just plain LDR but with the extable entry etc.

However, with PIE we got proper execute-only permission (not the kind of
fake one where we disabled the PTE_USER bit while keeping PTE_UXN as 0).
So the futex-style approach won't work unless we changed the PIE_E1
entry for _PAGE_EXECONLY to be PIE_R by the kernel.
David Hildenbrand May 14, 2024, 3:57 p.m. UTC | #17
On 14.05.24 12:39, Catalin Marinas wrote:
> On Fri, May 10, 2024 at 10:13:02AM -0700, Yang Shi wrote:
>> On 5/10/24 5:11 AM, Catalin Marinas wrote:
>>> On Tue, May 07, 2024 at 03:35:58PM -0700, Yang Shi wrote:
>>>> The atomic RMW instructions, for example, ldadd, actually does load +
>>>> add + store in one instruction, it may trigger two page faults, the
>>>> first fault is a read fault, the second fault is a write fault.
>>>>
>>>> Some applications use atomic RMW instructions to populate memory, for
>>>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
>>>> at launch time) between v18 and v22.
>>> I'd also argue that this should be optimised in openjdk. Is an LDADD
>>> more efficient on your hardware than a plain STR? I hope it only does
>>> one operation per page rather than per long. There's also MAP_POPULATE
>>> that openjdk can use to pre-fault the pages with no additional fault.
>>> This would be even more efficient than any store or atomic operation.
>>
>> It is not about whether atomic is more efficient than plain store on our
>> hardware or not. It is arch-independent solution used by openjdk.
> 
> It may be arch independent but it's not a great choice. If you run this
> on pre-LSE atomics hardware (ARMv8.0), this operation would involve
> LDXR+STXR and there's no way for the kernel to "upgrade" it to a write
> operation on the first LDXR fault.
> 
> It would be good to understand why openjdk is doing this instead of a
> plain write. Is it because it may be racing with some other threads
> already using the heap? That would be a valid pattern.

Maybe openjdk should be switching to MADV_POPULATE_WRITE. QEMU did that 
for the preallocate/populate use case.
Yang Shi May 17, 2024, 4:10 p.m. UTC | #18
On 5/14/24 3:53 AM, Catalin Marinas wrote:
> On Mon, May 13, 2024 at 09:19:39PM -0600, Yang Shi wrote:
>>> That said, I'm not keen on this kernel workaround. If openjdk decides to
>>> improve some security and goes for PROT_EXEC-only mappings of its text
>>> sections, the above trick will no longer work.
>> I noticed futex does replace insns. IIUC, the below sequence should
>> can do the trick for exec-only, right?
>>
>> disable privileged
>> read insn with ldxr
>> enable privileged
> Do you mean not using the unprivileged LDTR as in get_user()? You don't
> even need an LDXR, just plain LDR but with the extable entry etc.
>
> However, with PIE we got proper execute-only permission (not the kind of
> fake one where we disabled the PTE_USER bit while keeping PTE_UXN as 0).
> So the futex-style approach won't work unless we changed the PIE_E1
> entry for _PAGE_EXECONLY to be PIE_R by the kernel.

I see. Thanks. Yes, I did see this works without PIE. As you said in the 
earlier email, exec-only is not that popular yet. I think we can just 
ignore it for now.

>
Yang Shi May 17, 2024, 4:30 p.m. UTC | #19
On 5/14/24 3:39 AM, Catalin Marinas wrote:
> On Fri, May 10, 2024 at 10:13:02AM -0700, Yang Shi wrote:
>> On 5/10/24 5:11 AM, Catalin Marinas wrote:
>>> On Tue, May 07, 2024 at 03:35:58PM -0700, Yang Shi wrote:
>>>> The atomic RMW instructions, for example, ldadd, actually does load +
>>>> add + store in one instruction, it may trigger two page faults, the
>>>> first fault is a read fault, the second fault is a write fault.
>>>>
>>>> Some applications use atomic RMW instructions to populate memory, for
>>>> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
>>>> at launch time) between v18 and v22.
>>> I'd also argue that this should be optimised in openjdk. Is an LDADD
>>> more efficient on your hardware than a plain STR? I hope it only does
>>> one operation per page rather than per long. There's also MAP_POPULATE
>>> that openjdk can use to pre-fault the pages with no additional fault.
>>> This would be even more efficient than any store or atomic operation.
>> It is not about whether atomic is more efficient than plain store on our
>> hardware or not. It is arch-independent solution used by openjdk.
> It may be arch independent but it's not a great choice. If you run this
> on pre-LSE atomics hardware (ARMv8.0), this operation would involve
> LDXR+STXR and there's no way for the kernel to "upgrade" it to a write
> operation on the first LDXR fault.
>
> It would be good to understand why openjdk is doing this instead of a
> plain write. Is it because it may be racing with some other threads
> already using the heap? That would be a valid pattern.

Yes, you are right. I think I quoted the JVM justification in earlier 
email, anyway they said "permit use of memory concurrently with pretouch".

>
>>> Not sure the reason for the architecture to report a read fault only on
>>> atomics. Looking at the pseudocode, it checks for both but the read
>>> permission takes priority. Also in case of a translation fault (which is
>>> what we get on the first fault), I think the syndrome write bit is
>>> populated as (!read && write), so 0 since 'read' is 1 for atomics.
>> Yeah, I'm confused too. Triggering write fault in the first place should be
>> fine, right? Can we update the spec?
> As you noticed, even if we change the spec, we still have the old
> hardware. Also, changing the spec would probably need to come with a new
> CPUID field since that's software visible. I'll raise it with the
> architects, maybe in the future it will allow us to skip the instruction
> read.

Thank you.

>
>>>> But the double page fault has some problems:
>>>>
>>>> 1. Noticeable TLB overhead.  The kernel actually installs zero page with
>>>>      readonly PTE for the read fault.  The write fault will trigger a
>>>>      write-protection fault (CoW).  The CoW will allocate a new page and
>>>>      make the PTE point to the new page, this needs TLB invalidations.  The
>>>>      tlb invalidation and the mandatory memory barriers may incur
>>>>      significant overhead, particularly on the machines with many cores.
>>> I can see why the current behaviour is not ideal but I can't tell why
>>> openjdk does it this way either.
>>>
>>> A bigger hammer would be to implement mm_forbids_zeropage() but this may
>>> affect some workloads that rely on sparsely populated large arrays.
>> But we still needs to decode the insn, right? Or you mean forbid zero page
>> for all read fault? IMHO, this may incur noticeable overhead for read fault
>> since the fault handler has to allocate real page every time.
> The current kernel mm_forbids_zeropage() is a big knob irrespective of
> the instruction triggering the fault.

Yes.

>
>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>> index db1aeacd4cd9..5d5a3fbeecc0 100644
>>>> --- a/arch/arm64/include/asm/insn.h
>>>> +++ b/arch/arm64/include/asm/insn.h
>>>> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
>>>>     * "-" means "don't care"
>>>>     */
>>>>    __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
>>>> +__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
>>> This looks correct, it covers the LDADD and SWP instructions. However,
>>> one concern is whether future architecture versions will add some
>>> instructions in this space that are allowed to do a read only operation
>>> (e.g. skip writing if the value is the same or fails some comparison).
>> I think we can know the instruction by decoding it, right? Then we can
>> decide whether force write fault or not by further decoding.
> Your mask above covers unallocated opcodes, we don't know what else will
> get in there in the future, whether we get instructions that only do
> reads. We could ask for clarification from the architects but I doubt
> they'd commit to allocating it only to instructions that do a write in
> this space. The alternative is to check for the individual instructions
> already allocated in here (after the big mask check above) but this will
> increase the fault cost a bit.
>
> There are CAS and CASP variants that also require a write permission
> even if they fail the check. We should cover them as well.

Sure, will cover in v2.

>
>>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>>>> index 8251e2fea9c7..f7bceedf5ef3 100644
>>>> --- a/arch/arm64/mm/fault.c
>>>> +++ b/arch/arm64/mm/fault.c
>>>> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>    	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>>>>    	unsigned long addr = untagged_addr(far);
>>>>    	struct vm_area_struct *vma;
>>>> +	unsigned int insn;
>>>>    	if (kprobe_page_fault(regs, esr))
>>>>    		return 0;
>>>> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>    	if (!vma)
>>>>    		goto lock_mmap;
>>>> +	if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
>>>> +		goto continue_fault;
> [...]
>>>> +
>>>> +	pagefault_disable();
>>> This prevents recursively entering do_page_fault() but it may be worth
>>> testing it with an execute-only permission.
>> You mean the text section permission of the test is executive only?
> Yes. Not widely used though.

I tested my patch with exec-only. No crash, just the optimization didn't 
take effect as expected.

>
> A point Will raised was on potential ABI changes introduced by this
> patch. The ESR_EL1 reported to user remains the same as per the hardware
> spec (read-only), so from a SIGSEGV we may have some slight behaviour
> changes:
>
> 1. PTE invalid:
>
>     a) vma is VM_READ && !VM_WRITE permission - SIGSEGV reported with
>        ESR_EL1.WnR == 0 in sigcontext with your patch. Without this
>        patch, the PTE is mapped as PTE_RDONLY first and a subsequent
>        fault will report SIGSEGV with ESR_EL1.WnR == 1.

I think I can do something like the below conceptually:

if is_el0_atomic_instr && !is_write_abort
     force_write = true

if VM_READ && !VM_WRITE && force_write == true
     vm_flags = VM_READ
     mm_flags ~= FAULT_FLAG_WRITE

Then we just fallback to read fault. The following write fault will 
trigger SIGSEGV with consistent ABI.

>
>     b) vma is !VM_READ && !VM_WRITE permission - SIGSEGV reported with
>        ESR_EL1.WnR == 0, so no change from current behaviour, unless we
>        fix the patch for (1.a) to fake the WnR bit which would change the
>        current expectations.
>
> 2. PTE valid with PTE_RDONLY - we get a normal writeable fault in
>     hardware, no need to fix ESR_EL1 up.
>
> The patch would have to address (1) above but faking the ESR_EL1.WnR bit
> based on the vma flags looks a bit fragile.

I think we don't need to fake the ESR_EL1.WnR bit with the fallback.

>
> Similarly, we have userfaultfd that reports the fault to user. I think
> in scenario (1) the kernel will report UFFD_PAGEFAULT_FLAG_WRITE with
> your patch but no UFFD_PAGEFAULT_FLAG_WP. Without this patch, there are
> indeed two faults, with the second having both UFFD_PAGEFAULT_FLAG_WP
> and UFFD_PAGEFAULT_FLAG_WRITE set.

I don't quite get what the problem is. IIUC, uffd just needs a signal 
from kernel to tell this area will be written. It seems not break the 
semantic. Added Peter Xu in this loop, who is the uffd developer. He may 
shed some light.

>
Catalin Marinas May 17, 2024, 5:25 p.m. UTC | #20
On Fri, May 17, 2024 at 09:30:23AM -0700, Yang Shi wrote:
> On 5/14/24 3:39 AM, Catalin Marinas wrote:
> > It would be good to understand why openjdk is doing this instead of a
> > plain write. Is it because it may be racing with some other threads
> > already using the heap? That would be a valid pattern.
> 
> Yes, you are right. I think I quoted the JVM justification in earlier email,
> anyway they said "permit use of memory concurrently with pretouch".

Ah, sorry, I missed that. This seems like a valid reason.

> > A point Will raised was on potential ABI changes introduced by this
> > patch. The ESR_EL1 reported to user remains the same as per the hardware
> > spec (read-only), so from a SIGSEGV we may have some slight behaviour
> > changes:
> > 
> > 1. PTE invalid:
> > 
> >     a) vma is VM_READ && !VM_WRITE permission - SIGSEGV reported with
> >        ESR_EL1.WnR == 0 in sigcontext with your patch. Without this
> >        patch, the PTE is mapped as PTE_RDONLY first and a subsequent
> >        fault will report SIGSEGV with ESR_EL1.WnR == 1.
> 
> I think I can do something like the below conceptually:
> 
> if is_el0_atomic_instr && !is_write_abort
>     force_write = true
> 
> if VM_READ && !VM_WRITE && force_write == true

Nit: write implies read, so you only need to check !write.

>     vm_flags = VM_READ
>     mm_flags ~= FAULT_FLAG_WRITE
> 
> Then we just fallback to read fault. The following write fault will trigger
> SIGSEGV with consistent ABI.

I think this should work. So instead of reporting the write fault
directly in case of a read-only vma, we let the core code handle the
read fault and first and we retry the atomic instruction.

> >     b) vma is !VM_READ && !VM_WRITE permission - SIGSEGV reported with
> >        ESR_EL1.WnR == 0, so no change from current behaviour, unless we
> >        fix the patch for (1.a) to fake the WnR bit which would change the
> >        current expectations.
> > 
> > 2. PTE valid with PTE_RDONLY - we get a normal writeable fault in
> >     hardware, no need to fix ESR_EL1 up.
> > 
> > The patch would have to address (1) above but faking the ESR_EL1.WnR bit
> > based on the vma flags looks a bit fragile.
> 
> I think we don't need to fake the ESR_EL1.WnR bit with the fallback.

I agree, with your approach above we don't need to fake WnR.

> > Similarly, we have userfaultfd that reports the fault to user. I think
> > in scenario (1) the kernel will report UFFD_PAGEFAULT_FLAG_WRITE with
> > your patch but no UFFD_PAGEFAULT_FLAG_WP. Without this patch, there are
> > indeed two faults, with the second having both UFFD_PAGEFAULT_FLAG_WP
> > and UFFD_PAGEFAULT_FLAG_WRITE set.
> 
> I don't quite get what the problem is. IIUC, uffd just needs a signal from
> kernel to tell this area will be written. It seems not break the semantic.
> Added Peter Xu in this loop, who is the uffd developer. He may shed some
> light.

Not really familiar with uffd but just looking at the code, if a handler
is registered for both MODE_MISSING and MODE_WP, currently the atomic
instruction signals a user fault without UFFD_PAGEFAULT_FLAG_WRITE (the
do_anonymous_page() path). If the page is mapped by the uffd handler as
the zero page, a restart of the instruction would signal
UFFD_PAGEFAULT_FLAG_WRITE and UFFD_PAGEFAULT_FLAG_WP (the do_wp_page()
path).

With your patch, we get the equivalent of UFFD_PAGEFAULT_FLAG_WRITE on
the first attempt, just like having a STR instruction instead of
separate LDR + STR (as the atomics behave from a fault perspective).

However, I don't think that's a problem, the uffd handler should cope
with an STR anyway, so it's not some unexpected combination of flags.
Yang Shi May 17, 2024, 5:35 p.m. UTC | #21
On 5/17/24 10:25 AM, Catalin Marinas wrote:
> On Fri, May 17, 2024 at 09:30:23AM -0700, Yang Shi wrote:
>> On 5/14/24 3:39 AM, Catalin Marinas wrote:
>>> It would be good to understand why openjdk is doing this instead of a
>>> plain write. Is it because it may be racing with some other threads
>>> already using the heap? That would be a valid pattern.
>> Yes, you are right. I think I quoted the JVM justification in earlier email,
>> anyway they said "permit use of memory concurrently with pretouch".
> Ah, sorry, I missed that. This seems like a valid reason.

I should have articulated this in the commit log. Will add this in v2.

>
>>> A point Will raised was on potential ABI changes introduced by this
>>> patch. The ESR_EL1 reported to user remains the same as per the hardware
>>> spec (read-only), so from a SIGSEGV we may have some slight behaviour
>>> changes:
>>>
>>> 1. PTE invalid:
>>>
>>>      a) vma is VM_READ && !VM_WRITE permission - SIGSEGV reported with
>>>         ESR_EL1.WnR == 0 in sigcontext with your patch. Without this
>>>         patch, the PTE is mapped as PTE_RDONLY first and a subsequent
>>>         fault will report SIGSEGV with ESR_EL1.WnR == 1.
>> I think I can do something like the below conceptually:
>>
>> if is_el0_atomic_instr && !is_write_abort
>>      force_write = true
>>
>> if VM_READ && !VM_WRITE && force_write == true
> Nit: write implies read, so you only need to check !write.
>
>>      vm_flags = VM_READ
>>      mm_flags ~= FAULT_FLAG_WRITE
>>
>> Then we just fallback to read fault. The following write fault will trigger
>> SIGSEGV with consistent ABI.
> I think this should work. So instead of reporting the write fault
> directly in case of a read-only vma, we let the core code handle the
> read fault and first and we retry the atomic instruction.

Yes, just undo the force write when vma flags don't allow it.

>
>>>      b) vma is !VM_READ && !VM_WRITE permission - SIGSEGV reported with
>>>         ESR_EL1.WnR == 0, so no change from current behaviour, unless we
>>>         fix the patch for (1.a) to fake the WnR bit which would change the
>>>         current expectations.
>>>
>>> 2. PTE valid with PTE_RDONLY - we get a normal writeable fault in
>>>      hardware, no need to fix ESR_EL1 up.
>>>
>>> The patch would have to address (1) above but faking the ESR_EL1.WnR bit
>>> based on the vma flags looks a bit fragile.
>> I think we don't need to fake the ESR_EL1.WnR bit with the fallback.
> I agree, with your approach above we don't need to fake WnR.
>
>>> Similarly, we have userfaultfd that reports the fault to user. I think
>>> in scenario (1) the kernel will report UFFD_PAGEFAULT_FLAG_WRITE with
>>> your patch but no UFFD_PAGEFAULT_FLAG_WP. Without this patch, there are
>>> indeed two faults, with the second having both UFFD_PAGEFAULT_FLAG_WP
>>> and UFFD_PAGEFAULT_FLAG_WRITE set.
>> I don't quite get what the problem is. IIUC, uffd just needs a signal from
>> kernel to tell this area will be written. It seems not break the semantic.
>> Added Peter Xu in this loop, who is the uffd developer. He may shed some
>> light.
> Not really familiar with uffd but just looking at the code, if a handler
> is registered for both MODE_MISSING and MODE_WP, currently the atomic
> instruction signals a user fault without UFFD_PAGEFAULT_FLAG_WRITE (the
> do_anonymous_page() path). If the page is mapped by the uffd handler as
> the zero page, a restart of the instruction would signal
> UFFD_PAGEFAULT_FLAG_WRITE and UFFD_PAGEFAULT_FLAG_WP (the do_wp_page()
> path).
>
> With your patch, we get the equivalent of UFFD_PAGEFAULT_FLAG_WRITE on
> the first attempt, just like having a STR instruction instead of
> separate LDR + STR (as the atomics behave from a fault perspective).
>
> However, I don't think that's a problem, the uffd handler should cope
> with an STR anyway, so it's not some unexpected combination of flags.

Yes, this is what I thought.

>
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index db1aeacd4cd9..5d5a3fbeecc0 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -319,6 +319,7 @@  static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
  * "-" means "don't care"
  */
 __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
+__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
 
 __AARCH64_INSN_FUNCS(adr,	0x9F000000, 0x10000000)
 __AARCH64_INSN_FUNCS(adrp,	0x9F000000, 0x90000000)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 8251e2fea9c7..f7bceedf5ef3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -529,6 +529,7 @@  static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
 	unsigned long addr = untagged_addr(far);
 	struct vm_area_struct *vma;
+	unsigned int insn;
 
 	if (kprobe_page_fault(regs, esr))
 		return 0;
@@ -586,6 +587,24 @@  static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	if (!vma)
 		goto lock_mmap;
 
+	if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
+		goto continue_fault;
+
+	pagefault_disable();
+
+	if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
+		pagefault_enable();
+		goto continue_fault;
+	}
+
+	if (aarch64_insn_is_class_atomic(insn)) {
+		vm_flags = VM_WRITE;
+		mm_flags |= FAULT_FLAG_WRITE;
+	}
+
+	pagefault_enable();
+
+continue_fault:
 	if (!(vma->vm_flags & vm_flags)) {
 		vma_end_read(vma);
 		goto lock_mmap;