diff mbox series

[v5] arm64: mm: force write fault for atomic RMW instructions

Message ID 20240626191830.3819324-1-yang@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series [v5] arm64: mm: force write fault for atomic RMW instructions | expand

Commit Message

Yang Shi June 26, 2024, 7:18 p.m. UTC
The atomic RMW instructions, for example, ldadd, actually does load +
add + store in one instruction, it will trigger two page faults per the
ARM64 architecture spec, the first fault is a read fault, the second
fault is a write fault.

Some applications use atomic RMW instructions to populate memory, for
example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
at launch time) between v18 and v22 in order to permit use of memory
concurrently with pretouch.

But the double page fault has some problems:

1. Noticeable TLB overhead.  The kernel actually installs zero page with
   readonly PTE for the read fault.  The write fault will trigger a
   write-protection fault (CoW).  The CoW will allocate a new page and
   make the PTE point to the new page, this needs TLB invalidations.  The
   tlb invalidation and the mandatory memory barriers may incur
   significant overhead, particularly on the machines with many cores.

2. Break up huge pages.  If THP is on the read fault will install huge
   zero pages.  The later CoW will break up the huge page and allocate
   base pages instead of huge page.  The applications have to rely on
   khugepaged (kernel thread) to collapse huge pages asynchronously.
   This also incurs noticeable performance penalty.

3. 512x page faults with huge page.  Due to #2, the applications have to
   have page faults for every 4K area for the write, this makes the speed
   up by using huge page actually gone.

So it sounds pointless to have two page faults since we know the memory
will be definitely written very soon.  Forcing write fault for atomic RMW
instruction makes some sense and it can solve the aforementioned problems:

Firstly, it just allocates zero'ed page, no tlb invalidation and memory
barriers anymore.
Secondly, it can populate writable huge pages in the first place and
don't break them up.  Just one page fault is needed for 2M area instrad
of 512 faults and also save cpu time by not using khugepaged.

A simple micro benchmark which populates 1G memory shows the number of
page faults is reduced by half and the time spent by system is reduced
by 60% on a VM running on Ampere Altra platform.

And the benchmark for anonymous read fault on 1G memory, file read fault
on 1G file (cold page cache and warm page cache) don't show noticeable
regression.

Exclude unallocated instructions and LD64B/LDAPR instructions.

Some other architectures also have code inspection in page fault path,
for example, SPARC and x86.

Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/insn.h | 20 ++++++++++++++++++++
 arch/arm64/mm/fault.c         | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

v5: 1. Used vm_flags & VM_READ per Catalin.
    2. Collected ack tag from David.
v4: 1. Fixed the comments from Catalin.
    2. Rebased to v6.10-rc2.
    3. Collected the review tag from Christopher Lameter.
v3: Exclude unallocated insns and LD64B/LDAPR per Catalin. And thanks
    for D Scott help figure out the minimum conditions.
v2: 1. Made commit log more precise per Anshuman and Catalin
    2. Made pagefault_disable/enable window narrower per Anshuman
    3. Covered CAS and CASP variants per Catalin
    4. Put instruction fetching and decoding into a helper function and
       take into account endianess per Catalin
    5. Don't fetch and decode insn for 32 bit mode (compat) per Catalin
    6. More performance tests and exec-only test per Anshuman and Catalin

Comments

Catalin Marinas June 28, 2024, 4:54 p.m. UTC | #1
On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote:
> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>  	if (!vma)
>  		goto lock_mmap;
>  
> +	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
> +	    is_el0_atomic_instr(regs)) {
> +		vm_flags = VM_WRITE;
> +		mm_flags |= FAULT_FLAG_WRITE;
> +	}

The patch looks fine now and AFAICT there's no ABI change.

However, before deciding whether to merge this patch, I'd like to
understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This
would be the portable (Linux) solution that works better on
architectures without such atomic instructions (e.g. arm64 without LSE
atomics). So fixing user-space would be my preferred solution.

(I poked some people in Arm working in the area, hopefully I'll get some
more information)
Christoph Lameter (Ampere) June 28, 2024, 4:57 p.m. UTC | #2
On Fri, 28 Jun 2024, Catalin Marinas wrote:

> On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote:
>> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>  	if (!vma)
>>  		goto lock_mmap;
>>
>> +	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
>> +	    is_el0_atomic_instr(regs)) {
>> +		vm_flags = VM_WRITE;
>> +		mm_flags |= FAULT_FLAG_WRITE;
>> +	}
>
> The patch looks fine now and AFAICT there's no ABI change.
>
> However, before deciding whether to merge this patch, I'd like to
> understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This
> would be the portable (Linux) solution that works better on
> architectures without such atomic instructions (e.g. arm64 without LSE
> atomics). So fixing user-space would be my preferred solution.

Doing so would be requesting application code changes that are linux 
and ARM64 specific from applications running on Linux. A lot of these are 
proprietary.
Catalin Marinas June 28, 2024, 5:24 p.m. UTC | #3
On Fri, Jun 28, 2024 at 09:57:37AM -0700, Christoph Lameter (Ampere) wrote:
> On Fri, 28 Jun 2024, Catalin Marinas wrote:
> > On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote:
> > > @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> > >  	if (!vma)
> > >  		goto lock_mmap;
> > > 
> > > +	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
> > > +	    is_el0_atomic_instr(regs)) {
> > > +		vm_flags = VM_WRITE;
> > > +		mm_flags |= FAULT_FLAG_WRITE;
> > > +	}
> > 
> > The patch looks fine now and AFAICT there's no ABI change.
> > 
> > However, before deciding whether to merge this patch, I'd like to
> > understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This
> > would be the portable (Linux) solution that works better on
> > architectures without such atomic instructions (e.g. arm64 without LSE
> > atomics). So fixing user-space would be my preferred solution.
> 
> Doing so would be requesting application code changes that are linux and
> ARM64 specific from applications running on Linux.

Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely
not. I'd argue that expecting the atomic_add(0) to only trigger a single
write fault is arch specific. You can't do this on arm32 or arm64
pre-LSE (I haven't checked other architectures).

IIUC, OpenJDK added this feature about two years ago but the arm64
behaviour hasn't changed in the meantime. So it's not like we broke the
ABI and forcing user space to update.

This patch does feel a bit like working around a non-optimal user choice
in kernel space. Who knows, madvise() may even be quicker if you do a
single call for a larger VA vs touching each page.

> A lot of these are proprietary.

Are you aware of other (proprietary) software relying on such pattern to
fault pages in as writeable?
Yang Shi June 28, 2024, 6:20 p.m. UTC | #4
On 6/28/24 10:24 AM, Catalin Marinas wrote:
> On Fri, Jun 28, 2024 at 09:57:37AM -0700, Christoph Lameter (Ampere) wrote:
>> On Fri, 28 Jun 2024, Catalin Marinas wrote:
>>> On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote:
>>>> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>   	if (!vma)
>>>>   		goto lock_mmap;
>>>>
>>>> +	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
>>>> +	    is_el0_atomic_instr(regs)) {
>>>> +		vm_flags = VM_WRITE;
>>>> +		mm_flags |= FAULT_FLAG_WRITE;
>>>> +	}
>>> The patch looks fine now and AFAICT there's no ABI change.
>>>
>>> However, before deciding whether to merge this patch, I'd like to
>>> understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This
>>> would be the portable (Linux) solution that works better on
>>> architectures without such atomic instructions (e.g. arm64 without LSE
>>> atomics). So fixing user-space would be my preferred solution.
>> Doing so would be requesting application code changes that are linux and
>> ARM64 specific from applications running on Linux.
> Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely
> not. I'd argue that expecting the atomic_add(0) to only trigger a single
> write fault is arch specific. You can't do this on arm32 or arm64
> pre-LSE (I haven't checked other architectures).
>
> IIUC, OpenJDK added this feature about two years ago but the arm64
> behaviour hasn't changed in the meantime. So it's not like we broke the
> ABI and forcing user space to update.
>
> This patch does feel a bit like working around a non-optimal user choice
> in kernel space. Who knows, madvise() may even be quicker if you do a
> single call for a larger VA vs touching each page.

IMHO, I don't think so. I viewed this patch to solve or workaround some 
ISA inefficiency in kernel. Two faults are not necessary if we know we 
are definitely going to write the memory very soon, right?

>
>> A lot of these are proprietary.
> Are you aware of other (proprietary) software relying on such pattern to
> fault pages in as writeable?
>
Christoph Lameter (Ampere) June 28, 2024, 6:26 p.m. UTC | #5
On Fri, 28 Jun 2024, Catalin Marinas wrote:

> Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely
> not. I'd argue that expecting the atomic_add(0) to only trigger a single
> write fault is arch specific. You can't do this on arm32 or arm64
> pre-LSE (I haven't checked other architectures).

The single write fault is x86 behavior. I am not sure how other 
architectures handle that.

> IIUC, OpenJDK added this feature about two years ago but the arm64
> behaviour hasn't changed in the meantime. So it's not like we broke the
> ABI and forcing user space to update.

The focus of OpenJDK may not be arm64 and they never saw the issue? We 
only know this because we have an insider on staff. AFACIT we get pushback 
from them as well. There are certainly numerous other open 
source applications that behave in a similar way. We just dont know about 
it.

> This patch does feel a bit like working around a non-optimal user choice
> in kernel space. Who knows, madvise() may even be quicker if you do a
> single call for a larger VA vs touching each page.

Looks to me like unexpected surprising behavior on ARM64. madvise is 
rather particular to Linux and its semantics are ever evolving.

>> A lot of these are proprietary.
>
> Are you aware of other (proprietary) software relying on such pattern to
> fault pages in as writeable?

I would not be told about such things by companies I did not work for and 
if I have gotten knowledge about this in some way in the past then I would 
not be allowed to talk about it.
Catalin Marinas July 1, 2024, 7:43 p.m. UTC | #6
On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
> On 6/28/24 10:24 AM, Catalin Marinas wrote:
> > This patch does feel a bit like working around a non-optimal user choice
> > in kernel space. Who knows, madvise() may even be quicker if you do a
> > single call for a larger VA vs touching each page.
> 
> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
> inefficiency in kernel. Two faults are not necessary if we know we are
> definitely going to write the memory very soon, right?

I agree the Arm architecture behaviour is not ideal here and any
timelines for fixing it in hardware, if they do happen, are far into the
future. Purely from a kernel perspective, what I want though is make
sure that longer term (a) we don't create additional maintenance burden
and (b) we don't keep dead code around.

Point (a) could be mitigated if the architecture is changed so that any
new atomic instructions added to this range would also come with
additional syndrome information so that we don't have to update the
decoding patterns.

Point (b), however, depends on the OpenJDK and the kernel versions in
distros. Nick Gasson kindly provided some information on the OpenJDK
changes. The atomic_add(0) change happened in early 2022, about 5-6
months after MADV_POPULATE_WRITE support was added to the kernel. What's
interesting is Ampere already contributed MADV_POPULATE_WRITE support to
OpenJDK a few months ago:

https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a

The OpenJDK commit lacks explanation but what I gathered from the diff
is that this option is the preferred one in the presence of THP (which
most/all distros enable by default). If we merge your proposed kernel
patch, it will take time before it makes its way into distros. I'm
hoping that by that time, distros would have picked a new OpenJDK
version already that doesn't need the atomic_add(0) pattern. If that's
the case, we end up with some dead code in the kernel that's almost
never exercised.

I don't follow OpenJDK development but I heard that updates are dragging
quite a lot. I can't tell whether people have picked up the
atomic_add(0) feature and whether, by the time a kernel patch would make
it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.

There's a point (c) as well on the overhead of reading the faulting
instruction. I hope that's negligible but I haven't measured it.
Ryan Roberts July 2, 2024, 10:26 a.m. UTC | #7
On 01/07/2024 20:43, Catalin Marinas wrote:
> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
>> On 6/28/24 10:24 AM, Catalin Marinas wrote:
>>> This patch does feel a bit like working around a non-optimal user choice
>>> in kernel space. Who knows, madvise() may even be quicker if you do a
>>> single call for a larger VA vs touching each page.
>>
>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
>> inefficiency in kernel. Two faults are not necessary if we know we are
>> definitely going to write the memory very soon, right?
> 
> I agree the Arm architecture behaviour is not ideal here and any
> timelines for fixing it in hardware, if they do happen, are far into the
> future. Purely from a kernel perspective, what I want though is make
> sure that longer term (a) we don't create additional maintenance burden
> and (b) we don't keep dead code around.
> 
> Point (a) could be mitigated if the architecture is changed so that any
> new atomic instructions added to this range would also come with
> additional syndrome information so that we don't have to update the
> decoding patterns.
> 
> Point (b), however, depends on the OpenJDK and the kernel versions in
> distros. Nick Gasson kindly provided some information on the OpenJDK
> changes. The atomic_add(0) change happened in early 2022, about 5-6
> months after MADV_POPULATE_WRITE support was added to the kernel. What's
> interesting is Ampere already contributed MADV_POPULATE_WRITE support to
> OpenJDK a few months ago:
> 
> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a
> 
> The OpenJDK commit lacks explanation but what I gathered from the diff
> is that this option is the preferred one in the presence of THP (which
> most/all distros enable by default). If we merge your proposed kernel
> patch, it will take time before it makes its way into distros. I'm
> hoping that by that time, distros would have picked a new OpenJDK
> version already that doesn't need the atomic_add(0) pattern. If that's
> the case, we end up with some dead code in the kernel that's almost
> never exercised.
> 
> I don't follow OpenJDK development but I heard that updates are dragging
> quite a lot. I can't tell whether people have picked up the
> atomic_add(0) feature and whether, by the time a kernel patch would make
> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
> 
> There's a point (c) as well on the overhead of reading the faulting
> instruction. I hope that's negligible but I haven't measured it.
> 

Just to add to this, I note the existing kernel behaviour is that if a write
fault happens in a region that has a (RO) huge zero page mapped at PMD level,
then the PMD is shattered, the PTE of the fault address is populated with a
writable page and the remaining PTEs are populated with order-0 zero pages
(read-only).

This seems like odd behaviour to me. Surely it would be less effort and more
aligned with the app's expectations to notice the huge zero page in the PMD,
remove it, and install a THP, as would have been done if pmd_none() was true? I
don't think there is a memory bloat argument here because, IIUC, with the
current behaviour, khugepaged would eventually upgrade it to a THP anyway?

Changing to this new behaviour would only be a partial solution for your use
case, since you would still have 2 faults. But it would remove the cost of the
shattering and ensure you have a THP immediately after the write fault. But I
can't think of a reason why this wouldn't be a generally useful change
regardless? Any thoughts?

Thanks,
Ryan
David Hildenbrand July 2, 2024, 11:22 a.m. UTC | #8
On 02.07.24 12:26, Ryan Roberts wrote:
> On 01/07/2024 20:43, Catalin Marinas wrote:
>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
>>> On 6/28/24 10:24 AM, Catalin Marinas wrote:
>>>> This patch does feel a bit like working around a non-optimal user choice
>>>> in kernel space. Who knows, madvise() may even be quicker if you do a
>>>> single call for a larger VA vs touching each page.
>>>
>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
>>> inefficiency in kernel. Two faults are not necessary if we know we are
>>> definitely going to write the memory very soon, right?
>>
>> I agree the Arm architecture behaviour is not ideal here and any
>> timelines for fixing it in hardware, if they do happen, are far into the
>> future. Purely from a kernel perspective, what I want though is make
>> sure that longer term (a) we don't create additional maintenance burden
>> and (b) we don't keep dead code around.
>>
>> Point (a) could be mitigated if the architecture is changed so that any
>> new atomic instructions added to this range would also come with
>> additional syndrome information so that we don't have to update the
>> decoding patterns.
>>
>> Point (b), however, depends on the OpenJDK and the kernel versions in
>> distros. Nick Gasson kindly provided some information on the OpenJDK
>> changes. The atomic_add(0) change happened in early 2022, about 5-6
>> months after MADV_POPULATE_WRITE support was added to the kernel. What's
>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to
>> OpenJDK a few months ago:
>>
>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a
>>
>> The OpenJDK commit lacks explanation but what I gathered from the diff
>> is that this option is the preferred one in the presence of THP (which
>> most/all distros enable by default). If we merge your proposed kernel
>> patch, it will take time before it makes its way into distros. I'm
>> hoping that by that time, distros would have picked a new OpenJDK
>> version already that doesn't need the atomic_add(0) pattern. If that's
>> the case, we end up with some dead code in the kernel that's almost
>> never exercised.
>>
>> I don't follow OpenJDK development but I heard that updates are dragging
>> quite a lot. I can't tell whether people have picked up the
>> atomic_add(0) feature and whether, by the time a kernel patch would make
>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>>
>> There's a point (c) as well on the overhead of reading the faulting
>> instruction. I hope that's negligible but I haven't measured it.
>>
> 
> Just to add to this, I note the existing kernel behaviour is that if a write
> fault happens in a region that has a (RO) huge zero page mapped at PMD level,
> then the PMD is shattered, the PTE of the fault address is populated with a
> writable page and the remaining PTEs are populated with order-0 zero pages
> (read-only).

That also recently popped up in [1]. CCing Jinjiang. Ever since I
replied there, I also thought some more about that handling in regard to the
huge zeropage.

> 
> This seems like odd behaviour to me. Surely it would be less effort and more
> aligned with the app's expectations to notice the huge zero page in the PMD,
> remove it, and install a THP, as would have been done if pmd_none() was true? I
> don't think there is a memory bloat argument here because, IIUC, with the
> current behaviour, khugepaged would eventually upgrade it to a THP anyway?

One detail: depending on the setting of khugepaged_max_ptes_none. zeropages
are treated like pte_none. But in the common case, that setting is left alone.

> 
> Changing to this new behaviour would only be a partial solution for your use
> case, since you would still have 2 faults. But it would remove the cost of the
> shattering and ensure you have a THP immediately after the write fault. But I
> can't think of a reason why this wouldn't be a generally useful change
> regardless? Any thoughts?

The "let's read before we write" as used by QEMU migration code is the desire
to not waste memory by populating the zeropages. Deferring consuming memory
until really required.

     /*
      * We read one byte of each page; this will preallocate page tables if
      * required and populate the shared zeropage on MAP_PRIVATE anonymous memory
      * where no page was populated yet. This might require adaption when
      * supporting other mappings, like shmem.
      */


Without THP this works as expected. With THP this currently also works as
expected, but of course with the price [1] of not getting anon THP
immediately, which usually we don't care about. As you note, khugepaged might
fix this up later.

If we disable the huge zeropage, we would get anon THPs when reading instead of
small zeropages.

As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE)
when we really care about that performance difference, which would also
avoid the huge zeropage completely, but it's also not quite optimal in some cases.


I don't really know what to do here: changing the handling for the huge zeropage
only unconditionally does not sound too wrong, but the change in behavior
might (or might not) be desired for some use cases.

Reading from unpopulated memory can be a clear sign that really the shared zeropage
is desired (as for QEMU), and concurrent memory preallcoation/population should
ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2] regarding
the common use cases for the huge zeropage back than.

[1] https://lkml.kernel.org/r/740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com
[2] https://lwn.net/Articles/517465/
Ryan Roberts July 2, 2024, 12:36 p.m. UTC | #9
On 02/07/2024 12:22, David Hildenbrand wrote:
> On 02.07.24 12:26, Ryan Roberts wrote:
>> On 01/07/2024 20:43, Catalin Marinas wrote:
>>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
>>>> On 6/28/24 10:24 AM, Catalin Marinas wrote:
>>>>> This patch does feel a bit like working around a non-optimal user choice
>>>>> in kernel space. Who knows, madvise() may even be quicker if you do a
>>>>> single call for a larger VA vs touching each page.
>>>>
>>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
>>>> inefficiency in kernel. Two faults are not necessary if we know we are
>>>> definitely going to write the memory very soon, right?
>>>
>>> I agree the Arm architecture behaviour is not ideal here and any
>>> timelines for fixing it in hardware, if they do happen, are far into the
>>> future. Purely from a kernel perspective, what I want though is make
>>> sure that longer term (a) we don't create additional maintenance burden
>>> and (b) we don't keep dead code around.
>>>
>>> Point (a) could be mitigated if the architecture is changed so that any
>>> new atomic instructions added to this range would also come with
>>> additional syndrome information so that we don't have to update the
>>> decoding patterns.
>>>
>>> Point (b), however, depends on the OpenJDK and the kernel versions in
>>> distros. Nick Gasson kindly provided some information on the OpenJDK
>>> changes. The atomic_add(0) change happened in early 2022, about 5-6
>>> months after MADV_POPULATE_WRITE support was added to the kernel. What's
>>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to
>>> OpenJDK a few months ago:
>>>
>>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a
>>>
>>> The OpenJDK commit lacks explanation but what I gathered from the diff
>>> is that this option is the preferred one in the presence of THP (which
>>> most/all distros enable by default). If we merge your proposed kernel
>>> patch, it will take time before it makes its way into distros. I'm
>>> hoping that by that time, distros would have picked a new OpenJDK
>>> version already that doesn't need the atomic_add(0) pattern. If that's
>>> the case, we end up with some dead code in the kernel that's almost
>>> never exercised.
>>>
>>> I don't follow OpenJDK development but I heard that updates are dragging
>>> quite a lot. I can't tell whether people have picked up the
>>> atomic_add(0) feature and whether, by the time a kernel patch would make
>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>>>
>>> There's a point (c) as well on the overhead of reading the faulting
>>> instruction. I hope that's negligible but I haven't measured it.
>>>
>>
>> Just to add to this, I note the existing kernel behaviour is that if a write
>> fault happens in a region that has a (RO) huge zero page mapped at PMD level,
>> then the PMD is shattered, the PTE of the fault address is populated with a
>> writable page and the remaining PTEs are populated with order-0 zero pages
>> (read-only).
> 
> That also recently popped up in [1]. CCing Jinjiang. Ever since I
> replied there, I also thought some more about that handling in regard to the
> huge zeropage.
> 
>>
>> This seems like odd behaviour to me. Surely it would be less effort and more
>> aligned with the app's expectations to notice the huge zero page in the PMD,
>> remove it, and install a THP, as would have been done if pmd_none() was true? I
>> don't think there is a memory bloat argument here because, IIUC, with the
>> current behaviour, khugepaged would eventually upgrade it to a THP anyway?
> 
> One detail: depending on the setting of khugepaged_max_ptes_none. zeropages
> are treated like pte_none. But in the common case, that setting is left alone.

Ahh, got it. So in the common case, khugepaged won't actually collapse
unless/until a bunch more write faults occur in the 2M region, and in that case
there is a risk that changing this behaviour could lead to a memory bloat
regression.

> 
>>
>> Changing to this new behaviour would only be a partial solution for your use
>> case, since you would still have 2 faults. But it would remove the cost of the
>> shattering and ensure you have a THP immediately after the write fault. But I
>> can't think of a reason why this wouldn't be a generally useful change
>> regardless? Any thoughts?
> 
> The "let's read before we write" as used by QEMU migration code is the desire
> to not waste memory by populating the zeropages. Deferring consuming memory
> until really required.
> 
>     /*
>      * We read one byte of each page; this will preallocate page tables if
>      * required and populate the shared zeropage on MAP_PRIVATE anonymous memory
>      * where no page was populated yet. This might require adaption when
>      * supporting other mappings, like shmem.
>      */

So QEMU is concerned with preallocatiing page tables? I would have thought you
could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
through the allocator, fewer pud/pmd lock/unlocks, etc).

TBH I always assumed in the past the that huge zero page is only useful because
its a placeholder for a real THP that would be populated on write. But that's
obviously not the case at the moment. So other than a hack to preallocate the
pgtables with only 1 fault per 2M, what other benefits does it have?

> 
> 
> Without THP this works as expected. With THP this currently also works as
> expected, but of course with the price [1] of not getting anon THP
> immediately, which usually we don't care about. As you note, khugepaged might
> fix this up later.
> 
> If we disable the huge zeropage, we would get anon THPs when reading instead of
> small zeropages.

I wasn't aware of that behaviour either. Although that sounds like another
reason why allocating a THP over the huge zero page on write fault should be the
"more consistent" behaviour.

> 
> As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE)
> when we really care about that performance difference, which would also
> avoid the huge zeropage completely, but it's also not quite optimal in some cases.

I could imagine some cases could benefit from a MADV_POPULATE_WRITE_ON_FAULT,
which would just mark the VMA so that any read fault is upgraded to write.

> 
> 
> I don't really know what to do here: changing the handling for the huge zeropage
> only unconditionally does not sound too wrong, but the change in behavior
> might (or might not) be desired for some use cases.
> 
> Reading from unpopulated memory can be a clear sign that really the shared zeropage
> is desired (as for QEMU), and concurrent memory preallcoation/population should
> ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2]
> regarding
> the common use cases for the huge zeropage back than.

The current huge zero page behavior on write fault sounds wonky to me. But I
agree there are better and more complete solutions to the identified use cases.
So unless something pops up where the change is a clear benefit, I guess better
to be safe and leave as is.

> 
> [1] https://lkml.kernel.org/r/740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com
> [2] https://lwn.net/Articles/517465/
>
David Hildenbrand July 2, 2024, 12:58 p.m. UTC | #10
On 02.07.24 14:36, Ryan Roberts wrote:
> On 02/07/2024 12:22, David Hildenbrand wrote:
>> On 02.07.24 12:26, Ryan Roberts wrote:
>>> On 01/07/2024 20:43, Catalin Marinas wrote:
>>>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
>>>>> On 6/28/24 10:24 AM, Catalin Marinas wrote:
>>>>>> This patch does feel a bit like working around a non-optimal user choice
>>>>>> in kernel space. Who knows, madvise() may even be quicker if you do a
>>>>>> single call for a larger VA vs touching each page.
>>>>>
>>>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
>>>>> inefficiency in kernel. Two faults are not necessary if we know we are
>>>>> definitely going to write the memory very soon, right?
>>>>
>>>> I agree the Arm architecture behaviour is not ideal here and any
>>>> timelines for fixing it in hardware, if they do happen, are far into the
>>>> future. Purely from a kernel perspective, what I want though is make
>>>> sure that longer term (a) we don't create additional maintenance burden
>>>> and (b) we don't keep dead code around.
>>>>
>>>> Point (a) could be mitigated if the architecture is changed so that any
>>>> new atomic instructions added to this range would also come with
>>>> additional syndrome information so that we don't have to update the
>>>> decoding patterns.
>>>>
>>>> Point (b), however, depends on the OpenJDK and the kernel versions in
>>>> distros. Nick Gasson kindly provided some information on the OpenJDK
>>>> changes. The atomic_add(0) change happened in early 2022, about 5-6
>>>> months after MADV_POPULATE_WRITE support was added to the kernel. What's
>>>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to
>>>> OpenJDK a few months ago:
>>>>
>>>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a
>>>>
>>>> The OpenJDK commit lacks explanation but what I gathered from the diff
>>>> is that this option is the preferred one in the presence of THP (which
>>>> most/all distros enable by default). If we merge your proposed kernel
>>>> patch, it will take time before it makes its way into distros. I'm
>>>> hoping that by that time, distros would have picked a new OpenJDK
>>>> version already that doesn't need the atomic_add(0) pattern. If that's
>>>> the case, we end up with some dead code in the kernel that's almost
>>>> never exercised.
>>>>
>>>> I don't follow OpenJDK development but I heard that updates are dragging
>>>> quite a lot. I can't tell whether people have picked up the
>>>> atomic_add(0) feature and whether, by the time a kernel patch would make
>>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>>>>
>>>> There's a point (c) as well on the overhead of reading the faulting
>>>> instruction. I hope that's negligible but I haven't measured it.
>>>>
>>>
>>> Just to add to this, I note the existing kernel behaviour is that if a write
>>> fault happens in a region that has a (RO) huge zero page mapped at PMD level,
>>> then the PMD is shattered, the PTE of the fault address is populated with a
>>> writable page and the remaining PTEs are populated with order-0 zero pages
>>> (read-only).
>>
>> That also recently popped up in [1]. CCing Jinjiang. Ever since I
>> replied there, I also thought some more about that handling in regard to the
>> huge zeropage.
>>
>>>
>>> This seems like odd behaviour to me. Surely it would be less effort and more
>>> aligned with the app's expectations to notice the huge zero page in the PMD,
>>> remove it, and install a THP, as would have been done if pmd_none() was true? I
>>> don't think there is a memory bloat argument here because, IIUC, with the
>>> current behaviour, khugepaged would eventually upgrade it to a THP anyway?
>>
>> One detail: depending on the setting of khugepaged_max_ptes_none. zeropages
>> are treated like pte_none. But in the common case, that setting is left alone.
> 
> Ahh, got it. So in the common case, khugepaged won't actually collapse
> unless/until a bunch more write faults occur in the 2M region, and in that case
> there is a risk that changing this behaviour could lead to a memory bloat
> regression.
> 
>>
>>>
>>> Changing to this new behaviour would only be a partial solution for your use
>>> case, since you would still have 2 faults. But it would remove the cost of the
>>> shattering and ensure you have a THP immediately after the write fault. But I
>>> can't think of a reason why this wouldn't be a generally useful change
>>> regardless? Any thoughts?
>>
>> The "let's read before we write" as used by QEMU migration code is the desire
>> to not waste memory by populating the zeropages. Deferring consuming memory
>> until really required.
>>
>>      /*
>>       * We read one byte of each page; this will preallocate page tables if
>>       * required and populate the shared zeropage on MAP_PRIVATE anonymous memory
>>       * where no page was populated yet. This might require adaption when
>>       * supporting other mappings, like shmem.
>>       */
> 
> So QEMU is concerned with preallocatiing page tables? I would have thought you
> could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
> call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
> through the allocator, fewer pud/pmd lock/unlocks, etc).

I think we are only concerned about the "shared zeropage" part. 
Everything else is just unnecessary detail that adds confusion here :) 
One requires the other.

Note that this is from migration code where we're supposed to write a 
single page we received from the migration source right now (not more). 
And we want to avoid allcoating memory if it can be avoided (usually for 
overcommit).



> 
> TBH I always assumed in the past the that huge zero page is only useful because
> its a placeholder for a real THP that would be populated on write. But that's
> obviously not the case at the moment. So other than a hack to preallocate the
> pgtables with only 1 fault per 2M, what other benefits does it have?

I don't quite udnerstand that question. [2] has some details why the 
huge zeropage was added -- because we would have never otherwise 
received huge zeropages with THP enabled but always anon THP directly on 
read.

> 
>>
>>
>> Without THP this works as expected. With THP this currently also works as
>> expected, but of course with the price [1] of not getting anon THP
>> immediately, which usually we don't care about. As you note, khugepaged might
>> fix this up later.
>>
>> If we disable the huge zeropage, we would get anon THPs when reading instead of
>> small zeropages.
> 
> I wasn't aware of that behaviour either. Although that sounds like another
> reason why allocating a THP over the huge zero page on write fault should be the
> "more consistent" behaviour.

Reading [2] I think the huge zeropage was added to avoid the allocation 
of THP on read. Maybe for really only large readable regions, not sure 
why exactly.

> 
>>
>> As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE)
>> when we really care about that performance difference, which would also
>> avoid the huge zeropage completely, but it's also not quite optimal in some cases.
> 
> I could imagine some cases could benefit from a MADV_POPULATE_WRITE_ON_FAULT,
> which would just mark the VMA so that any read fault is upgraded to write.
> 
>>
>>
>> I don't really know what to do here: changing the handling for the huge zeropage
>> only unconditionally does not sound too wrong, but the change in behavior
>> might (or might not) be desired for some use cases.
>>
>> Reading from unpopulated memory can be a clear sign that really the shared zeropage
>> is desired (as for QEMU), and concurrent memory preallcoation/population should
>> ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2]
>> regarding
>> the common use cases for the huge zeropage back than.
> 
> The current huge zero page behavior on write fault sounds wonky to me. But I
> agree there are better and more complete solutions to the identified use cases.
> So unless something pops up where the change is a clear benefit, I guess better
> to be safe and leave as is.

We've had that behavior for a quite a while ... so it's rather 
surprising to see multiple people reporting this right now.

I guess most use cases don't read from uninitialized memory barely write 
to it and care about getting THPs immediately.

For preallocation, MADVISE_POPULATE_WRITE is better. For QEMU migration? 
not sure what's really better. Maybe replacing the huge zeropage by a 
THP would be faster in some cases, but result in more memory consumption 
(and more page zeroing?) during migration in other cases.
Ryan Roberts July 2, 2024, 1:26 p.m. UTC | #11
On 02/07/2024 13:58, David Hildenbrand wrote:
> On 02.07.24 14:36, Ryan Roberts wrote:
>> On 02/07/2024 12:22, David Hildenbrand wrote:
>>> On 02.07.24 12:26, Ryan Roberts wrote:
>>>> On 01/07/2024 20:43, Catalin Marinas wrote:
>>>>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
>>>>>> On 6/28/24 10:24 AM, Catalin Marinas wrote:
>>>>>>> This patch does feel a bit like working around a non-optimal user choice
>>>>>>> in kernel space. Who knows, madvise() may even be quicker if you do a
>>>>>>> single call for a larger VA vs touching each page.
>>>>>>
>>>>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
>>>>>> inefficiency in kernel. Two faults are not necessary if we know we are
>>>>>> definitely going to write the memory very soon, right?
>>>>>
>>>>> I agree the Arm architecture behaviour is not ideal here and any
>>>>> timelines for fixing it in hardware, if they do happen, are far into the
>>>>> future. Purely from a kernel perspective, what I want though is make
>>>>> sure that longer term (a) we don't create additional maintenance burden
>>>>> and (b) we don't keep dead code around.
>>>>>
>>>>> Point (a) could be mitigated if the architecture is changed so that any
>>>>> new atomic instructions added to this range would also come with
>>>>> additional syndrome information so that we don't have to update the
>>>>> decoding patterns.
>>>>>
>>>>> Point (b), however, depends on the OpenJDK and the kernel versions in
>>>>> distros. Nick Gasson kindly provided some information on the OpenJDK
>>>>> changes. The atomic_add(0) change happened in early 2022, about 5-6
>>>>> months after MADV_POPULATE_WRITE support was added to the kernel. What's
>>>>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to
>>>>> OpenJDK a few months ago:
>>>>>
>>>>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a
>>>>>
>>>>> The OpenJDK commit lacks explanation but what I gathered from the diff
>>>>> is that this option is the preferred one in the presence of THP (which
>>>>> most/all distros enable by default). If we merge your proposed kernel
>>>>> patch, it will take time before it makes its way into distros. I'm
>>>>> hoping that by that time, distros would have picked a new OpenJDK
>>>>> version already that doesn't need the atomic_add(0) pattern. If that's
>>>>> the case, we end up with some dead code in the kernel that's almost
>>>>> never exercised.
>>>>>
>>>>> I don't follow OpenJDK development but I heard that updates are dragging
>>>>> quite a lot. I can't tell whether people have picked up the
>>>>> atomic_add(0) feature and whether, by the time a kernel patch would make
>>>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>>>>>
>>>>> There's a point (c) as well on the overhead of reading the faulting
>>>>> instruction. I hope that's negligible but I haven't measured it.
>>>>>
>>>>
>>>> Just to add to this, I note the existing kernel behaviour is that if a write
>>>> fault happens in a region that has a (RO) huge zero page mapped at PMD level,
>>>> then the PMD is shattered, the PTE of the fault address is populated with a
>>>> writable page and the remaining PTEs are populated with order-0 zero pages
>>>> (read-only).
>>>
>>> That also recently popped up in [1]. CCing Jinjiang. Ever since I
>>> replied there, I also thought some more about that handling in regard to the
>>> huge zeropage.
>>>
>>>>
>>>> This seems like odd behaviour to me. Surely it would be less effort and more
>>>> aligned with the app's expectations to notice the huge zero page in the PMD,
>>>> remove it, and install a THP, as would have been done if pmd_none() was true? I
>>>> don't think there is a memory bloat argument here because, IIUC, with the
>>>> current behaviour, khugepaged would eventually upgrade it to a THP anyway?
>>>
>>> One detail: depending on the setting of khugepaged_max_ptes_none. zeropages
>>> are treated like pte_none. But in the common case, that setting is left alone.
>>
>> Ahh, got it. So in the common case, khugepaged won't actually collapse
>> unless/until a bunch more write faults occur in the 2M region, and in that case
>> there is a risk that changing this behaviour could lead to a memory bloat
>> regression.
>>
>>>
>>>>
>>>> Changing to this new behaviour would only be a partial solution for your use
>>>> case, since you would still have 2 faults. But it would remove the cost of the
>>>> shattering and ensure you have a THP immediately after the write fault. But I
>>>> can't think of a reason why this wouldn't be a generally useful change
>>>> regardless? Any thoughts?
>>>
>>> The "let's read before we write" as used by QEMU migration code is the desire
>>> to not waste memory by populating the zeropages. Deferring consuming memory
>>> until really required.
>>>
>>>      /*
>>>       * We read one byte of each page; this will preallocate page tables if
>>>       * required and populate the shared zeropage on MAP_PRIVATE anonymous
>>> memory
>>>       * where no page was populated yet. This might require adaption when
>>>       * supporting other mappings, like shmem.
>>>       */
>>
>> So QEMU is concerned with preallocatiing page tables? I would have thought you
>> could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
>> call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
>> through the allocator, fewer pud/pmd lock/unlocks, etc).
> 
> I think we are only concerned about the "shared zeropage" part. Everything else
> is just unnecessary detail that adds confusion here :) One requires the other.

Sorry I don't quite follow your comment. As I understand it, the zeropage
concept is intended as a memory-saving mechanism for applications that read
memory but never write it. I don't think that really applies in your migration
case, because you are definitely going to write all the memory eventually, I
think? So I guess you are not interested in the "memory-saving" property, but in
the side-effect, which is the pre-allocation of pagetables? (if you just wanted
the memory-saving property, why not just skip the "read one byte of each page"
op? It's not important though, so let's not go down the rabbit hole.

> 
> Note that this is from migration code where we're supposed to write a single
> page we received from the migration source right now (not more). And we want to
> avoid allcoating memory if it can be avoided (usually for overcommit).
> 
> 
> 
>>
>> TBH I always assumed in the past the that huge zero page is only useful because
>> its a placeholder for a real THP that would be populated on write. But that's
>> obviously not the case at the moment. So other than a hack to preallocate the
>> pgtables with only 1 fault per 2M, what other benefits does it have?
> 
> I don't quite udnerstand that question. [2] has some details why the huge
> zeropage was added -- because we would have never otherwise received huge
> zeropages with THP enabled but always anon THP directly on read.
> 
>>
>>>
>>>
>>> Without THP this works as expected. With THP this currently also works as
>>> expected, but of course with the price [1] of not getting anon THP
>>> immediately, which usually we don't care about. As you note, khugepaged might
>>> fix this up later.
>>>
>>> If we disable the huge zeropage, we would get anon THPs when reading instead of
>>> small zeropages.
>>
>> I wasn't aware of that behaviour either. Although that sounds like another
>> reason why allocating a THP over the huge zero page on write fault should be the
>> "more consistent" behaviour.
> 
> Reading [2] I think the huge zeropage was added to avoid the allocation of THP
> on read. Maybe for really only large readable regions, not sure why exactly.

I might raise this on the THP call tomorrow, if Kyril joins and get his view.

> 
>>
>>>
>>> As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE)
>>> when we really care about that performance difference, which would also
>>> avoid the huge zeropage completely, but it's also not quite optimal in some
>>> cases.
>>
>> I could imagine some cases could benefit from a MADV_POPULATE_WRITE_ON_FAULT,
>> which would just mark the VMA so that any read fault is upgraded to write.
>>
>>>
>>>
>>> I don't really know what to do here: changing the handling for the huge zeropage
>>> only unconditionally does not sound too wrong, but the change in behavior
>>> might (or might not) be desired for some use cases.
>>>
>>> Reading from unpopulated memory can be a clear sign that really the shared
>>> zeropage
>>> is desired (as for QEMU), and concurrent memory preallcoation/population should
>>> ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2]
>>> regarding
>>> the common use cases for the huge zeropage back than.
>>
>> The current huge zero page behavior on write fault sounds wonky to me. But I
>> agree there are better and more complete solutions to the identified use cases.
>> So unless something pops up where the change is a clear benefit, I guess better
>> to be safe and leave as is.
> 
> We've had that behavior for a quite a while ... so it's rather surprising to see
> multiple people reporting this right now.
> 
> I guess most use cases don't read from uninitialized memory barely write to it
> and care about getting THPs immediately.
> 
> For preallocation, MADVISE_POPULATE_WRITE is better. For QEMU migration? not
> sure what's really better. Maybe replacing the huge zeropage by a THP would be
> faster in some cases, but result in more memory consumption (and more page
> zeroing?) during migration in other cases.
>
David Hildenbrand July 2, 2024, 1:50 p.m. UTC | #12
>>>>> Changing to this new behaviour would only be a partial solution for your use
>>>>> case, since you would still have 2 faults. But it would remove the cost of the
>>>>> shattering and ensure you have a THP immediately after the write fault. But I
>>>>> can't think of a reason why this wouldn't be a generally useful change
>>>>> regardless? Any thoughts?
>>>>
>>>> The "let's read before we write" as used by QEMU migration code is the desire
>>>> to not waste memory by populating the zeropages. Deferring consuming memory
>>>> until really required.
>>>>
>>>>       /*
>>>>        * We read one byte of each page; this will preallocate page tables if
>>>>        * required and populate the shared zeropage on MAP_PRIVATE anonymous
>>>> memory
>>>>        * where no page was populated yet. This might require adaption when
>>>>        * supporting other mappings, like shmem.
>>>>        */
>>>
>>> So QEMU is concerned with preallocatiing page tables? I would have thought you
>>> could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
>>> call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
>>> through the allocator, fewer pud/pmd lock/unlocks, etc).
>>
>> I think we are only concerned about the "shared zeropage" part. Everything else
>> is just unnecessary detail that adds confusion here :) One requires the other.
> 

BTW, I was quoting the wrong QEMU code. The relevant QEMU commit that first added
that handling is:

commit 211ea74022f51164a7729030b28eec90b6c99a08
Author: Peter Lieven <pl@kamp.de>
Date:   Mon Jun 10 12:14:20 2013 +0200

     migration: do not overwrite zero pages
     
     on incoming migration do not memset pages to zero if they already read as zero.
     this will allocate a new zero page and consume memory unnecessarily. even
     if we madvise a MADV_DONTNEED later this will only deallocate the memory
     asynchronously.

(note that the MADV_DONTNEED handling in that form does not really apply
anymore AFAIK)


> Sorry I don't quite follow your comment. As I understand it, the zeropage
> concept is intended as a memory-saving mechanism for applications that read
> memory but never write it.

"never written" -- then we wouldn't support COW faults on it, right? :P

> I don't think that really applies in your migration
> case, because you are definitely going to write all the memory eventually, I
> think?

Especially with memory ballooning and things like free-page-reporting we might
be migrating a whole bunch of zero-memory and only have to make sure that the
destination is actually zero. We don't want to consume memory.

I recently fixed that handling for s390x where customers were running into
precisely that issue (and zeropages in s390x VMs were disabled for historical
reasons).

> So I guess you are not interested in the "memory-saving" property, but in
> the side-effect, which is the pre-allocation of pagetables? (if you just wanted
> the memory-saving property, why not just skip the "read one byte of each page"
> op? It's not important though, so let's not go down the rabbit hole.

There are cases where, during an incoming migration, some memory pages might
already have been populated and might not be zero. And IIRC there are some cases
(postcopy -> uffd) where I think it is important that we actually do have a page
in place. But I forgot some of the details around userfaultfd handling in QEMU.

> 
>>
>> Note that this is from migration code where we're supposed to write a single
>> page we received from the migration source right now (not more). And we want to
>> avoid allcoating memory if it can be avoided (usually for overcommit).
>>
>>
>>
>>>
>>> TBH I always assumed in the past the that huge zero page is only useful because
>>> its a placeholder for a real THP that would be populated on write. But that's
>>> obviously not the case at the moment. So other than a hack to preallocate the
>>> pgtables with only 1 fault per 2M, what other benefits does it have?
>>
>> I don't quite udnerstand that question. [2] has some details why the huge
>> zeropage was added -- because we would have never otherwise received huge
>> zeropages with THP enabled but always anon THP directly on read.
>>
>>>
>>>>
>>>>
>>>> Without THP this works as expected. With THP this currently also works as
>>>> expected, but of course with the price [1] of not getting anon THP
>>>> immediately, which usually we don't care about. As you note, khugepaged might
>>>> fix this up later.
>>>>
>>>> If we disable the huge zeropage, we would get anon THPs when reading instead of
>>>> small zeropages.
>>>
>>> I wasn't aware of that behaviour either. Although that sounds like another
>>> reason why allocating a THP over the huge zero page on write fault should be the
>>> "more consistent" behaviour.
>>
>> Reading [2] I think the huge zeropage was added to avoid the allocation of THP
>> on read. Maybe for really only large readable regions, not sure why exactly.
> 
> I might raise this on the THP call tomorrow, if Kyril joins and get his view.

Makes sense.
Ryan Roberts July 2, 2024, 2:51 p.m. UTC | #13
>> I don't think that really applies in your migration
>> case, because you are definitely going to write all the memory eventually, I
>> think?
> 
> Especially with memory ballooning and things like free-page-reporting we might
> be migrating a whole bunch of zero-memory and only have to make sure that the
> destination is actually zero. We don't want to consume memory.

Ahh you know if the contents of a page being migrated is all zero. In that case,
I get it. Thanks!
Yang Shi July 2, 2024, 10:21 p.m. UTC | #14
On 7/1/24 12:43 PM, Catalin Marinas wrote:
> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
>> On 6/28/24 10:24 AM, Catalin Marinas wrote:
>>> This patch does feel a bit like working around a non-optimal user choice
>>> in kernel space. Who knows, madvise() may even be quicker if you do a
>>> single call for a larger VA vs touching each page.
>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
>> inefficiency in kernel. Two faults are not necessary if we know we are
>> definitely going to write the memory very soon, right?
> I agree the Arm architecture behaviour is not ideal here and any
> timelines for fixing it in hardware, if they do happen, are far into the
> future. Purely from a kernel perspective, what I want though is make
> sure that longer term (a) we don't create additional maintenance burden
> and (b) we don't keep dead code around.
>
> Point (a) could be mitigated if the architecture is changed so that any
> new atomic instructions added to this range would also come with
> additional syndrome information so that we don't have to update the
> decoding patterns.
>
> Point (b), however, depends on the OpenJDK and the kernel versions in
> distros. Nick Gasson kindly provided some information on the OpenJDK
> changes. The atomic_add(0) change happened in early 2022, about 5-6
> months after MADV_POPULATE_WRITE support was added to the kernel. What's
> interesting is Ampere already contributed MADV_POPULATE_WRITE support to
> OpenJDK a few months ago:
>
> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a
>
> The OpenJDK commit lacks explanation but what I gathered from the diff
> is that this option is the preferred one in the presence of THP (which
> most/all distros enable by default). If we merge your proposed kernel
> patch, it will take time before it makes its way into distros. I'm
> hoping that by that time, distros would have picked a new OpenJDK
> version already that doesn't need the atomic_add(0) pattern. If that's
> the case, we end up with some dead code in the kernel that's almost
> never exercised.
>
> I don't follow OpenJDK development but I heard that updates are dragging
> quite a lot. I can't tell whether people have picked up the
> atomic_add(0) feature and whether, by the time a kernel patch would make
> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.

As Christopher said there may be similar use of atomic in other 
applications, so I don't worry too much about dead code problem IMHO. 
OpenJDK is just the usecase that we know. There may be unknown unknowns. 
And the distros typically backport patches from mainline kernel to their 
kernel so there should be combos like old kernel + backported patch + 
old OpenJDK.

AFAICT, the users do expect similar behavior as x86 (one fault instead 
of two faults). Actually we noticed this problem due to a customer report.

>
> There's a point (c) as well on the overhead of reading the faulting
> instruction. I hope that's negligible but I haven't measured it.

I think I showed benchmark data requested by Anshuman in the earlier 
email discussion.

>
Catalin Marinas July 4, 2024, 10:03 a.m. UTC | #15
On Tue, Jul 02, 2024 at 03:21:41PM -0700, Yang Shi wrote:
> On 7/1/24 12:43 PM, Catalin Marinas wrote:
> > I don't follow OpenJDK development but I heard that updates are dragging
> > quite a lot. I can't tell whether people have picked up the
> > atomic_add(0) feature and whether, by the time a kernel patch would make
> > it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
> 
> As Christopher said there may be similar use of atomic in other
> applications, so I don't worry too much about dead code problem IMHO.
> OpenJDK is just the usecase that we know. There may be unknown unknowns. And
> the distros typically backport patches from mainline kernel to their kernel
> so there should be combos like old kernel + backported patch + old OpenJDK.

That's a somewhat valid argument I heard internally as well. People tend
to change or patch kernel versions more often than OpenJDK versions
because of the risk of breaking their Java stack. But, arguably, one can
backport the madvise() OpenJDK change since it seems to have other
benefits on x86 as well.

> AFAICT, the users do expect similar behavior as x86 (one fault instead of
> two faults). Actually we noticed this problem due to a customer report.

It's not a correctness problem, only a performance one. Big part of that
could be mitigated by some adjustment to how THP pages are allocated on
a write fault (though we'd still have an unnecessary read fault and some
TLBI). See Ryan's sub-thread.

> > There's a point (c) as well on the overhead of reading the faulting
> > instruction. I hope that's negligible but I haven't measured it.
> 
> I think I showed benchmark data requested by Anshuman in the earlier email
> discussion.

Do you mean this:

https://lore.kernel.org/r/328c4c86-96c8-4896-8b6d-94f2facdac9a@os.amperecomputing.com

I haven't figured out what the +24% case is in there, it seems pretty
large.

What you haven't benchmarked (I think) is the case where the instruction
is in an exec-only mapping. The subsequent instruction read will fault
and it adds to the overhead. Currently exec-only mappings are not
widespread but I heard some people planning to move in this direction as
a default build configuration.

It could be worked around with a new flavour of get_user() that uses the
non-T LDR instruction and the user mapping is readable by the kernel
(that's the case with EPAN, prior to PIE and I think we can change this
for PIE configurations as well). But it adds to the complexity of this
patch when the kernel already offers a MADV_POPULATE_WRITE solution.
Christoph Lameter (Ampere) July 5, 2024, 5:05 p.m. UTC | #16
On Thu, 4 Jul 2024, Catalin Marinas wrote:

> It could be worked around with a new flavour of get_user() that uses the
> non-T LDR instruction and the user mapping is readable by the kernel
> (that's the case with EPAN, prior to PIE and I think we can change this
> for PIE configurations as well). But it adds to the complexity of this
> patch when the kernel already offers a MADV_POPULATE_WRITE solution.

The use of MADV_POPULATE_WRITE here is arch specific and not a general 
solution. It requires specialized knowledge and research before someone 
can figure out that this particular trick is required on Linux ARM64 
processors. The builders need to detect this special situation in the 
build process and activate this workaround.

It would be much simpler to just merge the patch and be done with it. 
Otherwise this issue will continue to cause uncountably many hours of 
anguish for sysadmins and developers all over the Linux ecosystem trying 
to figure out what in the world is going on with ARM.
Catalin Marinas July 5, 2024, 6:24 p.m. UTC | #17
On Fri, Jul 05, 2024 at 10:05:29AM -0700, Christoph Lameter (Ampere) wrote:
> On Thu, 4 Jul 2024, Catalin Marinas wrote:
> > It could be worked around with a new flavour of get_user() that uses the
> > non-T LDR instruction and the user mapping is readable by the kernel
> > (that's the case with EPAN, prior to PIE and I think we can change this
> > for PIE configurations as well). But it adds to the complexity of this
> > patch when the kernel already offers a MADV_POPULATE_WRITE solution.
> 
> The use of MADV_POPULATE_WRITE here is arch specific and not a general
> solution. It requires specialized knowledge and research before someone can
> figure out that this particular trick is required on Linux ARM64 processors.
> The builders need to detect this special situation in the build process and
> activate this workaround.

Not really, see this OpenJDK commit:

https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a

There's nothing about arm64 in there and it looks like the code prefers
MADV_POPULATE_WRITE if THPs are enabled (which is the case in all
enterprise distros). I can't tell whether the change was made to work
around the arm64 behaviour, there's no commit log (it was contributed by
Ampere).

There's a separate thread with the mm folk on the THP behaviour for
pmd_none() vs pmd mapping the zero huge page but it is more portable for
OpenJDK to use madvise() than guess the kernel behaviour and touch small
pages or a single large pages. Even if one claims that atomic_add(0) is
portable across operating systems, the OpenJDK code was already treating
Linux as a special case in the presence of THP.

> It would be much simpler to just merge the patch and be done with it.
> Otherwise this issue will continue to cause uncountably many hours of
> anguish for sysadmins and developers all over the Linux ecosystem trying to
> figure out what in the world is going on with ARM.

People will be happy until one enables execute-only ELF text sections in
a distro and all that opcode parsing will add considerable overhead for
many read faults (those with a writeable vma).

I'd also like to understand (probably have to re-read the older threads)
whether the overhead is caused mostly by the double fault or the actual
breaking of a THP. For the latter, the mm folk are willing to change the
behaviour so that pmd_none() and pmd to the zero high page are treated
similarly (i.e. allocate a huge page on write fault). If that's good
enough, I'd rather not merge this patch (or some form of it) and wait
for a proper fix in hardware in the future.

Just to be clear, there are still potential issues to address (or
understand the impact of) in this patch with exec-only mappings and
the performance gain _after_ the THP behaviour changed in the mm code.
We can make a call once we have more data but, TBH, my inclination is
towards 'no' given that OpenJDK already support madvise() and it's not
arm64 specific.
Christoph Lameter (Ampere) July 5, 2024, 6:51 p.m. UTC | #18
On Fri, 5 Jul 2024, Catalin Marinas wrote:

> There's nothing about arm64 in there and it looks like the code prefers
> MADV_POPULATE_WRITE if THPs are enabled (which is the case in all
> enterprise distros). I can't tell whether the change was made to work
> around the arm64 behaviour, there's no commit log (it was contributed by
> Ampere).

It took us a long time and numerous developers and QA teams  to get to 
this insight. You dont want to replicate this for other applications.

> There's a separate thread with the mm folk on the THP behaviour for
> pmd_none() vs pmd mapping the zero huge page but it is more portable for
> OpenJDK to use madvise() than guess the kernel behaviour and touch small
> pages or a single large pages. Even if one claims that atomic_add(0) is
> portable across operating systems, the OpenJDK code was already treating
> Linux as a special case in the presence of THP.

Other apps do not have such a vibrant developer community and no ampere 
employees contributing. They will never know and just say ARM has bad 
performance.


>> It would be much simpler to just merge the patch and be done with it.
>> Otherwise this issue will continue to cause uncountably many hours of
>> anguish for sysadmins and developers all over the Linux ecosystem trying to
>> figure out what in the world is going on with ARM.
>
> People will be happy until one enables execute-only ELF text sections in
> a distro and all that opcode parsing will add considerable overhead for
> many read faults (those with a writeable vma).

The opcode is in the l1 cache since we just faulted on it. There is no 
"considerable" overhead.

> I'd also like to understand (probably have to re-read the older threads)
> whether the overhead is caused mostly by the double fault or the actual
> breaking of a THP. For the latter, the mm folk are willing to change the
> behaviour so that pmd_none() and pmd to the zero high page are treated
> similarly (i.e. allocate a huge page on write fault). If that's good
> enough, I'd rather not merge this patch (or some form of it) and wait
> for a proper fix in hardware in the future.

THP is secondary effect here. Note that similar approaches have been 
implemented for other architectures. This is not a new approach and the 
approach is widely used on other platforms.

If those on other Linux platforms encounter this strange discussion here 
then they would come to the same conclusion that I have.

> Just to be clear, there are still potential issues to address (or
> understand the impact of) in this patch with exec-only mappings and
> the performance gain _after_ the THP behaviour changed in the mm code.
> We can make a call once we have more data but, TBH, my inclination is
> towards 'no' given that OpenJDK already support madvise() and it's not
> arm64 specific.

It is arm64 specific. Other Linux architectures have optimizations for 
similar issues in their arch code as mentioned in the patch or the 
processors will not double fault.

Is there a particular reason for ARM as processor manufacturer to oppose 
this patch? We have mostly hand waving and speculations coming from you 
here.

What the patch does is clearly beneficial and it is an established 
way of implementing read->write fault handling.
Catalin Marinas July 6, 2024, 9:47 a.m. UTC | #19
On Fri, Jul 05, 2024 at 11:51:33AM -0700, Christoph Lameter (Ampere) wrote:
> On Fri, 5 Jul 2024, Catalin Marinas wrote:
> > People will be happy until one enables execute-only ELF text sections in
> > a distro and all that opcode parsing will add considerable overhead for
> > many read faults (those with a writeable vma).
> 
> The opcode is in the l1 cache since we just faulted on it. There is no
> "considerable" overhead.

This has nothing to do with caches (and many Arm implementations still
have separate I and D caches). With the right linker flags (e.g.
--execute-only for lld), one can generate a PROT_EXEC only (no
PROT_READ) ELF text section. On newer Arm CPUs with FEAT_EPAN, the
kernel no longer forces PROT_READ on PROT_EXEC only mappings. The
get_user() in this patch to read the opcode will fault. So instead of
two faults you get now for an atomic instruction, you'd get three (the
additional one for opcode reading). What's worse, it affects standard
loads as well in the same way.

Yang Shi did test this scenario but for correctness only, not
performance. It would be good to recompile the benchmark with
--execute-only (or --rosegment I think for gnu ld) and see post the
results.

> > Just to be clear, there are still potential issues to address (or
> > understand the impact of) in this patch with exec-only mappings and
> > the performance gain _after_ the THP behaviour changed in the mm code.
> > We can make a call once we have more data but, TBH, my inclination is
> > towards 'no' given that OpenJDK already support madvise() and it's not
> > arm64 specific.
> 
> It is arm64 specific. Other Linux architectures have optimizations for
> similar issues in their arch code as mentioned in the patch or the
> processors will not double fault.
> 
> Is there a particular reason for ARM as processor manufacturer to oppose
> this patch? We have mostly hand waving and speculations coming from you
> here.

Arm Ltd has no involvement at all in this decision (and probably if you
ask the architects, they wouldn't see any problem). Even though I have
an arm.com email address, my opinions on the list are purely from a
kernel maintainer perspective.

There's no speculation but some real concerns here. Please see above.

> What the patch does is clearly beneficial and it is an established way of
> implementing read->write fault handling.

It is clearly beneficial for this specific case but, as I said, we still
need to address the execute-only mappings causing an additional fault on
the opcode reading. You may not find many such binaries now in the field
but there's a strong push from security people to enable it (it's a
user-space decisions, the kernel simply provides PROT_EXEC only
mappings).

In addition, there's a 24% performance overhead in one of Yang Shi's
results. This has not been explained.
Yang Shi July 9, 2024, 5:56 p.m. UTC | #20
On 7/4/24 3:03 AM, Catalin Marinas wrote:
> On Tue, Jul 02, 2024 at 03:21:41PM -0700, Yang Shi wrote:
>> On 7/1/24 12:43 PM, Catalin Marinas wrote:
>>> I don't follow OpenJDK development but I heard that updates are dragging
>>> quite a lot. I can't tell whether people have picked up the
>>> atomic_add(0) feature and whether, by the time a kernel patch would make
>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>> As Christopher said there may be similar use of atomic in other
>> applications, so I don't worry too much about dead code problem IMHO.
>> OpenJDK is just the usecase that we know. There may be unknown unknowns. And
>> the distros typically backport patches from mainline kernel to their kernel
>> so there should be combos like old kernel + backported patch + old OpenJDK.
> That's a somewhat valid argument I heard internally as well. People tend
> to change or patch kernel versions more often than OpenJDK versions
> because of the risk of breaking their Java stack. But, arguably, one can
> backport the madvise() OpenJDK change since it seems to have other
> benefits on x86 as well.
>
>> AFAICT, the users do expect similar behavior as x86 (one fault instead of
>> two faults). Actually we noticed this problem due to a customer report.
> It's not a correctness problem, only a performance one. Big part of that
> could be mitigated by some adjustment to how THP pages are allocated on
> a write fault (though we'd still have an unnecessary read fault and some
> TLBI). See Ryan's sub-thread.

Yes, the TLBI is still sub-optimal and it is quite noticeable.

>
>>> There's a point (c) as well on the overhead of reading the faulting
>>> instruction. I hope that's negligible but I haven't measured it.
>> I think I showed benchmark data requested by Anshuman in the earlier email
>> discussion.
> Do you mean this:
>
> https://lore.kernel.org/r/328c4c86-96c8-4896-8b6d-94f2facdac9a@os.amperecomputing.com
>
> I haven't figured out what the +24% case is in there, it seems pretty
> large.

I think I ran the test much more iterations and I didn't see such 
outlier anymore.

 >> To rule out the worst case, I also ran the test 100 iterations with 
160 threads then compared the worst case:
 >>
 >>      N           Min           Max        Median Avg Stddev
 >>   100         34770         84979         65536 63537.7 10358.873
 >>   100         38077         87652         65536 63119.02 8792.7399

>
> What you haven't benchmarked (I think) is the case where the instruction
> is in an exec-only mapping. The subsequent instruction read will fault
> and it adds to the overhead. Currently exec-only mappings are not
> widespread but I heard some people planning to move in this direction as
> a default build configuration.

I tested exec-only on QEMU tcg, but I don't have a hardware supported 
EPAN. I don't think performance benchmark on QEMU tcg makes sense since 
it is quite slow, such small overhead is unlikely measurable on it.

>
> It could be worked around with a new flavour of get_user() that uses the
> non-T LDR instruction and the user mapping is readable by the kernel
> (that's the case with EPAN, prior to PIE and I think we can change this
> for PIE configurations as well). But it adds to the complexity of this
> patch when the kernel already offers a MADV_POPULATE_WRITE solution.
>
Catalin Marinas July 9, 2024, 6:35 p.m. UTC | #21
On Tue, Jul 09, 2024 at 10:56:55AM -0700, Yang Shi wrote:
> On 7/4/24 3:03 AM, Catalin Marinas wrote:
> > I haven't figured out what the +24% case is in there, it seems pretty
> > large.
> 
> I think I ran the test much more iterations and I didn't see such outlier
> anymore.

That's good, thanks for confirming.

> > What you haven't benchmarked (I think) is the case where the instruction
> > is in an exec-only mapping. The subsequent instruction read will fault
> > and it adds to the overhead. Currently exec-only mappings are not
> > widespread but I heard some people planning to move in this direction as
> > a default build configuration.
> 
> I tested exec-only on QEMU tcg, but I don't have a hardware supported EPAN.
> I don't think performance benchmark on QEMU tcg makes sense since it is
> quite slow, such small overhead is unlikely measurable on it.

Yeah, benchmarking under qemu is pointless. I think you can remove some
of the ARM64_HAS_EPAN checks (or replaced them with ARM64_HAS_PAN) just
for testing. For security reason, we removed this behaviour in commit
24cecc377463 ("arm64: Revert support for execute-only user mappings")
but it's good enough for testing. This should give you PROT_EXEC-only
mappings on your hardware.
Yang Shi July 9, 2024, 10:29 p.m. UTC | #22
On 7/9/24 11:35 AM, Catalin Marinas wrote:
> On Tue, Jul 09, 2024 at 10:56:55AM -0700, Yang Shi wrote:
>> On 7/4/24 3:03 AM, Catalin Marinas wrote:
>>> I haven't figured out what the +24% case is in there, it seems pretty
>>> large.
>> I think I ran the test much more iterations and I didn't see such outlier
>> anymore.
> That's good, thanks for confirming.
>
>>> What you haven't benchmarked (I think) is the case where the instruction
>>> is in an exec-only mapping. The subsequent instruction read will fault
>>> and it adds to the overhead. Currently exec-only mappings are not
>>> widespread but I heard some people planning to move in this direction as
>>> a default build configuration.
>> I tested exec-only on QEMU tcg, but I don't have a hardware supported EPAN.
>> I don't think performance benchmark on QEMU tcg makes sense since it is
>> quite slow, such small overhead is unlikely measurable on it.
> Yeah, benchmarking under qemu is pointless. I think you can remove some
> of the ARM64_HAS_EPAN checks (or replaced them with ARM64_HAS_PAN) just
> for testing. For security reason, we removed this behaviour in commit
> 24cecc377463 ("arm64: Revert support for execute-only user mappings")
> but it's good enough for testing. This should give you PROT_EXEC-only
> mappings on your hardware.

Thanks for the suggestion. IIUC, I still can emulate exec-only even 
though hardware doesn't support EPAN? So it means reading exec-only area 
in kernel still can trigger fault, right?

And 24cecc377463 ("arm64: Revert support for execute-only user 
mappings") can't be reverted cleanly by git revert, so I did it manually 
as below.

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6a8b71917e3b..0bdedd415e56 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -573,8 +573,8 @@ static int __kprobes do_page_fault(unsigned long 
far, unsigned long esr,
                 /* Write implies read */
                 vm_flags |= VM_WRITE;
                 /* If EPAN is absent then exec implies read */
-               if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN))
-                       vm_flags |= VM_EXEC;
+               //if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN))
+               //      vm_flags |= VM_EXEC;
         }

         if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, esr, 
regs)) {
diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
index 642bdf908b22..d30265d424e4 100644
--- a/arch/arm64/mm/mmap.c
+++ b/arch/arm64/mm/mmap.c
@@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
         [VM_WRITE]                                      = PAGE_READONLY,
         [VM_WRITE | VM_READ]                            = PAGE_READONLY,
         /* PAGE_EXECONLY if Enhanced PAN */
-       [VM_EXEC]                                       = 
PAGE_READONLY_EXEC,
+       [VM_EXEC]                                       = PAGE_EXECONLY,
         [VM_EXEC | VM_READ]                             = 
PAGE_READONLY_EXEC,
         [VM_EXEC | VM_WRITE]                            = 
PAGE_READONLY_EXEC,
         [VM_EXEC | VM_WRITE | VM_READ]                  = 
PAGE_READONLY_EXEC,


Is it correct? Just wanted to make sure I did the right thing before 
running test.

>
Catalin Marinas July 10, 2024, 9:22 a.m. UTC | #23
On Tue, Jul 09, 2024 at 03:29:58PM -0700, Yang Shi wrote:
> On 7/9/24 11:35 AM, Catalin Marinas wrote:
> > On Tue, Jul 09, 2024 at 10:56:55AM -0700, Yang Shi wrote:
> > > On 7/4/24 3:03 AM, Catalin Marinas wrote:
> > > I tested exec-only on QEMU tcg, but I don't have a hardware supported EPAN.
> > > I don't think performance benchmark on QEMU tcg makes sense since it is
> > > quite slow, such small overhead is unlikely measurable on it.
> > 
> > Yeah, benchmarking under qemu is pointless. I think you can remove some
> > of the ARM64_HAS_EPAN checks (or replaced them with ARM64_HAS_PAN) just
> > for testing. For security reason, we removed this behaviour in commit
> > 24cecc377463 ("arm64: Revert support for execute-only user mappings")
> > but it's good enough for testing. This should give you PROT_EXEC-only
> > mappings on your hardware.
> 
> Thanks for the suggestion. IIUC, I still can emulate exec-only even though
> hardware doesn't support EPAN? So it means reading exec-only area in kernel
> still can trigger fault, right?

Yes, it's been supported since ARMv8.0. We limited it to EPAN only since
setting a PROT_EXEC mapping still allowed the kernel to access the
memory even if PSTATE.PAN was set.

> And 24cecc377463 ("arm64: Revert support for execute-only user mappings")
> can't be reverted cleanly by git revert, so I did it manually as below.

Yeah, I wasn't expecting that to work.

> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 6a8b71917e3b..0bdedd415e56 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -573,8 +573,8 @@ static int __kprobes do_page_fault(unsigned long far,
> unsigned long esr,
>                 /* Write implies read */
>                 vm_flags |= VM_WRITE;
>                 /* If EPAN is absent then exec implies read */
> -               if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN))
> -                       vm_flags |= VM_EXEC;
> +               //if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN))
> +               //      vm_flags |= VM_EXEC;
>         }
> 
>         if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, esr, regs))
> {
> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
> index 642bdf908b22..d30265d424e4 100644
> --- a/arch/arm64/mm/mmap.c
> +++ b/arch/arm64/mm/mmap.c
> @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
>         [VM_WRITE]                                      = PAGE_READONLY,
>         [VM_WRITE | VM_READ]                            = PAGE_READONLY,
>         /* PAGE_EXECONLY if Enhanced PAN */
> -       [VM_EXEC]                                       = PAGE_READONLY_EXEC,
> +       [VM_EXEC]                                       = PAGE_EXECONLY,
>         [VM_EXEC | VM_READ]                             = PAGE_READONLY_EXEC,
>         [VM_EXEC | VM_WRITE]                            = PAGE_READONLY_EXEC,
>         [VM_EXEC | VM_WRITE | VM_READ]                  = PAGE_READONLY_EXEC,

In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
Otherwise it looks fine.
Yang Shi July 10, 2024, 6:43 p.m. UTC | #24
On 7/10/24 2:22 AM, Catalin Marinas wrote:
> On Tue, Jul 09, 2024 at 03:29:58PM -0700, Yang Shi wrote:
>> On 7/9/24 11:35 AM, Catalin Marinas wrote:
>>> On Tue, Jul 09, 2024 at 10:56:55AM -0700, Yang Shi wrote:
>>>> On 7/4/24 3:03 AM, Catalin Marinas wrote:
>>>> I tested exec-only on QEMU tcg, but I don't have a hardware supported EPAN.
>>>> I don't think performance benchmark on QEMU tcg makes sense since it is
>>>> quite slow, such small overhead is unlikely measurable on it.
>>> Yeah, benchmarking under qemu is pointless. I think you can remove some
>>> of the ARM64_HAS_EPAN checks (or replaced them with ARM64_HAS_PAN) just
>>> for testing. For security reason, we removed this behaviour in commit
>>> 24cecc377463 ("arm64: Revert support for execute-only user mappings")
>>> but it's good enough for testing. This should give you PROT_EXEC-only
>>> mappings on your hardware.
>> Thanks for the suggestion. IIUC, I still can emulate exec-only even though
>> hardware doesn't support EPAN? So it means reading exec-only area in kernel
>> still can trigger fault, right?
> Yes, it's been supported since ARMv8.0. We limited it to EPAN only since
> setting a PROT_EXEC mapping still allowed the kernel to access the
> memory even if PSTATE.PAN was set.
>
>> And 24cecc377463 ("arm64: Revert support for execute-only user mappings")
>> can't be reverted cleanly by git revert, so I did it manually as below.
> Yeah, I wasn't expecting that to work.
>
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index 6a8b71917e3b..0bdedd415e56 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -573,8 +573,8 @@ static int __kprobes do_page_fault(unsigned long far,
>> unsigned long esr,
>>                  /* Write implies read */
>>                  vm_flags |= VM_WRITE;
>>                  /* If EPAN is absent then exec implies read */
>> -               if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN))
>> -                       vm_flags |= VM_EXEC;
>> +               //if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN))
>> +               //      vm_flags |= VM_EXEC;
>>          }
>>
>>          if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, esr, regs))
>> {
>> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
>> index 642bdf908b22..d30265d424e4 100644
>> --- a/arch/arm64/mm/mmap.c
>> +++ b/arch/arm64/mm/mmap.c
>> @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
>>          [VM_WRITE]                                      = PAGE_READONLY,
>>          [VM_WRITE | VM_READ]                            = PAGE_READONLY,
>>          /* PAGE_EXECONLY if Enhanced PAN */
>> -       [VM_EXEC]                                       = PAGE_READONLY_EXEC,
>> +       [VM_EXEC]                                       = PAGE_EXECONLY,
>>          [VM_EXEC | VM_READ]                             = PAGE_READONLY_EXEC,
>>          [VM_EXEC | VM_WRITE]                            = PAGE_READONLY_EXEC,
>>          [VM_EXEC | VM_WRITE | VM_READ]                  = PAGE_READONLY_EXEC,
> In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
> Otherwise it looks fine.

Thanks. I just ran the same benchmark. Ran the modified 
page_fault1_thread (trigger read fault) in 100 iterations with 160 
threads on 160 cores. This should be the worst contention case and 
collected the max data (worst latency). It shows the patch may incur 
~30% overhead for exec-only case. The overhead should just come from the 
permission fault.

     N           Min           Max        Median           Avg Stddev
x 100        163840        219083        184471        183262 12593.229
+ 100        211198        285947        233608     238819.98 15253.967
Difference at 95.0% confidence
     55558 +/- 3877
     30.3161% +/- 2.11555%

This is a very extreme benchmark, I don't think any real life workload 
will spend that much time (sys vs user) in page fault, particularly read 
fault.

With my atomic fault benchmark (populate 1G memory with atomic 
instruction then manipulate the value stored in the memory in 100 
iterations so the user time is much longer than sys time), I saw around 
13% overhead on sys time due to the permission fault, but no noticeable 
change for user and real time.

So the permission fault does incur noticeable overhead for read fault on 
exec-only, but it may be not that bad for real life workloads.

>
Catalin Marinas July 11, 2024, 5:43 p.m. UTC | #25
On Wed, Jul 10, 2024 at 11:43:18AM -0700, Yang Shi wrote:
> On 7/10/24 2:22 AM, Catalin Marinas wrote:
> > > diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
> > > index 642bdf908b22..d30265d424e4 100644
> > > --- a/arch/arm64/mm/mmap.c
> > > +++ b/arch/arm64/mm/mmap.c
> > > @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
> > >          [VM_WRITE]                                      = PAGE_READONLY,
> > >          [VM_WRITE | VM_READ]                            = PAGE_READONLY,
> > >          /* PAGE_EXECONLY if Enhanced PAN */
> > > -        [VM_EXEC]                                       = PAGE_READONLY_EXEC,
> > > +        [VM_EXEC]                                       = PAGE_EXECONLY,
> > >          [VM_EXEC | VM_READ]                             = PAGE_READONLY_EXEC,
> > >          [VM_EXEC | VM_WRITE]                            = PAGE_READONLY_EXEC,
> > >          [VM_EXEC | VM_WRITE | VM_READ]                  = PAGE_READONLY_EXEC,
> > 
> > In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
> > Otherwise it looks fine.
> 
> Thanks. I just ran the same benchmark. Ran the modified page_fault1_thread
> (trigger read fault) in 100 iterations with 160 threads on 160 cores. This
> should be the worst contention case and collected the max data (worst
> latency). It shows the patch may incur ~30% overhead for exec-only case. The
> overhead should just come from the permission fault.
> 
>     N           Min           Max        Median           Avg Stddev
> x 100        163840        219083        184471        183262 12593.229
> + 100        211198        285947        233608     238819.98 15253.967
> Difference at 95.0% confidence
>     55558 +/- 3877
>     30.3161% +/- 2.11555%
> 
> This is a very extreme benchmark, I don't think any real life workload will
> spend that much time (sys vs user) in page fault, particularly read fault.
> 
> With my atomic fault benchmark (populate 1G memory with atomic instruction
> then manipulate the value stored in the memory in 100 iterations so the user
> time is much longer than sys time), I saw around 13% overhead on sys time
> due to the permission fault, but no noticeable change for user and real
> time.

Thanks for running these tests.

> So the permission fault does incur noticeable overhead for read fault on
> exec-only, but it may be not that bad for real life workloads.

So you are saying 3 faults is not that bad for real life workloads but
the 2 faults behaviour we have currently is problematic for OpenJDK. For
the OpenJDK case, I don't think the faulting is the main issue affecting
run-time performance but, over a longer run (or with more iterations in
this benchmark after the faulting in), you'd notice the breaking up of
the initial THP pages.

Do you have any OpenJDK benchmarks that show the problem? It might be
worth running them with this patch:

https://lore.kernel.org/r/rjudrmg7nkkwfgviqeqluuww6wu6fdrgdsfimtmpjee7lkz2ej@iosd2f6pk4f7

Or, if not, do you see any difference in the user time in your benchmark
with the above mm patch? In subsequent iterations, linear accesses are
not ideal for testing. Better to have some random accesses this 1GB of
memory (after the initial touching). That would mimic heap accesses a
bit better.

Anyway, as it stands, I don't think we should merge this patch since it
does incur an additional penalty with exec-only mappings and it would
make things even worse for OpenJDK if distros change their default
toolchain flags at some point to generate exec-only ELF text sections.
While we could work around this by allowing the kernel to read the
exec-only user mapping with a new flavour of get_user() (toggling PAN as
well), I don't think it's worth it. Especially with the above mm change,
I think the benefits of this patch aren't justified. Longer term, I hope
that customers upgrade to OpenJDK v22 or, for proprietary tools, they
either follow the madvise() approach or wait for the Arm architects to
change the hardware behaviour.
Yang Shi July 11, 2024, 6:17 p.m. UTC | #26
On 7/11/24 10:43 AM, Catalin Marinas wrote:
> On Wed, Jul 10, 2024 at 11:43:18AM -0700, Yang Shi wrote:
>> On 7/10/24 2:22 AM, Catalin Marinas wrote:
>>>> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
>>>> index 642bdf908b22..d30265d424e4 100644
>>>> --- a/arch/arm64/mm/mmap.c
>>>> +++ b/arch/arm64/mm/mmap.c
>>>> @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
>>>>           [VM_WRITE]                                      = PAGE_READONLY,
>>>>           [VM_WRITE | VM_READ]                            = PAGE_READONLY,
>>>>           /* PAGE_EXECONLY if Enhanced PAN */
>>>> -        [VM_EXEC]                                       = PAGE_READONLY_EXEC,
>>>> +        [VM_EXEC]                                       = PAGE_EXECONLY,
>>>>           [VM_EXEC | VM_READ]                             = PAGE_READONLY_EXEC,
>>>>           [VM_EXEC | VM_WRITE]                            = PAGE_READONLY_EXEC,
>>>>           [VM_EXEC | VM_WRITE | VM_READ]                  = PAGE_READONLY_EXEC,
>>> In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
>>> Otherwise it looks fine.
>> Thanks. I just ran the same benchmark. Ran the modified page_fault1_thread
>> (trigger read fault) in 100 iterations with 160 threads on 160 cores. This
>> should be the worst contention case and collected the max data (worst
>> latency). It shows the patch may incur ~30% overhead for exec-only case. The
>> overhead should just come from the permission fault.
>>
>>      N           Min           Max        Median           Avg Stddev
>> x 100        163840        219083        184471        183262 12593.229
>> + 100        211198        285947        233608     238819.98 15253.967
>> Difference at 95.0% confidence
>>      55558 +/- 3877
>>      30.3161% +/- 2.11555%
>>
>> This is a very extreme benchmark, I don't think any real life workload will
>> spend that much time (sys vs user) in page fault, particularly read fault.
>>
>> With my atomic fault benchmark (populate 1G memory with atomic instruction
>> then manipulate the value stored in the memory in 100 iterations so the user
>> time is much longer than sys time), I saw around 13% overhead on sys time
>> due to the permission fault, but no noticeable change for user and real
>> time.
> Thanks for running these tests.
>
>> So the permission fault does incur noticeable overhead for read fault on
>> exec-only, but it may be not that bad for real life workloads.
> So you are saying 3 faults is not that bad for real life workloads but
> the 2 faults behaviour we have currently is problematic for OpenJDK. For
> the OpenJDK case, I don't think the faulting is the main issue affecting
> run-time performance but, over a longer run (or with more iterations in
> this benchmark after the faulting in), you'd notice the breaking up of
> the initial THP pages.

I meant the extra permission fault for exec-only should be ok since the 
current implementation can't force write fault for exec-only anyway. It 
does incur noticeable overhead for read fault, but I'm not aware of any 
real life workloads are sensitive to read fault. The benchmark is for a 
very extreme worst case.

>
> Do you have any OpenJDK benchmarks that show the problem? It might be
> worth running them with this patch:
>
> https://lore.kernel.org/r/rjudrmg7nkkwfgviqeqluuww6wu6fdrgdsfimtmpjee7lkz2ej@iosd2f6pk4f7
>
> Or, if not, do you see any difference in the user time in your benchmark
> with the above mm patch? In subsequent iterations, linear accesses are
> not ideal for testing. Better to have some random accesses this 1GB of
> memory (after the initial touching). That would mimic heap accesses a
> bit better.

I didn't try that patch. I think we discussed this before. This patch 
can remove the THP shattering, we should be able to see some 
improvement, but TBLI is still needed and its cost should be still 
noticeable because the write protection fault still happens.

>
> Anyway, as it stands, I don't think we should merge this patch since it
> does incur an additional penalty with exec-only mappings and it would
> make things even worse for OpenJDK if distros change their default
> toolchain flags at some point to generate exec-only ELF text sections.
> While we could work around this by allowing the kernel to read the
> exec-only user mapping with a new flavour of get_user() (toggling PAN as
> well), I don't think it's worth it. Especially with the above mm change,
> I think the benefits of this patch aren't justified. Longer term, I hope
> that customers upgrade to OpenJDK v22 or, for proprietary tools, they
> either follow the madvise() approach or wait for the Arm architects to
> change the hardware behaviour.

If the overhead for exec-only is a concern, I think we can just skip 
exec-only segment for now, right? The exec-only is not that popular yet. 
And if the users prefer security, typically they may be not that 
sensitive to performance.

>
Ryan Roberts July 15, 2024, 1:09 p.m. UTC | #27
On 02/07/2024 11:26, Ryan Roberts wrote:
> On 01/07/2024 20:43, Catalin Marinas wrote:
>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
>>> On 6/28/24 10:24 AM, Catalin Marinas wrote:
>>>> This patch does feel a bit like working around a non-optimal user choice
>>>> in kernel space. Who knows, madvise() may even be quicker if you do a
>>>> single call for a larger VA vs touching each page.
>>>
>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
>>> inefficiency in kernel. Two faults are not necessary if we know we are
>>> definitely going to write the memory very soon, right?
>>
>> I agree the Arm architecture behaviour is not ideal here and any
>> timelines for fixing it in hardware, if they do happen, are far into the
>> future. Purely from a kernel perspective, what I want though is make
>> sure that longer term (a) we don't create additional maintenance burden
>> and (b) we don't keep dead code around.
>>
>> Point (a) could be mitigated if the architecture is changed so that any
>> new atomic instructions added to this range would also come with
>> additional syndrome information so that we don't have to update the
>> decoding patterns.
>>
>> Point (b), however, depends on the OpenJDK and the kernel versions in
>> distros. Nick Gasson kindly provided some information on the OpenJDK
>> changes. The atomic_add(0) change happened in early 2022, about 5-6
>> months after MADV_POPULATE_WRITE support was added to the kernel. What's
>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to
>> OpenJDK a few months ago:
>>
>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a
>>
>> The OpenJDK commit lacks explanation but what I gathered from the diff
>> is that this option is the preferred one in the presence of THP (which
>> most/all distros enable by default). If we merge your proposed kernel
>> patch, it will take time before it makes its way into distros. I'm
>> hoping that by that time, distros would have picked a new OpenJDK
>> version already that doesn't need the atomic_add(0) pattern. If that's
>> the case, we end up with some dead code in the kernel that's almost
>> never exercised.
>>
>> I don't follow OpenJDK development but I heard that updates are dragging
>> quite a lot. I can't tell whether people have picked up the
>> atomic_add(0) feature and whether, by the time a kernel patch would make
>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>>
>> There's a point (c) as well on the overhead of reading the faulting
>> instruction. I hope that's negligible but I haven't measured it.
>>
> 
> Just to add to this, I note the existing kernel behaviour is that if a write
> fault happens in a region that has a (RO) huge zero page mapped at PMD level,
> then the PMD is shattered, the PTE of the fault address is populated with a
> writable page and the remaining PTEs are populated with order-0 zero pages
> (read-only).
> 
> This seems like odd behaviour to me. Surely it would be less effort and more
> aligned with the app's expectations to notice the huge zero page in the PMD,
> remove it, and install a THP, as would have been done if pmd_none() was true? I
> don't think there is a memory bloat argument here because, IIUC, with the
> current behaviour, khugepaged would eventually upgrade it to a THP anyway?
> 
> Changing to this new behaviour would only be a partial solution for your use
> case, since you would still have 2 faults. But it would remove the cost of the
> shattering and ensure you have a THP immediately after the write fault. But I
> can't think of a reason why this wouldn't be a generally useful change
> regardless? Any thoughts?

Hi All,

FYI, I had some more conversation on this at [1] and the conclusion was that the
current kernel behaviour is undesirable and we should change it so that if a
write fault occurs in a ragion mapped by a huge zero page, then the huge zero
page is uninstalled and a (private) THP is installed in its place. Shattering of
the huge zero page to PTEs should *not* occur.

So we will make this change in due course and submit to the list.

[1] https://lore.kernel.org/linux-mm/1cfae0c0-96a2-4308-9c62-f7a640520242@arm.com/

Thanks,
Ryan

> 
> Thanks,
> Ryan
>
Yang Shi Aug. 13, 2024, 5:09 p.m. UTC | #28
On 7/11/24 11:17 AM, Yang Shi wrote:
>
>
> On 7/11/24 10:43 AM, Catalin Marinas wrote:
>> On Wed, Jul 10, 2024 at 11:43:18AM -0700, Yang Shi wrote:
>>> On 7/10/24 2:22 AM, Catalin Marinas wrote:
>>>>> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
>>>>> index 642bdf908b22..d30265d424e4 100644
>>>>> --- a/arch/arm64/mm/mmap.c
>>>>> +++ b/arch/arm64/mm/mmap.c
>>>>> @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] 
>>>>> __ro_after_init = {
>>>>>           [VM_WRITE] = PAGE_READONLY,
>>>>>           [VM_WRITE | VM_READ] = PAGE_READONLY,
>>>>>           /* PAGE_EXECONLY if Enhanced PAN */
>>>>> -        [VM_EXEC]                                       = 
>>>>> PAGE_READONLY_EXEC,
>>>>> +        [VM_EXEC]                                       = 
>>>>> PAGE_EXECONLY,
>>>>>           [VM_EXEC | VM_READ] = PAGE_READONLY_EXEC,
>>>>>           [VM_EXEC | VM_WRITE] = PAGE_READONLY_EXEC,
>>>>>           [VM_EXEC | VM_WRITE | VM_READ] = PAGE_READONLY_EXEC,
>>>> In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
>>>> Otherwise it looks fine.
>>> Thanks. I just ran the same benchmark. Ran the modified 
>>> page_fault1_thread
>>> (trigger read fault) in 100 iterations with 160 threads on 160 
>>> cores. This
>>> should be the worst contention case and collected the max data (worst
>>> latency). It shows the patch may incur ~30% overhead for exec-only 
>>> case. The
>>> overhead should just come from the permission fault.
>>>
>>>      N           Min           Max        Median           Avg Stddev
>>> x 100        163840        219083        184471        183262 12593.229
>>> + 100        211198        285947        233608     238819.98 15253.967
>>> Difference at 95.0% confidence
>>>      55558 +/- 3877
>>>      30.3161% +/- 2.11555%
>>>
>>> This is a very extreme benchmark, I don't think any real life 
>>> workload will
>>> spend that much time (sys vs user) in page fault, particularly read 
>>> fault.
>>>
>>> With my atomic fault benchmark (populate 1G memory with atomic 
>>> instruction
>>> then manipulate the value stored in the memory in 100 iterations so 
>>> the user
>>> time is much longer than sys time), I saw around 13% overhead on sys 
>>> time
>>> due to the permission fault, but no noticeable change for user and real
>>> time.
>> Thanks for running these tests.
>>
>>> So the permission fault does incur noticeable overhead for read 
>>> fault on
>>> exec-only, but it may be not that bad for real life workloads.
>> So you are saying 3 faults is not that bad for real life workloads but
>> the 2 faults behaviour we have currently is problematic for OpenJDK. For
>> the OpenJDK case, I don't think the faulting is the main issue affecting
>> run-time performance but, over a longer run (or with more iterations in
>> this benchmark after the faulting in), you'd notice the breaking up of
>> the initial THP pages.

Gently ping...

>
> I meant the extra permission fault for exec-only should be ok since 
> the current implementation can't force write fault for exec-only 
> anyway. It does incur noticeable overhead for read fault, but I'm not 
> aware of any real life workloads are sensitive to read fault. The 
> benchmark is for a very extreme worst case.
>
>>
>> Do you have any OpenJDK benchmarks that show the problem? It might be
>> worth running them with this patch:
>>
>> https://lore.kernel.org/r/rjudrmg7nkkwfgviqeqluuww6wu6fdrgdsfimtmpjee7lkz2ej@iosd2f6pk4f7 
>>
>>
>> Or, if not, do you see any difference in the user time in your benchmark
>> with the above mm patch? In subsequent iterations, linear accesses are
>> not ideal for testing. Better to have some random accesses this 1GB of
>> memory (after the initial touching). That would mimic heap accesses a
>> bit better.
>
> I didn't try that patch. I think we discussed this before. This patch 
> can remove the THP shattering, we should be able to see some 
> improvement, but TBLI is still needed and its cost should be still 
> noticeable because the write protection fault still happens.
>
>>
>> Anyway, as it stands, I don't think we should merge this patch since it
>> does incur an additional penalty with exec-only mappings and it would
>> make things even worse for OpenJDK if distros change their default
>> toolchain flags at some point to generate exec-only ELF text sections.
>> While we could work around this by allowing the kernel to read the
>> exec-only user mapping with a new flavour of get_user() (toggling PAN as
>> well), I don't think it's worth it. Especially with the above mm change,
>> I think the benefits of this patch aren't justified. Longer term, I hope
>> that customers upgrade to OpenJDK v22 or, for proprietary tools, they
>> either follow the madvise() approach or wait for the Arm architects to
>> change the hardware behaviour.
>
> If the overhead for exec-only is a concern, I think we can just skip 
> exec-only segment for now, right? The exec-only is not that popular 
> yet. And if the users prefer security, typically they may be not that 
> sensitive to performance.
>
>>
>
Catalin Marinas Aug. 21, 2024, 10:18 a.m. UTC | #29
On Thu, Jul 11, 2024 at 11:17:51AM -0700, Yang Shi wrote:
> On 7/11/24 10:43 AM, Catalin Marinas wrote:
> > On Wed, Jul 10, 2024 at 11:43:18AM -0700, Yang Shi wrote:
> > > On 7/10/24 2:22 AM, Catalin Marinas wrote:
> > > > > diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
> > > > > index 642bdf908b22..d30265d424e4 100644
> > > > > --- a/arch/arm64/mm/mmap.c
> > > > > +++ b/arch/arm64/mm/mmap.c
> > > > > @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
> > > > >           [VM_WRITE]                                      = PAGE_READONLY,
> > > > >           [VM_WRITE | VM_READ]                            = PAGE_READONLY,
> > > > >           /* PAGE_EXECONLY if Enhanced PAN */
> > > > > -        [VM_EXEC]                                       = PAGE_READONLY_EXEC,
> > > > > +        [VM_EXEC]                                       = PAGE_EXECONLY,
> > > > >           [VM_EXEC | VM_READ]                             = PAGE_READONLY_EXEC,
> > > > >           [VM_EXEC | VM_WRITE]                            = PAGE_READONLY_EXEC,
> > > > >           [VM_EXEC | VM_WRITE | VM_READ]                  = PAGE_READONLY_EXEC,
> > > > In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
> > > > Otherwise it looks fine.
> > > Thanks. I just ran the same benchmark. Ran the modified page_fault1_thread
> > > (trigger read fault) in 100 iterations with 160 threads on 160 cores. This
> > > should be the worst contention case and collected the max data (worst
> > > latency). It shows the patch may incur ~30% overhead for exec-only case. The
> > > overhead should just come from the permission fault.
> > > 
> > >      N           Min           Max        Median           Avg Stddev
> > > x 100        163840        219083        184471        183262 12593.229
> > > + 100        211198        285947        233608     238819.98 15253.967
> > > Difference at 95.0% confidence
> > >      55558 +/- 3877
> > >      30.3161% +/- 2.11555%
> > > 
> > > This is a very extreme benchmark, I don't think any real life workload will
> > > spend that much time (sys vs user) in page fault, particularly read fault.
> > > 
> > > With my atomic fault benchmark (populate 1G memory with atomic instruction
> > > then manipulate the value stored in the memory in 100 iterations so the user
> > > time is much longer than sys time), I saw around 13% overhead on sys time
> > > due to the permission fault, but no noticeable change for user and real
> > > time.
> > Thanks for running these tests.
> > 
> > > So the permission fault does incur noticeable overhead for read fault on
> > > exec-only, but it may be not that bad for real life workloads.
> > So you are saying 3 faults is not that bad for real life workloads but
> > the 2 faults behaviour we have currently is problematic for OpenJDK. For
> > the OpenJDK case, I don't think the faulting is the main issue affecting
> > run-time performance but, over a longer run (or with more iterations in
> > this benchmark after the faulting in), you'd notice the breaking up of
> > the initial THP pages.
> 
> I meant the extra permission fault for exec-only should be ok since the
> current implementation can't force write fault for exec-only anyway. It does
> incur noticeable overhead for read fault, but I'm not aware of any real life
> workloads are sensitive to read fault. The benchmark is for a very extreme
> worst case.

I agree microbenchmarks like this are not realistic but I wouldn't be
surprised if we hear of some large file mmap() read or very sparse
arrays read being affected.

Sorry, I forgot the details from the previous discussions. Are there any
benchmarks that show OpenJDK performing badly _and_ what the improvement
is with this patch? We should do similar comparison with the THP kernel
fix. If the THP fix gets us close to the same level of improvement, I
don't think we should bother with this patch.

> > Do you have any OpenJDK benchmarks that show the problem? It might be
> > worth running them with this patch:
> > 
> > https://lore.kernel.org/r/rjudrmg7nkkwfgviqeqluuww6wu6fdrgdsfimtmpjee7lkz2ej@iosd2f6pk4f7
> > 
> > Or, if not, do you see any difference in the user time in your benchmark
> > with the above mm patch? In subsequent iterations, linear accesses are
> > not ideal for testing. Better to have some random accesses this 1GB of
> > memory (after the initial touching). That would mimic heap accesses a
> > bit better.
> 
> I didn't try that patch. I think we discussed this before. This patch can
> remove the THP shattering, we should be able to see some improvement, but
> TBLI is still needed and its cost should be still noticeable because the
> write protection fault still happens.

Yeah, with two faults we do need the subsequent TLBI. But if it's only
one per huge page, it might not be that bad.

> > Anyway, as it stands, I don't think we should merge this patch since it
> > does incur an additional penalty with exec-only mappings and it would
> > make things even worse for OpenJDK if distros change their default
> > toolchain flags at some point to generate exec-only ELF text sections.
> > While we could work around this by allowing the kernel to read the
> > exec-only user mapping with a new flavour of get_user() (toggling PAN as
> > well), I don't think it's worth it. Especially with the above mm change,
> > I think the benefits of this patch aren't justified. Longer term, I hope
> > that customers upgrade to OpenJDK v22 or, for proprietary tools, they
> > either follow the madvise() approach or wait for the Arm architects to
> > change the hardware behaviour.
> 
> If the overhead for exec-only is a concern, I think we can just skip
> exec-only segment for now, right? The exec-only is not that popular yet. And
> if the users prefer security, typically they may be not that sensitive to
> performance.

I don't think we should penalise something that currently works when we
already have solutions for the problematic case (madvise()) and a kernel
improvement on the THP handling. At least not without real-world numbers
to back it up. I don't want to advise people not to compile OpenJDK (or
other software) with exec-only permissions just because the in-kernel
RMW handling will get worse.

Currently exec-only binaries are not wide spread but as FEAT_EPAN (a
precondition) starts to appear in hardware, software may move in this
direction, at least for the pre-built executables (JIT'ed code may incur
additional size penalty to keep the literal pool separated).

My view (as kernel maintainer) is that there aren't strong enough
arguments for merging this patch. It does show an improvement in
specific cases but the downside is the risk of affecting other cases
(plus additional maintenance cost, though I'd expect the opcode checking
to remain the same). I do think the architecture should be changed so
that the kernel gets the right information in the fault handler but it
would be a while before such feature appears in hardware.
Dev Jain Aug. 21, 2024, 11:32 a.m. UTC | #30
On 8/21/24 15:48, Catalin Marinas wrote:
> On Thu, Jul 11, 2024 at 11:17:51AM -0700, Yang Shi wrote:
>> On 7/11/24 10:43 AM, Catalin Marinas wrote:
>>> On Wed, Jul 10, 2024 at 11:43:18AM -0700, Yang Shi wrote:
>>>> On 7/10/24 2:22 AM, Catalin Marinas wrote:
>>>>>> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
>>>>>> index 642bdf908b22..d30265d424e4 100644
>>>>>> --- a/arch/arm64/mm/mmap.c
>>>>>> +++ b/arch/arm64/mm/mmap.c
>>>>>> @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
>>>>>>    ������� [VM_WRITE]������������������������������������� = PAGE_READONLY,
>>>>>>    ������� [VM_WRITE | VM_READ]��������������������������� = PAGE_READONLY,
>>>>>>    ������� /* PAGE_EXECONLY if Enhanced PAN */
>>>>>> - ������ [VM_EXEC]�������������������������������������� = PAGE_READONLY_EXEC,
>>>>>> +� ����� [VM_EXEC]�������������������������������������� = PAGE_EXECONLY,
>>>>>>    ������� [VM_EXEC | VM_READ]���������������������������� = PAGE_READONLY_EXEC,
>>>>>>    ������� [VM_EXEC | VM_WRITE]��������������������������� = PAGE_READONLY_EXEC,
>>>>>>    ������� [VM_EXEC | VM_WRITE | VM_READ]����������������� = PAGE_READONLY_EXEC,
>>>>> In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
>>>>> Otherwise it looks fine.
>>>> Thanks. I just ran the same benchmark. Ran the modified page_fault1_thread
>>>> (trigger read fault) in 100 iterations with 160 threads on 160 cores. This
>>>> should be the worst contention case and collected the max data (worst
>>>> latency). It shows the patch may incur ~30% overhead for exec-only case. The
>>>> overhead should just come from the permission fault.
>>>>
>>>>   ��� N���������� Min���������� Max������� Median���������� Avg Stddev
>>>> x 100������� 163840������� 219083������� 184471������� 183262 12593.229
>>>> + 100������� 211198������� 285947������� 233608���� 238819.98 15253.967
>>>> Difference at 95.0% confidence
>>>>   �� �55558 +/- 3877
>>>>   �� �30.3161% +/- 2.11555%
>>>>
>>>> This is a very extreme benchmark, I don't think any real life workload will
>>>> spend that much time (sys vs user) in page fault, particularly read fault.
>>>>
>>>> With my atomic fault benchmark (populate 1G memory with atomic instruction
>>>> then manipulate the value stored in the memory in 100 iterations so the user
>>>> time is much longer than sys time), I saw around 13% overhead on sys time
>>>> due to the permission fault, but no noticeable change for user and real
>>>> time.
>>> Thanks for running these tests.
>>>
>>>> So the permission fault does incur noticeable overhead for read fault on
>>>> exec-only, but it may be not that bad for real life workloads.
>>> So you are saying 3 faults is not that bad for real life workloads but
>>> the 2 faults behaviour we have currently is problematic for OpenJDK. For
>>> the OpenJDK case, I don't think the faulting is the main issue affecting
>>> run-time performance but, over a longer run (or with more iterations in
>>> this benchmark after the faulting in), you'd notice the breaking up of
>>> the initial THP pages.
>> I meant the extra permission fault for exec-only should be ok since the
>> current implementation can't force write fault for exec-only anyway. It does
>> incur noticeable overhead for read fault, but I'm not aware of any real life
>> workloads are sensitive to read fault. The benchmark is for a very extreme
>> worst case.
> I agree microbenchmarks like this are not realistic but I wouldn't be
> surprised if we hear of some large file mmap() read or very sparse
> arrays read being affected.
>
> Sorry, I forgot the details from the previous discussions. Are there any
> benchmarks that show OpenJDK performing badly _and_ what the improvement
> is with this patch? We should do similar comparison with the THP kernel
> fix. If the THP fix gets us close to the same level of improvement, I
> don't think we should bother with this patch.

Just for the note, I am working on replacing the hugezeropage with a THP;
I will be posting that out (after some testing) soon.
Will Deacon Aug. 23, 2024, 9:59 a.m. UTC | #31
On Wed, Aug 21, 2024 at 11:18:42AM +0100, Catalin Marinas wrote:
> My view (as kernel maintainer) is that there aren't strong enough
> arguments for merging this patch. It does show an improvement in
> specific cases but the downside is the risk of affecting other cases
> (plus additional maintenance cost, though I'd expect the opcode checking
> to remain the same). I do think the architecture should be changed so
> that the kernel gets the right information in the fault handler but it
> would be a while before such feature appears in hardware.

I agree. This patch is a kludge to work around a bad assumption in OpenJDK
which can be (has been?) easily fixed in userspace.

Will
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 8c0a36f72d6f..efcc8b2050db 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -325,6 +325,7 @@  static __always_inline u32 aarch64_insn_get_##abbr##_value(void)	\
  * "-" means "don't care"
  */
 __AARCH64_INSN_FUNCS(class_branch_sys,	0x1c000000, 0x14000000)
+__AARCH64_INSN_FUNCS(class_atomic,	0x3b200c00, 0x38200000)
 
 __AARCH64_INSN_FUNCS(adr,	0x9F000000, 0x10000000)
 __AARCH64_INSN_FUNCS(adrp,	0x9F000000, 0x90000000)
@@ -345,6 +346,7 @@  __AARCH64_INSN_FUNCS(ldeor,	0x3F20FC00, 0x38202000)
 __AARCH64_INSN_FUNCS(ldset,	0x3F20FC00, 0x38203000)
 __AARCH64_INSN_FUNCS(swp,	0x3F20FC00, 0x38208000)
 __AARCH64_INSN_FUNCS(cas,	0x3FA07C00, 0x08A07C00)
+__AARCH64_INSN_FUNCS(casp,	0xBFA07C00, 0x08207C00)
 __AARCH64_INSN_FUNCS(ldr_reg,	0x3FE0EC00, 0x38606800)
 __AARCH64_INSN_FUNCS(signed_ldr_reg, 0X3FE0FC00, 0x38A0E800)
 __AARCH64_INSN_FUNCS(ldr_imm,	0x3FC00000, 0x39400000)
@@ -549,6 +551,24 @@  static __always_inline bool aarch64_insn_uses_literal(u32 insn)
 	       aarch64_insn_is_prfm_lit(insn);
 }
 
+static __always_inline bool aarch64_insn_is_class_cas(u32 insn)
+{
+	return aarch64_insn_is_cas(insn) ||
+	       aarch64_insn_is_casp(insn);
+}
+
+/*
+ * Exclude unallocated atomic instructions and LD64B/LDAPR.
+ * The masks and values were generated by using Python sympy module.
+ */
+static __always_inline bool aarch64_atomic_insn_has_wr_perm(u32 insn)
+{
+	return ((insn & 0x3f207c00) == 0x38200000) ||
+	       ((insn & 0x3f208c00) == 0x38200000) ||
+	       ((insn & 0x7fe06c00) == 0x78202000) ||
+	       ((insn & 0xbf204c00) == 0x38200000);
+}
+
 enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
 u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
 u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 451ba7cbd5ad..6a8b71917e3b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -500,6 +500,34 @@  static bool is_write_abort(unsigned long esr)
 	return (esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM);
 }
 
+static bool is_el0_atomic_instr(struct pt_regs *regs)
+{
+	u32 insn;
+	__le32 insn_le;
+	unsigned long pc = instruction_pointer(regs);
+
+	if (compat_user_mode(regs))
+		return false;
+
+	pagefault_disable();
+	if (get_user(insn_le, (__le32 __user *)pc)) {
+		pagefault_enable();
+		return false;
+	}
+	pagefault_enable();
+
+	insn = le32_to_cpu(insn_le);
+
+	if (aarch64_insn_is_class_atomic(insn) &&
+	    aarch64_atomic_insn_has_wr_perm(insn))
+		return true;
+
+	if (aarch64_insn_is_class_cas(insn))
+		return true;
+
+	return false;
+}
+
 static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 				   struct pt_regs *regs)
 {
@@ -568,6 +596,12 @@  static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	if (!vma)
 		goto lock_mmap;
 
+	if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) &&
+	    is_el0_atomic_instr(regs)) {
+		vm_flags = VM_WRITE;
+		mm_flags |= FAULT_FLAG_WRITE;
+	}
+
 	if (!(vma->vm_flags & vm_flags)) {
 		vma_end_read(vma);
 		fault = 0;