Message ID | 20240626191830.3819324-1-yang@os.amperecomputing.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v5] arm64: mm: force write fault for atomic RMW instructions | expand |
On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote: > @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, > if (!vma) > goto lock_mmap; > > + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && > + is_el0_atomic_instr(regs)) { > + vm_flags = VM_WRITE; > + mm_flags |= FAULT_FLAG_WRITE; > + } The patch looks fine now and AFAICT there's no ABI change. However, before deciding whether to merge this patch, I'd like to understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This would be the portable (Linux) solution that works better on architectures without such atomic instructions (e.g. arm64 without LSE atomics). So fixing user-space would be my preferred solution. (I poked some people in Arm working in the area, hopefully I'll get some more information)
On Fri, 28 Jun 2024, Catalin Marinas wrote: > On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote: >> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >> if (!vma) >> goto lock_mmap; >> >> + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && >> + is_el0_atomic_instr(regs)) { >> + vm_flags = VM_WRITE; >> + mm_flags |= FAULT_FLAG_WRITE; >> + } > > The patch looks fine now and AFAICT there's no ABI change. > > However, before deciding whether to merge this patch, I'd like to > understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This > would be the portable (Linux) solution that works better on > architectures without such atomic instructions (e.g. arm64 without LSE > atomics). So fixing user-space would be my preferred solution. Doing so would be requesting application code changes that are linux and ARM64 specific from applications running on Linux. A lot of these are proprietary.
On Fri, Jun 28, 2024 at 09:57:37AM -0700, Christoph Lameter (Ampere) wrote: > On Fri, 28 Jun 2024, Catalin Marinas wrote: > > On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote: > > > @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, > > > if (!vma) > > > goto lock_mmap; > > > > > > + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && > > > + is_el0_atomic_instr(regs)) { > > > + vm_flags = VM_WRITE; > > > + mm_flags |= FAULT_FLAG_WRITE; > > > + } > > > > The patch looks fine now and AFAICT there's no ABI change. > > > > However, before deciding whether to merge this patch, I'd like to > > understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This > > would be the portable (Linux) solution that works better on > > architectures without such atomic instructions (e.g. arm64 without LSE > > atomics). So fixing user-space would be my preferred solution. > > Doing so would be requesting application code changes that are linux and > ARM64 specific from applications running on Linux. Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely not. I'd argue that expecting the atomic_add(0) to only trigger a single write fault is arch specific. You can't do this on arm32 or arm64 pre-LSE (I haven't checked other architectures). IIUC, OpenJDK added this feature about two years ago but the arm64 behaviour hasn't changed in the meantime. So it's not like we broke the ABI and forcing user space to update. This patch does feel a bit like working around a non-optimal user choice in kernel space. Who knows, madvise() may even be quicker if you do a single call for a larger VA vs touching each page. > A lot of these are proprietary. Are you aware of other (proprietary) software relying on such pattern to fault pages in as writeable?
On 6/28/24 10:24 AM, Catalin Marinas wrote: > On Fri, Jun 28, 2024 at 09:57:37AM -0700, Christoph Lameter (Ampere) wrote: >> On Fri, 28 Jun 2024, Catalin Marinas wrote: >>> On Wed, Jun 26, 2024 at 12:18:30PM -0700, Yang Shi wrote: >>>> @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >>>> if (!vma) >>>> goto lock_mmap; >>>> >>>> + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && >>>> + is_el0_atomic_instr(regs)) { >>>> + vm_flags = VM_WRITE; >>>> + mm_flags |= FAULT_FLAG_WRITE; >>>> + } >>> The patch looks fine now and AFAICT there's no ABI change. >>> >>> However, before deciding whether to merge this patch, I'd like to >>> understand why OpenJDK cannot use madvise(MADV_POPULATE_WRITE). This >>> would be the portable (Linux) solution that works better on >>> architectures without such atomic instructions (e.g. arm64 without LSE >>> atomics). So fixing user-space would be my preferred solution. >> Doing so would be requesting application code changes that are linux and >> ARM64 specific from applications running on Linux. > Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely > not. I'd argue that expecting the atomic_add(0) to only trigger a single > write fault is arch specific. You can't do this on arm32 or arm64 > pre-LSE (I haven't checked other architectures). > > IIUC, OpenJDK added this feature about two years ago but the arm64 > behaviour hasn't changed in the meantime. So it's not like we broke the > ABI and forcing user space to update. > > This patch does feel a bit like working around a non-optimal user choice > in kernel space. Who knows, madvise() may even be quicker if you do a > single call for a larger VA vs touching each page. IMHO, I don't think so. I viewed this patch to solve or workaround some ISA inefficiency in kernel. Two faults are not necessary if we know we are definitely going to write the memory very soon, right? > >> A lot of these are proprietary. > Are you aware of other (proprietary) software relying on such pattern to > fault pages in as writeable? >
On Fri, 28 Jun 2024, Catalin Marinas wrote: > Linux-specific (e.g. madvise()), I agree, but arm64-specific definitely > not. I'd argue that expecting the atomic_add(0) to only trigger a single > write fault is arch specific. You can't do this on arm32 or arm64 > pre-LSE (I haven't checked other architectures). The single write fault is x86 behavior. I am not sure how other architectures handle that. > IIUC, OpenJDK added this feature about two years ago but the arm64 > behaviour hasn't changed in the meantime. So it's not like we broke the > ABI and forcing user space to update. The focus of OpenJDK may not be arm64 and they never saw the issue? We only know this because we have an insider on staff. AFACIT we get pushback from them as well. There are certainly numerous other open source applications that behave in a similar way. We just dont know about it. > This patch does feel a bit like working around a non-optimal user choice > in kernel space. Who knows, madvise() may even be quicker if you do a > single call for a larger VA vs touching each page. Looks to me like unexpected surprising behavior on ARM64. madvise is rather particular to Linux and its semantics are ever evolving. >> A lot of these are proprietary. > > Are you aware of other (proprietary) software relying on such pattern to > fault pages in as writeable? I would not be told about such things by companies I did not work for and if I have gotten knowledge about this in some way in the past then I would not be allowed to talk about it.
On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote: > On 6/28/24 10:24 AM, Catalin Marinas wrote: > > This patch does feel a bit like working around a non-optimal user choice > > in kernel space. Who knows, madvise() may even be quicker if you do a > > single call for a larger VA vs touching each page. > > IMHO, I don't think so. I viewed this patch to solve or workaround some ISA > inefficiency in kernel. Two faults are not necessary if we know we are > definitely going to write the memory very soon, right? I agree the Arm architecture behaviour is not ideal here and any timelines for fixing it in hardware, if they do happen, are far into the future. Purely from a kernel perspective, what I want though is make sure that longer term (a) we don't create additional maintenance burden and (b) we don't keep dead code around. Point (a) could be mitigated if the architecture is changed so that any new atomic instructions added to this range would also come with additional syndrome information so that we don't have to update the decoding patterns. Point (b), however, depends on the OpenJDK and the kernel versions in distros. Nick Gasson kindly provided some information on the OpenJDK changes. The atomic_add(0) change happened in early 2022, about 5-6 months after MADV_POPULATE_WRITE support was added to the kernel. What's interesting is Ampere already contributed MADV_POPULATE_WRITE support to OpenJDK a few months ago: https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a The OpenJDK commit lacks explanation but what I gathered from the diff is that this option is the preferred one in the presence of THP (which most/all distros enable by default). If we merge your proposed kernel patch, it will take time before it makes its way into distros. I'm hoping that by that time, distros would have picked a new OpenJDK version already that doesn't need the atomic_add(0) pattern. If that's the case, we end up with some dead code in the kernel that's almost never exercised. I don't follow OpenJDK development but I heard that updates are dragging quite a lot. I can't tell whether people have picked up the atomic_add(0) feature and whether, by the time a kernel patch would make it into distros, they'd also move to the MADV_POPULATE_WRITE pattern. There's a point (c) as well on the overhead of reading the faulting instruction. I hope that's negligible but I haven't measured it.
On 01/07/2024 20:43, Catalin Marinas wrote: > On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote: >> On 6/28/24 10:24 AM, Catalin Marinas wrote: >>> This patch does feel a bit like working around a non-optimal user choice >>> in kernel space. Who knows, madvise() may even be quicker if you do a >>> single call for a larger VA vs touching each page. >> >> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA >> inefficiency in kernel. Two faults are not necessary if we know we are >> definitely going to write the memory very soon, right? > > I agree the Arm architecture behaviour is not ideal here and any > timelines for fixing it in hardware, if they do happen, are far into the > future. Purely from a kernel perspective, what I want though is make > sure that longer term (a) we don't create additional maintenance burden > and (b) we don't keep dead code around. > > Point (a) could be mitigated if the architecture is changed so that any > new atomic instructions added to this range would also come with > additional syndrome information so that we don't have to update the > decoding patterns. > > Point (b), however, depends on the OpenJDK and the kernel versions in > distros. Nick Gasson kindly provided some information on the OpenJDK > changes. The atomic_add(0) change happened in early 2022, about 5-6 > months after MADV_POPULATE_WRITE support was added to the kernel. What's > interesting is Ampere already contributed MADV_POPULATE_WRITE support to > OpenJDK a few months ago: > > https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a > > The OpenJDK commit lacks explanation but what I gathered from the diff > is that this option is the preferred one in the presence of THP (which > most/all distros enable by default). If we merge your proposed kernel > patch, it will take time before it makes its way into distros. I'm > hoping that by that time, distros would have picked a new OpenJDK > version already that doesn't need the atomic_add(0) pattern. If that's > the case, we end up with some dead code in the kernel that's almost > never exercised. > > I don't follow OpenJDK development but I heard that updates are dragging > quite a lot. I can't tell whether people have picked up the > atomic_add(0) feature and whether, by the time a kernel patch would make > it into distros, they'd also move to the MADV_POPULATE_WRITE pattern. > > There's a point (c) as well on the overhead of reading the faulting > instruction. I hope that's negligible but I haven't measured it. > Just to add to this, I note the existing kernel behaviour is that if a write fault happens in a region that has a (RO) huge zero page mapped at PMD level, then the PMD is shattered, the PTE of the fault address is populated with a writable page and the remaining PTEs are populated with order-0 zero pages (read-only). This seems like odd behaviour to me. Surely it would be less effort and more aligned with the app's expectations to notice the huge zero page in the PMD, remove it, and install a THP, as would have been done if pmd_none() was true? I don't think there is a memory bloat argument here because, IIUC, with the current behaviour, khugepaged would eventually upgrade it to a THP anyway? Changing to this new behaviour would only be a partial solution for your use case, since you would still have 2 faults. But it would remove the cost of the shattering and ensure you have a THP immediately after the write fault. But I can't think of a reason why this wouldn't be a generally useful change regardless? Any thoughts? Thanks, Ryan
On 02.07.24 12:26, Ryan Roberts wrote: > On 01/07/2024 20:43, Catalin Marinas wrote: >> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote: >>> On 6/28/24 10:24 AM, Catalin Marinas wrote: >>>> This patch does feel a bit like working around a non-optimal user choice >>>> in kernel space. Who knows, madvise() may even be quicker if you do a >>>> single call for a larger VA vs touching each page. >>> >>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA >>> inefficiency in kernel. Two faults are not necessary if we know we are >>> definitely going to write the memory very soon, right? >> >> I agree the Arm architecture behaviour is not ideal here and any >> timelines for fixing it in hardware, if they do happen, are far into the >> future. Purely from a kernel perspective, what I want though is make >> sure that longer term (a) we don't create additional maintenance burden >> and (b) we don't keep dead code around. >> >> Point (a) could be mitigated if the architecture is changed so that any >> new atomic instructions added to this range would also come with >> additional syndrome information so that we don't have to update the >> decoding patterns. >> >> Point (b), however, depends on the OpenJDK and the kernel versions in >> distros. Nick Gasson kindly provided some information on the OpenJDK >> changes. The atomic_add(0) change happened in early 2022, about 5-6 >> months after MADV_POPULATE_WRITE support was added to the kernel. What's >> interesting is Ampere already contributed MADV_POPULATE_WRITE support to >> OpenJDK a few months ago: >> >> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a >> >> The OpenJDK commit lacks explanation but what I gathered from the diff >> is that this option is the preferred one in the presence of THP (which >> most/all distros enable by default). If we merge your proposed kernel >> patch, it will take time before it makes its way into distros. I'm >> hoping that by that time, distros would have picked a new OpenJDK >> version already that doesn't need the atomic_add(0) pattern. If that's >> the case, we end up with some dead code in the kernel that's almost >> never exercised. >> >> I don't follow OpenJDK development but I heard that updates are dragging >> quite a lot. I can't tell whether people have picked up the >> atomic_add(0) feature and whether, by the time a kernel patch would make >> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern. >> >> There's a point (c) as well on the overhead of reading the faulting >> instruction. I hope that's negligible but I haven't measured it. >> > > Just to add to this, I note the existing kernel behaviour is that if a write > fault happens in a region that has a (RO) huge zero page mapped at PMD level, > then the PMD is shattered, the PTE of the fault address is populated with a > writable page and the remaining PTEs are populated with order-0 zero pages > (read-only). That also recently popped up in [1]. CCing Jinjiang. Ever since I replied there, I also thought some more about that handling in regard to the huge zeropage. > > This seems like odd behaviour to me. Surely it would be less effort and more > aligned with the app's expectations to notice the huge zero page in the PMD, > remove it, and install a THP, as would have been done if pmd_none() was true? I > don't think there is a memory bloat argument here because, IIUC, with the > current behaviour, khugepaged would eventually upgrade it to a THP anyway? One detail: depending on the setting of khugepaged_max_ptes_none. zeropages are treated like pte_none. But in the common case, that setting is left alone. > > Changing to this new behaviour would only be a partial solution for your use > case, since you would still have 2 faults. But it would remove the cost of the > shattering and ensure you have a THP immediately after the write fault. But I > can't think of a reason why this wouldn't be a generally useful change > regardless? Any thoughts? The "let's read before we write" as used by QEMU migration code is the desire to not waste memory by populating the zeropages. Deferring consuming memory until really required. /* * We read one byte of each page; this will preallocate page tables if * required and populate the shared zeropage on MAP_PRIVATE anonymous memory * where no page was populated yet. This might require adaption when * supporting other mappings, like shmem. */ Without THP this works as expected. With THP this currently also works as expected, but of course with the price [1] of not getting anon THP immediately, which usually we don't care about. As you note, khugepaged might fix this up later. If we disable the huge zeropage, we would get anon THPs when reading instead of small zeropages. As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE) when we really care about that performance difference, which would also avoid the huge zeropage completely, but it's also not quite optimal in some cases. I don't really know what to do here: changing the handling for the huge zeropage only unconditionally does not sound too wrong, but the change in behavior might (or might not) be desired for some use cases. Reading from unpopulated memory can be a clear sign that really the shared zeropage is desired (as for QEMU), and concurrent memory preallcoation/population should ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2] regarding the common use cases for the huge zeropage back than. [1] https://lkml.kernel.org/r/740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com [2] https://lwn.net/Articles/517465/
On 02/07/2024 12:22, David Hildenbrand wrote: > On 02.07.24 12:26, Ryan Roberts wrote: >> On 01/07/2024 20:43, Catalin Marinas wrote: >>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote: >>>> On 6/28/24 10:24 AM, Catalin Marinas wrote: >>>>> This patch does feel a bit like working around a non-optimal user choice >>>>> in kernel space. Who knows, madvise() may even be quicker if you do a >>>>> single call for a larger VA vs touching each page. >>>> >>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA >>>> inefficiency in kernel. Two faults are not necessary if we know we are >>>> definitely going to write the memory very soon, right? >>> >>> I agree the Arm architecture behaviour is not ideal here and any >>> timelines for fixing it in hardware, if they do happen, are far into the >>> future. Purely from a kernel perspective, what I want though is make >>> sure that longer term (a) we don't create additional maintenance burden >>> and (b) we don't keep dead code around. >>> >>> Point (a) could be mitigated if the architecture is changed so that any >>> new atomic instructions added to this range would also come with >>> additional syndrome information so that we don't have to update the >>> decoding patterns. >>> >>> Point (b), however, depends on the OpenJDK and the kernel versions in >>> distros. Nick Gasson kindly provided some information on the OpenJDK >>> changes. The atomic_add(0) change happened in early 2022, about 5-6 >>> months after MADV_POPULATE_WRITE support was added to the kernel. What's >>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to >>> OpenJDK a few months ago: >>> >>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a >>> >>> The OpenJDK commit lacks explanation but what I gathered from the diff >>> is that this option is the preferred one in the presence of THP (which >>> most/all distros enable by default). If we merge your proposed kernel >>> patch, it will take time before it makes its way into distros. I'm >>> hoping that by that time, distros would have picked a new OpenJDK >>> version already that doesn't need the atomic_add(0) pattern. If that's >>> the case, we end up with some dead code in the kernel that's almost >>> never exercised. >>> >>> I don't follow OpenJDK development but I heard that updates are dragging >>> quite a lot. I can't tell whether people have picked up the >>> atomic_add(0) feature and whether, by the time a kernel patch would make >>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern. >>> >>> There's a point (c) as well on the overhead of reading the faulting >>> instruction. I hope that's negligible but I haven't measured it. >>> >> >> Just to add to this, I note the existing kernel behaviour is that if a write >> fault happens in a region that has a (RO) huge zero page mapped at PMD level, >> then the PMD is shattered, the PTE of the fault address is populated with a >> writable page and the remaining PTEs are populated with order-0 zero pages >> (read-only). > > That also recently popped up in [1]. CCing Jinjiang. Ever since I > replied there, I also thought some more about that handling in regard to the > huge zeropage. > >> >> This seems like odd behaviour to me. Surely it would be less effort and more >> aligned with the app's expectations to notice the huge zero page in the PMD, >> remove it, and install a THP, as would have been done if pmd_none() was true? I >> don't think there is a memory bloat argument here because, IIUC, with the >> current behaviour, khugepaged would eventually upgrade it to a THP anyway? > > One detail: depending on the setting of khugepaged_max_ptes_none. zeropages > are treated like pte_none. But in the common case, that setting is left alone. Ahh, got it. So in the common case, khugepaged won't actually collapse unless/until a bunch more write faults occur in the 2M region, and in that case there is a risk that changing this behaviour could lead to a memory bloat regression. > >> >> Changing to this new behaviour would only be a partial solution for your use >> case, since you would still have 2 faults. But it would remove the cost of the >> shattering and ensure you have a THP immediately after the write fault. But I >> can't think of a reason why this wouldn't be a generally useful change >> regardless? Any thoughts? > > The "let's read before we write" as used by QEMU migration code is the desire > to not waste memory by populating the zeropages. Deferring consuming memory > until really required. > > /* > * We read one byte of each page; this will preallocate page tables if > * required and populate the shared zeropage on MAP_PRIVATE anonymous memory > * where no page was populated yet. This might require adaption when > * supporting other mappings, like shmem. > */ So QEMU is concerned with preallocatiing page tables? I would have thought you could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip through the allocator, fewer pud/pmd lock/unlocks, etc). TBH I always assumed in the past the that huge zero page is only useful because its a placeholder for a real THP that would be populated on write. But that's obviously not the case at the moment. So other than a hack to preallocate the pgtables with only 1 fault per 2M, what other benefits does it have? > > > Without THP this works as expected. With THP this currently also works as > expected, but of course with the price [1] of not getting anon THP > immediately, which usually we don't care about. As you note, khugepaged might > fix this up later. > > If we disable the huge zeropage, we would get anon THPs when reading instead of > small zeropages. I wasn't aware of that behaviour either. Although that sounds like another reason why allocating a THP over the huge zero page on write fault should be the "more consistent" behaviour. > > As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE) > when we really care about that performance difference, which would also > avoid the huge zeropage completely, but it's also not quite optimal in some cases. I could imagine some cases could benefit from a MADV_POPULATE_WRITE_ON_FAULT, which would just mark the VMA so that any read fault is upgraded to write. > > > I don't really know what to do here: changing the handling for the huge zeropage > only unconditionally does not sound too wrong, but the change in behavior > might (or might not) be desired for some use cases. > > Reading from unpopulated memory can be a clear sign that really the shared zeropage > is desired (as for QEMU), and concurrent memory preallcoation/population should > ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2] > regarding > the common use cases for the huge zeropage back than. The current huge zero page behavior on write fault sounds wonky to me. But I agree there are better and more complete solutions to the identified use cases. So unless something pops up where the change is a clear benefit, I guess better to be safe and leave as is. > > [1] https://lkml.kernel.org/r/740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com > [2] https://lwn.net/Articles/517465/ >
On 02.07.24 14:36, Ryan Roberts wrote: > On 02/07/2024 12:22, David Hildenbrand wrote: >> On 02.07.24 12:26, Ryan Roberts wrote: >>> On 01/07/2024 20:43, Catalin Marinas wrote: >>>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote: >>>>> On 6/28/24 10:24 AM, Catalin Marinas wrote: >>>>>> This patch does feel a bit like working around a non-optimal user choice >>>>>> in kernel space. Who knows, madvise() may even be quicker if you do a >>>>>> single call for a larger VA vs touching each page. >>>>> >>>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA >>>>> inefficiency in kernel. Two faults are not necessary if we know we are >>>>> definitely going to write the memory very soon, right? >>>> >>>> I agree the Arm architecture behaviour is not ideal here and any >>>> timelines for fixing it in hardware, if they do happen, are far into the >>>> future. Purely from a kernel perspective, what I want though is make >>>> sure that longer term (a) we don't create additional maintenance burden >>>> and (b) we don't keep dead code around. >>>> >>>> Point (a) could be mitigated if the architecture is changed so that any >>>> new atomic instructions added to this range would also come with >>>> additional syndrome information so that we don't have to update the >>>> decoding patterns. >>>> >>>> Point (b), however, depends on the OpenJDK and the kernel versions in >>>> distros. Nick Gasson kindly provided some information on the OpenJDK >>>> changes. The atomic_add(0) change happened in early 2022, about 5-6 >>>> months after MADV_POPULATE_WRITE support was added to the kernel. What's >>>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to >>>> OpenJDK a few months ago: >>>> >>>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a >>>> >>>> The OpenJDK commit lacks explanation but what I gathered from the diff >>>> is that this option is the preferred one in the presence of THP (which >>>> most/all distros enable by default). If we merge your proposed kernel >>>> patch, it will take time before it makes its way into distros. I'm >>>> hoping that by that time, distros would have picked a new OpenJDK >>>> version already that doesn't need the atomic_add(0) pattern. If that's >>>> the case, we end up with some dead code in the kernel that's almost >>>> never exercised. >>>> >>>> I don't follow OpenJDK development but I heard that updates are dragging >>>> quite a lot. I can't tell whether people have picked up the >>>> atomic_add(0) feature and whether, by the time a kernel patch would make >>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern. >>>> >>>> There's a point (c) as well on the overhead of reading the faulting >>>> instruction. I hope that's negligible but I haven't measured it. >>>> >>> >>> Just to add to this, I note the existing kernel behaviour is that if a write >>> fault happens in a region that has a (RO) huge zero page mapped at PMD level, >>> then the PMD is shattered, the PTE of the fault address is populated with a >>> writable page and the remaining PTEs are populated with order-0 zero pages >>> (read-only). >> >> That also recently popped up in [1]. CCing Jinjiang. Ever since I >> replied there, I also thought some more about that handling in regard to the >> huge zeropage. >> >>> >>> This seems like odd behaviour to me. Surely it would be less effort and more >>> aligned with the app's expectations to notice the huge zero page in the PMD, >>> remove it, and install a THP, as would have been done if pmd_none() was true? I >>> don't think there is a memory bloat argument here because, IIUC, with the >>> current behaviour, khugepaged would eventually upgrade it to a THP anyway? >> >> One detail: depending on the setting of khugepaged_max_ptes_none. zeropages >> are treated like pte_none. But in the common case, that setting is left alone. > > Ahh, got it. So in the common case, khugepaged won't actually collapse > unless/until a bunch more write faults occur in the 2M region, and in that case > there is a risk that changing this behaviour could lead to a memory bloat > regression. > >> >>> >>> Changing to this new behaviour would only be a partial solution for your use >>> case, since you would still have 2 faults. But it would remove the cost of the >>> shattering and ensure you have a THP immediately after the write fault. But I >>> can't think of a reason why this wouldn't be a generally useful change >>> regardless? Any thoughts? >> >> The "let's read before we write" as used by QEMU migration code is the desire >> to not waste memory by populating the zeropages. Deferring consuming memory >> until really required. >> >> /* >> * We read one byte of each page; this will preallocate page tables if >> * required and populate the shared zeropage on MAP_PRIVATE anonymous memory >> * where no page was populated yet. This might require adaption when >> * supporting other mappings, like shmem. >> */ > > So QEMU is concerned with preallocatiing page tables? I would have thought you > could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE > call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip > through the allocator, fewer pud/pmd lock/unlocks, etc). I think we are only concerned about the "shared zeropage" part. Everything else is just unnecessary detail that adds confusion here :) One requires the other. Note that this is from migration code where we're supposed to write a single page we received from the migration source right now (not more). And we want to avoid allcoating memory if it can be avoided (usually for overcommit). > > TBH I always assumed in the past the that huge zero page is only useful because > its a placeholder for a real THP that would be populated on write. But that's > obviously not the case at the moment. So other than a hack to preallocate the > pgtables with only 1 fault per 2M, what other benefits does it have? I don't quite udnerstand that question. [2] has some details why the huge zeropage was added -- because we would have never otherwise received huge zeropages with THP enabled but always anon THP directly on read. > >> >> >> Without THP this works as expected. With THP this currently also works as >> expected, but of course with the price [1] of not getting anon THP >> immediately, which usually we don't care about. As you note, khugepaged might >> fix this up later. >> >> If we disable the huge zeropage, we would get anon THPs when reading instead of >> small zeropages. > > I wasn't aware of that behaviour either. Although that sounds like another > reason why allocating a THP over the huge zero page on write fault should be the > "more consistent" behaviour. Reading [2] I think the huge zeropage was added to avoid the allocation of THP on read. Maybe for really only large readable regions, not sure why exactly. > >> >> As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE) >> when we really care about that performance difference, which would also >> avoid the huge zeropage completely, but it's also not quite optimal in some cases. > > I could imagine some cases could benefit from a MADV_POPULATE_WRITE_ON_FAULT, > which would just mark the VMA so that any read fault is upgraded to write. > >> >> >> I don't really know what to do here: changing the handling for the huge zeropage >> only unconditionally does not sound too wrong, but the change in behavior >> might (or might not) be desired for some use cases. >> >> Reading from unpopulated memory can be a clear sign that really the shared zeropage >> is desired (as for QEMU), and concurrent memory preallcoation/population should >> ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2] >> regarding >> the common use cases for the huge zeropage back than. > > The current huge zero page behavior on write fault sounds wonky to me. But I > agree there are better and more complete solutions to the identified use cases. > So unless something pops up where the change is a clear benefit, I guess better > to be safe and leave as is. We've had that behavior for a quite a while ... so it's rather surprising to see multiple people reporting this right now. I guess most use cases don't read from uninitialized memory barely write to it and care about getting THPs immediately. For preallocation, MADVISE_POPULATE_WRITE is better. For QEMU migration? not sure what's really better. Maybe replacing the huge zeropage by a THP would be faster in some cases, but result in more memory consumption (and more page zeroing?) during migration in other cases.
On 02/07/2024 13:58, David Hildenbrand wrote: > On 02.07.24 14:36, Ryan Roberts wrote: >> On 02/07/2024 12:22, David Hildenbrand wrote: >>> On 02.07.24 12:26, Ryan Roberts wrote: >>>> On 01/07/2024 20:43, Catalin Marinas wrote: >>>>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote: >>>>>> On 6/28/24 10:24 AM, Catalin Marinas wrote: >>>>>>> This patch does feel a bit like working around a non-optimal user choice >>>>>>> in kernel space. Who knows, madvise() may even be quicker if you do a >>>>>>> single call for a larger VA vs touching each page. >>>>>> >>>>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA >>>>>> inefficiency in kernel. Two faults are not necessary if we know we are >>>>>> definitely going to write the memory very soon, right? >>>>> >>>>> I agree the Arm architecture behaviour is not ideal here and any >>>>> timelines for fixing it in hardware, if they do happen, are far into the >>>>> future. Purely from a kernel perspective, what I want though is make >>>>> sure that longer term (a) we don't create additional maintenance burden >>>>> and (b) we don't keep dead code around. >>>>> >>>>> Point (a) could be mitigated if the architecture is changed so that any >>>>> new atomic instructions added to this range would also come with >>>>> additional syndrome information so that we don't have to update the >>>>> decoding patterns. >>>>> >>>>> Point (b), however, depends on the OpenJDK and the kernel versions in >>>>> distros. Nick Gasson kindly provided some information on the OpenJDK >>>>> changes. The atomic_add(0) change happened in early 2022, about 5-6 >>>>> months after MADV_POPULATE_WRITE support was added to the kernel. What's >>>>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to >>>>> OpenJDK a few months ago: >>>>> >>>>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a >>>>> >>>>> The OpenJDK commit lacks explanation but what I gathered from the diff >>>>> is that this option is the preferred one in the presence of THP (which >>>>> most/all distros enable by default). If we merge your proposed kernel >>>>> patch, it will take time before it makes its way into distros. I'm >>>>> hoping that by that time, distros would have picked a new OpenJDK >>>>> version already that doesn't need the atomic_add(0) pattern. If that's >>>>> the case, we end up with some dead code in the kernel that's almost >>>>> never exercised. >>>>> >>>>> I don't follow OpenJDK development but I heard that updates are dragging >>>>> quite a lot. I can't tell whether people have picked up the >>>>> atomic_add(0) feature and whether, by the time a kernel patch would make >>>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern. >>>>> >>>>> There's a point (c) as well on the overhead of reading the faulting >>>>> instruction. I hope that's negligible but I haven't measured it. >>>>> >>>> >>>> Just to add to this, I note the existing kernel behaviour is that if a write >>>> fault happens in a region that has a (RO) huge zero page mapped at PMD level, >>>> then the PMD is shattered, the PTE of the fault address is populated with a >>>> writable page and the remaining PTEs are populated with order-0 zero pages >>>> (read-only). >>> >>> That also recently popped up in [1]. CCing Jinjiang. Ever since I >>> replied there, I also thought some more about that handling in regard to the >>> huge zeropage. >>> >>>> >>>> This seems like odd behaviour to me. Surely it would be less effort and more >>>> aligned with the app's expectations to notice the huge zero page in the PMD, >>>> remove it, and install a THP, as would have been done if pmd_none() was true? I >>>> don't think there is a memory bloat argument here because, IIUC, with the >>>> current behaviour, khugepaged would eventually upgrade it to a THP anyway? >>> >>> One detail: depending on the setting of khugepaged_max_ptes_none. zeropages >>> are treated like pte_none. But in the common case, that setting is left alone. >> >> Ahh, got it. So in the common case, khugepaged won't actually collapse >> unless/until a bunch more write faults occur in the 2M region, and in that case >> there is a risk that changing this behaviour could lead to a memory bloat >> regression. >> >>> >>>> >>>> Changing to this new behaviour would only be a partial solution for your use >>>> case, since you would still have 2 faults. But it would remove the cost of the >>>> shattering and ensure you have a THP immediately after the write fault. But I >>>> can't think of a reason why this wouldn't be a generally useful change >>>> regardless? Any thoughts? >>> >>> The "let's read before we write" as used by QEMU migration code is the desire >>> to not waste memory by populating the zeropages. Deferring consuming memory >>> until really required. >>> >>> /* >>> * We read one byte of each page; this will preallocate page tables if >>> * required and populate the shared zeropage on MAP_PRIVATE anonymous >>> memory >>> * where no page was populated yet. This might require adaption when >>> * supporting other mappings, like shmem. >>> */ >> >> So QEMU is concerned with preallocatiing page tables? I would have thought you >> could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE >> call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip >> through the allocator, fewer pud/pmd lock/unlocks, etc). > > I think we are only concerned about the "shared zeropage" part. Everything else > is just unnecessary detail that adds confusion here :) One requires the other. Sorry I don't quite follow your comment. As I understand it, the zeropage concept is intended as a memory-saving mechanism for applications that read memory but never write it. I don't think that really applies in your migration case, because you are definitely going to write all the memory eventually, I think? So I guess you are not interested in the "memory-saving" property, but in the side-effect, which is the pre-allocation of pagetables? (if you just wanted the memory-saving property, why not just skip the "read one byte of each page" op? It's not important though, so let's not go down the rabbit hole. > > Note that this is from migration code where we're supposed to write a single > page we received from the migration source right now (not more). And we want to > avoid allcoating memory if it can be avoided (usually for overcommit). > > > >> >> TBH I always assumed in the past the that huge zero page is only useful because >> its a placeholder for a real THP that would be populated on write. But that's >> obviously not the case at the moment. So other than a hack to preallocate the >> pgtables with only 1 fault per 2M, what other benefits does it have? > > I don't quite udnerstand that question. [2] has some details why the huge > zeropage was added -- because we would have never otherwise received huge > zeropages with THP enabled but always anon THP directly on read. > >> >>> >>> >>> Without THP this works as expected. With THP this currently also works as >>> expected, but of course with the price [1] of not getting anon THP >>> immediately, which usually we don't care about. As you note, khugepaged might >>> fix this up later. >>> >>> If we disable the huge zeropage, we would get anon THPs when reading instead of >>> small zeropages. >> >> I wasn't aware of that behaviour either. Although that sounds like another >> reason why allocating a THP over the huge zero page on write fault should be the >> "more consistent" behaviour. > > Reading [2] I think the huge zeropage was added to avoid the allocation of THP > on read. Maybe for really only large readable regions, not sure why exactly. I might raise this on the THP call tomorrow, if Kyril joins and get his view. > >> >>> >>> As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE) >>> when we really care about that performance difference, which would also >>> avoid the huge zeropage completely, but it's also not quite optimal in some >>> cases. >> >> I could imagine some cases could benefit from a MADV_POPULATE_WRITE_ON_FAULT, >> which would just mark the VMA so that any read fault is upgraded to write. >> >>> >>> >>> I don't really know what to do here: changing the handling for the huge zeropage >>> only unconditionally does not sound too wrong, but the change in behavior >>> might (or might not) be desired for some use cases. >>> >>> Reading from unpopulated memory can be a clear sign that really the shared >>> zeropage >>> is desired (as for QEMU), and concurrent memory preallcoation/population should >>> ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2] >>> regarding >>> the common use cases for the huge zeropage back than. >> >> The current huge zero page behavior on write fault sounds wonky to me. But I >> agree there are better and more complete solutions to the identified use cases. >> So unless something pops up where the change is a clear benefit, I guess better >> to be safe and leave as is. > > We've had that behavior for a quite a while ... so it's rather surprising to see > multiple people reporting this right now. > > I guess most use cases don't read from uninitialized memory barely write to it > and care about getting THPs immediately. > > For preallocation, MADVISE_POPULATE_WRITE is better. For QEMU migration? not > sure what's really better. Maybe replacing the huge zeropage by a THP would be > faster in some cases, but result in more memory consumption (and more page > zeroing?) during migration in other cases. >
>>>>> Changing to this new behaviour would only be a partial solution for your use >>>>> case, since you would still have 2 faults. But it would remove the cost of the >>>>> shattering and ensure you have a THP immediately after the write fault. But I >>>>> can't think of a reason why this wouldn't be a generally useful change >>>>> regardless? Any thoughts? >>>> >>>> The "let's read before we write" as used by QEMU migration code is the desire >>>> to not waste memory by populating the zeropages. Deferring consuming memory >>>> until really required. >>>> >>>> /* >>>> * We read one byte of each page; this will preallocate page tables if >>>> * required and populate the shared zeropage on MAP_PRIVATE anonymous >>>> memory >>>> * where no page was populated yet. This might require adaption when >>>> * supporting other mappings, like shmem. >>>> */ >>> >>> So QEMU is concerned with preallocatiing page tables? I would have thought you >>> could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE >>> call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip >>> through the allocator, fewer pud/pmd lock/unlocks, etc). >> >> I think we are only concerned about the "shared zeropage" part. Everything else >> is just unnecessary detail that adds confusion here :) One requires the other. > BTW, I was quoting the wrong QEMU code. The relevant QEMU commit that first added that handling is: commit 211ea74022f51164a7729030b28eec90b6c99a08 Author: Peter Lieven <pl@kamp.de> Date: Mon Jun 10 12:14:20 2013 +0200 migration: do not overwrite zero pages on incoming migration do not memset pages to zero if they already read as zero. this will allocate a new zero page and consume memory unnecessarily. even if we madvise a MADV_DONTNEED later this will only deallocate the memory asynchronously. (note that the MADV_DONTNEED handling in that form does not really apply anymore AFAIK) > Sorry I don't quite follow your comment. As I understand it, the zeropage > concept is intended as a memory-saving mechanism for applications that read > memory but never write it. "never written" -- then we wouldn't support COW faults on it, right? :P > I don't think that really applies in your migration > case, because you are definitely going to write all the memory eventually, I > think? Especially with memory ballooning and things like free-page-reporting we might be migrating a whole bunch of zero-memory and only have to make sure that the destination is actually zero. We don't want to consume memory. I recently fixed that handling for s390x where customers were running into precisely that issue (and zeropages in s390x VMs were disabled for historical reasons). > So I guess you are not interested in the "memory-saving" property, but in > the side-effect, which is the pre-allocation of pagetables? (if you just wanted > the memory-saving property, why not just skip the "read one byte of each page" > op? It's not important though, so let's not go down the rabbit hole. There are cases where, during an incoming migration, some memory pages might already have been populated and might not be zero. And IIRC there are some cases (postcopy -> uffd) where I think it is important that we actually do have a page in place. But I forgot some of the details around userfaultfd handling in QEMU. > >> >> Note that this is from migration code where we're supposed to write a single >> page we received from the migration source right now (not more). And we want to >> avoid allcoating memory if it can be avoided (usually for overcommit). >> >> >> >>> >>> TBH I always assumed in the past the that huge zero page is only useful because >>> its a placeholder for a real THP that would be populated on write. But that's >>> obviously not the case at the moment. So other than a hack to preallocate the >>> pgtables with only 1 fault per 2M, what other benefits does it have? >> >> I don't quite udnerstand that question. [2] has some details why the huge >> zeropage was added -- because we would have never otherwise received huge >> zeropages with THP enabled but always anon THP directly on read. >> >>> >>>> >>>> >>>> Without THP this works as expected. With THP this currently also works as >>>> expected, but of course with the price [1] of not getting anon THP >>>> immediately, which usually we don't care about. As you note, khugepaged might >>>> fix this up later. >>>> >>>> If we disable the huge zeropage, we would get anon THPs when reading instead of >>>> small zeropages. >>> >>> I wasn't aware of that behaviour either. Although that sounds like another >>> reason why allocating a THP over the huge zero page on write fault should be the >>> "more consistent" behaviour. >> >> Reading [2] I think the huge zeropage was added to avoid the allocation of THP >> on read. Maybe for really only large readable regions, not sure why exactly. > > I might raise this on the THP call tomorrow, if Kyril joins and get his view. Makes sense.
>> I don't think that really applies in your migration >> case, because you are definitely going to write all the memory eventually, I >> think? > > Especially with memory ballooning and things like free-page-reporting we might > be migrating a whole bunch of zero-memory and only have to make sure that the > destination is actually zero. We don't want to consume memory. Ahh you know if the contents of a page being migrated is all zero. In that case, I get it. Thanks!
On 7/1/24 12:43 PM, Catalin Marinas wrote: > On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote: >> On 6/28/24 10:24 AM, Catalin Marinas wrote: >>> This patch does feel a bit like working around a non-optimal user choice >>> in kernel space. Who knows, madvise() may even be quicker if you do a >>> single call for a larger VA vs touching each page. >> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA >> inefficiency in kernel. Two faults are not necessary if we know we are >> definitely going to write the memory very soon, right? > I agree the Arm architecture behaviour is not ideal here and any > timelines for fixing it in hardware, if they do happen, are far into the > future. Purely from a kernel perspective, what I want though is make > sure that longer term (a) we don't create additional maintenance burden > and (b) we don't keep dead code around. > > Point (a) could be mitigated if the architecture is changed so that any > new atomic instructions added to this range would also come with > additional syndrome information so that we don't have to update the > decoding patterns. > > Point (b), however, depends on the OpenJDK and the kernel versions in > distros. Nick Gasson kindly provided some information on the OpenJDK > changes. The atomic_add(0) change happened in early 2022, about 5-6 > months after MADV_POPULATE_WRITE support was added to the kernel. What's > interesting is Ampere already contributed MADV_POPULATE_WRITE support to > OpenJDK a few months ago: > > https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a > > The OpenJDK commit lacks explanation but what I gathered from the diff > is that this option is the preferred one in the presence of THP (which > most/all distros enable by default). If we merge your proposed kernel > patch, it will take time before it makes its way into distros. I'm > hoping that by that time, distros would have picked a new OpenJDK > version already that doesn't need the atomic_add(0) pattern. If that's > the case, we end up with some dead code in the kernel that's almost > never exercised. > > I don't follow OpenJDK development but I heard that updates are dragging > quite a lot. I can't tell whether people have picked up the > atomic_add(0) feature and whether, by the time a kernel patch would make > it into distros, they'd also move to the MADV_POPULATE_WRITE pattern. As Christopher said there may be similar use of atomic in other applications, so I don't worry too much about dead code problem IMHO. OpenJDK is just the usecase that we know. There may be unknown unknowns. And the distros typically backport patches from mainline kernel to their kernel so there should be combos like old kernel + backported patch + old OpenJDK. AFAICT, the users do expect similar behavior as x86 (one fault instead of two faults). Actually we noticed this problem due to a customer report. > > There's a point (c) as well on the overhead of reading the faulting > instruction. I hope that's negligible but I haven't measured it. I think I showed benchmark data requested by Anshuman in the earlier email discussion. >
diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 8c0a36f72d6f..efcc8b2050db 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -325,6 +325,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \ * "-" means "don't care" */ __AARCH64_INSN_FUNCS(class_branch_sys, 0x1c000000, 0x14000000) +__AARCH64_INSN_FUNCS(class_atomic, 0x3b200c00, 0x38200000) __AARCH64_INSN_FUNCS(adr, 0x9F000000, 0x10000000) __AARCH64_INSN_FUNCS(adrp, 0x9F000000, 0x90000000) @@ -345,6 +346,7 @@ __AARCH64_INSN_FUNCS(ldeor, 0x3F20FC00, 0x38202000) __AARCH64_INSN_FUNCS(ldset, 0x3F20FC00, 0x38203000) __AARCH64_INSN_FUNCS(swp, 0x3F20FC00, 0x38208000) __AARCH64_INSN_FUNCS(cas, 0x3FA07C00, 0x08A07C00) +__AARCH64_INSN_FUNCS(casp, 0xBFA07C00, 0x08207C00) __AARCH64_INSN_FUNCS(ldr_reg, 0x3FE0EC00, 0x38606800) __AARCH64_INSN_FUNCS(signed_ldr_reg, 0X3FE0FC00, 0x38A0E800) __AARCH64_INSN_FUNCS(ldr_imm, 0x3FC00000, 0x39400000) @@ -549,6 +551,24 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn) aarch64_insn_is_prfm_lit(insn); } +static __always_inline bool aarch64_insn_is_class_cas(u32 insn) +{ + return aarch64_insn_is_cas(insn) || + aarch64_insn_is_casp(insn); +} + +/* + * Exclude unallocated atomic instructions and LD64B/LDAPR. + * The masks and values were generated by using Python sympy module. + */ +static __always_inline bool aarch64_atomic_insn_has_wr_perm(u32 insn) +{ + return ((insn & 0x3f207c00) == 0x38200000) || + ((insn & 0x3f208c00) == 0x38200000) || + ((insn & 0x7fe06c00) == 0x78202000) || + ((insn & 0xbf204c00) == 0x38200000); +} + enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn); u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn); u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 451ba7cbd5ad..6a8b71917e3b 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -500,6 +500,34 @@ static bool is_write_abort(unsigned long esr) return (esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM); } +static bool is_el0_atomic_instr(struct pt_regs *regs) +{ + u32 insn; + __le32 insn_le; + unsigned long pc = instruction_pointer(regs); + + if (compat_user_mode(regs)) + return false; + + pagefault_disable(); + if (get_user(insn_le, (__le32 __user *)pc)) { + pagefault_enable(); + return false; + } + pagefault_enable(); + + insn = le32_to_cpu(insn_le); + + if (aarch64_insn_is_class_atomic(insn) && + aarch64_atomic_insn_has_wr_perm(insn)) + return true; + + if (aarch64_insn_is_class_cas(insn)) + return true; + + return false; +} + static int __kprobes do_page_fault(unsigned long far, unsigned long esr, struct pt_regs *regs) { @@ -568,6 +596,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, if (!vma) goto lock_mmap; + if ((vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE) && + is_el0_atomic_instr(regs)) { + vm_flags = VM_WRITE; + mm_flags |= FAULT_FLAG_WRITE; + } + if (!(vma->vm_flags & vm_flags)) { vma_end_read(vma); fault = 0;