Message ID | 20240919121719.2148361-1-liaochang1@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: uprobes: Optimize cache flushes for xol slot | expand |
On 09/19, Liao Chang wrote: > > --- a/arch/arm64/kernel/probes/uprobes.c > +++ b/arch/arm64/kernel/probes/uprobes.c > @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > void *xol_page_kaddr = kmap_atomic(page); > void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > > + if (!memcmp(dst, src, len)) > + goto done; can't really comment, I know nothing about arm64... but don't we need to change __create_xol_area() - area->page = alloc_page(GFP_HIGHUSER); + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); to avoid the false positives? Oleg.
在 2024/9/19 22:18, Oleg Nesterov 写道: > On 09/19, Liao Chang wrote: >> >> --- a/arch/arm64/kernel/probes/uprobes.c >> +++ b/arch/arm64/kernel/probes/uprobes.c >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, >> void *xol_page_kaddr = kmap_atomic(page); >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); >> >> + if (!memcmp(dst, src, len)) >> + goto done; > > can't really comment, I know nothing about arm64... > > but don't we need to change __create_xol_area() > > - area->page = alloc_page(GFP_HIGHUSER); > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > to avoid the false positives? Indeed, it would be safer. Could we tolerate these false positives? Even if the page are not reset to zero bits, if the existing bits are the same as the instruction being copied, it still can execute the correct instruction. > > Oleg. > >
On 09/20, Liao, Chang wrote: > > 在 2024/9/19 22:18, Oleg Nesterov 写道: > > On 09/19, Liao Chang wrote: > >> > >> --- a/arch/arm64/kernel/probes/uprobes.c > >> +++ b/arch/arm64/kernel/probes/uprobes.c > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > >> void *xol_page_kaddr = kmap_atomic(page); > >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > >> > >> + if (!memcmp(dst, src, len)) > >> + goto done; > > > > can't really comment, I know nothing about arm64... > > > > but don't we need to change __create_xol_area() > > > > - area->page = alloc_page(GFP_HIGHUSER); > > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > > > to avoid the false positives? > > Indeed, it would be safer. > > Could we tolerate these false positives? Even if the page are not reset > to zero bits, if the existing bits are the same as the instruction being > copied, it still can execute the correct instruction. OK, agreed, the task should the same data after page fault. Oleg.
On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: > > > 在 2024/9/19 22:18, Oleg Nesterov 写道: > > On 09/19, Liao Chang wrote: > >> > >> --- a/arch/arm64/kernel/probes/uprobes.c > >> +++ b/arch/arm64/kernel/probes/uprobes.c > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > >> void *xol_page_kaddr = kmap_atomic(page); > >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > >> > >> + if (!memcmp(dst, src, len)) > >> + goto done; > > > > can't really comment, I know nothing about arm64... > > > > but don't we need to change __create_xol_area() > > > > - area->page = alloc_page(GFP_HIGHUSER); > > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > > > to avoid the false positives? > > Indeed, it would be safer. > > Could we tolerate these false positives? Even if the page are not reset > to zero bits, if the existing bits are the same as the instruction being > copied, it still can execute the correct instruction. Not if the I-cache has stale data. If alloc_page() returns a page with some random data that resembles a valid instruction but there was never a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether the compare (on the D-cache side) succeeds or not. I think using __GFP_ZERO should do the trick. All 0s is a permanently undefined instruction, not something we'd use with xol.
On 09/20, Catalin Marinas wrote: > > On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: > > > > > > 在 2024/9/19 22:18, Oleg Nesterov 写道: > > > On 09/19, Liao Chang wrote: > > >> > > >> --- a/arch/arm64/kernel/probes/uprobes.c > > >> +++ b/arch/arm64/kernel/probes/uprobes.c > > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > > >> void *xol_page_kaddr = kmap_atomic(page); > > >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > > >> > > >> + if (!memcmp(dst, src, len)) > > >> + goto done; > > > > > > can't really comment, I know nothing about arm64... > > > > > > but don't we need to change __create_xol_area() > > > > > > - area->page = alloc_page(GFP_HIGHUSER); > > > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > > > > > to avoid the false positives? > > > > Indeed, it would be safer. > > > > Could we tolerate these false positives? Even if the page are not reset > > to zero bits, if the existing bits are the same as the instruction being > > copied, it still can execute the correct instruction. > > Not if the I-cache has stale data. If alloc_page() returns a page with > some random data that resembles a valid instruction but there was never > a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether > the compare (on the D-cache side) succeeds or not. But shouldn't the page fault paths on arm64 flush I-cache ? If alloc_page() returns a page with some random data that resembles a valid instruction, user-space can't execute this instruction until special_mapping_fault() installs the page allocated in __create_xol_area(). Again, I know nothing about arm64/icache/etc, I am just curious and trying to understand... Oleg.
On Fri, Sep 20, 2024 at 07:32:23PM +0200, Oleg Nesterov wrote: > On 09/20, Catalin Marinas wrote: > > > > On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: > > > > > > > > > 在 2024/9/19 22:18, Oleg Nesterov 写道: > > > > On 09/19, Liao Chang wrote: > > > >> > > > >> --- a/arch/arm64/kernel/probes/uprobes.c > > > >> +++ b/arch/arm64/kernel/probes/uprobes.c > > > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > > > >> void *xol_page_kaddr = kmap_atomic(page); > > > >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > > > >> > > > >> + if (!memcmp(dst, src, len)) > > > >> + goto done; > > > > > > > > can't really comment, I know nothing about arm64... > > > > > > > > but don't we need to change __create_xol_area() > > > > > > > > - area->page = alloc_page(GFP_HIGHUSER); > > > > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > > > > > > > to avoid the false positives? > > > > > > Indeed, it would be safer. > > > > > > Could we tolerate these false positives? Even if the page are not reset > > > to zero bits, if the existing bits are the same as the instruction being > > > copied, it still can execute the correct instruction. > > > > Not if the I-cache has stale data. If alloc_page() returns a page with > > some random data that resembles a valid instruction but there was never > > a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether > > the compare (on the D-cache side) succeeds or not. > > But shouldn't the page fault paths on arm64 flush I-cache ? > > If alloc_page() returns a page with some random data that resembles a valid > instruction, user-space can't execute this instruction until > special_mapping_fault() installs the page allocated in __create_xol_area(). > > Again, I know nothing about arm64/icache/etc, I am just curious and trying > to understand... We defer the icache maintenance until set_pte_at() time, where we call __sync_icache_dcache() if we're installing a present, executable user eintry. That also elides the maintenance if PG_arch_1 is set (i.e. the kernel only takes responsibility for the freshly allocated page). Will > > Oleg. >
On 09/22, Will Deacon wrote: > > On Fri, Sep 20, 2024 at 07:32:23PM +0200, Oleg Nesterov wrote: > > On 09/20, Catalin Marinas wrote: > > > > > > On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: > > > > > > > > > > > > 在 2024/9/19 22:18, Oleg Nesterov 写道: > > > > > On 09/19, Liao Chang wrote: > > > > >> > > > > >> --- a/arch/arm64/kernel/probes/uprobes.c > > > > >> +++ b/arch/arm64/kernel/probes/uprobes.c > > > > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > > > > >> void *xol_page_kaddr = kmap_atomic(page); > > > > >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > > > > >> > > > > >> + if (!memcmp(dst, src, len)) > > > > >> + goto done; > > > > > > > > > > can't really comment, I know nothing about arm64... > > > > > > > > > > but don't we need to change __create_xol_area() > > > > > > > > > > - area->page = alloc_page(GFP_HIGHUSER); > > > > > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > > > > > > > > > to avoid the false positives? > > > > > > > > Indeed, it would be safer. > > > > > > > > Could we tolerate these false positives? Even if the page are not reset > > > > to zero bits, if the existing bits are the same as the instruction being > > > > copied, it still can execute the correct instruction. > > > > > > Not if the I-cache has stale data. If alloc_page() returns a page with > > > some random data that resembles a valid instruction but there was never > > > a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether > > > the compare (on the D-cache side) succeeds or not. > > > > But shouldn't the page fault paths on arm64 flush I-cache ? > > > > If alloc_page() returns a page with some random data that resembles a valid > > instruction, user-space can't execute this instruction until > > special_mapping_fault() installs the page allocated in __create_xol_area(). > > > > Again, I know nothing about arm64/icache/etc, I am just curious and trying > > to understand... > > We defer the icache maintenance until set_pte_at() time, where we call > __sync_icache_dcache() if we're installing a present, executable user > eintry. And to me this looks as if __sync_icache_dcache() must be called when user space tries to fault-in the page allocated in __create_xol_area() and returned by special_mapping_fault(). So I still don't understand the problem. Oleg.
在 2024/9/20 23:32, Catalin Marinas 写道: > On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: >> >> >> 在 2024/9/19 22:18, Oleg Nesterov 写道: >>> On 09/19, Liao Chang wrote: >>>> >>>> --- a/arch/arm64/kernel/probes/uprobes.c >>>> +++ b/arch/arm64/kernel/probes/uprobes.c >>>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, >>>> void *xol_page_kaddr = kmap_atomic(page); >>>> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); >>>> >>>> + if (!memcmp(dst, src, len)) >>>> + goto done; >>> >>> can't really comment, I know nothing about arm64... >>> >>> but don't we need to change __create_xol_area() >>> >>> - area->page = alloc_page(GFP_HIGHUSER); >>> + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); >>> >>> to avoid the false positives? >> >> Indeed, it would be safer. >> >> Could we tolerate these false positives? Even if the page are not reset >> to zero bits, if the existing bits are the same as the instruction being >> copied, it still can execute the correct instruction. > > Not if the I-cache has stale data. If alloc_page() returns a page with > some random data that resembles a valid instruction but there was never > a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether > the compare (on the D-cache side) succeeds or not. Absolutly right, I overlooked the comparsion is still performed in the D-cache. However, the most important thing is ensuring the I-cache sees the accurate bits, which is why a cache flush in necessary for each xol slot. > > I think using __GFP_ZERO should do the trick. All 0s is a permanently > undefined instruction, not something we'd use with xol. Unfortunately, the comparison assumes the D-cache and I-cache are already in sync for the slot being copied. But this assumption is flawed if we start with a page with some random bits and D-cache has not been sychronized with I-cache. So, besides __GFP_ZERO, should we have a additional cache flush after page allocation? >
On Mon, Sep 23, 2024 at 09:57:14AM +0800, Liao, Chang wrote: > 在 2024/9/20 23:32, Catalin Marinas 写道: > > On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: > >> 在 2024/9/19 22:18, Oleg Nesterov 写道: > >>> On 09/19, Liao Chang wrote: > >>>> --- a/arch/arm64/kernel/probes/uprobes.c > >>>> +++ b/arch/arm64/kernel/probes/uprobes.c > >>>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > >>>> void *xol_page_kaddr = kmap_atomic(page); > >>>> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > >>>> > >>>> + if (!memcmp(dst, src, len)) > >>>> + goto done; > >>> > >>> can't really comment, I know nothing about arm64... > >>> > >>> but don't we need to change __create_xol_area() > >>> > >>> - area->page = alloc_page(GFP_HIGHUSER); > >>> + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > >>> > >>> to avoid the false positives? > >> > >> Indeed, it would be safer. > >> > >> Could we tolerate these false positives? Even if the page are not reset > >> to zero bits, if the existing bits are the same as the instruction being > >> copied, it still can execute the correct instruction. > > > > Not if the I-cache has stale data. If alloc_page() returns a page with > > some random data that resembles a valid instruction but there was never > > a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether > > the compare (on the D-cache side) succeeds or not. > > Absolutly right, I overlooked the comparsion is still performed in the D-cache. > However, the most important thing is ensuring the I-cache sees the accurate bits, > which is why a cache flush in necessary for each xol slot. > > > > > I think using __GFP_ZERO should do the trick. All 0s is a permanently > > undefined instruction, not something we'd use with xol. > > Unfortunately, the comparison assumes the D-cache and I-cache are already > in sync for the slot being copied. But this assumption is flawed if we start > with a page with some random bits and D-cache has not been sychronized with > I-cache. So, besides __GFP_ZERO, should we have a additional cache flush > after page allocation? No, I think Oleg's right. The initial cache maintenance will happen when the executable pte is installed. However, we should use __GFP_ZERO anyway because I don't think it's a good idea to map an uninitialised page into userspace. Will
On 09/23, Will Deacon wrote: > > However, we should use __GFP_ZERO anyway > because I don't think it's a good idea to map an uninitialised page into > userspace. Agreed, and imo this even needs a separate "fix info leak" patch. Oleg.
在 2024/9/22 22:09, Will Deacon 写道: > On Fri, Sep 20, 2024 at 07:32:23PM +0200, Oleg Nesterov wrote: >> On 09/20, Catalin Marinas wrote: >>> >>> On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: >>>> >>>> >>>> 在 2024/9/19 22:18, Oleg Nesterov 写道: >>>>> On 09/19, Liao Chang wrote: >>>>>> >>>>>> --- a/arch/arm64/kernel/probes/uprobes.c >>>>>> +++ b/arch/arm64/kernel/probes/uprobes.c >>>>>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, >>>>>> void *xol_page_kaddr = kmap_atomic(page); >>>>>> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); >>>>>> >>>>>> + if (!memcmp(dst, src, len)) >>>>>> + goto done; >>>>> >>>>> can't really comment, I know nothing about arm64... >>>>> >>>>> but don't we need to change __create_xol_area() >>>>> >>>>> - area->page = alloc_page(GFP_HIGHUSER); >>>>> + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); >>>>> >>>>> to avoid the false positives? >>>> >>>> Indeed, it would be safer. >>>> >>>> Could we tolerate these false positives? Even if the page are not reset >>>> to zero bits, if the existing bits are the same as the instruction being >>>> copied, it still can execute the correct instruction. >>> >>> Not if the I-cache has stale data. If alloc_page() returns a page with >>> some random data that resembles a valid instruction but there was never >>> a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether >>> the compare (on the D-cache side) succeeds or not. >> >> But shouldn't the page fault paths on arm64 flush I-cache ? >> >> If alloc_page() returns a page with some random data that resembles a valid >> instruction, user-space can't execute this instruction until >> special_mapping_fault() installs the page allocated in __create_xol_area(). >> >> Again, I know nothing about arm64/icache/etc, I am just curious and trying >> to understand... > > We defer the icache maintenance until set_pte_at() time, where we call > __sync_icache_dcache() if we're installing a present, executable user > eintry. That also elides the maintenance if PG_arch_1 is set (i.e. the > kernel only takes responsibility for the freshly allocated page). The newly allocated page should always have PG_arch_1 cleared, correct? Is it possible for alloc_page() to return a page with PG_arch_1 set in the current arm64 kernel? > > Will > >> >> Oleg. >>
在 2024/9/23 15:18, Will Deacon 写道: > On Mon, Sep 23, 2024 at 09:57:14AM +0800, Liao, Chang wrote: >> 在 2024/9/20 23:32, Catalin Marinas 写道: >>> On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: >>>> 在 2024/9/19 22:18, Oleg Nesterov 写道: >>>>> On 09/19, Liao Chang wrote: >>>>>> --- a/arch/arm64/kernel/probes/uprobes.c >>>>>> +++ b/arch/arm64/kernel/probes/uprobes.c >>>>>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, >>>>>> void *xol_page_kaddr = kmap_atomic(page); >>>>>> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); >>>>>> >>>>>> + if (!memcmp(dst, src, len)) >>>>>> + goto done; >>>>> >>>>> can't really comment, I know nothing about arm64... >>>>> >>>>> but don't we need to change __create_xol_area() >>>>> >>>>> - area->page = alloc_page(GFP_HIGHUSER); >>>>> + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); >>>>> >>>>> to avoid the false positives? >>>> >>>> Indeed, it would be safer. >>>> >>>> Could we tolerate these false positives? Even if the page are not reset >>>> to zero bits, if the existing bits are the same as the instruction being >>>> copied, it still can execute the correct instruction. >>> >>> Not if the I-cache has stale data. If alloc_page() returns a page with >>> some random data that resembles a valid instruction but there was never >>> a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether >>> the compare (on the D-cache side) succeeds or not. >> >> Absolutly right, I overlooked the comparsion is still performed in the D-cache. >> However, the most important thing is ensuring the I-cache sees the accurate bits, >> which is why a cache flush in necessary for each xol slot. >> >>> >>> I think using __GFP_ZERO should do the trick. All 0s is a permanently >>> undefined instruction, not something we'd use with xol. >> >> Unfortunately, the comparison assumes the D-cache and I-cache are already >> in sync for the slot being copied. But this assumption is flawed if we start >> with a page with some random bits and D-cache has not been sychronized with >> I-cache. So, besides __GFP_ZERO, should we have a additional cache flush >> after page allocation? > > No, I think Oleg's right. The initial cache maintenance will happen when the > executable pte is installed. However, we should use __GFP_ZERO anyway > because I don't think it's a good idea to map an uninitialised page into > userspace. I will use __GFP_ZERO for xol page allocation in v2. > > Will >
On Mon, Sep 23, 2024 at 08:18:57AM +0100, Will Deacon wrote: > On Mon, Sep 23, 2024 at 09:57:14AM +0800, Liao, Chang wrote: > > Unfortunately, the comparison assumes the D-cache and I-cache are already > > in sync for the slot being copied. But this assumption is flawed if we start > > with a page with some random bits and D-cache has not been sychronized with > > I-cache. So, besides __GFP_ZERO, should we have a additional cache flush > > after page allocation? > > No, I think Oleg's right. The initial cache maintenance will happen when the > executable pte is installed. For some reason I had kprobes in mind, did not realise that this page ends up in user-space. So yes, we have cache maintenance when the pte is set to point to this page. Subsequent changes will need cache maintenance. > However, we should use __GFP_ZERO anyway because I don't think it's a > good idea to map an uninitialised page into userspace. Oh, that's not good.
在 2024/9/23 18:52, Oleg Nesterov 写道: > On 09/23, Will Deacon wrote: >> >> However, we should use __GFP_ZERO anyway >> because I don't think it's a good idea to map an uninitialised page into >> userspace. > > Agreed, and imo this even needs a separate "fix info leak" patch. Do you mean to fill the entire page with CPU specific illegal instructions in this patch? > > Oleg. > >
On 09/26, Liao, Chang wrote: > > 在 2024/9/23 18:52, Oleg Nesterov 写道: > > On 09/23, Will Deacon wrote: > >> > >> However, we should use __GFP_ZERO anyway > >> because I don't think it's a good idea to map an uninitialised page into > >> userspace. > > > > Agreed, and imo this even needs a separate "fix info leak" patch. > > Do you mean to fill the entire page with CPU specific illegal instructions > in this patch? Hmm. Why?? No... OK, I'll write the changelog and send the trivial patch in a minute. Oleg.
Hi, Will and Catalin 在 2024/9/19 20:17, Liao Chang 写道: > On 09/23, Will Deacon wrote: >> However, we should use __GFP_ZERO anyway >> because I don't think it's a good idea to map an uninitialised page into >> userspace. > Agreed, and imo this even needs a separate "fix info leak" patch. > > Oleg. Given that Oleg's fix info leak patch has been merged [1], the risk of leakage is gone. So I am looking forward to your options about this patch. As many functions start with same instructions like 'stp fp, lr, [sp, #imm]' or 'paciasp'. So I think this patch could avoid unnecessary D/I cache synchronization. [1] https://lore.kernel.org/all/20240929162047.GA12611@redhat.com/
On Wed, Nov 06, 2024 at 05:55:16PM +0800, Liao, Chang wrote: > 在 2024/9/19 20:17, Liao Chang 写道: > > On 09/23, Will Deacon wrote: > >> However, we should use __GFP_ZERO anyway > >> because I don't think it's a good idea to map an uninitialised page into > >> userspace. > > Agreed, and imo this even needs a separate "fix info leak" patch. > > > > Oleg. > > Given that Oleg's fix info leak patch has been merged [1], the risk of leakage > is gone. So I am looking forward to your options about this patch. As many > functions start with same instructions like 'stp fp, lr, [sp, #imm]' or > 'paciasp'. So I think this patch could avoid unnecessary D/I cache synchronization. > > [1] https://lore.kernel.org/all/20240929162047.GA12611@redhat.com/ The patch is fine with the fix in __create_xol_area(). But please add a comment on why it is safe to skip the cache maintenance, something like "the initial cache maintenance was done via set_pte_at()" (well, I can do this when applying).
On Thu, 19 Sep 2024 12:17:19 +0000, Liao Chang wrote: > The profiling of single-thread selftests bench reveals a bottlenect in > caches_clean_inval_pou() on ARM64. On my local testing machine, this > function takes approximately 34% of CPU cycles for trig-uprobe-nop and > trig-uprobe-push. > > This patch add a check to avoid unnecessary cache flush when writing > instruction to the xol slot. If the instruction is same with the > existing instruction in slot, there is no need to synchronize D/I cache. > Since xol slot allocation and updates occur on the hot path of uprobe > handling, The upstream kernel running on Kunpeng916 (Hi1616), 4 NUMA > nodes, 64 cores@ 2.4GHz reveals this optimization has obvious gain for > nop and push testcases. > > [...] Applied to arm64 (for-next/misc), thanks! [1/1] arm64: uprobes: Optimize cache flushes for xol slot https://git.kernel.org/arm64/c/bdf94836c22a
diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c index d49aef2657cd..5ee27509d6f6 100644 --- a/arch/arm64/kernel/probes/uprobes.c +++ b/arch/arm64/kernel/probes/uprobes.c @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, void *xol_page_kaddr = kmap_atomic(page); void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); + if (!memcmp(dst, src, len)) + goto done; + /* Initialize the slot */ memcpy(dst, src, len); /* flush caches (dcache/icache) */ sync_icache_aliases((unsigned long)dst, (unsigned long)dst + len); +done: kunmap_atomic(xol_page_kaddr); }
The profiling of single-thread selftests bench reveals a bottlenect in caches_clean_inval_pou() on ARM64. On my local testing machine, this function takes approximately 34% of CPU cycles for trig-uprobe-nop and trig-uprobe-push. This patch add a check to avoid unnecessary cache flush when writing instruction to the xol slot. If the instruction is same with the existing instruction in slot, there is no need to synchronize D/I cache. Since xol slot allocation and updates occur on the hot path of uprobe handling, The upstream kernel running on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz reveals this optimization has obvious gain for nop and push testcases. Before (next-20240918) ---------------------- uprobe-nop ( 1 cpus): 0.418 ± 0.001M/s ( 0.418M/s/cpu) uprobe-push ( 1 cpus): 0.411 ± 0.005M/s ( 0.411M/s/cpu) uprobe-ret ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) uretprobe-nop ( 1 cpus): 0.350 ± 0.000M/s ( 0.350M/s/cpu) uretprobe-push ( 1 cpus): 0.353 ± 0.000M/s ( 0.353M/s/cpu) uretprobe-ret ( 1 cpus): 1.074 ± 0.001M/s ( 1.074M/s/cpu) After ----- uprobe-nop ( 1 cpus): 0.926 ± 0.000M/s ( 0.926M/s/cpu) uprobe-push ( 1 cpus): 0.910 ± 0.001M/s ( 0.910M/s/cpu) uprobe-ret ( 1 cpus): 2.056 ± 0.001M/s ( 2.056M/s/cpu) uretprobe-nop ( 1 cpus): 0.653 ± 0.001M/s ( 0.653M/s/cpu) uretprobe-push ( 1 cpus): 0.645 ± 0.000M/s ( 0.645M/s/cpu) uretprobe-ret ( 1 cpus): 1.093 ± 0.001M/s ( 1.093M/s/cpu) Signed-off-by: Liao Chang <liaochang1@huawei.com> --- arch/arm64/kernel/probes/uprobes.c | 4 ++++ 1 file changed, 4 insertions(+)