diff mbox series

arm64: uprobes: Optimize cache flushes for xol slot

Message ID 20240919121719.2148361-1-liaochang1@huawei.com (mailing list archive)
State New
Headers show
Series arm64: uprobes: Optimize cache flushes for xol slot | expand

Commit Message

Liao, Chang Sept. 19, 2024, 12:17 p.m. UTC
The profiling of single-thread selftests bench reveals a bottlenect in
caches_clean_inval_pou() on ARM64. On my local testing machine, this
function takes approximately 34% of CPU cycles for trig-uprobe-nop and
trig-uprobe-push.

This patch add a check to avoid unnecessary cache flush when writing
instruction to the xol slot. If the instruction is same with the
existing instruction in slot, there is no need to synchronize D/I cache.
Since xol slot allocation and updates occur on the hot path of uprobe
handling, The upstream kernel running on Kunpeng916 (Hi1616), 4 NUMA
nodes, 64 cores@ 2.4GHz reveals this optimization has obvious gain for
nop and push testcases.

Before (next-20240918)
----------------------
uprobe-nop      ( 1 cpus):    0.418 ± 0.001M/s  (  0.418M/s/cpu)
uprobe-push     ( 1 cpus):    0.411 ± 0.005M/s  (  0.411M/s/cpu)
uprobe-ret      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.350 ± 0.000M/s  (  0.350M/s/cpu)
uretprobe-push  ( 1 cpus):    0.353 ± 0.000M/s  (  0.353M/s/cpu)
uretprobe-ret   ( 1 cpus):    1.074 ± 0.001M/s  (  1.074M/s/cpu)

After
-----
uprobe-nop      ( 1 cpus):    0.926 ± 0.000M/s  (  0.926M/s/cpu)
uprobe-push     ( 1 cpus):    0.910 ± 0.001M/s  (  0.910M/s/cpu)
uprobe-ret      ( 1 cpus):    2.056 ± 0.001M/s  (  2.056M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.653 ± 0.001M/s  (  0.653M/s/cpu)
uretprobe-push  ( 1 cpus):    0.645 ± 0.000M/s  (  0.645M/s/cpu)
uretprobe-ret   ( 1 cpus):    1.093 ± 0.001M/s  (  1.093M/s/cpu)

Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 arch/arm64/kernel/probes/uprobes.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Oleg Nesterov Sept. 19, 2024, 2:18 p.m. UTC | #1
On 09/19, Liao Chang wrote:
>
> --- a/arch/arm64/kernel/probes/uprobes.c
> +++ b/arch/arm64/kernel/probes/uprobes.c
> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>  	void *xol_page_kaddr = kmap_atomic(page);
>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
>
> +	if (!memcmp(dst, src, len))
> +		goto done;

can't really comment, I know nothing about arm64...

but don't we need to change __create_xol_area()

	-	area->page = alloc_page(GFP_HIGHUSER);
	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);

to avoid the false positives?

Oleg.
Liao, Chang Sept. 20, 2024, 8:58 a.m. UTC | #2
在 2024/9/19 22:18, Oleg Nesterov 写道:
> On 09/19, Liao Chang wrote:
>>
>> --- a/arch/arm64/kernel/probes/uprobes.c
>> +++ b/arch/arm64/kernel/probes/uprobes.c
>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>>  	void *xol_page_kaddr = kmap_atomic(page);
>>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
>>
>> +	if (!memcmp(dst, src, len))
>> +		goto done;
> 
> can't really comment, I know nothing about arm64...
> 
> but don't we need to change __create_xol_area()
> 
> 	-	area->page = alloc_page(GFP_HIGHUSER);
> 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> 
> to avoid the false positives?

Indeed, it would be safer.

Could we tolerate these false positives? Even if the page are not reset
to zero bits, if the existing bits are the same as the instruction being
copied, it still can execute the correct instruction.

> 
> Oleg.
> 
>
Oleg Nesterov Sept. 20, 2024, 11:03 a.m. UTC | #3
On 09/20, Liao, Chang wrote:
>
> 在 2024/9/19 22:18, Oleg Nesterov 写道:
> > On 09/19, Liao Chang wrote:
> >>
> >> --- a/arch/arm64/kernel/probes/uprobes.c
> >> +++ b/arch/arm64/kernel/probes/uprobes.c
> >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> >>  	void *xol_page_kaddr = kmap_atomic(page);
> >>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
> >>
> >> +	if (!memcmp(dst, src, len))
> >> +		goto done;
> >
> > can't really comment, I know nothing about arm64...
> >
> > but don't we need to change __create_xol_area()
> >
> > 	-	area->page = alloc_page(GFP_HIGHUSER);
> > 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> >
> > to avoid the false positives?
>
> Indeed, it would be safer.
>
> Could we tolerate these false positives? Even if the page are not reset
> to zero bits, if the existing bits are the same as the instruction being
> copied, it still can execute the correct instruction.

OK, agreed, the task should the same data after page fault.

Oleg.
Catalin Marinas Sept. 20, 2024, 3:32 p.m. UTC | #4
On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
> 
> 
> 在 2024/9/19 22:18, Oleg Nesterov 写道:
> > On 09/19, Liao Chang wrote:
> >>
> >> --- a/arch/arm64/kernel/probes/uprobes.c
> >> +++ b/arch/arm64/kernel/probes/uprobes.c
> >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> >>  	void *xol_page_kaddr = kmap_atomic(page);
> >>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
> >>
> >> +	if (!memcmp(dst, src, len))
> >> +		goto done;
> > 
> > can't really comment, I know nothing about arm64...
> > 
> > but don't we need to change __create_xol_area()
> > 
> > 	-	area->page = alloc_page(GFP_HIGHUSER);
> > 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> > 
> > to avoid the false positives?
> 
> Indeed, it would be safer.
> 
> Could we tolerate these false positives? Even if the page are not reset
> to zero bits, if the existing bits are the same as the instruction being
> copied, it still can execute the correct instruction.

Not if the I-cache has stale data. If alloc_page() returns a page with
some random data that resembles a valid instruction but there was never
a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
the compare (on the D-cache side) succeeds or not.

I think using __GFP_ZERO should do the trick. All 0s is a permanently
undefined instruction, not something we'd use with xol.
Oleg Nesterov Sept. 20, 2024, 5:32 p.m. UTC | #5
On 09/20, Catalin Marinas wrote:
>
> On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
> >
> >
> > 在 2024/9/19 22:18, Oleg Nesterov 写道:
> > > On 09/19, Liao Chang wrote:
> > >>
> > >> --- a/arch/arm64/kernel/probes/uprobes.c
> > >> +++ b/arch/arm64/kernel/probes/uprobes.c
> > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> > >>  	void *xol_page_kaddr = kmap_atomic(page);
> > >>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
> > >>
> > >> +	if (!memcmp(dst, src, len))
> > >> +		goto done;
> > >
> > > can't really comment, I know nothing about arm64...
> > >
> > > but don't we need to change __create_xol_area()
> > >
> > > 	-	area->page = alloc_page(GFP_HIGHUSER);
> > > 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> > >
> > > to avoid the false positives?
> >
> > Indeed, it would be safer.
> >
> > Could we tolerate these false positives? Even if the page are not reset
> > to zero bits, if the existing bits are the same as the instruction being
> > copied, it still can execute the correct instruction.
>
> Not if the I-cache has stale data. If alloc_page() returns a page with
> some random data that resembles a valid instruction but there was never
> a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
> the compare (on the D-cache side) succeeds or not.

But shouldn't the page fault paths on arm64 flush I-cache ?

If alloc_page() returns a page with some random data that resembles a valid
instruction, user-space can't execute this instruction until
special_mapping_fault() installs the page allocated in __create_xol_area().

Again, I know nothing about arm64/icache/etc, I am just curious and trying
to understand...

Oleg.
diff mbox series

Patch

diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c
index d49aef2657cd..5ee27509d6f6 100644
--- a/arch/arm64/kernel/probes/uprobes.c
+++ b/arch/arm64/kernel/probes/uprobes.c
@@ -17,12 +17,16 @@  void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 	void *xol_page_kaddr = kmap_atomic(page);
 	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
 
+	if (!memcmp(dst, src, len))
+		goto done;
+
 	/* Initialize the slot */
 	memcpy(dst, src, len);
 
 	/* flush caches (dcache/icache) */
 	sync_icache_aliases((unsigned long)dst, (unsigned long)dst + len);
 
+done:
 	kunmap_atomic(xol_page_kaddr);
 }