Message ID | 20240919121719.2148361-1-liaochang1@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | arm64: uprobes: Optimize cache flushes for xol slot | expand |
On 09/19, Liao Chang wrote: > > --- a/arch/arm64/kernel/probes/uprobes.c > +++ b/arch/arm64/kernel/probes/uprobes.c > @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > void *xol_page_kaddr = kmap_atomic(page); > void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > > + if (!memcmp(dst, src, len)) > + goto done; can't really comment, I know nothing about arm64... but don't we need to change __create_xol_area() - area->page = alloc_page(GFP_HIGHUSER); + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); to avoid the false positives? Oleg.
在 2024/9/19 22:18, Oleg Nesterov 写道: > On 09/19, Liao Chang wrote: >> >> --- a/arch/arm64/kernel/probes/uprobes.c >> +++ b/arch/arm64/kernel/probes/uprobes.c >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, >> void *xol_page_kaddr = kmap_atomic(page); >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); >> >> + if (!memcmp(dst, src, len)) >> + goto done; > > can't really comment, I know nothing about arm64... > > but don't we need to change __create_xol_area() > > - area->page = alloc_page(GFP_HIGHUSER); > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > to avoid the false positives? Indeed, it would be safer. Could we tolerate these false positives? Even if the page are not reset to zero bits, if the existing bits are the same as the instruction being copied, it still can execute the correct instruction. > > Oleg. > >
On 09/20, Liao, Chang wrote: > > 在 2024/9/19 22:18, Oleg Nesterov 写道: > > On 09/19, Liao Chang wrote: > >> > >> --- a/arch/arm64/kernel/probes/uprobes.c > >> +++ b/arch/arm64/kernel/probes/uprobes.c > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > >> void *xol_page_kaddr = kmap_atomic(page); > >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > >> > >> + if (!memcmp(dst, src, len)) > >> + goto done; > > > > can't really comment, I know nothing about arm64... > > > > but don't we need to change __create_xol_area() > > > > - area->page = alloc_page(GFP_HIGHUSER); > > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > > > to avoid the false positives? > > Indeed, it would be safer. > > Could we tolerate these false positives? Even if the page are not reset > to zero bits, if the existing bits are the same as the instruction being > copied, it still can execute the correct instruction. OK, agreed, the task should the same data after page fault. Oleg.
On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: > > > 在 2024/9/19 22:18, Oleg Nesterov 写道: > > On 09/19, Liao Chang wrote: > >> > >> --- a/arch/arm64/kernel/probes/uprobes.c > >> +++ b/arch/arm64/kernel/probes/uprobes.c > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > >> void *xol_page_kaddr = kmap_atomic(page); > >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > >> > >> + if (!memcmp(dst, src, len)) > >> + goto done; > > > > can't really comment, I know nothing about arm64... > > > > but don't we need to change __create_xol_area() > > > > - area->page = alloc_page(GFP_HIGHUSER); > > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > > > to avoid the false positives? > > Indeed, it would be safer. > > Could we tolerate these false positives? Even if the page are not reset > to zero bits, if the existing bits are the same as the instruction being > copied, it still can execute the correct instruction. Not if the I-cache has stale data. If alloc_page() returns a page with some random data that resembles a valid instruction but there was never a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether the compare (on the D-cache side) succeeds or not. I think using __GFP_ZERO should do the trick. All 0s is a permanently undefined instruction, not something we'd use with xol.
On 09/20, Catalin Marinas wrote: > > On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote: > > > > > > 在 2024/9/19 22:18, Oleg Nesterov 写道: > > > On 09/19, Liao Chang wrote: > > >> > > >> --- a/arch/arm64/kernel/probes/uprobes.c > > >> +++ b/arch/arm64/kernel/probes/uprobes.c > > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > > >> void *xol_page_kaddr = kmap_atomic(page); > > >> void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); > > >> > > >> + if (!memcmp(dst, src, len)) > > >> + goto done; > > > > > > can't really comment, I know nothing about arm64... > > > > > > but don't we need to change __create_xol_area() > > > > > > - area->page = alloc_page(GFP_HIGHUSER); > > > + area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); > > > > > > to avoid the false positives? > > > > Indeed, it would be safer. > > > > Could we tolerate these false positives? Even if the page are not reset > > to zero bits, if the existing bits are the same as the instruction being > > copied, it still can execute the correct instruction. > > Not if the I-cache has stale data. If alloc_page() returns a page with > some random data that resembles a valid instruction but there was never > a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether > the compare (on the D-cache side) succeeds or not. But shouldn't the page fault paths on arm64 flush I-cache ? If alloc_page() returns a page with some random data that resembles a valid instruction, user-space can't execute this instruction until special_mapping_fault() installs the page allocated in __create_xol_area(). Again, I know nothing about arm64/icache/etc, I am just curious and trying to understand... Oleg.
diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c index d49aef2657cd..5ee27509d6f6 100644 --- a/arch/arm64/kernel/probes/uprobes.c +++ b/arch/arm64/kernel/probes/uprobes.c @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, void *xol_page_kaddr = kmap_atomic(page); void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK); + if (!memcmp(dst, src, len)) + goto done; + /* Initialize the slot */ memcpy(dst, src, len); /* flush caches (dcache/icache) */ sync_icache_aliases((unsigned long)dst, (unsigned long)dst + len); +done: kunmap_atomic(xol_page_kaddr); }
The profiling of single-thread selftests bench reveals a bottlenect in caches_clean_inval_pou() on ARM64. On my local testing machine, this function takes approximately 34% of CPU cycles for trig-uprobe-nop and trig-uprobe-push. This patch add a check to avoid unnecessary cache flush when writing instruction to the xol slot. If the instruction is same with the existing instruction in slot, there is no need to synchronize D/I cache. Since xol slot allocation and updates occur on the hot path of uprobe handling, The upstream kernel running on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz reveals this optimization has obvious gain for nop and push testcases. Before (next-20240918) ---------------------- uprobe-nop ( 1 cpus): 0.418 ± 0.001M/s ( 0.418M/s/cpu) uprobe-push ( 1 cpus): 0.411 ± 0.005M/s ( 0.411M/s/cpu) uprobe-ret ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) uretprobe-nop ( 1 cpus): 0.350 ± 0.000M/s ( 0.350M/s/cpu) uretprobe-push ( 1 cpus): 0.353 ± 0.000M/s ( 0.353M/s/cpu) uretprobe-ret ( 1 cpus): 1.074 ± 0.001M/s ( 1.074M/s/cpu) After ----- uprobe-nop ( 1 cpus): 0.926 ± 0.000M/s ( 0.926M/s/cpu) uprobe-push ( 1 cpus): 0.910 ± 0.001M/s ( 0.910M/s/cpu) uprobe-ret ( 1 cpus): 2.056 ± 0.001M/s ( 2.056M/s/cpu) uretprobe-nop ( 1 cpus): 0.653 ± 0.001M/s ( 0.653M/s/cpu) uretprobe-push ( 1 cpus): 0.645 ± 0.000M/s ( 0.645M/s/cpu) uretprobe-ret ( 1 cpus): 1.093 ± 0.001M/s ( 1.093M/s/cpu) Signed-off-by: Liao Chang <liaochang1@huawei.com> --- arch/arm64/kernel/probes/uprobes.c | 4 ++++ 1 file changed, 4 insertions(+)