diff mbox series

arm64: uprobes: Optimize cache flushes for xol slot

Message ID 20240919121719.2148361-1-liaochang1@huawei.com (mailing list archive)
State New, archived
Headers show
Series arm64: uprobes: Optimize cache flushes for xol slot | expand

Commit Message

Liao, Chang Sept. 19, 2024, 12:17 p.m. UTC
The profiling of single-thread selftests bench reveals a bottlenect in
caches_clean_inval_pou() on ARM64. On my local testing machine, this
function takes approximately 34% of CPU cycles for trig-uprobe-nop and
trig-uprobe-push.

This patch add a check to avoid unnecessary cache flush when writing
instruction to the xol slot. If the instruction is same with the
existing instruction in slot, there is no need to synchronize D/I cache.
Since xol slot allocation and updates occur on the hot path of uprobe
handling, The upstream kernel running on Kunpeng916 (Hi1616), 4 NUMA
nodes, 64 cores@ 2.4GHz reveals this optimization has obvious gain for
nop and push testcases.

Before (next-20240918)
----------------------
uprobe-nop      ( 1 cpus):    0.418 ± 0.001M/s  (  0.418M/s/cpu)
uprobe-push     ( 1 cpus):    0.411 ± 0.005M/s  (  0.411M/s/cpu)
uprobe-ret      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.350 ± 0.000M/s  (  0.350M/s/cpu)
uretprobe-push  ( 1 cpus):    0.353 ± 0.000M/s  (  0.353M/s/cpu)
uretprobe-ret   ( 1 cpus):    1.074 ± 0.001M/s  (  1.074M/s/cpu)

After
-----
uprobe-nop      ( 1 cpus):    0.926 ± 0.000M/s  (  0.926M/s/cpu)
uprobe-push     ( 1 cpus):    0.910 ± 0.001M/s  (  0.910M/s/cpu)
uprobe-ret      ( 1 cpus):    2.056 ± 0.001M/s  (  2.056M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.653 ± 0.001M/s  (  0.653M/s/cpu)
uretprobe-push  ( 1 cpus):    0.645 ± 0.000M/s  (  0.645M/s/cpu)
uretprobe-ret   ( 1 cpus):    1.093 ± 0.001M/s  (  1.093M/s/cpu)

Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 arch/arm64/kernel/probes/uprobes.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Oleg Nesterov Sept. 19, 2024, 2:18 p.m. UTC | #1
On 09/19, Liao Chang wrote:
>
> --- a/arch/arm64/kernel/probes/uprobes.c
> +++ b/arch/arm64/kernel/probes/uprobes.c
> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>  	void *xol_page_kaddr = kmap_atomic(page);
>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
>
> +	if (!memcmp(dst, src, len))
> +		goto done;

can't really comment, I know nothing about arm64...

but don't we need to change __create_xol_area()

	-	area->page = alloc_page(GFP_HIGHUSER);
	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);

to avoid the false positives?

Oleg.
Liao, Chang Sept. 20, 2024, 8:58 a.m. UTC | #2
在 2024/9/19 22:18, Oleg Nesterov 写道:
> On 09/19, Liao Chang wrote:
>>
>> --- a/arch/arm64/kernel/probes/uprobes.c
>> +++ b/arch/arm64/kernel/probes/uprobes.c
>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>>  	void *xol_page_kaddr = kmap_atomic(page);
>>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
>>
>> +	if (!memcmp(dst, src, len))
>> +		goto done;
> 
> can't really comment, I know nothing about arm64...
> 
> but don't we need to change __create_xol_area()
> 
> 	-	area->page = alloc_page(GFP_HIGHUSER);
> 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> 
> to avoid the false positives?

Indeed, it would be safer.

Could we tolerate these false positives? Even if the page are not reset
to zero bits, if the existing bits are the same as the instruction being
copied, it still can execute the correct instruction.

> 
> Oleg.
> 
>
Oleg Nesterov Sept. 20, 2024, 11:03 a.m. UTC | #3
On 09/20, Liao, Chang wrote:
>
> 在 2024/9/19 22:18, Oleg Nesterov 写道:
> > On 09/19, Liao Chang wrote:
> >>
> >> --- a/arch/arm64/kernel/probes/uprobes.c
> >> +++ b/arch/arm64/kernel/probes/uprobes.c
> >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> >>  	void *xol_page_kaddr = kmap_atomic(page);
> >>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
> >>
> >> +	if (!memcmp(dst, src, len))
> >> +		goto done;
> >
> > can't really comment, I know nothing about arm64...
> >
> > but don't we need to change __create_xol_area()
> >
> > 	-	area->page = alloc_page(GFP_HIGHUSER);
> > 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> >
> > to avoid the false positives?
>
> Indeed, it would be safer.
>
> Could we tolerate these false positives? Even if the page are not reset
> to zero bits, if the existing bits are the same as the instruction being
> copied, it still can execute the correct instruction.

OK, agreed, the task should the same data after page fault.

Oleg.
Catalin Marinas Sept. 20, 2024, 3:32 p.m. UTC | #4
On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
> 
> 
> 在 2024/9/19 22:18, Oleg Nesterov 写道:
> > On 09/19, Liao Chang wrote:
> >>
> >> --- a/arch/arm64/kernel/probes/uprobes.c
> >> +++ b/arch/arm64/kernel/probes/uprobes.c
> >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> >>  	void *xol_page_kaddr = kmap_atomic(page);
> >>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
> >>
> >> +	if (!memcmp(dst, src, len))
> >> +		goto done;
> > 
> > can't really comment, I know nothing about arm64...
> > 
> > but don't we need to change __create_xol_area()
> > 
> > 	-	area->page = alloc_page(GFP_HIGHUSER);
> > 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> > 
> > to avoid the false positives?
> 
> Indeed, it would be safer.
> 
> Could we tolerate these false positives? Even if the page are not reset
> to zero bits, if the existing bits are the same as the instruction being
> copied, it still can execute the correct instruction.

Not if the I-cache has stale data. If alloc_page() returns a page with
some random data that resembles a valid instruction but there was never
a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
the compare (on the D-cache side) succeeds or not.

I think using __GFP_ZERO should do the trick. All 0s is a permanently
undefined instruction, not something we'd use with xol.
Oleg Nesterov Sept. 20, 2024, 5:32 p.m. UTC | #5
On 09/20, Catalin Marinas wrote:
>
> On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
> >
> >
> > 在 2024/9/19 22:18, Oleg Nesterov 写道:
> > > On 09/19, Liao Chang wrote:
> > >>
> > >> --- a/arch/arm64/kernel/probes/uprobes.c
> > >> +++ b/arch/arm64/kernel/probes/uprobes.c
> > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> > >>  	void *xol_page_kaddr = kmap_atomic(page);
> > >>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
> > >>
> > >> +	if (!memcmp(dst, src, len))
> > >> +		goto done;
> > >
> > > can't really comment, I know nothing about arm64...
> > >
> > > but don't we need to change __create_xol_area()
> > >
> > > 	-	area->page = alloc_page(GFP_HIGHUSER);
> > > 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> > >
> > > to avoid the false positives?
> >
> > Indeed, it would be safer.
> >
> > Could we tolerate these false positives? Even if the page are not reset
> > to zero bits, if the existing bits are the same as the instruction being
> > copied, it still can execute the correct instruction.
>
> Not if the I-cache has stale data. If alloc_page() returns a page with
> some random data that resembles a valid instruction but there was never
> a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
> the compare (on the D-cache side) succeeds or not.

But shouldn't the page fault paths on arm64 flush I-cache ?

If alloc_page() returns a page with some random data that resembles a valid
instruction, user-space can't execute this instruction until
special_mapping_fault() installs the page allocated in __create_xol_area().

Again, I know nothing about arm64/icache/etc, I am just curious and trying
to understand...

Oleg.
Will Deacon Sept. 22, 2024, 2:09 p.m. UTC | #6
On Fri, Sep 20, 2024 at 07:32:23PM +0200, Oleg Nesterov wrote:
> On 09/20, Catalin Marinas wrote:
> >
> > On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
> > >
> > >
> > > 在 2024/9/19 22:18, Oleg Nesterov 写道:
> > > > On 09/19, Liao Chang wrote:
> > > >>
> > > >> --- a/arch/arm64/kernel/probes/uprobes.c
> > > >> +++ b/arch/arm64/kernel/probes/uprobes.c
> > > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> > > >>  	void *xol_page_kaddr = kmap_atomic(page);
> > > >>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
> > > >>
> > > >> +	if (!memcmp(dst, src, len))
> > > >> +		goto done;
> > > >
> > > > can't really comment, I know nothing about arm64...
> > > >
> > > > but don't we need to change __create_xol_area()
> > > >
> > > > 	-	area->page = alloc_page(GFP_HIGHUSER);
> > > > 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> > > >
> > > > to avoid the false positives?
> > >
> > > Indeed, it would be safer.
> > >
> > > Could we tolerate these false positives? Even if the page are not reset
> > > to zero bits, if the existing bits are the same as the instruction being
> > > copied, it still can execute the correct instruction.
> >
> > Not if the I-cache has stale data. If alloc_page() returns a page with
> > some random data that resembles a valid instruction but there was never
> > a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
> > the compare (on the D-cache side) succeeds or not.
> 
> But shouldn't the page fault paths on arm64 flush I-cache ?
> 
> If alloc_page() returns a page with some random data that resembles a valid
> instruction, user-space can't execute this instruction until
> special_mapping_fault() installs the page allocated in __create_xol_area().
> 
> Again, I know nothing about arm64/icache/etc, I am just curious and trying
> to understand...

We defer the icache maintenance until set_pte_at() time, where we call
__sync_icache_dcache() if we're installing a present, executable user
eintry. That also elides the maintenance if PG_arch_1 is set (i.e. the
kernel only takes responsibility for the freshly allocated page).

Will

> 
> Oleg.
>
Oleg Nesterov Sept. 22, 2024, 2:39 p.m. UTC | #7
On 09/22, Will Deacon wrote:
>
> On Fri, Sep 20, 2024 at 07:32:23PM +0200, Oleg Nesterov wrote:
> > On 09/20, Catalin Marinas wrote:
> > >
> > > On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
> > > >
> > > >
> > > > 在 2024/9/19 22:18, Oleg Nesterov 写道:
> > > > > On 09/19, Liao Chang wrote:
> > > > >>
> > > > >> --- a/arch/arm64/kernel/probes/uprobes.c
> > > > >> +++ b/arch/arm64/kernel/probes/uprobes.c
> > > > >> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> > > > >>  	void *xol_page_kaddr = kmap_atomic(page);
> > > > >>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
> > > > >>
> > > > >> +	if (!memcmp(dst, src, len))
> > > > >> +		goto done;
> > > > >
> > > > > can't really comment, I know nothing about arm64...
> > > > >
> > > > > but don't we need to change __create_xol_area()
> > > > >
> > > > > 	-	area->page = alloc_page(GFP_HIGHUSER);
> > > > > 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> > > > >
> > > > > to avoid the false positives?
> > > >
> > > > Indeed, it would be safer.
> > > >
> > > > Could we tolerate these false positives? Even if the page are not reset
> > > > to zero bits, if the existing bits are the same as the instruction being
> > > > copied, it still can execute the correct instruction.
> > >
> > > Not if the I-cache has stale data. If alloc_page() returns a page with
> > > some random data that resembles a valid instruction but there was never
> > > a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
> > > the compare (on the D-cache side) succeeds or not.
> >
> > But shouldn't the page fault paths on arm64 flush I-cache ?
> >
> > If alloc_page() returns a page with some random data that resembles a valid
> > instruction, user-space can't execute this instruction until
> > special_mapping_fault() installs the page allocated in __create_xol_area().
> >
> > Again, I know nothing about arm64/icache/etc, I am just curious and trying
> > to understand...
>
> We defer the icache maintenance until set_pte_at() time, where we call
> __sync_icache_dcache() if we're installing a present, executable user
> eintry.

And to me this looks as if __sync_icache_dcache() must be called when
user space tries to fault-in the page allocated in __create_xol_area()
and returned by special_mapping_fault().

So I still don't understand the problem.

Oleg.
Liao, Chang Sept. 23, 2024, 1:57 a.m. UTC | #8
在 2024/9/20 23:32, Catalin Marinas 写道:
> On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
>>
>>
>> 在 2024/9/19 22:18, Oleg Nesterov 写道:
>>> On 09/19, Liao Chang wrote:
>>>>
>>>> --- a/arch/arm64/kernel/probes/uprobes.c
>>>> +++ b/arch/arm64/kernel/probes/uprobes.c
>>>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>>>>  	void *xol_page_kaddr = kmap_atomic(page);
>>>>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
>>>>
>>>> +	if (!memcmp(dst, src, len))
>>>> +		goto done;
>>>
>>> can't really comment, I know nothing about arm64...
>>>
>>> but don't we need to change __create_xol_area()
>>>
>>> 	-	area->page = alloc_page(GFP_HIGHUSER);
>>> 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
>>>
>>> to avoid the false positives?
>>
>> Indeed, it would be safer.
>>
>> Could we tolerate these false positives? Even if the page are not reset
>> to zero bits, if the existing bits are the same as the instruction being
>> copied, it still can execute the correct instruction.
> 
> Not if the I-cache has stale data. If alloc_page() returns a page with
> some random data that resembles a valid instruction but there was never
> a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
> the compare (on the D-cache side) succeeds or not.

Absolutly right, I overlooked the comparsion is still performed in the D-cache.
However, the most important thing is ensuring the I-cache sees the accurate bits,
which is why a cache flush in necessary for each xol slot.

> 
> I think using __GFP_ZERO should do the trick. All 0s is a permanently
> undefined instruction, not something we'd use with xol.

Unfortunately, the comparison assumes the D-cache and I-cache are already
in sync for the slot being copied. But this assumption is flawed if we start
with a page with some random bits and D-cache has not been sychronized with
I-cache. So, besides __GFP_ZERO, should we have a additional cache flush
after page allocation?

>
Will Deacon Sept. 23, 2024, 7:18 a.m. UTC | #9
On Mon, Sep 23, 2024 at 09:57:14AM +0800, Liao, Chang wrote:
> 在 2024/9/20 23:32, Catalin Marinas 写道:
> > On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
> >> 在 2024/9/19 22:18, Oleg Nesterov 写道:
> >>> On 09/19, Liao Chang wrote:
> >>>> --- a/arch/arm64/kernel/probes/uprobes.c
> >>>> +++ b/arch/arm64/kernel/probes/uprobes.c
> >>>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> >>>>  	void *xol_page_kaddr = kmap_atomic(page);
> >>>>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
> >>>>
> >>>> +	if (!memcmp(dst, src, len))
> >>>> +		goto done;
> >>>
> >>> can't really comment, I know nothing about arm64...
> >>>
> >>> but don't we need to change __create_xol_area()
> >>>
> >>> 	-	area->page = alloc_page(GFP_HIGHUSER);
> >>> 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> >>>
> >>> to avoid the false positives?
> >>
> >> Indeed, it would be safer.
> >>
> >> Could we tolerate these false positives? Even if the page are not reset
> >> to zero bits, if the existing bits are the same as the instruction being
> >> copied, it still can execute the correct instruction.
> > 
> > Not if the I-cache has stale data. If alloc_page() returns a page with
> > some random data that resembles a valid instruction but there was never
> > a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
> > the compare (on the D-cache side) succeeds or not.
> 
> Absolutly right, I overlooked the comparsion is still performed in the D-cache.
> However, the most important thing is ensuring the I-cache sees the accurate bits,
> which is why a cache flush in necessary for each xol slot.
> 
> > 
> > I think using __GFP_ZERO should do the trick. All 0s is a permanently
> > undefined instruction, not something we'd use with xol.
> 
> Unfortunately, the comparison assumes the D-cache and I-cache are already
> in sync for the slot being copied. But this assumption is flawed if we start
> with a page with some random bits and D-cache has not been sychronized with
> I-cache. So, besides __GFP_ZERO, should we have a additional cache flush
> after page allocation?

No, I think Oleg's right. The initial cache maintenance will happen when the
executable pte is installed. However, we should use __GFP_ZERO anyway
because I don't think it's a good idea to map an uninitialised page into
userspace.

Will
Oleg Nesterov Sept. 23, 2024, 10:52 a.m. UTC | #10
On 09/23, Will Deacon wrote:
>
> However, we should use __GFP_ZERO anyway
> because I don't think it's a good idea to map an uninitialised page into
> userspace.

Agreed, and imo this even needs a separate "fix info leak" patch.

Oleg.
Liao, Chang Sept. 23, 2024, 11:16 a.m. UTC | #11
在 2024/9/22 22:09, Will Deacon 写道:
> On Fri, Sep 20, 2024 at 07:32:23PM +0200, Oleg Nesterov wrote:
>> On 09/20, Catalin Marinas wrote:
>>>
>>> On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
>>>>
>>>>
>>>> 在 2024/9/19 22:18, Oleg Nesterov 写道:
>>>>> On 09/19, Liao Chang wrote:
>>>>>>
>>>>>> --- a/arch/arm64/kernel/probes/uprobes.c
>>>>>> +++ b/arch/arm64/kernel/probes/uprobes.c
>>>>>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>>>>>>  	void *xol_page_kaddr = kmap_atomic(page);
>>>>>>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
>>>>>>
>>>>>> +	if (!memcmp(dst, src, len))
>>>>>> +		goto done;
>>>>>
>>>>> can't really comment, I know nothing about arm64...
>>>>>
>>>>> but don't we need to change __create_xol_area()
>>>>>
>>>>> 	-	area->page = alloc_page(GFP_HIGHUSER);
>>>>> 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
>>>>>
>>>>> to avoid the false positives?
>>>>
>>>> Indeed, it would be safer.
>>>>
>>>> Could we tolerate these false positives? Even if the page are not reset
>>>> to zero bits, if the existing bits are the same as the instruction being
>>>> copied, it still can execute the correct instruction.
>>>
>>> Not if the I-cache has stale data. If alloc_page() returns a page with
>>> some random data that resembles a valid instruction but there was never
>>> a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
>>> the compare (on the D-cache side) succeeds or not.
>>
>> But shouldn't the page fault paths on arm64 flush I-cache ?
>>
>> If alloc_page() returns a page with some random data that resembles a valid
>> instruction, user-space can't execute this instruction until
>> special_mapping_fault() installs the page allocated in __create_xol_area().
>>
>> Again, I know nothing about arm64/icache/etc, I am just curious and trying
>> to understand...
> 
> We defer the icache maintenance until set_pte_at() time, where we call
> __sync_icache_dcache() if we're installing a present, executable user
> eintry. That also elides the maintenance if PG_arch_1 is set (i.e. the
> kernel only takes responsibility for the freshly allocated page).

The newly allocated page should always have PG_arch_1 cleared, correct? Is it
possible for alloc_page() to return a page with PG_arch_1 set in the current
arm64 kernel?

> 
> Will
> 
>>
>> Oleg.
>>
Liao, Chang Sept. 23, 2024, 11:16 a.m. UTC | #12
在 2024/9/23 15:18, Will Deacon 写道:
> On Mon, Sep 23, 2024 at 09:57:14AM +0800, Liao, Chang wrote:
>> 在 2024/9/20 23:32, Catalin Marinas 写道:
>>> On Fri, Sep 20, 2024 at 04:58:31PM +0800, Liao, Chang wrote:
>>>> 在 2024/9/19 22:18, Oleg Nesterov 写道:
>>>>> On 09/19, Liao Chang wrote:
>>>>>> --- a/arch/arm64/kernel/probes/uprobes.c
>>>>>> +++ b/arch/arm64/kernel/probes/uprobes.c
>>>>>> @@ -17,12 +17,16 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>>>>>>  	void *xol_page_kaddr = kmap_atomic(page);
>>>>>>  	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
>>>>>>
>>>>>> +	if (!memcmp(dst, src, len))
>>>>>> +		goto done;
>>>>>
>>>>> can't really comment, I know nothing about arm64...
>>>>>
>>>>> but don't we need to change __create_xol_area()
>>>>>
>>>>> 	-	area->page = alloc_page(GFP_HIGHUSER);
>>>>> 	+	area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
>>>>>
>>>>> to avoid the false positives?
>>>>
>>>> Indeed, it would be safer.
>>>>
>>>> Could we tolerate these false positives? Even if the page are not reset
>>>> to zero bits, if the existing bits are the same as the instruction being
>>>> copied, it still can execute the correct instruction.
>>>
>>> Not if the I-cache has stale data. If alloc_page() returns a page with
>>> some random data that resembles a valid instruction but there was never
>>> a cache flush (sync_icache_aliases() on arm64), it's irrelevant whether
>>> the compare (on the D-cache side) succeeds or not.
>>
>> Absolutly right, I overlooked the comparsion is still performed in the D-cache.
>> However, the most important thing is ensuring the I-cache sees the accurate bits,
>> which is why a cache flush in necessary for each xol slot.
>>
>>>
>>> I think using __GFP_ZERO should do the trick. All 0s is a permanently
>>> undefined instruction, not something we'd use with xol.
>>
>> Unfortunately, the comparison assumes the D-cache and I-cache are already
>> in sync for the slot being copied. But this assumption is flawed if we start
>> with a page with some random bits and D-cache has not been sychronized with
>> I-cache. So, besides __GFP_ZERO, should we have a additional cache flush
>> after page allocation?
> 
> No, I think Oleg's right. The initial cache maintenance will happen when the
> executable pte is installed. However, we should use __GFP_ZERO anyway
> because I don't think it's a good idea to map an uninitialised page into
> userspace.

I will use __GFP_ZERO for xol page allocation in v2.

> 
> Will
>
Catalin Marinas Sept. 23, 2024, 4:03 p.m. UTC | #13
On Mon, Sep 23, 2024 at 08:18:57AM +0100, Will Deacon wrote:
> On Mon, Sep 23, 2024 at 09:57:14AM +0800, Liao, Chang wrote:
> > Unfortunately, the comparison assumes the D-cache and I-cache are already
> > in sync for the slot being copied. But this assumption is flawed if we start
> > with a page with some random bits and D-cache has not been sychronized with
> > I-cache. So, besides __GFP_ZERO, should we have a additional cache flush
> > after page allocation?
> 
> No, I think Oleg's right. The initial cache maintenance will happen when the
> executable pte is installed.

For some reason I had kprobes in mind, did not realise that this page
ends up in user-space. So yes, we have cache maintenance when the pte is
set to point to this page. Subsequent changes will need cache
maintenance.

> However, we should use __GFP_ZERO anyway because I don't think it's a
> good idea to map an uninitialised page into userspace.

Oh, that's not good.
Liao, Chang Sept. 26, 2024, 12:06 p.m. UTC | #14
在 2024/9/23 18:52, Oleg Nesterov 写道:
> On 09/23, Will Deacon wrote:
>>
>> However, we should use __GFP_ZERO anyway
>> because I don't think it's a good idea to map an uninitialised page into
>> userspace.
> 
> Agreed, and imo this even needs a separate "fix info leak" patch.

Do you mean to fill the entire page with CPU specific illegal instructions
in this patch?

> 
> Oleg.
> 
>
Oleg Nesterov Sept. 26, 2024, 4:08 p.m. UTC | #15
On 09/26, Liao, Chang wrote:
>
> 在 2024/9/23 18:52, Oleg Nesterov 写道:
> > On 09/23, Will Deacon wrote:
> >>
> >> However, we should use __GFP_ZERO anyway
> >> because I don't think it's a good idea to map an uninitialised page into
> >> userspace.
> >
> > Agreed, and imo this even needs a separate "fix info leak" patch.
>
> Do you mean to fill the entire page with CPU specific illegal instructions
> in this patch?

Hmm. Why?? No... OK, I'll write the changelog and send the trivial patch
in a minute.

Oleg.
Liao, Chang Nov. 6, 2024, 9:55 a.m. UTC | #16
Hi, Will and Catalin

在 2024/9/19 20:17, Liao Chang 写道:
> On 09/23, Will Deacon wrote:
>> However, we should use __GFP_ZERO anyway
>> because I don't think it's a good idea to map an uninitialised page into
>> userspace.
> Agreed, and imo this even needs a separate "fix info leak" patch.
> 
> Oleg.

Given that Oleg's fix info leak patch has been merged [1], the risk of leakage
is gone. So I am looking forward to your options about this patch. As many
functions start with same instructions like 'stp fp, lr, [sp, #imm]' or
'paciasp'. So I think this patch could avoid unnecessary D/I cache synchronization.

[1] https://lore.kernel.org/all/20240929162047.GA12611@redhat.com/
Catalin Marinas Nov. 7, 2024, 6:35 p.m. UTC | #17
On Wed, Nov 06, 2024 at 05:55:16PM +0800, Liao, Chang wrote:
> 在 2024/9/19 20:17, Liao Chang 写道:
> > On 09/23, Will Deacon wrote:
> >> However, we should use __GFP_ZERO anyway
> >> because I don't think it's a good idea to map an uninitialised page into
> >> userspace.
> > Agreed, and imo this even needs a separate "fix info leak" patch.
> > 
> > Oleg.
> 
> Given that Oleg's fix info leak patch has been merged [1], the risk of leakage
> is gone. So I am looking forward to your options about this patch. As many
> functions start with same instructions like 'stp fp, lr, [sp, #imm]' or
> 'paciasp'. So I think this patch could avoid unnecessary D/I cache synchronization.
> 
> [1] https://lore.kernel.org/all/20240929162047.GA12611@redhat.com/

The patch is fine with the fix in __create_xol_area(). But please add a
comment on why it is safe to skip the cache maintenance, something like
"the initial cache maintenance was done via set_pte_at()" (well, I can
do this when applying).
Catalin Marinas Nov. 8, 2024, 4:49 p.m. UTC | #18
On Thu, 19 Sep 2024 12:17:19 +0000, Liao Chang wrote:
> The profiling of single-thread selftests bench reveals a bottlenect in
> caches_clean_inval_pou() on ARM64. On my local testing machine, this
> function takes approximately 34% of CPU cycles for trig-uprobe-nop and
> trig-uprobe-push.
> 
> This patch add a check to avoid unnecessary cache flush when writing
> instruction to the xol slot. If the instruction is same with the
> existing instruction in slot, there is no need to synchronize D/I cache.
> Since xol slot allocation and updates occur on the hot path of uprobe
> handling, The upstream kernel running on Kunpeng916 (Hi1616), 4 NUMA
> nodes, 64 cores@ 2.4GHz reveals this optimization has obvious gain for
> nop and push testcases.
> 
> [...]

Applied to arm64 (for-next/misc), thanks!

[1/1] arm64: uprobes: Optimize cache flushes for xol slot
      https://git.kernel.org/arm64/c/bdf94836c22a
diff mbox series

Patch

diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c
index d49aef2657cd..5ee27509d6f6 100644
--- a/arch/arm64/kernel/probes/uprobes.c
+++ b/arch/arm64/kernel/probes/uprobes.c
@@ -17,12 +17,16 @@  void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 	void *xol_page_kaddr = kmap_atomic(page);
 	void *dst = xol_page_kaddr + (vaddr & ~PAGE_MASK);
 
+	if (!memcmp(dst, src, len))
+		goto done;
+
 	/* Initialize the slot */
 	memcpy(dst, src, len);
 
 	/* flush caches (dcache/icache) */
 	sync_icache_aliases((unsigned long)dst, (unsigned long)dst + len);
 
+done:
 	kunmap_atomic(xol_page_kaddr);
 }