diff mbox series

[v2,1/2] arm64: clear_page() shouldn't use DC ZVA when DCZID_EL0.DZP == 1

Message ID 20211108071149.823570-2-reijiw@google.com (mailing list archive)
State New, archived
Headers show
Series arm64: DC {ZVA,GVA,GZVA} shouldn't be used when DCZID_EL0.DZP == 1 | expand

Commit Message

Reiji Watanabe Nov. 8, 2021, 7:11 a.m. UTC
Currently, clear_page() uses DC ZVA instruction unconditionally.  But it
should make sure that DCZID_EL0.DZP, which indicates whether or not use
of DC ZVA instruction is prohibited, is zero when using the instruction.
Use STP as memset does instead when DCZID_EL0.DZP == 1.

Fixes: f27bb139c387 ("arm64: Miscellaneous library functions")
Signed-off-by: Reiji Watanabe <reijiw@google.com>
---
 arch/arm64/lib/clear_page.S | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Robin Murphy Nov. 16, 2021, 11 p.m. UTC | #1
On 2021-11-08 07:11, Reiji Watanabe wrote:
> Currently, clear_page() uses DC ZVA instruction unconditionally.  But it
> should make sure that DCZID_EL0.DZP, which indicates whether or not use
> of DC ZVA instruction is prohibited, is zero when using the instruction.
> Use STP as memset does instead when DCZID_EL0.DZP == 1.

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

FWIW I did eventually figure out the "pre-bias" trick from v1 thanks to 
Mark's nod toward the original context, but a quick survey of various 
optimisation guides implied that the explicit add should generally be 
preferred over post-index writeback anyway, so I think we're all good here.

Cheers,
Robin.

> Fixes: f27bb139c387 ("arm64: Miscellaneous library functions")
> Signed-off-by: Reiji Watanabe <reijiw@google.com>
> ---
>   arch/arm64/lib/clear_page.S | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_page.S
> index b84b179edba3..ea3a6c927fe8 100644
> --- a/arch/arm64/lib/clear_page.S
> +++ b/arch/arm64/lib/clear_page.S
> @@ -16,6 +16,7 @@
>    */
>   SYM_FUNC_START_PI(clear_page)
>   	mrs	x1, dczid_el0
> +	tbnz	x1, #4, 2f	/* Branch if DC ZVA is prohibited */
>   	and	w1, w1, #0xf
>   	mov	x2, #4
>   	lsl	x1, x2, x1
> @@ -25,5 +26,14 @@ SYM_FUNC_START_PI(clear_page)
>   	tst	x0, #(PAGE_SIZE - 1)
>   	b.ne	1b
>   	ret
> +
> +2:	stp	xzr, xzr, [x0]
> +	stp	xzr, xzr, [x0, #16]
> +	stp	xzr, xzr, [x0, #32]
> +	stp	xzr, xzr, [x0, #48]
> +	add	x0, x0, #64
> +	tst	x0, #(PAGE_SIZE - 1)
> +	b.ne	2b
> +	ret
>   SYM_FUNC_END_PI(clear_page)
>   EXPORT_SYMBOL(clear_page)
>
Reiji Watanabe Nov. 18, 2021, 8:18 a.m. UTC | #2
On Tue, Nov 16, 2021 at 3:00 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2021-11-08 07:11, Reiji Watanabe wrote:
> > Currently, clear_page() uses DC ZVA instruction unconditionally.  But it
> > should make sure that DCZID_EL0.DZP, which indicates whether or not use
> > of DC ZVA instruction is prohibited, is zero when using the instruction.
> > Use STP as memset does instead when DCZID_EL0.DZP == 1.
>
> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
>
> FWIW I did eventually figure out the "pre-bias" trick from v1 thanks to
> Mark's nod toward the original context, but a quick survey of various
> optimisation guides implied that the explicit add should generally be
> preferred over post-index writeback anyway, so I think we're all good here.

Thank you for the review!
The original code, which used *pre*-index (not post-index) addressing,
made no significant difference in page_clear performance on my test
environment from the current code.

Now, I am looking at creating v3 patches to use stnp instead of stp
in page_clear (NOTE: DC ZVA shows much better performance on my test
system than stp/stnp).

Although using stnp didn't show significant difference in clear_page()
performance on my test system from stp (no significant difference in
cache-misses, cache_refill, cache_wb, or cache_allocate event counter
either), using stnp should be more appropriate for page_clear than stp,
and I understand it could show better performance on some CPUs.

Thanks,
Reiji




>
> Cheers,
> Robin.
>
> > Fixes: f27bb139c387 ("arm64: Miscellaneous library functions")
> > Signed-off-by: Reiji Watanabe <reijiw@google.com>
> > ---
> >   arch/arm64/lib/clear_page.S | 10 ++++++++++
> >   1 file changed, 10 insertions(+)
> >
> > diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_page.S
> > index b84b179edba3..ea3a6c927fe8 100644
> > --- a/arch/arm64/lib/clear_page.S
> > +++ b/arch/arm64/lib/clear_page.S
> > @@ -16,6 +16,7 @@
> >    */
> >   SYM_FUNC_START_PI(clear_page)
> >       mrs     x1, dczid_el0
> > +     tbnz    x1, #4, 2f      /* Branch if DC ZVA is prohibited */
> >       and     w1, w1, #0xf
> >       mov     x2, #4
> >       lsl     x1, x2, x1
> > @@ -25,5 +26,14 @@ SYM_FUNC_START_PI(clear_page)
> >       tst     x0, #(PAGE_SIZE - 1)
> >       b.ne    1b
> >       ret
> > +
> > +2:   stp     xzr, xzr, [x0]
> > +     stp     xzr, xzr, [x0, #16]
> > +     stp     xzr, xzr, [x0, #32]
> > +     stp     xzr, xzr, [x0, #48]
> > +     add     x0, x0, #64
> > +     tst     x0, #(PAGE_SIZE - 1)
> > +     b.ne    2b
> > +     ret
> >   SYM_FUNC_END_PI(clear_page)
> >   EXPORT_SYMBOL(clear_page)
> >
Robin Murphy Nov. 18, 2021, 11:42 a.m. UTC | #3
On 2021-11-18 08:18, Reiji Watanabe wrote:
> On Tue, Nov 16, 2021 at 3:00 PM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2021-11-08 07:11, Reiji Watanabe wrote:
>>> Currently, clear_page() uses DC ZVA instruction unconditionally.  But it
>>> should make sure that DCZID_EL0.DZP, which indicates whether or not use
>>> of DC ZVA instruction is prohibited, is zero when using the instruction.
>>> Use STP as memset does instead when DCZID_EL0.DZP == 1.
>>
>> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
>>
>> FWIW I did eventually figure out the "pre-bias" trick from v1 thanks to
>> Mark's nod toward the original context, but a quick survey of various
>> optimisation guides implied that the explicit add should generally be
>> preferred over post-index writeback anyway, so I think we're all good here.
> 
> Thank you for the review!
> The original code, which used *pre*-index (not post-index) addressing,

Oops, in the context I think I meant writeback in general anyway. This 
is what happens when a sudden urge to review random patches at 11PM 
strikes :)

> made no significant difference in page_clear performance on my test
> environment from the current code.
> 
> Now, I am looking at creating v3 patches to use stnp instead of stp
> in page_clear (NOTE: DC ZVA shows much better performance on my test
> system than stp/stnp).
> 
> Although using stnp didn't show significant difference in clear_page()
> performance on my test system from stp (no significant difference in
> cache-misses, cache_refill, cache_wb, or cache_allocate event counter
> either), using stnp should be more appropriate for page_clear than stp,
> and I understand it could show better performance on some CPUs.

Indeed - certainly most Arm Ltd. cores tend to be good at spotting the 
store pattern and switching into write-streaming mode automatically - 
but semantically, STNP probably is appropriate for the great majority of 
clear_page() usage. Feel free to keep my review tag with that change.

Thanks,
Robin.
diff mbox series

Patch

diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_page.S
index b84b179edba3..ea3a6c927fe8 100644
--- a/arch/arm64/lib/clear_page.S
+++ b/arch/arm64/lib/clear_page.S
@@ -16,6 +16,7 @@ 
  */
 SYM_FUNC_START_PI(clear_page)
 	mrs	x1, dczid_el0
+	tbnz	x1, #4, 2f	/* Branch if DC ZVA is prohibited */
 	and	w1, w1, #0xf
 	mov	x2, #4
 	lsl	x1, x2, x1
@@ -25,5 +26,14 @@  SYM_FUNC_START_PI(clear_page)
 	tst	x0, #(PAGE_SIZE - 1)
 	b.ne	1b
 	ret
+
+2:	stp	xzr, xzr, [x0]
+	stp	xzr, xzr, [x0, #16]
+	stp	xzr, xzr, [x0, #32]
+	stp	xzr, xzr, [x0, #48]
+	add	x0, x0, #64
+	tst	x0, #(PAGE_SIZE - 1)
+	b.ne	2b
+	ret
 SYM_FUNC_END_PI(clear_page)
 EXPORT_SYMBOL(clear_page)