diff mbox series

xen/livepatch: Make check_for_livepatch_work() faster in the common case

Message ID 20231222220045.2840714-1-andrew.cooper3@citrix.com (mailing list archive)
State Superseded
Headers show
Series xen/livepatch: Make check_for_livepatch_work() faster in the common case | expand

Commit Message

Andrew Cooper Dec. 22, 2023, 10 p.m. UTC
When livepatching is enabled, this function is used all the time.  Really do
check the fastpath first, and annotate it likely() as this is the right answer
100% of the time (to many significant figures).

This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
the optimiser has an easier time too.  Bloat-o-meter reports:

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
  Function                                     old     new   delta
  check_for_livepatch_work.cold               1201    1183     -18
  check_for_livepatch_work                    1021     982     -39

which isn't too shabby for no logical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

I'm still a little disappointed with the code generation.  GCC still chooses
to set up the full stack frame (6 regs, +3 more slots) intermixed with the
per-cpu calculations.

In isolation, GCC can check the boolean without creating a stack frame:

  <work_to_to>:
    48 89 e2                mov    %rsp,%rdx
    48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # ffff82d0405b6068 <per_cpu__work_to_do>
    48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
    8b 4a c1                mov    -0x3f(%rdx),%ecx
    48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # ffff82d0405d28e0 <__per_cpu_offset>
    48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
    0f b6 04 02             movzbl (%rdx,%rax,1),%eax
    c3                      retq

but I can't find a way to convince GCC that it would be worth not setting up a
stack frame in in the common case, and having a few extra mov reg/reg's later
in the uncommon case.

I haven't tried manually splitting the function into a check() and a do()
function.  Views on whether that might be acceptable?  At a guess, do() would
need to be a static noinline to avoid it turning back into what it currently
is.
---
 xen/common/livepatch.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


base-commit: 49818cde637b5ec20383e46b71f93b2e7d867686

Comments

Jan Beulich Jan. 4, 2024, 1:24 p.m. UTC | #1
On 22.12.2023 23:00, Andrew Cooper wrote:
> When livepatching is enabled, this function is used all the time.  Really do
> check the fastpath first, and annotate it likely() as this is the right answer
> 100% of the time (to many significant figures).
> 
> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
> the optimiser has an easier time too.  Bloat-o-meter reports:
> 
>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>   Function                                     old     new   delta
>   check_for_livepatch_work.cold               1201    1183     -18
>   check_for_livepatch_work                    1021     982     -39
> 
> which isn't too shabby for no logical change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> I'm still a little disappointed with the code generation.  GCC still chooses
> to set up the full stack frame (6 regs, +3 more slots) intermixed with the
> per-cpu calculations.
> 
> In isolation, GCC can check the boolean without creating a stack frame:
> 
>   <work_to_to>:
>     48 89 e2                mov    %rsp,%rdx
>     48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # ffff82d0405b6068 <per_cpu__work_to_do>
>     48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
>     8b 4a c1                mov    -0x3f(%rdx),%ecx
>     48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # ffff82d0405d28e0 <__per_cpu_offset>
>     48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
>     0f b6 04 02             movzbl (%rdx,%rax,1),%eax
>     c3                      retq
> 
> but I can't find a way to convince GCC that it would be worth not setting up a
> stack frame in in the common case, and having a few extra mov reg/reg's later
> in the uncommon case.
> 
> I haven't tried manually splitting the function into a check() and a do()
> function.  Views on whether that might be acceptable?  At a guess, do() would
> need to be a static noinline to avoid it turning back into what it currently
> is.

Or maybe move the fast-path check into an inline function, which calls the
(renamed) out-of-line one only when the fast-path check passes? Downside
would be that the per-CPU work_to_do variable then couldn't be static anymore.

Jan
Andrew Cooper Jan. 4, 2024, 1:32 p.m. UTC | #2
On 04/01/2024 1:24 pm, Jan Beulich wrote:
> On 22.12.2023 23:00, Andrew Cooper wrote:
>> When livepatching is enabled, this function is used all the time.  Really do
>> check the fastpath first, and annotate it likely() as this is the right answer
>> 100% of the time (to many significant figures).
>>
>> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
>> the optimiser has an easier time too.  Bloat-o-meter reports:
>>
>>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>>   Function                                     old     new   delta
>>   check_for_livepatch_work.cold               1201    1183     -18
>>   check_for_livepatch_work                    1021     982     -39
>>
>> which isn't too shabby for no logical change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
>> I'm still a little disappointed with the code generation.  GCC still chooses
>> to set up the full stack frame (6 regs, +3 more slots) intermixed with the
>> per-cpu calculations.
>>
>> In isolation, GCC can check the boolean without creating a stack frame:
>>
>>   <work_to_to>:
>>     48 89 e2                mov    %rsp,%rdx
>>     48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # ffff82d0405b6068 <per_cpu__work_to_do>
>>     48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
>>     8b 4a c1                mov    -0x3f(%rdx),%ecx
>>     48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # ffff82d0405d28e0 <__per_cpu_offset>
>>     48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
>>     0f b6 04 02             movzbl (%rdx,%rax,1),%eax
>>     c3                      retq
>>
>> but I can't find a way to convince GCC that it would be worth not setting up a
>> stack frame in in the common case, and having a few extra mov reg/reg's later
>> in the uncommon case.
>>
>> I haven't tried manually splitting the function into a check() and a do()
>> function.  Views on whether that might be acceptable?  At a guess, do() would
>> need to be a static noinline to avoid it turning back into what it currently
>> is.
> Or maybe move the fast-path check into an inline function, which calls the
> (renamed) out-of-line one only when the fast-path check passes? Downside
> would be that the per-CPU work_to_do variable then couldn't be static anymore.

We can't do that unfortunately.  It's called from assembly in
reset_stack_and_*()

But, I've had another idea.  I think attribute cold can help here, as it
will move the majority of the implementation into a separate section.

~Andrew
Ross Lagerwall Jan. 5, 2024, 3:24 p.m. UTC | #3
On Fri, Dec 22, 2023 at 10:01 PM Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>
> When livepatching is enabled, this function is used all the time.  Really do
> check the fastpath first, and annotate it likely() as this is the right answer
> 100% of the time (to many significant figures).
>
> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
> the optimiser has an easier time too.  Bloat-o-meter reports:
>
>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>   Function                                     old     new   delta
>   check_for_livepatch_work.cold               1201    1183     -18
>   check_for_livepatch_work                    1021     982     -39
>
> which isn't too shabby for no logical change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
diff mbox series

Patch

diff --git a/xen/common/livepatch.c b/xen/common/livepatch.c
index 1209fea2566c..b6275339f663 100644
--- a/xen/common/livepatch.c
+++ b/xen/common/livepatch.c
@@ -1706,15 +1706,15 @@  void check_for_livepatch_work(void)
     s_time_t timeout;
     unsigned long flags;
 
+    /* Fast path: no work to do. */
+    if ( likely(!per_cpu(work_to_do, cpu)) )
+        return;
+
     /* Only do any work when invoked in truly idle state. */
     if ( system_state != SYS_STATE_active ||
          !is_idle_domain(current->sched_unit->domain) )
         return;
 
-    /* Fast path: no work to do. */
-    if ( !per_cpu(work_to_do, cpu ) )
-        return;
-
     smp_rmb();
     /* In case we aborted, other CPUs can skip right away. */
     if ( !livepatch_work.do_work )