diff mbox series

[RFC,5/6] arm64: call enter_hypervisor_head only when it is needed

Message ID 1564137460-25629-7-git-send-email-andrii.anisov@gmail.com (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Andrii Anisov July 26, 2019, 10:37 a.m. UTC
From: Andrii Anisov <andrii_anisov@epam.com>

On ARM64 we know exactly if trap happened from hypervisor or guest, so
we do not need to take that decision. This reduces a condition for
all enter_hypervisor_head calls and the function call for traps from
the hypervisor mode.

Currently, it is implemented for ARM64 only. Integrating the stuff
with ARM32 requires moving ` if ( guest_mode(regs) )` condition
into ARM32 specific traps.c

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/arch/arm/arm64/entry.S |  6 ++--
 xen/arch/arm/traps.c       | 75 ++++++++++++++++++++++++----------------------
 2 files changed, 43 insertions(+), 38 deletions(-)

Comments

Julien Grall July 26, 2019, 10:59 a.m. UTC | #1
Hi,

On 26/07/2019 11:37, Andrii Anisov wrote:
> From: Andrii Anisov <andrii_anisov@epam.com>
> 
> On ARM64 we know exactly if trap happened from hypervisor or guest, so
> we do not need to take that decision. This reduces a condition for
> all enter_hypervisor_head calls and the function call for traps from
> the hypervisor mode.

One condition lost but ...

> 
> Currently, it is implemented for ARM64 only. Integrating the stuff
> with ARM32 requires moving ` if ( guest_mode(regs) )` condition
> into ARM32 specific traps.

... one more divergence between arm32 and arm64.

There are probably dozens of more conditions in the code that are not necessary 
for one of the architectures. Yet there are value to keep everything common 
because the benefits outweigh the likely non performance improvement.

So I am not convinced that this has any value for Xen.

Cheers,
Andrii Anisov July 30, 2019, 5:35 p.m. UTC | #2
On 26.07.19 13:59, Julien Grall wrote:
> Hi,
> 
> On 26/07/2019 11:37, Andrii Anisov wrote:
>> From: Andrii Anisov <andrii_anisov@epam.com>
>>
>> On ARM64 we know exactly if trap happened from hypervisor or guest, so
>> we do not need to take that decision. This reduces a condition for
>> all enter_hypervisor_head calls and the function call for traps from
>> the hypervisor mode.
> 
> One condition lost but ...

...In the hot path (actually at any trap).
And the function call for traps from hyp.

>> Currently, it is implemented for ARM64 only. Integrating the stuff
>> with ARM32 requires moving ` if ( guest_mode(regs) )` condition
>> into ARM32 specific traps.
> 
> ... one more divergence between arm32 and arm64.
> 
> There are probably dozens of more conditions in the code that are not necessary for one of the architectures.
> Yet there are value to keep everything common because the benefits outweigh the likely non performance improvement.

I'm not seeing any benefits in calling `enter_hypervisor_head()` from functions named `do_trap_hyp_*`. That code is confusing for me.
IMHO, dividing `do_trap_irq/fiq()` into guest and hyp specific functions is not a big deal. Even for ARM32. Moreover, it will make more obvious the fact that nothing from `enter_hypervisor_head()` is done for IRQ traps taken from hyp.

> So I am not convinced that this has any value for Xen.

OK, I wouldn't insist.
Julien Grall July 31, 2019, 11:02 a.m. UTC | #3
Hi Andrii,

On 30/07/2019 18:35, Andrii Anisov wrote:
> 
> On 26.07.19 13:59, Julien Grall wrote:
>> Hi,
>>
>> On 26/07/2019 11:37, Andrii Anisov wrote:
>>> From: Andrii Anisov <andrii_anisov@epam.com>
>>>
>>> On ARM64 we know exactly if trap happened from hypervisor or guest, so
>>> we do not need to take that decision. This reduces a condition for
>>> all enter_hypervisor_head calls and the function call for traps from
>>> the hypervisor mode.
>>
>> One condition lost but ...
> 
> ...In the hot path (actually at any trap).

Everything is in the hot path here, yet there are a lot of other branches. So 
why this branch in particular?

As I have mentioned a few times before, there are a difference between the 
theory and the practice. In theory, removing a branch looks nice. But in 
practice this may not be the case.

In this particular case, I don't believe this is going to have a real impact on 
the performance.

The PSTATE has been saved a few instructions before in cpu_user_regs, so there
are high chance the value will still be in the L1 cache.

The compiler may also decide to do the direct branch when not in guest_mode. A 
trap from the hypervisor is mostly for interrupts. So there are chance this is 
not going to have a real impact on the overall of the interrupt handling.

If you are really worry of the impact of branch then there are a few more 
important places (with a greater benefits) to look:
     1) It seems the compiler use a jump table for the switch case in 
do_trap_guest_sync(), so it will result to multiple direct branch everytime.
     2) Indirect branch have a non-negligible cost compare to direct branch. 
This is a lot used in the interrupt code (see gic_hw_ops->read_irq()). All of 
them are known at boot time, so they could be replace with direct branch. x86 
recently introduced alternative_call() for this purpose. This could be re-used 
on Arm.

Cheers,
Andre Przywara July 31, 2019, 11:33 a.m. UTC | #4
On Wed, 31 Jul 2019 12:02:20 +0100
Julien Grall <julien.grall@arm.com> wrote:

Hi,

> On 30/07/2019 18:35, Andrii Anisov wrote:
> > 
> > On 26.07.19 13:59, Julien Grall wrote:  
> >> Hi,
> >>
> >> On 26/07/2019 11:37, Andrii Anisov wrote:  
> >>> From: Andrii Anisov <andrii_anisov@epam.com>
> >>>
> >>> On ARM64 we know exactly if trap happened from hypervisor or guest, so
> >>> we do not need to take that decision. This reduces a condition for
> >>> all enter_hypervisor_head calls and the function call for traps from
> >>> the hypervisor mode.  
> >>
> >> One condition lost but ...  
> > 
> > ...In the hot path (actually at any trap).  
> 
> Everything is in the hot path here, yet there are a lot of other branches. So 
> why this branch in particular?
> 
> As I have mentioned a few times before, there are a difference between the 
> theory and the practice. In theory, removing a branch looks nice. But in 
> practice this may not be the case.
> 
> In this particular case, I don't believe this is going to have a real impact on 
> the performance.
> 
> The PSTATE has been saved a few instructions before in cpu_user_regs, so there
> are high chance the value will still be in the L1 cache.

I agree on this, and second the idea of *not* micro-optimising code just for the sake of it. If you have numbers that back this up, it would be a different story.

> The compiler may also decide to do the direct branch when not in guest_mode. A 
> trap from the hypervisor is mostly for interrupts. So there are chance this is 
> not going to have a real impact on the overall of the interrupt handling.
> 
> If you are really worry of the impact of branch then there are a few more 
> important places (with a greater benefits) to look:
>      1) It seems the compiler use a jump table for the switch case in 
> do_trap_guest_sync(), so it will result to multiple direct branch everytime.

I don't think it's worth to "fix" this issue. The compiler has done this for a reason, and I would guess it figured that this is cheaper than other ways of solving this. If you are really paranoid about this, I would try to compile this with tuning for your particular core (-mtune), so that the compiler can throw in more micro-architectural knowledge about the cost of certain instructions.

>      2) Indirect branch have a non-negligible cost compare to direct branch. 
> This is a lot used in the interrupt code (see gic_hw_ops->read_irq()). All of 
> them are known at boot time, so they could be replace with direct branch. x86 
> recently introduced alternative_call() for this purpose. This could be re-used 
> on Arm.

This is indeed something I was always worried about: It looks cheap and elegant in the C source code, but is potentially expensive on hardware. The particular snippet is:
...
  249024:       d5033fdf        isb
  249028:       f9401e80        ldr     x0, [x20, #56]
  24902c:       f9407801        ldr     x1, [x0, #240]
  249030:       2a1303e0        mov     w0, w19
  249034:       d63f0020        blr     x1
...
In case of an interrupt, the first load will probably miss the cache, and the CPU is stuck now, because due to the dependencies it can't do much else. It can't even predict the branch and speculatively execute anything, because the destination address is yet another dependent load away.
This might not matter for little cores like A53s, because they wouldn't speculate anyway. But better cores (A72, for instance) would most likely benefit from an optimisation in this area.

Cheers,
Andre.
Andrii Anisov Aug. 1, 2019, 7:33 a.m. UTC | #5
On 31.07.19 14:02, Julien Grall wrote:
> Everything is in the hot path here, yet there are a lot of other branches. So why this branch in particular?

This branch and function call is particular because I find this piece of code quite confusing:

>> I'm not seeing any benefits in calling `enter_hypervisor_head()` from functions named `do_trap_hyp_*`. That code is confusing for me.
>> IMHO, dividing `do_trap_irq/fiq()` into guest and hyp specific functions is not a big deal. Even for ARM32. Moreover, it will make more obvious the fact that nothing from `enter_hypervisor_head()` is done for IRQ traps taken from hyp.

And I believe this patch balances patch "xen/arm: Re-enable interrupt later in the trap path" what you NAKed because of increased IRQ latency.
While them together make the code more straight forward and clear, because:
  - you start all C-coded common trap handlers with IRQs locked, and free from asm code misunderstanding
  - all common trap handlers are distinct in their naming, purpose and action. In terms of calling `enter_hypervisor_head()` only from the traps taken from guest.

> If you are really worry of the impact of branch then there are a few more important places (with a greater benefits) to look:
>      1) It seems the compiler use a jump table for the switch case in do_trap_guest_sync(), so it will result to multiple direct branch everytime.
>      2) Indirect branch have a non-negligible cost compare to direct branch. This is a lot used in the interrupt code (see gic_hw_ops->read_irq()). All of them are known at boot time, so they could be replace with direct branch. x86 recently introduced alternative_call() for this purpose. This could be re-used on Arm.

I remember this. But I'm not ready to code it. I admit that I have not yet good understanding about how alternatives work.
Julien Grall Aug. 1, 2019, 10:17 a.m. UTC | #6
Hi Andrii,

On 01/08/2019 08:33, Andrii Anisov wrote:
> 
> 
> On 31.07.19 14:02, Julien Grall wrote:
>> Everything is in the hot path here, yet there are a lot of other branches. So 
>> why this branch in particular?
> 
> This branch and function call is particular because I find this piece of code 
> quite confusing:

All the commit message is based on "performance improvement".... Now you are 
selling it as this is confusing. What are the real reasons for this patch?

> 
>>> I'm not seeing any benefits in calling `enter_hypervisor_head()` from 
>>> functions named `do_trap_hyp_*`. That code is confusing for me.
>>> IMHO, dividing `do_trap_irq/fiq()` into guest and hyp specific functions is 
>>> not a big deal. Even for ARM32. Moreover, it will make more obvious the fact 
>>> that nothing from `enter_hypervisor_head()` is done for IRQ traps taken from 
>>> hyp.
> 
> And I believe this patch balances patch "xen/arm: Re-enable interrupt later in 
> the trap path" what you NAKed because of increased IRQ latency.

I never NAKed the patch as you keep claiming it. You are sending a patch without 
any justification three time in a row, so it is normal to request more details 
and to be slightly annoyed.

If you expect me to ack a patch without understanding the implications, then I 
am afraid this is not going to happen. Additionally, it is important to keep 
track of the reasoning of we can come back in 2 years time and find out quickly 
why interrupts are masked for a long period of time.

As I pointed out Xen supports multiple use-cases. I am concerned you are trying 
to optimize for your use-case and disregard the rest. I have actually requested 
more details on your use case to understand a bit more where you are coming 
from. See my e-mail [1].

I know I wrote the patch but it was from debugging other than real improvement. 
 From my understanding, you are using to optimize the case where all LRs are 
full. Is it something you have seen during your testing?

If so, how many LRs does the platform provide? Can you provide more details on 
your use case? I don't need the full details, but roughly the number of 
interrupts used and often they trigger.

Additionally, it would be good to know the usage over the time.  You could 
modify Xen to record the number of LRs used to each entry.

> While them together make the code more straight forward and clear, because:
>   - you start all C-coded common trap handlers with IRQs locked, and free from 
> asm code misunderstanding

Well, there are only one (two if you count the FIQ) case where interrupts are 
kept disabled, this is when receiving an interrupt. I don't see it as a good 
enough justification to have to impose that to all the handlers.

>   - all common trap handlers are distinct in their naming, purpose and action. 
> In terms of calling `enter_hypervisor_head()` only from the traps taken from guest.

There are nearly no difference between receiving an interrupt while running the 
guest mode and while running in hypervisor mode. So what do you really gain with 
the duplication?

Cheers,

[1] https://lists.xenproject.org/archives/html/xen-devel/2019-07/msg02335.html
Andrii Anisov Aug. 2, 2019, 1:50 p.m. UTC | #7
On 01.08.19 13:17, Julien Grall wrote:
> All the commit message is based on "performance improvement".... Now you are selling it as this is confusing. 

Sorry Julien, I have no more arguments for you.
I'll drop these two patches for the next iteration.
diff mbox series

Patch

diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S
index 8f28789..21c710d 100644
--- a/xen/arch/arm/arm64/entry.S
+++ b/xen/arch/arm/arm64/entry.S
@@ -211,7 +211,7 @@  hyp_irq:
         entry   hyp=1
         msr     daifclr, #4
         mov     x0, sp
-        bl      do_trap_irq
+        bl      do_trap_hyp_irq
         exit    hyp=1
 
 guest_sync:
@@ -321,7 +321,7 @@  guest_irq:
                     SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT)
         msr     daifclr, #4
         mov     x0, sp
-        bl      do_trap_irq
+        bl      do_trap_guest_irq
 1:
         exit    hyp=0, compat=0
 
@@ -364,7 +364,7 @@  guest_irq_compat:
                     SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT)
         msr     daifclr, #4
         mov     x0, sp
-        bl      do_trap_irq
+        bl      do_trap_guest_irq
 1:
         exit    hyp=0, compat=1
 
diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 5a9dc66..13726db 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -2011,48 +2011,45 @@  static inline bool needs_ssbd_flip(struct vcpu *v)
              cpu_require_ssbd_mitigation();
 }
 
-static void enter_hypervisor_head(struct cpu_user_regs *regs)
+static void enter_hypervisor_head(void)
 {
-    if ( guest_mode(regs) )
-    {
-        struct vcpu *v = current;
+    struct vcpu *v = current;
 
-        ASSERT(!local_irq_is_enabled());
+    ASSERT(!local_irq_is_enabled());
 
-        /* If the guest has disabled the workaround, bring it back on. */
-        if ( needs_ssbd_flip(v) )
-            arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 1, NULL);
+    /* If the guest has disabled the workaround, bring it back on. */
+    if ( needs_ssbd_flip(v) )
+        arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 1, NULL);
 
-        /*
-         * If we pended a virtual abort, preserve it until it gets cleared.
-         * See ARM ARM DDI 0487A.j D1.14.3 (Virtual Interrupts) for details,
-         * but the crucial bit is "On taking a vSError interrupt, HCR_EL2.VSE
-         * (alias of HCR.VA) is cleared to 0."
-         */
-        if ( v->arch.hcr_el2 & HCR_VA )
-            v->arch.hcr_el2 = READ_SYSREG(HCR_EL2);
+    /*
+     * If we pended a virtual abort, preserve it until it gets cleared.
+     * See ARM ARM DDI 0487A.j D1.14.3 (Virtual Interrupts) for details,
+     * but the crucial bit is "On taking a vSError interrupt, HCR_EL2.VSE
+     * (alias of HCR.VA) is cleared to 0."
+     */
+    if ( v->arch.hcr_el2 & HCR_VA )
+        v->arch.hcr_el2 = READ_SYSREG(HCR_EL2);
 
 #ifdef CONFIG_NEW_VGIC
-        /*
-         * We need to update the state of our emulated devices using level
-         * triggered interrupts before syncing back the VGIC state.
-         *
-         * TODO: Investigate whether this is necessary to do on every
-         * trap and how it can be optimised.
-         */
-        vtimer_update_irqs(v);
-        vcpu_update_evtchn_irq(v);
+    /*
+     * We need to update the state of our emulated devices using level
+     * triggered interrupts before syncing back the VGIC state.
+     *
+     * TODO: Investigate whether this is necessary to do on every
+     * trap and how it can be optimised.
+     */
+    vtimer_update_irqs(v);
+    vcpu_update_evtchn_irq(v);
 #endif
 
-        vgic_sync_from_lrs(v);
-    }
+    vgic_sync_from_lrs(v);
 }
 
 void do_trap_guest_sync(struct cpu_user_regs *regs)
 {
     const union hsr hsr = { .bits = regs->hsr };
 
-    enter_hypervisor_head(regs);
+    enter_hypervisor_head();
     local_irq_enable();
 
     switch ( hsr.ec )
@@ -2188,7 +2185,6 @@  void do_trap_hyp_sync(struct cpu_user_regs *regs)
 {
     const union hsr hsr = { .bits = regs->hsr };
 
-    enter_hypervisor_head(regs);
     local_irq_enable();
 
     switch ( hsr.ec )
@@ -2227,7 +2223,6 @@  void do_trap_hyp_sync(struct cpu_user_regs *regs)
 
 void do_trap_hyp_serror(struct cpu_user_regs *regs)
 {
-    enter_hypervisor_head(regs);
     local_irq_enable();
 
     __do_trap_serror(regs, VABORT_GEN_BY_GUEST(regs));
@@ -2235,21 +2230,31 @@  void do_trap_hyp_serror(struct cpu_user_regs *regs)
 
 void do_trap_guest_serror(struct cpu_user_regs *regs)
 {
-    enter_hypervisor_head(regs);
+    enter_hypervisor_head();
     local_irq_enable();
 
     __do_trap_serror(regs, true);
 }
 
-void do_trap_irq(struct cpu_user_regs *regs)
+void do_trap_guest_irq(struct cpu_user_regs *regs)
+{
+    enter_hypervisor_head();
+    gic_interrupt(regs, 0);
+}
+
+void do_trap_guest_fiq(struct cpu_user_regs *regs)
+{
+    enter_hypervisor_head();
+    gic_interrupt(regs, 1);
+}
+
+void do_trap_hyp_irq(struct cpu_user_regs *regs)
 {
-    enter_hypervisor_head(regs);
     gic_interrupt(regs, 0);
 }
 
-void do_trap_fiq(struct cpu_user_regs *regs)
+void do_trap_hyp_fiq(struct cpu_user_regs *regs)
 {
-    enter_hypervisor_head(regs);
     gic_interrupt(regs, 1);
 }