diff mbox series

[1/2] x86/hvm: improve performance of HVMOP_flush_tlbs

Message ID 20191224132616.47441-2-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86: improve assisted tlb flush and use it in guest mode | expand

Commit Message

Roger Pau Monne Dec. 24, 2019, 1:26 p.m. UTC
There's no need to call paging_update_cr3 unless CR3 trapping is
enabled, and that's only the case when using shadow paging or when
requested for introspection purposes, otherwise there's no need to
pause all the vCPUs of the domain in order to perform the flush.

Check whether CR3 trapping is currently in use in order to decide
whether the vCPUs should be paused, otherwise just perform the flush.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/hvm/hvm.c | 55 ++++++++++++++++++++++++++++--------------
 1 file changed, 37 insertions(+), 18 deletions(-)

Comments

Andrew Cooper Dec. 27, 2019, 2:52 p.m. UTC | #1
On 24/12/2019 13:26, Roger Pau Monne wrote:
> There's no need to call paging_update_cr3 unless CR3 trapping is
> enabled, and that's only the case when using shadow paging or when
> requested for introspection purposes, otherwise there's no need to
> pause all the vCPUs of the domain in order to perform the flush.
>
> Check whether CR3 trapping is currently in use in order to decide
> whether the vCPUs should be paused, otherwise just perform the flush.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

I agree that the existing logic poor, but this direction looks to be
even more fragile.

Instead, I think it would be better to follow the EPT invalidation
example; mark all vcpus as needing a tlb flush, and IPI the domain dirty
mask, having the return-to-guest path do the flushing.

This avoids all vcpu pausing/unpausing activities, and the cost of the
flush is incurred by the target vcpu, rather than the vcpu making the
hypercall accumulate the cost for everything, as well as a large amount
of remote VMCS accesses.

It can probably also remove the need for the flush_vcpu() callback which
is going to be expensive due to retpoline, and whose contents are trivial.

~Andrew
Roger Pau Monne Dec. 31, 2019, 12:10 p.m. UTC | #2
On Fri, Dec 27, 2019 at 02:52:17PM +0000, Andrew Cooper wrote:
> On 24/12/2019 13:26, Roger Pau Monne wrote:
> > There's no need to call paging_update_cr3 unless CR3 trapping is
> > enabled, and that's only the case when using shadow paging or when
> > requested for introspection purposes, otherwise there's no need to
> > pause all the vCPUs of the domain in order to perform the flush.
> >
> > Check whether CR3 trapping is currently in use in order to decide
> > whether the vCPUs should be paused, otherwise just perform the flush.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> I agree that the existing logic poor, but this direction looks to be
> even more fragile.
> 
> Instead, I think it would be better to follow the EPT invalidation
> example; mark all vcpus as needing a tlb flush, and IPI the domain dirty
> mask, having the return-to-guest path do the flushing.

AFAICT there's no need to call the tlb flush, the vmexit/vmentry
itself will perform the necessary flushes, so the only requirement is
to IPI the pCPUs in order to force a vmexit.

> This avoids all vcpu pausing/unpausing activities, and the cost of the
> flush is incurred by the target vcpu, rather than the vcpu making the
> hypercall accumulate the cost for everything, as well as a large amount
> of remote VMCS accesses.

Hm, then we would need a way to pin the vCPUs to the pCPUs they are
running on, or else in the introspection-enabled case you could end up
calling paging_update_cr3 on vCPUs of other domains (maybe that's
fine, but it could mess up with introspection I guess).

AFAICT the call to paging_update_cr3 needs to be done from
hvm_flush_vcpu_tlb or else we would have to freeze the scheduler so
that vCPUs don't move around pCPUs (or get de-scheduled), I think
we still need the pause in the introspection case, but the open coded
pause loop could be replaced with domain_pause_except_self.

> It can probably also remove the need for the flush_vcpu() callback which
> is going to be expensive due to retpoline, and whose contents are trivial.

I was planning to look into this, but wanted to send this version
first since it's already a big improvement in terms of performance.

Thanks, Roger.
Jan Beulich Jan. 3, 2020, 11:18 a.m. UTC | #3
On 31.12.2019 13:10, Roger Pau Monné wrote:
> On Fri, Dec 27, 2019 at 02:52:17PM +0000, Andrew Cooper wrote:
>> On 24/12/2019 13:26, Roger Pau Monne wrote:
>>> There's no need to call paging_update_cr3 unless CR3 trapping is
>>> enabled, and that's only the case when using shadow paging or when
>>> requested for introspection purposes, otherwise there's no need to
>>> pause all the vCPUs of the domain in order to perform the flush.
>>>
>>> Check whether CR3 trapping is currently in use in order to decide
>>> whether the vCPUs should be paused, otherwise just perform the flush.
>>>
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>
>> I agree that the existing logic poor, but this direction looks to be
>> even more fragile.
>>
>> Instead, I think it would be better to follow the EPT invalidation
>> example; mark all vcpus as needing a tlb flush, and IPI the domain dirty
>> mask, having the return-to-guest path do the flushing.
> 
> AFAICT there's no need to call the tlb flush, the vmexit/vmentry
> itself will perform the necessary flushes, so the only requirement is
> to IPI the pCPUs in order to force a vmexit.

TLB flushing is at best conditional upon VM entry - see the callers
of hvm_asid_handle_vmenter().

Jan
Jan Beulich Jan. 3, 2020, 3:17 p.m. UTC | #4
On 24.12.2019 14:26, Roger Pau Monne wrote:
> There's no need to call paging_update_cr3 unless CR3 trapping is
> enabled, and that's only the case when using shadow paging or when
> requested for introspection purposes, otherwise there's no need to
> pause all the vCPUs of the domain in order to perform the flush.
> 
> Check whether CR3 trapping is currently in use in order to decide
> whether the vCPUs should be paused, otherwise just perform the flush.

First of all - with the commit introducing the pausing not saying
anything on the "why", you must have gained some understanding
there. Could you share this? I can't see why this was needed, and
sh_update_cr3() also doesn't look to have any respective ASSERT()
or alike. I'm having even more trouble seeing why in HAP mode the
pausing would be needed.

As a result I wonder whether, rather than determining whether
pausing is needed inside the function, this shouldn't be a flag
in struct paging_mode.

Next I seriously doubt introspection hooks should be called here.
Introspection should be about guest actions, and us calling
paging_update_cr3() is an implementation detail of Xen, not
something the guest controls. Even more, there not being any CR3
change here I wonder whether the call by the hooks to
hvm_update_guest_cr3() couldn't be suppressed altogether in this
case. Quite possibly in the shadow case there could be more
steps that aren't really needed, so perhaps a separate hook might
be on order.

Jan
Roger Pau Monne Jan. 8, 2020, 5:37 p.m. UTC | #5
On Fri, Jan 03, 2020 at 04:17:14PM +0100, Jan Beulich wrote:
> On 24.12.2019 14:26, Roger Pau Monne wrote:
> > There's no need to call paging_update_cr3 unless CR3 trapping is
> > enabled, and that's only the case when using shadow paging or when
> > requested for introspection purposes, otherwise there's no need to
> > pause all the vCPUs of the domain in order to perform the flush.
> > 
> > Check whether CR3 trapping is currently in use in order to decide
> > whether the vCPUs should be paused, otherwise just perform the flush.
> 
> First of all - with the commit introducing the pausing not saying
> anything on the "why", you must have gained some understanding
> there. Could you share this?

hap_update_cr3 does a "v->arch.hvm.hw_cr[3] = v->arch.hvm.guest_cr[3]"
unconditionally, and such access would be racy if the vCPU is running
and also modifying cr3 at the same time AFAICT.

Just pausing each vCPU before calling paging_update_cr3 should be fine
and would have a smaller performance penalty.

> I can't see why this was needed, and
> sh_update_cr3() also doesn't look to have any respective ASSERT()
> or alike. I'm having even more trouble seeing why in HAP mode the
> pausing would be needed.
>
> As a result I wonder whether, rather than determining whether
> pausing is needed inside the function, this shouldn't be a flag
> in struct paging_mode.
>
> Next I seriously doubt introspection hooks should be called here.
> Introspection should be about guest actions, and us calling
> paging_update_cr3() is an implementation detail of Xen, not
> something the guest controls. Even more, there not being any CR3
> change here I wonder whether the call by the hooks to
> hvm_update_guest_cr3() couldn't be suppressed altogether in this
> case. Quite possibly in the shadow case there could be more
> steps that aren't really needed, so perhaps a separate hook might
> be on order.

Right, I guess just having a hook that does a flush would be enough.
Let me try to propose something slightly better.

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 4dfaf35566..7dcc16afc6 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3985,25 +3985,36 @@  bool hvm_flush_vcpu_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
     static DEFINE_PER_CPU(cpumask_t, flush_cpumask);
     cpumask_t *mask = &this_cpu(flush_cpumask);
     struct domain *d = current->domain;
+    /*
+     * CR3 trapping is only enabled when running with shadow paging or when
+     * requested for introspection purposes, otherwise there's no need to call
+     * paging_update_cr3 and hence pause all vCPUs.
+     */
+    bool trap_cr3 = !paging_mode_hap(d) ||
+                    (d->arch.monitor.write_ctrlreg_enabled &
+                    monitor_ctrlreg_bitmask(VM_EVENT_X86_CR3));
     struct vcpu *v;
 
-    /* Avoid deadlock if more than one vcpu tries this at the same time. */
-    if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
-        return false;
+    if ( trap_cr3 )
+    {
+        /* Avoid deadlock if more than one vcpu tries this at the same time. */
+        if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
+            return false;
 
-    /* Pause all other vcpus. */
-    for_each_vcpu ( d, v )
-        if ( v != current && flush_vcpu(ctxt, v) )
-            vcpu_pause_nosync(v);
+        /* Pause all other vcpus. */
+        for_each_vcpu ( d, v )
+            if ( v != current && flush_vcpu(ctxt, v) )
+                vcpu_pause_nosync(v);
 
-    /* Now that all VCPUs are signalled to deschedule, we wait... */
-    for_each_vcpu ( d, v )
-        if ( v != current && flush_vcpu(ctxt, v) )
-            while ( !vcpu_runnable(v) && v->is_running )
-                cpu_relax();
+        /* Now that all VCPUs are signalled to deschedule, we wait... */
+        for_each_vcpu ( d, v )
+            if ( v != current && flush_vcpu(ctxt, v) )
+                while ( !vcpu_runnable(v) && v->is_running )
+                    cpu_relax();
 
-    /* All other vcpus are paused, safe to unlock now. */
-    spin_unlock(&d->hypercall_deadlock_mutex);
+        /* All other vcpus are paused, safe to unlock now. */
+        spin_unlock(&d->hypercall_deadlock_mutex);
+    }
 
     cpumask_clear(mask);
 
@@ -4015,8 +4026,15 @@  bool hvm_flush_vcpu_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
         if ( !flush_vcpu(ctxt, v) )
             continue;
 
-        paging_update_cr3(v, false);
+        if ( trap_cr3 )
+            paging_update_cr3(v, false);
 
+        /*
+         * It's correct to do this flush without pausing the vCPUs: any vCPU
+         * context switch will already flush the tlb and the worse that could
+         * happen is that Xen ends up performing flushes on pCPUs that are no
+         * longer running the target vCPUs.
+         */
         cpu = read_atomic(&v->dirty_cpu);
         if ( is_vcpu_dirty_cpu(cpu) )
             __cpumask_set_cpu(cpu, mask);
@@ -4026,9 +4044,10 @@  bool hvm_flush_vcpu_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
     flush_tlb_mask(mask);
 
     /* Done. */
-    for_each_vcpu ( d, v )
-        if ( v != current && flush_vcpu(ctxt, v) )
-            vcpu_unpause(v);
+    if ( trap_cr3 )
+        for_each_vcpu ( d, v )
+            if ( v != current && flush_vcpu(ctxt, v) )
+                vcpu_unpause(v);
 
     return true;
 }