diff mbox series

[v5,3/7] x86/hap: improve hypervisor assisted guest TLB flush

Message ID 20200219174354.84726-4-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86: improve assisted tlb flush and use it in guest mode | expand

Commit Message

Roger Pau Monné Feb. 19, 2020, 5:43 p.m. UTC
The current implementation of the hypervisor assisted flush for HAP is
extremely inefficient.

First of all there's no need to call paging_update_cr3, as the only
relevant part of that function when doing a flush is the ASID vCPU
flush, so just call that function directly.

Since hvm_asid_flush_vcpu is protected against concurrent callers by
using atomic operations there's no need anymore to pause the affected
vCPUs.

Finally the global TLB flush performed by flush_tlb_mask is also not
necessary, since we only want to flush the guest TLB state it's enough
to trigger a vmexit on the pCPUs currently holding any vCPU state, as
such vmexit will already perform an ASID/VPID update, and thus clear
the guest TLB.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
---
Changes since v3:
 - s/do_flush/handle_flush/.
 - Add comment about handle_flush usage.
 - Fix VPID typo in comment.
---
 xen/arch/x86/mm/hap/hap.c | 52 +++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 27 deletions(-)

Comments

Jan Beulich Feb. 28, 2020, 1:58 p.m. UTC | #1
On 19.02.2020 18:43, Roger Pau Monne wrote:
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -669,32 +669,28 @@ static void hap_update_cr3(struct vcpu *v, int do_locking, bool noflush)
>      hvm_update_guest_cr3(v, noflush);
>  }
>  
> +/*
> + * NB: doesn't actually perform any flush, used just to clear the CPU from the
> + * mask and hence signal that the guest TLB flush has been done.
> + */

"has been done" isn't correct, as the flush may happen only later
on (upon next VM entry). I think wording here needs to be as
precise as possible, however the comment may turn out unnecessary
altogether:

> @@ -705,20 +701,22 @@ bool hap_flush_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
>          if ( !flush_vcpu(ctxt, v) )
>              continue;
>  
> -        paging_update_cr3(v, false);
> +        hvm_asid_flush_vcpu(v);
>  
>          cpu = read_atomic(&v->dirty_cpu);
> -        if ( is_vcpu_dirty_cpu(cpu) )
> +        if ( cpu != this_cpu && is_vcpu_dirty_cpu(cpu) )
>              __cpumask_set_cpu(cpu, mask);
>      }
>  
> -    /* Flush TLBs on all CPUs with dirty vcpu state. */
> -    flush_tlb_mask(mask);
> -
> -    /* Done. */
> -    for_each_vcpu ( d, v )
> -        if ( v != current && flush_vcpu(ctxt, v) )
> -            vcpu_unpause(v);
> +    /*
> +     * Trigger a vmexit on all pCPUs with dirty vCPU state in order to force an
> +     * ASID/VPID change and hence accomplish a guest TLB flush. Note that vCPUs
> +     * not currently running will already be flushed when scheduled because of
> +     * the ASID tickle done in the loop above.
> +     */
> +    on_selected_cpus(mask, handle_flush, mask, 0);
> +    while ( !cpumask_empty(mask) )
> +        cpu_relax();

Why do you pass 0 as last argument? If you passed 1, you wouldn't
need the while() here, handle_flush() could be empty, and you'd
save a perhaps large amount of CPUs to all try to modify two
cache lines (the one used by on_selected_cpus() itself and the
one here) instead of just one. Yes, latency from the last CPU
clearing its bit to you being able to exit from here may be a
little larger this way, but overall latency may shrink with the
cut by half amount of atomic ops issued to the bus.

(Forcing an empty function to be called may seem odd, but sending
 just some IPI [like the event check one] wouldn't be enough, as
 you want to be sure the other side has actually received the
 request.)

Jan
Roger Pau Monné Feb. 28, 2020, 4:31 p.m. UTC | #2
On Fri, Feb 28, 2020 at 02:58:42PM +0100, Jan Beulich wrote:
> On 19.02.2020 18:43, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/mm/hap/hap.c
> > +++ b/xen/arch/x86/mm/hap/hap.c
> > @@ -669,32 +669,28 @@ static void hap_update_cr3(struct vcpu *v, int do_locking, bool noflush)
> >      hvm_update_guest_cr3(v, noflush);
> >  }
> >  
> > +/*
> > + * NB: doesn't actually perform any flush, used just to clear the CPU from the
> > + * mask and hence signal that the guest TLB flush has been done.
> > + */
> 
> "has been done" isn't correct, as the flush may happen only later
> on (upon next VM entry). I think wording here needs to be as
> precise as possible, however the comment may turn out unnecessary
> altogether:

What about:

/*
 * NB: Dummy function exclusively used as a way to trigger a vmexit,
 * and thus force an ASID/VPID update on vmentry (that will flush the
 * cache).
 */

> > @@ -705,20 +701,22 @@ bool hap_flush_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
> >          if ( !flush_vcpu(ctxt, v) )
> >              continue;
> >  
> > -        paging_update_cr3(v, false);
> > +        hvm_asid_flush_vcpu(v);
> >  
> >          cpu = read_atomic(&v->dirty_cpu);
> > -        if ( is_vcpu_dirty_cpu(cpu) )
> > +        if ( cpu != this_cpu && is_vcpu_dirty_cpu(cpu) )
> >              __cpumask_set_cpu(cpu, mask);
> >      }
> >  
> > -    /* Flush TLBs on all CPUs with dirty vcpu state. */
> > -    flush_tlb_mask(mask);
> > -
> > -    /* Done. */
> > -    for_each_vcpu ( d, v )
> > -        if ( v != current && flush_vcpu(ctxt, v) )
> > -            vcpu_unpause(v);
> > +    /*
> > +     * Trigger a vmexit on all pCPUs with dirty vCPU state in order to force an
> > +     * ASID/VPID change and hence accomplish a guest TLB flush. Note that vCPUs
> > +     * not currently running will already be flushed when scheduled because of
> > +     * the ASID tickle done in the loop above.
> > +     */
> > +    on_selected_cpus(mask, handle_flush, mask, 0);
> > +    while ( !cpumask_empty(mask) )
> > +        cpu_relax();
> 
> Why do you pass 0 as last argument? If you passed 1, you wouldn't
> need the while() here, handle_flush() could be empty, and you'd
> save a perhaps large amount of CPUs to all try to modify two
> cache lines (the one used by on_selected_cpus() itself and the
> one here) instead of just one. Yes, latency from the last CPU
> clearing its bit to you being able to exit from here may be a
> little larger this way, but overall latency may shrink with the
> cut by half amount of atomic ops issued to the bus.

In fact I think passing 0 as the last argument is fine, and
handle_flush can be empty in that case anyway. We don't really care
whether on_selected_cpus returns before all CPUs have executed the
dummy function, as long as all of them have taken a vmexit. Using 0
already guarantees that AFAICT.

> (Forcing an empty function to be called may seem odd, but sending
>  just some IPI [like the event check one] wouldn't be enough, as
>  you want to be sure the other side has actually received the
>  request.)

Exactly, that's the reason for the empty function.

Thanks, Roger.
Jan Beulich Feb. 28, 2020, 4:44 p.m. UTC | #3
On 28.02.2020 17:31, Roger Pau Monné wrote:
> On Fri, Feb 28, 2020 at 02:58:42PM +0100, Jan Beulich wrote:
>> On 19.02.2020 18:43, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/mm/hap/hap.c
>>> +++ b/xen/arch/x86/mm/hap/hap.c
>>> @@ -669,32 +669,28 @@ static void hap_update_cr3(struct vcpu *v, int do_locking, bool noflush)
>>>      hvm_update_guest_cr3(v, noflush);
>>>  }
>>>  
>>> +/*
>>> + * NB: doesn't actually perform any flush, used just to clear the CPU from the
>>> + * mask and hence signal that the guest TLB flush has been done.
>>> + */
>>
>> "has been done" isn't correct, as the flush may happen only later
>> on (upon next VM entry). I think wording here needs to be as
>> precise as possible, however the comment may turn out unnecessary
>> altogether:
> 
> What about:
> 
> /*
>  * NB: Dummy function exclusively used as a way to trigger a vmexit,
>  * and thus force an ASID/VPID update on vmentry (that will flush the
>  * cache).
>  */

Once it's empty - yes, looks okay (with s/cache/TLB/).

>>> @@ -705,20 +701,22 @@ bool hap_flush_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
>>>          if ( !flush_vcpu(ctxt, v) )
>>>              continue;
>>>  
>>> -        paging_update_cr3(v, false);
>>> +        hvm_asid_flush_vcpu(v);
>>>  
>>>          cpu = read_atomic(&v->dirty_cpu);
>>> -        if ( is_vcpu_dirty_cpu(cpu) )
>>> +        if ( cpu != this_cpu && is_vcpu_dirty_cpu(cpu) )
>>>              __cpumask_set_cpu(cpu, mask);
>>>      }
>>>  
>>> -    /* Flush TLBs on all CPUs with dirty vcpu state. */
>>> -    flush_tlb_mask(mask);
>>> -
>>> -    /* Done. */
>>> -    for_each_vcpu ( d, v )
>>> -        if ( v != current && flush_vcpu(ctxt, v) )
>>> -            vcpu_unpause(v);
>>> +    /*
>>> +     * Trigger a vmexit on all pCPUs with dirty vCPU state in order to force an
>>> +     * ASID/VPID change and hence accomplish a guest TLB flush. Note that vCPUs
>>> +     * not currently running will already be flushed when scheduled because of
>>> +     * the ASID tickle done in the loop above.
>>> +     */
>>> +    on_selected_cpus(mask, handle_flush, mask, 0);
>>> +    while ( !cpumask_empty(mask) )
>>> +        cpu_relax();
>>
>> Why do you pass 0 as last argument? If you passed 1, you wouldn't
>> need the while() here, handle_flush() could be empty, and you'd
>> save a perhaps large amount of CPUs to all try to modify two
>> cache lines (the one used by on_selected_cpus() itself and the
>> one here) instead of just one. Yes, latency from the last CPU
>> clearing its bit to you being able to exit from here may be a
>> little larger this way, but overall latency may shrink with the
>> cut by half amount of atomic ops issued to the bus.
> 
> In fact I think passing 0 as the last argument is fine, and
> handle_flush can be empty in that case anyway. We don't really care
> whether on_selected_cpus returns before all CPUs have executed the
> dummy function, as long as all of them have taken a vmexit. Using 0
> already guarantees that AFAICT.

Isn't it that the caller ants to be guaranteed that the flush
has actually occurred (or at least that no further insns can
be executed prior to the flush happening, i.e. at least the VM
exit having occurred)?

>> (Forcing an empty function to be called may seem odd, but sending
>>  just some IPI [like the event check one] wouldn't be enough, as
>>  you want to be sure the other side has actually received the
>>  request.)
> 
> Exactly, that's the reason for the empty function.

But the function isn't empty.

Jan
Roger Pau Monné Feb. 28, 2020, 4:57 p.m. UTC | #4
On Fri, Feb 28, 2020 at 05:44:57PM +0100, Jan Beulich wrote:
> On 28.02.2020 17:31, Roger Pau Monné wrote:
> > On Fri, Feb 28, 2020 at 02:58:42PM +0100, Jan Beulich wrote:
> >> On 19.02.2020 18:43, Roger Pau Monne wrote:
> >>> --- a/xen/arch/x86/mm/hap/hap.c
> >>> +++ b/xen/arch/x86/mm/hap/hap.c
> >>> @@ -669,32 +669,28 @@ static void hap_update_cr3(struct vcpu *v, int do_locking, bool noflush)
> >>>      hvm_update_guest_cr3(v, noflush);
> >>>  }
> >>>  
> >>> +/*
> >>> + * NB: doesn't actually perform any flush, used just to clear the CPU from the
> >>> + * mask and hence signal that the guest TLB flush has been done.
> >>> + */
> >>
> >> "has been done" isn't correct, as the flush may happen only later
> >> on (upon next VM entry). I think wording here needs to be as
> >> precise as possible, however the comment may turn out unnecessary
> >> altogether:
> > 
> > What about:
> > 
> > /*
> >  * NB: Dummy function exclusively used as a way to trigger a vmexit,
> >  * and thus force an ASID/VPID update on vmentry (that will flush the
> >  * cache).
> >  */
> 
> Once it's empty - yes, looks okay (with s/cache/TLB/).

Ack.

> >>> @@ -705,20 +701,22 @@ bool hap_flush_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
> >>>          if ( !flush_vcpu(ctxt, v) )
> >>>              continue;
> >>>  
> >>> -        paging_update_cr3(v, false);
> >>> +        hvm_asid_flush_vcpu(v);
> >>>  
> >>>          cpu = read_atomic(&v->dirty_cpu);
> >>> -        if ( is_vcpu_dirty_cpu(cpu) )
> >>> +        if ( cpu != this_cpu && is_vcpu_dirty_cpu(cpu) )
> >>>              __cpumask_set_cpu(cpu, mask);
> >>>      }
> >>>  
> >>> -    /* Flush TLBs on all CPUs with dirty vcpu state. */
> >>> -    flush_tlb_mask(mask);
> >>> -
> >>> -    /* Done. */
> >>> -    for_each_vcpu ( d, v )
> >>> -        if ( v != current && flush_vcpu(ctxt, v) )
> >>> -            vcpu_unpause(v);
> >>> +    /*
> >>> +     * Trigger a vmexit on all pCPUs with dirty vCPU state in order to force an
> >>> +     * ASID/VPID change and hence accomplish a guest TLB flush. Note that vCPUs
> >>> +     * not currently running will already be flushed when scheduled because of
> >>> +     * the ASID tickle done in the loop above.
> >>> +     */
> >>> +    on_selected_cpus(mask, handle_flush, mask, 0);
> >>> +    while ( !cpumask_empty(mask) )
> >>> +        cpu_relax();
> >>
> >> Why do you pass 0 as last argument? If you passed 1, you wouldn't
> >> need the while() here, handle_flush() could be empty, and you'd
> >> save a perhaps large amount of CPUs to all try to modify two
> >> cache lines (the one used by on_selected_cpus() itself and the
> >> one here) instead of just one. Yes, latency from the last CPU
> >> clearing its bit to you being able to exit from here may be a
> >> little larger this way, but overall latency may shrink with the
> >> cut by half amount of atomic ops issued to the bus.
> > 
> > In fact I think passing 0 as the last argument is fine, and
> > handle_flush can be empty in that case anyway. We don't really care
> > whether on_selected_cpus returns before all CPUs have executed the
> > dummy function, as long as all of them have taken a vmexit. Using 0
> > already guarantees that AFAICT.
> 
> Isn't it that the caller ants to be guaranteed that the flush
> has actually occurred (or at least that no further insns can
> be executed prior to the flush happening, i.e. at least the VM
> exit having occurred)?

Yes, but that would happen with 0 as the last argument already, see
the code in smp_call_function_interrupt:

    if ( call_data.wait )
    {
        (*func)(info);
        smp_mb();
        cpumask_clear_cpu(cpu, &call_data.selected);
    }
    else
    {
        smp_mb();
        cpumask_clear_cpu(cpu, &call_data.selected);
        (*func)(info);
    }

The only difference is whether on_selected_cpus can return before all
the functions have been executed on all CPUs, or whether all CPUs have
taken a vmexit. The later is fine for our use-case.

> >> (Forcing an empty function to be called may seem odd, but sending
> >>  just some IPI [like the event check one] wouldn't be enough, as
> >>  you want to be sure the other side has actually received the
> >>  request.)
> > 
> > Exactly, that's the reason for the empty function.
> 
> But the function isn't empty.

It will be now, since on_selected_cpus already does the needed
synchronization as you pointed out.

Thanks, Roger.
Jan Beulich Feb. 28, 2020, 5:02 p.m. UTC | #5
On 28.02.2020 17:57, Roger Pau Monné wrote:
> On Fri, Feb 28, 2020 at 05:44:57PM +0100, Jan Beulich wrote:
>> On 28.02.2020 17:31, Roger Pau Monné wrote:
>>> On Fri, Feb 28, 2020 at 02:58:42PM +0100, Jan Beulich wrote:
>>>> On 19.02.2020 18:43, Roger Pau Monne wrote:
>>>>> @@ -705,20 +701,22 @@ bool hap_flush_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
>>>>>          if ( !flush_vcpu(ctxt, v) )
>>>>>              continue;
>>>>>  
>>>>> -        paging_update_cr3(v, false);
>>>>> +        hvm_asid_flush_vcpu(v);
>>>>>  
>>>>>          cpu = read_atomic(&v->dirty_cpu);
>>>>> -        if ( is_vcpu_dirty_cpu(cpu) )
>>>>> +        if ( cpu != this_cpu && is_vcpu_dirty_cpu(cpu) )
>>>>>              __cpumask_set_cpu(cpu, mask);
>>>>>      }
>>>>>  
>>>>> -    /* Flush TLBs on all CPUs with dirty vcpu state. */
>>>>> -    flush_tlb_mask(mask);
>>>>> -
>>>>> -    /* Done. */
>>>>> -    for_each_vcpu ( d, v )
>>>>> -        if ( v != current && flush_vcpu(ctxt, v) )
>>>>> -            vcpu_unpause(v);
>>>>> +    /*
>>>>> +     * Trigger a vmexit on all pCPUs with dirty vCPU state in order to force an
>>>>> +     * ASID/VPID change and hence accomplish a guest TLB flush. Note that vCPUs
>>>>> +     * not currently running will already be flushed when scheduled because of
>>>>> +     * the ASID tickle done in the loop above.
>>>>> +     */
>>>>> +    on_selected_cpus(mask, handle_flush, mask, 0);
>>>>> +    while ( !cpumask_empty(mask) )
>>>>> +        cpu_relax();
>>>>
>>>> Why do you pass 0 as last argument? If you passed 1, you wouldn't
>>>> need the while() here, handle_flush() could be empty, and you'd
>>>> save a perhaps large amount of CPUs to all try to modify two
>>>> cache lines (the one used by on_selected_cpus() itself and the
>>>> one here) instead of just one. Yes, latency from the last CPU
>>>> clearing its bit to you being able to exit from here may be a
>>>> little larger this way, but overall latency may shrink with the
>>>> cut by half amount of atomic ops issued to the bus.
>>>
>>> In fact I think passing 0 as the last argument is fine, and
>>> handle_flush can be empty in that case anyway. We don't really care
>>> whether on_selected_cpus returns before all CPUs have executed the
>>> dummy function, as long as all of them have taken a vmexit. Using 0
>>> already guarantees that AFAICT.
>>
>> Isn't it that the caller ants to be guaranteed that the flush
>> has actually occurred (or at least that no further insns can
>> be executed prior to the flush happening, i.e. at least the VM
>> exit having occurred)?
> 
> Yes, but that would happen with 0 as the last argument already, see
> the code in smp_call_function_interrupt:
> 
>     if ( call_data.wait )
>     {
>         (*func)(info);
>         smp_mb();
>         cpumask_clear_cpu(cpu, &call_data.selected);
>     }
>     else
>     {
>         smp_mb();
>         cpumask_clear_cpu(cpu, &call_data.selected);
>         (*func)(info);
>     }
> 
> The only difference is whether on_selected_cpus can return before all
> the functions have been executed on all CPUs, or whether all CPUs have
> taken a vmexit. The later is fine for our use-case.

Oh, yes - right you are.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 6894c1aa38..dbb61bf9c6 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -669,32 +669,28 @@  static void hap_update_cr3(struct vcpu *v, int do_locking, bool noflush)
     hvm_update_guest_cr3(v, noflush);
 }
 
+/*
+ * NB: doesn't actually perform any flush, used just to clear the CPU from the
+ * mask and hence signal that the guest TLB flush has been done.
+ */
+static void handle_flush(void *data)
+{
+    cpumask_t *mask = data;
+    unsigned int cpu = smp_processor_id();
+
+    ASSERT(cpumask_test_cpu(cpu, mask));
+    cpumask_clear_cpu(cpu, mask);
+}
+
 bool hap_flush_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
                    void *ctxt)
 {
     static DEFINE_PER_CPU(cpumask_t, flush_cpumask);
     cpumask_t *mask = &this_cpu(flush_cpumask);
     struct domain *d = current->domain;
+    unsigned int this_cpu = smp_processor_id();
     struct vcpu *v;
 
-    /* Avoid deadlock if more than one vcpu tries this at the same time. */
-    if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
-        return false;
-
-    /* Pause all other vcpus. */
-    for_each_vcpu ( d, v )
-        if ( v != current && flush_vcpu(ctxt, v) )
-            vcpu_pause_nosync(v);
-
-    /* Now that all VCPUs are signalled to deschedule, we wait... */
-    for_each_vcpu ( d, v )
-        if ( v != current && flush_vcpu(ctxt, v) )
-            while ( !vcpu_runnable(v) && v->is_running )
-                cpu_relax();
-
-    /* All other vcpus are paused, safe to unlock now. */
-    spin_unlock(&d->hypercall_deadlock_mutex);
-
     cpumask_clear(mask);
 
     /* Flush paging-mode soft state (e.g., va->gfn cache; PAE PDPE cache). */
@@ -705,20 +701,22 @@  bool hap_flush_tlb(bool (*flush_vcpu)(void *ctxt, struct vcpu *v),
         if ( !flush_vcpu(ctxt, v) )
             continue;
 
-        paging_update_cr3(v, false);
+        hvm_asid_flush_vcpu(v);
 
         cpu = read_atomic(&v->dirty_cpu);
-        if ( is_vcpu_dirty_cpu(cpu) )
+        if ( cpu != this_cpu && is_vcpu_dirty_cpu(cpu) )
             __cpumask_set_cpu(cpu, mask);
     }
 
-    /* Flush TLBs on all CPUs with dirty vcpu state. */
-    flush_tlb_mask(mask);
-
-    /* Done. */
-    for_each_vcpu ( d, v )
-        if ( v != current && flush_vcpu(ctxt, v) )
-            vcpu_unpause(v);
+    /*
+     * Trigger a vmexit on all pCPUs with dirty vCPU state in order to force an
+     * ASID/VPID change and hence accomplish a guest TLB flush. Note that vCPUs
+     * not currently running will already be flushed when scheduled because of
+     * the ASID tickle done in the loop above.
+     */
+    on_selected_cpus(mask, handle_flush, mask, 0);
+    while ( !cpumask_empty(mask) )
+        cpu_relax();
 
     return true;
 }