[v2,1/2] VMX: fix VMCS race on context-switch paths

Message ID	5A0177B8020000780018CCC9@prv-mh.provo.novell.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> Message-Id: <5A0177B8020000780018CCC9@prv-mh.provo.novell.com> Date: Tue, 07 Nov 2017 01:07:04 -0700 From: "Jan Beulich" <JBeulich@suse.com> To: "Igor Druzhinin" <igor.druzhinin@citrix.com> References: <58A596C0020000780013AA84@prv-mh.provo.novell.com> <58A597D8020000780013AAAF@prv-mh.provo.novell.com> <b24ef6a3-a8f6-7da6-0295-6bc96c0725e8@citrix.com> <5ca9f140-a574-a8d0-1231-4ce0aec0e124@citrix.com> In-Reply-To: <5ca9f140-a574-a8d0-1231-4ce0aec0e124@citrix.com> Mime-Version: 1.0 Content-Disposition: inline Cc: Sergey Dyasli <sergey.dyasli@citrix.com>, Kevin Tian <kevin.tian@intel.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Anshul Makkar <anshul.makkar@citrix.com>, raistlin@linux.it, Jun Nakajima <jun.nakajima@intel.com>, xen-devel <xen-devel@lists.xenproject.org> Subject: Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Jan Beulich Nov. 7, 2017, 8:07 a.m. UTC

>>> On 02.11.17 at 20:46, <igor.druzhinin@citrix.com> wrote:
>> Any ideas about the root cause of the fault and suggestions how to reproduce it
>> would be welcome. Does this crash really has something to do with PML? I doubt
>> because the original environment may hardly be called PML-heavy.

Well, PML-heaviness doesn't matter. It's the mere fact that PML
is enabled on the vCPU being destroyed.

> So we finally have complete understanding of what's going on:
> 
> Some vCPU has just migrated to another pCPU and we switched to idle but
> per_cpu(curr_vcpu) on the current pCPU is still pointing to it - this is
> how the current logic works. While we're in idle we're issuing
> vcpu_destroy() for some other domain which eventually calls
> vmx_vcpu_disable_pml() and trashes VMCS pointer on the current pCPU. At
> this moment we get a TLB flush IPI from that same vCPU which is now
> context switching on another pCPU - it appears to clean TLB after
> itself. This vCPU is already marked is_running=1 by the scheduler. In
> the IPI handler we enter __sync_local_execstate() and trying to call
> vmx_ctxt_switch_from() for the migrated vCPU which is supposed to call
> vmcs_reload() but doesn't do it because is_running==1. The next VMWRITE
> crashes the hypervisor.
> 
> So the state transition diagram might look like:
> pCPU1: vCPUx -> migrate to pCPU2 -> idle -> RCU callbacks ->

I'm not really clear about who/what is "idle" here: pCPU1,
pCPU2, or yet something else? If vCPUx migrated to pCPU2,
wouldn't it be put back into runnable state right away, and
hence pCPU2 can't be idle at this point? Yet for pCPU1 I don't
think its idleness would matter much, i.e. the situation could
also arise without it becoming idle afaics. pCPU1 making it
anywhere softirqs are being processed would suffice.

> vcpu_destroy() -> vmx_vcpu_disable_pml() -> vmcs_clear()
> pCPU2: context switch into vCPUx -> is_running = 1 -> TLB flush
> pCPU1: IPI handler -> context switch out of vCPUx -> VMWRITE -> CRASH!
> 
> We can basically just fix the condition around vmcs_reload() call but
> I'm not completely sure that it's the right way to do - I don't think
> leaving per_cpu(curr_vcpu) pointing to a migrated vCPU is a good idea
> (maybe we need to clean it). What are your thoughts?

per_cpu(curr_vcpu) can only validly be written inside
__context_switch(), hence the only way to achieve this would
be to force __context_switch() to be called earlier than out of
the TLB flush IPI handler, perhaps like in the (untested!) patch
below. Two questions then remain:
- Should we perhaps rather do this in an arch-independent way
  (i.e. ahead of the call to vcpu_destroy() in common code)?
- This deals with only a special case of the more general "TLB
  flush behind the back of a vmx_vmcs_enter() /
  vmx_vmcs_exit() section" - does this need dealing with in a
  more general way? Here I'm thinking of introducing a
  FLUSH_STATE flag to be passed to flush_mask() instead of
  the current flush_tlb_mask() in context_switch() and
  sync_vcpu_execstate(). This could at the same time be used
  for a small performance optimization: At least for HAP vCPU-s
  I don't think we really need the TLB part of the flushes here.

Jan

Igor Druzhinin Nov. 7, 2017, 2:24 p.m. UTC | #1

On 07/11/17 08:07, Jan Beulich wrote:
>>>> On 02.11.17 at 20:46, <igor.druzhinin@citrix.com> wrote:
>>> Any ideas about the root cause of the fault and suggestions how to reproduce it
>>> would be welcome. Does this crash really has something to do with PML? I doubt
>>> because the original environment may hardly be called PML-heavy.
> 
> Well, PML-heaviness doesn't matter. It's the mere fact that PML
> is enabled on the vCPU being destroyed.
> 
>> So we finally have complete understanding of what's going on:
>>
>> Some vCPU has just migrated to another pCPU and we switched to idle but
>> per_cpu(curr_vcpu) on the current pCPU is still pointing to it - this is
>> how the current logic works. While we're in idle we're issuing
>> vcpu_destroy() for some other domain which eventually calls
>> vmx_vcpu_disable_pml() and trashes VMCS pointer on the current pCPU. At
>> this moment we get a TLB flush IPI from that same vCPU which is now
>> context switching on another pCPU - it appears to clean TLB after
>> itself. This vCPU is already marked is_running=1 by the scheduler. In
>> the IPI handler we enter __sync_local_execstate() and trying to call
>> vmx_ctxt_switch_from() for the migrated vCPU which is supposed to call
>> vmcs_reload() but doesn't do it because is_running==1. The next VMWRITE
>> crashes the hypervisor.
>>
>> So the state transition diagram might look like:
>> pCPU1: vCPUx -> migrate to pCPU2 -> idle -> RCU callbacks ->
> 
> I'm not really clear about who/what is "idle" here: pCPU1,
> pCPU2, or yet something else?

It's switching to the "current" idle context on pCPU1.

> If vCPUx migrated to pCPU2,
> wouldn't it be put back into runnable state right away, and
> hence pCPU2 can't be idle at this point? Yet for pCPU1 I don't
> think its idleness would matter much, i.e. the situation could
> also arise without it becoming idle afaics. pCPU1 making it
> anywhere softirqs are being processed would suffice.
> 

Idleness matters in that case because we are not switching
per_cpu(curr_vcpu) which I think is the main problem when vCPU migration
comes into play.

>> vcpu_destroy() -> vmx_vcpu_disable_pml() -> vmcs_clear()
>> pCPU2: context switch into vCPUx -> is_running = 1 -> TLB flush
>> pCPU1: IPI handler -> context switch out of vCPUx -> VMWRITE -> CRASH!
>>
>> We can basically just fix the condition around vmcs_reload() call but
>> I'm not completely sure that it's the right way to do - I don't think
>> leaving per_cpu(curr_vcpu) pointing to a migrated vCPU is a good idea
>> (maybe we need to clean it). What are your thoughts?
> 
> per_cpu(curr_vcpu) can only validly be written inside
> __context_switch(), hence the only way to achieve this would
> be to force __context_switch() to be called earlier than out of
> the TLB flush IPI handler, perhaps like in the (untested!) patch
> below. Two questions then remain:
> - Should we perhaps rather do this in an arch-independent way
>   (i.e. ahead of the call to vcpu_destroy() in common code)?
> - This deals with only a special case of the more general "TLB
>   flush behind the back of a vmx_vmcs_enter() /
>   vmx_vmcs_exit() section" - does this need dealing with in a
>   more general way? Here I'm thinking of introducing a
>   FLUSH_STATE flag to be passed to flush_mask() instead of
>   the current flush_tlb_mask() in context_switch() and
>   sync_vcpu_execstate(). This could at the same time be used
>   for a small performance optimization: At least for HAP vCPU-s
>   I don't think we really need the TLB part of the flushes here.
> 
> Jan
> 
> --- unstable.orig/xen/arch/x86/domain.c
> +++ unstable/xen/arch/x86/domain.c
> @@ -379,6 +379,14 @@ int vcpu_initialise(struct vcpu *v)
>  
>  void vcpu_destroy(struct vcpu *v)
>  {
> +    /*
> +     * Flush all state for this vCPU before fully tearing it down. This is
> +     * particularly important for HVM ones on VMX, so that this flushing of
> +     * state won't happen from the TLB flush IPI handler behind the back of
> +     * a vmx_vmcs_enter() / vmx_vmcs_exit() section.
> +     */
> +    sync_vcpu_execstate(v);
> +
>      xfree(v->arch.vm_event);
>      v->arch.vm_event = NULL;
>  

I don't think this is going to fix the problem since vCPU we are
currently destroying has nothing to do with the vCPUx that actually
caused the problem by its migration. We still are going to call
vmx_vcpu_disable_pml() which loads and cleans VMCS on the current pCPU.
Perhaps I should improve my diagram:

pCPU1: vCPUx of domain X -> migrate to pCPU2 -> switch to idle context
-> RCU callbacks -> vcpu_destroy(vCPUy of domain Y) ->
vmx_vcpu_disable_pml() -> vmx_vmcs_clear() (VMCS is trashed at this
point on pCPU1)

pCPU2: context switch into vCPUx -> vCPUx.is_running = 1 -> TLB flush
from context switch to clean TLB on pCPU1

(pCPU1 is still somewhere in vcpu_destroy() loop and with VMCS cleared
by vmx_vcpu_disable_pml())

pCPU1: IPI handler for TLB flush -> context switch out of vCPUx (this is
here because we haven't switched per_cpu(curr_vcpu) before) -> (no
vmcs_reload() here because pCPU2 already set is_running to 1) -> VMWRITE
-> CRASH!

Igor

Raistlin Nov. 9, 2017, 9:54 a.m. UTC | #2

On Tue, 2017-11-07 at 14:24 +0000, Igor Druzhinin wrote:
> Perhaps I should improve my diagram:
> 
> pCPU1: vCPUx of domain X -> migrate to pCPU2 -> switch to idle
> context
> -> RCU callbacks -> vcpu_destroy(vCPUy of domain Y) ->
> vmx_vcpu_disable_pml() -> vmx_vmcs_clear() (VMCS is trashed at this
> point on pCPU1)
> 
> pCPU2: context switch into vCPUx -> vCPUx.is_running = 1 -> TLB flush
> from context switch to clean TLB on pCPU1
> 
Sorry, there must be something I'm missing (or misunderstanding).

What is this code that checks is_running and triggers the TLB flush?

But, more important, how come you are context switching to something
that has is_running == 1 ? That should not be possible.

In fact, from a scheduling code perspective, since you're mentioning
vCPU migration between pCPUs:

 pCPU1
 .
 .
 //vCPUx->is_running is 1
 vCPUx->pause_flags |= _VPF_migrating
 schedule()
  idle->is_running = 1
  //vCPUx->pause_flags != 0 ==> it's blocked and can't be scheduled!
  context_switch( prev=vCPUx, next=idle )
   set_current( idle )
   //let's be lazy! don't call __context_switch()
   context_saved( vCPUx )
    vCPUx->is_running = 0
    SCHED_OP( context_saved ) //NULL for Credit1
    vcpu_migrate( vCPUx )
     if ( vCPUx->is_running || !test_and_clear(_VPF_migrating) )
      return;
     vcpu_wake( vCPUx )
 .
 .
 .

So, basically, the scheduler on pCPU2 can decide to pick vCPUx from the
runqueue and switch to it _only_ if it has gone through vcpu_wake(),
which must actually have woken up it, which happens if _VPF_migrating
has been cleared, which means is_running was 0 already.

Dario

Jan Beulich Nov. 9, 2017, 10:17 a.m. UTC | #3

>>> On 09.11.17 at 10:54, <raistlin@linux.it> wrote:
> On Tue, 2017-11-07 at 14:24 +0000, Igor Druzhinin wrote:
>> Perhaps I should improve my diagram:
>> 
>> pCPU1: vCPUx of domain X -> migrate to pCPU2 -> switch to idle
>> context
>> -> RCU callbacks -> vcpu_destroy(vCPUy of domain Y) ->
>> vmx_vcpu_disable_pml() -> vmx_vmcs_clear() (VMCS is trashed at this
>> point on pCPU1)
>> 
>> pCPU2: context switch into vCPUx -> vCPUx.is_running = 1 -> TLB flush
>> from context switch to clean TLB on pCPU1
>> 
> Sorry, there must be something I'm missing (or misunderstanding).
> 
> What is this code that checks is_running and triggers the TLB flush?

I don't see where Igor said is_running is being checked around a
TLB flush. The TLB flush itself is what happens first thing in
context_switch() (and it's really using the TLB flush interface to
mainly effect the state flush, with the TLB flush being an implied
side effect; I've already got a series of further patches to make
this less implicit).

> But, more important, how come you are context switching to something
> that has is_running == 1 ? That should not be possible.

That's not what Igor's diagram says - it's indicating the fact that
is_running is being set to 1 in the process of context switching
into vCPUx.

Jan

Sergey Dyasli Nov. 9, 2017, 10:36 a.m. UTC | #4

On Thu, 2017-11-09 at 03:17 -0700, Jan Beulich wrote:
> > > > On 09.11.17 at 10:54, <raistlin@linux.it> wrote:

> > 

> > On Tue, 2017-11-07 at 14:24 +0000, Igor Druzhinin wrote:

> > > Perhaps I should improve my diagram:

> > > 

> > > pCPU1: vCPUx of domain X -> migrate to pCPU2 -> switch to idle

> > > context

> > > -> RCU callbacks -> vcpu_destroy(vCPUy of domain Y) ->

> > > vmx_vcpu_disable_pml() -> vmx_vmcs_clear() (VMCS is trashed at this

> > > point on pCPU1)

> > > 

> > > pCPU2: context switch into vCPUx -> vCPUx.is_running = 1 -> TLB flush

> > > from context switch to clean TLB on pCPU1

> > > 

> > 

> > Sorry, there must be something I'm missing (or misunderstanding).

> > 

> > What is this code that checks is_running and triggers the TLB flush?

> 

> I don't see where Igor said is_running is being checked around a

> TLB flush. The TLB flush itself is what happens first thing in

> context_switch() (and it's really using the TLB flush interface to

> mainly effect the state flush, with the TLB flush being an implied

> side effect; I've already got a series of further patches to make

> this less implicit).

> 

> > But, more important, how come you are context switching to something

> > that has is_running == 1 ? That should not be possible.

> 

> That's not what Igor's diagram says - it's indicating the fact that

> is_running is being set to 1 in the process of context switching

> into vCPUx.


Jan, Dario,

Igor was referring to the following situation:


pCPU1                                   pCPU2
=====                                   =====
current == vCPU1
context_switch(next == idle)
!! __context_switch() is skipped
vcpu_migrate(vCPU1)
RCU callbacks
vmx_vcpu_destroy()
vmx_vcpu_disable_pml()
current_vmcs = 0

                                        schedule(next == vCPU1)
                                        vCPU1->is_running = 1;
                                        context_switch(next == vCPU1)
                                        flush_tlb_mask(&dirty_mask);
                                    
                                <--- IPI

__sync_local_execstate()
__context_switch(prev == vCPU1)
vmx_ctxt_switch_from(vCPU1)
vCPU1->is_running == 1
!! vmx_vmcs_reload() is skipped


I hope that this better illustrates the root cause.

-- 
Thanks,
Sergey

Raistlin Nov. 9, 2017, 10:39 a.m. UTC | #5

On Thu, 2017-11-09 at 03:17 -0700, Jan Beulich wrote:
> > > > On 09.11.17 at 10:54, <raistlin@linux.it> wrote:
> > On Tue, 2017-11-07 at 14:24 +0000, Igor Druzhinin wrote:
> > > Perhaps I should improve my diagram:
> > > 
> > > pCPU1: vCPUx of domain X -> migrate to pCPU2 -> switch to idle
> > > context
> > > -> RCU callbacks -> vcpu_destroy(vCPUy of domain Y) ->
> > > vmx_vcpu_disable_pml() -> vmx_vmcs_clear() (VMCS is trashed at
> > > this
> > > point on pCPU1)
> > > 
> > > pCPU2: context switch into vCPUx -> vCPUx.is_running = 1 -> TLB
> > > flush
> > > from context switch to clean TLB on pCPU1
> > 
> > But, more important, how come you are context switching to
> > something
> > that has is_running == 1 ? That should not be possible.
> 
> That's not what Igor's diagram says - it's indicating the fact that
> is_running is being set to 1 in the process of context switching
> into vCPUx.
> 
Ah, ok. So I was right: I indeed was misunderstanding something, i.e.,
the diagram itself. :-)

Now I get it.

Sorry for the noise,
Dario

Raistlin Nov. 9, 2017, 11:01 a.m. UTC | #6

On Thu, 2017-11-09 at 10:36 +0000, Sergey Dyasli wrote:
> On Thu, 2017-11-09 at 03:17 -0700, Jan Beulich wrote:
> > > > > On 09.11.17 at 10:54, <raistlin@linux.it> wrote:
> > > 
> > > On Tue, 2017-11-07 at 14:24 +0000, Igor Druzhinin wrote:
> > > > Perhaps I should improve my diagram:
> > > > 
> > > > pCPU1: vCPUx of domain X -> migrate to pCPU2 -> switch to idle
> > > > context
> > > > -> RCU callbacks -> vcpu_destroy(vCPUy of domain Y) ->
> > > > vmx_vcpu_disable_pml() -> vmx_vmcs_clear() (VMCS is trashed at
> > > > this
> > > > point on pCPU1)
> > > > 
> > > > pCPU2: context switch into vCPUx -> vCPUx.is_running = 1 -> TLB
> > > > flush
> > > > from context switch to clean TLB on pCPU1
> > > > 
> > > 
> > > Sorry, there must be something I'm missing (or misunderstanding).
> > > 
> > > What is this code that checks is_running and triggers the TLB
> > > flush?
> > 
> > I don't see where Igor said is_running is being checked around a
> > TLB flush. The TLB flush itself is what happens first thing in
> > context_switch() (and it's really using the TLB flush interface to
> > mainly effect the state flush, with the TLB flush being an implied
> > side effect; I've already got a series of further patches to make
> > this less implicit).
> > 
> > > But, more important, how come you are context switching to
> > > something
> > > that has is_running == 1 ? That should not be possible.
> > 
> > That's not what Igor's diagram says - it's indicating the fact that
> > is_running is being set to 1 in the process of context switching
> > into vCPUx.
> 
> Jan, Dario,
> 
Hi,

> Igor was referring to the following situation:
> 
> 
> pCPU1                                   pCPU2
> =====                                   =====
> current == vCPU1
> context_switch(next == idle)
> !! __context_switch() is skipped
> vcpu_migrate(vCPU1)
> RCU callbacks
> vmx_vcpu_destroy()
> vmx_vcpu_disable_pml()
> current_vmcs = 0
> 
>                                         schedule(next == vCPU1)
>                                         vCPU1->is_running = 1;
>                                         context_switch(next == vCPU1)
>                                         flush_tlb_mask(&dirty_mask);
>                                     
>                                 <--- IPI
> 
> __sync_local_execstate()
> __context_switch(prev == vCPU1)
> vmx_ctxt_switch_from(vCPU1)
> vCPU1->is_running == 1
> !! vmx_vmcs_reload() is skipped
> 
> I hope that this better illustrates the root cause.
> 
Yes, I think this is clearer, and easier to understand. But that's
probably a mater of habit and personal taste, so sorry again for
misunderstanding it in its other form.

Anyway, as I was trying to explain replaying to Jan, although in this
situation the issue manifests as a consequence of vCPU migration, I
think it is indeed more general, as in, without even the need to
consider a second pCPU:

pCPU1
=====
current == vCPU1
context_switch(next == idle)
!! __context_switch() is skipped
vcpu_migrate(vCPU1)
anything_that_uses_or_touches_context()

So, it must be anything_that_uses_or_touches_context() --knowing it
will be reading or touching the pCPU context-- that syncs the state,
before using or touching it (which is, e.g., what Jan's patch does).

The only other solution I see, is to get rid of lazy context switch.

Regards,
Dario

Jan Beulich Nov. 9, 2017, 1:08 p.m. UTC | #7

>>> On 09.11.17 at 12:01, <raistlin@linux.it> wrote:
> Anyway, as I was trying to explain replaying to Jan, although in this
> situation the issue manifests as a consequence of vCPU migration, I
> think it is indeed more general, as in, without even the need to
> consider a second pCPU:
> 
> pCPU1
> =====
> current == vCPU1
> context_switch(next == idle)
> !! __context_switch() is skipped
> vcpu_migrate(vCPU1)
> anything_that_uses_or_touches_context()
> 
> So, it must be anything_that_uses_or_touches_context() --knowing it
> will be reading or touching the pCPU context-- that syncs the state,
> before using or touching it (which is, e.g., what Jan's patch does).

Well, generally after the vcpu_migrate(vCPU1) above we expect
nothing to happen at all on the pCPU, until another (non-idle)
vCPU gets scheduled onto it. The problem is with tasklet / softirq
(and hence also RCU) work. Tasklets already take care of this by
calling sync_local_execstate() before calling the handler. But
for softirqs this isn't really an option; I'm surprised to see that
tasklet code does this independently of what kind of tasklet that
is. Softirq tasklets aren't used very often, so I wonder if we have
a latent bug here. Otoh, if this was actually fine, adding a similar
call to rcu_do_batch() (or its caller) would be an option, I think.

Jan

Raistlin Nov. 9, 2017, 2:16 p.m. UTC | #8

On Thu, 2017-11-09 at 06:08 -0700, Jan Beulich wrote:
> > > > On 09.11.17 at 12:01, <raistlin@linux.it> wrote:
> > 
> > pCPU1
> > =====
> > current == vCPU1
> > context_switch(next == idle)
> > !! __context_switch() is skipped
> > vcpu_migrate(vCPU1)
> > anything_that_uses_or_touches_context()
> > 
> > So, it must be anything_that_uses_or_touches_context() --knowing it
> > will be reading or touching the pCPU context-- that syncs the
> > state,
> > before using or touching it (which is, e.g., what Jan's patch
> > does).
> 
> Well, generally after the vcpu_migrate(vCPU1) above we expect
> nothing to happen at all on the pCPU, until another (non-idle)
> vCPU gets scheduled onto it. 
>
Ah, yes, my bad! What if I take vcpu_migrate() out of the above exec-
trace (which is what I wanted to do in my email already)?

pCPU1
=====
current == vCPU1
context_switch(next == idle)
!! __context_switch() is skipped
anything_that_uses_or_touches_context()

Point being, is the underlying and general "issue" here, really bound
to migrations, or is it something intrinsic of lazy context switch? I'm
saying it's the latter.

That being said, sure it makes sense to assume that, if we migrated the
vCPU from pCPU1 to pCPU2, it's highly unlikely that it will resume the
execution on pCPU1, and hence there is no point in leaving its context
there, and we could very well sync. It's a reasonable best-effort
measure, but can we rely on it for correctness? I don't think we can.

And generalizing the idea enough that we could then rely on it for
correctness, tends to be close enough to not doing lazy context switch
at all, I think.

> The problem is with tasklet / softirq
> (and hence also RCU) work. 
>
Yes.

> Tasklets already take care of this by
> calling sync_local_execstate() before calling the handler. But
> for softirqs this isn't really an option; I'm surprised to see that
> tasklet code does this independently of what kind of tasklet that
> is. 
>
Good point. Weird indeed.

> Softirq tasklets aren't used very often, so I wonder if we have
> a latent bug here. Otoh, if this was actually fine, adding a similar
> call to rcu_do_batch() (or its caller) would be an option, I think.
> 
We can have a look.

What about the effect on performance, though?

I mean, assuming that lazy context switch does a good job, with respect
to that, by avoiding synching in enough case where it is actually not
necessary, how do things change if we start to sync at any softirq,
even when the handler would have not required that?

Regards,
Dario

Jan Beulich Nov. 9, 2017, 2:39 p.m. UTC | #9

>>> On 09.11.17 at 15:16, <raistlin@linux.it> wrote:
> Ah, yes, my bad! What if I take vcpu_migrate() out of the above exec-
> trace (which is what I wanted to do in my email already)?
> 
> pCPU1
> =====
> current == vCPU1
> context_switch(next == idle)
> !! __context_switch() is skipped
> anything_that_uses_or_touches_context()
> 
> Point being, is the underlying and general "issue" here, really bound
> to migrations, or is it something intrinsic of lazy context switch? I'm
> saying it's the latter.

The general issue doesn't require vcpu_migrate(), I agree. The
specific VMX issue here does, though.

Thing is - I'm not convinced there's a general issue here in the first
place: Asynchronous code isn't supposed to modify state behind
the back of synchronous code. It just so happens that VMX code is
structured to violate that assumption when PML is in use.

> That being said, sure it makes sense to assume that, if we migrated the
> vCPU from pCPU1 to pCPU2, it's highly unlikely that it will resume the
> execution on pCPU1, and hence there is no point in leaving its context
> there, and we could very well sync. It's a reasonable best-effort
> measure, but can we rely on it for correctness? I don't think we can.

We can't right now, but code (from an abstract pov at least)
could be structured so we could rely on it.

> And generalizing the idea enough that we could then rely on it for
> correctness, tends to be close enough to not doing lazy context switch
> at all, I think.

I don't think so, no - we could still leave state in hardware in
anticipation that no other non-idle vCPU would be scheduled
on the local CPU. That's something that ought to help in
particular pinned vCPU-s.

>> The problem is with tasklet / softirq
>> (and hence also RCU) work. 
>>
> Yes.
> 
>> Tasklets already take care of this by
>> calling sync_local_execstate() before calling the handler. But
>> for softirqs this isn't really an option; I'm surprised to see that
>> tasklet code does this independently of what kind of tasklet that
>> is. 
>>
> Good point. Weird indeed.

I've added an item to my todo list to see whether I can figure
whether this is an okay thing to do.

>> Softirq tasklets aren't used very often, so I wonder if we have
>> a latent bug here. Otoh, if this was actually fine, adding a similar
>> call to rcu_do_batch() (or its caller) would be an option, I think.
>> 
> We can have a look.
> 
> What about the effect on performance, though?
> 
> I mean, assuming that lazy context switch does a good job, with respect
> to that, by avoiding synching in enough case where it is actually not
> necessary, how do things change if we start to sync at any softirq,
> even when the handler would have not required that?

I wasn't suggesting this for softirqs, but only (if at all) for RCU. But
yes, there would a performance hit from this; not sure how small
or large. But as you can see, for tasklets the hit is taken
unconditionally.

Jan

Jan Beulich Nov. 9, 2017, 4:38 p.m. UTC | #10

>>> On 09.11.17 at 15:16, <raistlin@linux.it> wrote:
> On Thu, 2017-11-09 at 06:08 -0700, Jan Beulich wrote:
>> Tasklets already take care of this by
>> calling sync_local_execstate() before calling the handler. But
>> for softirqs this isn't really an option; I'm surprised to see that
>> tasklet code does this independently of what kind of tasklet that
>> is. 
>>
> Good point. Weird indeed.
> 
>> Softirq tasklets aren't used very often, so I wonder if we have
>> a latent bug here. Otoh, if this was actually fine, adding a similar
>> call to rcu_do_batch() (or its caller) would be an option, I think.
>> 
> We can have a look.

I'm sorry for the noise here - I've once again forgotten that
sync_local_execstate() does nothing if a lazy switch hasn't
happened already. Hence leaving the potentially bad
performance effect aside, doing the same for RCU (or more
generally softirqs) would be an option.

Jan

[v2,1/2] VMX: fix VMCS race on context-switch paths

Commit Message

Comments

Patch