diff mbox

x86/vm_event: fix race between vmx_vmexit_handler() and vm_event_resume()

Message ID 1493277744-9032-1-git-send-email-rcojocaru@bitdefender.com (mailing list archive)
State New, archived
Headers show

Commit Message

Razvan Cojocaru April 27, 2017, 7:22 a.m. UTC
The introspection agent can reply to a vm_event faster than
vmx_vmexit_handler() can complete in some cases, where it is then
not safe for vm_event_set_registers() to modify v->arch.user_regs.
This patch ensures that vm_event_resume() code only sets per-VCPU
data to be used for the actual setting of registers only in
hvm_do_resume() (similar to the model used to control setting of CRs
and MSRs).
The patch additionally removes the sync_vcpu_execstate(v) call from
vm_event_resume(), which is no longer necessary, which removes the
associated broadcast TLB flush (read: performance improvement).

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/hvm.c         | 35 +++++++++++++++++++++++++++++++++++
 xen/arch/x86/vm_event.c        | 22 ++--------------------
 xen/common/vm_event.c          | 14 +++++++-------
 xen/include/asm-x86/vm_event.h |  2 ++
 4 files changed, 46 insertions(+), 27 deletions(-)

Comments

Andrew Cooper April 27, 2017, 7:46 a.m. UTC | #1
On 27/04/2017 08:22, Razvan Cojocaru wrote:
> The introspection agent can reply to a vm_event faster than
> vmx_vmexit_handler() can complete in some cases, where it is then
> not safe for vm_event_set_registers() to modify v->arch.user_regs.
> This patch ensures that vm_event_resume() code only sets per-VCPU
> data to be used for the actual setting of registers only in
> hvm_do_resume() (similar to the model used to control setting of CRs
> and MSRs).
> The patch additionally removes the sync_vcpu_execstate(v) call from
> vm_event_resume(), which is no longer necessary, which removes the
> associated broadcast TLB flush (read: performance improvement).
>
> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

FWIW, Reivewed-by: Andrew Cooper <andrew.cooper3@citrix.com>  (this has
taken an embarrassingly long time to get to the root cause of,
considering the eventual simplicity of the patch) with one trivial
correction I have just spotted below.

CC'ing Julien and Ian in their RM capacity, as this bugfix should be
taken for 4.9

> diff --git a/xen/common/vm_event.c b/xen/common/vm_event.c
> index 0fe9a53..498749b 100644
> --- a/xen/common/vm_event.c
> +++ b/xen/common/vm_event.c
> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>  {
>      vm_event_response_t rsp;
>  
> +    /*
> +     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
> +     * EVTCHN_send from the introspection consumer.  Both contexts are

I have just spotted that this should read "runs in either
XEN_DOMCTL_VM_EVENT_OP_*, or EVTCHN_send context from".  I can fix up on
commit, if there are no other issues.

~Andrew

> +     * guaranteed not to be the subject of vm_event responses.
> +     */
> +    ASSERT(d != current->domain);
> +
>      /* Pull all responses off the ring. */
>      while ( vm_event_get_response(d, ved, &rsp) )
>      {
Jan Beulich April 27, 2017, 8:01 a.m. UTC | #2
>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
> The introspection agent can reply to a vm_event faster than
> vmx_vmexit_handler() can complete in some cases, where it is then
> not safe for vm_event_set_registers() to modify v->arch.user_regs.

This needs more explanation: I cannot see the connection between
vmx_vmexit_handler() completing and (un)safety of modification of
v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't
gone through __context_switch(), and the former doesn't call that
function directly or indirectly (i.e. I think it is more than just
vmx_vmexit_handler() which needs to be completed by the time
register modification becomes safe to do).

> This patch ensures that vm_event_resume() code only sets per-VCPU
> data to be used for the actual setting of registers only in
> hvm_do_resume() (similar to the model used to control setting of CRs
> and MSRs).

I think the second "only" would better be "later".

> The patch additionally removes the sync_vcpu_execstate(v) call from
> vm_event_resume(), which is no longer necessary, which removes the
> associated broadcast TLB flush (read: performance improvement).

Depending on the better explanation above, it may or may not be
further necessary to also better explain the "no longer necessary"
part here.

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info)
>      return hvm_funcs.get_pending_event(v, info);
>  }
>  
> +static void hvm_vm_event_set_registers(struct vcpu *v)

This could be const, as *v itself isn't being modified, but maybe it's
better to leave it non-const indeed.

> +{
> +    ASSERT(v == current);
> +
> +    if ( v->arch.vm_event->set_gprs )

I think we will want an unlikely() here. We anyway can only hope for
the compiler to always inline this function, such that non-VM-event
setups don't get penalized by the extra call here. Strictly speaking
the function doesn't belong into this file ...

> +    {
> +        struct cpu_user_regs *regs = guest_cpu_user_regs();
> +
> +        regs->rax = v->arch.vm_event->gprs.rax;
> +        regs->rbx = v->arch.vm_event->gprs.rbx;
> +        regs->rcx = v->arch.vm_event->gprs.rcx;
> +        regs->rdx = v->arch.vm_event->gprs.rdx;
> +        regs->rsp = v->arch.vm_event->gprs.rsp;
> +        regs->rbp = v->arch.vm_event->gprs.rbp;
> +        regs->rsi = v->arch.vm_event->gprs.rsi;
> +        regs->rdi = v->arch.vm_event->gprs.rdi;
> +
> +        regs->r8 = v->arch.vm_event->gprs.r8;
> +        regs->r9 = v->arch.vm_event->gprs.r9;
> +        regs->r10 = v->arch.vm_event->gprs.r10;
> +        regs->r11 = v->arch.vm_event->gprs.r11;
> +        regs->r12 = v->arch.vm_event->gprs.r12;
> +        regs->r13 = v->arch.vm_event->gprs.r13;
> +        regs->r14 = v->arch.vm_event->gprs.r14;
> +        regs->r15 = v->arch.vm_event->gprs.r15;
> +
> +        regs->rflags = v->arch.vm_event->gprs.rflags;
> +        regs->rip = v->arch.vm_event->gprs.rip;
> +
> +        v->arch.vm_event->set_gprs = 0;

false (and true/bool elsewhere)

> --- a/xen/common/vm_event.c
> +++ b/xen/common/vm_event.c
> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>  {
>      vm_event_response_t rsp;
>  
> +    /*
> +     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
> +     * EVTCHN_send from the introspection consumer.  Both contexts are
> +     * guaranteed not to be the subject of vm_event responses.
> +     */
> +    ASSERT(d != current->domain);

What is this meant to guard against? It surely isn't ...

> @@ -375,13 +382,6 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>          v = d->vcpu[rsp.vcpu_id];
>  
>          /*
> -         * Make sure the vCPU state has been synchronized for the custom
> -         * handlers.
> -         */
> -        if ( atomic_read(&v->vm_event_pause_count) )
> -            sync_vcpu_execstate(v);

... a "replacement" for this, as the state of "current" doesn't reflect
whether register state has been saved (that's this_cpu(curr_vcpu)
instead). Nor would a comparison of domains seem to be the right
thing - a comparison of vcpus ought to suffice (and be less strict,
allowing for something hypothetical like self-introspection).

Jan
Razvan Cojocaru April 27, 2017, 8:11 a.m. UTC | #3
On 04/27/17 11:01, Jan Beulich wrote:
>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>> The introspection agent can reply to a vm_event faster than
>> vmx_vmexit_handler() can complete in some cases, where it is then
>> not safe for vm_event_set_registers() to modify v->arch.user_regs.
> 
> This needs more explanation: I cannot see the connection between
> vmx_vmexit_handler() completing and (un)safety of modification of
> v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't
> gone through __context_switch(), and the former doesn't call that
> function directly or indirectly (i.e. I think it is more than just
> vmx_vmexit_handler() which needs to be completed by the time
> register modification becomes safe to do).

Indeed, this is exactly the case (__context_switch() doesn't go
through). I'll re-word the commit message.

>> This patch ensures that vm_event_resume() code only sets per-VCPU
>> data to be used for the actual setting of registers only in
>> hvm_do_resume() (similar to the model used to control setting of CRs
>> and MSRs).
> 
> I think the second "only" would better be "later".

Yes, it's actually redundant (sorry). I'll change it to "later".

>> The patch additionally removes the sync_vcpu_execstate(v) call from
>> vm_event_resume(), which is no longer necessary, which removes the
>> associated broadcast TLB flush (read: performance improvement).
> 
> Depending on the better explanation above, it may or may not be
> further necessary to also better explain the "no longer necessary"
> part here.
> 
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info)
>>      return hvm_funcs.get_pending_event(v, info);
>>  }
>>  
>> +static void hvm_vm_event_set_registers(struct vcpu *v)
> 
> This could be const, as *v itself isn't being modified, but maybe it's
> better to leave it non-const indeed.

I have no problem changing it if there are no objections.

>> +{
>> +    ASSERT(v == current);
>> +
>> +    if ( v->arch.vm_event->set_gprs )
> 
> I think we will want an unlikely() here. We anyway can only hope for
> the compiler to always inline this function, such that non-VM-event
> setups don't get penalized by the extra call here. Strictly speaking
> the function doesn't belong into this file ...

Should I move it to the x86 vm_event code?

>> +    {
>> +        struct cpu_user_regs *regs = guest_cpu_user_regs();
>> +
>> +        regs->rax = v->arch.vm_event->gprs.rax;
>> +        regs->rbx = v->arch.vm_event->gprs.rbx;
>> +        regs->rcx = v->arch.vm_event->gprs.rcx;
>> +        regs->rdx = v->arch.vm_event->gprs.rdx;
>> +        regs->rsp = v->arch.vm_event->gprs.rsp;
>> +        regs->rbp = v->arch.vm_event->gprs.rbp;
>> +        regs->rsi = v->arch.vm_event->gprs.rsi;
>> +        regs->rdi = v->arch.vm_event->gprs.rdi;
>> +
>> +        regs->r8 = v->arch.vm_event->gprs.r8;
>> +        regs->r9 = v->arch.vm_event->gprs.r9;
>> +        regs->r10 = v->arch.vm_event->gprs.r10;
>> +        regs->r11 = v->arch.vm_event->gprs.r11;
>> +        regs->r12 = v->arch.vm_event->gprs.r12;
>> +        regs->r13 = v->arch.vm_event->gprs.r13;
>> +        regs->r14 = v->arch.vm_event->gprs.r14;
>> +        regs->r15 = v->arch.vm_event->gprs.r15;
>> +
>> +        regs->rflags = v->arch.vm_event->gprs.rflags;
>> +        regs->rip = v->arch.vm_event->gprs.rip;
>> +
>> +        v->arch.vm_event->set_gprs = 0;
> 
> false (and true/bool elsewhere)

I'll change it to bool.

>> --- a/xen/common/vm_event.c
>> +++ b/xen/common/vm_event.c
>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>>  {
>>      vm_event_response_t rsp;
>>  
>> +    /*
>> +     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
>> +     * EVTCHN_send from the introspection consumer.  Both contexts are
>> +     * guaranteed not to be the subject of vm_event responses.
>> +     */
>> +    ASSERT(d != current->domain);
> 
> What is this meant to guard against? It surely isn't ...
> 
>> @@ -375,13 +382,6 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>>          v = d->vcpu[rsp.vcpu_id];
>>  
>>          /*
>> -         * Make sure the vCPU state has been synchronized for the custom
>> -         * handlers.
>> -         */
>> -        if ( atomic_read(&v->vm_event_pause_count) )
>> -            sync_vcpu_execstate(v);
> 
> ... a "replacement" for this, as the state of "current" doesn't reflect
> whether register state has been saved (that's this_cpu(curr_vcpu)
> instead). Nor would a comparison of domains seem to be the right
> thing - a comparison of vcpus ought to suffice (and be less strict,
> allowing for something hypothetical like self-introspection).

This part (the ASSERT and comment) is from Andrew, he can help us here.


Thanks,
Razvan
Jan Beulich April 27, 2017, 8:18 a.m. UTC | #4
>>> On 27.04.17 at 10:11, <rcojocaru@bitdefender.com> wrote:
> On 04/27/17 11:01, Jan Beulich wrote:
>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info)
>>>      return hvm_funcs.get_pending_event(v, info);
>>>  }
>>>  
>>> +static void hvm_vm_event_set_registers(struct vcpu *v)
>>> +{
>>> +    ASSERT(v == current);
>>> +
>>> +    if ( v->arch.vm_event->set_gprs )
>> 
>> I think we will want an unlikely() here. We anyway can only hope for
>> the compiler to always inline this function, such that non-VM-event
>> setups don't get penalized by the extra call here. Strictly speaking
>> the function doesn't belong into this file ...
> 
> Should I move it to the x86 vm_event code?

If you do, then you'll want to have an inline wrapper with the if()
in it, so the actual call at the machine level would be avoided in the
common case.

Jan
Andrew Cooper April 27, 2017, 8:34 a.m. UTC | #5
On 27/04/2017 09:01, Jan Beulich wrote:
>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>> The introspection agent can reply to a vm_event faster than
>> vmx_vmexit_handler() can complete in some cases, where it is then
>> not safe for vm_event_set_registers() to modify v->arch.user_regs.
> This needs more explanation: I cannot see the connection between
> vmx_vmexit_handler() completing and (un)safety of modification of
> v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't
> gone through __context_switch(), and the former doesn't call that
> function directly or indirectly (i.e. I think it is more than just
> vmx_vmexit_handler() which needs to be completed by the time
> register modification becomes safe to do).

The test scenario here was to step over an int3 breakpoint by setting
rip += 1.  This is a very quick reply and tended to complete before the
vcpu triggering the introspection event had properly paused and been
descheduled.

If the reply occurs before __context_switch() happens,
__context_switch() clobbers the reply by overwriting v->arch.user_regs
from the stack.  If the reply occurs after __context_switch(), but the
pcpu has  gone idle and keeps v in context, v->arch.user_regs is not
restored to the stack, and the update is similarly missed.  (There is a
very narrow race condition where both the reply and __context_switch()
are updating v->arch.user_regs, and who knows what will happen, given
our memcpy implementation.)

>
>> The patch additionally removes the sync_vcpu_execstate(v) call from
>> vm_event_resume(), which is no longer necessary, which removes the
>> associated broadcast TLB flush (read: performance improvement).
> Depending on the better explanation above, it may or may not be
> further necessary to also better explain the "no longer necessary"
> part here.

As I understand, it was a previous attempt to fix this bug.

There is nothing (now) in vm_event_resume() which does anything other
than new updates into v->arch.vm_event and drop the pause reference. 
All updated are applied in context, in the return-to-guest path, which
is race free.

There is no need for the IPI, or to force the target vcpu out of context
if the pcpu is currently idle.

>
>> --- a/xen/common/vm_event.c
>> +++ b/xen/common/vm_event.c
>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>>  {
>>      vm_event_response_t rsp;
>>  
>> +    /*
>> +     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
>> +     * EVTCHN_send from the introspection consumer.  Both contexts are
>> +     * guaranteed not to be the subject of vm_event responses.
>> +     */
>> +    ASSERT(d != current->domain);
> What is this meant to guard against?

Documentation, as much as anything else.  It took a long time to
untangle the contexts involved here; I'm not convinced the logic is safe
to run in context, and it certainly doesn't need to.

~Andrew
Razvan Cojocaru April 27, 2017, 8:34 a.m. UTC | #6
On 04/27/17 11:18, Jan Beulich wrote:
>>>> On 27.04.17 at 10:11, <rcojocaru@bitdefender.com> wrote:
>> On 04/27/17 11:01, Jan Beulich wrote:
>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info)
>>>>      return hvm_funcs.get_pending_event(v, info);
>>>>  }
>>>>  
>>>> +static void hvm_vm_event_set_registers(struct vcpu *v)
>>>> +{
>>>> +    ASSERT(v == current);
>>>> +
>>>> +    if ( v->arch.vm_event->set_gprs )
>>>
>>> I think we will want an unlikely() here. We anyway can only hope for
>>> the compiler to always inline this function, such that non-VM-event
>>> setups don't get penalized by the extra call here. Strictly speaking
>>> the function doesn't belong into this file ...
>>
>> Should I move it to the x86 vm_event code?
> 
> If you do, then you'll want to have an inline wrapper with the if()
> in it, so the actual call at the machine level would be avoided in the
> common case.

Sorry, I'm not sure I understand: if moving it is not required, I'd
prefer to leave it where it is, as we already have
vm_event_set_registers() in arch/x86/vm_event.c and I feel it would
complicate matters there (I'd have to perhaps prepend "hvm_" to it in
which case it wouldn't really belong in vm_event.c either). But if it's
necessary, I'll move it - do you want me to move it?


Thanks,
Razvan
Jan Beulich April 27, 2017, 8:52 a.m. UTC | #7
>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote:
> On 27/04/2017 09:01, Jan Beulich wrote:
>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>>> The introspection agent can reply to a vm_event faster than
>>> vmx_vmexit_handler() can complete in some cases, where it is then
>>> not safe for vm_event_set_registers() to modify v->arch.user_regs.
>> This needs more explanation: I cannot see the connection between
>> vmx_vmexit_handler() completing and (un)safety of modification of
>> v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't
>> gone through __context_switch(), and the former doesn't call that
>> function directly or indirectly (i.e. I think it is more than just
>> vmx_vmexit_handler() which needs to be completed by the time
>> register modification becomes safe to do).
> 
> The test scenario here was to step over an int3 breakpoint by setting
> rip += 1.  This is a very quick reply and tended to complete before the
> vcpu triggering the introspection event had properly paused and been
> descheduled.
> 
> If the reply occurs before __context_switch() happens,
> __context_switch() clobbers the reply by overwriting v->arch.user_regs
> from the stack.  If the reply occurs after __context_switch(), but the
> pcpu has  gone idle and keeps v in context, v->arch.user_regs is not
> restored to the stack, and the update is similarly missed.

This second case not properly described, I think: v won't be kept in
context once we've gone through __context_switch(). If the pCPU
goes idle, __context_switch() simply will be deferred until another
vCPU gets scheduled onto this pCPU, and be avoided altogether if
it's the original vCPU that gets scheduled back onto this pCPU. But
I do get the point. I think much if not all of the above (suitably
adjusted) needs to go into the commit message.

>  (There is a
> very narrow race condition where both the reply and __context_switch()
> are updating v->arch.user_regs, and who knows what will happen, given
> our memcpy implementation.)

Right, but with the proper serialization of events done now this
isn't a problem anymore either, aiui.

>>> The patch additionally removes the sync_vcpu_execstate(v) call from
>>> vm_event_resume(), which is no longer necessary, which removes the
>>> associated broadcast TLB flush (read: performance improvement).
>> Depending on the better explanation above, it may or may not be
>> further necessary to also better explain the "no longer necessary"
>> part here.
> 
> As I understand, it was a previous attempt to fix this bug.
> 
> There is nothing (now) in vm_event_resume() which does anything other
> than new updates into v->arch.vm_event and drop the pause reference. 
> All updated are applied in context, in the return-to-guest path, which
> is race free.
> 
> There is no need for the IPI, or to force the target vcpu out of context
> if the pcpu is currently idle.

I agree, and (as indicated) the better explanation earlier on is
probably sufficient then.

>>> --- a/xen/common/vm_event.c
>>> +++ b/xen/common/vm_event.c
>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>>>  {
>>>      vm_event_response_t rsp;
>>>  
>>> +    /*
>>> +     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
>>> +     * EVTCHN_send from the introspection consumer.  Both contexts are
>>> +     * guaranteed not to be the subject of vm_event responses.
>>> +     */
>>> +    ASSERT(d != current->domain);
>> What is this meant to guard against?
> 
> Documentation, as much as anything else.  It took a long time to
> untangle the contexts involved here; I'm not convinced the logic is safe
> to run in context, and it certainly doesn't need to.

Well, as said - I think it is too strict now: You only need the vCPU
not be current afaict, and it really matters here which of the two
"current"-s you actually mean (and it looks to me as if you mean
the other one, guaranteeing register state to no longer be on the
stack).

Jan
Jan Beulich April 27, 2017, 9 a.m. UTC | #8
>>> On 27.04.17 at 10:34, <rcojocaru@bitdefender.com> wrote:
> On 04/27/17 11:18, Jan Beulich wrote:
>>>>> On 27.04.17 at 10:11, <rcojocaru@bitdefender.com> wrote:
>>> On 04/27/17 11:01, Jan Beulich wrote:
>>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>>> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, 
> struct x86_event *info)
>>>>>      return hvm_funcs.get_pending_event(v, info);
>>>>>  }
>>>>>  
>>>>> +static void hvm_vm_event_set_registers(struct vcpu *v)
>>>>> +{
>>>>> +    ASSERT(v == current);
>>>>> +
>>>>> +    if ( v->arch.vm_event->set_gprs )
>>>>
>>>> I think we will want an unlikely() here. We anyway can only hope for
>>>> the compiler to always inline this function, such that non-VM-event
>>>> setups don't get penalized by the extra call here. Strictly speaking
>>>> the function doesn't belong into this file ...
>>>
>>> Should I move it to the x86 vm_event code?
>> 
>> If you do, then you'll want to have an inline wrapper with the if()
>> in it, so the actual call at the machine level would be avoided in the
>> common case.
> 
> Sorry, I'm not sure I understand: if moving it is not required, I'd
> prefer to leave it where it is, as we already have
> vm_event_set_registers() in arch/x86/vm_event.c and I feel it would
> complicate matters there (I'd have to perhaps prepend "hvm_" to it in
> which case it wouldn't really belong in vm_event.c either). But if it's
> necessary, I'll move it - do you want me to move it?

Well, logically this is a VM event function, so belongs into vm_event.c.
So I'd _prefer_ for it to be moved, but I don't insist (in the end, by
leaving it where it is, you and Tamas [wrongly imo] are not the
maintainers of it). But I agree, x86/vm_event.c isn't an ideal place
either, better would be x86/hvm/vm_event.c.

Otoh hvm_do_resume() open codes all the CR and MSR writes
(getting the order wrong as it seems, btw, as some MSR writes
depend on certain CR settings), so maybe a separate function isn't
needed at all here? In fact the bulk of the function is VM event
code, so one might also view it the other way around, and most or
all of this should move into a new function, for example
hvm_vm_event_do_resume().

Jan
Razvan Cojocaru April 27, 2017, 9:56 a.m. UTC | #9
On 04/27/17 12:00, Jan Beulich wrote:
>>>> On 27.04.17 at 10:34, <rcojocaru@bitdefender.com> wrote:
>> On 04/27/17 11:18, Jan Beulich wrote:
>>>>>> On 27.04.17 at 10:11, <rcojocaru@bitdefender.com> wrote:
>>>> On 04/27/17 11:01, Jan Beulich wrote:
>>>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>>>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>>>> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, 
>> struct x86_event *info)
>>>>>>      return hvm_funcs.get_pending_event(v, info);
>>>>>>  }
>>>>>>  
>>>>>> +static void hvm_vm_event_set_registers(struct vcpu *v)
>>>>>> +{
>>>>>> +    ASSERT(v == current);
>>>>>> +
>>>>>> +    if ( v->arch.vm_event->set_gprs )
>>>>>
>>>>> I think we will want an unlikely() here. We anyway can only hope for
>>>>> the compiler to always inline this function, such that non-VM-event
>>>>> setups don't get penalized by the extra call here. Strictly speaking
>>>>> the function doesn't belong into this file ...
>>>>
>>>> Should I move it to the x86 vm_event code?
>>>
>>> If you do, then you'll want to have an inline wrapper with the if()
>>> in it, so the actual call at the machine level would be avoided in the
>>> common case.
>>
>> Sorry, I'm not sure I understand: if moving it is not required, I'd
>> prefer to leave it where it is, as we already have
>> vm_event_set_registers() in arch/x86/vm_event.c and I feel it would
>> complicate matters there (I'd have to perhaps prepend "hvm_" to it in
>> which case it wouldn't really belong in vm_event.c either). But if it's
>> necessary, I'll move it - do you want me to move it?
> 
> Well, logically this is a VM event function, so belongs into vm_event.c.
> So I'd _prefer_ for it to be moved, but I don't insist (in the end, by
> leaving it where it is, you and Tamas [wrongly imo] are not the
> maintainers of it). But I agree, x86/vm_event.c isn't an ideal place
> either, better would be x86/hvm/vm_event.c.
> 
> Otoh hvm_do_resume() open codes all the CR and MSR writes
> (getting the order wrong as it seems, btw, as some MSR writes
> depend on certain CR settings), so maybe a separate function isn't
> needed at all here? In fact the bulk of the function is VM event
> code, so one might also view it the other way around, and most or
> all of this should move into a new function, for example
> hvm_vm_event_do_resume().

All fair points. If there are no objections, I'll add x86/hvm/vm_event.c
and asm-x86/hvm/vm_event.h, and place hvm_vm_do_resume() in it. I'll
also add the files to MAINTAINERS under vm_event.

In the process of moving the code, I'll also put the MSR write last.

This will then become a 2-patch series, with the 1st patch doing the
above, and the second patch will be this one.


Thanks,
Razvan
Andrew Cooper April 27, 2017, 5:31 p.m. UTC | #10
On 27/04/17 09:52, Jan Beulich wrote:
>>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote:
>> On 27/04/2017 09:01, Jan Beulich wrote:
>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>>>> The introspection agent can reply to a vm_event faster than
>>>> vmx_vmexit_handler() can complete in some cases, where it is then
>>>> not safe for vm_event_set_registers() to modify v->arch.user_regs.
>>> This needs more explanation: I cannot see the connection between
>>> vmx_vmexit_handler() completing and (un)safety of modification of
>>> v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't
>>> gone through __context_switch(), and the former doesn't call that
>>> function directly or indirectly (i.e. I think it is more than just
>>> vmx_vmexit_handler() which needs to be completed by the time
>>> register modification becomes safe to do).
>> The test scenario here was to step over an int3 breakpoint by setting
>> rip += 1.  This is a very quick reply and tended to complete before the
>> vcpu triggering the introspection event had properly paused and been
>> descheduled.
>>
>> If the reply occurs before __context_switch() happens,
>> __context_switch() clobbers the reply by overwriting v->arch.user_regs
>> from the stack.  If the reply occurs after __context_switch(), but the
>> pcpu has  gone idle and keeps v in context, v->arch.user_regs is not
>> restored to the stack, and the update is similarly missed.
> This second case not properly described, I think: v won't be kept in
> context once we've gone through __context_switch(). If the pCPU
> goes idle, __context_switch() simply will be deferred until another
> vCPU gets scheduled onto this pCPU, and be avoided altogether if
> it's the original vCPU that gets scheduled back onto this pCPU. But
> I do get the point. I think much if not all of the above (suitably
> adjusted) needs to go into the commit message.

Yes.  I did make a slight error.  The latter case is that we don't pass
through __context_switch() when transitioning to idle.

>
>>  (There is a
>> very narrow race condition where both the reply and __context_switch()
>> are updating v->arch.user_regs, and who knows what will happen, given
>> our memcpy implementation.)
> Right, but with the proper serialization of events done now this
> isn't a problem anymore either, aiui.

Correct.

>
>>>> The patch additionally removes the sync_vcpu_execstate(v) call from
>>>> vm_event_resume(), which is no longer necessary, which removes the
>>>> associated broadcast TLB flush (read: performance improvement).
>>> Depending on the better explanation above, it may or may not be
>>> further necessary to also better explain the "no longer necessary"
>>> part here.
>> As I understand, it was a previous attempt to fix this bug.
>>
>> There is nothing (now) in vm_event_resume() which does anything other
>> than new updates into v->arch.vm_event and drop the pause reference. 
>> All updated are applied in context, in the return-to-guest path, which
>> is race free.
>>
>> There is no need for the IPI, or to force the target vcpu out of context
>> if the pcpu is currently idle.
> I agree, and (as indicated) the better explanation earlier on is
> probably sufficient then.
>
>>>> --- a/xen/common/vm_event.c
>>>> +++ b/xen/common/vm_event.c
>>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>>>>  {
>>>>      vm_event_response_t rsp;
>>>>  
>>>> +    /*
>>>> +     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
>>>> +     * EVTCHN_send from the introspection consumer.  Both contexts are
>>>> +     * guaranteed not to be the subject of vm_event responses.
>>>> +     */
>>>> +    ASSERT(d != current->domain);
>>> What is this meant to guard against?
>> Documentation, as much as anything else.  It took a long time to
>> untangle the contexts involved here; I'm not convinced the logic is safe
>> to run in context, and it certainly doesn't need to.
> Well, as said - I think it is too strict now: You only need the vCPU
> not be current afaict, and it really matters here which of the two
> "current"-s you actually mean (and it looks to me as if you mean
> the other one, guaranteeing register state to no longer be on the
> stack).

We don't know the vcpu at this point; that information comes out of the
replies on the ring.

We also may process multiple replies in this one loop.  In the worse
case, we process one reply for every vcpu in d.

If the ASSERT() were to be deferred until the middle of the request
loop, we could ASSERT(v != current) for every vcpu in d, but that is
identical to this single assertion.

~Andrew
Jan Beulich April 28, 2017, 6:25 a.m. UTC | #11
>>> On 27.04.17 at 19:31, <andrew.cooper3@citrix.com> wrote:
> On 27/04/17 09:52, Jan Beulich wrote:
>>>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote:
>>> On 27/04/2017 09:01, Jan Beulich wrote:
>>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>>>>> --- a/xen/common/vm_event.c
>>>>> +++ b/xen/common/vm_event.c
>>>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>>>>>  {
>>>>>      vm_event_response_t rsp;
>>>>>  
>>>>> +    /*
>>>>> +     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
>>>>> +     * EVTCHN_send from the introspection consumer.  Both contexts are
>>>>> +     * guaranteed not to be the subject of vm_event responses.
>>>>> +     */
>>>>> +    ASSERT(d != current->domain);
>>>> What is this meant to guard against?
>>> Documentation, as much as anything else.  It took a long time to
>>> untangle the contexts involved here; I'm not convinced the logic is safe
>>> to run in context, and it certainly doesn't need to.
>> Well, as said - I think it is too strict now: You only need the vCPU
>> not be current afaict, and it really matters here which of the two
>> "current"-s you actually mean (and it looks to me as if you mean
>> the other one, guaranteeing register state to no longer be on the
>> stack).
> 
> We don't know the vcpu at this point; that information comes out of the
> replies on the ring.
> 
> We also may process multiple replies in this one loop.  In the worse
> case, we process one reply for every vcpu in d.
> 
> If the ASSERT() were to be deferred until the middle of the request
> loop, we could ASSERT(v != current) for every vcpu in d, but that is
> identical to this single assertion.

No, it isn't. It would only be if you iterated over all vCPU-s of that
domain (which as you say may or may not happen).

Jan
Razvan Cojocaru April 28, 2017, 6:45 a.m. UTC | #12
On 04/28/2017 09:25 AM, Jan Beulich wrote:
>>>> On 27.04.17 at 19:31, <andrew.cooper3@citrix.com> wrote:
>> On 27/04/17 09:52, Jan Beulich wrote:
>>>>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote:
>>>> On 27/04/2017 09:01, Jan Beulich wrote:
>>>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>>>>>> --- a/xen/common/vm_event.c
>>>>>> +++ b/xen/common/vm_event.c
>>>>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
>>>>>>  {
>>>>>>      vm_event_response_t rsp;
>>>>>>  
>>>>>> +    /*
>>>>>> +     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
>>>>>> +     * EVTCHN_send from the introspection consumer.  Both contexts are
>>>>>> +     * guaranteed not to be the subject of vm_event responses.
>>>>>> +     */
>>>>>> +    ASSERT(d != current->domain);
>>>>> What is this meant to guard against?
>>>> Documentation, as much as anything else.  It took a long time to
>>>> untangle the contexts involved here; I'm not convinced the logic is safe
>>>> to run in context, and it certainly doesn't need to.
>>> Well, as said - I think it is too strict now: You only need the vCPU
>>> not be current afaict, and it really matters here which of the two
>>> "current"-s you actually mean (and it looks to me as if you mean
>>> the other one, guaranteeing register state to no longer be on the
>>> stack).
>>
>> We don't know the vcpu at this point; that information comes out of the
>> replies on the ring.
>>
>> We also may process multiple replies in this one loop.  In the worse
>> case, we process one reply for every vcpu in d.
>>
>> If the ASSERT() were to be deferred until the middle of the request
>> loop, we could ASSERT(v != current) for every vcpu in d, but that is
>> identical to this single assertion.
> 
> No, it isn't. It would only be if you iterated over all vCPU-s of that
> domain (which as you say may or may not happen).

I will modify to the code to whatever you and Andrew decide is best, but
just in case this helps decide, in my experience iterating over all
VCPUs here will happen more often than not - even looking at
xen-access.c today, it poll()s with a timeout, processes all collected
events (which, if they are sync events - and they always are with our
application - there can be several of them only if they correspond to
different VCPUs), and only then signals the event channel.

But there are of course cases in which less than all the VCPUs have
placed events in the ring buffer.


Thanks,
Razvan
Jan Beulich April 28, 2017, 7:35 a.m. UTC | #13
>>> On 28.04.17 at 08:45, <rcojocaru@bitdefender.com> wrote:
> On 04/28/2017 09:25 AM, Jan Beulich wrote:
>>>>> On 27.04.17 at 19:31, <andrew.cooper3@citrix.com> wrote:
>>> On 27/04/17 09:52, Jan Beulich wrote:
>>>>>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote:
>>>>> On 27/04/2017 09:01, Jan Beulich wrote:
>>>>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote:
>>>>>>> --- a/xen/common/vm_event.c
>>>>>>> +++ b/xen/common/vm_event.c
>>>>>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct 
> vm_event_domain *ved)
>>>>>>>  {
>>>>>>>      vm_event_response_t rsp;
>>>>>>>  
>>>>>>> +    /*
>>>>>>> +     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
>>>>>>> +     * EVTCHN_send from the introspection consumer.  Both contexts are
>>>>>>> +     * guaranteed not to be the subject of vm_event responses.
>>>>>>> +     */
>>>>>>> +    ASSERT(d != current->domain);
>>>>>> What is this meant to guard against?
>>>>> Documentation, as much as anything else.  It took a long time to
>>>>> untangle the contexts involved here; I'm not convinced the logic is safe
>>>>> to run in context, and it certainly doesn't need to.
>>>> Well, as said - I think it is too strict now: You only need the vCPU
>>>> not be current afaict, and it really matters here which of the two
>>>> "current"-s you actually mean (and it looks to me as if you mean
>>>> the other one, guaranteeing register state to no longer be on the
>>>> stack).
>>>
>>> We don't know the vcpu at this point; that information comes out of the
>>> replies on the ring.
>>>
>>> We also may process multiple replies in this one loop.  In the worse
>>> case, we process one reply for every vcpu in d.
>>>
>>> If the ASSERT() were to be deferred until the middle of the request
>>> loop, we could ASSERT(v != current) for every vcpu in d, but that is
>>> identical to this single assertion.
>> 
>> No, it isn't. It would only be if you iterated over all vCPU-s of that
>> domain (which as you say may or may not happen).
> 
> I will modify to the code to whatever you and Andrew decide is best, but
> just in case this helps decide, in my experience iterating over all
> VCPUs here will happen more often than not - even looking at
> xen-access.c today, it poll()s with a timeout, processes all collected
> events (which, if they are sync events - and they always are with our
> application - there can be several of them only if they correspond to
> different VCPUs), and only then signals the event channel.
> 
> But there are of course cases in which less than all the VCPUs have
> placed events in the ring buffer.

So just to clarify - I'm not entirely opposed to the ASSERT() staying
as is, but then the comment next to it should clarify that it is slightly
more strict than absolutely necessary.

Jan
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index a441955..520942a 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -473,6 +473,39 @@  static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info)
     return hvm_funcs.get_pending_event(v, info);
 }
 
+static void hvm_vm_event_set_registers(struct vcpu *v)
+{
+    ASSERT(v == current);
+
+    if ( v->arch.vm_event->set_gprs )
+    {
+        struct cpu_user_regs *regs = guest_cpu_user_regs();
+
+        regs->rax = v->arch.vm_event->gprs.rax;
+        regs->rbx = v->arch.vm_event->gprs.rbx;
+        regs->rcx = v->arch.vm_event->gprs.rcx;
+        regs->rdx = v->arch.vm_event->gprs.rdx;
+        regs->rsp = v->arch.vm_event->gprs.rsp;
+        regs->rbp = v->arch.vm_event->gprs.rbp;
+        regs->rsi = v->arch.vm_event->gprs.rsi;
+        regs->rdi = v->arch.vm_event->gprs.rdi;
+
+        regs->r8 = v->arch.vm_event->gprs.r8;
+        regs->r9 = v->arch.vm_event->gprs.r9;
+        regs->r10 = v->arch.vm_event->gprs.r10;
+        regs->r11 = v->arch.vm_event->gprs.r11;
+        regs->r12 = v->arch.vm_event->gprs.r12;
+        regs->r13 = v->arch.vm_event->gprs.r13;
+        regs->r14 = v->arch.vm_event->gprs.r14;
+        regs->r15 = v->arch.vm_event->gprs.r15;
+
+        regs->rflags = v->arch.vm_event->gprs.rflags;
+        regs->rip = v->arch.vm_event->gprs.rip;
+
+        v->arch.vm_event->set_gprs = 0;
+    }
+}
+
 void hvm_do_resume(struct vcpu *v)
 {
     check_wakeup_from_wait();
@@ -487,6 +520,8 @@  void hvm_do_resume(struct vcpu *v)
     {
         struct monitor_write_data *w = &v->arch.vm_event->write_data;
 
+        hvm_vm_event_set_registers(v);
+
         if ( unlikely(v->arch.vm_event->emulate_flags) )
         {
             enum emul_kind kind = EMUL_KIND_NORMAL;
diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c
index 502ccc2..f66780a 100644
--- a/xen/arch/x86/vm_event.c
+++ b/xen/arch/x86/vm_event.c
@@ -113,26 +113,8 @@  void vm_event_set_registers(struct vcpu *v, vm_event_response_t *rsp)
 {
     ASSERT(atomic_read(&v->vm_event_pause_count));
 
-    v->arch.user_regs.rax = rsp->data.regs.x86.rax;
-    v->arch.user_regs.rbx = rsp->data.regs.x86.rbx;
-    v->arch.user_regs.rcx = rsp->data.regs.x86.rcx;
-    v->arch.user_regs.rdx = rsp->data.regs.x86.rdx;
-    v->arch.user_regs.rsp = rsp->data.regs.x86.rsp;
-    v->arch.user_regs.rbp = rsp->data.regs.x86.rbp;
-    v->arch.user_regs.rsi = rsp->data.regs.x86.rsi;
-    v->arch.user_regs.rdi = rsp->data.regs.x86.rdi;
-
-    v->arch.user_regs.r8 = rsp->data.regs.x86.r8;
-    v->arch.user_regs.r9 = rsp->data.regs.x86.r9;
-    v->arch.user_regs.r10 = rsp->data.regs.x86.r10;
-    v->arch.user_regs.r11 = rsp->data.regs.x86.r11;
-    v->arch.user_regs.r12 = rsp->data.regs.x86.r12;
-    v->arch.user_regs.r13 = rsp->data.regs.x86.r13;
-    v->arch.user_regs.r14 = rsp->data.regs.x86.r14;
-    v->arch.user_regs.r15 = rsp->data.regs.x86.r15;
-
-    v->arch.user_regs.rflags = rsp->data.regs.x86.rflags;
-    v->arch.user_regs.rip = rsp->data.regs.x86.rip;
+    v->arch.vm_event->gprs = rsp->data.regs.x86;
+    v->arch.vm_event->set_gprs = 1;
 }
 
 void vm_event_monitor_next_interrupt(struct vcpu *v)
diff --git a/xen/common/vm_event.c b/xen/common/vm_event.c
index 0fe9a53..498749b 100644
--- a/xen/common/vm_event.c
+++ b/xen/common/vm_event.c
@@ -357,6 +357,13 @@  void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
 {
     vm_event_response_t rsp;
 
+    /*
+     * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or
+     * EVTCHN_send from the introspection consumer.  Both contexts are
+     * guaranteed not to be the subject of vm_event responses.
+     */
+    ASSERT(d != current->domain);
+
     /* Pull all responses off the ring. */
     while ( vm_event_get_response(d, ved, &rsp) )
     {
@@ -375,13 +382,6 @@  void vm_event_resume(struct domain *d, struct vm_event_domain *ved)
         v = d->vcpu[rsp.vcpu_id];
 
         /*
-         * Make sure the vCPU state has been synchronized for the custom
-         * handlers.
-         */
-        if ( atomic_read(&v->vm_event_pause_count) )
-            sync_vcpu_execstate(v);
-
-        /*
          * In some cases the response type needs extra handling, so here
          * we call the appropriate handlers.
          */
diff --git a/xen/include/asm-x86/vm_event.h b/xen/include/asm-x86/vm_event.h
index ca73f99..f703170 100644
--- a/xen/include/asm-x86/vm_event.h
+++ b/xen/include/asm-x86/vm_event.h
@@ -32,6 +32,8 @@  struct arch_vm_event {
         struct vm_event_emul_insn_data insn;
     } emul;
     struct monitor_write_data write_data;
+    struct vm_event_regs_x86 gprs;
+    bool_t set_gprs;
 };
 
 int vm_event_init_domain(struct domain *d);