xc_hvm_inject_trap() races

Message ID	6430868f-6eb8-5c07-3a79-6f94101ca600@bitdefender.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> To: Jan Beulich <JBeulich@suse.com> References: <56bd0939-976b-ef45-4182-60e65c2f3ddc@bitdefender.com> <58186EB702000078000F25A0@prv-mh.provo.novell.com> <0837744b-5fc6-0f8e-1c68-592ab8f1369f@bitdefender.com> <5818BA8502000078000F25E7@prv-mh.provo.novell.com> <c2fae9141d804c97a7f4d46c35d0ad70@mb2xmail.bitdefender.biz> <5818C56E02000078000F263B@prv-mh.provo.novell.com> <a71e415ca6a04360a1149fc237c5d941@mb2xmail.bitdefender.biz> <f9c7ffb7-9833-8235-4d67-550011588807@citrix.com> <14c80cc5-8338-bb29-7981-5851db27cd69@bitdefender.com> <58204D50020000780011C945@prv-mh.provo.novell.com> <af20bd27-4814-4f7a-6f44-91a01002795b@bitdefender.com> <58205E37020000780011C9A1@prv-mh.provo.novell.com> <3d1c2cac-5a73-4397-a1dd-286ed8b34fc3@bitdefender.com> <5820805B020000780011CA75@prv-mh.provo.novell.com> <dfffbcb4-9df8-47da-be59-6cd4e1b4240b@bitdefender.com> <58208DA2020000780011CB2A@prv-mh.provo.novell.com> <ce3d2b86-2036-76e2-27e8-211000c7f0a2@bitdefender.com> <5820956A020000780011CB5B@prv-mh.provo.novell.com> From: Razvan Cojocaru <rcojocaru@bitdefender.com> Message-ID: <6430868f-6eb8-5c07-3a79-6f94101ca600@bitdefender.com> Date: Mon, 7 Nov 2016 16:34:08 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <5820956A020000780011CB5B@prv-mh.provo.novell.com> Agent: BitDefender qmail 3.1.6 on smtp02.buh.bitdefender.net, sigver: 7.67921 Dats: 434673, Stamp: 3], Multi: [Enabled, t: (0.000011, 0.004188)], BW: [Enabled, t: (0.000007,0.000001)], RBL DNSBL: [Disabled], APM: [Enabled, Score: 500, t: (0.004196), Flags: 85D2ED72; NN_FORGED_THUNDERBIRD_2; NN_BEST_SOL_ADN; NN_LEGIT_VALID_REPLY; NN_LEGIT_SUMM_400_WORDS; NN_NO_LINK_NMD; NN_LEGIT_BITDEFENDER; NN_LEGIT_S_SQARE_BRACKETS], SGN: [Enabled, t: (0.007289)], URL: [Enabled, t: (0.000005)], RTDA: [Enabled, t: (0.019798), Hit: No, Details: v2.4.0; Id: 2m1gheo.1b0r0aub6.vin7], total: 0(775) Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Andrei Vlad LUTAS <vlutas@bitdefender.com>, "tamas@tklengyel.com" <tamas@tklengyel.com>, "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org> Subject: Re: [Xen-devel] xc_hvm_inject_trap() races Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Razvan Cojocaru Nov. 7, 2016, 2:34 p.m. UTC

On 11/07/2016 03:53 PM, Jan Beulich wrote:
>>>> On 07.11.16 at 14:44, <rcojocaru@bitdefender.com> wrote:
>> On 11/07/2016 03:20 PM, Jan Beulich wrote:
>>>>>> On 07.11.16 at 13:27, <rcojocaru@bitdefender.com> wrote:
>>>> On 11/07/2016 02:23 PM, Jan Beulich wrote:
>>>>>> At this point it looks like the best solution is to simply discard the
>>>>>>> non-architectural event if there's a pending architectural one, and add
>>>>>>> a new vm_event, as suggested by Tamas, that notifies the application
>>>>>>> that a trap has failed, or succeeded, and let it do the best it can with
>>>>>>> that information.
>>>>> Well, if the two of you think this is something which fits into the VM
>>>>> event model, then I guess that's an option. I, for one, am not
>>>>> convinced: It simply seems too special purpose. If this was a more
>>>>> generic event ("interruption delivered", perhaps with a type or
>>>>> vector qualifier) that can be subscribed to, and the app did that
>>>>> only for such transient periods of time, this would look more
>>>>> reasonable to me.
>>>>
>>>> Indeed, that's the way I think of it: the user is free to subscribe to
>>>> the event or not, and the event is:
>>>>
>>>> struct vm_event_injection_result {
>>>>     uint32_t vector;
>>>>     uint32_t type;
>>>>     uint32_t error_code;
>>>>     uint64_t cr2;
>>>>     uint32_t injected;
>>>> };
>>>>
>>>> with injected 0 for failure and 1 for success. It's as generic as possible.
>>>
>>> Wait, no, that's not what I did describe. I'm not talking about the
>>> "result" of an injection (through hypercall), but about delivering of
>>> events (of any origin). Hence there can't be any "failure" here.
>>> IOW what I'm proposing is a "VM is about to see this interruption"
>>> event.
>>
>> But a a success-only event is hard to interpret with regards to failure,
>> which is what we're really interested in (specifically, failure to
>> inject via hypercall). Do we count it as a failure if we don't receive a
>> "VM is about to see this interruption" event immediately after the
>> vm_event that caused the injection, on the same processor, with the same
>> trap vector? That's a tricky commitment to make.
> 
> If you subscribed to all interruptions, you'd simply see one next
> which is different from the one you've tried to inject.
> 
>> That would also decrease performance, likely noticeably, for a
>> vm_event-based application, as we'd get many such events (most of which
>> we're not interested in at all) which would require treatment while the
>> respective VCPU is paused.
> 
> You should limit the periods of time when you enable the
> subscription (e.g. only from when you inject an event until
> you did see it arrive). Perhaps such a subscription should then
> be per-vCPU ...

I think there might be a design argument against this model: using a
generic event implies that there are (or there might be) users
interested in long-term following interrupt events. However, that's
clearly impractical, since this would effectively render the guest unusable.

So the real use case would be to just use it, as you say, sparingly, for
just a few events - but in this case the event was never meant to be
followed more than figuring out if, for example, outside injection
failed, which can more readily be accomplished with a single, dedicated
event.

My proposal was simply something along the lines of:



Thanks,
Razvan

Jan Beulich Nov. 7, 2016, 2:59 p.m. UTC | #1

>>> On 07.11.16 at 15:34, <rcojocaru@bitdefender.com> wrote:
> My proposal was simply something along the lines of:
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 704fd64..58f5ae4 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -535,8 +535,22 @@ void hvm_do_resume(struct vcpu *v)
>      /* Inject pending hw/sw trap */
>      if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
>      {
> -        hvm_inject_trap(&v->arch.hvm_vcpu.inject_trap);
> +        unsigned int success = 0;
> +
> +        /* Check for already pending interrupts (races). */
> +        if ( !hvm_event_pending(v) )
> +        {
> +            hvm_inject_trap(&v->arch.hvm_vcpu.inject_trap);
> +            success = 1;
> +        }
> +
>          v->arch.hvm_vcpu.inject_trap.vector = -1;
> +
> +        hvm_monitor_injection_result(v->arch.hvm_vcpu.inject_trap.vector,
> +                                     v->arch.hvm_vcpu.inject_trap.type,
> +
> v->arch.hvm_vcpu.inject_trap.error_code,
> +                                     v->arch.hvm_vcpu.inject_trap.cr2,
> +                                     success);
>      }
>  }
> 

But you realize that injection isn't really VM event related; it's an
independent interface. Hence my "too special cased"complaint.
What if the event I did propose would be a one shot one, which
you enable right before (or after) doing your event injection (it
could also be an injection flag)? You'd then see whether the
next event the vCPU gets is the one you tried to inject.

Jan

Razvan Cojocaru Nov. 7, 2016, 3:24 p.m. UTC | #2

On 11/07/2016 04:59 PM, Jan Beulich wrote:
>>>> On 07.11.16 at 15:34, <rcojocaru@bitdefender.com> wrote:
>> My proposal was simply something along the lines of:
>>
>> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
>> index 704fd64..58f5ae4 100644
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -535,8 +535,22 @@ void hvm_do_resume(struct vcpu *v)
>>      /* Inject pending hw/sw trap */
>>      if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
>>      {
>> -        hvm_inject_trap(&v->arch.hvm_vcpu.inject_trap);
>> +        unsigned int success = 0;
>> +
>> +        /* Check for already pending interrupts (races). */
>> +        if ( !hvm_event_pending(v) )
>> +        {
>> +            hvm_inject_trap(&v->arch.hvm_vcpu.inject_trap);
>> +            success = 1;
>> +        }
>> +
>>          v->arch.hvm_vcpu.inject_trap.vector = -1;
>> +
>> +        hvm_monitor_injection_result(v->arch.hvm_vcpu.inject_trap.vector,
>> +                                     v->arch.hvm_vcpu.inject_trap.type,
>> +
>> v->arch.hvm_vcpu.inject_trap.error_code,
>> +                                     v->arch.hvm_vcpu.inject_trap.cr2,
>> +                                     success);
>>      }
>>  }
>>
> 
> But you realize that injection isn't really VM event related; it's an
> independent interface. Hence my "too special cased"complaint.
> What if the event I did propose would be a one shot one, which
> you enable right before (or after) doing your event injection (it
> could also be an injection flag)? You'd then see whether the
> next event the vCPU gets is the one you tried to inject.

Not only do I realize that, but the irony is that that's been my initial
reply to Tamas' suggestion. :) Unfortunately after looking carefully at
all the alternatives, this one has turned out to be the simplest and
most effective one - since it's not possible to know if the injection
will succeed after xc_hvm_inject_trap() returns, the only way to know is
if the application can be somehow notified asynchronously when the
hypervisor knows, and the simplest way to do that is via vm_event.

The one-shot vm_event does sound reasonable. I could set a flag
per-VCPU, basically similar to v->arch.hvm_vcpu.inject_trap.vector, and
fire a single event from hvm_inject_trap() if it's set (then unset it) -
the flag would be set via an xc_monitor_next_interrupt() call in libxc.
If nobody objects, I'll test that and see how it goes.

Thanks,
Razvan

Jan Beulich Nov. 7, 2016, 4:10 p.m. UTC | #3

>>> On 07.11.16 at 16:24, <rcojocaru@bitdefender.com> wrote:
> The one-shot vm_event does sound reasonable. I could set a flag
> per-VCPU, basically similar to v->arch.hvm_vcpu.inject_trap.vector, and
> fire a single event from hvm_inject_trap() if it's set (then unset it) -
> the flag would be set via an xc_monitor_next_interrupt() call in libxc.

Doing this in hvm_inject_trap() would not cover all cases afict.
I'd suggest doing this from hvm_do_resume() _after_ the
(conditional) call to hvm_inject_trap(), if there is _any_ event
pending.

Jan

Razvan Cojocaru Nov. 7, 2016, 5:01 p.m. UTC | #4

On 11/07/2016 06:10 PM, Jan Beulich wrote:
>>>> On 07.11.16 at 16:24, <rcojocaru@bitdefender.com> wrote:
>> The one-shot vm_event does sound reasonable. I could set a flag
>> per-VCPU, basically similar to v->arch.hvm_vcpu.inject_trap.vector, and
>> fire a single event from hvm_inject_trap() if it's set (then unset it) -
>> the flag would be set via an xc_monitor_next_interrupt() call in libxc.
> 
> Doing this in hvm_inject_trap() would not cover all cases afict.
> I'd suggest doing this from hvm_do_resume() _after_ the
> (conditional) call to hvm_inject_trap(), if there is _any_ event
> pending.

But that would only cover the hypercall-injected traps. The condition in
hvm_do_resume() is "if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )",
and inject_trap.vector seems to only ever be set by the hypercall:

5876     case HVMOP_inject_trap:
5877     {
5878         xen_hvm_inject_trap_t tr;
5879         struct domain *d;
5880         struct vcpu *v;
5881
5882         if ( copy_from_guest(&tr, arg, 1 ) )
5883             return -EFAULT;
5884
5885         rc = rcu_lock_remote_domain_by_id(tr.domid, &d);
5886         if ( rc != 0 )
5887             return rc;
5888
5889         rc = -EINVAL;
5890         if ( !is_hvm_domain(d) )
5891             goto injtrap_fail;
5892
5893         rc = xsm_hvm_control(XSM_DM_PRIV, d, op);
5894         if ( rc )
5895             goto injtrap_fail;
5896
5897         rc = -ENOENT;
5898         if ( tr.vcpuid >= d->max_vcpus || (v = d->vcpu[tr.vcpuid])
== NULL )
5899             goto injtrap_fail;
5900
5901         if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
5902             rc = -EBUSY;
5903         else
5904         {
5905             v->arch.hvm_vcpu.inject_trap.vector = tr.vector;
5906             v->arch.hvm_vcpu.inject_trap.type = tr.type;
5907             v->arch.hvm_vcpu.inject_trap.error_code = tr.error_code;
5908             v->arch.hvm_vcpu.inject_trap.insn_len = tr.insn_len;
5909             v->arch.hvm_vcpu.inject_trap.cr2 = tr.cr2;
5910             rc = 0;
5911         }
5912
5913     injtrap_fail:
5914         rcu_unlock_domain(d);
5915         break;
5916     }

So if the next interrupt is not caused by the hypercall, we'll never get
another event. Am I reading the code wrong?


Thanks,
Razvan

Jan Beulich Nov. 8, 2016, 8:15 a.m. UTC | #5

>>> On 07.11.16 at 18:01, <rcojocaru@bitdefender.com> wrote:
> On 11/07/2016 06:10 PM, Jan Beulich wrote:
>>>>> On 07.11.16 at 16:24, <rcojocaru@bitdefender.com> wrote:
>>> The one-shot vm_event does sound reasonable. I could set a flag
>>> per-VCPU, basically similar to v->arch.hvm_vcpu.inject_trap.vector, and
>>> fire a single event from hvm_inject_trap() if it's set (then unset it) -
>>> the flag would be set via an xc_monitor_next_interrupt() call in libxc.
>> 
>> Doing this in hvm_inject_trap() would not cover all cases afict.
>> I'd suggest doing this from hvm_do_resume() _after_ the
>> (conditional) call to hvm_inject_trap(), if there is _any_ event
>> pending.
> 
> But that would only cover the hypercall-injected traps. The condition in
> hvm_do_resume() is "if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )",
> and inject_trap.vector seems to only ever be set by the hypercall:
>[...]
> So if the next interrupt is not caused by the hypercall, we'll never get
> another event. Am I reading the code wrong?

No, maybe I expressed myself ambiguously: I meant to say that the
event should be delivered from hvm_do_resume(), but _outside_ the
conditional guarding the call to hvm_inject_trap(). Otherwise things
would have been worse than when doing it inside hvm_inject_trap().

Jan

Razvan Cojocaru Nov. 8, 2016, 8:22 a.m. UTC | #6

On 11/08/2016 10:15 AM, Jan Beulich wrote:
>>>> On 07.11.16 at 18:01, <rcojocaru@bitdefender.com> wrote:
>> On 11/07/2016 06:10 PM, Jan Beulich wrote:
>>>>>> On 07.11.16 at 16:24, <rcojocaru@bitdefender.com> wrote:
>>>> The one-shot vm_event does sound reasonable. I could set a flag
>>>> per-VCPU, basically similar to v->arch.hvm_vcpu.inject_trap.vector, and
>>>> fire a single event from hvm_inject_trap() if it's set (then unset it) -
>>>> the flag would be set via an xc_monitor_next_interrupt() call in libxc.
>>>
>>> Doing this in hvm_inject_trap() would not cover all cases afict.
>>> I'd suggest doing this from hvm_do_resume() _after_ the
>>> (conditional) call to hvm_inject_trap(), if there is _any_ event
>>> pending.
>>
>> But that would only cover the hypercall-injected traps. The condition in
>> hvm_do_resume() is "if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )",
>> and inject_trap.vector seems to only ever be set by the hypercall:
>> [...]
>> So if the next interrupt is not caused by the hypercall, we'll never get
>> another event. Am I reading the code wrong?
> 
> No, maybe I expressed myself ambiguously: I meant to say that the
> event should be delivered from hvm_do_resume(), but _outside_ the
> conditional guarding the call to hvm_inject_trap(). Otherwise things
> would have been worse than when doing it inside hvm_inject_trap().

Right, definitely, I'll do that.


Thanks,
Razvan

Razvan Cojocaru Nov. 8, 2016, 9:19 a.m. UTC | #7

On 11/08/2016 10:15 AM, Jan Beulich wrote:
>>>> On 07.11.16 at 18:01, <rcojocaru@bitdefender.com> wrote:
>> On 11/07/2016 06:10 PM, Jan Beulich wrote:
>>>>>> On 07.11.16 at 16:24, <rcojocaru@bitdefender.com> wrote:
>>>> The one-shot vm_event does sound reasonable. I could set a flag
>>>> per-VCPU, basically similar to v->arch.hvm_vcpu.inject_trap.vector, and
>>>> fire a single event from hvm_inject_trap() if it's set (then unset it) -
>>>> the flag would be set via an xc_monitor_next_interrupt() call in libxc.
>>>
>>> Doing this in hvm_inject_trap() would not cover all cases afict.
>>> I'd suggest doing this from hvm_do_resume() _after_ the
>>> (conditional) call to hvm_inject_trap(), if there is _any_ event
>>> pending.
>>
>> But that would only cover the hypercall-injected traps. The condition in
>> hvm_do_resume() is "if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )",
>> and inject_trap.vector seems to only ever be set by the hypercall:
>> [...]
>> So if the next interrupt is not caused by the hypercall, we'll never get
>> another event. Am I reading the code wrong?
> 
> No, maybe I expressed myself ambiguously: I meant to say that the
> event should be delivered from hvm_do_resume(), but _outside_ the
> conditional guarding the call to hvm_inject_trap(). Otherwise things
> would have been worse than when doing it inside hvm_inject_trap().

While working on this patch, I've had a new idea, which would require
less changes and fix the problem in a more elegant manner if validated.
Looking at vmx_idtv_reinject(), the real problem seems to be that
VM_ENTRY_INTR_INFO is being written to directly:

3229 static void vmx_idtv_reinject(unsigned long idtv_info)
3230 {
3231
3232     /* Event delivery caused this intercept? Queue for redelivery. */
3233     if ( unlikely(idtv_info & INTR_INFO_VALID_MASK) )
3234     {
3235         if ( hvm_event_needs_reinjection(MASK_EXTR(idtv_info,
3236
INTR_INFO_INTR_TYPE_MASK),
3237                                          idtv_info &
INTR_INFO_VECTOR_MASK) )
3238         {
3239             /* See SDM 3B 25.7.1.1 and .2 for info about masking
resvd bits. */
3240             __vmwrite(VM_ENTRY_INTR_INFO,
3241                       idtv_info & ~INTR_INFO_RESVD_BITS_MASK);
3242             if ( idtv_info & INTR_INFO_DELIVER_CODE_MASK )
3243             {
3244                 unsigned long ec;
3245
3246                 __vmread(IDT_VECTORING_ERROR_CODE, &ec);
3247                 __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, ec);
3248             }
3249         }
3250
3251         /*
3252          * Clear NMI-blocking interruptibility info if an NMI
delivery faulted.
3253          * Re-delivery will re-set it (see SDM 3B 25.7.1.2).
3254          */
3255         if ( cpu_has_vmx_vnmi &&
3256              ((idtv_info & INTR_INFO_INTR_TYPE_MASK) ==
3257               MASK_INSR(X86_EVENTTYPE_NMI, INTR_INFO_INTR_TYPE_MASK)) )
3258         {
3259             unsigned long intr_info;
3260
3261             __vmread(GUEST_INTERRUPTIBILITY_INFO, &intr_info);
3262             __vmwrite(GUEST_INTERRUPTIBILITY_INFO,
3263                       intr_info & ~VMX_INTR_SHADOW_NMI);
3264         }
3265     }
3266 }

where the hypercall looks at v->arch.hvm_vcpu.inject_trap.vector only.
Then we notice that the hypercall _fails_immediately_ with -EBUSY if
v->arch.hvm_vcpu.inject_trap.vector is already set:

5922     case HVMOP_inject_trap:
5923     {
5924         xen_hvm_inject_trap_t tr;
5925         struct domain *d;
5926         struct vcpu *v;
5927
5928         if ( copy_from_guest(&tr, arg, 1 ) )
5929             return -EFAULT;
5930
5931         rc = rcu_lock_remote_domain_by_id(tr.domid, &d);
5932         if ( rc != 0 )
5933             return rc;
5934
5935         rc = -EINVAL;
5936         if ( !is_hvm_domain(d) )
5937             goto injtrap_fail;
5938
5939         rc = xsm_hvm_control(XSM_DM_PRIV, d, op);
5940         if ( rc )
5941             goto injtrap_fail;
5942
5943         rc = -ENOENT;
5944         if ( tr.vcpuid >= d->max_vcpus || (v = d->vcpu[tr.vcpuid])
== NULL )
5945             goto injtrap_fail;
5946
5947         if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
5948             rc = -EBUSY;
5949         else
5950         {
5951             v->arch.hvm_vcpu.inject_trap.vector = tr.vector;
5952             v->arch.hvm_vcpu.inject_trap.type = tr.type;
5953             v->arch.hvm_vcpu.inject_trap.error_code = tr.error_code;
5954             v->arch.hvm_vcpu.inject_trap.insn_len = tr.insn_len;
5955             v->arch.hvm_vcpu.inject_trap.cr2 = tr.cr2;
5956             rc = 0;
5957         }
5958
5959     injtrap_fail:
5960         rcu_unlock_domain(d);
5961         break;
5962     }

The conclusion: instead of __vmwrite(VM_ENTRY_INTR_INFO, ...);
__vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, ...) in vmx_idtv_reinject(),
simply do what the hypercall would have done, i.e. write
inject_trap.vector, inject_trap.type, etc.

This way, the hypercall will fail immediately if there's a pending
interrupt set on exit.

Is this too good to be true? :)


Thanks,
Razvan

Razvan Cojocaru Nov. 8, 2016, 9:39 a.m. UTC | #8

On 11/08/2016 11:19 AM, Razvan Cojocaru wrote:
> On 11/08/2016 10:15 AM, Jan Beulich wrote:
>>>>> On 07.11.16 at 18:01, <rcojocaru@bitdefender.com> wrote:
>>> On 11/07/2016 06:10 PM, Jan Beulich wrote:
>>>>>>> On 07.11.16 at 16:24, <rcojocaru@bitdefender.com> wrote:
>>>>> The one-shot vm_event does sound reasonable. I could set a flag
>>>>> per-VCPU, basically similar to v->arch.hvm_vcpu.inject_trap.vector, and
>>>>> fire a single event from hvm_inject_trap() if it's set (then unset it) -
>>>>> the flag would be set via an xc_monitor_next_interrupt() call in libxc.
>>>>
>>>> Doing this in hvm_inject_trap() would not cover all cases afict.
>>>> I'd suggest doing this from hvm_do_resume() _after_ the
>>>> (conditional) call to hvm_inject_trap(), if there is _any_ event
>>>> pending.
>>>
>>> But that would only cover the hypercall-injected traps. The condition in
>>> hvm_do_resume() is "if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )",
>>> and inject_trap.vector seems to only ever be set by the hypercall:
>>> [...]
>>> So if the next interrupt is not caused by the hypercall, we'll never get
>>> another event. Am I reading the code wrong?
>>
>> No, maybe I expressed myself ambiguously: I meant to say that the
>> event should be delivered from hvm_do_resume(), but _outside_ the
>> conditional guarding the call to hvm_inject_trap(). Otherwise things
>> would have been worse than when doing it inside hvm_inject_trap().
> 
> While working on this patch, I've had a new idea, which would require
> less changes and fix the problem in a more elegant manner if validated.
> Looking at vmx_idtv_reinject(), the real problem seems to be that
> VM_ENTRY_INTR_INFO is being written to directly:
> 
> 3229 static void vmx_idtv_reinject(unsigned long idtv_info)
> 3230 {
> 3231
> 3232     /* Event delivery caused this intercept? Queue for redelivery. */
> 3233     if ( unlikely(idtv_info & INTR_INFO_VALID_MASK) )
> 3234     {
> 3235         if ( hvm_event_needs_reinjection(MASK_EXTR(idtv_info,
> 3236
> INTR_INFO_INTR_TYPE_MASK),
> 3237                                          idtv_info &
> INTR_INFO_VECTOR_MASK) )
> 3238         {
> 3239             /* See SDM 3B 25.7.1.1 and .2 for info about masking
> resvd bits. */
> 3240             __vmwrite(VM_ENTRY_INTR_INFO,
> 3241                       idtv_info & ~INTR_INFO_RESVD_BITS_MASK);
> 3242             if ( idtv_info & INTR_INFO_DELIVER_CODE_MASK )
> 3243             {
> 3244                 unsigned long ec;
> 3245
> 3246                 __vmread(IDT_VECTORING_ERROR_CODE, &ec);
> 3247                 __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, ec);
> 3248             }
> 3249         }
> 3250
> 3251         /*
> 3252          * Clear NMI-blocking interruptibility info if an NMI
> delivery faulted.
> 3253          * Re-delivery will re-set it (see SDM 3B 25.7.1.2).
> 3254          */
> 3255         if ( cpu_has_vmx_vnmi &&
> 3256              ((idtv_info & INTR_INFO_INTR_TYPE_MASK) ==
> 3257               MASK_INSR(X86_EVENTTYPE_NMI, INTR_INFO_INTR_TYPE_MASK)) )
> 3258         {
> 3259             unsigned long intr_info;
> 3260
> 3261             __vmread(GUEST_INTERRUPTIBILITY_INFO, &intr_info);
> 3262             __vmwrite(GUEST_INTERRUPTIBILITY_INFO,
> 3263                       intr_info & ~VMX_INTR_SHADOW_NMI);
> 3264         }
> 3265     }
> 3266 }
> 
> where the hypercall looks at v->arch.hvm_vcpu.inject_trap.vector only.
> Then we notice that the hypercall _fails_immediately_ with -EBUSY if
> v->arch.hvm_vcpu.inject_trap.vector is already set:
> 
> 5922     case HVMOP_inject_trap:
> 5923     {
> 5924         xen_hvm_inject_trap_t tr;
> 5925         struct domain *d;
> 5926         struct vcpu *v;
> 5927
> 5928         if ( copy_from_guest(&tr, arg, 1 ) )
> 5929             return -EFAULT;
> 5930
> 5931         rc = rcu_lock_remote_domain_by_id(tr.domid, &d);
> 5932         if ( rc != 0 )
> 5933             return rc;
> 5934
> 5935         rc = -EINVAL;
> 5936         if ( !is_hvm_domain(d) )
> 5937             goto injtrap_fail;
> 5938
> 5939         rc = xsm_hvm_control(XSM_DM_PRIV, d, op);
> 5940         if ( rc )
> 5941             goto injtrap_fail;
> 5942
> 5943         rc = -ENOENT;
> 5944         if ( tr.vcpuid >= d->max_vcpus || (v = d->vcpu[tr.vcpuid])
> == NULL )
> 5945             goto injtrap_fail;
> 5946
> 5947         if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
> 5948             rc = -EBUSY;

Actually the fix should be even simpler than that: we can add to this if
" || hvm_event_pending(v)".

Objections?


Thanks,
Razvan

Razvan Cojocaru Nov. 8, 2016, 9:46 a.m. UTC | #9

On 11/08/2016 11:39 AM, Razvan Cojocaru wrote:
> On 11/08/2016 11:19 AM, Razvan Cojocaru wrote:
>> On 11/08/2016 10:15 AM, Jan Beulich wrote:
>>>>>> On 07.11.16 at 18:01, <rcojocaru@bitdefender.com> wrote:
>>>> On 11/07/2016 06:10 PM, Jan Beulich wrote:
>>>>>>>> On 07.11.16 at 16:24, <rcojocaru@bitdefender.com> wrote:
>>>>>> The one-shot vm_event does sound reasonable. I could set a flag
>>>>>> per-VCPU, basically similar to v->arch.hvm_vcpu.inject_trap.vector, and
>>>>>> fire a single event from hvm_inject_trap() if it's set (then unset it) -
>>>>>> the flag would be set via an xc_monitor_next_interrupt() call in libxc.
>>>>>
>>>>> Doing this in hvm_inject_trap() would not cover all cases afict.
>>>>> I'd suggest doing this from hvm_do_resume() _after_ the
>>>>> (conditional) call to hvm_inject_trap(), if there is _any_ event
>>>>> pending.
>>>>
>>>> But that would only cover the hypercall-injected traps. The condition in
>>>> hvm_do_resume() is "if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )",
>>>> and inject_trap.vector seems to only ever be set by the hypercall:
>>>> [...]
>>>> So if the next interrupt is not caused by the hypercall, we'll never get
>>>> another event. Am I reading the code wrong?
>>>
>>> No, maybe I expressed myself ambiguously: I meant to say that the
>>> event should be delivered from hvm_do_resume(), but _outside_ the
>>> conditional guarding the call to hvm_inject_trap(). Otherwise things
>>> would have been worse than when doing it inside hvm_inject_trap().
>>
>> While working on this patch, I've had a new idea, which would require
>> less changes and fix the problem in a more elegant manner if validated.
>> Looking at vmx_idtv_reinject(), the real problem seems to be that
>> VM_ENTRY_INTR_INFO is being written to directly:
>>
>> 3229 static void vmx_idtv_reinject(unsigned long idtv_info)
>> 3230 {
>> 3231
>> 3232     /* Event delivery caused this intercept? Queue for redelivery. */
>> 3233     if ( unlikely(idtv_info & INTR_INFO_VALID_MASK) )
>> 3234     {
>> 3235         if ( hvm_event_needs_reinjection(MASK_EXTR(idtv_info,
>> 3236
>> INTR_INFO_INTR_TYPE_MASK),
>> 3237                                          idtv_info &
>> INTR_INFO_VECTOR_MASK) )
>> 3238         {
>> 3239             /* See SDM 3B 25.7.1.1 and .2 for info about masking
>> resvd bits. */
>> 3240             __vmwrite(VM_ENTRY_INTR_INFO,
>> 3241                       idtv_info & ~INTR_INFO_RESVD_BITS_MASK);
>> 3242             if ( idtv_info & INTR_INFO_DELIVER_CODE_MASK )
>> 3243             {
>> 3244                 unsigned long ec;
>> 3245
>> 3246                 __vmread(IDT_VECTORING_ERROR_CODE, &ec);
>> 3247                 __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, ec);
>> 3248             }
>> 3249         }
>> 3250
>> 3251         /*
>> 3252          * Clear NMI-blocking interruptibility info if an NMI
>> delivery faulted.
>> 3253          * Re-delivery will re-set it (see SDM 3B 25.7.1.2).
>> 3254          */
>> 3255         if ( cpu_has_vmx_vnmi &&
>> 3256              ((idtv_info & INTR_INFO_INTR_TYPE_MASK) ==
>> 3257               MASK_INSR(X86_EVENTTYPE_NMI, INTR_INFO_INTR_TYPE_MASK)) )
>> 3258         {
>> 3259             unsigned long intr_info;
>> 3260
>> 3261             __vmread(GUEST_INTERRUPTIBILITY_INFO, &intr_info);
>> 3262             __vmwrite(GUEST_INTERRUPTIBILITY_INFO,
>> 3263                       intr_info & ~VMX_INTR_SHADOW_NMI);
>> 3264         }
>> 3265     }
>> 3266 }
>>
>> where the hypercall looks at v->arch.hvm_vcpu.inject_trap.vector only.
>> Then we notice that the hypercall _fails_immediately_ with -EBUSY if
>> v->arch.hvm_vcpu.inject_trap.vector is already set:
>>
>> 5922     case HVMOP_inject_trap:
>> 5923     {
>> 5924         xen_hvm_inject_trap_t tr;
>> 5925         struct domain *d;
>> 5926         struct vcpu *v;
>> 5927
>> 5928         if ( copy_from_guest(&tr, arg, 1 ) )
>> 5929             return -EFAULT;
>> 5930
>> 5931         rc = rcu_lock_remote_domain_by_id(tr.domid, &d);
>> 5932         if ( rc != 0 )
>> 5933             return rc;
>> 5934
>> 5935         rc = -EINVAL;
>> 5936         if ( !is_hvm_domain(d) )
>> 5937             goto injtrap_fail;
>> 5938
>> 5939         rc = xsm_hvm_control(XSM_DM_PRIV, d, op);
>> 5940         if ( rc )
>> 5941             goto injtrap_fail;
>> 5942
>> 5943         rc = -ENOENT;
>> 5944         if ( tr.vcpuid >= d->max_vcpus || (v = d->vcpu[tr.vcpuid])
>> == NULL )
>> 5945             goto injtrap_fail;
>> 5946
>> 5947         if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
>> 5948             rc = -EBUSY;
> 
> Actually the fix should be even simpler than that: we can add to this if
> " || hvm_event_pending(v)".
> 
> Objections?

Nevermind, vmx_event_pending() has a fair ASSERT that v == curr:

1789 static int vmx_event_pending(struct vcpu *v)
1790 {
1791     unsigned long intr_info;
1792
1793     ASSERT(v == current);
1794     __vmread(VM_ENTRY_INTR_INFO, &intr_info);
1795
1796     return intr_info & INTR_INFO_VALID_MASK;
1797 }

The inject_trap.vector solution seems to be the only plausible one.
Sorry for the spam.


Thanks,
Razvan

xc_hvm_inject_trap() races

Commit Message

Comments

Patch