Message ID | 1493277744-9032-1-git-send-email-rcojocaru@bitdefender.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 27/04/2017 08:22, Razvan Cojocaru wrote: > The introspection agent can reply to a vm_event faster than > vmx_vmexit_handler() can complete in some cases, where it is then > not safe for vm_event_set_registers() to modify v->arch.user_regs. > This patch ensures that vm_event_resume() code only sets per-VCPU > data to be used for the actual setting of registers only in > hvm_do_resume() (similar to the model used to control setting of CRs > and MSRs). > The patch additionally removes the sync_vcpu_execstate(v) call from > vm_event_resume(), which is no longer necessary, which removes the > associated broadcast TLB flush (read: performance improvement). > > Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> FWIW, Reivewed-by: Andrew Cooper <andrew.cooper3@citrix.com> (this has taken an embarrassingly long time to get to the root cause of, considering the eventual simplicity of the patch) with one trivial correction I have just spotted below. CC'ing Julien and Ian in their RM capacity, as this bugfix should be taken for 4.9 > diff --git a/xen/common/vm_event.c b/xen/common/vm_event.c > index 0fe9a53..498749b 100644 > --- a/xen/common/vm_event.c > +++ b/xen/common/vm_event.c > @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) > { > vm_event_response_t rsp; > > + /* > + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or > + * EVTCHN_send from the introspection consumer. Both contexts are I have just spotted that this should read "runs in either XEN_DOMCTL_VM_EVENT_OP_*, or EVTCHN_send context from". I can fix up on commit, if there are no other issues. ~Andrew > + * guaranteed not to be the subject of vm_event responses. > + */ > + ASSERT(d != current->domain); > + > /* Pull all responses off the ring. */ > while ( vm_event_get_response(d, ved, &rsp) ) > {
>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: > The introspection agent can reply to a vm_event faster than > vmx_vmexit_handler() can complete in some cases, where it is then > not safe for vm_event_set_registers() to modify v->arch.user_regs. This needs more explanation: I cannot see the connection between vmx_vmexit_handler() completing and (un)safety of modification of v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't gone through __context_switch(), and the former doesn't call that function directly or indirectly (i.e. I think it is more than just vmx_vmexit_handler() which needs to be completed by the time register modification becomes safe to do). > This patch ensures that vm_event_resume() code only sets per-VCPU > data to be used for the actual setting of registers only in > hvm_do_resume() (similar to the model used to control setting of CRs > and MSRs). I think the second "only" would better be "later". > The patch additionally removes the sync_vcpu_execstate(v) call from > vm_event_resume(), which is no longer necessary, which removes the > associated broadcast TLB flush (read: performance improvement). Depending on the better explanation above, it may or may not be further necessary to also better explain the "no longer necessary" part here. > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info) > return hvm_funcs.get_pending_event(v, info); > } > > +static void hvm_vm_event_set_registers(struct vcpu *v) This could be const, as *v itself isn't being modified, but maybe it's better to leave it non-const indeed. > +{ > + ASSERT(v == current); > + > + if ( v->arch.vm_event->set_gprs ) I think we will want an unlikely() here. We anyway can only hope for the compiler to always inline this function, such that non-VM-event setups don't get penalized by the extra call here. Strictly speaking the function doesn't belong into this file ... > + { > + struct cpu_user_regs *regs = guest_cpu_user_regs(); > + > + regs->rax = v->arch.vm_event->gprs.rax; > + regs->rbx = v->arch.vm_event->gprs.rbx; > + regs->rcx = v->arch.vm_event->gprs.rcx; > + regs->rdx = v->arch.vm_event->gprs.rdx; > + regs->rsp = v->arch.vm_event->gprs.rsp; > + regs->rbp = v->arch.vm_event->gprs.rbp; > + regs->rsi = v->arch.vm_event->gprs.rsi; > + regs->rdi = v->arch.vm_event->gprs.rdi; > + > + regs->r8 = v->arch.vm_event->gprs.r8; > + regs->r9 = v->arch.vm_event->gprs.r9; > + regs->r10 = v->arch.vm_event->gprs.r10; > + regs->r11 = v->arch.vm_event->gprs.r11; > + regs->r12 = v->arch.vm_event->gprs.r12; > + regs->r13 = v->arch.vm_event->gprs.r13; > + regs->r14 = v->arch.vm_event->gprs.r14; > + regs->r15 = v->arch.vm_event->gprs.r15; > + > + regs->rflags = v->arch.vm_event->gprs.rflags; > + regs->rip = v->arch.vm_event->gprs.rip; > + > + v->arch.vm_event->set_gprs = 0; false (and true/bool elsewhere) > --- a/xen/common/vm_event.c > +++ b/xen/common/vm_event.c > @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) > { > vm_event_response_t rsp; > > + /* > + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or > + * EVTCHN_send from the introspection consumer. Both contexts are > + * guaranteed not to be the subject of vm_event responses. > + */ > + ASSERT(d != current->domain); What is this meant to guard against? It surely isn't ... > @@ -375,13 +382,6 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) > v = d->vcpu[rsp.vcpu_id]; > > /* > - * Make sure the vCPU state has been synchronized for the custom > - * handlers. > - */ > - if ( atomic_read(&v->vm_event_pause_count) ) > - sync_vcpu_execstate(v); ... a "replacement" for this, as the state of "current" doesn't reflect whether register state has been saved (that's this_cpu(curr_vcpu) instead). Nor would a comparison of domains seem to be the right thing - a comparison of vcpus ought to suffice (and be less strict, allowing for something hypothetical like self-introspection). Jan
On 04/27/17 11:01, Jan Beulich wrote: >>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >> The introspection agent can reply to a vm_event faster than >> vmx_vmexit_handler() can complete in some cases, where it is then >> not safe for vm_event_set_registers() to modify v->arch.user_regs. > > This needs more explanation: I cannot see the connection between > vmx_vmexit_handler() completing and (un)safety of modification of > v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't > gone through __context_switch(), and the former doesn't call that > function directly or indirectly (i.e. I think it is more than just > vmx_vmexit_handler() which needs to be completed by the time > register modification becomes safe to do). Indeed, this is exactly the case (__context_switch() doesn't go through). I'll re-word the commit message. >> This patch ensures that vm_event_resume() code only sets per-VCPU >> data to be used for the actual setting of registers only in >> hvm_do_resume() (similar to the model used to control setting of CRs >> and MSRs). > > I think the second "only" would better be "later". Yes, it's actually redundant (sorry). I'll change it to "later". >> The patch additionally removes the sync_vcpu_execstate(v) call from >> vm_event_resume(), which is no longer necessary, which removes the >> associated broadcast TLB flush (read: performance improvement). > > Depending on the better explanation above, it may or may not be > further necessary to also better explain the "no longer necessary" > part here. > >> --- a/xen/arch/x86/hvm/hvm.c >> +++ b/xen/arch/x86/hvm/hvm.c >> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info) >> return hvm_funcs.get_pending_event(v, info); >> } >> >> +static void hvm_vm_event_set_registers(struct vcpu *v) > > This could be const, as *v itself isn't being modified, but maybe it's > better to leave it non-const indeed. I have no problem changing it if there are no objections. >> +{ >> + ASSERT(v == current); >> + >> + if ( v->arch.vm_event->set_gprs ) > > I think we will want an unlikely() here. We anyway can only hope for > the compiler to always inline this function, such that non-VM-event > setups don't get penalized by the extra call here. Strictly speaking > the function doesn't belong into this file ... Should I move it to the x86 vm_event code? >> + { >> + struct cpu_user_regs *regs = guest_cpu_user_regs(); >> + >> + regs->rax = v->arch.vm_event->gprs.rax; >> + regs->rbx = v->arch.vm_event->gprs.rbx; >> + regs->rcx = v->arch.vm_event->gprs.rcx; >> + regs->rdx = v->arch.vm_event->gprs.rdx; >> + regs->rsp = v->arch.vm_event->gprs.rsp; >> + regs->rbp = v->arch.vm_event->gprs.rbp; >> + regs->rsi = v->arch.vm_event->gprs.rsi; >> + regs->rdi = v->arch.vm_event->gprs.rdi; >> + >> + regs->r8 = v->arch.vm_event->gprs.r8; >> + regs->r9 = v->arch.vm_event->gprs.r9; >> + regs->r10 = v->arch.vm_event->gprs.r10; >> + regs->r11 = v->arch.vm_event->gprs.r11; >> + regs->r12 = v->arch.vm_event->gprs.r12; >> + regs->r13 = v->arch.vm_event->gprs.r13; >> + regs->r14 = v->arch.vm_event->gprs.r14; >> + regs->r15 = v->arch.vm_event->gprs.r15; >> + >> + regs->rflags = v->arch.vm_event->gprs.rflags; >> + regs->rip = v->arch.vm_event->gprs.rip; >> + >> + v->arch.vm_event->set_gprs = 0; > > false (and true/bool elsewhere) I'll change it to bool. >> --- a/xen/common/vm_event.c >> +++ b/xen/common/vm_event.c >> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) >> { >> vm_event_response_t rsp; >> >> + /* >> + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or >> + * EVTCHN_send from the introspection consumer. Both contexts are >> + * guaranteed not to be the subject of vm_event responses. >> + */ >> + ASSERT(d != current->domain); > > What is this meant to guard against? It surely isn't ... > >> @@ -375,13 +382,6 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) >> v = d->vcpu[rsp.vcpu_id]; >> >> /* >> - * Make sure the vCPU state has been synchronized for the custom >> - * handlers. >> - */ >> - if ( atomic_read(&v->vm_event_pause_count) ) >> - sync_vcpu_execstate(v); > > ... a "replacement" for this, as the state of "current" doesn't reflect > whether register state has been saved (that's this_cpu(curr_vcpu) > instead). Nor would a comparison of domains seem to be the right > thing - a comparison of vcpus ought to suffice (and be less strict, > allowing for something hypothetical like self-introspection). This part (the ASSERT and comment) is from Andrew, he can help us here. Thanks, Razvan
>>> On 27.04.17 at 10:11, <rcojocaru@bitdefender.com> wrote: > On 04/27/17 11:01, Jan Beulich wrote: >>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >>> --- a/xen/arch/x86/hvm/hvm.c >>> +++ b/xen/arch/x86/hvm/hvm.c >>> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info) >>> return hvm_funcs.get_pending_event(v, info); >>> } >>> >>> +static void hvm_vm_event_set_registers(struct vcpu *v) >>> +{ >>> + ASSERT(v == current); >>> + >>> + if ( v->arch.vm_event->set_gprs ) >> >> I think we will want an unlikely() here. We anyway can only hope for >> the compiler to always inline this function, such that non-VM-event >> setups don't get penalized by the extra call here. Strictly speaking >> the function doesn't belong into this file ... > > Should I move it to the x86 vm_event code? If you do, then you'll want to have an inline wrapper with the if() in it, so the actual call at the machine level would be avoided in the common case. Jan
On 27/04/2017 09:01, Jan Beulich wrote: >>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >> The introspection agent can reply to a vm_event faster than >> vmx_vmexit_handler() can complete in some cases, where it is then >> not safe for vm_event_set_registers() to modify v->arch.user_regs. > This needs more explanation: I cannot see the connection between > vmx_vmexit_handler() completing and (un)safety of modification of > v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't > gone through __context_switch(), and the former doesn't call that > function directly or indirectly (i.e. I think it is more than just > vmx_vmexit_handler() which needs to be completed by the time > register modification becomes safe to do). The test scenario here was to step over an int3 breakpoint by setting rip += 1. This is a very quick reply and tended to complete before the vcpu triggering the introspection event had properly paused and been descheduled. If the reply occurs before __context_switch() happens, __context_switch() clobbers the reply by overwriting v->arch.user_regs from the stack. If the reply occurs after __context_switch(), but the pcpu has gone idle and keeps v in context, v->arch.user_regs is not restored to the stack, and the update is similarly missed. (There is a very narrow race condition where both the reply and __context_switch() are updating v->arch.user_regs, and who knows what will happen, given our memcpy implementation.) > >> The patch additionally removes the sync_vcpu_execstate(v) call from >> vm_event_resume(), which is no longer necessary, which removes the >> associated broadcast TLB flush (read: performance improvement). > Depending on the better explanation above, it may or may not be > further necessary to also better explain the "no longer necessary" > part here. As I understand, it was a previous attempt to fix this bug. There is nothing (now) in vm_event_resume() which does anything other than new updates into v->arch.vm_event and drop the pause reference. All updated are applied in context, in the return-to-guest path, which is race free. There is no need for the IPI, or to force the target vcpu out of context if the pcpu is currently idle. > >> --- a/xen/common/vm_event.c >> +++ b/xen/common/vm_event.c >> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) >> { >> vm_event_response_t rsp; >> >> + /* >> + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or >> + * EVTCHN_send from the introspection consumer. Both contexts are >> + * guaranteed not to be the subject of vm_event responses. >> + */ >> + ASSERT(d != current->domain); > What is this meant to guard against? Documentation, as much as anything else. It took a long time to untangle the contexts involved here; I'm not convinced the logic is safe to run in context, and it certainly doesn't need to. ~Andrew
On 04/27/17 11:18, Jan Beulich wrote: >>>> On 27.04.17 at 10:11, <rcojocaru@bitdefender.com> wrote: >> On 04/27/17 11:01, Jan Beulich wrote: >>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >>>> --- a/xen/arch/x86/hvm/hvm.c >>>> +++ b/xen/arch/x86/hvm/hvm.c >>>> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info) >>>> return hvm_funcs.get_pending_event(v, info); >>>> } >>>> >>>> +static void hvm_vm_event_set_registers(struct vcpu *v) >>>> +{ >>>> + ASSERT(v == current); >>>> + >>>> + if ( v->arch.vm_event->set_gprs ) >>> >>> I think we will want an unlikely() here. We anyway can only hope for >>> the compiler to always inline this function, such that non-VM-event >>> setups don't get penalized by the extra call here. Strictly speaking >>> the function doesn't belong into this file ... >> >> Should I move it to the x86 vm_event code? > > If you do, then you'll want to have an inline wrapper with the if() > in it, so the actual call at the machine level would be avoided in the > common case. Sorry, I'm not sure I understand: if moving it is not required, I'd prefer to leave it where it is, as we already have vm_event_set_registers() in arch/x86/vm_event.c and I feel it would complicate matters there (I'd have to perhaps prepend "hvm_" to it in which case it wouldn't really belong in vm_event.c either). But if it's necessary, I'll move it - do you want me to move it? Thanks, Razvan
>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote: > On 27/04/2017 09:01, Jan Beulich wrote: >>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >>> The introspection agent can reply to a vm_event faster than >>> vmx_vmexit_handler() can complete in some cases, where it is then >>> not safe for vm_event_set_registers() to modify v->arch.user_regs. >> This needs more explanation: I cannot see the connection between >> vmx_vmexit_handler() completing and (un)safety of modification of >> v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't >> gone through __context_switch(), and the former doesn't call that >> function directly or indirectly (i.e. I think it is more than just >> vmx_vmexit_handler() which needs to be completed by the time >> register modification becomes safe to do). > > The test scenario here was to step over an int3 breakpoint by setting > rip += 1. This is a very quick reply and tended to complete before the > vcpu triggering the introspection event had properly paused and been > descheduled. > > If the reply occurs before __context_switch() happens, > __context_switch() clobbers the reply by overwriting v->arch.user_regs > from the stack. If the reply occurs after __context_switch(), but the > pcpu has gone idle and keeps v in context, v->arch.user_regs is not > restored to the stack, and the update is similarly missed. This second case not properly described, I think: v won't be kept in context once we've gone through __context_switch(). If the pCPU goes idle, __context_switch() simply will be deferred until another vCPU gets scheduled onto this pCPU, and be avoided altogether if it's the original vCPU that gets scheduled back onto this pCPU. But I do get the point. I think much if not all of the above (suitably adjusted) needs to go into the commit message. > (There is a > very narrow race condition where both the reply and __context_switch() > are updating v->arch.user_regs, and who knows what will happen, given > our memcpy implementation.) Right, but with the proper serialization of events done now this isn't a problem anymore either, aiui. >>> The patch additionally removes the sync_vcpu_execstate(v) call from >>> vm_event_resume(), which is no longer necessary, which removes the >>> associated broadcast TLB flush (read: performance improvement). >> Depending on the better explanation above, it may or may not be >> further necessary to also better explain the "no longer necessary" >> part here. > > As I understand, it was a previous attempt to fix this bug. > > There is nothing (now) in vm_event_resume() which does anything other > than new updates into v->arch.vm_event and drop the pause reference. > All updated are applied in context, in the return-to-guest path, which > is race free. > > There is no need for the IPI, or to force the target vcpu out of context > if the pcpu is currently idle. I agree, and (as indicated) the better explanation earlier on is probably sufficient then. >>> --- a/xen/common/vm_event.c >>> +++ b/xen/common/vm_event.c >>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) >>> { >>> vm_event_response_t rsp; >>> >>> + /* >>> + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or >>> + * EVTCHN_send from the introspection consumer. Both contexts are >>> + * guaranteed not to be the subject of vm_event responses. >>> + */ >>> + ASSERT(d != current->domain); >> What is this meant to guard against? > > Documentation, as much as anything else. It took a long time to > untangle the contexts involved here; I'm not convinced the logic is safe > to run in context, and it certainly doesn't need to. Well, as said - I think it is too strict now: You only need the vCPU not be current afaict, and it really matters here which of the two "current"-s you actually mean (and it looks to me as if you mean the other one, guaranteeing register state to no longer be on the stack). Jan
>>> On 27.04.17 at 10:34, <rcojocaru@bitdefender.com> wrote: > On 04/27/17 11:18, Jan Beulich wrote: >>>>> On 27.04.17 at 10:11, <rcojocaru@bitdefender.com> wrote: >>> On 04/27/17 11:01, Jan Beulich wrote: >>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >>>>> --- a/xen/arch/x86/hvm/hvm.c >>>>> +++ b/xen/arch/x86/hvm/hvm.c >>>>> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, > struct x86_event *info) >>>>> return hvm_funcs.get_pending_event(v, info); >>>>> } >>>>> >>>>> +static void hvm_vm_event_set_registers(struct vcpu *v) >>>>> +{ >>>>> + ASSERT(v == current); >>>>> + >>>>> + if ( v->arch.vm_event->set_gprs ) >>>> >>>> I think we will want an unlikely() here. We anyway can only hope for >>>> the compiler to always inline this function, such that non-VM-event >>>> setups don't get penalized by the extra call here. Strictly speaking >>>> the function doesn't belong into this file ... >>> >>> Should I move it to the x86 vm_event code? >> >> If you do, then you'll want to have an inline wrapper with the if() >> in it, so the actual call at the machine level would be avoided in the >> common case. > > Sorry, I'm not sure I understand: if moving it is not required, I'd > prefer to leave it where it is, as we already have > vm_event_set_registers() in arch/x86/vm_event.c and I feel it would > complicate matters there (I'd have to perhaps prepend "hvm_" to it in > which case it wouldn't really belong in vm_event.c either). But if it's > necessary, I'll move it - do you want me to move it? Well, logically this is a VM event function, so belongs into vm_event.c. So I'd _prefer_ for it to be moved, but I don't insist (in the end, by leaving it where it is, you and Tamas [wrongly imo] are not the maintainers of it). But I agree, x86/vm_event.c isn't an ideal place either, better would be x86/hvm/vm_event.c. Otoh hvm_do_resume() open codes all the CR and MSR writes (getting the order wrong as it seems, btw, as some MSR writes depend on certain CR settings), so maybe a separate function isn't needed at all here? In fact the bulk of the function is VM event code, so one might also view it the other way around, and most or all of this should move into a new function, for example hvm_vm_event_do_resume(). Jan
On 04/27/17 12:00, Jan Beulich wrote: >>>> On 27.04.17 at 10:34, <rcojocaru@bitdefender.com> wrote: >> On 04/27/17 11:18, Jan Beulich wrote: >>>>>> On 27.04.17 at 10:11, <rcojocaru@bitdefender.com> wrote: >>>> On 04/27/17 11:01, Jan Beulich wrote: >>>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >>>>>> --- a/xen/arch/x86/hvm/hvm.c >>>>>> +++ b/xen/arch/x86/hvm/hvm.c >>>>>> @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, >> struct x86_event *info) >>>>>> return hvm_funcs.get_pending_event(v, info); >>>>>> } >>>>>> >>>>>> +static void hvm_vm_event_set_registers(struct vcpu *v) >>>>>> +{ >>>>>> + ASSERT(v == current); >>>>>> + >>>>>> + if ( v->arch.vm_event->set_gprs ) >>>>> >>>>> I think we will want an unlikely() here. We anyway can only hope for >>>>> the compiler to always inline this function, such that non-VM-event >>>>> setups don't get penalized by the extra call here. Strictly speaking >>>>> the function doesn't belong into this file ... >>>> >>>> Should I move it to the x86 vm_event code? >>> >>> If you do, then you'll want to have an inline wrapper with the if() >>> in it, so the actual call at the machine level would be avoided in the >>> common case. >> >> Sorry, I'm not sure I understand: if moving it is not required, I'd >> prefer to leave it where it is, as we already have >> vm_event_set_registers() in arch/x86/vm_event.c and I feel it would >> complicate matters there (I'd have to perhaps prepend "hvm_" to it in >> which case it wouldn't really belong in vm_event.c either). But if it's >> necessary, I'll move it - do you want me to move it? > > Well, logically this is a VM event function, so belongs into vm_event.c. > So I'd _prefer_ for it to be moved, but I don't insist (in the end, by > leaving it where it is, you and Tamas [wrongly imo] are not the > maintainers of it). But I agree, x86/vm_event.c isn't an ideal place > either, better would be x86/hvm/vm_event.c. > > Otoh hvm_do_resume() open codes all the CR and MSR writes > (getting the order wrong as it seems, btw, as some MSR writes > depend on certain CR settings), so maybe a separate function isn't > needed at all here? In fact the bulk of the function is VM event > code, so one might also view it the other way around, and most or > all of this should move into a new function, for example > hvm_vm_event_do_resume(). All fair points. If there are no objections, I'll add x86/hvm/vm_event.c and asm-x86/hvm/vm_event.h, and place hvm_vm_do_resume() in it. I'll also add the files to MAINTAINERS under vm_event. In the process of moving the code, I'll also put the MSR write last. This will then become a 2-patch series, with the 1st patch doing the above, and the second patch will be this one. Thanks, Razvan
On 27/04/17 09:52, Jan Beulich wrote: >>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote: >> On 27/04/2017 09:01, Jan Beulich wrote: >>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >>>> The introspection agent can reply to a vm_event faster than >>>> vmx_vmexit_handler() can complete in some cases, where it is then >>>> not safe for vm_event_set_registers() to modify v->arch.user_regs. >>> This needs more explanation: I cannot see the connection between >>> vmx_vmexit_handler() completing and (un)safety of modification of >>> v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't >>> gone through __context_switch(), and the former doesn't call that >>> function directly or indirectly (i.e. I think it is more than just >>> vmx_vmexit_handler() which needs to be completed by the time >>> register modification becomes safe to do). >> The test scenario here was to step over an int3 breakpoint by setting >> rip += 1. This is a very quick reply and tended to complete before the >> vcpu triggering the introspection event had properly paused and been >> descheduled. >> >> If the reply occurs before __context_switch() happens, >> __context_switch() clobbers the reply by overwriting v->arch.user_regs >> from the stack. If the reply occurs after __context_switch(), but the >> pcpu has gone idle and keeps v in context, v->arch.user_regs is not >> restored to the stack, and the update is similarly missed. > This second case not properly described, I think: v won't be kept in > context once we've gone through __context_switch(). If the pCPU > goes idle, __context_switch() simply will be deferred until another > vCPU gets scheduled onto this pCPU, and be avoided altogether if > it's the original vCPU that gets scheduled back onto this pCPU. But > I do get the point. I think much if not all of the above (suitably > adjusted) needs to go into the commit message. Yes. I did make a slight error. The latter case is that we don't pass through __context_switch() when transitioning to idle. > >> (There is a >> very narrow race condition where both the reply and __context_switch() >> are updating v->arch.user_regs, and who knows what will happen, given >> our memcpy implementation.) > Right, but with the proper serialization of events done now this > isn't a problem anymore either, aiui. Correct. > >>>> The patch additionally removes the sync_vcpu_execstate(v) call from >>>> vm_event_resume(), which is no longer necessary, which removes the >>>> associated broadcast TLB flush (read: performance improvement). >>> Depending on the better explanation above, it may or may not be >>> further necessary to also better explain the "no longer necessary" >>> part here. >> As I understand, it was a previous attempt to fix this bug. >> >> There is nothing (now) in vm_event_resume() which does anything other >> than new updates into v->arch.vm_event and drop the pause reference. >> All updated are applied in context, in the return-to-guest path, which >> is race free. >> >> There is no need for the IPI, or to force the target vcpu out of context >> if the pcpu is currently idle. > I agree, and (as indicated) the better explanation earlier on is > probably sufficient then. > >>>> --- a/xen/common/vm_event.c >>>> +++ b/xen/common/vm_event.c >>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) >>>> { >>>> vm_event_response_t rsp; >>>> >>>> + /* >>>> + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or >>>> + * EVTCHN_send from the introspection consumer. Both contexts are >>>> + * guaranteed not to be the subject of vm_event responses. >>>> + */ >>>> + ASSERT(d != current->domain); >>> What is this meant to guard against? >> Documentation, as much as anything else. It took a long time to >> untangle the contexts involved here; I'm not convinced the logic is safe >> to run in context, and it certainly doesn't need to. > Well, as said - I think it is too strict now: You only need the vCPU > not be current afaict, and it really matters here which of the two > "current"-s you actually mean (and it looks to me as if you mean > the other one, guaranteeing register state to no longer be on the > stack). We don't know the vcpu at this point; that information comes out of the replies on the ring. We also may process multiple replies in this one loop. In the worse case, we process one reply for every vcpu in d. If the ASSERT() were to be deferred until the middle of the request loop, we could ASSERT(v != current) for every vcpu in d, but that is identical to this single assertion. ~Andrew
>>> On 27.04.17 at 19:31, <andrew.cooper3@citrix.com> wrote: > On 27/04/17 09:52, Jan Beulich wrote: >>>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote: >>> On 27/04/2017 09:01, Jan Beulich wrote: >>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >>>>> --- a/xen/common/vm_event.c >>>>> +++ b/xen/common/vm_event.c >>>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) >>>>> { >>>>> vm_event_response_t rsp; >>>>> >>>>> + /* >>>>> + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or >>>>> + * EVTCHN_send from the introspection consumer. Both contexts are >>>>> + * guaranteed not to be the subject of vm_event responses. >>>>> + */ >>>>> + ASSERT(d != current->domain); >>>> What is this meant to guard against? >>> Documentation, as much as anything else. It took a long time to >>> untangle the contexts involved here; I'm not convinced the logic is safe >>> to run in context, and it certainly doesn't need to. >> Well, as said - I think it is too strict now: You only need the vCPU >> not be current afaict, and it really matters here which of the two >> "current"-s you actually mean (and it looks to me as if you mean >> the other one, guaranteeing register state to no longer be on the >> stack). > > We don't know the vcpu at this point; that information comes out of the > replies on the ring. > > We also may process multiple replies in this one loop. In the worse > case, we process one reply for every vcpu in d. > > If the ASSERT() were to be deferred until the middle of the request > loop, we could ASSERT(v != current) for every vcpu in d, but that is > identical to this single assertion. No, it isn't. It would only be if you iterated over all vCPU-s of that domain (which as you say may or may not happen). Jan
On 04/28/2017 09:25 AM, Jan Beulich wrote: >>>> On 27.04.17 at 19:31, <andrew.cooper3@citrix.com> wrote: >> On 27/04/17 09:52, Jan Beulich wrote: >>>>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote: >>>> On 27/04/2017 09:01, Jan Beulich wrote: >>>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >>>>>> --- a/xen/common/vm_event.c >>>>>> +++ b/xen/common/vm_event.c >>>>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) >>>>>> { >>>>>> vm_event_response_t rsp; >>>>>> >>>>>> + /* >>>>>> + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or >>>>>> + * EVTCHN_send from the introspection consumer. Both contexts are >>>>>> + * guaranteed not to be the subject of vm_event responses. >>>>>> + */ >>>>>> + ASSERT(d != current->domain); >>>>> What is this meant to guard against? >>>> Documentation, as much as anything else. It took a long time to >>>> untangle the contexts involved here; I'm not convinced the logic is safe >>>> to run in context, and it certainly doesn't need to. >>> Well, as said - I think it is too strict now: You only need the vCPU >>> not be current afaict, and it really matters here which of the two >>> "current"-s you actually mean (and it looks to me as if you mean >>> the other one, guaranteeing register state to no longer be on the >>> stack). >> >> We don't know the vcpu at this point; that information comes out of the >> replies on the ring. >> >> We also may process multiple replies in this one loop. In the worse >> case, we process one reply for every vcpu in d. >> >> If the ASSERT() were to be deferred until the middle of the request >> loop, we could ASSERT(v != current) for every vcpu in d, but that is >> identical to this single assertion. > > No, it isn't. It would only be if you iterated over all vCPU-s of that > domain (which as you say may or may not happen). I will modify to the code to whatever you and Andrew decide is best, but just in case this helps decide, in my experience iterating over all VCPUs here will happen more often than not - even looking at xen-access.c today, it poll()s with a timeout, processes all collected events (which, if they are sync events - and they always are with our application - there can be several of them only if they correspond to different VCPUs), and only then signals the event channel. But there are of course cases in which less than all the VCPUs have placed events in the ring buffer. Thanks, Razvan
>>> On 28.04.17 at 08:45, <rcojocaru@bitdefender.com> wrote: > On 04/28/2017 09:25 AM, Jan Beulich wrote: >>>>> On 27.04.17 at 19:31, <andrew.cooper3@citrix.com> wrote: >>> On 27/04/17 09:52, Jan Beulich wrote: >>>>>>> On 27.04.17 at 10:34, <andrew.cooper3@citrix.com> wrote: >>>>> On 27/04/2017 09:01, Jan Beulich wrote: >>>>>>>>> On 27.04.17 at 09:22, <rcojocaru@bitdefender.com> wrote: >>>>>>> --- a/xen/common/vm_event.c >>>>>>> +++ b/xen/common/vm_event.c >>>>>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct > vm_event_domain *ved) >>>>>>> { >>>>>>> vm_event_response_t rsp; >>>>>>> >>>>>>> + /* >>>>>>> + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or >>>>>>> + * EVTCHN_send from the introspection consumer. Both contexts are >>>>>>> + * guaranteed not to be the subject of vm_event responses. >>>>>>> + */ >>>>>>> + ASSERT(d != current->domain); >>>>>> What is this meant to guard against? >>>>> Documentation, as much as anything else. It took a long time to >>>>> untangle the contexts involved here; I'm not convinced the logic is safe >>>>> to run in context, and it certainly doesn't need to. >>>> Well, as said - I think it is too strict now: You only need the vCPU >>>> not be current afaict, and it really matters here which of the two >>>> "current"-s you actually mean (and it looks to me as if you mean >>>> the other one, guaranteeing register state to no longer be on the >>>> stack). >>> >>> We don't know the vcpu at this point; that information comes out of the >>> replies on the ring. >>> >>> We also may process multiple replies in this one loop. In the worse >>> case, we process one reply for every vcpu in d. >>> >>> If the ASSERT() were to be deferred until the middle of the request >>> loop, we could ASSERT(v != current) for every vcpu in d, but that is >>> identical to this single assertion. >> >> No, it isn't. It would only be if you iterated over all vCPU-s of that >> domain (which as you say may or may not happen). > > I will modify to the code to whatever you and Andrew decide is best, but > just in case this helps decide, in my experience iterating over all > VCPUs here will happen more often than not - even looking at > xen-access.c today, it poll()s with a timeout, processes all collected > events (which, if they are sync events - and they always are with our > application - there can be several of them only if they correspond to > different VCPUs), and only then signals the event channel. > > But there are of course cases in which less than all the VCPUs have > placed events in the ring buffer. So just to clarify - I'm not entirely opposed to the ASSERT() staying as is, but then the comment next to it should clarify that it is slightly more strict than absolutely necessary. Jan
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index a441955..520942a 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -473,6 +473,39 @@ static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info) return hvm_funcs.get_pending_event(v, info); } +static void hvm_vm_event_set_registers(struct vcpu *v) +{ + ASSERT(v == current); + + if ( v->arch.vm_event->set_gprs ) + { + struct cpu_user_regs *regs = guest_cpu_user_regs(); + + regs->rax = v->arch.vm_event->gprs.rax; + regs->rbx = v->arch.vm_event->gprs.rbx; + regs->rcx = v->arch.vm_event->gprs.rcx; + regs->rdx = v->arch.vm_event->gprs.rdx; + regs->rsp = v->arch.vm_event->gprs.rsp; + regs->rbp = v->arch.vm_event->gprs.rbp; + regs->rsi = v->arch.vm_event->gprs.rsi; + regs->rdi = v->arch.vm_event->gprs.rdi; + + regs->r8 = v->arch.vm_event->gprs.r8; + regs->r9 = v->arch.vm_event->gprs.r9; + regs->r10 = v->arch.vm_event->gprs.r10; + regs->r11 = v->arch.vm_event->gprs.r11; + regs->r12 = v->arch.vm_event->gprs.r12; + regs->r13 = v->arch.vm_event->gprs.r13; + regs->r14 = v->arch.vm_event->gprs.r14; + regs->r15 = v->arch.vm_event->gprs.r15; + + regs->rflags = v->arch.vm_event->gprs.rflags; + regs->rip = v->arch.vm_event->gprs.rip; + + v->arch.vm_event->set_gprs = 0; + } +} + void hvm_do_resume(struct vcpu *v) { check_wakeup_from_wait(); @@ -487,6 +520,8 @@ void hvm_do_resume(struct vcpu *v) { struct monitor_write_data *w = &v->arch.vm_event->write_data; + hvm_vm_event_set_registers(v); + if ( unlikely(v->arch.vm_event->emulate_flags) ) { enum emul_kind kind = EMUL_KIND_NORMAL; diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c index 502ccc2..f66780a 100644 --- a/xen/arch/x86/vm_event.c +++ b/xen/arch/x86/vm_event.c @@ -113,26 +113,8 @@ void vm_event_set_registers(struct vcpu *v, vm_event_response_t *rsp) { ASSERT(atomic_read(&v->vm_event_pause_count)); - v->arch.user_regs.rax = rsp->data.regs.x86.rax; - v->arch.user_regs.rbx = rsp->data.regs.x86.rbx; - v->arch.user_regs.rcx = rsp->data.regs.x86.rcx; - v->arch.user_regs.rdx = rsp->data.regs.x86.rdx; - v->arch.user_regs.rsp = rsp->data.regs.x86.rsp; - v->arch.user_regs.rbp = rsp->data.regs.x86.rbp; - v->arch.user_regs.rsi = rsp->data.regs.x86.rsi; - v->arch.user_regs.rdi = rsp->data.regs.x86.rdi; - - v->arch.user_regs.r8 = rsp->data.regs.x86.r8; - v->arch.user_regs.r9 = rsp->data.regs.x86.r9; - v->arch.user_regs.r10 = rsp->data.regs.x86.r10; - v->arch.user_regs.r11 = rsp->data.regs.x86.r11; - v->arch.user_regs.r12 = rsp->data.regs.x86.r12; - v->arch.user_regs.r13 = rsp->data.regs.x86.r13; - v->arch.user_regs.r14 = rsp->data.regs.x86.r14; - v->arch.user_regs.r15 = rsp->data.regs.x86.r15; - - v->arch.user_regs.rflags = rsp->data.regs.x86.rflags; - v->arch.user_regs.rip = rsp->data.regs.x86.rip; + v->arch.vm_event->gprs = rsp->data.regs.x86; + v->arch.vm_event->set_gprs = 1; } void vm_event_monitor_next_interrupt(struct vcpu *v) diff --git a/xen/common/vm_event.c b/xen/common/vm_event.c index 0fe9a53..498749b 100644 --- a/xen/common/vm_event.c +++ b/xen/common/vm_event.c @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) { vm_event_response_t rsp; + /* + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or + * EVTCHN_send from the introspection consumer. Both contexts are + * guaranteed not to be the subject of vm_event responses. + */ + ASSERT(d != current->domain); + /* Pull all responses off the ring. */ while ( vm_event_get_response(d, ved, &rsp) ) { @@ -375,13 +382,6 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) v = d->vcpu[rsp.vcpu_id]; /* - * Make sure the vCPU state has been synchronized for the custom - * handlers. - */ - if ( atomic_read(&v->vm_event_pause_count) ) - sync_vcpu_execstate(v); - - /* * In some cases the response type needs extra handling, so here * we call the appropriate handlers. */ diff --git a/xen/include/asm-x86/vm_event.h b/xen/include/asm-x86/vm_event.h index ca73f99..f703170 100644 --- a/xen/include/asm-x86/vm_event.h +++ b/xen/include/asm-x86/vm_event.h @@ -32,6 +32,8 @@ struct arch_vm_event { struct vm_event_emul_insn_data insn; } emul; struct monitor_write_data write_data; + struct vm_event_regs_x86 gprs; + bool_t set_gprs; }; int vm_event_init_domain(struct domain *d);