Message ID | 1558721577-13958-3-git-send-email-andrii.anisov@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v3] xen: introduce VCPUOP_register_runstate_phys_memory_area hypercall | expand |
>>> On 24.05.19 at 20:12, <andrii.anisov@gmail.com> wrote: > From: Andrii Anisov <andrii_anisov@epam.com> > > Existing interface to register runstate are with its virtual address > is prone to issues which became more obvious with KPTI enablement in > guests. The nature of those issues is the fact that the guest could > be interrupted by the hypervisor at any time, and there is no guarantee > to have the registered virtual address translated with the currently > available guest's page tables. Before the KPTI such a situation was > possible in case the guest is caught in the middle of PT processing > (e.g. superpage shattering). With the KPTI this happens also when the > guest runs userspace, so has a pretty high probability. Except when there's no need for KPTI in the guest in the first place, as is the case for x86-64 PV guests. I think this is worthwhile clarifying. > So it was agreed to register runstate with the guest's physical address > so that its mapping is permanent from the hypervisor point of view. [1] > > The hypercall employs the same vcpu_register_runstate_memory_area > structure for the interface, but requires a registered area to not > cross a page boundary. > > [1] https://lists.xenproject.org/archives/html/xen-devel/2019-02/msg00416.html > > Signed-off-by: Andrii Anisov <andrii_anisov@epam.com> I would have really hoped that you would outline the intended interaction between both interfaces, such that while reviewing one can judge whether you're actually matching the goal. I think it is actually mandatory to make explicit in the public header what level of mixing is permitted, what is not permitted, and what mix of requests is simply undefined. In particular, while we did work out during prior discussions that some level of mixing has to be permitted, I'm unconvinced that arbitrary mixing has to be allowed. For example, switching from one model to another could be permitted only with just a single active vCPU. This might allow to do away again with the runstate_in_use field you add. > --- a/xen/arch/arm/domain.c > +++ b/xen/arch/arm/domain.c > @@ -274,17 +274,15 @@ static void ctxt_switch_to(struct vcpu *n) > virt_timer_restore(n); > } > > -/* Update per-VCPU guest runstate shared memory area (if registered). */ > -static void update_runstate_area(struct vcpu *v) > +static void update_runstate_by_gvaddr(struct vcpu *v) > { > void __user *guest_handle = NULL; > > - if ( guest_handle_is_null(runstate_guest(v)) ) > - return; > + ASSERT(!guest_handle_is_null(runstate_guest_virt(v))); > > if ( VM_ASSIST(v->domain, runstate_update_flag) ) > { > - guest_handle = &v->runstate_guest.p->state_entry_time + 1; > + guest_handle = &v->runstate_guest.virt.p->state_entry_time + 1; > guest_handle--; > v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; > __raw_copy_to_guest(guest_handle, > @@ -292,7 +290,7 @@ static void update_runstate_area(struct vcpu *v) > smp_wmb(); > } > > - __copy_to_guest(runstate_guest(v), &v->runstate, 1); > + __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1); In a prior version you did the mechanical part of adjusting the VA-based code in a prereq patch, aiding review. Is there a particular reason you folded everything into one patch now? > @@ -303,6 +301,53 @@ static void update_runstate_area(struct vcpu *v) > } > } > > +static void update_runstate_by_gpaddr(struct vcpu *v) > +{ > + struct vcpu_runstate_info *runstate = > + (struct vcpu_runstate_info *)v->runstate_guest.phys; What's the cast for here? > @@ -1614,6 +1613,92 @@ bool update_runstate_area(struct vcpu *v) > return rc; > } > > +static bool update_runstate_by_gpaddr_native(struct vcpu *v) > +{ > + struct vcpu_runstate_info *runstate = > + (struct vcpu_runstate_info *)v->runstate_guest.phys; > + > + ASSERT(runstate != NULL); > + > + if ( VM_ASSIST(v->domain, runstate_update_flag) ) > + { > + runstate->state_entry_time |= XEN_RUNSTATE_UPDATE; > + smp_wmb(); > + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; > + } > + > + memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate)); > + > + if ( VM_ASSIST(v->domain, runstate_update_flag) ) > + { > + runstate->state_entry_time &= ~XEN_RUNSTATE_UPDATE; > + smp_wmb(); > + v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; > + } > + > + return true; > +} I can't help thinking that this matches the Arm code. Can common things please be put in common code? I may be asking too much, but if the pre-existing implementations are similar enough (I didn't check) perhaps they could be folded in a first step, too? > +static bool update_runstate_by_gpaddr_compat(struct vcpu *v) > +{ > + struct compat_vcpu_runstate_info *runstate = > + (struct compat_vcpu_runstate_info *)v->runstate_guest.phys; > + > + ASSERT(runstate != NULL); > + > + if ( VM_ASSIST(v->domain, runstate_update_flag) ) > + { > + runstate->state_entry_time |= XEN_RUNSTATE_UPDATE; > + smp_wmb(); > + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; > + } > + > + { > + struct compat_vcpu_runstate_info info; > + XLAT_vcpu_runstate_info(&info, &v->runstate); > + memcpy(v->runstate_guest.phys, &info, sizeof(info)); > + } > + else > + memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate)); This "else" does not seem to be paired with an if(). Does this build at all? > --- a/xen/arch/x86/x86_64/domain.c > +++ b/xen/arch/x86/x86_64/domain.c > @@ -12,6 +12,8 @@ > CHECK_vcpu_get_physid; > #undef xen_vcpu_get_physid > > +extern void discard_runstate_area(struct vcpu *v); No, this is not allowed. The declaration must be visible to both consumer and producer, so that when either side is changed things won't build until the other side gets changed too. > @@ -35,8 +37,16 @@ arch_compat_vcpu_op( > !compat_handle_okay(area.addr.h, 1) ) > break; > > + while( xchg(&v->runstate_in_use, 1) == 0); At the very least such loops want a cpu_relax() in their bodies. But this being on a hypercall path - are there theoretical guarantees that a guest can't abuse this to lock up a CPU? Furthermore I don't understand how this is supposed to work in the first place. The first xchg() will store 1, no matter what. Thus on the second iteration you'll find the wanted value unless the other side stored 0. Yet the other side is a simple xchg() too. Hence its first attempt would fail, but its second attempt (which we didn't exit the critical region here yet) would succeed. Also - style. > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -698,6 +698,74 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d) > return 0; > } > > +static void unmap_runstate_area(struct vcpu *v) > +{ > + mfn_t mfn; > + > + if ( ! v->runstate_guest.phys ) Stray blank after ! . > + return; > + > + mfn = domain_page_map_to_mfn(v->runstate_guest.phys); > + > + unmap_domain_page_global((void *) > + ((unsigned long)v->runstate_guest.phys & > + PAGE_MASK)); > + > + v->runstate_guest.phys = NULL; I think you would better store NULL before unmapping. > @@ -734,7 +802,10 @@ int domain_kill(struct domain *d) > if ( cpupool_move_domain(d, cpupool0) ) > return -ERESTART; > for_each_vcpu ( d, v ) > + { > + discard_runstate_area_locked(v); > unmap_vcpu_info(v); > + } What is the "locked" aspect here about? The guest itself is dead (i.e. fully paused) at this point, isn't it? And it won't ever be unpaused again. > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -163,17 +163,31 @@ struct vcpu > void *sched_priv; /* scheduler-specific data */ > > struct vcpu_runstate_info runstate; > + > + enum { > + RUNSTATE_NONE = 0, > + RUNSTATE_PADDR = 1, > + RUNSTATE_VADDR = 2, > + } runstate_guest_type; > + > + unsigned long runstate_in_use; Why "unsigned long"? Isn't a bool all you need? Also these would now all want to be grouped in a sub-structure named "runstate", rather than having "runstate_" prefixes. > + void* phys; Bad ordering between * and the blanks. There ought to be a blank ahead of the *, and I don't see why you need any after it. Jan
Hi Jan, On 07/06/2019 15:23, Jan Beulich wrote: >>>> On 24.05.19 at 20:12, <andrii.anisov@gmail.com> wrote: >> From: Andrii Anisov <andrii_anisov@epam.com> >> >> Existing interface to register runstate are with its virtual address >> is prone to issues which became more obvious with KPTI enablement in >> guests. The nature of those issues is the fact that the guest could >> be interrupted by the hypervisor at any time, and there is no guarantee >> to have the registered virtual address translated with the currently >> available guest's page tables. Before the KPTI such a situation was >> possible in case the guest is caught in the middle of PT processing >> (e.g. superpage shattering). With the KPTI this happens also when the >> guest runs userspace, so has a pretty high probability. > > Except when there's no need for KPTI in the guest in the first place, > as is the case for x86-64 PV guests. I think this is worthwhile clarifying. I am not sure what is your point here. At least on Arm, using virtual address is not safe at all (whether KPTI is used or not). A guest can genuinely decides to shatter the mapping where the virtual address is. On Arm, this require to use the break-before-make sequence. It means the translation VA -> PA may fail is you happen to do it while the guest is using the sequence. Some of the intermittent issues I have seen on the Arndale in the past [1] might be related to using virtual address. I am not 100% sure because even if the debug, the error does not make sense. But this is the most plausible reason for the failure. I want to discuss this in part of the bigger attempt to rework the hypercall ABI during Xen Summit in July. [...] >> @@ -35,8 +37,16 @@ arch_compat_vcpu_op( >> !compat_handle_okay(area.addr.h, 1) ) >> break; >> >> + while( xchg(&v->runstate_in_use, 1) == 0); > > At the very least such loops want a cpu_relax() in their bodies. > But this being on a hypercall path - are there theoretical guarantees > that a guest can't abuse this to lock up a CPU? Hmmm, I suggested this but it looks like a guest may call the hypercall multiple time from different vCPU. So this could be a way to delay work on the CPU. I wanted to make the context switch mostly lockless and therefore avoiding to introduce a spinlock. [1] https://lists.xen.org/archives/html/xen-devel/2017-11/msg00942.html
>>> On 10.06.19 at 13:44, <julien.grall@arm.com> wrote: > Hi Jan, > > On 07/06/2019 15:23, Jan Beulich wrote: >>>>> On 24.05.19 at 20:12, <andrii.anisov@gmail.com> wrote: >>> From: Andrii Anisov <andrii_anisov@epam.com> >>> >>> Existing interface to register runstate are with its virtual address >>> is prone to issues which became more obvious with KPTI enablement in >>> guests. The nature of those issues is the fact that the guest could >>> be interrupted by the hypervisor at any time, and there is no guarantee >>> to have the registered virtual address translated with the currently >>> available guest's page tables. Before the KPTI such a situation was >>> possible in case the guest is caught in the middle of PT processing >>> (e.g. superpage shattering). With the KPTI this happens also when the >>> guest runs userspace, so has a pretty high probability. >> >> Except when there's no need for KPTI in the guest in the first place, >> as is the case for x86-64 PV guests. I think this is worthwhile clarifying. > > I am not sure what is your point here. At least on Arm, using virtual address is > not safe at all (whether KPTI is used or not). A guest can genuinely decides to > shatter the mapping where the virtual address is. On Arm, this require to use > the break-before-make sequence. It means the translation VA -> PA may fail is > you happen to do it while the guest is using the sequence. > > Some of the intermittent issues I have seen on the Arndale in the past [1] might > be related to using virtual address. I am not 100% sure because even if the > debug, the error does not make sense. But this is the most plausible reason for > the failure. All fine, but Arm-specific. The point of my comment was to ask to call out that there is one environment (x86-64 PV) where this KPTI discussion is entirely inapplicable. >>> @@ -35,8 +37,16 @@ arch_compat_vcpu_op( >>> !compat_handle_okay(area.addr.h, 1) ) >>> break; >>> >>> + while( xchg(&v->runstate_in_use, 1) == 0); >> >> At the very least such loops want a cpu_relax() in their bodies. >> But this being on a hypercall path - are there theoretical guarantees >> that a guest can't abuse this to lock up a CPU? > Hmmm, I suggested this but it looks like a guest may call the hypercall multiple > time from different vCPU. So this could be a way to delay work on the CPU. > > I wanted to make the context switch mostly lockless and therefore avoiding to > introduce a spinlock. Well, constructs like the above are trying to mimic a spinlock without actually using a spinlock. There are extremely rare situation in which this may indeed be warranted, but here it falls in the common "makes things worse overall" bucket, I think. To not unduly penalize the actual update paths, I think using a r/w lock would be appropriate here. Jan
Hello Jan, On 11.06.19 12:10, Jan Beulich wrote: >>> Except when there's no need for KPTI in the guest in the first place, >>> as is the case for x86-64 PV guests. I think this is worthwhile clarifying. >> >> I am not sure what is your point here. At least on Arm, using virtual address is >> not safe at all (whether KPTI is used or not). A guest can genuinely decides to >> shatter the mapping where the virtual address is. On Arm, this require to use >> the break-before-make sequence. It means the translation VA -> PA may fail is >> you happen to do it while the guest is using the sequence. >> >> Some of the intermittent issues I have seen on the Arndale in the past [1] might >> be related to using virtual address. I am not 100% sure because even if the >> debug, the error does not make sense. But this is the most plausible reason for >> the failure. > > All fine, but Arm-specific. The point of my comment was to ask to call > out that there is one environment (x86-64 PV) where this KPTI > discussion is entirely inapplicable. I admit that x86 specifics are quite unclear to me so clarifications and corrections in that domain are desirable. > >>>> @@ -35,8 +37,16 @@ arch_compat_vcpu_op( >>>> !compat_handle_okay(area.addr.h, 1) ) >>>> break; >>>> >>>> + while( xchg(&v->runstate_in_use, 1) == 0); >>> >>> At the very least such loops want a cpu_relax() in their bodies. >>> But this being on a hypercall path - are there theoretical guarantees >>> that a guest can't abuse this to lock up a CPU? >> Hmmm, I suggested this but it looks like a guest may call the hypercall multiple >> time from different vCPU. So this could be a way to delay work on the CPU. Julien, I'm not sure I understand how work on (p?)CPU could be delayed. We are here with interrupts enabled, so here guest would just spend his own vcpu time budget. On the runstate update there is a kind-of trylock, so we would not hang there either. So what the problem? >> I wanted to make the context switch mostly lockless and therefore avoiding to >> introduce a spinlock.> > Well, constructs like the above are trying to mimic a spinlock > without actually using a spinlock. There are extremely rare > situation in which this may indeed be warranted, but here it > falls in the common "makes things worse overall" bucket, I > think. To not unduly penalize the actual update paths, I think > using a r/w lock would be appropriate here. Firstly I did not think r/w lock specifics are needed here. We have only one reader path - runstate update on vcpu scheduling. And this path can have only one instance at the time. But it seems to be more efficient than the spinlock for the case we are not locking.
Hi, On 11/06/2019 11:22, Andrii Anisov wrote: > On 11.06.19 12:10, Jan Beulich wrote: >>>>> @@ -35,8 +37,16 @@ arch_compat_vcpu_op( >>>>> !compat_handle_okay(area.addr.h, 1) ) >>>>> break; >>>>> + while( xchg(&v->runstate_in_use, 1) == 0); >>>> >>>> At the very least such loops want a cpu_relax() in their bodies. >>>> But this being on a hypercall path - are there theoretical guarantees >>>> that a guest can't abuse this to lock up a CPU? >>> Hmmm, I suggested this but it looks like a guest may call the hypercall multiple >>> time from different vCPU. So this could be a way to delay work on the CPU. > > Julien, I'm not sure I understand how work on (p?)CPU could be delayed. We are > here with interrupts enabled, so here guest would just spend his own vcpu time > budget. Xen only supports only voluntary preemption. This means that if the hypercall run for a long time, you have no way to interrupt it and schedule another vCPU. In other words, the rest of the work on that specific pCPU would be delayed. In this context, I think Jan is seeking guarantees that the lock can be taken in timely manner. If not, then we want to use a try-lock style. Cheers,
On 11.06.19 15:12, Julien Grall wrote: >> Julien, I'm not sure I understand how work on (p?)CPU could be delayed. We are here with interrupts enabled, so here guest would just spend his own vcpu time budget. > > Xen only supports only voluntary preemption. Oh, really? Let me look into it a bit closer.
On 11/06/2019 13:26, Andrii Anisov wrote: > > > On 11.06.19 15:12, Julien Grall wrote: >>> Julien, I'm not sure I understand how work on (p?)CPU could be delayed. We >>> are here with interrupts enabled, so here guest would just spend his own vcpu >>> time budget. >> >> Xen only supports only voluntary preemption. > > Oh, really? Let me look into it a bit closer. Did you expect Xen to be fully preemptible? The function to do the scheduling is schedule(). This is either call from a softirq or directly in a few places (e.g wait()). The only place in this path where do_softirq() will be called is on return to the guest (see leave_hypervisor_tail). Cheers,
On 11.06.19 15:32, Julien Grall wrote: > Did you expect Xen to be fully preemptible? > > The function to do the scheduling is schedule(). This is either call from a softirq or directly in a few places (e.g wait()). > > The only place in this path where do_softirq() will be called is on return to the guest (see leave_hypervisor_tail). Right you are! I forgot that bit. We will not pass through it on serving timer irq interrupted us in hyp mode.
Hello Jan, On 07.06.19 17:23, Jan Beulich wrote: >>>> On 24.05.19 at 20:12, <andrii.anisov@gmail.com> wrote: >> From: Andrii Anisov <andrii_anisov@epam.com> >> >> Existing interface to register runstate are with its virtual address >> is prone to issues which became more obvious with KPTI enablement in >> guests. The nature of those issues is the fact that the guest could >> be interrupted by the hypervisor at any time, and there is no guarantee >> to have the registered virtual address translated with the currently >> available guest's page tables. Before the KPTI such a situation was >> possible in case the guest is caught in the middle of PT processing >> (e.g. superpage shattering). With the KPTI this happens also when the >> guest runs userspace, so has a pretty high probability. > > Except when there's no need for KPTI in the guest in the first place, > as is the case for x86-64 PV guests. I think this is worthwhile clarifying. > >> So it was agreed to register runstate with the guest's physical address >> so that its mapping is permanent from the hypervisor point of view. [1] >> >> The hypercall employs the same vcpu_register_runstate_memory_area >> structure for the interface, but requires a registered area to not >> cross a page boundary. >> >> [1] https://lists.xenproject.org/archives/html/xen-devel/2019-02/msg00416.html >> >> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com> > > I would have really hoped that you would outline the intended interaction > between both interfaces, I'd suppose no specific interaction between interfaces. I hardly imagine realistic use-cases where this might be needed. > such that while reviewing one can judge whether > you're actually matching the goal. I think it is actually mandatory to make > explicit in the public header what level of mixing is permitted, what is not > permitted, and what mix of requests is simply undefined.> > In particular, while we did work out during prior discussions that some > level of mixing has to be permitted, I'm unconvinced that arbitrary > mixing has to be allowed. For example, switching from one model to > another could be permitted only with just a single active vCPU. This > might allow to do away again with the runstate_in_use field you add. Well, my point here is to left it as is, maybe add more documentation. If one likes shooting his leg, we should only care about avoiding ricochet harms hypervisor or other guests. If you disagree, please suggest your interaction model, I'll be happy to implement it. > >> --- a/xen/arch/arm/domain.c >> +++ b/xen/arch/arm/domain.c >> @@ -274,17 +274,15 @@ static void ctxt_switch_to(struct vcpu *n) >> virt_timer_restore(n); >> } >> >> -/* Update per-VCPU guest runstate shared memory area (if registered). */ >> -static void update_runstate_area(struct vcpu *v) >> +static void update_runstate_by_gvaddr(struct vcpu *v) >> { >> void __user *guest_handle = NULL; >> >> - if ( guest_handle_is_null(runstate_guest(v)) ) >> - return; >> + ASSERT(!guest_handle_is_null(runstate_guest_virt(v))); >> >> if ( VM_ASSIST(v->domain, runstate_update_flag) ) >> { >> - guest_handle = &v->runstate_guest.p->state_entry_time + 1; >> + guest_handle = &v->runstate_guest.virt.p->state_entry_time + 1; >> guest_handle--; >> v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; >> __raw_copy_to_guest(guest_handle, >> @@ -292,7 +290,7 @@ static void update_runstate_area(struct vcpu *v) >> smp_wmb(); >> } >> >> - __copy_to_guest(runstate_guest(v), &v->runstate, 1); >> + __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1); > > In a prior version you did the mechanical part of adjusting the VA-based > code in a prereq patch, aiding review. Is there a particular reason you > folded everything into one patch now? I silently followed suggestion from George [1]. Any objections? > >> @@ -303,6 +301,53 @@ static void update_runstate_area(struct vcpu *v) >> } >> } >> >> +static void update_runstate_by_gpaddr(struct vcpu *v) >> +{ >> + struct vcpu_runstate_info *runstate = >> + (struct vcpu_runstate_info *)v->runstate_guest.phys; > > What's the cast for here? > >> @@ -1614,6 +1613,92 @@ bool update_runstate_area(struct vcpu *v) >> return rc; >> } >> >> +static bool update_runstate_by_gpaddr_native(struct vcpu *v) >> +{ >> + struct vcpu_runstate_info *runstate = >> + (struct vcpu_runstate_info *)v->runstate_guest.phys; >> + >> + ASSERT(runstate != NULL); >> + >> + if ( VM_ASSIST(v->domain, runstate_update_flag) ) >> + { >> + runstate->state_entry_time |= XEN_RUNSTATE_UPDATE; >> + smp_wmb(); >> + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; >> + } >> + >> + memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate)); >> + >> + if ( VM_ASSIST(v->domain, runstate_update_flag) ) >> + { >> + runstate->state_entry_time &= ~XEN_RUNSTATE_UPDATE; >> + smp_wmb(); >> + v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; >> + } >> + >> + return true; >> +} > > I can't help thinking that this matches the Arm code. Can common things > please be put in common code? I may be asking too much, but if the > pre-existing implementations are similar enough (I didn't check) perhaps > they could be folded in a first step, too? The problem thought to be the interface: on x86 update_runstate_area() returns bool, but on ARM it is void. But for the common function update_runstate_by_gpaddr_~native~() it would not be a problem, because we will return proper bool from the refactored update_runstate_area(). Thank you for the point. > >> +static bool update_runstate_by_gpaddr_compat(struct vcpu *v) >> +{ >> + struct compat_vcpu_runstate_info *runstate = >> + (struct compat_vcpu_runstate_info *)v->runstate_guest.phys; >> + >> + ASSERT(runstate != NULL); >> + >> + if ( VM_ASSIST(v->domain, runstate_update_flag) ) >> + { >> + runstate->state_entry_time |= XEN_RUNSTATE_UPDATE; >> + smp_wmb(); >> + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; >> + } >> + >> + { >> + struct compat_vcpu_runstate_info info; >> + XLAT_vcpu_runstate_info(&info, &v->runstate); >> + memcpy(v->runstate_guest.phys, &info, sizeof(info)); >> + } >> + else >> + memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate)); > > This "else" does not seem to be paired with an if(). Does this build > at all? This particular - not! And it is really strange, I remember I checked patch compilation for x86. Looking through git reflog to realize at what amend it became broken. But also I do not completely understand the meaning of "_compat" and if it should be supported here? > >> --- a/xen/arch/x86/x86_64/domain.c >> +++ b/xen/arch/x86/x86_64/domain.c >> @@ -12,6 +12,8 @@ >> CHECK_vcpu_get_physid; >> #undef xen_vcpu_get_physid >> >> +extern void discard_runstate_area(struct vcpu *v); > > No, this is not allowed. The declaration must be visible to both consumer > and producer, so that when either side is changed things won't build until > the other side gets changed too. > >> @@ -35,8 +37,16 @@ arch_compat_vcpu_op( >> !compat_handle_okay(area.addr.h, 1) ) >> break; >> >> + while( xchg(&v->runstate_in_use, 1) == 0); > > At the very least such loops want a cpu_relax() in their bodies. > But this being on a hypercall path - are there theoretical guarantees > that a guest can't abuse this to lock up a CPU? > > Furthermore I don't understand how this is supposed to work in > the first place. The first xchg() will store 1, no matter what. Thus > on the second iteration you'll find the wanted value unless the > other side stored 0. Yet the other side is a simple xchg() too. > Hence its first attempt would fail, but its second attempt (which > we didn't exit the critical region here yet) would succeed. > > Also - style. Will leave this part for locking issue discussion. > >> --- a/xen/common/domain.c >> +++ b/xen/common/domain.c >> @@ -698,6 +698,74 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d) >> return 0; >> } >> >> +static void unmap_runstate_area(struct vcpu *v) >> +{ >> + mfn_t mfn; >> + >> + if ( ! v->runstate_guest.phys ) > > Stray blank after ! . OK. > >> + return; >> + >> + mfn = domain_page_map_to_mfn(v->runstate_guest.phys); >> + >> + unmap_domain_page_global((void *) >> + ((unsigned long)v->runstate_guest.phys & >> + PAGE_MASK)); >> + >> + v->runstate_guest.phys = NULL; > > I think you would better store NULL before unmapping. Do you mean using local variable to pass address to unmap_domain_page_global(), and setting v->runstate_guest.phys to NULL prior to unmap? > >> @@ -734,7 +802,10 @@ int domain_kill(struct domain *d) >> if ( cpupool_move_domain(d, cpupool0) ) >> return -ERESTART; >> for_each_vcpu ( d, v ) >> + { >> + discard_runstate_area_locked(v); >> unmap_vcpu_info(v); >> + } > > What is the "locked" aspect here about? The guest itself is dead (i.e. > fully paused) at this point, isn't it? And it won't ever be unpaused > again. You are right. All vcpus here are stopped. We can discard runstate area without locking. > >> --- a/xen/include/xen/sched.h >> +++ b/xen/include/xen/sched.h >> @@ -163,17 +163,31 @@ struct vcpu >> void *sched_priv; /* scheduler-specific data */ >> >> struct vcpu_runstate_info runstate; >> + >> + enum { >> + RUNSTATE_NONE = 0, >> + RUNSTATE_PADDR = 1, >> + RUNSTATE_VADDR = 2, >> + } runstate_guest_type; >> + >> + unsigned long runstate_in_use; > > Why "unsigned long"? Isn't a bool all you need? Bool should be enough. But it seems we will have a lock here. > Also these would now all want to be grouped in a sub-structure named > "runstate", rather than having "runstate_" prefixes. Member `runstate` has already a type of `struct vcpu_runstate_info` which is an interface type. `runstate_guest` is a union. I'd not like moving `runstate_guest` union into another substructure. Because we would have long lines like `v->struct.runstate_guest.virt.p->state_entry_time`. > >> + void* phys; > > Bad ordering between * and the blanks. There ought to be a blank > ahead of the *, and I don't see why you need any after it. OK. [1] https://lists.xenproject.org/archives/html/xen-devel/2019-05/msg00599.html
Sorry, Missed one comment On 07.06.19 17:23, Jan Beulich wrote: >> +static void update_runstate_by_gpaddr(struct vcpu *v) >> +{ >> + struct vcpu_runstate_info *runstate = >> + (struct vcpu_runstate_info *)v->runstate_guest.phys; > > What's the cast for here? Seems to be not needed.
>>> On 11.06.19 at 18:09, <andrii.anisov@gmail.com> wrote: > On 07.06.19 17:23, Jan Beulich wrote: >>>>> On 24.05.19 at 20:12, <andrii.anisov@gmail.com> wrote: >>> From: Andrii Anisov <andrii_anisov@epam.com> >>> >>> Existing interface to register runstate are with its virtual address >>> is prone to issues which became more obvious with KPTI enablement in >>> guests. The nature of those issues is the fact that the guest could >>> be interrupted by the hypervisor at any time, and there is no guarantee >>> to have the registered virtual address translated with the currently >>> available guest's page tables. Before the KPTI such a situation was >>> possible in case the guest is caught in the middle of PT processing >>> (e.g. superpage shattering). With the KPTI this happens also when the >>> guest runs userspace, so has a pretty high probability. >> >> Except when there's no need for KPTI in the guest in the first place, >> as is the case for x86-64 PV guests. I think this is worthwhile clarifying. >> >>> So it was agreed to register runstate with the guest's physical address >>> so that its mapping is permanent from the hypervisor point of view. [1] >>> >>> The hypercall employs the same vcpu_register_runstate_memory_area >>> structure for the interface, but requires a registered area to not >>> cross a page boundary. >>> >>> [1] https://lists.xenproject.org/archives/html/xen-devel/2019-02/msg00416.html >>> >>> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com> >> >> I would have really hoped that you would outline the intended interaction >> between both interfaces, > > I'd suppose no specific interaction between interfaces. I hardly imagine > realistic use-cases where this might be needed. > >> such that while reviewing one can judge whether >> you're actually matching the goal. I think it is actually mandatory to make >> explicit in the public header what level of mixing is permitted, what is not >> permitted, and what mix of requests is simply undefined.> >> In particular, while we did work out during prior discussions that some >> level of mixing has to be permitted, I'm unconvinced that arbitrary >> mixing has to be allowed. For example, switching from one model to >> another could be permitted only with just a single active vCPU. This >> might allow to do away again with the runstate_in_use field you add. > > Well, my point here is to left it as is, maybe add more documentation. If > one likes shooting his leg, we should only care about avoiding ricochet harms > hypervisor or other guests. > If you disagree, please suggest your interaction model, I'll be happy to > implement it. Well, if "mix as you like" is fine for guests to follow, then okay. But we need to be _really_ certain there's no issue with this. Relaxing the interface is always possible, while tightening an interface is almost always at least a problem, if possible at all. >>> --- a/xen/arch/arm/domain.c >>> +++ b/xen/arch/arm/domain.c >>> @@ -274,17 +274,15 @@ static void ctxt_switch_to(struct vcpu *n) >>> virt_timer_restore(n); >>> } >>> >>> -/* Update per-VCPU guest runstate shared memory area (if registered). */ >>> -static void update_runstate_area(struct vcpu *v) >>> +static void update_runstate_by_gvaddr(struct vcpu *v) >>> { >>> void __user *guest_handle = NULL; >>> >>> - if ( guest_handle_is_null(runstate_guest(v)) ) >>> - return; >>> + ASSERT(!guest_handle_is_null(runstate_guest_virt(v))); >>> >>> if ( VM_ASSIST(v->domain, runstate_update_flag) ) >>> { >>> - guest_handle = &v->runstate_guest.p->state_entry_time + 1; >>> + guest_handle = &v->runstate_guest.virt.p->state_entry_time + 1; >>> guest_handle--; >>> v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; >>> __raw_copy_to_guest(guest_handle, >>> @@ -292,7 +290,7 @@ static void update_runstate_area(struct vcpu *v) >>> smp_wmb(); >>> } >>> >>> - __copy_to_guest(runstate_guest(v), &v->runstate, 1); >>> + __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1); >> >> In a prior version you did the mechanical part of adjusting the VA-based >> code in a prereq patch, aiding review. Is there a particular reason you >> folded everything into one patch now? > > I silently followed suggestion from George [1]. Any objections? Hmm, I can't read this into George's suggestion. Aiui he did suggest not to split the definition of the new interface from its implementation. But that doesn't necessarily mean to squash _everything_ in one patch. >>> +static bool update_runstate_by_gpaddr_compat(struct vcpu *v) >>> +{ >>> + struct compat_vcpu_runstate_info *runstate = >>> + (struct compat_vcpu_runstate_info *)v->runstate_guest.phys; >>> + >>> + ASSERT(runstate != NULL); >>> + >>> + if ( VM_ASSIST(v->domain, runstate_update_flag) ) >>> + { >>> + runstate->state_entry_time |= XEN_RUNSTATE_UPDATE; >>> + smp_wmb(); >>> + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; >>> + } >>> + >>> + { >>> + struct compat_vcpu_runstate_info info; >>> + XLAT_vcpu_runstate_info(&info, &v->runstate); >>> + memcpy(v->runstate_guest.phys, &info, sizeof(info)); >>> + } >>> + else >>> + memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate)); >> >> This "else" does not seem to be paired with an if(). Does this build >> at all? > > This particular - not! > And it is really strange, I remember I checked patch compilation for x86. > Looking through git reflog to realize at what amend it became broken. > But also I do not completely understand the meaning of "_compat" and if it > should be supported here? Well, I'm afraid I don't understand what you're after. Of course compat mode guests need to continue to be supported, and the new interface would also better be available to them. And it is a fact that their runstate area layout differs from that of 64-bit guests. >>> + mfn = domain_page_map_to_mfn(v->runstate_guest.phys); >>> + >>> + unmap_domain_page_global((void *) >>> + ((unsigned long)v->runstate_guest.phys & >>> + PAGE_MASK)); >>> + >>> + v->runstate_guest.phys = NULL; >> >> I think you would better store NULL before unmapping. > > Do you mean using local variable to pass address to > unmap_domain_page_global(), and setting v->runstate_guest.phys to NULL prior > to unmap? Yes. >>> --- a/xen/include/xen/sched.h >>> +++ b/xen/include/xen/sched.h >>> @@ -163,17 +163,31 @@ struct vcpu >>> void *sched_priv; /* scheduler-specific data */ >>> >>> struct vcpu_runstate_info runstate; >>> + >>> + enum { >>> + RUNSTATE_NONE = 0, >>> + RUNSTATE_PADDR = 1, >>> + RUNSTATE_VADDR = 2, >>> + } runstate_guest_type; >>> + >>> + unsigned long runstate_in_use; >> >> Why "unsigned long"? Isn't a bool all you need? > > Bool should be enough. But it seems we will have a lock here. > >> Also these would now all want to be grouped in a sub-structure named >> "runstate", rather than having "runstate_" prefixes. > > Member `runstate` has already a type of `struct vcpu_runstate_info` which is > an interface type. > `runstate_guest` is a union. I'd not like moving `runstate_guest` union into > another substructure. Because we would have long lines like > `v->struct.runstate_guest.virt.p->state_entry_time`. You didn't get my point then: What I'm after is struct { struct vcpu_runstate_info info; enum { RUNSTATE_NONE, RUNSTATE_PADDR, RUNSTATE_VADDR, } guest_type; bool in_use; } runstate; (and of course the transformation to runstate.info broken out into a separate prerreq patch). Jan
Jan, On 12.06.19 10:27, Jan Beulich wrote: >> Well, my point here is to left it as is, maybe add more documentation. If >> one likes shooting his leg, we should only care about avoiding ricochet harms >> hypervisor or other guests. >> If you disagree, please suggest your interaction model, I'll be happy to >> implement it. > > Well, if "mix as you like" is fine for guests to follow, then okay. But > we need to be _really_ certain there's no issue with this. I'm not aware about potential problems from the guest side. Do you have any ideas about it? > Relaxing > the interface is always possible, while tightening an interface is > almost always at least a problem, if possible at all. True. >>> In a prior version you did the mechanical part of adjusting the VA-based >>> code in a prereq patch, aiding review. Is there a particular reason you >>> folded everything into one patch now? >> >> I silently followed suggestion from George [1]. Any objections? > > Hmm, I can't read this into George's suggestion. Aiui he did suggest > not to split the definition of the new interface from its implementation. > But that doesn't necessarily mean to squash _everything_ in one > patch. OK. It looks that what you said firstly is closer to V1 of this stuff. Will keep this in mind for the next version. > Well, I'm afraid I don't understand what you're after. Of course > compat mode guests need to continue to be supported, and the > new interface would also better be available to them. And it is > a fact that their runstate area layout differs from that of 64-bit > guests. OK. >>>> --- a/xen/include/xen/sched.h >>>> +++ b/xen/include/xen/sched.h >>>> @@ -163,17 +163,31 @@ struct vcpu >>>> void *sched_priv; /* scheduler-specific data */ >>>> >>>> struct vcpu_runstate_info runstate; >>>> + >>>> + enum { >>>> + RUNSTATE_NONE = 0, >>>> + RUNSTATE_PADDR = 1, >>>> + RUNSTATE_VADDR = 2, >>>> + } runstate_guest_type; >>>> + >>>> + unsigned long runstate_in_use; >>> >>> Why "unsigned long"? Isn't a bool all you need? >> >> Bool should be enough. But it seems we will have a lock here. >> >>> Also these would now all want to be grouped in a sub-structure named >>> "runstate", rather than having "runstate_" prefixes. >> >> Member `runstate` has already a type of `struct vcpu_runstate_info` which is >> an interface type. >> `runstate_guest` is a union. I'd not like moving `runstate_guest` union into >> another substructure. Because we would have long lines like >> `v->struct.runstate_guest.virt.p->state_entry_time`. > > You didn't get my point then: What I'm after is > > struct { > struct vcpu_runstate_info info; > enum { > RUNSTATE_NONE, > RUNSTATE_PADDR, > RUNSTATE_VADDR, > } guest_type; > bool in_use; > } runstate; Did you miss runstate_guest as a member of that struct?
Jan, On 11.06.19 13:22, Andrii Anisov wrote: > Hello Jan, > > On 11.06.19 12:10, Jan Beulich wrote: >>>> Except when there's no need for KPTI in the guest in the first place, >>>> as is the case for x86-64 PV guests. I think this is worthwhile clarifying. >>> >>> I am not sure what is your point here. At least on Arm, using virtual address is >>> not safe at all (whether KPTI is used or not). A guest can genuinely decides to >>> shatter the mapping where the virtual address is. On Arm, this require to use >>> the break-before-make sequence. It means the translation VA -> PA may fail is >>> you happen to do it while the guest is using the sequence. >>> >>> Some of the intermittent issues I have seen on the Arndale in the past [1] might >>> be related to using virtual address. I am not 100% sure because even if the >>> debug, the error does not make sense. But this is the most plausible reason for >>> the failure. >> >> All fine, but Arm-specific. The point of my comment was to ask to call >> out that there is one environment (x86-64 PV) where this KPTI >> discussion is entirely inapplicable. > > I admit that x86 specifics are quite unclear to me so clarifications and corrections in that domain are desirable. Could you please elaborate more about this? Do you mean that more words should be added to the commit message about x86? If so, please provide what is proper from your point of view.
Jan, Julien, On 11.06.19 12:10, Jan Beulich wrote: >>> At the very least such loops want a cpu_relax() in their bodies. >>> But this being on a hypercall path - are there theoretical guarantees >>> that a guest can't abuse this to lock up a CPU? >> Hmmm, I suggested this but it looks like a guest may call the hypercall multiple >> time from different vCPU. So this could be a way to delay work on the CPU. >> >> I wanted to make the context switch mostly lockless and therefore avoiding to >> introduce a spinlock. > > Well, constructs like the above are trying to mimic a spinlock > without actually using a spinlock. There are extremely rare > situation in which this may indeed be warranted, but here it > falls in the common "makes things worse overall" bucket, I > think. To not unduly penalize the actual update paths, I think > using a r/w lock would be appropriate here. So what is the conclusion here? Should we go with trylock and hypercall_create_continuation() in order to avoid locking but still not fail to the guest?
>>> On 13.06.19 at 14:17, <andrii.anisov@gmail.com> wrote: > On 12.06.19 10:27, Jan Beulich wrote: >>> Well, my point here is to left it as is, maybe add more documentation. If >>> one likes shooting his leg, we should only care about avoiding ricochet > harms >>> hypervisor or other guests. >>> If you disagree, please suggest your interaction model, I'll be happy to >>> implement it. >> >> Well, if "mix as you like" is fine for guests to follow, then okay. But >> we need to be _really_ certain there's no issue with this. > > I'm not aware about potential problems from the guest side. Do you have any > ideas about it? I didn't spend time trying to figure something out, but ... >> Relaxing >> the interface is always possible, while tightening an interface is >> almost always at least a problem, if possible at all. > > True. ... you agreeing here suggests someone should, and this would better not (only) be the reviewer(s). >>>>> --- a/xen/include/xen/sched.h >>>>> +++ b/xen/include/xen/sched.h >>>>> @@ -163,17 +163,31 @@ struct vcpu >>>>> void *sched_priv; /* scheduler-specific data */ >>>>> >>>>> struct vcpu_runstate_info runstate; >>>>> + >>>>> + enum { >>>>> + RUNSTATE_NONE = 0, >>>>> + RUNSTATE_PADDR = 1, >>>>> + RUNSTATE_VADDR = 2, >>>>> + } runstate_guest_type; >>>>> + >>>>> + unsigned long runstate_in_use; >>>> >>>> Why "unsigned long"? Isn't a bool all you need? >>> >>> Bool should be enough. But it seems we will have a lock here. >>> >>>> Also these would now all want to be grouped in a sub-structure named >>>> "runstate", rather than having "runstate_" prefixes. >>> >>> Member `runstate` has already a type of `struct vcpu_runstate_info` which is >>> an interface type. >>> `runstate_guest` is a union. I'd not like moving `runstate_guest` union into >>> another substructure. Because we would have long lines like >>> `v->struct.runstate_guest.virt.p->state_entry_time`. >> >> You didn't get my point then: What I'm after is >> >> struct { >> struct vcpu_runstate_info info; >> enum { >> RUNSTATE_NONE, >> RUNSTATE_PADDR, >> RUNSTATE_VADDR, >> } guest_type; >> bool in_use; >> } runstate; > > Did you miss runstate_guest as a member of that struct? Quite possible. I've outlined it only anyway, for you to get the idea. Jan
>>> On 13.06.19 at 14:21, <andrii.anisov@gmail.com> wrote: > On 11.06.19 13:22, Andrii Anisov wrote: >> On 11.06.19 12:10, Jan Beulich wrote: >>>>> Except when there's no need for KPTI in the guest in the first place, >>>>> as is the case for x86-64 PV guests. I think this is worthwhile clarifying. >>>> >>>> I am not sure what is your point here. At least on Arm, using virtual address is >>>> not safe at all (whether KPTI is used or not). A guest can genuinely decides to >>>> shatter the mapping where the virtual address is. On Arm, this require to use >>>> the break-before-make sequence. It means the translation VA -> PA may fail is >>>> you happen to do it while the guest is using the sequence. >>>> >>>> Some of the intermittent issues I have seen on the Arndale in the past [1] might >>>> be related to using virtual address. I am not 100% sure because even if the >>>> debug, the error does not make sense. But this is the most plausible reason for >>>> the failure. >>> >>> All fine, but Arm-specific. The point of my comment was to ask to call >>> out that there is one environment (x86-64 PV) where this KPTI >>> discussion is entirely inapplicable. >> >> I admit that x86 specifics are quite unclear to me so clarifications and > corrections in that domain are desirable. > > Could you please elaborate more about this? > Do you mean that more words should be added to the commit message about x86? > If so, please provide what is proper from your point of view. I still think my initial response (still visible in context) was sufficient. All I'm after is that you slightly soften your bold statement in the description (no longer visible in context). Jan
>>> On 13.06.19 at 14:32, <andrii.anisov@gmail.com> wrote: > Jan, Julien, > > On 11.06.19 12:10, Jan Beulich wrote: >>>> At the very least such loops want a cpu_relax() in their bodies. >>>> But this being on a hypercall path - are there theoretical guarantees >>>> that a guest can't abuse this to lock up a CPU? >>> Hmmm, I suggested this but it looks like a guest may call the hypercall > multiple >>> time from different vCPU. So this could be a way to delay work on the CPU. >>> >>> I wanted to make the context switch mostly lockless and therefore avoiding > to >>> introduce a spinlock. >> >> Well, constructs like the above are trying to mimic a spinlock >> without actually using a spinlock. There are extremely rare >> situation in which this may indeed be warranted, but here it >> falls in the common "makes things worse overall" bucket, I >> think. To not unduly penalize the actual update paths, I think >> using a r/w lock would be appropriate here. > > So what is the conclusion here? Should we go with trylock and > hypercall_create_continuation() in order to avoid locking but still not fail > to the guest? I'm not convinced a "trylock" approach is needed - that's something Julien suggested. I'm pretty sure we're acquiring other locks in hypercall context without going the trylock route. I am convinced though that the pseudo-lock you've used needs to be replaced by a real (and perhaps r/w) one, _if_ there is any need for locking in the first place. Jan
Hi Jan, On 13/06/2019 13:41, Jan Beulich wrote: >>>> On 13.06.19 at 14:32, <andrii.anisov@gmail.com> wrote: >> Jan, Julien, >> >> On 11.06.19 12:10, Jan Beulich wrote: >>>>> At the very least such loops want a cpu_relax() in their bodies. >>>>> But this being on a hypercall path - are there theoretical guarantees >>>>> that a guest can't abuse this to lock up a CPU? >>>> Hmmm, I suggested this but it looks like a guest may call the hypercall >> multiple >>>> time from different vCPU. So this could be a way to delay work on the CPU. >>>> >>>> I wanted to make the context switch mostly lockless and therefore avoiding >> to >>>> introduce a spinlock. >>> >>> Well, constructs like the above are trying to mimic a spinlock >>> without actually using a spinlock. There are extremely rare >>> situation in which this may indeed be warranted, but here it >>> falls in the common "makes things worse overall" bucket, I >>> think. To not unduly penalize the actual update paths, I think >>> using a r/w lock would be appropriate here. >> >> So what is the conclusion here? Should we go with trylock and >> hypercall_create_continuation() in order to avoid locking but still not fail >> to the guest? > > I'm not convinced a "trylock" approach is needed - that's > something Julien suggested. I think the trylock in the context switch is a must. Otherwise you would delay context switch if the information get updated. > I'm pretty sure we're acquiring other > locks in hypercall context without going the trylock route. I am > convinced though that the pseudo-lock you've used needs to be > replaced by a real (and perhaps r/w) one, _if_ there is any need > for locking in the first place. You were the one asking for theoretical guarantees that a guest can't abuse this to lock up a CPU. There are no way to guarantee that as multiple vCPUs could call the hypercall and take the same lock potentially delaying significantly the work. Regarding the need of the lock, I still can't see how you can make it safe without it as you may have concurrent call. Feel free to suggest a way. Cheers,
>>> On 13.06.19 at 14:48, <julien.grall@arm.com> wrote: > Hi Jan, > > On 13/06/2019 13:41, Jan Beulich wrote: >>>>> On 13.06.19 at 14:32, <andrii.anisov@gmail.com> wrote: >>> Jan, Julien, >>> >>> On 11.06.19 12:10, Jan Beulich wrote: >>>>>> At the very least such loops want a cpu_relax() in their bodies. >>>>>> But this being on a hypercall path - are there theoretical guarantees >>>>>> that a guest can't abuse this to lock up a CPU? >>>>> Hmmm, I suggested this but it looks like a guest may call the hypercall >>> multiple >>>>> time from different vCPU. So this could be a way to delay work on the CPU. >>>>> >>>>> I wanted to make the context switch mostly lockless and therefore avoiding >>> to >>>>> introduce a spinlock. >>>> >>>> Well, constructs like the above are trying to mimic a spinlock >>>> without actually using a spinlock. There are extremely rare >>>> situation in which this may indeed be warranted, but here it >>>> falls in the common "makes things worse overall" bucket, I >>>> think. To not unduly penalize the actual update paths, I think >>>> using a r/w lock would be appropriate here. >>> >>> So what is the conclusion here? Should we go with trylock and >>> hypercall_create_continuation() in order to avoid locking but still not fail >>> to the guest? >> >> I'm not convinced a "trylock" approach is needed - that's >> something Julien suggested. > > I think the trylock in the context switch is a must. Otherwise you would delay > context switch if the information get updated. Delay in what way? I.e. how would this be an issue other than for the guest itself (which shouldn't be constantly updating the address for the region)? >> I'm pretty sure we're acquiring other >> locks in hypercall context without going the trylock route. I am >> convinced though that the pseudo-lock you've used needs to be >> replaced by a real (and perhaps r/w) one, _if_ there is any need >> for locking in the first place. > > You were the one asking for theoretical guarantees that a guest can't abuse this > to lock up a CPU. There are no way to guarantee that as multiple vCPUs could > call the hypercall and take the same lock potentially delaying significantly the > work. Well, I may have gone a little too far with my original response. It just was so odd to see this pseudo lock used. > Regarding the need of the lock, I still can't see how you can make it safe > without it as you may have concurrent call. > > Feel free to suggest a way. Well, if none can be found, then fine. I don't have the time or interest here to try and think about a lockless approach; it just doesn't _feel_ like this ought to strictly require use of a lock. This gut feeling of mine may well be wrong. Jan
Hi Jan, On 13/06/2019 13:58, Jan Beulich wrote: >>>> On 13.06.19 at 14:48, <julien.grall@arm.com> wrote: >> Hi Jan, >> >> On 13/06/2019 13:41, Jan Beulich wrote: >>>>>> On 13.06.19 at 14:32, <andrii.anisov@gmail.com> wrote: >>>> Jan, Julien, >>>> >>>> On 11.06.19 12:10, Jan Beulich wrote: >>>>>>> At the very least such loops want a cpu_relax() in their bodies. >>>>>>> But this being on a hypercall path - are there theoretical guarantees >>>>>>> that a guest can't abuse this to lock up a CPU? >>>>>> Hmmm, I suggested this but it looks like a guest may call the hypercall >>>> multiple >>>>>> time from different vCPU. So this could be a way to delay work on the CPU. >>>>>> >>>>>> I wanted to make the context switch mostly lockless and therefore avoiding >>>> to >>>>>> introduce a spinlock. >>>>> >>>>> Well, constructs like the above are trying to mimic a spinlock >>>>> without actually using a spinlock. There are extremely rare >>>>> situation in which this may indeed be warranted, but here it >>>>> falls in the common "makes things worse overall" bucket, I >>>>> think. To not unduly penalize the actual update paths, I think >>>>> using a r/w lock would be appropriate here. >>>> >>>> So what is the conclusion here? Should we go with trylock and >>>> hypercall_create_continuation() in order to avoid locking but still not fail >>>> to the guest? >>> >>> I'm not convinced a "trylock" approach is needed - that's >>> something Julien suggested. >> >> I think the trylock in the context switch is a must. Otherwise you would delay >> context switch if the information get updated. > > Delay in what way? I.e. how would this be an issue other than for > the guest itself (which shouldn't be constantly updating the > address for the region)? Why would it only be an issue with the guest itself? Any wait on lock in Xen implies that you can't schedule another vCPU as we are not preemptible. As the lock is taken in the context switch, I am worry that a guest continuously trying to call the hypercall and therefore use the lock may actually delay the end of the context switch. And therefore delay the rest of the work. I suggested the trylock here, so the context switch could avoid updating the runstate if we are in the hypercall. > >>> I'm pretty sure we're acquiring other >>> locks in hypercall context without going the trylock route. I am >>> convinced though that the pseudo-lock you've used needs to be >>> replaced by a real (and perhaps r/w) one, _if_ there is any need >>> for locking in the first place. >> >> You were the one asking for theoretical guarantees that a guest can't abuse this >> to lock up a CPU. There are no way to guarantee that as multiple vCPUs could >> call the hypercall and take the same lock potentially delaying significantly the >> work. > > Well, I may have gone a little too far with my original response. It > just was so odd to see this pseudo lock used. > >> Regarding the need of the lock, I still can't see how you can make it safe >> without it as you may have concurrent call. >> >> Feel free to suggest a way. > > Well, if none can be found, then fine. I don't have the time or interest > here to try and think about a lockless approach; it just doesn't _feel_ > like this ought to strictly require use of a lock. This gut feeling of mine > may well be wrong. I am not asking you to spend a lot of time on it. But if you have a gut feeling this can be done, then a little help would be extremely useful... Otherwise, I will consider that the lock is the best way to go. Cheers,
>>> On 13.06.19 at 15:14, <julien.grall@arm.com> wrote: > On 13/06/2019 13:58, Jan Beulich wrote: >>>>> On 13.06.19 at 14:48, <julien.grall@arm.com> wrote: >>> On 13/06/2019 13:41, Jan Beulich wrote: >>>>>>> On 13.06.19 at 14:32, <andrii.anisov@gmail.com> wrote: >>>>> On 11.06.19 12:10, Jan Beulich wrote: >>>>>>>> At the very least such loops want a cpu_relax() in their bodies. >>>>>>>> But this being on a hypercall path - are there theoretical guarantees >>>>>>>> that a guest can't abuse this to lock up a CPU? >>>>>>> Hmmm, I suggested this but it looks like a guest may call the hypercall >>>>> multiple >>>>>>> time from different vCPU. So this could be a way to delay work on the CPU. >>>>>>> >>>>>>> I wanted to make the context switch mostly lockless and therefore avoiding >>>>> to >>>>>>> introduce a spinlock. >>>>>> >>>>>> Well, constructs like the above are trying to mimic a spinlock >>>>>> without actually using a spinlock. There are extremely rare >>>>>> situation in which this may indeed be warranted, but here it >>>>>> falls in the common "makes things worse overall" bucket, I >>>>>> think. To not unduly penalize the actual update paths, I think >>>>>> using a r/w lock would be appropriate here. >>>>> >>>>> So what is the conclusion here? Should we go with trylock and >>>>> hypercall_create_continuation() in order to avoid locking but still not fail >>>>> to the guest? >>>> >>>> I'm not convinced a "trylock" approach is needed - that's >>>> something Julien suggested. >>> >>> I think the trylock in the context switch is a must. Otherwise you would delay >>> context switch if the information get updated. >> >> Delay in what way? I.e. how would this be an issue other than for >> the guest itself (which shouldn't be constantly updating the >> address for the region)? > > Why would it only be an issue with the guest itself? Any wait on lock in Xen > implies that you can't schedule another vCPU as we are not preemptible. For one I initially (wrongly) understood you want the trylock in the hypercall handler. And then, for context switch, wasting the target (i.e. being switched in) vCPU's time slice is not an issue here. Of course if there's a chance that acquiring the lock could require more than a full time slice, then yes, some try-lock-ery may be needed. However, ... >>> Regarding the need of the lock, I still can't see how you can make it safe >>> without it as you may have concurrent call. >>> >>> Feel free to suggest a way. >> >> Well, if none can be found, then fine. I don't have the time or interest >> here to try and think about a lockless approach; it just doesn't _feel_ >> like this ought to strictly require use of a lock. This gut feeling of mine >> may well be wrong. > > I am not asking you to spend a lot of time on it. But if you have a gut feeling > this can be done, then a little help would be extremely useful... ... I thought I had already outlined a model: Allow cross-vCPU updates only while the target vCPU is still offline. Once online, a vCPU can only itself update its runstate area address. I think you can get away without any locks in this case; there may be a corner case with a vCPU being onlined right at that point in time, so there may need to be a more strict condition (like "only one online vCPU" instead of "the target vCPU is offline"). Jan
Hi Jan, On 13/06/2019 14:40, Jan Beulich wrote: >>>> On 13.06.19 at 15:14, <julien.grall@arm.com> wrote: >> I am not asking you to spend a lot of time on it. But if you have a gut feeling >> this can be done, then a little help would be extremely useful... > > ... I thought I had already outlined a model: Allow cross-vCPU updates > only while the target vCPU is still offline. Once online, a vCPU can only > itself update its runstate area address. I think you can get away > without any locks in this case; there may be a corner case with a vCPU > being onlined right at that point in time, so there may need to be a more > strict condition (like "only one online vCPU" instead of "the target vCPU > is offline"). Sorry I may have missed it. We can't really restrict the usage of the current hypercall (it is pretty lax). So I think any lockless solution would require to allow the hypercall to be used together (which I want to avoid). If we agree to allow the two hypercalls to be used together, then if we protect the update with domain_lock() then you should be able to avoid any race with the update path as onlining a vCPU requires to take the domain_lock() (see do_vcpu_op for x86 and do_common_cpu_on for Arm). Cheers,
Hello Julien, On 13.06.19 17:41, Julien Grall wrote: > Sorry I may have missed it. We can't really restrict the usage of the current hypercall (it is pretty lax). So I think any lockless solution would require to allow the hypercall > to be used together (which I want to avoid). I'd better say here allowing using phys and virt registered runstates together (and independently). And me personally for this approach, for sure not encouraging users (guests) to do so. > If we agree to allow the two hypercalls to be used together, then if we protect the update with domain_lock() then you should be able to avoid any race with the update path as onlining a vCPU requires to take the domain_lock() (see do_vcpu_op for x86 and do_common_cpu_on for Arm). Could you please clarify are you saying about protection runstate mapping update or runstate values update? I'd not use the common lock for all vcpus runstate value update, unless it is an rw lock or doing trylock on it. BTW, I'm a bit confused, are you OK with lock (not trylock) existing in hypercall?
On 14/06/2019 15:36, Andrii Anisov wrote: > Hello Julien, Hi, > On 13.06.19 17:41, Julien Grall wrote: >> Sorry I may have missed it. We can't really restrict the usage of the current >> hypercall (it is pretty lax). So I think any lockless solution would require >> to allow the hypercall >> to be used together (which I want to avoid). > > I'd better say here allowing using phys and virt registered runstates together > (and independently). > And me personally for this approach, for sure not encouraging users (guests) to > do so. Why? What are the benefits for a guest to use the two interface together? After all they have exactly the same data... > >> If we agree to allow the two hypercalls to be used together, then if we >> protect the update with domain_lock() then you should be able to avoid any >> race with the update path as onlining a vCPU requires to take the >> domain_lock() (see do_vcpu_op for x86 and do_common_cpu_on for Arm). > > Could you please clarify are you saying about protection runstate mapping update > or runstate values update? runstate mapping. > BTW, I'm a bit confused, are you OK with lock (not trylock) existing in hypercall? This is still in discussion. Cheers,
On 14.06.19 17:39, Julien Grall wrote: > Why? What are the benefits for a guest to use the two interface together? I do not say the guest has to use both interfaces simultaneously. It is logically odd, doing so will only reflect in increasing of hypervisor overhead. But such an implementation will have a simpler code, which expected to be (a bit) faster. So the code simplicity would be a benefit for us. Lower hypervisor overhead is a benefit for sane guests, which use only one interface. BTW, dropping the old interface implementation will be much easier in future if it will not clash with the new one. > After all they have exactly the same data... Yes, but normal guests should use only one interface. >> BTW, I'm a bit confused, are you OK with lock (not trylock) existing in hypercall? > > This is still in discussion. I see. So I'll think about the continuation implementation in meanwhile.
Hi Andrii, On 14/06/2019 16:11, Andrii Anisov wrote: > > > On 14.06.19 17:39, Julien Grall wrote: >> Why? What are the benefits for a guest to use the two interface together? > > I do not say the guest has to use both interfaces simultaneously. It is > logically odd, doing so will only reflect in increasing of hypervisor overhead. > But such an implementation will have a simpler code, which expected to be (a > bit) faster. > So the code simplicity would be a benefit for us. Lower hypervisor overhead is a > benefit for sane guests, which use only one interface. I hope you are aware that speaking about speed here is quite irrelevant. The difference would be clear lost in the noise of the rest of the context switch. But, if you allow something, then most likely someone will use it. However, you have to differentiate implementation vs documentation. In this case, I don't think the implementation should dictate what is going to be exposed. If you document that it can't happen, then you have room to forbid the mix in the future (assuming this can't be done now). In other word, the more lax is the interface, the more difficult it is tighten in the future. I am not going to push for an implementation that forbid the mix. But I am strongly going to push for any documentation of the expected interaction. So we don't make our life miserable later on. > > BTW, dropping the old interface implementation will be much easier in future if > it will not clash with the new one. I am afraid we will never be able to remove the old interface. > >> After all they have exactly the same data... > > Yes, but normal guests should use only one interface. See above. Cheers,
>>> On 14.06.19 at 17:11, <andrii.anisov@gmail.com> wrote: > On 14.06.19 17:39, Julien Grall wrote: >> After all they have exactly the same data... > > Yes, but normal guests should use only one interface. I thought it had been clarified already that normal guests can very well use both interfaces, just not both at the same time: Boot loader and OS could disagree in this regard, as the prime example. Jan
On 14.06.19 18:24, Julien Grall wrote: > I hope you are aware that speaking about speed here is quite irrelevant. The difference would be clear lost in the noise of the rest of the context switch. Mmm... I have that understanding. Yet I'd rather try to not increase the noise, if not reduce. BTW, I'll remember that to you on your next conditional branch removal ;) > But, if you allow something, then most likely someone will use it. However, you have to differentiate implementation vs documentation. > > In this case, I don't think the implementation should dictate what is going to be exposed. > > If you document that it can't happen, then you have room to forbid the mix in the future (assuming this can't be done now). > > In other word, the more lax is the interface, the more difficult it is tighten in the future. > > I am not going to push for an implementation that forbid the mix. But I am strongly going to push for any documentation of the expected interaction. So we don't make our life miserable later on. I do not encourage using both interfaces simultaneously, it is pointless. If you are saying that this matter could be solved with the appropriate documentation, it's OK with me. >> BTW, dropping the old interface implementation will be much easier in future if it will not clash with the new one. > I am afraid we will never be able to remove the old interface. Maybe.
On 14/06/2019 17:11, Andrii Anisov wrote: > On 14.06.19 18:24, Julien Grall wrote: >> But, if you allow something, then most likely someone will use it. However, >> you have to differentiate implementation vs documentation. >> >> In this case, I don't think the implementation should dictate what is going to >> be exposed. >> >> If you document that it can't happen, then you have room to forbid the mix in >> the future (assuming this can't be done now). >> >> In other word, the more lax is the interface, the more difficult it is tighten >> in the future. >> >> I am not going to push for an implementation that forbid the mix. But I am >> strongly going to push for any documentation of the expected interaction. So >> we don't make our life miserable later on. > > I do not encourage using both interfaces simultaneously, it is pointless. > If you are saying that this matter could be solved with the appropriate > documentation, it's OK with me. > >>> BTW, dropping the old interface implementation will be much easier in future >>> if it will not clash with the new one. >> I am afraid we will never be able to remove the old interface. > > Maybe. Well, that a stable ABI... Even if I would love to remove it, you can't get rid of old guests that easily... Cheers, >
Hello Jan, On 14.06.19 18:42, Jan Beulich wrote: >>>> On 14.06.19 at 17:11, <andrii.anisov@gmail.com> wrote: >> On 14.06.19 17:39, Julien Grall wrote: >>> After all they have exactly the same data... >> >> Yes, but normal guests should use only one interface. > > I thought it had been clarified already that normal guests can very > well use both interfaces, just not both at the same time: Boot loader > and OS could disagree in this regard, as the prime example. I missed "at the same time". We may require existing runstate area unregistering if the system is aware of it. But it is for the new interface. The old one has no documentation about the unregistering. The implicit way is known to us, but not known to users. How to solve the clash?
On 14.06.19 19:20, Julien Grall wrote:
> Well, that a stable ABI... Even if I would love to remove it, you can't get rid of old guests that easily...
In 5 years, as XEN did for LK 3.18?
>>> On 14.06.19 at 18:25, <andrii.anisov@gmail.com> wrote: > On 14.06.19 19:20, Julien Grall wrote: >> Well, that a stable ABI... Even if I would love to remove it, you can't get > rid of old guests that easily... > > In 5 years, as XEN did for LK 3.18? I'm afraid I don't even know how to best word a reply to this. What has the Linux kernel version got to do with the interfaces provided by Xen? Is your reply meant to say that something was removed from the Linux kernel in that version? That's fine - consumers can stop consuming interfaces they don't like to use. But Xen (the producer of interfaces like the one discussed here) can't lightly stop providing interfaces, irrespective of their use in upstream Linux. Jan
>>> On 14.06.19 at 18:23, <andrii.anisov@gmail.com> wrote: > On 14.06.19 18:42, Jan Beulich wrote: >>>>> On 14.06.19 at 17:11, <andrii.anisov@gmail.com> wrote: >>> On 14.06.19 17:39, Julien Grall wrote: >>>> After all they have exactly the same data... >>> >>> Yes, but normal guests should use only one interface. >> >> I thought it had been clarified already that normal guests can very >> well use both interfaces, just not both at the same time: Boot loader >> and OS could disagree in this regard, as the prime example. > > I missed "at the same time". > > We may require existing runstate area unregistering if the system is aware > of it. But it is for the new interface. > The old one has no documentation about the unregistering. The implicit way > is known to us, but not known to users. > How to solve the clash? And once again I'm not sure what to answer, considering that I've already outlined a possible model (without any explicit unregistration). Jan
Hello Jan, On 17.06.19 09:28, Jan Beulich wrote: >> We may require existing runstate area unregistering if the system is aware >> of it. But it is for the new interface. >> The old one has no documentation about the unregistering. The implicit way >> is known to us, but not known to users. >> How to solve the clash? > > And once again I'm not sure what to answer, considering that I've > already outlined a possible model (without any explicit unregistration). Just to be sure, "the model" you are talking about is following: > I thought it had been clarified already that normal guests can very > well use both interfaces, just not both at the same time: Boot loader > and OS could disagree in this regard, as the prime example. Is it correct? But with the current interface (VA) that model is already broken without unregistration. On change between entities with different VA spaces the hypervisor definitely has a chance to spoil the new VA space at the old address. IMHO it should be fixed (at least documented) for the old interface.
>>> On 18.06.19 at 17:32, <andrii.anisov@gmail.com> wrote: > On 17.06.19 09:28, Jan Beulich wrote: >>> We may require existing runstate area unregistering if the system is aware >>> of it. But it is for the new interface. >>> The old one has no documentation about the unregistering. The implicit way >>> is known to us, but not known to users. >>> How to solve the clash? >> >> And once again I'm not sure what to answer, considering that I've >> already outlined a possible model (without any explicit unregistration). > > Just to be sure, "the model" you are talking about is following: > >> I thought it had been clarified already that normal guests can very >> well use both interfaces, just not both at the same time: Boot loader >> and OS could disagree in this regard, as the prime example. > > Is it correct? Not really - what you quote is a statement, not the outline of a model. The basic idea for enforcement of a restriction was to allow switching modes only when just one vCPU is online in a guest. > But with the current interface (VA) that model is already broken without > unregistration. On change between entities with different VA spaces the > hypervisor definitely has a chance to spoil the new VA space at the old address. > IMHO it should be fixed (at least documented) for the old interface. Document - sure, feel free. Fix - I don't see how you would do this. Every component handing control onto another one would be in charge on its own. That's not under our control. Jan
diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index ff330b3..ecedf1c 100644 --- a/xen/arch/arm/domain.c +++ b/xen/arch/arm/domain.c @@ -274,17 +274,15 @@ static void ctxt_switch_to(struct vcpu *n) virt_timer_restore(n); } -/* Update per-VCPU guest runstate shared memory area (if registered). */ -static void update_runstate_area(struct vcpu *v) +static void update_runstate_by_gvaddr(struct vcpu *v) { void __user *guest_handle = NULL; - if ( guest_handle_is_null(runstate_guest(v)) ) - return; + ASSERT(!guest_handle_is_null(runstate_guest_virt(v))); if ( VM_ASSIST(v->domain, runstate_update_flag) ) { - guest_handle = &v->runstate_guest.p->state_entry_time + 1; + guest_handle = &v->runstate_guest.virt.p->state_entry_time + 1; guest_handle--; v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; __raw_copy_to_guest(guest_handle, @@ -292,7 +290,7 @@ static void update_runstate_area(struct vcpu *v) smp_wmb(); } - __copy_to_guest(runstate_guest(v), &v->runstate, 1); + __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1); if ( guest_handle ) { @@ -303,6 +301,53 @@ static void update_runstate_area(struct vcpu *v) } } +static void update_runstate_by_gpaddr(struct vcpu *v) +{ + struct vcpu_runstate_info *runstate = + (struct vcpu_runstate_info *)v->runstate_guest.phys; + + ASSERT(runstate != NULL); + + if ( VM_ASSIST(v->domain, runstate_update_flag) ) + { + runstate->state_entry_time |= XEN_RUNSTATE_UPDATE; + smp_wmb(); + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; + } + + memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate)); + + if ( VM_ASSIST(v->domain, runstate_update_flag) ) + { + runstate->state_entry_time &= ~XEN_RUNSTATE_UPDATE; + smp_wmb(); + v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; + } +} + +/* Update per-VCPU guest runstate shared memory area (if registered). */ +static void update_runstate_area(struct vcpu *v) +{ + if ( xchg(&v->runstate_in_use, 1) ) + return; + + switch ( v->runstate_guest_type ) + { + case RUNSTATE_NONE: + break; + + case RUNSTATE_VADDR: + update_runstate_by_gvaddr(v); + break; + + case RUNSTATE_PADDR: + update_runstate_by_gpaddr(v); + break; + } + + xchg(&v->runstate_in_use, 0); +} + static void schedule_tail(struct vcpu *prev) { ASSERT(prev != current); @@ -998,6 +1043,7 @@ long do_arm_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) a { case VCPUOP_register_vcpu_info: case VCPUOP_register_runstate_memory_area: + case VCPUOP_register_runstate_phys_memory_area: return do_vcpu_op(cmd, vcpuid, arg); default: return -EINVAL; diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index ac960dd..fe71776 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1566,22 +1566,21 @@ void paravirt_ctxt_switch_to(struct vcpu *v) } /* Update per-VCPU guest runstate shared memory area (if registered). */ -bool update_runstate_area(struct vcpu *v) +static bool update_runstate_by_gvaddr(struct vcpu *v) { bool rc; struct guest_memory_policy policy = { .nested_guest_mode = false }; void __user *guest_handle = NULL; - if ( guest_handle_is_null(runstate_guest(v)) ) - return true; + ASSERT(!guest_handle_is_null(runstate_guest_virt(v))); update_guest_memory_policy(v, &policy); if ( VM_ASSIST(v->domain, runstate_update_flag) ) { guest_handle = has_32bit_shinfo(v->domain) - ? &v->runstate_guest.compat.p->state_entry_time + 1 - : &v->runstate_guest.native.p->state_entry_time + 1; + ? &v->runstate_guest.virt.compat.p->state_entry_time + 1 + : &v->runstate_guest.virt.native.p->state_entry_time + 1; guest_handle--; v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; __raw_copy_to_guest(guest_handle, @@ -1594,11 +1593,11 @@ bool update_runstate_area(struct vcpu *v) struct compat_vcpu_runstate_info info; XLAT_vcpu_runstate_info(&info, &v->runstate); - __copy_to_guest(v->runstate_guest.compat, &info, 1); + __copy_to_guest(v->runstate_guest.virt.compat, &info, 1); rc = true; } else - rc = __copy_to_guest(runstate_guest(v), &v->runstate, 1) != + rc = __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1) != sizeof(v->runstate); if ( guest_handle ) @@ -1614,6 +1613,92 @@ bool update_runstate_area(struct vcpu *v) return rc; } +static bool update_runstate_by_gpaddr_native(struct vcpu *v) +{ + struct vcpu_runstate_info *runstate = + (struct vcpu_runstate_info *)v->runstate_guest.phys; + + ASSERT(runstate != NULL); + + if ( VM_ASSIST(v->domain, runstate_update_flag) ) + { + runstate->state_entry_time |= XEN_RUNSTATE_UPDATE; + smp_wmb(); + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; + } + + memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate)); + + if ( VM_ASSIST(v->domain, runstate_update_flag) ) + { + runstate->state_entry_time &= ~XEN_RUNSTATE_UPDATE; + smp_wmb(); + v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; + } + + return true; +} + +static bool update_runstate_by_gpaddr_compat(struct vcpu *v) +{ + struct compat_vcpu_runstate_info *runstate = + (struct compat_vcpu_runstate_info *)v->runstate_guest.phys; + + ASSERT(runstate != NULL); + + if ( VM_ASSIST(v->domain, runstate_update_flag) ) + { + runstate->state_entry_time |= XEN_RUNSTATE_UPDATE; + smp_wmb(); + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; + } + + { + struct compat_vcpu_runstate_info info; + XLAT_vcpu_runstate_info(&info, &v->runstate); + memcpy(v->runstate_guest.phys, &info, sizeof(info)); + } + else + memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate)); + + if ( VM_ASSIST(v->domain, runstate_update_flag) ) + { + runstate->state_entry_time &= ~XEN_RUNSTATE_UPDATE; + smp_wmb(); + v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; + } + + return true; +} + +bool update_runstate_area(struct vcpu *v) +{ + bool rc = true; + + if ( xchg(&v->runstate_in_use, 1) ) + return rc; + + switch ( v->runstate_guest_type ) + { + case RUNSTATE_NONE: + break; + + case RUNSTATE_VADDR: + rc = update_runstate_by_gvaddr(v); + break; + + case RUNSTATE_PADDR: + if ( has_32bit_shinfo(v->domain) ) + rc = update_runstate_by_gpaddr_compat(v); + else + rc = update_runstate_by_gpaddr_native(v); + break; + } + + xchg(&v->runstate_in_use, 0); + return rc; +} + static void _update_runstate_area(struct vcpu *v) { if ( !update_runstate_area(v) && is_pv_vcpu(v) && diff --git a/xen/arch/x86/x86_64/domain.c b/xen/arch/x86/x86_64/domain.c index c46dccc..85d0072 100644 --- a/xen/arch/x86/x86_64/domain.c +++ b/xen/arch/x86/x86_64/domain.c @@ -12,6 +12,8 @@ CHECK_vcpu_get_physid; #undef xen_vcpu_get_physid +extern void discard_runstate_area(struct vcpu *v); + int arch_compat_vcpu_op( int cmd, struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg) @@ -35,8 +37,16 @@ arch_compat_vcpu_op( !compat_handle_okay(area.addr.h, 1) ) break; + while( xchg(&v->runstate_in_use, 1) == 0); + + discard_runstate_area(v); + rc = 0; - guest_from_compat_handle(v->runstate_guest.compat, area.addr.h); + + guest_from_compat_handle(v->runstate_guest.virt.compat, + area.addr.h); + + v->runstate_guest_type = RUNSTATE_VADDR; if ( v == current ) { @@ -49,7 +59,9 @@ arch_compat_vcpu_op( vcpu_runstate_get(v, &runstate); XLAT_vcpu_runstate_info(&info, &runstate); } - __copy_to_guest(v->runstate_guest.compat, &info, 1); + __copy_to_guest(v->runstate_guest.virt.compat, &info, 1); + + xchg(&v->runstate_in_use, 0); break; } diff --git a/xen/common/domain.c b/xen/common/domain.c index 90c6607..d276b87 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -698,6 +698,74 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d) return 0; } +static void unmap_runstate_area(struct vcpu *v) +{ + mfn_t mfn; + + if ( ! v->runstate_guest.phys ) + return; + + mfn = domain_page_map_to_mfn(v->runstate_guest.phys); + + unmap_domain_page_global((void *) + ((unsigned long)v->runstate_guest.phys & + PAGE_MASK)); + + v->runstate_guest.phys = NULL; + put_page_and_type(mfn_to_page(mfn)); +} + +static int map_runstate_area(struct vcpu *v, + struct vcpu_register_runstate_memory_area *area) +{ + unsigned long offset = area->addr.p & ~PAGE_MASK; + gfn_t gfn = gaddr_to_gfn(area->addr.p); + struct domain *d = v->domain; + void *mapping; + struct page_info *page; + size_t size = sizeof(struct vcpu_runstate_info); + + if ( offset > (PAGE_SIZE - size) ) + return -EINVAL; + + page = get_page_from_gfn(d, gfn_x(gfn), NULL, P2M_ALLOC); + if ( !page ) + return -EINVAL; + + if ( !get_page_type(page, PGT_writable_page) ) + { + put_page(page); + return -EINVAL; + } + + mapping = __map_domain_page_global(page); + + if ( mapping == NULL ) + { + put_page_and_type(page); + return -ENOMEM; + } + + v->runstate_guest.phys = mapping + offset; + + return 0; +} + +void discard_runstate_area(struct vcpu *v) +{ + if ( v->runstate_guest_type == RUNSTATE_PADDR ) + unmap_runstate_area(v); + + v->runstate_guest_type = RUNSTATE_NONE; +} + +static void discard_runstate_area_locked(struct vcpu *v) +{ + while ( xchg(&v->runstate_in_use, 1) ); + discard_runstate_area(v); + xchg(&v->runstate_in_use, 0); +} + int domain_kill(struct domain *d) { int rc = 0; @@ -734,7 +802,10 @@ int domain_kill(struct domain *d) if ( cpupool_move_domain(d, cpupool0) ) return -ERESTART; for_each_vcpu ( d, v ) + { + discard_runstate_area_locked(v); unmap_vcpu_info(v); + } d->is_dying = DOMDYING_dead; /* Mem event cleanup has to go here because the rings * have to be put before we call put_domain. */ @@ -1188,7 +1259,7 @@ int domain_soft_reset(struct domain *d) for_each_vcpu ( d, v ) { - set_xen_guest_handle(runstate_guest(v), NULL); + discard_runstate_area_locked(v); unmap_vcpu_info(v); } @@ -1518,18 +1589,50 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg) break; rc = 0; - runstate_guest(v) = area.addr.h; - if ( v == current ) - { - __copy_to_guest(runstate_guest(v), &v->runstate, 1); - } - else + while( xchg(&v->runstate_in_use, 1) == 0); + + discard_runstate_area(v); + + if ( !guest_handle_is_null(runstate_guest_virt(v)) ) { - vcpu_runstate_get(v, &runstate); - __copy_to_guest(runstate_guest(v), &runstate, 1); + runstate_guest_virt(v) = area.addr.h; + v->runstate_guest_type = RUNSTATE_VADDR; + + if ( v == current ) + { + __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1); + } + else + { + vcpu_runstate_get(v, &runstate); + __copy_to_guest(runstate_guest_virt(v), &runstate, 1); + } } + xchg(&v->runstate_in_use, 0); + + break; + } + + case VCPUOP_register_runstate_phys_memory_area: + { + struct vcpu_register_runstate_memory_area area; + + rc = -EFAULT; + if ( copy_from_guest(&area, arg, 1) ) + break; + + while( xchg(&v->runstate_in_use, 1) == 0); + + discard_runstate_area(v); + + rc = map_runstate_area(v, &area); + if ( !rc ) + v->runstate_guest_type = RUNSTATE_PADDR; + + xchg(&v->runstate_in_use, 0); + break; } diff --git a/xen/include/public/vcpu.h b/xen/include/public/vcpu.h index 3623af9..d7da4a3 100644 --- a/xen/include/public/vcpu.h +++ b/xen/include/public/vcpu.h @@ -235,6 +235,21 @@ struct vcpu_register_time_memory_area { typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t; DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t); +/* + * Register a shared memory area from which the guest may obtain its own + * runstate information without needing to execute a hypercall. + * Notes: + * 1. The registered address must be guest's physical address. + * 2. The registered runstate area should not cross page boundary. + * 3. Only one shared area may be registered per VCPU. The shared area is + * updated by the hypervisor each time the VCPU is scheduled. Thus + * runstate.state will always be RUNSTATE_running and + * runstate.state_entry_time will indicate the system time at which the + * VCPU was last scheduled to run. + * @extra_arg == pointer to vcpu_register_runstate_memory_area structure. + */ +#define VCPUOP_register_runstate_phys_memory_area 14 + #endif /* __XEN_PUBLIC_VCPU_H__ */ /* diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 2201fac..6c8de8f 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -163,17 +163,31 @@ struct vcpu void *sched_priv; /* scheduler-specific data */ struct vcpu_runstate_info runstate; + + enum { + RUNSTATE_NONE = 0, + RUNSTATE_PADDR = 1, + RUNSTATE_VADDR = 2, + } runstate_guest_type; + + unsigned long runstate_in_use; + + union + { #ifndef CONFIG_COMPAT -# define runstate_guest(v) ((v)->runstate_guest) - XEN_GUEST_HANDLE(vcpu_runstate_info_t) runstate_guest; /* guest address */ +# define runstate_guest_virt(v) ((v)->runstate_guest.virt) + XEN_GUEST_HANDLE(vcpu_runstate_info_t) virt; /* guest address */ #else -# define runstate_guest(v) ((v)->runstate_guest.native) - union { - XEN_GUEST_HANDLE(vcpu_runstate_info_t) native; - XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat; - } runstate_guest; /* guest address */ +# define runstate_guest_virt(v) ((v)->runstate_guest.virt.native) + union { + XEN_GUEST_HANDLE(vcpu_runstate_info_t) native; + XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat; + } virt; /* guest address */ #endif + void* phys; + } runstate_guest; + /* last time when vCPU is scheduled out */ uint64_t last_run_time;