Message ID | f92f62bf-2f8d-34db-4be5-d3e6a4b9d580@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/HVM: refine when to send mapcache invalidation request to qemu | expand |
On 28.09.2020 13:25, Paul Durrant wrote: >> From: Jan Beulich <jbeulich@suse.com> >> Sent: 28 September 2020 11:44 >> >> For one it was wrong to send the request only upon a completed >> hypercall: Even if only part of it completed before getting preempted, >> invalidation ought to occur. Therefore fold the two return statements. >> >> And then XENMEM_decrease_reservation isn't the only means by which pages >> can get removed from a guest, yet all removals ought to be signaled to >> qemu. Put setting of the flag into the central p2m_remove_page() >> underlying all respective hypercalls. > > Actually, does this catch all the cases? Would it be better to push the > check into p2m_set_entry() where we displace a ram entry? (I think it > would be possible, for instance, for map_mmio_regions() to directly > displace ram). I did consider this, but my fear of going too far (and adding perhaps many false positives) was too large. For the specific example of map_mmio_regions() I'd expect it's the device model anyway which calls it, and it hence shouldn't be in need of telling. But yes, if there are multiple emulators, others than the one doing the change may need knowing. But perhaps PoD is in even more immediate need of this, when transitioning a GFN from RAM to PoD, especially outside of the context of p2m_pod_decrease_reservation()? In this case things happen behind the guest's back, i.e. in particular outside of a hypercall, and hence there would again be the problem from where to properly send the request. IOW I'm struggling to ascertain whether putting another instance of the sending onto the tail of hvm_hap_nested_page_fault() would work correctly, or what the conditions are for this to be done without interfering with other requests which may be pending. >> Plus finally there's no point sending the request for the local domain >> when the domain acted upon is a different one. If anything that domain's >> qemu's mapcache may need invalidating, but it's unclear how useful this >> would be: That remote domain may not execute hypercalls at all, and >> hence may never make it to the point where the request actually gets >> issued. I guess the assumption is that such manipulation is not supposed >> to happen anymore once the guest has been started? > > Either that or the code was based on a false assumption was always > occurring on the current domain. When the code was added, I think this assumption was reasonable. There was not even a remote idea at that point of a HVM(-like) domain managing other domains. It was later on when it was missed to update this logic. > I also wonder whether the test-and-clear in hvm_hypercall() is really > the right thing to do. E.g. if two vcpus decrease reservation at the > same time, only one would end up waiting on the send_invalidate_req(). Hmm, yes, the flag wants to be per vCPU. Which will make remote manipulation (if such was permitted to occur) even more awkward. > Also, the pages will have been freed back to heap before invalidate > is sent so I guess it is possible for a domain to use its qemu > process to attack pages that may have been re-assigned? Not for a PV domain - there page refs will be held by the mappings in the dm domain(s). For PVH (which isn't supported yet for anything other than being an ordinary DomU) this would indeed be a problem, for there not being any similar ref-counting when inserting into / removing from the p2m. Jan
I came across the same issue when QEMU was holding an extra reference to a page removed from p2m, via XENMEM_add_to_physmap. Please tell me if I am talking nonsense since my knowledge around QEMU invalidation is limited. On Mon, 2020-09-28 at 12:44 +0200, Jan Beulich wrote: > For one it was wrong to send the request only upon a completed > hypercall: Even if only part of it completed before getting > preempted, > invalidation ought to occur. Therefore fold the two return > statements. > > And then XENMEM_decrease_reservation isn't the only means by which > pages > can get removed from a guest, yet all removals ought to be signaled > to > qemu. Put setting of the flag into the central p2m_remove_page() > underlying all respective hypercalls. Sounds good. I know this still does not catch everything, but at least a nice improvement from before. > > Plus finally there's no point sending the request for the local > domain > when the domain acted upon is a different one. If anything that > domain's > qemu's mapcache may need invalidating, but it's unclear how useful > this > would be: That remote domain may not execute hypercalls at all, and > hence may never make it to the point where the request actually gets > issued. I guess the assumption is that such manipulation is not > supposed > to happen anymore once the guest has been started? I may still want to set the invalidation signal to true even if the domain acted on is not the local domain. I know the remote domain may never reach the point to issue the invalidate, but it sounds to me that the problem is not whether we should set the signal but whether we can change where the signal is checked to make sure the point of issue can be reliably triggered, and the latter can be done in a future patch. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > Putting the check in guest_physmap_remove_page() might also suffice, > but then a separate is_hvm_domain() would need adding again. > > --- a/xen/arch/x86/hvm/hypercall.c > +++ b/xen/arch/x86/hvm/hypercall.c > @@ -31,7 +31,6 @@ > > static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > { > - const struct vcpu *curr = current; > long rc; > > switch ( cmd & MEMOP_CMD_MASK ) > @@ -41,14 +40,11 @@ static long hvm_memory_op(int cmd, XEN_G > return -ENOSYS; > } > > - if ( !curr->hcall_compat ) > + if ( !current->hcall_compat ) > rc = do_memory_op(cmd, arg); > else > rc = compat_memory_op(cmd, arg); > > - if ( (cmd & MEMOP_CMD_MASK) == XENMEM_decrease_reservation ) > - curr->domain->arch.hvm.qemu_mapcache_invalidate = true; > - > return rc; > } > > @@ -326,14 +322,11 @@ int hvm_hypercall(struct cpu_user_regs * > > HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax); > > - if ( curr->hcall_preempted ) > - return HVM_HCALL_preempted; > - > if ( unlikely(currd->arch.hvm.qemu_mapcache_invalidate) && > test_and_clear_bool(currd- > >arch.hvm.qemu_mapcache_invalidate) ) > send_invalidate_req(); > > - return HVM_HCALL_completed; > + return curr->hcall_preempted ? HVM_HCALL_preempted : > HVM_HCALL_completed; > } > > /* > --- a/xen/arch/x86/mm/p2m.c > +++ b/xen/arch/x86/mm/p2m.c > @@ -812,6 +812,9 @@ p2m_remove_page(struct p2m_domain *p2m, > } > } > > + if ( p2m->domain == current->domain ) > + p2m->domain->arch.hvm.qemu_mapcache_invalidate = true; > + So my comment above makes me want to remove the condition. > return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, > p2m_invalid, > p2m->default_access); > } > Hongyan
On 10.12.2020 14:09, Hongyan Xia wrote: > On Mon, 2020-09-28 at 12:44 +0200, Jan Beulich wrote: >> Plus finally there's no point sending the request for the local >> domain >> when the domain acted upon is a different one. If anything that >> domain's >> qemu's mapcache may need invalidating, but it's unclear how useful >> this >> would be: That remote domain may not execute hypercalls at all, and >> hence may never make it to the point where the request actually gets >> issued. I guess the assumption is that such manipulation is not >> supposed >> to happen anymore once the guest has been started? > > I may still want to set the invalidation signal to true even if the > domain acted on is not the local domain. I know the remote domain may > never reach the point to issue the invalidate, but it sounds to me that > the problem is not whether we should set the signal but whether we can > change where the signal is checked to make sure the point of issue can > be reliably triggered, and the latter can be done in a future patch. One of Paul's replies was quite helpful here: The main thing to worry about is for the vCPU to not continue running before the invalidation request was signaled (or else, aiui, qemu may serve a subsequent emulation request by the guest incorrectly, because of using the stale mapping). Hence I believe for a non-paused guest remote operations simply cannot be allowed when the may lead to the need for invalidation. Therefore yes, if we assume the guest is paused in such cases, we could drop the "is current" check, but we'd then still need to arrange for actual signaling before the guest gets to run again. I wonder whether handle_hvm_io_completion() (or its caller, hvm_do_resume(), right after that other call) wouldn't be a good place to do so. Jan
On Thu, 2020-12-10 at 14:37 +0100, Jan Beulich wrote: > On 10.12.2020 14:09, Hongyan Xia wrote: > > On Mon, 2020-09-28 at 12:44 +0200, Jan Beulich wrote: > > > Plus finally there's no point sending the request for the local > > > domain > > > when the domain acted upon is a different one. If anything that > > > domain's > > > qemu's mapcache may need invalidating, but it's unclear how > > > useful > > > this > > > would be: That remote domain may not execute hypercalls at all, > > > and > > > hence may never make it to the point where the request actually > > > gets > > > issued. I guess the assumption is that such manipulation is not > > > supposed > > > to happen anymore once the guest has been started? > > > > I may still want to set the invalidation signal to true even if the > > domain acted on is not the local domain. I know the remote domain > > may > > never reach the point to issue the invalidate, but it sounds to me > > that > > the problem is not whether we should set the signal but whether we > > can > > change where the signal is checked to make sure the point of issue > > can > > be reliably triggered, and the latter can be done in a future > > patch. > > One of Paul's replies was quite helpful here: The main thing to Hmm, I seem to not be able to see the whole thread... > worry about is for the vCPU to not continue running before the > invalidation request was signaled (or else, aiui, qemu may serve > a subsequent emulation request by the guest incorrectly, because > of using the stale mapping). Hence I believe for a non-paused > guest remote operations simply cannot be allowed when the may > lead to the need for invalidation. Therefore yes, if we assume > the guest is paused in such cases, we could drop the "is current" > check, but we'd then still need to arrange for actual signaling > before the guest gets to run again. I wonder whether > handle_hvm_io_completion() (or its caller, hvm_do_resume(), > right after that other call) wouldn't be a good place to do so. Actually, the existing code must assume that when QEMU is up, the only one that manipulates the p2m is the guest itself like you said. If the caller is XENMEM_decrease_reservation, the code does not even check which p2m this is for and unconditionally sets the QEMU invalidate flag for the current domain. Athough this assumption may simply be wrong now, so I agree care should be taken for remote p2m ops (I may need to read the code more to know how this should be done). Hongyan
On 10.12.2020 16:55, Hongyan Xia wrote: > On Thu, 2020-12-10 at 14:37 +0100, Jan Beulich wrote: >> On 10.12.2020 14:09, Hongyan Xia wrote: >>> On Mon, 2020-09-28 at 12:44 +0200, Jan Beulich wrote: >>>> Plus finally there's no point sending the request for the local >>>> domain >>>> when the domain acted upon is a different one. If anything that >>>> domain's >>>> qemu's mapcache may need invalidating, but it's unclear how >>>> useful >>>> this >>>> would be: That remote domain may not execute hypercalls at all, >>>> and >>>> hence may never make it to the point where the request actually >>>> gets >>>> issued. I guess the assumption is that such manipulation is not >>>> supposed >>>> to happen anymore once the guest has been started? >>> >>> I may still want to set the invalidation signal to true even if the >>> domain acted on is not the local domain. I know the remote domain >>> may >>> never reach the point to issue the invalidate, but it sounds to me >>> that >>> the problem is not whether we should set the signal but whether we >>> can >>> change where the signal is checked to make sure the point of issue >>> can >>> be reliably triggered, and the latter can be done in a future >>> patch. >> >> One of Paul's replies was quite helpful here: The main thing to > > Hmm, I seem to not be able to see the whole thread... This may have been on the thread which had prompted the creation of this patch. >> worry about is for the vCPU to not continue running before the >> invalidation request was signaled (or else, aiui, qemu may serve >> a subsequent emulation request by the guest incorrectly, because >> of using the stale mapping). Hence I believe for a non-paused >> guest remote operations simply cannot be allowed when the may >> lead to the need for invalidation. Therefore yes, if we assume >> the guest is paused in such cases, we could drop the "is current" >> check, but we'd then still need to arrange for actual signaling >> before the guest gets to run again. I wonder whether >> handle_hvm_io_completion() (or its caller, hvm_do_resume(), >> right after that other call) wouldn't be a good place to do so. > > Actually, the existing code must assume that when QEMU is up, the only > one that manipulates the p2m is the guest itself like you said. Not sure what "that" you mean existing code must assume, and why you would outright exclude external p2m manipulation when the guest is up. At the very least in a hypothetical memory-hot- unplug scenario manipulation would still necessarily occur from outside the guest (yet without real need for pausing it). And obviously mem-sharing and mem-paging are also doing external manipulations, albeit maybe they pause the guest for every change. > If the > caller is XENMEM_decrease_reservation, the code does not even check > which p2m this is for and unconditionally sets the QEMU invalidate flag > for the current domain. This observation is part of what had prompted the change here. > Athough this assumption may simply be wrong > now, so I agree care should be taken for remote p2m ops (I may need to > read the code more to know how this should be done). I believe the assumption stems from the time where the controlling domain would necessarily be PV, and hence a decrease-reservation request by a HVM domain could only have been for itself. Jan
--- a/xen/arch/x86/hvm/hypercall.c +++ b/xen/arch/x86/hvm/hypercall.c @@ -31,7 +31,6 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) { - const struct vcpu *curr = current; long rc; switch ( cmd & MEMOP_CMD_MASK ) @@ -41,14 +40,11 @@ static long hvm_memory_op(int cmd, XEN_G return -ENOSYS; } - if ( !curr->hcall_compat ) + if ( !current->hcall_compat ) rc = do_memory_op(cmd, arg); else rc = compat_memory_op(cmd, arg); - if ( (cmd & MEMOP_CMD_MASK) == XENMEM_decrease_reservation ) - curr->domain->arch.hvm.qemu_mapcache_invalidate = true; - return rc; } @@ -326,14 +322,11 @@ int hvm_hypercall(struct cpu_user_regs * HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax); - if ( curr->hcall_preempted ) - return HVM_HCALL_preempted; - if ( unlikely(currd->arch.hvm.qemu_mapcache_invalidate) && test_and_clear_bool(currd->arch.hvm.qemu_mapcache_invalidate) ) send_invalidate_req(); - return HVM_HCALL_completed; + return curr->hcall_preempted ? HVM_HCALL_preempted : HVM_HCALL_completed; } /* --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -812,6 +812,9 @@ p2m_remove_page(struct p2m_domain *p2m, } } + if ( p2m->domain == current->domain ) + p2m->domain->arch.hvm.qemu_mapcache_invalidate = true; + return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid, p2m->default_access); }
For one it was wrong to send the request only upon a completed hypercall: Even if only part of it completed before getting preempted, invalidation ought to occur. Therefore fold the two return statements. And then XENMEM_decrease_reservation isn't the only means by which pages can get removed from a guest, yet all removals ought to be signaled to qemu. Put setting of the flag into the central p2m_remove_page() underlying all respective hypercalls. Plus finally there's no point sending the request for the local domain when the domain acted upon is a different one. If anything that domain's qemu's mapcache may need invalidating, but it's unclear how useful this would be: That remote domain may not execute hypercalls at all, and hence may never make it to the point where the request actually gets issued. I guess the assumption is that such manipulation is not supposed to happen anymore once the guest has been started? Signed-off-by: Jan Beulich <jbeulich@suse.com> --- Putting the check in guest_physmap_remove_page() might also suffice, but then a separate is_hvm_domain() would need adding again.