diff mbox series

x86/HVM: refine when to send mapcache invalidation request to qemu

Message ID f92f62bf-2f8d-34db-4be5-d3e6a4b9d580@suse.com (mailing list archive)
State New, archived
Headers show
Series x86/HVM: refine when to send mapcache invalidation request to qemu | expand

Commit Message

Jan Beulich Sept. 28, 2020, 10:44 a.m. UTC
For one it was wrong to send the request only upon a completed
hypercall: Even if only part of it completed before getting preempted,
invalidation ought to occur. Therefore fold the two return statements.

And then XENMEM_decrease_reservation isn't the only means by which pages
can get removed from a guest, yet all removals ought to be signaled to
qemu. Put setting of the flag into the central p2m_remove_page()
underlying all respective hypercalls.

Plus finally there's no point sending the request for the local domain
when the domain acted upon is a different one. If anything that domain's
qemu's mapcache may need invalidating, but it's unclear how useful this
would be: That remote domain may not execute hypercalls at all, and
hence may never make it to the point where the request actually gets
issued. I guess the assumption is that such manipulation is not supposed
to happen anymore once the guest has been started?

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Putting the check in guest_physmap_remove_page() might also suffice,
but then a separate is_hvm_domain() would need adding again.

Comments

Jan Beulich Sept. 29, 2020, 8:43 a.m. UTC | #1
On 28.09.2020 13:25, Paul Durrant wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 28 September 2020 11:44
>>
>> For one it was wrong to send the request only upon a completed
>> hypercall: Even if only part of it completed before getting preempted,
>> invalidation ought to occur. Therefore fold the two return statements.
>>
>> And then XENMEM_decrease_reservation isn't the only means by which pages
>> can get removed from a guest, yet all removals ought to be signaled to
>> qemu. Put setting of the flag into the central p2m_remove_page()
>> underlying all respective hypercalls.
> 
> Actually, does this catch all the cases? Would it be better to push the
> check into p2m_set_entry() where we displace a ram entry? (I think it
> would be possible, for instance, for map_mmio_regions() to directly
> displace ram).

I did consider this, but my fear of going too far (and adding perhaps
many false positives) was too large. For the specific example of
map_mmio_regions() I'd expect it's the device model anyway which
calls it, and it hence shouldn't be in need of telling. But yes, if
there are multiple emulators, others than the one doing the change
may need knowing.

But perhaps PoD is in even more immediate need of this, when
transitioning a GFN from RAM to PoD, especially outside of the
context of p2m_pod_decrease_reservation()? In this case things
happen behind the guest's back, i.e. in particular outside of a
hypercall, and hence there would again be the problem from where to
properly send the request. IOW I'm struggling to ascertain whether
putting another instance of the sending onto the tail of
hvm_hap_nested_page_fault() would work correctly, or what the
conditions are for this to be done without interfering with other
requests which may be pending.

>> Plus finally there's no point sending the request for the local domain
>> when the domain acted upon is a different one. If anything that domain's
>> qemu's mapcache may need invalidating, but it's unclear how useful this
>> would be: That remote domain may not execute hypercalls at all, and
>> hence may never make it to the point where the request actually gets
>> issued. I guess the assumption is that such manipulation is not supposed
>> to happen anymore once the guest has been started?
> 
> Either that or the code was based on a false assumption was always
> occurring on the current domain.

When the code was added, I think this assumption was reasonable.
There was not even a remote idea at that point of a HVM(-like)
domain managing other domains. It was later on when it was missed to
update this logic.

> I also wonder whether the test-and-clear in hvm_hypercall() is really
> the right thing to do. E.g. if two vcpus decrease reservation at the
> same time, only one would end up waiting on the send_invalidate_req().

Hmm, yes, the flag wants to be per vCPU. Which will make remote
manipulation (if such was permitted to occur) even more awkward.

> Also, the pages will have been freed back to heap before invalidate
> is sent so I guess it is possible for a domain to use its qemu
> process to attack pages that may have been re-assigned?

Not for a PV domain - there page refs will be held by the mappings in
the dm domain(s). For PVH (which isn't supported yet for anything
other than being an ordinary DomU) this would indeed be a problem,
for there not being any similar ref-counting when inserting into /
removing from the p2m.

Jan
Hongyan Xia Dec. 10, 2020, 1:09 p.m. UTC | #2
I came across the same issue when QEMU was holding an extra reference
to a page removed from p2m, via XENMEM_add_to_physmap. Please tell me
if I am talking nonsense since my knowledge around QEMU invalidation is
limited.

On Mon, 2020-09-28 at 12:44 +0200, Jan Beulich wrote:
> For one it was wrong to send the request only upon a completed
> hypercall: Even if only part of it completed before getting
> preempted,
> invalidation ought to occur. Therefore fold the two return
> statements.
> 
> And then XENMEM_decrease_reservation isn't the only means by which
> pages
> can get removed from a guest, yet all removals ought to be signaled
> to
> qemu. Put setting of the flag into the central p2m_remove_page()
> underlying all respective hypercalls.

Sounds good. I know this still does not catch everything, but at least
a nice improvement from before.

> 
> Plus finally there's no point sending the request for the local
> domain
> when the domain acted upon is a different one. If anything that
> domain's
> qemu's mapcache may need invalidating, but it's unclear how useful
> this
> would be: That remote domain may not execute hypercalls at all, and
> hence may never make it to the point where the request actually gets
> issued. I guess the assumption is that such manipulation is not
> supposed
> to happen anymore once the guest has been started?

I may still want to set the invalidation signal to true even if the
domain acted on is not the local domain. I know the remote domain may
never reach the point to issue the invalidate, but it sounds to me that
the problem is not whether we should set the signal but whether we can
change where the signal is checked to make sure the point of issue can
be reliably triggered, and the latter can be done in a future patch.

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Putting the check in guest_physmap_remove_page() might also suffice,
> but then a separate is_hvm_domain() would need adding again.
> 
> --- a/xen/arch/x86/hvm/hypercall.c
> +++ b/xen/arch/x86/hvm/hypercall.c
> @@ -31,7 +31,6 @@
>  
>  static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  {
> -    const struct vcpu *curr = current;
>      long rc;
>  
>      switch ( cmd & MEMOP_CMD_MASK )
> @@ -41,14 +40,11 @@ static long hvm_memory_op(int cmd, XEN_G
>          return -ENOSYS;
>      }
>  
> -    if ( !curr->hcall_compat )
> +    if ( !current->hcall_compat )
>          rc = do_memory_op(cmd, arg);
>      else
>          rc = compat_memory_op(cmd, arg);
>  
> -    if ( (cmd & MEMOP_CMD_MASK) == XENMEM_decrease_reservation )
> -        curr->domain->arch.hvm.qemu_mapcache_invalidate = true;
> -
>      return rc;
>  }
>  
> @@ -326,14 +322,11 @@ int hvm_hypercall(struct cpu_user_regs *
>  
>      HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax);
>  
> -    if ( curr->hcall_preempted )
> -        return HVM_HCALL_preempted;
> -
>      if ( unlikely(currd->arch.hvm.qemu_mapcache_invalidate) &&
>           test_and_clear_bool(currd-
> >arch.hvm.qemu_mapcache_invalidate) )
>          send_invalidate_req();
>  
> -    return HVM_HCALL_completed;
> +    return curr->hcall_preempted ? HVM_HCALL_preempted :
> HVM_HCALL_completed;
>  }
>  
>  /*
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -812,6 +812,9 @@ p2m_remove_page(struct p2m_domain *p2m,
>          }
>      }
>  
> +    if ( p2m->domain == current->domain )
> +        p2m->domain->arch.hvm.qemu_mapcache_invalidate = true;
> +

So my comment above makes me want to remove the condition.

>      return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order,
> p2m_invalid,
>                           p2m->default_access);
>  }
> 

Hongyan
Jan Beulich Dec. 10, 2020, 1:37 p.m. UTC | #3
On 10.12.2020 14:09, Hongyan Xia wrote:
> On Mon, 2020-09-28 at 12:44 +0200, Jan Beulich wrote:
>> Plus finally there's no point sending the request for the local
>> domain
>> when the domain acted upon is a different one. If anything that
>> domain's
>> qemu's mapcache may need invalidating, but it's unclear how useful
>> this
>> would be: That remote domain may not execute hypercalls at all, and
>> hence may never make it to the point where the request actually gets
>> issued. I guess the assumption is that such manipulation is not
>> supposed
>> to happen anymore once the guest has been started?
> 
> I may still want to set the invalidation signal to true even if the
> domain acted on is not the local domain. I know the remote domain may
> never reach the point to issue the invalidate, but it sounds to me that
> the problem is not whether we should set the signal but whether we can
> change where the signal is checked to make sure the point of issue can
> be reliably triggered, and the latter can be done in a future patch.

One of Paul's replies was quite helpful here: The main thing to
worry about is for the vCPU to not continue running before the
invalidation request was signaled (or else, aiui, qemu may serve
a subsequent emulation request by the guest incorrectly, because
of using the stale mapping). Hence I believe for a non-paused
guest remote operations simply cannot be allowed when the may
lead to the need for invalidation. Therefore yes, if we assume
the guest is paused in such cases, we could drop the "is current"
check, but we'd then still need to arrange for actual signaling
before the guest gets to run again. I wonder whether
handle_hvm_io_completion() (or its caller, hvm_do_resume(),
right after that other call) wouldn't be a good place to do so.

Jan
Hongyan Xia Dec. 10, 2020, 3:55 p.m. UTC | #4
On Thu, 2020-12-10 at 14:37 +0100, Jan Beulich wrote:
> On 10.12.2020 14:09, Hongyan Xia wrote:
> > On Mon, 2020-09-28 at 12:44 +0200, Jan Beulich wrote:
> > > Plus finally there's no point sending the request for the local
> > > domain
> > > when the domain acted upon is a different one. If anything that
> > > domain's
> > > qemu's mapcache may need invalidating, but it's unclear how
> > > useful
> > > this
> > > would be: That remote domain may not execute hypercalls at all,
> > > and
> > > hence may never make it to the point where the request actually
> > > gets
> > > issued. I guess the assumption is that such manipulation is not
> > > supposed
> > > to happen anymore once the guest has been started?
> > 
> > I may still want to set the invalidation signal to true even if the
> > domain acted on is not the local domain. I know the remote domain
> > may
> > never reach the point to issue the invalidate, but it sounds to me
> > that
> > the problem is not whether we should set the signal but whether we
> > can
> > change where the signal is checked to make sure the point of issue
> > can
> > be reliably triggered, and the latter can be done in a future
> > patch.
> 
> One of Paul's replies was quite helpful here: The main thing to

Hmm, I seem to not be able to see the whole thread...

> worry about is for the vCPU to not continue running before the
> invalidation request was signaled (or else, aiui, qemu may serve
> a subsequent emulation request by the guest incorrectly, because
> of using the stale mapping). Hence I believe for a non-paused
> guest remote operations simply cannot be allowed when the may
> lead to the need for invalidation. Therefore yes, if we assume
> the guest is paused in such cases, we could drop the "is current"
> check, but we'd then still need to arrange for actual signaling
> before the guest gets to run again. I wonder whether
> handle_hvm_io_completion() (or its caller, hvm_do_resume(),
> right after that other call) wouldn't be a good place to do so.

Actually, the existing code must assume that when QEMU is up, the only
one that manipulates the p2m is the guest itself like you said. If the
caller is XENMEM_decrease_reservation, the code does not even check
which p2m this is for and unconditionally sets the QEMU invalidate flag
for the current domain. Athough this assumption may simply be wrong
now, so I agree care should be taken for remote p2m ops (I may need to
read the code more to know how this should be done).

Hongyan
Jan Beulich Dec. 11, 2020, 9:43 a.m. UTC | #5
On 10.12.2020 16:55, Hongyan Xia wrote:
> On Thu, 2020-12-10 at 14:37 +0100, Jan Beulich wrote:
>> On 10.12.2020 14:09, Hongyan Xia wrote:
>>> On Mon, 2020-09-28 at 12:44 +0200, Jan Beulich wrote:
>>>> Plus finally there's no point sending the request for the local
>>>> domain
>>>> when the domain acted upon is a different one. If anything that
>>>> domain's
>>>> qemu's mapcache may need invalidating, but it's unclear how
>>>> useful
>>>> this
>>>> would be: That remote domain may not execute hypercalls at all,
>>>> and
>>>> hence may never make it to the point where the request actually
>>>> gets
>>>> issued. I guess the assumption is that such manipulation is not
>>>> supposed
>>>> to happen anymore once the guest has been started?
>>>
>>> I may still want to set the invalidation signal to true even if the
>>> domain acted on is not the local domain. I know the remote domain
>>> may
>>> never reach the point to issue the invalidate, but it sounds to me
>>> that
>>> the problem is not whether we should set the signal but whether we
>>> can
>>> change where the signal is checked to make sure the point of issue
>>> can
>>> be reliably triggered, and the latter can be done in a future
>>> patch.
>>
>> One of Paul's replies was quite helpful here: The main thing to
> 
> Hmm, I seem to not be able to see the whole thread...

This may have been on the thread which had prompted the creation
of this patch.

>> worry about is for the vCPU to not continue running before the
>> invalidation request was signaled (or else, aiui, qemu may serve
>> a subsequent emulation request by the guest incorrectly, because
>> of using the stale mapping). Hence I believe for a non-paused
>> guest remote operations simply cannot be allowed when the may
>> lead to the need for invalidation. Therefore yes, if we assume
>> the guest is paused in such cases, we could drop the "is current"
>> check, but we'd then still need to arrange for actual signaling
>> before the guest gets to run again. I wonder whether
>> handle_hvm_io_completion() (or its caller, hvm_do_resume(),
>> right after that other call) wouldn't be a good place to do so.
> 
> Actually, the existing code must assume that when QEMU is up, the only
> one that manipulates the p2m is the guest itself like you said.

Not sure what "that" you mean existing code must assume, and why
you would outright exclude external p2m manipulation when the
guest is up. At the very least in a hypothetical memory-hot-
unplug scenario manipulation would still necessarily occur from
outside the guest (yet without real need for pausing it). And
obviously mem-sharing and mem-paging are also doing external
manipulations, albeit maybe they pause the guest for every
change.

> If the
> caller is XENMEM_decrease_reservation, the code does not even check
> which p2m this is for and unconditionally sets the QEMU invalidate flag
> for the current domain.

This observation is part of what had prompted the change here.

> Athough this assumption may simply be wrong
> now, so I agree care should be taken for remote p2m ops (I may need to
> read the code more to know how this should be done).

I believe the assumption stems from the time where the controlling
domain would necessarily be PV, and hence a decrease-reservation
request by a HVM domain could only have been for itself.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -31,7 +31,6 @@ 
 
 static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
-    const struct vcpu *curr = current;
     long rc;
 
     switch ( cmd & MEMOP_CMD_MASK )
@@ -41,14 +40,11 @@  static long hvm_memory_op(int cmd, XEN_G
         return -ENOSYS;
     }
 
-    if ( !curr->hcall_compat )
+    if ( !current->hcall_compat )
         rc = do_memory_op(cmd, arg);
     else
         rc = compat_memory_op(cmd, arg);
 
-    if ( (cmd & MEMOP_CMD_MASK) == XENMEM_decrease_reservation )
-        curr->domain->arch.hvm.qemu_mapcache_invalidate = true;
-
     return rc;
 }
 
@@ -326,14 +322,11 @@  int hvm_hypercall(struct cpu_user_regs *
 
     HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax);
 
-    if ( curr->hcall_preempted )
-        return HVM_HCALL_preempted;
-
     if ( unlikely(currd->arch.hvm.qemu_mapcache_invalidate) &&
          test_and_clear_bool(currd->arch.hvm.qemu_mapcache_invalidate) )
         send_invalidate_req();
 
-    return HVM_HCALL_completed;
+    return curr->hcall_preempted ? HVM_HCALL_preempted : HVM_HCALL_completed;
 }
 
 /*
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -812,6 +812,9 @@  p2m_remove_page(struct p2m_domain *p2m,
         }
     }
 
+    if ( p2m->domain == current->domain )
+        p2m->domain->arch.hvm.qemu_mapcache_invalidate = true;
+
     return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid,
                          p2m->default_access);
 }