diff mbox series

[v7,1/9] KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL

Message ID 20221031003621.164306-2-gshan@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Enable ring-based dirty memory tracking | expand

Commit Message

Gavin Shan Oct. 31, 2022, 12:36 a.m. UTC
The VCPU isn't expected to be runnable when the dirty ring becomes soft
full, until the dirty pages are harvested and the dirty ring is reset
from userspace. So there is a check in each guest's entrace to see if
the dirty ring is soft full or not. The VCPU is stopped from running if
its dirty ring has been soft full. The similar check will be needed when
the feature is going to be supported on ARM64. As Marc Zyngier suggested,
a new event will avoid pointless overhead to check the size of the dirty
ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance.

Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring
becomes soft full in kvm_dirty_ring_push(). The event is cleared in the
check, done in the newly added helper kvm_dirty_ring_check_request(), or
when the dirty ring is reset by userspace. Since the VCPU is not runnable
when the dirty ring becomes soft full, the KVM_REQ_DIRTY_RING_SOFT_FULL
event is always set to prevent the VCPU from running until the dirty pages
are harvested and the dirty ring is reset by userspace.

kvm_dirty_ring_soft_full() becomes a private function with the newly added
helper kvm_dirty_ring_check_request(). The alignment for the various event
definitions in kvm_host.h is changed to tab character by the way. In order
to avoid using 'container_of()', the argument @ring is replaced by @vcpu
in kvm_dirty_ring_push() and kvm_dirty_ring_reset(). The argument @kvm to
kvm_dirty_ring_reset() is dropped since it can be retrieved from the VCPU.

Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/x86.c             | 15 ++++++---------
 include/linux/kvm_dirty_ring.h | 17 ++++++-----------
 include/linux/kvm_host.h       |  9 +++++----
 virt/kvm/dirty_ring.c          | 34 +++++++++++++++++++++++++++++++---
 virt/kvm/kvm_main.c            |  5 ++---
 5 files changed, 50 insertions(+), 30 deletions(-)

Comments

Sean Christopherson Nov. 1, 2022, 7:39 p.m. UTC | #1
On Mon, Oct 31, 2022, Gavin Shan wrote:
> The VCPU isn't expected to be runnable when the dirty ring becomes soft
> full, until the dirty pages are harvested and the dirty ring is reset
> from userspace. So there is a check in each guest's entrace to see if
> the dirty ring is soft full or not. The VCPU is stopped from running if
> its dirty ring has been soft full. The similar check will be needed when
> the feature is going to be supported on ARM64. As Marc Zyngier suggested,
> a new event will avoid pointless overhead to check the size of the dirty
> ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance.
> 
> Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring
> becomes soft full in kvm_dirty_ring_push(). The event is cleared in the
> check, done in the newly added helper kvm_dirty_ring_check_request(), or
> when the dirty ring is reset by userspace. Since the VCPU is not runnable
> when the dirty ring becomes soft full, the KVM_REQ_DIRTY_RING_SOFT_FULL
> event is always set to prevent the VCPU from running until the dirty pages
> are harvested and the dirty ring is reset by userspace.
> 
> kvm_dirty_ring_soft_full() becomes a private function with the newly added
> helper kvm_dirty_ring_check_request(). The alignment for the various event
> definitions in kvm_host.h is changed to tab character by the way. In order
> to avoid using 'container_of()', the argument @ring is replaced by @vcpu
> in kvm_dirty_ring_push() and kvm_dirty_ring_reset(). The argument @kvm to
> kvm_dirty_ring_reset() is dropped since it can be retrieved from the VCPU.
> 
> Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org
> Suggested-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

> @@ -142,13 +144,17 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
>  
>  	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
>  
> +	if (!kvm_dirty_ring_soft_full(ring))
> +		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
> +

Marc, Peter, and/or Paolo, can you confirm that clearing the request here won't
cause ordering problems?  Logically, this makes perfect sense (to me, since I
suggested it), but I'm mildly concerned I'm overlooking an edge case where KVM
could end up with a soft-full ring but no pending request.

>  	trace_kvm_dirty_ring_reset(ring);
>  
>  	return count;
>  }
>
Peter Xu Nov. 2, 2022, 2:29 p.m. UTC | #2
On Tue, Nov 01, 2022 at 07:39:25PM +0000, Sean Christopherson wrote:
> > @@ -142,13 +144,17 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> >  
> >  	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> >  
> > +	if (!kvm_dirty_ring_soft_full(ring))
> > +		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
> > +
> 
> Marc, Peter, and/or Paolo, can you confirm that clearing the request here won't
> cause ordering problems?  Logically, this makes perfect sense (to me, since I
> suggested it), but I'm mildly concerned I'm overlooking an edge case where KVM
> could end up with a soft-full ring but no pending request.

I don't see an ordering issue here, as long as kvm_clear_request() is using
atomic version of bit clear, afaict that's genuine RMW and should always
imply a full memory barrier (on any arch?) between the soft full check and
the bit clear.  At least for x86 the lock prefix was applied.

However I don't see anything stops a simple "race" to trigger like below:

          recycle thread                   vcpu thread
          --------------                   -----------
      if (!dirty_ring_soft_full)                                   <--- not full
                                        dirty_ring_push();
                                        if (dirty_ring_soft_full)  <--- full due to the push
                                            set_request(SOFT_FULL);
          clear_request(SOFT_FULL);                                <--- can wrongly clear the request?

But I don't think that's a huge matter, as it'll just let the vcpu to have
one more chance to do another round of KVM_RUN.  Normally I think it means
there can be one more dirty GFN (perhaps there're cases that it can push >1
gfns for one KVM_RUN cycle?  I never figured out the details here, but
still..) pushed to the ring so closer to the hard limit, but we have had a
buffer zone of KVM_DIRTY_RING_RSVD_ENTRIES (64) entries.  So I assume
that's still fine, but maybe worth a short comment here?

I never know what's the maximum possible GFNs being dirtied for a KVM_RUN
cycle.  It would be good if there's an answer to that from anyone.
Marc Zyngier Nov. 2, 2022, 2:31 p.m. UTC | #3
On Tue, 01 Nov 2022 19:39:25 +0000,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Mon, Oct 31, 2022, Gavin Shan wrote:
> > The VCPU isn't expected to be runnable when the dirty ring becomes soft
> > full, until the dirty pages are harvested and the dirty ring is reset
> > from userspace. So there is a check in each guest's entrace to see if
> > the dirty ring is soft full or not. The VCPU is stopped from running if
> > its dirty ring has been soft full. The similar check will be needed when
> > the feature is going to be supported on ARM64. As Marc Zyngier suggested,
> > a new event will avoid pointless overhead to check the size of the dirty
> > ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance.
> > 
> > Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring
> > becomes soft full in kvm_dirty_ring_push(). The event is cleared in the
> > check, done in the newly added helper kvm_dirty_ring_check_request(), or
> > when the dirty ring is reset by userspace. Since the VCPU is not runnable
> > when the dirty ring becomes soft full, the KVM_REQ_DIRTY_RING_SOFT_FULL
> > event is always set to prevent the VCPU from running until the dirty pages
> > are harvested and the dirty ring is reset by userspace.
> > 
> > kvm_dirty_ring_soft_full() becomes a private function with the newly added
> > helper kvm_dirty_ring_check_request(). The alignment for the various event
> > definitions in kvm_host.h is changed to tab character by the way. In order
> > to avoid using 'container_of()', the argument @ring is replaced by @vcpu
> > in kvm_dirty_ring_push() and kvm_dirty_ring_reset(). The argument @kvm to
> > kvm_dirty_ring_reset() is dropped since it can be retrieved from the VCPU.
> > 
> > Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org
> > Suggested-by: Marc Zyngier <maz@kernel.org>
> > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > ---
> 
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> 
> > @@ -142,13 +144,17 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> >  
> >  	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> >  
> > +	if (!kvm_dirty_ring_soft_full(ring))
> > +		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
> > +
> 
> Marc, Peter, and/or Paolo, can you confirm that clearing the request
> here won't cause ordering problems?  Logically, this makes perfect
> sense (to me, since I suggested it), but I'm mildly concerned I'm
> overlooking an edge case where KVM could end up with a soft-full
> ring but no pending request.

I don't think you'll end-up with a soft-full and no request situation,
as kvm_make_request() enforces ordering, and you're making the request
on the vcpu itself. Even on arm64, this is guaranteed to be ordered
(same CPU, same address).

However, resetting the ring and clearing the request are not ordered,
which can lead to a slightly odd situation where the two events are
out of sync. But kvm_dirty_ring_check_request() requires both the
request to be set and the ring to be full to take any action. This
work around the lack of order.

I'd be much happier if kvm_clear_request() was fully ordered, but I
otherwise don't think we have an issue here.

Thanks,

	M.
Marc Zyngier Nov. 2, 2022, 3:58 p.m. UTC | #4
On Wed, 02 Nov 2022 14:29:26 +0000,
Peter Xu <peterx@redhat.com> wrote:
> 
> On Tue, Nov 01, 2022 at 07:39:25PM +0000, Sean Christopherson wrote:
> > > @@ -142,13 +144,17 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > >  
> > >  	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > >  
> > > +	if (!kvm_dirty_ring_soft_full(ring))
> > > +		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
> > > +
> > 
> > Marc, Peter, and/or Paolo, can you confirm that clearing the
> > request here won't cause ordering problems?  Logically, this makes
> > perfect sense (to me, since I suggested it), but I'm mildly
> > concerned I'm overlooking an edge case where KVM could end up with
> > a soft-full ring but no pending request.
> 
> I don't see an ordering issue here, as long as kvm_clear_request() is using
> atomic version of bit clear, afaict that's genuine RMW and should always
> imply a full memory barrier (on any arch?) between the soft full check and
> the bit clear.  At least for x86 the lock prefix was applied.

No, clear_bit() is not a full barrier. It only atomic, and thus
completely unordered (see Documentation/atomic_bitops.txt). If you
want a full barrier, you need to use test_and_clear_bit().

> 
> However I don't see anything stops a simple "race" to trigger like below:
> 
>           recycle thread                   vcpu thread
>           --------------                   -----------
>       if (!dirty_ring_soft_full)                                   <--- not full
>                                         dirty_ring_push();
>                                         if (dirty_ring_soft_full)  <--- full due to the push
>                                             set_request(SOFT_FULL);
>           clear_request(SOFT_FULL);                                <--- can wrongly clear the request?
>

Hmmm, well spotted. That's another ugly effect of the recycle thread
playing with someone else's toys.

> But I don't think that's a huge matter, as it'll just let the vcpu to have
> one more chance to do another round of KVM_RUN.  Normally I think it means
> there can be one more dirty GFN (perhaps there're cases that it can push >1
> gfns for one KVM_RUN cycle?  I never figured out the details here, but
> still..) pushed to the ring so closer to the hard limit, but we have had a
> buffer zone of KVM_DIRTY_RING_RSVD_ENTRIES (64) entries.  So I assume
> that's still fine, but maybe worth a short comment here?
> 
> I never know what's the maximum possible GFNs being dirtied for a KVM_RUN
> cycle.  It would be good if there's an answer to that from anyone.

This is dangerous, and I'd rather not go there.

It is starting to look like we need the recycle thread to get out of
the way. And to be honest:

+	if (!kvm_dirty_ring_soft_full(ring))
+		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);

seems rather superfluous. Only clearing the flag in the vcpu entry
path feels much saner, and I can't see anything that would break.

Thoughts?

	M.
Sean Christopherson Nov. 2, 2022, 4:11 p.m. UTC | #5
On Wed, Nov 02, 2022, Marc Zyngier wrote:
> On Wed, 02 Nov 2022 14:29:26 +0000, Peter Xu <peterx@redhat.com> wrote:
> > However I don't see anything stops a simple "race" to trigger like below:
> > 
> >           recycle thread                   vcpu thread
> >           --------------                   -----------
> >       if (!dirty_ring_soft_full)                                   <--- not full
> >                                         dirty_ring_push();
> >                                         if (dirty_ring_soft_full)  <--- full due to the push
> >                                             set_request(SOFT_FULL);
> >           clear_request(SOFT_FULL);                                <--- can wrongly clear the request?
> >
> 
> Hmmm, well spotted. That's another ugly effect of the recycle thread
> playing with someone else's toys.
> 
> > But I don't think that's a huge matter, as it'll just let the vcpu to have
> > one more chance to do another round of KVM_RUN.  Normally I think it means
> > there can be one more dirty GFN (perhaps there're cases that it can push >1
> > gfns for one KVM_RUN cycle?  I never figured out the details here, but
> > still..) pushed to the ring so closer to the hard limit, but we have had a
> > buffer zone of KVM_DIRTY_RING_RSVD_ENTRIES (64) entries.  So I assume
> > that's still fine, but maybe worth a short comment here?
> > 
> > I never know what's the maximum possible GFNs being dirtied for a KVM_RUN
> > cycle.  It would be good if there's an answer to that from anyone.
> 
> This is dangerous, and I'd rather not go there.
> 
> It is starting to look like we need the recycle thread to get out of
> the way. And to be honest:
> 
> +	if (!kvm_dirty_ring_soft_full(ring))
> +		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
> 
> seems rather superfluous. Only clearing the flag in the vcpu entry
> path feels much saner, and I can't see anything that would break.
> 
> Thoughts?

I've no objections to dropping the clear on reset, I suggested it primarily so
that it would be easier to understand what action causes the dirty ring to become
not-full.  I agree that the explicit clear is unnecessary from a functional
perspective.
Peter Xu Nov. 2, 2022, 4:23 p.m. UTC | #6
On Wed, Nov 02, 2022 at 03:58:43PM +0000, Marc Zyngier wrote:
> On Wed, 02 Nov 2022 14:29:26 +0000,
> Peter Xu <peterx@redhat.com> wrote:
> > 
> > On Tue, Nov 01, 2022 at 07:39:25PM +0000, Sean Christopherson wrote:
> > > > @@ -142,13 +144,17 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > >  
> > > >  	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > >  
> > > > +	if (!kvm_dirty_ring_soft_full(ring))
> > > > +		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
> > > > +
> > > 
> > > Marc, Peter, and/or Paolo, can you confirm that clearing the
> > > request here won't cause ordering problems?  Logically, this makes
> > > perfect sense (to me, since I suggested it), but I'm mildly
> > > concerned I'm overlooking an edge case where KVM could end up with
> > > a soft-full ring but no pending request.
> > 
> > I don't see an ordering issue here, as long as kvm_clear_request() is using
> > atomic version of bit clear, afaict that's genuine RMW and should always
> > imply a full memory barrier (on any arch?) between the soft full check and
> > the bit clear.  At least for x86 the lock prefix was applied.
> 
> No, clear_bit() is not a full barrier. It only atomic, and thus
> completely unordered (see Documentation/atomic_bitops.txt). If you
> want a full barrier, you need to use test_and_clear_bit().

Right, I mixed it up again. :(  It's genuine RMW indeed (unlike _set/_read)
but I forgot it needs to have a retval to have the memory barriers.

Quotting atomic_t.rst:

---8<---
ORDERING  (go read memory-barriers.txt first)
--------

The rule of thumb:

 - non-RMW operations are unordered;

 - RMW operations that have no return value are unordered;

 - RMW operations that have a return value are fully ordered;

 - RMW operations that are conditional are unordered on FAILURE,
   otherwise the above rules apply.
---8<---

Bit clear unordered.

> 
> > 
> > However I don't see anything stops a simple "race" to trigger like below:
> > 
> >           recycle thread                   vcpu thread
> >           --------------                   -----------
> >       if (!dirty_ring_soft_full)                                   <--- not full
> >                                         dirty_ring_push();
> >                                         if (dirty_ring_soft_full)  <--- full due to the push
> >                                             set_request(SOFT_FULL);
> >           clear_request(SOFT_FULL);                                <--- can wrongly clear the request?
> >
> 
> Hmmm, well spotted. That's another ugly effect of the recycle thread
> playing with someone else's toys.
> 
> > But I don't think that's a huge matter, as it'll just let the vcpu to have
> > one more chance to do another round of KVM_RUN.  Normally I think it means
> > there can be one more dirty GFN (perhaps there're cases that it can push >1
> > gfns for one KVM_RUN cycle?  I never figured out the details here, but
> > still..) pushed to the ring so closer to the hard limit, but we have had a
> > buffer zone of KVM_DIRTY_RING_RSVD_ENTRIES (64) entries.  So I assume
> > that's still fine, but maybe worth a short comment here?
> > 
> > I never know what's the maximum possible GFNs being dirtied for a KVM_RUN
> > cycle.  It would be good if there's an answer to that from anyone.
> 
> This is dangerous, and I'd rather not go there.
> 
> It is starting to look like we need the recycle thread to get out of
> the way. And to be honest:
> 
> +	if (!kvm_dirty_ring_soft_full(ring))
> +		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
> 
> seems rather superfluous. Only clearing the flag in the vcpu entry
> path feels much saner, and I can't see anything that would break.
> 
> Thoughts?

Sounds good here.

Might be slightly off-topic: I didn't quickly spot how do we guarantee two
threads doing KVM_RUN ioctl on the same vcpu fd concurrently.  I know
that's insane and could have corrupted things, but I just want to make sure
e.g. even a malicious guest app won't be able to trigger host warnings.
Sean Christopherson Nov. 2, 2022, 4:33 p.m. UTC | #7
On Wed, Nov 02, 2022, Peter Xu wrote:
> Might be slightly off-topic: I didn't quickly spot how do we guarantee two
> threads doing KVM_RUN ioctl on the same vcpu fd concurrently.  I know
> that's insane and could have corrupted things, but I just want to make sure
> e.g. even a malicious guest app won't be able to trigger host warnings.

kvm_vcpu_ioctl() takes the vCPU's mutex:

static long kvm_vcpu_ioctl(struct file *filp,
			   unsigned int ioctl, unsigned long arg)
{
	...

	/*
	 * Some architectures have vcpu ioctls that are asynchronous to vcpu
	 * execution; mutex_lock() would break them.
	 */
	r = kvm_arch_vcpu_async_ioctl(filp, ioctl, arg);
	if (r != -ENOIOCTLCMD)
		return r;

	if (mutex_lock_killable(&vcpu->mutex))
		return -EINTR;
	switch (ioctl) {
	case KVM_RUN: {
Peter Xu Nov. 2, 2022, 4:43 p.m. UTC | #8
On Wed, Nov 02, 2022 at 04:33:15PM +0000, Sean Christopherson wrote:
> On Wed, Nov 02, 2022, Peter Xu wrote:
> > Might be slightly off-topic: I didn't quickly spot how do we guarantee two
> > threads doing KVM_RUN ioctl on the same vcpu fd concurrently.  I know
> > that's insane and could have corrupted things, but I just want to make sure
> > e.g. even a malicious guest app won't be able to trigger host warnings.
> 
> kvm_vcpu_ioctl() takes the vCPU's mutex:
> 
> static long kvm_vcpu_ioctl(struct file *filp,
> 			   unsigned int ioctl, unsigned long arg)
> {
> 	...
> 
> 	/*
> 	 * Some architectures have vcpu ioctls that are asynchronous to vcpu
> 	 * execution; mutex_lock() would break them.
> 	 */
> 	r = kvm_arch_vcpu_async_ioctl(filp, ioctl, arg);
> 	if (r != -ENOIOCTLCMD)
> 		return r;
> 
> 	if (mutex_lock_killable(&vcpu->mutex))
> 		return -EINTR;
> 	switch (ioctl) {
> 	case KVM_RUN: {

Ah, makes sense, thanks!
Marc Zyngier Nov. 2, 2022, 4:44 p.m. UTC | #9
On Wed, 02 Nov 2022 16:11:07 +0000,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Wed, Nov 02, 2022, Marc Zyngier wrote:
> > On Wed, 02 Nov 2022 14:29:26 +0000, Peter Xu <peterx@redhat.com> wrote:
> > > However I don't see anything stops a simple "race" to trigger like below:
> > > 
> > >           recycle thread                   vcpu thread
> > >           --------------                   -----------
> > >       if (!dirty_ring_soft_full)                                   <--- not full
> > >                                         dirty_ring_push();
> > >                                         if (dirty_ring_soft_full)  <--- full due to the push
> > >                                             set_request(SOFT_FULL);
> > >           clear_request(SOFT_FULL);                                <--- can wrongly clear the request?
> > >
> > 
> > Hmmm, well spotted. That's another ugly effect of the recycle thread
> > playing with someone else's toys.
> > 
> > > But I don't think that's a huge matter, as it'll just let the vcpu to have
> > > one more chance to do another round of KVM_RUN.  Normally I think it means
> > > there can be one more dirty GFN (perhaps there're cases that it can push >1
> > > gfns for one KVM_RUN cycle?  I never figured out the details here, but
> > > still..) pushed to the ring so closer to the hard limit, but we have had a
> > > buffer zone of KVM_DIRTY_RING_RSVD_ENTRIES (64) entries.  So I assume
> > > that's still fine, but maybe worth a short comment here?
> > > 
> > > I never know what's the maximum possible GFNs being dirtied for a KVM_RUN
> > > cycle.  It would be good if there's an answer to that from anyone.
> > 
> > This is dangerous, and I'd rather not go there.
> > 
> > It is starting to look like we need the recycle thread to get out of
> > the way. And to be honest:
> > 
> > +	if (!kvm_dirty_ring_soft_full(ring))
> > +		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
> > 
> > seems rather superfluous. Only clearing the flag in the vcpu entry
> > path feels much saner, and I can't see anything that would break.
> > 
> > Thoughts?
> 
> I've no objections to dropping the clear on reset, I suggested it
> primarily so that it would be easier to understand what action
> causes the dirty ring to become not-full.  I agree that the explicit
> clear is unnecessary from a functional perspective.

The core of the issue is that the whole request mechanism is a
producer/consumer model, where consuming a request is a CLEAR
action. The standard model is that the vcpu thread is the consumer,
and that any thread (including the vcpu itself) can be a producer.

With this flag clearing being on a non-vcpu thread, you end-up with
two consumers, and things can go subtly wrong.

I'd suggest replacing this hunk with a comment saying that the request
will be cleared by the vcpu thread next time it enters the guest.

	M.
Marc Zyngier Nov. 2, 2022, 4:48 p.m. UTC | #10
On Wed, 02 Nov 2022 16:23:16 +0000,
Peter Xu <peterx@redhat.com> wrote:
> 
> Might be slightly off-topic: I didn't quickly spot how do we guarantee two
> threads doing KVM_RUN ioctl on the same vcpu fd concurrently.  I know
> that's insane and could have corrupted things, but I just want to make sure
> e.g. even a malicious guest app won't be able to trigger host warnings.

In kvm_vcpu_ioctl():

	if (mutex_lock_killable(&vcpu->mutex)) <----- this
		return -EINTR;
	switch (ioctl) {
	case KVM_RUN: {
		struct pid *oldpid;
		r = -EINVAL;
		if (arg)

We simply don't allow two concurrent ioctls to the same vcpu, let
alone two KVM_RUN.

	M.
Gavin Shan Nov. 3, 2022, 12:44 a.m. UTC | #11
On 11/3/22 12:44 AM, Marc Zyngier wrote:
> On Wed, 02 Nov 2022 16:11:07 +0000,
> Sean Christopherson <seanjc@google.com> wrote:
>>
>> On Wed, Nov 02, 2022, Marc Zyngier wrote:
>>> On Wed, 02 Nov 2022 14:29:26 +0000, Peter Xu <peterx@redhat.com> wrote:
>>>> However I don't see anything stops a simple "race" to trigger like below:
>>>>
>>>>            recycle thread                   vcpu thread
>>>>            --------------                   -----------
>>>>        if (!dirty_ring_soft_full)                                   <--- not full
>>>>                                          dirty_ring_push();
>>>>                                          if (dirty_ring_soft_full)  <--- full due to the push
>>>>                                              set_request(SOFT_FULL);
>>>>            clear_request(SOFT_FULL);                                <--- can wrongly clear the request?
>>>>
>>>
>>> Hmmm, well spotted. That's another ugly effect of the recycle thread
>>> playing with someone else's toys.
>>>
>>>> But I don't think that's a huge matter, as it'll just let the vcpu to have
>>>> one more chance to do another round of KVM_RUN.  Normally I think it means
>>>> there can be one more dirty GFN (perhaps there're cases that it can push >1
>>>> gfns for one KVM_RUN cycle?  I never figured out the details here, but
>>>> still..) pushed to the ring so closer to the hard limit, but we have had a
>>>> buffer zone of KVM_DIRTY_RING_RSVD_ENTRIES (64) entries.  So I assume
>>>> that's still fine, but maybe worth a short comment here?
>>>>
>>>> I never know what's the maximum possible GFNs being dirtied for a KVM_RUN
>>>> cycle.  It would be good if there's an answer to that from anyone.
>>>
>>> This is dangerous, and I'd rather not go there.
>>>
>>> It is starting to look like we need the recycle thread to get out of
>>> the way. And to be honest:
>>>
>>> +	if (!kvm_dirty_ring_soft_full(ring))
>>> +		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
>>>
>>> seems rather superfluous. Only clearing the flag in the vcpu entry
>>> path feels much saner, and I can't see anything that would break.
>>>
>>> Thoughts?
>>
>> I've no objections to dropping the clear on reset, I suggested it
>> primarily so that it would be easier to understand what action
>> causes the dirty ring to become not-full.  I agree that the explicit
>> clear is unnecessary from a functional perspective.
> 
> The core of the issue is that the whole request mechanism is a
> producer/consumer model, where consuming a request is a CLEAR
> action. The standard model is that the vcpu thread is the consumer,
> and that any thread (including the vcpu itself) can be a producer.
> 
> With this flag clearing being on a non-vcpu thread, you end-up with
> two consumers, and things can go subtly wrong.
> 
> I'd suggest replacing this hunk with a comment saying that the request
> will be cleared by the vcpu thread next time it enters the guest.
> 

Thanks, Marc. I will replace the hunk of code with the following
comments, as you suggested, in next respin.

     /*
      * The request KVM_REQ_DIRTY_RING_SOFT_FULL will be cleared
      * by the VCPU thread next time when it enters the guest.
      */

I will post v8 after Peter/Sean/Oliver take a look on [PATCH v7 4/9].
I think we're settled on other patches.

Thanks,
Gavin
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9cf1ba865562..d0d32e67ebf3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10499,20 +10499,17 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	bool req_immediate_exit = false;
 
-	/* Forbid vmenter if vcpu dirty ring is soft-full */
-	if (unlikely(vcpu->kvm->dirty_ring_size &&
-		     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
-		vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
-		trace_kvm_dirty_ring_exit(vcpu);
-		r = 0;
-		goto out;
-	}
-
 	if (kvm_request_pending(vcpu)) {
 		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
 			r = -EIO;
 			goto out;
 		}
+
+		if (kvm_dirty_ring_check_request(vcpu)) {
+			r = 0;
+			goto out;
+		}
+
 		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
 			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
 				r = 0;
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index 906f899813dc..53a36f38d15e 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -43,13 +43,12 @@  static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
 	return 0;
 }
 
-static inline int kvm_dirty_ring_reset(struct kvm *kvm,
-				       struct kvm_dirty_ring *ring)
+static inline int kvm_dirty_ring_reset(struct kvm_vcpu *vcpu)
 {
 	return 0;
 }
 
-static inline void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
+static inline void kvm_dirty_ring_push(struct kvm_vcpu *vcpu,
 				       u32 slot, u64 offset)
 {
 }
@@ -64,11 +63,6 @@  static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
 {
 }
 
-static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
-{
-	return true;
-}
-
 #else /* CONFIG_HAVE_KVM_DIRTY_RING */
 
 u32 kvm_dirty_ring_get_rsvd_entries(void);
@@ -78,19 +72,20 @@  int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
  * called with kvm->slots_lock held, returns the number of
  * processed pages.
  */
-int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
+int kvm_dirty_ring_reset(struct kvm_vcpu *vcpu);
 
 /*
  * returns =0: successfully pushed
  *         <0: unable to push, need to wait
  */
-void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
+void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset);
+
+bool kvm_dirty_ring_check_request(struct kvm_vcpu *vcpu);
 
 /* for use in vm_operations_struct */
 struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
 
 void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
-bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
 
 #endif /* CONFIG_HAVE_KVM_DIRTY_RING */
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 00c3448ba7f8..648d663f32c4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -153,10 +153,11 @@  static inline bool is_error_page(struct page *page)
  * Architecture-independent vcpu->requests bit members
  * Bits 3-7 are reserved for more arch-independent bits.
  */
-#define KVM_REQ_TLB_FLUSH         (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_VM_DEAD           (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_UNBLOCK           2
-#define KVM_REQUEST_ARCH_BASE     8
+#define KVM_REQ_TLB_FLUSH		(0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_VM_DEAD			(1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_UNBLOCK			2
+#define KVM_REQ_DIRTY_RING_SOFT_FULL	3
+#define KVM_REQUEST_ARCH_BASE		8
 
 /*
  * KVM_REQ_OUTSIDE_GUEST_MODE exists is purely as way to force the vCPU to
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index d6fabf238032..6091e1403bc8 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -26,7 +26,7 @@  static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
 	return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
 }
 
-bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
+static bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
 {
 	return kvm_dirty_ring_used(ring) >= ring->soft_limit;
 }
@@ -87,8 +87,10 @@  static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
 	return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
 }
 
-int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
+int kvm_dirty_ring_reset(struct kvm_vcpu *vcpu)
 {
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
 	u32 cur_slot, next_slot;
 	u64 cur_offset, next_offset;
 	unsigned long mask;
@@ -142,13 +144,17 @@  int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
 
 	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
 
+	if (!kvm_dirty_ring_soft_full(ring))
+		kvm_clear_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
+
 	trace_kvm_dirty_ring_reset(ring);
 
 	return count;
 }
 
-void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
+void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset)
 {
+	struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
 	struct kvm_dirty_gfn *entry;
 
 	/* It should never get full */
@@ -166,6 +172,28 @@  void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
 	kvm_dirty_gfn_set_dirtied(entry);
 	ring->dirty_index++;
 	trace_kvm_dirty_ring_push(ring, slot, offset);
+
+	if (kvm_dirty_ring_soft_full(ring))
+		kvm_make_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
+}
+
+bool kvm_dirty_ring_check_request(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * The VCPU isn't runnable when the dirty ring becomes soft full.
+	 * The KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent
+	 * the VCPU from running until the dirty pages are harvested and
+	 * the dirty ring is reset by userspace.
+	 */
+	if (kvm_check_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu) &&
+	    kvm_dirty_ring_soft_full(&vcpu->dirty_ring)) {
+		kvm_make_request(KVM_REQ_DIRTY_RING_SOFT_FULL, vcpu);
+		vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+		trace_kvm_dirty_ring_exit(vcpu);
+		return true;
+	}
+
+	return false;
 }
 
 struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1376a47fedee..30ff73931e1c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3314,8 +3314,7 @@  void mark_page_dirty_in_slot(struct kvm *kvm,
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
 		if (kvm->dirty_ring_size)
-			kvm_dirty_ring_push(&vcpu->dirty_ring,
-					    slot, rel_gfn);
+			kvm_dirty_ring_push(vcpu, slot, rel_gfn);
 		else
 			set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
@@ -4543,7 +4542,7 @@  static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
 	mutex_lock(&kvm->slots_lock);
 
 	kvm_for_each_vcpu(i, vcpu, kvm)
-		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
+		cleared += kvm_dirty_ring_reset(vcpu);
 
 	mutex_unlock(&kvm->slots_lock);