diff mbox series

[v2,1/1] KVM: Implement dirty quota-based throttling of vcpus

Message ID 20211220055722.204341-2-shivam.kumar1@nutanix.com (mailing list archive)
State New, archived
Headers show
Series KVM: Dirty quota-based throttling | expand

Commit Message

Shivam Kumar Dec. 20, 2021, 5:57 a.m. UTC
Define variables to track and throttle memory dirtying for every vcpu.

dirty_count:    Number of pages the vcpu has dirtied since its creation,
                while dirty logging is enabled.
dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
                more, it needs to request more quota by exiting to
                userspace.

Implement the flow for throttling based on dirty quota.

i) Increment dirty_count for the vcpu whenever it dirties a page.
ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
count equals/exceeds dirty quota) to request more dirty quota.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 arch/x86/kvm/x86.c        | 17 +++++++++++++++++
 include/linux/kvm_types.h |  5 +++++
 include/uapi/linux/kvm.h  | 12 ++++++++++++
 virt/kvm/kvm_main.c       |  4 ++++
 4 files changed, 38 insertions(+)

Comments

Sean Christopherson Jan. 10, 2022, 6:08 p.m. UTC | #1
On Mon, Dec 20, 2021, Shivam Kumar wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a2972fdae82..723f24909314 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10042,6 +10042,11 @@ static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
>  		!vcpu->arch.apf.halted);
>  }
>  
> +static inline bool is_dirty_quota_full(struct kvm_vcpu *vcpu)
> +{
> +	return (vcpu->stat.generic.dirty_count >= vcpu->run->dirty_quota);
> +}
> +
>  static int vcpu_run(struct kvm_vcpu *vcpu)
>  {
>  	int r;
> @@ -10079,6 +10084,18 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  				return r;
>  			vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>  		}
> +
> +		/*
> +		 * Exit to userspace when dirty quota is full (if dirty quota
> +		 * throttling is enabled, i.e. dirty quota is non-zero).
> +		 */
> +		if (vcpu->run->dirty_quota > 0 && is_dirty_quota_full(vcpu)) {

Kernel style is to omit the "> 0" when checking for non-zero.  It matters here
because the "> 0" suggests dirty_quota can be negative, which it can't.

To allow userspace to modify dirty_quota on the fly, run->dirty_quota should be
READ_ONCE() with the result used for both the !0 and >= checks.  And then also
capture the effective dirty_quota in the exit union struct (free from a memory
perspective because the exit union is padded to 256 bytes).   That way if userspace
wants to modify the dirty_quota while the vCPU running it will get coherent data
even though the behavior is somewhat non-deterministic.

And then to simplify the code and also make this logic reusable for other
architectures, move it all into the helper and put the helper in kvm_host.h.

For other architectures, unless the arch maintainers explicitly don't want to
support this, I would prefer we enable at least arm64 right away to prevent this
from becoming a de facto x86-only feature.  s390 also appears to be easy to support.
I almost suggested moving the check to generic code, but then I looked at MIPS
and PPC and lost all hope :-/

> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_QUOTA_FULL;

"FULL" is a bit of a misnomer, there's nothing being filled.  REACHED is a better,
though not perfect because the quota can be exceeded if multiple pages are dirtied
in a single run.  Maybe just KVM_EXIT_DIRTY_QUOTA?

> +			vcpu->run->dqt.dirty_count = vcpu->stat.generic.dirty_count;
> +			r = 0;
> +			break;
> +		}

The dirty quota should be checked _before_ running the vCPU, otherwise KVM_RUN
with count >= quota will let the vCPU make forward progress and possibly dirty
more pages despite being over the quota.

> +
>  	}
>  
>  	srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 234eab059839..01f3726c0e09 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -87,6 +87,11 @@ struct kvm_vcpu_stat_generic {
>  	u64 halt_poll_success_hist[HALT_POLL_HIST_COUNT];
>  	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
>  	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
> +	/*
> +	 * Number of pages the vCPU has dirtied since its creation, while dirty
> +	 * logging is enabled.

This stat should count regardless of whether dirty logging is enabled.  First and
foremost, counting only while dirty logging is enabled creates funky semantics for
KVM_EXIT_DIRTY_QUOTA, e.g. a vCPU can take exits even though no memslots have dirty
logging enabled (due to past counts), and a vCPU can dirty enormous amounts of
memory without exiting due to the relevant memslot not being dirty logged.

Second, the stat could be useful for determining initial quotas or just for debugging.
There's the huge caveat that the counts may be misleading if there's nothing clearing
the dirty bits, but I suspect the info would still be helpful.

Speaking of caveats, this needs documentation in Documentation/virt/kvm/api.rst.
One thing that absolutely needs to be covered is that this quota is not a hard limit,
and that it is enforced opportunistically, e.g. with VMX's PML enabled, a vCPU can go
up to 511 (or 510? I hate math) counts over its quota.

> +	 */
> +	u64 dirty_count;

This doesn't say _what_ is dirty.  I'm not a fan of "count", it's redundant to
some extent (these are numerical stats after all) and "count" is often used for
a value that can be reset at arbitrary times, which isn't true in this case.

Maybe pages_dirtied?  With that, I think you can delete the above comment.

>  };
>  
>  #define KVM_STATS_NAME_SIZE	48
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 1daa45268de2..632b29a22778 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -270,6 +270,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_X86_BUS_LOCK     33
>  #define KVM_EXIT_XEN              34
>  #define KVM_EXIT_RISCV_SBI        35
> +#define KVM_EXIT_DIRTY_QUOTA_FULL 36
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -307,6 +308,10 @@ struct kvm_run {
>  	__u64 psw_addr; /* psw lower half */
>  #endif
>  	union {
> +		/* KVM_EXIT_DIRTY_QUOTA_FULL */
> +		struct {
> +			__u64 dirty_count;
> +		} dqt;

"dqt" is a bit opaque.  I wouldn't worry too much about keeping the name short.
Having to avoid a collision with the standalone "dirty_quota" is annoying though.
Maybe dirty_quota_exit?

>  		/* KVM_EXIT_UNKNOWN */
>  		struct {
>  			__u64 hardware_exit_reason;
> @@ -508,6 +513,13 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +	/*
> +	 * Number of pages the vCPU is allowed to dirty (if dirty quota
> +	 * throttling is enabled).

The "(if dirty quota throttling is enabled)" is stale, this is the enable. 

> To dirty more, it needs to request more
> +	 * quota by exiting to userspace (with exit reason
> +	 * KVM_EXIT_DIRTY_QUOTA_FULL).

The blurb about "requesting" more quota is bizarre, the vCPU isn't requesting
anything, it's simply honoring a limit.  For this comment, I think it's better to
simply state the KVM behavior, and then use the documentation entry to describe
how userspace can react to the exit.  There are other subtleties that need to be
addressed, e.g. behavior with respect to clearing of dirty bits, so there should
be a natural segue into using the knob.  E.g. for this:

	/*
	 * Number of pages the vCPU is allowed to have dirtied over its entire
	 * liftime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA if the quota is
	 * reached/exceeded.
	 */


> +	 */
> +	__u64 dirty_quota;
>  };
>  
>  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 72c4e6b39389..f557d91459fb 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3025,12 +3025,16 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>  	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
>  		u32 slot = (memslot->as_id << 16) | memslot->id;
> +		struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>  
>  		if (kvm->dirty_ring_size)
>  			kvm_dirty_ring_push(kvm_dirty_ring_get(kvm),
>  					    slot, rel_gfn);
>  		else
>  			set_bit_le(rel_gfn, memslot->dirty_bitmap);
> +
> +		if (!WARN_ON_ONCE(!vcpu))
> +			vcpu->stat.generic.dirty_count++;

This all becomes a lot simpler because of commit 2efd61a608b0 ("KVM: Warn if
mark_page_dirty() is called without an active vCPU").

Here's a modified patch with most of the feedback incorporated.

---
 arch/arm64/kvm/arm.c      |  4 ++++
 arch/s390/kvm/kvm-s390.c  |  4 ++++
 arch/x86/kvm/x86.c        |  4 ++++
 include/linux/kvm_host.h  | 14 ++++++++++++++
 include/linux/kvm_types.h |  1 +
 include/uapi/linux/kvm.h  | 12 ++++++++++++
 virt/kvm/kvm_main.c       |  7 ++++++-
 7 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 868109cf96b4..c02a00237879 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -823,6 +823,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	ret = 1;
 	run->exit_reason = KVM_EXIT_UNKNOWN;
 	while (ret > 0) {
+		ret = kvm_vcpu_check_dirty_quota(vcpu);
+		if (!ret)
+			break;
+
 		/*
 		 * Check conditions before entering the guest
 		 */
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 9c6d45d0d345..ab73a4bf6327 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -3995,6 +3995,10 @@ static int vcpu_pre_run(struct kvm_vcpu *vcpu)
 {
 	int rc, cpuflags;

+	rc = kvm_vcpu_check_dirty_quota(vcpu);
+	if (!rc)
+		return -EREMOTE;
+
 	/*
 	 * On s390 notifications for arriving pages will be delivered directly
 	 * to the guest but the house keeping for completed pfaults is
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c194a8cbd25f..fe583efe91c0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10119,6 +10119,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 	vcpu->arch.l1tf_flush_l1d = true;

 	for (;;) {
+		r = kvm_vcpu_check_dirty_quota(vcpu);
+		if (!r)
+			break;
+
 		if (kvm_vcpu_running(vcpu)) {
 			r = vcpu_enter_guest(vcpu);
 		} else {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f079820f52b5..7449b9748ddf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -424,6 +424,20 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
 }

+static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
+{
+	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
+	struct kvm_run *run = vcpu->run;
+
+	if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
+		return 1;
+
+	run->exit_reason = KVM_EXIT_DIRTY_QUOTA;
+	run->dirty_quota_exit.count = vcpu->stat.generic.pages_dirtied;
+	run->dirty_quota_exit.quota = dirty_quota;
+	return 0;
+}
+
 /*
  * Some of the bitops functions do not support too long bitmaps.
  * This number must be determined not to exceed such limits.
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index dceac12c1ce5..7f42486b0405 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -106,6 +106,7 @@ struct kvm_vcpu_stat_generic {
 	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
 	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
 	u64 blocking;
+	u64 pages_dirtied;
 };

 #define KVM_STATS_NAME_SIZE	48
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index fbfd70d965c6..a7416c56ab37 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -270,6 +270,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_DIRTY_QUOTA	  36

 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -487,6 +488,11 @@ struct kvm_run {
 			unsigned long args[6];
 			unsigned long ret[2];
 		} riscv_sbi;
+		/* KVM_EXIT_DIRTY_QUOTA */
+		struct {
+			__u64 count;
+			__u64 quota;
+		} dirty_quota_exit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -508,6 +514,12 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+	/*
+	 * Number of pages the vCPU is allowed to have dirtied over its entire
+	 * liftime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA if the quota is
+	 * reached/exceeded.
+	 */
+	__u64 dirty_quota;
 };

 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 168d0ab93c88..aa526b5b5518 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3163,7 +3163,12 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
 	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
 		return;

-	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
+	if (!memslot)
+		return;
+
+	vcpu->stat.generic.pages_dirtied++;
+
+	if (kvm_slot_dirty_track_enabled(memslot)) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;

--
Sean Christopherson Jan. 11, 2022, 4:59 p.m. UTC | #2
On Tue, Jan 11, 2022, Shivam Kumar wrote:
> On 10/01/22 11:38 pm, Sean Christopherson wrote:
> > > +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_QUOTA_FULL;
> > "FULL" is a bit of a misnomer, there's nothing being filled.  REACHED is a better,
> > though not perfect because the quota can be exceeded if multiple pages are dirtied
> > in a single run.  Maybe just KVM_EXIT_DIRTY_QUOTA?
> 
> Absolutely. Does KVM_EXIT_DIRTY_QUOTA_EXHAUSTED look good? Or should I keep it just
> KVM_EXIT_DIRTY_QUOTA?

EXHAUSTED is good.

> > This stat should count regardless of whether dirty logging is enabled.  First and
> > foremost, counting only while dirty logging is enabled creates funky semantics for
> > KVM_EXIT_DIRTY_QUOTA, e.g. a vCPU can take exits even though no memslots have dirty
> > logging enabled (due to past counts), and a vCPU can dirty enormous amounts of
> > memory without exiting due to the relevant memslot not being dirty logged.
> > 
> > Second, the stat could be useful for determining initial quotas or just for debugging.
> > There's the huge caveat that the counts may be misleading if there's nothing clearing
> > the dirty bits, but I suspect the info would still be helpful.
> > 
> > Speaking of caveats, this needs documentation in Documentation/virt/kvm/api.rst.
> > One thing that absolutely needs to be covered is that this quota is not a hard limit,
> > and that it is enforced opportunistically, e.g. with VMX's PML enabled, a vCPU can go
> > up to 511 (or 510? I hate math) counts over its quota.
> 
> I think section 5 (kvm run structure) will be the right place to document this. Please
> confirm once.

Yep.

> > The "(if dirty quota throttling is enabled)" is stale, this is the enable.
> 
> dirty_quota will be reset to zero at the end of live migration. I added this
> to capture this scenario.

That's irrelevant with respect to KVM's responsibilities.  More below.

> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index f079820f52b5..7449b9748ddf 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -424,6 +424,20 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> >   	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
> >   }
> > 
> > +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
> > +{
> > +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> 
> I think we can store vcpu->stat.generic.pages_dirtied too in some variable here, and
> use that in the next statements.

Yep.

> > @@ -508,6 +514,12 @@ struct kvm_run {
> >   		struct kvm_sync_regs regs;
> >   		char padding[SYNC_REGS_SIZE_BYTES];
> >   	} s;
> > +	/*
> > +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> > +	 * liftime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA if the quota is
> > +	 * reached/exceeded.
> > +	 */
> > +	__u64 dirty_quota;
> >   };
> 
> As mentioned above, this doesn't capture resetting of dirty_quota. Please let
> me know if that can be ignored here and captured in the documentation.

Capture it in the documentation.  Similar to the "reset to zero at the end of
live migration" comment, that's the behavior of _one_ userspace VMM, and isn't
fully representative of KVM's responsibilities.  From a KVM perspective, we can't
make any assumptions about how exactly userspace will utilize a feature, what
matters is what is/isn't supported by KVM's ABI.  E.g. for this particular comment,
KVM needs to play nice with userspace setting dirty_quota to any arbitrary value,
and functionally at any time as well since the quota is checked inside the run
loop, hence suggestion to use READ_ONCE().

It's certainly helpful to document how a feature can be used, especially for users
and VMM developers, but that belongs in the "official" documentation, not internal
code comments.

Speaking of VMM behavior, this also needs a selftest.  I'm guessing it'll be easier
and cleaner to add a new test instead of enhancing something like dirty_log_test,
but whatever works.

> >   /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 168d0ab93c88..aa526b5b5518 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -3163,7 +3163,12 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
> >   	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
> >   		return;
> > 
> > -	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> > +	if (!memslot)
> > +		return;
> > +
> > +	vcpu->stat.generic.pages_dirtied++;
> > +
> > +	if (kvm_slot_dirty_track_enabled(memslot)) {
> >   		unsigned long rel_gfn = gfn - memslot->base_gfn;
> >   		u32 slot = (memslot->as_id << 16) | memslot->id;
> > 
> > --
> > 
> This looks superb. I am very thankful to you for your reviews. As a beginner in linux
> kernel development, I have learnt a lot from your suggestions.
> 
> I'll be sending this with the documentation bit shortly. Please let me know if I can
> add you in the "Reviewed-by" list in the next patch.

No :-)  A Reviewed-by can be speculatively/conditionally provided, but generally
speaking that's only done if the expected delta is minimal and/or unlikely to have
any impact on the functionally.  That's also a good guideline for carrying reviews
across revisions of a patch: it's usually ok to retain a R-b if you rebase a patch,
tweak a comment/changelog, make minor revisions such as renaming a variable, etc...,
but you should drop a R-b (or explicitly ask as you've done here) if you make
non-trivial changes to a patch and/or modify the functionality in some way.
Shivam Kumar Jan. 23, 2022, 6:58 p.m. UTC | #3
On 10/01/22 11:38 pm, Sean Christopherson wrote:
> On Mon, Dec 20, 2021, Shivam Kumar wrote:
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 9a2972fdae82..723f24909314 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10042,6 +10042,11 @@ static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
>>   		!vcpu->arch.apf.halted);
>>   }
>>   
>> +static inline bool is_dirty_quota_full(struct kvm_vcpu *vcpu)
>> +{
>> +	return (vcpu->stat.generic.dirty_count >= vcpu->run->dirty_quota);
>> +}
>> +
>>   static int vcpu_run(struct kvm_vcpu *vcpu)
>>   {
>>   	int r;
>> @@ -10079,6 +10084,18 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>>   				return r;
>>   			vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>>   		}
>> +
>> +		/*
>> +		 * Exit to userspace when dirty quota is full (if dirty quota
>> +		 * throttling is enabled, i.e. dirty quota is non-zero).
>> +		 */
>> +		if (vcpu->run->dirty_quota > 0 && is_dirty_quota_full(vcpu)) {
> Kernel style is to omit the "> 0" when checking for non-zero.  It matters here
> because the "> 0" suggests dirty_quota can be negative, which it can't.
>
> To allow userspace to modify dirty_quota on the fly, run->dirty_quota should be
> READ_ONCE() with the result used for both the !0 and >= checks.  And then also
> capture the effective dirty_quota in the exit union struct (free from a memory
> perspective because the exit union is padded to 256 bytes).   That way if userspace
> wants to modify the dirty_quota while the vCPU running it will get coherent data
> even though the behavior is somewhat non-deterministic.
>
> And then to simplify the code and also make this logic reusable for other
> architectures, move it all into the helper and put the helper in kvm_host.h.
>
> For other architectures, unless the arch maintainers explicitly don't want to
> support this, I would prefer we enable at least arm64 right away to prevent this
> from becoming a de facto x86-only feature.  s390 also appears to be easy to support.
> I almost suggested moving the check to generic code, but then I looked at MIPS
> and PPC and lost all hope :-/
>
>> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_QUOTA_FULL;
>>
>>
>> --
>>
I am not able to test this on arm64 and s390 as I don't have access to 
arm64 and s390 hardware. Looking forward to your suggestions. Thank you!
Sean Christopherson Jan. 26, 2022, 9:04 p.m. UTC | #4
On Mon, Jan 24, 2022, Shivam Kumar wrote:
> 
> On 10/01/22 11:38 pm, Sean Christopherson wrote:
> > On Mon, Dec 20, 2021, Shivam Kumar wrote:
> > For other architectures, unless the arch maintainers explicitly don't want to
> > support this, I would prefer we enable at least arm64 right away to prevent this
> > from becoming a de facto x86-only feature.  s390 also appears to be easy to support.
> > I almost suggested moving the check to generic code, but then I looked at MIPS
> > and PPC and lost all hope :-/
> > 
> > > +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_QUOTA_FULL;
> > > 
> > > 
> > > --
> > > 
> I am not able to test this on arm64 and s390 as I don't have access to arm64
> and s390 hardware. Looking forward to your suggestions. Thank you!

Posting patches for other architectures that are compile-tested only is ok, just
be sure to note as much in the cover letter (or ignored part of the patch if it's
a single patchy).  I don't think there's anyone that has access to _all_ KVM
ports, though there might be someone that has x86, arm64, and s390.  In other words,
inability to test other architectures is very normal.  The arm64 and s390 maintainers
will need to ack the change in any case.
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a2972fdae82..723f24909314 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10042,6 +10042,11 @@  static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
 		!vcpu->arch.apf.halted);
 }
 
+static inline bool is_dirty_quota_full(struct kvm_vcpu *vcpu)
+{
+	return (vcpu->stat.generic.dirty_count >= vcpu->run->dirty_quota);
+}
+
 static int vcpu_run(struct kvm_vcpu *vcpu)
 {
 	int r;
@@ -10079,6 +10084,18 @@  static int vcpu_run(struct kvm_vcpu *vcpu)
 				return r;
 			vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
 		}
+
+		/*
+		 * Exit to userspace when dirty quota is full (if dirty quota
+		 * throttling is enabled, i.e. dirty quota is non-zero).
+		 */
+		if (vcpu->run->dirty_quota > 0 && is_dirty_quota_full(vcpu)) {
+			vcpu->run->exit_reason = KVM_EXIT_DIRTY_QUOTA_FULL;
+			vcpu->run->dqt.dirty_count = vcpu->stat.generic.dirty_count;
+			r = 0;
+			break;
+		}
+
 	}
 
 	srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 234eab059839..01f3726c0e09 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -87,6 +87,11 @@  struct kvm_vcpu_stat_generic {
 	u64 halt_poll_success_hist[HALT_POLL_HIST_COUNT];
 	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
 	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
+	/*
+	 * Number of pages the vCPU has dirtied since its creation, while dirty
+	 * logging is enabled.
+	 */
+	u64 dirty_count;
 };
 
 #define KVM_STATS_NAME_SIZE	48
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1daa45268de2..632b29a22778 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -270,6 +270,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_DIRTY_QUOTA_FULL 36
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -307,6 +308,10 @@  struct kvm_run {
 	__u64 psw_addr; /* psw lower half */
 #endif
 	union {
+		/* KVM_EXIT_DIRTY_QUOTA_FULL */
+		struct {
+			__u64 dirty_count;
+		} dqt;
 		/* KVM_EXIT_UNKNOWN */
 		struct {
 			__u64 hardware_exit_reason;
@@ -508,6 +513,13 @@  struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+	/*
+	 * Number of pages the vCPU is allowed to dirty (if dirty quota
+	 * throttling is enabled). To dirty more, it needs to request more
+	 * quota by exiting to userspace (with exit reason
+	 * KVM_EXIT_DIRTY_QUOTA_FULL).
+	 */
+	__u64 dirty_quota;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 72c4e6b39389..f557d91459fb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3025,12 +3025,16 @@  void mark_page_dirty_in_slot(struct kvm *kvm,
 	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
+		struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
 		if (kvm->dirty_ring_size)
 			kvm_dirty_ring_push(kvm_dirty_ring_get(kvm),
 					    slot, rel_gfn);
 		else
 			set_bit_le(rel_gfn, memslot->dirty_bitmap);
+
+		if (!WARN_ON_ONCE(!vcpu))
+			vcpu->stat.generic.dirty_count++;
 	}
 }
 EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);