diff mbox series

[v6,1/5] KVM: Implement dirty quota-based throttling of vcpus

Message ID 20220915101049.187325-2-shivam.kumar1@nutanix.com (mailing list archive)
State New, archived
Headers show
Series KVM: Dirty quota-based throttling | expand

Commit Message

Shivam Kumar Sept. 15, 2022, 10:10 a.m. UTC
Define variables to track and throttle memory dirtying for every vcpu.

dirty_count:    Number of pages the vcpu has dirtied since its creation,
                while dirty logging is enabled.
dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
                more, it needs to request more quota by exiting to
                userspace.

Implement the flow for throttling based on dirty quota.

i) Increment dirty_count for the vcpu whenever it dirties a page.
ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
count equals/exceeds dirty quota) to request more dirty quota.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 Documentation/virt/kvm/api.rst | 35 ++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h       | 20 ++++++++++++++++++-
 include/linux/kvm_types.h      |  1 +
 include/uapi/linux/kvm.h       | 12 ++++++++++++
 virt/kvm/kvm_main.c            | 26 ++++++++++++++++++++++---
 5 files changed, 90 insertions(+), 4 deletions(-)

Comments

Christian Borntraeger Sept. 15, 2022, 1:21 p.m. UTC | #1
Am 15.09.22 um 12:10 schrieb Shivam Kumar:
> Define variables to track and throttle memory dirtying for every vcpu.
> 
> dirty_count:    Number of pages the vcpu has dirtied since its creation,
>                  while dirty logging is enabled.
> dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
>                  more, it needs to request more quota by exiting to
>                  userspace.
> 
> Implement the flow for throttling based on dirty quota.
> 
> i) Increment dirty_count for the vcpu whenever it dirties a page.
> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
> count equals/exceeds dirty quota) to request more dirty quota.
> 
> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
[...]

I am wondering if this will work on s390. On s390  we only call
mark_page_dirty_in_slot for the kvm_read/write functions but not
for those done by the guest on fault. We do account those lazily in
kvm_arch_sync_dirty_log (like x96 in the past).

> +
>   void mark_page_dirty_in_slot(struct kvm *kvm,
>   			     const struct kvm_memory_slot *memslot,
>   		 	     gfn_t gfn)
>   {
>   	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>   
> -#ifdef CONFIG_HAVE_KVM_DIRTY_RING
>   	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
>   		return;
> -#endif

This will rigger on s390 for the interrupt payload written from vhost
if I recall correctly. Please do not enable this warning.

>   
> -	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> +	if (!memslot)
> +		return;
> +
> +	WARN_ON_ONCE(!vcpu->stat.generic.pages_dirtied++);
> +
> +	if (kvm_slot_dirty_track_enabled(memslot)) {
>   		unsigned long rel_gfn = gfn - memslot->base_gfn;
>   		u32 slot = (memslot->as_id << 16) | memslot->id;
>   
> @@ -3318,6 +3336,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>   					    slot, rel_gfn);
>   		else
>   			set_bit_le(rel_gfn, memslot->dirty_bitmap);
> +
> +		kvm_vcpu_is_dirty_quota_exhausted(vcpu);
>   	}
>   }
>   EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);
Christian Borntraeger Sept. 15, 2022, 2:34 p.m. UTC | #2
Am 15.09.22 um 15:21 schrieb Christian Borntraeger:
> 
> 
> Am 15.09.22 um 12:10 schrieb Shivam Kumar:
>> Define variables to track and throttle memory dirtying for every vcpu.
>>
>> dirty_count:    Number of pages the vcpu has dirtied since its creation,
>>                  while dirty logging is enabled.
>> dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
>>                  more, it needs to request more quota by exiting to
>>                  userspace.
>>
>> Implement the flow for throttling based on dirty quota.
>>
>> i) Increment dirty_count for the vcpu whenever it dirties a page.
>> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
>> count equals/exceeds dirty quota) to request more dirty quota.
>>
>> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
>> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
>> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> [...]
> 
> I am wondering if this will work on s390. On s390  we only call
> mark_page_dirty_in_slot for the kvm_read/write functions but not
> for those done by the guest on fault. We do account those lazily in
> kvm_arch_sync_dirty_log (like x96 in the past).
> 

I think we need to rework the page fault handling on s390 to actually make
use of this. This has to happen anyway somewhen (as indicated by the guest
enter/exit rework from Mark). Right now we handle KVM page faults directly
in the normal system fault handler. It seems we need to make a side turn
into KVM for page faults on guests in the long run.

Until this is done the dirty logging is really not per CPU on s390 so we
need to defer this feature for now. CC other s390 maintainers.
The other use case for a page fault handler rework was Mark Rutlands
"kvm: fix latent guest entry/exit bugs" series which did not work on s390.
Sean Christopherson Oct. 7, 2022, 6:18 p.m. UTC | #3
On Thu, Sep 15, 2022, Christian Borntraeger wrote:
> Am 15.09.22 um 15:21 schrieb Christian Borntraeger:
> > I am wondering if this will work on s390. On s390  we only call
> > mark_page_dirty_in_slot for the kvm_read/write functions but not
> > for those done by the guest on fault. We do account those lazily in
> > kvm_arch_sync_dirty_log (like x96 in the past).
> > 
> 
> I think we need to rework the page fault handling on s390 to actually make
> use of this. This has to happen anyway somewhen (as indicated by the guest
> enter/exit rework from Mark). Right now we handle KVM page faults directly
> in the normal system fault handler. It seems we need to make a side turn
> into KVM for page faults on guests in the long run.

s390's page table management came up in conversation at KVM Forum in the context
of a common MMU (or MMUs)[*].  If/when we get an MMU that supports multiple
architectures, that would be a perfect opportunity for s390 to switch.  I don't
know that it'll be feasible to make a common MMU support s390 from the get-go,
but at the very least we can keep y'all in the loop and not make it unnecessarily
difficult to support s390.

[*] https://kvmforum2022.sched.com/event/15jJk
Sean Christopherson Oct. 7, 2022, 7:08 p.m. UTC | #4
On Thu, Sep 15, 2022, Shivam Kumar wrote:
> @@ -542,6 +545,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>  	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>  }
>  
> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_run *run = vcpu->run;
> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> +
> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> +		return 1;
> +
> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
> +	run->dirty_quota_exit.count = pages_dirtied;
> +	run->dirty_quota_exit.quota = dirty_quota;
> +	return 0;

Dead code.

> +}

...

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 584a5bab3af3..f315af50037d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3298,18 +3298,36 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
>  }
>  EXPORT_SYMBOL_GPL(kvm_clear_guest);
>  
> +static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)

Ouch, sorry.  I suspect you got this name from me[*].  That was a goof on my end,
I'm 99% certain I copy-pasted stale code, i.e. didn't intended to suggest a
rename.

Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name.

[*] https://lore.kernel.org/all/Yo+82LjHSOdyxKzT@google.com

> +{
> +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> +
> +	if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
> +		return;
> +
> +	/*
> +	 * Snapshot the quota to report it to userspace.  The dirty count will be
> +	 * captured when the request is processed.
> +	 */
> +	vcpu->dirty_quota = dirty_quota;
> +	kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);

Making the request needs to be guarded with an arch opt-in.  Pending requests
prevent KVM from entering the guest, and so making a request that an arch isn't
aware of will effectively hang the vCPU.  Obviously userspace would be shooting
itself in the foot by setting run->dirty_quota in this case, but KVM shouldn't
hand userspace a loaded gun and help them aim.

My suggestion from v1[*] about not forcing architectures to opt-in was in the
context of a request-less implementation where dirty_quota was a nop until the
arch took action.

And regardless of arch opt-in, I think this needs a capability so that userspace
can detect KVM support.

I don't see any reason to wrap the request or vcpu_run field, e.g. something like
this should suffice:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ea5847d22aff..93362441215b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3298,8 +3298,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
+static void kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
 {
+#ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA
        u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
 
        if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
@@ -3311,6 +3312,7 @@ static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
         */
        vcpu->dirty_quota = dirty_quota;
        kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
+#endif
 }
 
 void mark_page_dirty_in_slot(struct kvm *kvm,
@@ -4507,6 +4509,8 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
        case KVM_CAP_BINARY_STATS_FD:
        case KVM_CAP_SYSTEM_EVENT_DATA:
                return 1;
+       case KVM_CAP_DIRTY_QUOTA:
+               return !!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_QUOTA);
        default:
                break;
        }


[*] https://lore.kernel.org/all/YZaUENi0ZyQi%2F9M0@google.com
Sean Christopherson Oct. 7, 2022, 7:20 p.m. UTC | #5
On Fri, Oct 07, 2022, Sean Christopherson wrote:
> On Thu, Sep 15, 2022, Shivam Kumar wrote:
> Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name.
> 
> [*] https://lore.kernel.org/all/Yo+82LjHSOdyxKzT@google.com

Actually, I take that back.  The code snippet itself is also flawed.  If userspace
increases the quota (or disables it entirely) between KVM snapshotting the quota
and making the request, then there's no need for KVM to exit to userspace.

So I think this can be:

static void kvm_vcpu_is_dirty_quota_exchausted(struct kvm_vcpu *vcpu)
{
#ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA
	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);

	return dirty_quota && (vcpu->stat.generic.pages_dirtied >= dirty_quota);
#else
	return false;
#endif
}

and the usage becomes:

		if (kvm_vcpu_is_dirty_quota_exhausted(vcpu))
			kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);

More thoughts in the x86 patch.
Shivam Kumar Oct. 9, 2022, 6:36 p.m. UTC | #6
On 07/10/22 11:48 pm, Sean Christopherson wrote:
> On Thu, Sep 15, 2022, Christian Borntraeger wrote:
>> Am 15.09.22 um 15:21 schrieb Christian Borntraeger:
>>> I am wondering if this will work on s390. On s390  we only call
>>> mark_page_dirty_in_slot for the kvm_read/write functions but not
>>> for those done by the guest on fault. We do account those lazily in
>>> kvm_arch_sync_dirty_log (like x96 in the past).
>>>
>>
>> I think we need to rework the page fault handling on s390 to actually make
>> use of this. This has to happen anyway somewhen (as indicated by the guest
>> enter/exit rework from Mark). Right now we handle KVM page faults directly
>> in the normal system fault handler. It seems we need to make a side turn
>> into KVM for page faults on guests in the long run.
> 
> s390's page table management came up in conversation at KVM Forum in the context
> of a common MMU (or MMUs)[*].  If/when we get an MMU that supports multiple
> architectures, that would be a perfect opportunity for s390 to switch.  I don't
> know that it'll be feasible to make a common MMU support s390 from the get-go,
> but at the very least we can keep y'all in the loop and not make it unnecessarily
> difficult to support s390.
> 
> [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__kvmforum2022.sched.com_event_15jJk&d=DwIDAw&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=ea1RmNv-vQM3Ad3SEOr7EmyBdSD1qIFG_Vp8yROenZZFQWjn4Dm6CGaOpNE0LIeJ&s=GRTR1AntYFCoIjcd2GajoTeeek3nLNgmU2slGAbzSgI&e=
Thank you Sean for the reference to the KVM Forum talk. Should I skip 
the support for s390 (patch no. 4) for the next patchset?
Shivam Kumar Oct. 9, 2022, 6:49 p.m. UTC | #7
On 08/10/22 12:50 am, Sean Christopherson wrote:
> On Fri, Oct 07, 2022, Sean Christopherson wrote:
>> On Thu, Sep 15, 2022, Shivam Kumar wrote:
>> Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name.
>>
>> [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_Yo-2B82LjHSOdyxKzT-40google.com&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=0-XNirx6DRihxIvWzzJHJnErbZelq39geArwcitkIRgMl23nTXBs57QP543DuFnw&s=7zXRbLuhXLpsET-zMv7muSajxOFUoktaL97P3huVuhA&e=
> 
> Actually, I take that back.  The code snippet itself is also flawed.  If userspace
> increases the quota (or disables it entirely) between KVM snapshotting the quota
> and making the request, then there's no need for KVM to exit to userspace.
> 
> So I think this can be:
> 
> static void kvm_vcpu_is_dirty_quota_exchausted(struct kvm_vcpu *vcpu)
> {
> #ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA
> 	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> 
> 	return dirty_quota && (vcpu->stat.generic.pages_dirtied >= dirty_quota);
> #else
> 	return false;
> #endif
> }
> 
> and the usage becomes:
> 
> 		if (kvm_vcpu_is_dirty_quota_exhausted(vcpu))
> 			kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
> 
> More thoughts in the x86 patch.

Snapshotting is not a requirement for now anyway. We have plans to 
lazily update the quota, i.e. only when it needs to do more dirtying. 
This helps us prevent overthrottling of the VM due to skewed cases where 
some vcpus are mostly reading and the others are mostly wirting.


So, this looks good from every angle. The name 
kvm_vcpu_is_dirty_quota_exchausted, though verbose, looks good to me.
Shivam Kumar Oct. 9, 2022, 7:30 p.m. UTC | #8
On 08/10/22 12:38 am, Sean Christopherson wrote:
> On Thu, Sep 15, 2022, Shivam Kumar wrote:
>> @@ -542,6 +545,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>>   	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>>   }
>>   
>> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_run *run = vcpu->run;
>> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
>> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
>> +
>> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
>> +		return 1;
>> +
>> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
>> +	run->dirty_quota_exit.count = pages_dirtied;
>> +	run->dirty_quota_exit.quota = dirty_quota;
>> +	return 0;
> 
> Dead code.
> 
>> +}
> 
> ...
> 
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 584a5bab3af3..f315af50037d 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -3298,18 +3298,36 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
>>   }
>>   EXPORT_SYMBOL_GPL(kvm_clear_guest);
>>   
>> +static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
> 
> Ouch, sorry.  I suspect you got this name from me[*].  That was a goof on my end,
> I'm 99% certain I copy-pasted stale code, i.e. didn't intended to suggest a
> rename.
> 
> Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name.
> 
> [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_Yo-2B82LjHSOdyxKzT-40google.com&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=IIED7C8bZGKU5CadTcY5IR4yLs79Rsv2HVHCn5FIL8TQpiAbpjEBs97euDE3FFZf&s=WyikNnoMe8QzP5dTKKectMN-uxc_fTby1FlKVAaS7zI&e=
> 
>> +{
>> +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
>> +
>> +	if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
>> +		return;
>> +
>> +	/*
>> +	 * Snapshot the quota to report it to userspace.  The dirty count will be
>> +	 * captured when the request is processed.
>> +	 */
>> +	vcpu->dirty_quota = dirty_quota;
>> +	kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
> 
> Making the request needs to be guarded with an arch opt-in.  Pending requests
> prevent KVM from entering the guest, and so making a request that an arch isn't
> aware of will effectively hang the vCPU.  Obviously userspace would be shooting
> itself in the foot by setting run->dirty_quota in this case, but KVM shouldn't
> hand userspace a loaded gun and help them aim.
> 
> My suggestion from v1[*] about not forcing architectures to opt-in was in the
> context of a request-less implementation where dirty_quota was a nop until the
> arch took action.
> 
> And regardless of arch opt-in, I think this needs a capability so that userspace
> can detect KVM support.
> 
> I don't see any reason to wrap the request or vcpu_run field, e.g. something like
> this should suffice:
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ea5847d22aff..93362441215b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3298,8 +3298,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
>   }
>   EXPORT_SYMBOL_GPL(kvm_clear_guest);
>   
> -static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
> +static void kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
>   {
> +#ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA
>          u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
>   
>          if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
> @@ -3311,6 +3312,7 @@ static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
>           */
>          vcpu->dirty_quota = dirty_quota;
>          kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
> +#endif
>   }
>   
>   void mark_page_dirty_in_slot(struct kvm *kvm,
> @@ -4507,6 +4509,8 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>          case KVM_CAP_BINARY_STATS_FD:
>          case KVM_CAP_SYSTEM_EVENT_DATA:
>                  return 1;
> +       case KVM_CAP_DIRTY_QUOTA:
> +               return !!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_QUOTA);
>          default:
>                  break;
>          }
> 
> 
> [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_YZaUENi0ZyQi-252F9M0-40google.com&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=IIED7C8bZGKU5CadTcY5IR4yLs79Rsv2HVHCn5FIL8TQpiAbpjEBs97euDE3FFZf&s=kbNeQcRDxAxx4SEPIe_tL-_1dYMa9zx-REEozKvO0w0&e=

Will add. Thank you so much for the comments.
Shivam Kumar Oct. 10, 2022, 5:41 a.m. UTC | #9
On 15/09/22 3:40 pm, Shivam Kumar wrote:
> Define variables to track and throttle memory dirtying for every vcpu.
> 
> dirty_count:    Number of pages the vcpu has dirtied since its creation,
>                  while dirty logging is enabled.
> dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
>                  more, it needs to request more quota by exiting to
>                  userspace.
> 
> Implement the flow for throttling based on dirty quota.
> 
> i) Increment dirty_count for the vcpu whenever it dirties a page.
> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
> count equals/exceeds dirty quota) to request more dirty quota.
> 
> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> ---
>   Documentation/virt/kvm/api.rst | 35 ++++++++++++++++++++++++++++++++++
>   include/linux/kvm_host.h       | 20 ++++++++++++++++++-
>   include/linux/kvm_types.h      |  1 +
>   include/uapi/linux/kvm.h       | 12 ++++++++++++
>   virt/kvm/kvm_main.c            | 26 ++++++++++++++++++++++---
>   5 files changed, 90 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index abd7c32126ce..97030a6a35b4 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6614,6 +6614,26 @@ array field represents return values. The userspace should update the return
>   values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>   spec refer, https://github.com/riscv/riscv-sbi-doc.
>   
> +::
> +
> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> +		struct {
> +			__u64 count;
> +			__u64 quota;
> +		} dirty_quota_exit;
> +
> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
> +makes the following information available to the userspace:
> +    count: the current count of pages dirtied by the VCPU, can be
> +    skewed based on the size of the pages accessed by each vCPU.
> +    quota: the observed dirty quota just before the exit to userspace.
> +
> +The userspace can design a strategy to allocate the overall scope of dirtying
> +for the VM among the vcpus. Based on the strategy and the current state of dirty
> +quota throttling, the userspace can make a decision to either update (increase)
> +the quota or to put the VCPU to sleep for some time.
> +
>   ::
>   
>       /* KVM_EXIT_NOTIFY */
> @@ -6668,6 +6688,21 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
>   
>   ::
>   
> +	/*
> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> +	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
> +	 * is reached/exceeded.
> +	 */
> +	__u64 dirty_quota;
> +
> +Please note that enforcing the quota is best effort, as the guest may dirty
> +multiple pages before KVM can recheck the quota.  However, unless KVM is using
> +a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
> +KVM will detect quota exhaustion within a handful of dirtied pages.  If a
> +hardware ring buffer is used, the overrun is bounded by the size of the buffer
> +(512 entries for PML).
> +
> +::
>     };
>   
>   
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f4519d3689e1..9acb28635d94 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -151,12 +151,13 @@ static inline bool is_error_page(struct page *page)
>   #define KVM_REQUEST_NO_ACTION      BIT(10)
>   /*
>    * Architecture-independent vcpu->requests bit members
> - * Bits 4-7 are reserved for more arch-independent bits.
> + * Bits 5-7 are reserved for more arch-independent bits.
>    */
>   #define KVM_REQ_TLB_FLUSH         (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>   #define KVM_REQ_VM_DEAD           (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>   #define KVM_REQ_UNBLOCK           2
>   #define KVM_REQ_UNHALT            3
> +#define KVM_REQ_DIRTY_QUOTA_EXIT  4
>   #define KVM_REQUEST_ARCH_BASE     8
>   
>   /*
> @@ -380,6 +381,8 @@ struct kvm_vcpu {
>   	 */
>   	struct kvm_memory_slot *last_used_slot;
>   	u64 last_used_slot_gen;
> +
> +	u64 dirty_quota;
>   };
>   
>   /*
> @@ -542,6 +545,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>   	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>   }
>   
> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_run *run = vcpu->run;
> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> +
> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> +		return 1;
> +
> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
> +	run->dirty_quota_exit.count = pages_dirtied;
> +	run->dirty_quota_exit.quota = dirty_quota;
> +	return 0;
> +}
> +
>   /*
>    * Some of the bitops functions do not support too long bitmaps.
>    * This number must be determined not to exceed such limits.
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 3ca3db020e0e..263a588f3cd3 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -118,6 +118,7 @@ struct kvm_vcpu_stat_generic {
>   	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
>   	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
>   	u64 blocking;
> +	u64 pages_dirtied;
I am reworking the QEMU patches and I am not sure how I can access the
pages_dirtied info from the userspace side when the migration starts, i.e.
without a dirty quota exit.

I need this info to initialise the dirty quota. This is what I am looking
to do on the userspace side at the start of dirty quota migration:
	dirty_quota = pages_dirtied + some initial quota

Hoping if you could help, Sean. Thanks in advance.
>   #define KVM_STATS_NAME_SIZE	48
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index eed0315a77a6..4c4a65b0f0a5 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -272,6 +272,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_RISCV_SBI        35
>   #define KVM_EXIT_RISCV_CSR        36
>   #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 38
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -510,6 +511,11 @@ struct kvm_run {
>   #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>   			__u32 flags;
>   		} notify;
> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> +		struct {
> +			__u64 count;
> +			__u64 quota;
> +		} dirty_quota_exit;
>   		/* Fix the size of the union. */
>   		char padding[256];
>   	};
> @@ -531,6 +537,12 @@ struct kvm_run {
>   		struct kvm_sync_regs regs;
>   		char padding[SYNC_REGS_SIZE_BYTES];
>   	} s;
> +	/*
> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> +	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the
> +	 * quota is reached/exceeded.
> +	 */
> +	__u64 dirty_quota;
>   };
>   
>   /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 584a5bab3af3..f315af50037d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3298,18 +3298,36 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
>   }
>   EXPORT_SYMBOL_GPL(kvm_clear_guest);
>   
> +static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
> +{
> +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> +
> +	if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
> +		return;
> +
> +	/*
> +	 * Snapshot the quota to report it to userspace.  The dirty count will be
> +	 * captured when the request is processed.
> +	 */
> +	vcpu->dirty_quota = dirty_quota;
> +	kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
> +}
> +
>   void mark_page_dirty_in_slot(struct kvm *kvm,
>   			     const struct kvm_memory_slot *memslot,
>   		 	     gfn_t gfn)
>   {
>   	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>   
> -#ifdef CONFIG_HAVE_KVM_DIRTY_RING
>   	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
>   		return;
> -#endif
>   
> -	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> +	if (!memslot)
> +		return;
> +
> +	WARN_ON_ONCE(!vcpu->stat.generic.pages_dirtied++);
> +
> +	if (kvm_slot_dirty_track_enabled(memslot)) {
>   		unsigned long rel_gfn = gfn - memslot->base_gfn;
>   		u32 slot = (memslot->as_id << 16) | memslot->id;
>   
> @@ -3318,6 +3336,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>   					    slot, rel_gfn);
>   		else
>   			set_bit_le(rel_gfn, memslot->dirty_bitmap);
> +
> +		kvm_vcpu_is_dirty_quota_exhausted(vcpu);
>   	}
>   }
>   EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);
Christian Borntraeger Oct. 10, 2022, 6:12 a.m. UTC | #10
Am 09.10.22 um 20:36 schrieb Shivam Kumar:
> 
> 
> On 07/10/22 11:48 pm, Sean Christopherson wrote:
>> On Thu, Sep 15, 2022, Christian Borntraeger wrote:
>>> Am 15.09.22 um 15:21 schrieb Christian Borntraeger:
>>>> I am wondering if this will work on s390. On s390  we only call
>>>> mark_page_dirty_in_slot for the kvm_read/write functions but not
>>>> for those done by the guest on fault. We do account those lazily in
>>>> kvm_arch_sync_dirty_log (like x96 in the past).
>>>>
>>>
>>> I think we need to rework the page fault handling on s390 to actually make
>>> use of this. This has to happen anyway somewhen (as indicated by the guest
>>> enter/exit rework from Mark). Right now we handle KVM page faults directly
>>> in the normal system fault handler. It seems we need to make a side turn
>>> into KVM for page faults on guests in the long run.
>>
>> s390's page table management came up in conversation at KVM Forum in the context
>> of a common MMU (or MMUs)[*].  If/when we get an MMU that supports multiple
>> architectures, that would be a perfect opportunity for s390 to switch.  I don't
>> know that it'll be feasible to make a common MMU support s390 from the get-go,
>> but at the very least we can keep y'all in the loop and not make it unnecessarily
>> difficult to support s390.
>>
>> [*] https://kvmforum2022.sched.com/event/15jJk
> Thank you Sean for the reference to the KVM Forum talk. Should I skip the support for s390 (patch no. 4) for the next patchset?

yes please.
Sean Christopherson Oct. 10, 2022, 4:09 p.m. UTC | #11
On Mon, Oct 10, 2022, Shivam Kumar wrote:
> 
> 
> On 08/10/22 12:50 am, Sean Christopherson wrote:
> > On Fri, Oct 07, 2022, Sean Christopherson wrote:
> > > On Thu, Sep 15, 2022, Shivam Kumar wrote:
> > > Let's keep kvm_vcpu_check_dirty_quota(), IMO that's still the least awful name.
> > > 
> > > [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_Yo-2B82LjHSOdyxKzT-40google.com&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=0-XNirx6DRihxIvWzzJHJnErbZelq39geArwcitkIRgMl23nTXBs57QP543DuFnw&s=7zXRbLuhXLpsET-zMv7muSajxOFUoktaL97P3huVuhA&e=
> > 
> > Actually, I take that back.  The code snippet itself is also flawed.  If userspace
> > increases the quota (or disables it entirely) between KVM snapshotting the quota
> > and making the request, then there's no need for KVM to exit to userspace.
> > 
> > So I think this can be:
> > 
> > static void kvm_vcpu_is_dirty_quota_exchausted(struct kvm_vcpu *vcpu)
> > {
> > #ifdef CONFIG_HAVE_KVM_DIRTY_QUOTA
> > 	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> > 
> > 	return dirty_quota && (vcpu->stat.generic.pages_dirtied >= dirty_quota);
> > #else
> > 	return false;
> > #endif
> > }
> > 
> > and the usage becomes:
> > 
> > 		if (kvm_vcpu_is_dirty_quota_exhausted(vcpu))
> > 			kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
> > 
> > More thoughts in the x86 patch.
> 
> Snapshotting is not a requirement for now anyway. We have plans to lazily
> update the quota, i.e. only when it needs to do more dirtying. This helps us
> prevent overthrottling of the VM due to skewed cases where some vcpus are
> mostly reading and the others are mostly wirting.

I don't see how snapshotting can ever be a sane requirement.  Requiring KVM to
exit if KVM detects an exhausted quota even if userspace changes the quota is
nonsensical as the resulting behavior is 100% non-determinstic unless userspace
is spying on the number of dirty pages.  And if userspace is constly polling the
number of dirty pages, what's the point of the exit?  Requiring KVM to exit in
this case puts extra burden on KVM without any meaningful benefit.

In other words, we need consider about how KVM's behavior impacts KVM's uABI, not
just about what userspace "needs".
Shivam Kumar Oct. 17, 2022, 5:28 a.m. UTC | #12
On 10/10/22 11:11 am, Shivam Kumar wrote:
> 
> 
> On 15/09/22 3:40 pm, Shivam Kumar wrote:
>> Define variables to track and throttle memory dirtying for every vcpu.
>>
>> dirty_count:    Number of pages the vcpu has dirtied since its creation,
>>                  while dirty logging is enabled.
>> dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
>>                  more, it needs to request more quota by exiting to
>>                  userspace.
>>
>> Implement the flow for throttling based on dirty quota.
>>
>> i) Increment dirty_count for the vcpu whenever it dirties a page.
>> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
>> count equals/exceeds dirty quota) to request more dirty quota.
>>
>> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
>> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
>> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
>> ---
>>   Documentation/virt/kvm/api.rst | 35 ++++++++++++++++++++++++++++++++++
>>   include/linux/kvm_host.h       | 20 ++++++++++++++++++-
>>   include/linux/kvm_types.h      |  1 +
>>   include/uapi/linux/kvm.h       | 12 ++++++++++++
>>   virt/kvm/kvm_main.c            | 26 ++++++++++++++++++++++---
>>   5 files changed, 90 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst 
>> b/Documentation/virt/kvm/api.rst
>> index abd7c32126ce..97030a6a35b4 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6614,6 +6614,26 @@ array field represents return values. The 
>> userspace should update the return
>>   values of SBI call before resuming the VCPU. For more details on 
>> RISC-V SBI
>>   spec refer, https://github.com/riscv/riscv-sbi-doc.
>> +::
>> +
>> +        /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
>> +        struct {
>> +            __u64 count;
>> +            __u64 quota;
>> +        } dirty_quota_exit;
>> +
>> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that 
>> the VCPU has
>> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run 
>> structure
>> +makes the following information available to the userspace:
>> +    count: the current count of pages dirtied by the VCPU, can be
>> +    skewed based on the size of the pages accessed by each vCPU.
>> +    quota: the observed dirty quota just before the exit to userspace.
>> +
>> +The userspace can design a strategy to allocate the overall scope of 
>> dirtying
>> +for the VM among the vcpus. Based on the strategy and the current 
>> state of dirty
>> +quota throttling, the userspace can make a decision to either update 
>> (increase)
>> +the quota or to put the VCPU to sleep for some time.
>> +
>>   ::
>>       /* KVM_EXIT_NOTIFY */
>> @@ -6668,6 +6688,21 @@ values in kvm_run even if the corresponding bit 
>> in kvm_dirty_regs is not set.
>>   ::
>> +    /*
>> +     * Number of pages the vCPU is allowed to have dirtied over its 
>> entire
>> +     * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 
>> if the quota
>> +     * is reached/exceeded.
>> +     */
>> +    __u64 dirty_quota;
>> +
>> +Please note that enforcing the quota is best effort, as the guest may 
>> dirty
>> +multiple pages before KVM can recheck the quota.  However, unless KVM 
>> is using
>> +a hardware-based dirty ring buffer, e.g. Intel's Page Modification 
>> Logging,
>> +KVM will detect quota exhaustion within a handful of dirtied pages.  
>> If a
>> +hardware ring buffer is used, the overrun is bounded by the size of 
>> the buffer
>> +(512 entries for PML).
>> +
>> +::
>>     };
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index f4519d3689e1..9acb28635d94 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -151,12 +151,13 @@ static inline bool is_error_page(struct page *page)
>>   #define KVM_REQUEST_NO_ACTION      BIT(10)
>>   /*
>>    * Architecture-independent vcpu->requests bit members
>> - * Bits 4-7 are reserved for more arch-independent bits.
>> + * Bits 5-7 are reserved for more arch-independent bits.
>>    */
>>   #define KVM_REQ_TLB_FLUSH         (0 | KVM_REQUEST_WAIT | 
>> KVM_REQUEST_NO_WAKEUP)
>>   #define KVM_REQ_VM_DEAD           (1 | KVM_REQUEST_WAIT | 
>> KVM_REQUEST_NO_WAKEUP)
>>   #define KVM_REQ_UNBLOCK           2
>>   #define KVM_REQ_UNHALT            3
>> +#define KVM_REQ_DIRTY_QUOTA_EXIT  4
>>   #define KVM_REQUEST_ARCH_BASE     8
>>   /*
>> @@ -380,6 +381,8 @@ struct kvm_vcpu {
>>        */
>>       struct kvm_memory_slot *last_used_slot;
>>       u64 last_used_slot_gen;
>> +
>> +    u64 dirty_quota;
>>   };
>>   /*
>> @@ -542,6 +545,21 @@ static inline int 
>> kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>>       return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>>   }
>> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
>> +{
>> +    struct kvm_run *run = vcpu->run;
>> +    u64 dirty_quota = READ_ONCE(run->dirty_quota);
>> +    u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
>> +
>> +    if (!dirty_quota || (pages_dirtied < dirty_quota))
>> +        return 1;
>> +
>> +    run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
>> +    run->dirty_quota_exit.count = pages_dirtied;
>> +    run->dirty_quota_exit.quota = dirty_quota;
>> +    return 0;
>> +}
>> +
>>   /*
>>    * Some of the bitops functions do not support too long bitmaps.
>>    * This number must be determined not to exceed such limits.
>> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
>> index 3ca3db020e0e..263a588f3cd3 100644
>> --- a/include/linux/kvm_types.h
>> +++ b/include/linux/kvm_types.h
>> @@ -118,6 +118,7 @@ struct kvm_vcpu_stat_generic {
>>       u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
>>       u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
>>       u64 blocking;
>> +    u64 pages_dirtied;
> I am reworking the QEMU patches and I am not sure how I can access the
> pages_dirtied info from the userspace side when the migration starts, i.e.
> without a dirty quota exit.
> 
> I need this info to initialise the dirty quota. This is what I am looking
> to do on the userspace side at the start of dirty quota migration:
>      dirty_quota = pages_dirtied + some initial quota
> 
> Hoping if you could help, Sean. Thanks in advance.
I think I can set dirty_quota initially to 1 and let the vpcu exit with 
exit reason KVM_EXIT_DIRTY_QUOTA_EXHAUSTED. Then, I can set the quota.
Sean Christopherson Oct. 19, 2022, 4:01 p.m. UTC | #13
On Mon, Oct 17, 2022, Shivam Kumar wrote:
> 
> On 10/10/22 11:11 am, Shivam Kumar wrote:
> > 
> > On 15/09/22 3:40 pm, Shivam Kumar wrote:
> > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > > index 3ca3db020e0e..263a588f3cd3 100644
> > > --- a/include/linux/kvm_types.h
> > > +++ b/include/linux/kvm_types.h
> > > @@ -118,6 +118,7 @@ struct kvm_vcpu_stat_generic {
> > >       u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
> > >       u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
> > >       u64 blocking;
> > > +    u64 pages_dirtied;
> > I am reworking the QEMU patches and I am not sure how I can access the
> > pages_dirtied info from the userspace side when the migration starts, i.e.
> > without a dirty quota exit.
> > 
> > I need this info to initialise the dirty quota. This is what I am looking
> > to do on the userspace side at the start of dirty quota migration:
> >      dirty_quota = pages_dirtied + some initial quota
> > 
> > Hoping if you could help, Sean. Thanks in advance.
> I think I can set dirty_quota initially to 1 and let the vpcu exit with exit
> reason KVM_EXIT_DIRTY_QUOTA_EXHAUSTED. Then, I can set the quota.

The vCPU doesn't need to be paused to read stats, pages_dirtied can be read while
the vCPU is running via KVM_GET_STATS_FD.  Though at a glance, QEMU doesn't yet
utilize KVM_GETS_STATS_FD, i.e. QEMU would a need a decent chunk of infrastructure
improvements to avoid the unnecessary exit.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index abd7c32126ce..97030a6a35b4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6614,6 +6614,26 @@  array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
+		struct {
+			__u64 count;
+			__u64 quota;
+		} dirty_quota_exit;
+
+If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
+exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
+makes the following information available to the userspace:
+    count: the current count of pages dirtied by the VCPU, can be
+    skewed based on the size of the pages accessed by each vCPU.
+    quota: the observed dirty quota just before the exit to userspace.
+
+The userspace can design a strategy to allocate the overall scope of dirtying
+for the VM among the vcpus. Based on the strategy and the current state of dirty
+quota throttling, the userspace can make a decision to either update (increase)
+the quota or to put the VCPU to sleep for some time.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
@@ -6668,6 +6688,21 @@  values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
 ::
 
+	/*
+	 * Number of pages the vCPU is allowed to have dirtied over its entire
+	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
+	 * is reached/exceeded.
+	 */
+	__u64 dirty_quota;
+
+Please note that enforcing the quota is best effort, as the guest may dirty
+multiple pages before KVM can recheck the quota.  However, unless KVM is using
+a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
+KVM will detect quota exhaustion within a handful of dirtied pages.  If a
+hardware ring buffer is used, the overrun is bounded by the size of the buffer
+(512 entries for PML).
+
+::
   };
 
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f4519d3689e1..9acb28635d94 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -151,12 +151,13 @@  static inline bool is_error_page(struct page *page)
 #define KVM_REQUEST_NO_ACTION      BIT(10)
 /*
  * Architecture-independent vcpu->requests bit members
- * Bits 4-7 are reserved for more arch-independent bits.
+ * Bits 5-7 are reserved for more arch-independent bits.
  */
 #define KVM_REQ_TLB_FLUSH         (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_VM_DEAD           (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_UNBLOCK           2
 #define KVM_REQ_UNHALT            3
+#define KVM_REQ_DIRTY_QUOTA_EXIT  4
 #define KVM_REQUEST_ARCH_BASE     8
 
 /*
@@ -380,6 +381,8 @@  struct kvm_vcpu {
 	 */
 	struct kvm_memory_slot *last_used_slot;
 	u64 last_used_slot_gen;
+
+	u64 dirty_quota;
 };
 
 /*
@@ -542,6 +545,21 @@  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
 }
 
+static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
+{
+	struct kvm_run *run = vcpu->run;
+	u64 dirty_quota = READ_ONCE(run->dirty_quota);
+	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
+
+	if (!dirty_quota || (pages_dirtied < dirty_quota))
+		return 1;
+
+	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
+	run->dirty_quota_exit.count = pages_dirtied;
+	run->dirty_quota_exit.quota = dirty_quota;
+	return 0;
+}
+
 /*
  * Some of the bitops functions do not support too long bitmaps.
  * This number must be determined not to exceed such limits.
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 3ca3db020e0e..263a588f3cd3 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -118,6 +118,7 @@  struct kvm_vcpu_stat_generic {
 	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
 	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
 	u64 blocking;
+	u64 pages_dirtied;
 };
 
 #define KVM_STATS_NAME_SIZE	48
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index eed0315a77a6..4c4a65b0f0a5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -272,6 +272,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -510,6 +511,11 @@  struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
+		struct {
+			__u64 count;
+			__u64 quota;
+		} dirty_quota_exit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -531,6 +537,12 @@  struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+	/*
+	 * Number of pages the vCPU is allowed to have dirtied over its entire
+	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the
+	 * quota is reached/exceeded.
+	 */
+	__u64 dirty_quota;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 584a5bab3af3..f315af50037d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3298,18 +3298,36 @@  int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
+static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
+{
+	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
+
+	if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
+		return;
+
+	/*
+	 * Snapshot the quota to report it to userspace.  The dirty count will be
+	 * captured when the request is processed.
+	 */
+	vcpu->dirty_quota = dirty_quota;
+	kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
+}
+
 void mark_page_dirty_in_slot(struct kvm *kvm,
 			     const struct kvm_memory_slot *memslot,
 		 	     gfn_t gfn)
 {
 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-#ifdef CONFIG_HAVE_KVM_DIRTY_RING
 	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
 		return;
-#endif
 
-	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
+	if (!memslot)
+		return;
+
+	WARN_ON_ONCE(!vcpu->stat.generic.pages_dirtied++);
+
+	if (kvm_slot_dirty_track_enabled(memslot)) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
@@ -3318,6 +3336,8 @@  void mark_page_dirty_in_slot(struct kvm *kvm,
 					    slot, rel_gfn);
 		else
 			set_bit_le(rel_gfn, memslot->dirty_bitmap);
+
+		kvm_vcpu_is_dirty_quota_exhausted(vcpu);
 	}
 }
 EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);