diff mbox series

[3/6] Add KVM_CAP_DIRTY_QUOTA_MIGRATION and handle vCPU page faults.

Message ID 20211114145721.209219-4-shivam.kumar1@nutanix.com (mailing list archive)
State New, archived
Headers show
Series KVM: Dirty Quota-Based VM Live Migration Auto-Converge | expand

Commit Message

Shivam Kumar Nov. 14, 2021, 2:57 p.m. UTC
When a dirty quota migration is initiated from QEMU side, the following
things happen:

1. An mmap ioctl is called for each vCPU to mmap the dirty quota context.
This results into vCPU page fault which needs to be handled.
2. An ioctl to start dirty quota migration is called from QEMU and must be
handled. This happens once QEMU is ready to start the migration.

Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
Signed-off-by: Shaju Abraham <shaju.abraham@nutanix.com>
Signed-off-by: Manish Mishra <manish.mishra@nutanix.com>
---
 Documentation/virt/kvm/api.rst        | 39 +++++++++++++++++++++++++++
 include/linux/dirty_quota_migration.h |  8 ++++++
 include/uapi/linux/kvm.h              |  1 +
 virt/kvm/dirty_quota_migration.c      |  6 +++++
 virt/kvm/kvm_main.c                   | 37 +++++++++++++++++++++++++
 5 files changed, 91 insertions(+)

Comments

Sean Christopherson Nov. 18, 2021, 5:57 p.m. UTC | #1
On Sun, Nov 14, 2021, Shivam Kumar wrote:
> +static int kvm_vm_ioctl_enable_dirty_quota_migration(struct kvm *kvm,
> +		bool enabled)
> +{
> +	if (!KVM_DIRTY_LOG_PAGE_OFFSET)

I don't think we should force architectures to opt in.  It would be trivial to
add 

		if (kvm_dirty_quota_is_full(vcpu)) {
			vcpu->run->exit_reason = KVM_EXIT_DIRTY_QUOTA_FULL;
			r = 0;
			break;
		}

in the run loops of each architecture.  And we can do that in incremental patches
without #ifdeffery since it's only the exiting aspect that requires arch help.

> +		return -EINVAL;
> +
> +	/*
> +	 * For now, dirty quota migration works with dirty bitmap so don't
> +	 * enable it if dirty ring interface is enabled. In future, dirty
> +	 * quota migration may work with dirty ring interface was well.
> +	 */

Why does KVM care?  This is a very simple concept.  QEMU not using it for the
dirty ring doesn't mean KVM can't support it.

> +	if (kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	/* Return if no change */
> +	if (kvm->dirty_quota_migration_enabled == enabled)

Needs to be check under lock.

> +		return -EINVAL;

Probably more idiomatic to return 0 if the desired value is the current value.

> +	mutex_lock(&kvm->lock);
> +	kvm->dirty_quota_migration_enabled = enabled;

Needs to check vCPU creation.

> +	mutex_unlock(&kvm->lock);
> +
> +	return 0;
> +}
> +
>  int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  						  struct kvm_enable_cap *cap)
>  {
> @@ -4305,6 +4339,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  	}
>  	case KVM_CAP_DIRTY_LOG_RING:
>  		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> +	case KVM_CAP_DIRTY_QUOTA_MIGRATION:
> +		return kvm_vm_ioctl_enable_dirty_quota_migration(kvm,
> +				cap->args[0]);
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> -- 
> 2.22.3
>
Shivam Kumar Nov. 19, 2021, 8:03 p.m. UTC | #2
On 18/11/21 11:27 pm, Sean Christopherson wrote:
> On Sun, Nov 14, 2021, Shivam Kumar wrote:
>> +static int kvm_vm_ioctl_enable_dirty_quota_migration(struct kvm *kvm,
>> +		bool enabled)
>> +{
>> +	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
> I don't think we should force architectures to opt in.  It would be trivial to
> add
>
> 		if (kvm_dirty_quota_is_full(vcpu)) {
> 			vcpu->run->exit_reason = KVM_EXIT_DIRTY_QUOTA_FULL;
> 			r = 0;
> 			break;
> 		}
>
> in the run loops of each architecture.  And we can do that in incremental patches
> without #ifdeffery since it's only the exiting aspect that requires arch help.
Noted. Thanks.
>
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * For now, dirty quota migration works with dirty bitmap so don't
>> +	 * enable it if dirty ring interface is enabled. In future, dirty
>> +	 * quota migration may work with dirty ring interface was well.
>> +	 */
> Why does KVM care?  This is a very simple concept.  QEMU not using it for the
> dirty ring doesn't mean KVM can't support it.
>

The dirty ring interface, if enabled, blocks the path that updates the 
dirty bitmap. Our current implementation depends on that path. We were 
planning to make the required changes in our implementation for it work 
with dirty ring as well in upcoming patches. Will explore the 
possibility of doing it in the next patchset only.


>> +	if (kvm->dirty_ring_size)
>> +		return -EINVAL;
>> +
>> +	/* Return if no change */
>> +	if (kvm->dirty_quota_migration_enabled == enabled)
> Needs to be check under lock.


Noted. Thanks.


>
>> +		return -EINVAL;
> Probably more idiomatic to return 0 if the desired value is the current value.


Keeping the case in mind when the userspace is trying to enable it while 
the migration is already going on(which shouldn't happen), we are 
returning -EINVAL. Please let me know if 0 still makes more sense.


>
>> +	mutex_lock(&kvm->lock);
>> +	kvm->dirty_quota_migration_enabled = enabled;
> Needs to check vCPU creation.


In our current implementation, we are using the 
KVM_CAP_DIRTY_QUOTA_MIGRATION ioctl to start dirty logging (through 
dirty counter) on the kernel side. This ioctl is called each time a new 
migration starts and ends.

The dirty quota context of each vCPU is stored in two variables dirty 
counter and quota which we are allocating at vCPU creation and freeing 
at vCPU destroy.


>
>> +	mutex_unlock(&kvm->lock);
>> +
>> +	return 0;
>> +}
>> +
>>   int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   						  struct kvm_enable_cap *cap)
>>   {
>> @@ -4305,6 +4339,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>>   	}
>>   	case KVM_CAP_DIRTY_LOG_RING:
>>   		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>> +	case KVM_CAP_DIRTY_QUOTA_MIGRATION:
>> +		return kvm_vm_ioctl_enable_dirty_quota_migration(kvm,
>> +				cap->args[0]);
>>   	default:
>>   		return kvm_vm_ioctl_enable_cap(kvm, cap);
>>   	}
>> -- 
>> 2.22.3
>>
Thank you very much for your feedback.
Shivam Kumar Nov. 19, 2021, 8:21 p.m. UTC | #3
On 20/11/21 1:36 am, Sean Christopherson wrote:
> On Sat, Nov 20, 2021, Shivam Kumar wrote:
>> On 18/11/21 11:27 pm, Sean Christopherson wrote:
>>>> +		return -EINVAL;
>>> Probably more idiomatic to return 0 if the desired value is the current value.
>> Keeping the case in mind when the userspace is trying to enable it while the
>> migration is already going on(which shouldn't happen), we are returning
>> -EINVAL. Please let me know if 0 still makes more sense.
> If the semantics are not "enable/disable", but rather "(re)set the quota",
> then it makes sense to allow changing the quota arbitrarily.
I agree that the semantics are not apt. Will modify it. Thanks.
>
>>>> +	mutex_lock(&kvm->lock);
>>>> +	kvm->dirty_quota_migration_enabled = enabled;
>>> Needs to check vCPU creation.
>> In our current implementation, we are using the
>> KVM_CAP_DIRTY_QUOTA_MIGRATION ioctl to start dirty logging (through dirty
>> counter) on the kernel side. This ioctl is called each time a new migration
>> starts and ends.
> Ah, and from the cover letter discussion, you want the count and quota to be
> reset when a new migration occurs.  That makes sense.
>
> Actually, if we go the route of using kvm_run to report and update the count/quota,
> we don't even need a capability.  Userspace can signal each vCPU to induce an exit
> to userspace, e.g. at the start of migration, then set the desired quota/count in
> vcpu->kvm_run and stuff exit_reason so that KVM updates the quota/count on the
> subsequent KVM_RUN.  No locking or requests needed, and userspace can reset the
> count at will, it just requires a signal.
>
> It's a little weird to overload exit_reason like that, but if that's a sticking
> point we could add a flag in kvm_run somewhere.  Requiring an exit to userspace
> at the start of migration doesn't seem too onerous.
Yes, this path looks flaw-free. We will explore the complexity and how 
we can simplify its implementation.
Shivam Kumar Nov. 25, 2021, 8:43 a.m. UTC | #4
On 20/11/21 1:51 am, Shivam Kumar wrote:
>
> On 20/11/21 1:36 am, Sean Christopherson wrote:
>> On Sat, Nov 20, 2021, Shivam Kumar wrote:
>>> On 18/11/21 11:27 pm, Sean Christopherson wrote:
>>>>> +        return -EINVAL;
>>>> Probably more idiomatic to return 0 if the desired value is the 
>>>> current value.
>>> Keeping the case in mind when the userspace is trying to enable it 
>>> while the
>>> migration is already going on(which shouldn't happen), we are returning
>>> -EINVAL. Please let me know if 0 still makes more sense.
>> If the semantics are not "enable/disable", but rather "(re)set the 
>> quota",
>> then it makes sense to allow changing the quota arbitrarily.
> I agree that the semantics are not apt. Will modify it. Thanks.
>>
>>>>> +    mutex_lock(&kvm->lock);
>>>>> +    kvm->dirty_quota_migration_enabled = enabled;
>>>> Needs to check vCPU creation.
>>> In our current implementation, we are using the
>>> KVM_CAP_DIRTY_QUOTA_MIGRATION ioctl to start dirty logging (through 
>>> dirty
>>> counter) on the kernel side. This ioctl is called each time a new 
>>> migration
>>> starts and ends.
>> Ah, and from the cover letter discussion, you want the count and 
>> quota to be
>> reset when a new migration occurs.  That makes sense.
>>
>> Actually, if we go the route of using kvm_run to report and update 
>> the count/quota,
>> we don't even need a capability.  Userspace can signal each vCPU to 
>> induce an exit
>> to userspace, e.g. at the start of migration, then set the desired 
>> quota/count in
>> vcpu->kvm_run and stuff exit_reason so that KVM updates the 
>> quota/count on the
>> subsequent KVM_RUN.  No locking or requests needed, and userspace can 
>> reset the
>> count at will, it just requires a signal.
>>
>> It's a little weird to overload exit_reason like that, but if that's 
>> a sticking
>> point we could add a flag in kvm_run somewhere.  Requiring an exit to 
>> userspace
>> at the start of migration doesn't seem too onerous.
> Yes, this path looks flaw-free. We will explore the complexity and how 
> we can simplify its implementation.
Is it okay to define the per-vcpu dirty quota and dirty count in the 
kvm_run structure itself? It can save space and reduce the complexity of 
the implemenation by large margin.
Sean Christopherson Dec. 1, 2021, 5:22 p.m. UTC | #5
On Thu, Nov 25, 2021, Shivam Kumar wrote:
> 
> On 20/11/21 1:51 am, Shivam Kumar wrote:
> > 
> > On 20/11/21 1:36 am, Sean Christopherson wrote:
> > > Actually, if we go the route of using kvm_run to report and update the
> > > count/quota, we don't even need a capability.  Userspace can signal each
> > > vCPU to induce an exit to userspace, e.g. at the start of migration, then
> > > set the desired quota/count in vcpu->kvm_run and stuff exit_reason so
> > > that KVM updates the quota/count on the subsequent KVM_RUN.  No locking
> > > or requests needed, and userspace can reset the count at will, it just
> > > requires a signal.
> > > 
> > > It's a little weird to overload exit_reason like that, but if that's a
> > > sticking point we could add a flag in kvm_run somewhere.  Requiring an
> > > exit to userspace at the start of migration doesn't seem too onerous.
> >
> > Yes, this path looks flaw-free. We will explore the complexity and how
> > we can simplify its implementation.
>
> Is it okay to define the per-vcpu dirty quota and dirty count in the kvm_run
> structure itself? It can save space and reduce the complexity of the
> implemenation by large margin.

Paolo, I'm guessing this question is directed at you since I made the suggestion :-)
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index aeeb071c7688..6679bceee649 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -277,6 +277,10 @@  the VCPU file descriptor can be mmap-ed, including:
   KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
   KVM_CAP_DIRTY_LOG_RING, see section 8.3.
 
+- if KVM_CAP_DIRTY_QUOTA_MIGRATION is available, a number of pages at
+  KVM_DIRTY_QUOTA_PAGE_OFFSET * PAGE_SIZE.  For more information on
+  KVM_CAP_DIRTY_QUOTA_MIGRATION, see section 8.35.
+
 
 4.6 KVM_SET_MEMORY_REGION
 -------------------------
@@ -7484,3 +7488,38 @@  The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
 of the result of KVM_CHECK_EXTENSION.  KVM will forward to userspace
 the hypercalls whose corresponding bit is in the argument, and return
 ENOSYS for the others.
+
+8.35 KVM_CAP_DIRTY_QUOTA_MIGRATION
+---------------------------
+
+:Architectures: x86
+:Parameters: args[0] - boolean value specifying whether to enable or
+disable dirty quota migration (true and false respectively)
+
+With dirty quota migration, memory dirtying is throttled by setting a
+limit on the number of pages a vCPU can dirty in given fixed microscopic
+size time intervals. This limit depends on the network throughput
+calculated over the last few intervals so as to throttle the vCPUs based
+on available network bandwidth. We are referring to this limit as the
+"dirty quota" of a vCPU and the fixed size intervals as the "dirty quota
+intervals".
+
+vCPUDirtyQuotaContext keeps the dirty quota context for each vCPU. It
+keeps the number of pages the vCPU has dirtied (dirty_counter) in the
+ongoing dirty quota interval and the maximum number of dirties allowed for
+the vCPU (dirty_quota) in the ongoing dirty quota interval.
+
+  struct vCPUDirtyQuotaContext {
+          u64 dirty_counter;
+          u64 dirty_quota;
+  };
+
+The flag dirty_quota_migration_enabled determines whether dirty quota-
+based throttling is enabled for an ongoing migration or not.
+
+When the guest tries to dirty a page, it leads to a vmexit as each page
+is write-protected. In the vmexit path, we increment the dirty_counter
+for the corresponding vCPU. Then, we check if the vCPU has exceeded its
+quota. If yes, we exit to userspace with a new exit reason
+KVM_EXIT_DIRTY_QUOTA_FULL. This "quota full" event is further handled on
+the userspace side.
diff --git a/include/linux/dirty_quota_migration.h b/include/linux/dirty_quota_migration.h
index 8c12fa428436..b6c6f5f896dd 100644
--- a/include/linux/dirty_quota_migration.h
+++ b/include/linux/dirty_quota_migration.h
@@ -24,9 +24,17 @@  static inline int kvm_vcpu_dirty_quota_alloc(struct vCPUDirtyQuotaContext **vCPU
 	return 0;
 }
 
+static inline struct page *kvm_dirty_quota_context_get_page(
+		struct vCPUDirtyQuotaContext *vCPUdqctx, u32 offset)
+{
+	return NULL;
+}
+
 #else /* KVM_DIRTY_QUOTA_PAGE_OFFSET == 0 */
 
 int kvm_vcpu_dirty_quota_alloc(struct vCPUDirtyQuotaContext **vCPUdqctx);
+struct page *kvm_dirty_quota_context_get_page(
+		struct vCPUDirtyQuotaContext *vCPUdqctx, u32 offset);
 
 #endif /* KVM_DIRTY_QUOTA_PAGE_OFFSET == 0 */
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 647f7e1a04dc..a6785644bf47 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1131,6 +1131,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
 #define KVM_CAP_ARM_MTE 205
 #define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
+#define KVM_CAP_DIRTY_QUOTA_MIGRATION 207
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/dirty_quota_migration.c b/virt/kvm/dirty_quota_migration.c
index 262f071aac0c..7e9ace760939 100644
--- a/virt/kvm/dirty_quota_migration.c
+++ b/virt/kvm/dirty_quota_migration.c
@@ -12,3 +12,9 @@  int kvm_vcpu_dirty_quota_alloc(struct vCPUDirtyQuotaContext **vCPUdqctx)
 	memset((*vCPUdqctx), 0, size);
 	return 0;
 }
+
+struct page *kvm_dirty_quota_context_get_page(
+		struct vCPUDirtyQuotaContext *vCPUdqctx, u32 offset)
+{
+	return vmalloc_to_page((void *)vCPUdqctx + offset * PAGE_SIZE);
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5626ae1b92ce..1564d3a3f608 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3519,6 +3519,9 @@  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
 		page = kvm_dirty_ring_get_page(
 		    &vcpu->dirty_ring,
 		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
+	else if (vmf->pgoff == KVM_DIRTY_QUOTA_PAGE_OFFSET)
+		page = kvm_dirty_quota_context_get_page(vcpu->vCPUdqctx,
+				vmf->pgoff - KVM_DIRTY_QUOTA_PAGE_OFFSET);
 	else
 		return kvm_arch_vcpu_fault(vcpu, vmf);
 	get_page(page);
@@ -4207,6 +4210,12 @@  static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #endif
 	case KVM_CAP_BINARY_STATS_FD:
 		return 1;
+	case KVM_CAP_DIRTY_QUOTA_MIGRATION:
+#if KVM_DIRTY_QUOTA_PAGE_OFFSET > 0
+		return 1;
+#else
+		return 0;
+#endif
 	default:
 		break;
 	}
@@ -4273,6 +4282,31 @@  static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
 	return cleared;
 }
 
+static int kvm_vm_ioctl_enable_dirty_quota_migration(struct kvm *kvm,
+		bool enabled)
+{
+	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
+		return -EINVAL;
+
+	/*
+	 * For now, dirty quota migration works with dirty bitmap so don't
+	 * enable it if dirty ring interface is enabled. In future, dirty
+	 * quota migration may work with dirty ring interface was well.
+	 */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
+	/* Return if no change */
+	if (kvm->dirty_quota_migration_enabled == enabled)
+		return -EINVAL;
+
+	mutex_lock(&kvm->lock);
+	kvm->dirty_quota_migration_enabled = enabled;
+	mutex_unlock(&kvm->lock);
+
+	return 0;
+}
+
 int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 						  struct kvm_enable_cap *cap)
 {
@@ -4305,6 +4339,9 @@  static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 	}
 	case KVM_CAP_DIRTY_LOG_RING:
 		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
+	case KVM_CAP_DIRTY_QUOTA_MIGRATION:
+		return kvm_vm_ioctl_enable_dirty_quota_migration(kvm,
+				cap->args[0]);
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}