diff mbox series

[v6,3/8] KVM: Add support for using dirty ring in conjunction with bitmap

Message ID 20221011061447.131531-4-gshan@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Enable ring-based dirty memory tracking | expand

Commit Message

Gavin Shan Oct. 11, 2022, 6:14 a.m. UTC
Some architectures (such as arm64) need to dirty memory outside of the
context of a vCPU. Of course, this simply doesn't fit with the UAPI of
KVM's per-vCPU dirty ring.

Introduce a new flavor of dirty ring that requires the use of both vCPU
dirty rings and a dirty bitmap. The expectation is that for non-vCPU
sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
the dirty bitmap. Userspace should scan the dirty bitmap before
migrating the VM to the target.

Use an additional capability to advertize this behavior and require
explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
not allow userspace to enable dirty ring if it hasn't also enabled the
ring && bitmap capability, as a VM is likely DOA without the pages
marked in the bitmap.

Suggested-by: Marc Zyngier <maz@kernel.org>
Suggested-by: Peter Xu <peterx@redhat.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 Documentation/virt/kvm/api.rst | 17 ++++++++---------
 include/linux/kvm_dirty_ring.h |  6 ++++++
 include/linux/kvm_host.h       |  1 +
 include/uapi/linux/kvm.h       |  1 +
 virt/kvm/Kconfig               |  8 ++++++++
 virt/kvm/dirty_ring.c          |  5 +++++
 virt/kvm/kvm_main.c            | 34 +++++++++++++++++++++++++---------
 7 files changed, 54 insertions(+), 18 deletions(-)

Comments

Peter Xu Oct. 18, 2022, 4:07 p.m. UTC | #1
On Tue, Oct 11, 2022 at 02:14:42PM +0800, Gavin Shan wrote:
> Some architectures (such as arm64) need to dirty memory outside of the
> context of a vCPU. Of course, this simply doesn't fit with the UAPI of
> KVM's per-vCPU dirty ring.
> 
> Introduce a new flavor of dirty ring that requires the use of both vCPU
> dirty rings and a dirty bitmap. The expectation is that for non-vCPU
> sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
> the dirty bitmap. Userspace should scan the dirty bitmap before
> migrating the VM to the target.
> 
> Use an additional capability to advertize this behavior and require
> explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
> you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
> not allow userspace to enable dirty ring if it hasn't also enabled the
> ring && bitmap capability, as a VM is likely DOA without the pages
> marked in the bitmap.
> 
> Suggested-by: Marc Zyngier <maz@kernel.org>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  Documentation/virt/kvm/api.rst | 17 ++++++++---------
>  include/linux/kvm_dirty_ring.h |  6 ++++++
>  include/linux/kvm_host.h       |  1 +
>  include/uapi/linux/kvm.h       |  1 +
>  virt/kvm/Kconfig               |  8 ++++++++
>  virt/kvm/dirty_ring.c          |  5 +++++
>  virt/kvm/kvm_main.c            | 34 +++++++++++++++++++++++++---------
>  7 files changed, 54 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 32427ea160df..09fa6c491c1b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8019,8 +8019,8 @@ guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
>  (0x40000001). Otherwise, a guest may use the paravirtual features
>  regardless of what has actually been exposed through the CPUID leaf.
>  
> -8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
> -----------------------------------------------------------
> +8.29 KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL, RING_WITH_BITMAP}

Shall we open a new section for RING_WITH_BITMAP?  Otherwise here it still
looks like these are three options for the rings.

Perhaps RING_WITH_BITMAP doesn't worth a section at all, so we can avoid
mentioning it here to avoid confusing.

> +-------------------------------------------------------------
>  
>  :Architectures: x86
>  :Parameters: args[0] - size of the dirty log ring
> @@ -8104,13 +8104,6 @@ flushing is done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one
>  needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
>  vmexit ensures that all dirty GFNs are flushed to the dirty rings.
>  
> -NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
> -ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
> -KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG.  After enabling
> -KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
> -machine will switch to ring-buffer dirty page tracking and further
> -KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
> -
>  NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
>  should be exposed by weakly ordered architecture, in order to indicate
>  the additional memory ordering requirements imposed on userspace when
> @@ -8119,6 +8112,12 @@ Architecture with TSO-like ordering (such as x86) are allowed to
>  expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
>  to userspace.
>  
> +NOTE: There is no running vcpu and available vcpu dirty ring when pages

IMHO it'll be great to start with something like below to describe the
userspace's responsibility to proactively detect the WITH_BITMAP cap:

  Before using the dirty rings, the userspace needs to detect the cap of
  KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures
  need to be backed by per-slot bitmaps.

  When KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP returns 1, it means the arch can
  dirty guest pages without vcpu/ring context, so that some of the dirty
  information will still be maintained in the bitmap structure.

  Note that the bitmap here is only a backup of the ring structure, and it
  doesn't need to be collected until the final switch-over of migration
  process.  Normally the bitmap should only contain a very small amount of
  dirty pages only, which needs to be transferred during VM downtime.

  To collect dirty bits in the backup bitmap, the userspace can use the
  same KVM_GET_DIRTY_LOG ioctl.  Since it's always the last phase of
  migration that needs the fetching of dirty bitmap, KVM_CLEAR_DIRTY_LOG
  ioctl should not be needed in this case and its behavior undefined.

That's how I understand this new cap, but let me know if you think any of
above is inproper.

> +becomes dirty in some cases. One example is to save arm64's vgic/its
> +tables during migration.

Nit: it'll be great to mention the exact arm ioctl here just in case anyone
would like to further reference the code.

> The dirty bitmap is still used to track those
> +dirty pages, indicated by KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP. The ditry
> +bitmap is visited by KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG ioctls.
> +
>  8.30 KVM_CAP_XEN_HVM
>  --------------------
>  
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> index fe5982b46424..23b2b466aa0f 100644
> --- a/include/linux/kvm_dirty_ring.h
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -28,6 +28,11 @@ struct kvm_dirty_ring {
>  };
>  
>  #ifndef CONFIG_HAVE_KVM_DIRTY_RING
> +static inline bool kvm_dirty_ring_exclusive(struct kvm *kvm)
> +{
> +	return false;
> +}
> +
>  /*
>   * If CONFIG_HAVE_HVM_DIRTY_RING not defined, kvm_dirty_ring.o should
>   * not be included as well, so define these nop functions for the arch.
> @@ -66,6 +71,7 @@ static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
>  
>  #else /* CONFIG_HAVE_KVM_DIRTY_RING */
>  
> +bool kvm_dirty_ring_exclusive(struct kvm *kvm);
>  int kvm_cpu_dirty_log_size(void);
>  u32 kvm_dirty_ring_get_rsvd_entries(void);
>  int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 53fa3134fee0..a3fae111f25c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -780,6 +780,7 @@ struct kvm {
>  	pid_t userspace_pid;
>  	unsigned int max_halt_poll_ns;
>  	u32 dirty_ring_size;
> +	bool dirty_ring_with_bitmap;
>  	bool vm_bugged;
>  	bool vm_dead;
>  
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0d5d4419139a..c87b5882d7ae 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1178,6 +1178,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_ZPCI_OP 221
>  #define KVM_CAP_S390_CPU_TOPOLOGY 222
>  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> +#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 224
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..228be1145cf3 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -33,6 +33,14 @@ config HAVE_KVM_DIRTY_RING_ACQ_REL
>         bool
>         select HAVE_KVM_DIRTY_RING
>  
> +# Only architectures that need to dirty memory outside of a vCPU
> +# context should select this, advertising to userspace the
> +# requirement to use a dirty bitmap in addition to the vCPU dirty
> +# ring.
> +config HAVE_KVM_DIRTY_RING_WITH_BITMAP
> +	bool
> +	depends on HAVE_KVM_DIRTY_RING
> +
>  config HAVE_KVM_EVENTFD
>         bool
>         select EVENTFD
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index f68d75026bc0..9cc60af291ef 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -11,6 +11,11 @@
>  #include <trace/events/kvm.h>
>  #include "kvm_mm.h"
>  
> +bool kvm_dirty_ring_exclusive(struct kvm *kvm)
> +{
> +	return kvm->dirty_ring_size && !kvm->dirty_ring_with_bitmap;
> +}
> +
>  int __weak kvm_cpu_dirty_log_size(void)
>  {
>  	return 0;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5b064dbadaf4..8915dcefcefd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1617,7 +1617,7 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>  			new->dirty_bitmap = NULL;
>  		else if (old && old->dirty_bitmap)
>  			new->dirty_bitmap = old->dirty_bitmap;
> -		else if (!kvm->dirty_ring_size) {
> +		else if (!kvm_dirty_ring_exclusive(kvm)) {
>  			r = kvm_alloc_dirty_bitmap(new);
>  			if (r)
>  				return r;
> @@ -2060,8 +2060,8 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
>  	unsigned long n;
>  	unsigned long any = 0;
>  
> -	/* Dirty ring tracking is exclusive to dirty log tracking */
> -	if (kvm->dirty_ring_size)
> +	/* Dirty ring tracking may be exclusive to dirty log tracking */
> +	if (kvm_dirty_ring_exclusive(kvm))
>  		return -ENXIO;
>  
>  	*memslot = NULL;
> @@ -2125,8 +2125,8 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>  	unsigned long *dirty_bitmap_buffer;
>  	bool flush;
>  
> -	/* Dirty ring tracking is exclusive to dirty log tracking */
> -	if (kvm->dirty_ring_size)
> +	/* Dirty ring tracking may be exclusive to dirty log tracking */
> +	if (kvm_dirty_ring_exclusive(kvm))
>  		return -ENXIO;
>  
>  	as_id = log->slot >> 16;
> @@ -2237,8 +2237,8 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>  	unsigned long *dirty_bitmap_buffer;
>  	bool flush;
>  
> -	/* Dirty ring tracking is exclusive to dirty log tracking */
> -	if (kvm->dirty_ring_size)
> +	/* Dirty ring tracking may be exclusive to dirty log tracking */
> +	if (kvm_dirty_ring_exclusive(kvm))
>  		return -ENXIO;
>  
>  	as_id = log->slot >> 16;
> @@ -3305,15 +3305,20 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>  	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>  
>  #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> -	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
> +	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
>  		return;
> +
> +#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> +	if (WARN_ON_ONCE(!vcpu))
> +		return;
> +#endif
>  #endif
>  
>  	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
>  		u32 slot = (memslot->as_id << 16) | memslot->id;
>  
> -		if (kvm->dirty_ring_size)
> +		if (vcpu && kvm->dirty_ring_size)
>  			kvm_dirty_ring_push(&vcpu->dirty_ring,
>  					    slot, rel_gfn);
>  		else
> @@ -4485,6 +4490,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  		return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
>  #else
>  		return 0;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
>  #endif
>  	case KVM_CAP_BINARY_STATS_FD:
>  	case KVM_CAP_SYSTEM_EVENT_DATA:
> @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
>  {
>  	int r;
>  
> +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> +	if (!kvm->dirty_ring_with_bitmap)
> +		return -EINVAL;
> +#endif
> +
>  	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
>  		return -EINVAL;
>  
> @@ -4588,6 +4601,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  	case KVM_CAP_DIRTY_LOG_RING:
>  	case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
>  		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
> +		kvm->dirty_ring_with_bitmap = true;

IIUC what Oliver wanted to suggest is we can avoid enabling of this cap,
then we don't need dirty_ring_with_bitmap field but instead we can check
against CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP when needed.

I think that'll make sense, because without the bitmap the ring won't work
with arm64, so not valid to not enable it at all.  But good to double check
with Oliver too.

The rest looks good to me, thanks,

> +		return 0;
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> -- 
> 2.23.0
>
Gavin Shan Oct. 18, 2022, 10:20 p.m. UTC | #2
Hi Peter,

On 10/19/22 12:07 AM, Peter Xu wrote:
> On Tue, Oct 11, 2022 at 02:14:42PM +0800, Gavin Shan wrote:
>> Some architectures (such as arm64) need to dirty memory outside of the
>> context of a vCPU. Of course, this simply doesn't fit with the UAPI of
>> KVM's per-vCPU dirty ring.
>>
>> Introduce a new flavor of dirty ring that requires the use of both vCPU
>> dirty rings and a dirty bitmap. The expectation is that for non-vCPU
>> sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
>> the dirty bitmap. Userspace should scan the dirty bitmap before
>> migrating the VM to the target.
>>
>> Use an additional capability to advertize this behavior and require
>> explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
>> you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
>> not allow userspace to enable dirty ring if it hasn't also enabled the
>> ring && bitmap capability, as a VM is likely DOA without the pages
>> marked in the bitmap.
>>
>> Suggested-by: Marc Zyngier <maz@kernel.org>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   Documentation/virt/kvm/api.rst | 17 ++++++++---------
>>   include/linux/kvm_dirty_ring.h |  6 ++++++
>>   include/linux/kvm_host.h       |  1 +
>>   include/uapi/linux/kvm.h       |  1 +
>>   virt/kvm/Kconfig               |  8 ++++++++
>>   virt/kvm/dirty_ring.c          |  5 +++++
>>   virt/kvm/kvm_main.c            | 34 +++++++++++++++++++++++++---------
>>   7 files changed, 54 insertions(+), 18 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 32427ea160df..09fa6c491c1b 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -8019,8 +8019,8 @@ guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
>>   (0x40000001). Otherwise, a guest may use the paravirtual features
>>   regardless of what has actually been exposed through the CPUID leaf.
>>   
>> -8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
>> -----------------------------------------------------------
>> +8.29 KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL, RING_WITH_BITMAP}
> 
> Shall we open a new section for RING_WITH_BITMAP?  Otherwise here it still
> looks like these are three options for the rings.
> 
> Perhaps RING_WITH_BITMAP doesn't worth a section at all, so we can avoid
> mentioning it here to avoid confusing.
> 

Lets avoid mentioning it in the subject since it's a add-on functionality
and not all architectures need it.

>> +-------------------------------------------------------------
>>   
>>   :Architectures: x86
>>   :Parameters: args[0] - size of the dirty log ring
>> @@ -8104,13 +8104,6 @@ flushing is done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one
>>   needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
>>   vmexit ensures that all dirty GFNs are flushed to the dirty rings.
>>   
>> -NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
>> -ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
>> -KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG.  After enabling
>> -KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
>> -machine will switch to ring-buffer dirty page tracking and further
>> -KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
>> -
>>   NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
>>   should be exposed by weakly ordered architecture, in order to indicate
>>   the additional memory ordering requirements imposed on userspace when
>> @@ -8119,6 +8112,12 @@ Architecture with TSO-like ordering (such as x86) are allowed to
>>   expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
>>   to userspace.
>>   
>> +NOTE: There is no running vcpu and available vcpu dirty ring when pages
> 
> IMHO it'll be great to start with something like below to describe the
> userspace's responsibility to proactively detect the WITH_BITMAP cap:
> 
>    Before using the dirty rings, the userspace needs to detect the cap of
>    KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures
>    need to be backed by per-slot bitmaps.
> 
>    When KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP returns 1, it means the arch can
>    dirty guest pages without vcpu/ring context, so that some of the dirty
>    information will still be maintained in the bitmap structure.
> 
>    Note that the bitmap here is only a backup of the ring structure, and it
>    doesn't need to be collected until the final switch-over of migration
>    process.  Normally the bitmap should only contain a very small amount of
>    dirty pages only, which needs to be transferred during VM downtime.
> 
>    To collect dirty bits in the backup bitmap, the userspace can use the
>    same KVM_GET_DIRTY_LOG ioctl.  Since it's always the last phase of
>    migration that needs the fetching of dirty bitmap, KVM_CLEAR_DIRTY_LOG
>    ioctl should not be needed in this case and its behavior undefined.
> 
> That's how I understand this new cap, but let me know if you think any of
> above is inproper.
> 

Yes, It looks much better to describe how KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
is used. However, the missed part is the capability is still need to be enabled
prior to KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. It means the capability needs
to be acknowledged (confirmed) by user space. Otherwise, KVM_CAP_DIRTY_LOG_RING_ACQ_REL
can't be enabled successfully. It seems Oliver, you and I aren't on same page for
this part. Please refer to below reply for more discussion. After the discussion
is finalized, I can amend the description accordingly here.

>> +becomes dirty in some cases. One example is to save arm64's vgic/its
>> +tables during migration.
> 
> Nit: it'll be great to mention the exact arm ioctl here just in case anyone
> would like to further reference the code.
> 

Yep, good point. I will mention the ioctl here.

>> The dirty bitmap is still used to track those
>> +dirty pages, indicated by KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP. The ditry
>> +bitmap is visited by KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG ioctls.
>> +
>>   8.30 KVM_CAP_XEN_HVM
>>   --------------------
>>   
>> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
>> index fe5982b46424..23b2b466aa0f 100644
>> --- a/include/linux/kvm_dirty_ring.h
>> +++ b/include/linux/kvm_dirty_ring.h
>> @@ -28,6 +28,11 @@ struct kvm_dirty_ring {
>>   };
>>   
>>   #ifndef CONFIG_HAVE_KVM_DIRTY_RING
>> +static inline bool kvm_dirty_ring_exclusive(struct kvm *kvm)
>> +{
>> +	return false;
>> +}
>> +
>>   /*
>>    * If CONFIG_HAVE_HVM_DIRTY_RING not defined, kvm_dirty_ring.o should
>>    * not be included as well, so define these nop functions for the arch.
>> @@ -66,6 +71,7 @@ static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
>>   
>>   #else /* CONFIG_HAVE_KVM_DIRTY_RING */
>>   
>> +bool kvm_dirty_ring_exclusive(struct kvm *kvm);
>>   int kvm_cpu_dirty_log_size(void);
>>   u32 kvm_dirty_ring_get_rsvd_entries(void);
>>   int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 53fa3134fee0..a3fae111f25c 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -780,6 +780,7 @@ struct kvm {
>>   	pid_t userspace_pid;
>>   	unsigned int max_halt_poll_ns;
>>   	u32 dirty_ring_size;
>> +	bool dirty_ring_with_bitmap;
>>   	bool vm_bugged;
>>   	bool vm_dead;
>>   
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 0d5d4419139a..c87b5882d7ae 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1178,6 +1178,7 @@ struct kvm_ppc_resize_hpt {
>>   #define KVM_CAP_S390_ZPCI_OP 221
>>   #define KVM_CAP_S390_CPU_TOPOLOGY 222
>>   #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>> +#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 224
>>   
>>   #ifdef KVM_CAP_IRQ_ROUTING
>>   
>> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
>> index 800f9470e36b..228be1145cf3 100644
>> --- a/virt/kvm/Kconfig
>> +++ b/virt/kvm/Kconfig
>> @@ -33,6 +33,14 @@ config HAVE_KVM_DIRTY_RING_ACQ_REL
>>          bool
>>          select HAVE_KVM_DIRTY_RING
>>   
>> +# Only architectures that need to dirty memory outside of a vCPU
>> +# context should select this, advertising to userspace the
>> +# requirement to use a dirty bitmap in addition to the vCPU dirty
>> +# ring.
>> +config HAVE_KVM_DIRTY_RING_WITH_BITMAP
>> +	bool
>> +	depends on HAVE_KVM_DIRTY_RING
>> +
>>   config HAVE_KVM_EVENTFD
>>          bool
>>          select EVENTFD
>> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
>> index f68d75026bc0..9cc60af291ef 100644
>> --- a/virt/kvm/dirty_ring.c
>> +++ b/virt/kvm/dirty_ring.c
>> @@ -11,6 +11,11 @@
>>   #include <trace/events/kvm.h>
>>   #include "kvm_mm.h"
>>   
>> +bool kvm_dirty_ring_exclusive(struct kvm *kvm)
>> +{
>> +	return kvm->dirty_ring_size && !kvm->dirty_ring_with_bitmap;
>> +}
>> +
>>   int __weak kvm_cpu_dirty_log_size(void)
>>   {
>>   	return 0;
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 5b064dbadaf4..8915dcefcefd 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1617,7 +1617,7 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>>   			new->dirty_bitmap = NULL;
>>   		else if (old && old->dirty_bitmap)
>>   			new->dirty_bitmap = old->dirty_bitmap;
>> -		else if (!kvm->dirty_ring_size) {
>> +		else if (!kvm_dirty_ring_exclusive(kvm)) {
>>   			r = kvm_alloc_dirty_bitmap(new);
>>   			if (r)
>>   				return r;
>> @@ -2060,8 +2060,8 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
>>   	unsigned long n;
>>   	unsigned long any = 0;
>>   
>> -	/* Dirty ring tracking is exclusive to dirty log tracking */
>> -	if (kvm->dirty_ring_size)
>> +	/* Dirty ring tracking may be exclusive to dirty log tracking */
>> +	if (kvm_dirty_ring_exclusive(kvm))
>>   		return -ENXIO;
>>   
>>   	*memslot = NULL;
>> @@ -2125,8 +2125,8 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>>   	unsigned long *dirty_bitmap_buffer;
>>   	bool flush;
>>   
>> -	/* Dirty ring tracking is exclusive to dirty log tracking */
>> -	if (kvm->dirty_ring_size)
>> +	/* Dirty ring tracking may be exclusive to dirty log tracking */
>> +	if (kvm_dirty_ring_exclusive(kvm))
>>   		return -ENXIO;
>>   
>>   	as_id = log->slot >> 16;
>> @@ -2237,8 +2237,8 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>>   	unsigned long *dirty_bitmap_buffer;
>>   	bool flush;
>>   
>> -	/* Dirty ring tracking is exclusive to dirty log tracking */
>> -	if (kvm->dirty_ring_size)
>> +	/* Dirty ring tracking may be exclusive to dirty log tracking */
>> +	if (kvm_dirty_ring_exclusive(kvm))
>>   		return -ENXIO;
>>   
>>   	as_id = log->slot >> 16;
>> @@ -3305,15 +3305,20 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>>   	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>>   
>>   #ifdef CONFIG_HAVE_KVM_DIRTY_RING
>> -	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
>> +	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
>>   		return;
>> +
>> +#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
>> +	if (WARN_ON_ONCE(!vcpu))
>> +		return;
>> +#endif
>>   #endif
>>   
>>   	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
>>   		unsigned long rel_gfn = gfn - memslot->base_gfn;
>>   		u32 slot = (memslot->as_id << 16) | memslot->id;
>>   
>> -		if (kvm->dirty_ring_size)
>> +		if (vcpu && kvm->dirty_ring_size)
>>   			kvm_dirty_ring_push(&vcpu->dirty_ring,
>>   					    slot, rel_gfn);
>>   		else
>> @@ -4485,6 +4490,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>>   		return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
>>   #else
>>   		return 0;
>> +#endif
>> +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
>> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
>>   #endif
>>   	case KVM_CAP_BINARY_STATS_FD:
>>   	case KVM_CAP_SYSTEM_EVENT_DATA:
>> @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
>>   {
>>   	int r;
>>   
>> +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
>> +	if (!kvm->dirty_ring_with_bitmap)
>> +		return -EINVAL;
>> +#endif
>> +
>>   	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
>>   		return -EINVAL;
>>   
>> @@ -4588,6 +4601,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>>   	case KVM_CAP_DIRTY_LOG_RING:
>>   	case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
>>   		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
>> +		kvm->dirty_ring_with_bitmap = true;
> 
> IIUC what Oliver wanted to suggest is we can avoid enabling of this cap,
> then we don't need dirty_ring_with_bitmap field but instead we can check
> against CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP when needed.
> 
> I think that'll make sense, because without the bitmap the ring won't work
> with arm64, so not valid to not enable it at all.  But good to double check
> with Oliver too.
> 
> The rest looks good to me, thanks,
> 

It was suggested by Oliver to expose KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP. The
user space also needs to enable the capability prior to KVM_CAP_DIRTY_LOG_RING_ACQ_REL
on ARM64. I may be missing something since Oliver and you had lots of discussion
on this particular new capability.

I'm fine to drop the bits to enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP. It means
the capability is exposed to user space on ARM64 and user space need __not__ to
enable it prior to KVM_CAP_DIRTY_LOG_RING_ACQ_REL. I would like Oliver helps to
confirm before I'm able to post v7.

>> +		return 0;
>>   	default:
>>   		return kvm_vm_ioctl_enable_cap(kvm, cap);
>>   	}

Thanks,
Gavin
Oliver Upton Oct. 20, 2022, 6:58 p.m. UTC | #3
On Wed, Oct 19, 2022 at 06:20:32AM +0800, Gavin Shan wrote:
> Hi Peter,
> 
> On 10/19/22 12:07 AM, Peter Xu wrote:
> > On Tue, Oct 11, 2022 at 02:14:42PM +0800, Gavin Shan wrote:

[...]

> > IMHO it'll be great to start with something like below to describe the
> > userspace's responsibility to proactively detect the WITH_BITMAP cap:
> > 
> >    Before using the dirty rings, the userspace needs to detect the cap of
> >    KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures
> >    need to be backed by per-slot bitmaps.
> > 
> >    When KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP returns 1, it means the arch can
> >    dirty guest pages without vcpu/ring context, so that some of the dirty
> >    information will still be maintained in the bitmap structure.
> > 
> >    Note that the bitmap here is only a backup of the ring structure, and it
> >    doesn't need to be collected until the final switch-over of migration
> >    process.  Normally the bitmap should only contain a very small amount of
> >    dirty pages only, which needs to be transferred during VM downtime.
> > 
> >    To collect dirty bits in the backup bitmap, the userspace can use the
> >    same KVM_GET_DIRTY_LOG ioctl.  Since it's always the last phase of
> >    migration that needs the fetching of dirty bitmap, KVM_CLEAR_DIRTY_LOG
> >    ioctl should not be needed in this case and its behavior undefined.
> > 
> > That's how I understand this new cap, but let me know if you think any of
> > above is inproper.
> > 
> 
> Yes, It looks much better to describe how KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
> is used. However, the missed part is the capability is still need to be enabled
> prior to KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. It means the capability needs
> to be acknowledged (confirmed) by user space. Otherwise, KVM_CAP_DIRTY_LOG_RING_ACQ_REL
> can't be enabled successfully. It seems Oliver, you and I aren't on same page for
> this part. Please refer to below reply for more discussion. After the discussion
> is finalized, I can amend the description accordingly here.

I'll follow up on the details of the CAP below, but wanted to explicitly
note some stuff for documentation:

Collecting the dirty bitmap should be the very last thing that the VMM
does before transmitting state to the target VMM. You'll want to make
sure that the dirty state is final and avoid missing dirty pages from
another ioctl ordered after bitmap collection.

[...]

> > > +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
> > > +		kvm->dirty_ring_with_bitmap = true;
> > 
> > IIUC what Oliver wanted to suggest is we can avoid enabling of this cap,
> > then we don't need dirty_ring_with_bitmap field but instead we can check
> > against CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP when needed.
> > 
> > I think that'll make sense, because without the bitmap the ring won't work
> > with arm64, so not valid to not enable it at all.  But good to double check
> > with Oliver too.
> > 
> > The rest looks good to me, thanks,
> > 
> 
> It was suggested by Oliver to expose KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP. The
> user space also needs to enable the capability prior to KVM_CAP_DIRTY_LOG_RING_ACQ_REL
> on ARM64. I may be missing something since Oliver and you had lots of discussion
> on this particular new capability.
> 
> I'm fine to drop the bits to enable KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP. It means
> the capability is exposed to user space on ARM64 and user space need __not__ to
> enable it prior to KVM_CAP_DIRTY_LOG_RING_ACQ_REL. I would like Oliver helps to
> confirm before I'm able to post v7.

IMO you really want the explicit buy-in from userspace, as failure to
collect the dirty bitmap will result in a dead VM on the other side of
migration. Fundamentally we're changing the ABI of
KVM_CAP_DIRTY_LOG_RING[_ACQ_REL].

--
Thanks,
Oliver
Sean Christopherson Oct. 20, 2022, 11:44 p.m. UTC | #4
On Tue, Oct 11, 2022, Gavin Shan wrote:
> Some architectures (such as arm64) need to dirty memory outside of the
> context of a vCPU. Of course, this simply doesn't fit with the UAPI of
> KVM's per-vCPU dirty ring.

What is the point of using the dirty ring in this case?  KVM still burns a pile
of memory for the bitmap.  Is the benefit that userspace can get away with
scanning the bitmap fewer times, e.g. scan it once just before blackout under
the assumption that very few pages will dirty the bitmap?

Why not add a global ring to @kvm?  I assume thread safety is a problem, but the
memory overhead of the dirty_bitmap also seems like a fairly big problem.

> Introduce a new flavor of dirty ring that requires the use of both vCPU
> dirty rings and a dirty bitmap. The expectation is that for non-vCPU
> sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
> the dirty bitmap. Userspace should scan the dirty bitmap before
> migrating the VM to the target.
> 
> Use an additional capability to advertize this behavior and require
> explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
> you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
> not allow userspace to enable dirty ring if it hasn't also enabled the
> ring && bitmap capability, as a VM is likely DOA without the pages
> marked in the bitmap.
> 
> Suggested-by: Marc Zyngier <maz@kernel.org>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Co-developed-by: Oliver Upton <oliver.upton@linux.dev>

Co-developed-by needs Oliver's SoB.

>  #ifndef CONFIG_HAVE_KVM_DIRTY_RING
> +static inline bool kvm_dirty_ring_exclusive(struct kvm *kvm)

What about inverting the naming to better capture that this is about the dirty
bitmap, and less so about the dirty ring?  It's not obvious what "exclusive"
means, e.g. I saw this stub before reading the changelog and assumed it was
making a dirty ring exclusive to something.

Something like this?

bool kvm_use_dirty_bitmap(struct kvm *kvm)
{
	return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap;
}

> @@ -3305,15 +3305,20 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>  	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>  
>  #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> -	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
> +	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
>  		return;
> +
> +#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> +	if (WARN_ON_ONCE(!vcpu))

To cut down on the #ifdefs, this can be:

	if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) && !vcpu)

though that's arguably even harder to read.  Blech.

> +		return;
> +#endif
>  #endif
>  
>  	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
>  		u32 slot = (memslot->as_id << 16) | memslot->id;
>  
> -		if (kvm->dirty_ring_size)
> +		if (vcpu && kvm->dirty_ring_size)
>  			kvm_dirty_ring_push(&vcpu->dirty_ring,
>  					    slot, rel_gfn);
>  		else
> @@ -4485,6 +4490,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  		return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
>  #else
>  		return 0;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
>  #endif
>  	case KVM_CAP_BINARY_STATS_FD:
>  	case KVM_CAP_SYSTEM_EVENT_DATA:
> @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
>  {
>  	int r;
>  
> +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> +	if (!kvm->dirty_ring_with_bitmap)
> +		return -EINVAL;
> +#endif

This one at least is prettier with IS_ENABLED

	if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) &&
	    !kvm->dirty_ring_with_bitmap)
		return -EINVAL;

But dirty_ring_with_bitmap really shouldn't need to exist.  It's mandatory for
architectures that have HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for
architectures that don't.  In other words, the API for enabling the dirty ring
is a bit ugly.

Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been officially
released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP on top, what about
usurping bits 63:32 of cap->args[0] for flags?  E.g.

Ideally we'd use cap->flags directly, but we screwed up with KVM_CAP_DIRTY_LOG_RING
and didn't require flags to be zero :-(

Actually, what's the point of allowing KVM_CAP_DIRTY_LOG_RING_ACQ_REL to be
enabled?  I get why KVM would enumerate this info, i.e. allowing checking, but I
don't seen any value in supporting a second method for enabling the dirty ring.

The acquire-release thing is irrelevant for x86, and no other architecture
supports the dirty ring until this series, i.e. there's no need for KVM to detect
that userspace has been updated to gain acquire-release semantics, because the
fact that userspace is enabling the dirty ring on arm64 means userspace has been
updated.

Same goes for the "with bitmap" capability.  There are no existing arm64 users,
so there's no risk of breaking existing userspace by suddenly shoving stuff into
the dirty bitmap.

KVM doesn't even get the enabling checks right, e.g. KVM_CAP_DIRTY_LOG_RING can be
enabled on architectures that select CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL but not
KVM_CAP_DIRTY_LOG_RING.  The reverse is true (ignoring that x86 selects both and
is the only arch that selects the TSO variant).

Ditto for KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP...
> +
>  	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
>  		return -EINVAL;
>  
> @@ -4588,6 +4601,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  	case KVM_CAP_DIRTY_LOG_RING:
>  	case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
>  		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:

... as this should return -EINVAL if CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP=n.

So rather than add a rather useless flag and increase KVM's API surface, why not
make the capabilities informational-only?

---
 include/linux/kvm_dirty_ring.h |  6 +++---
 include/linux/kvm_host.h       |  1 -
 virt/kvm/dirty_ring.c          |  5 +++--
 virt/kvm/kvm_main.c            | 20 ++++----------------
 4 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index 23b2b466aa0f..f49db42bc26a 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -28,9 +28,9 @@ struct kvm_dirty_ring {
 };
 
 #ifndef CONFIG_HAVE_KVM_DIRTY_RING
-static inline bool kvm_dirty_ring_exclusive(struct kvm *kvm)
+static inline bool kvm_use_dirty_bitmap(struct kvm *kvm)
 {
-	return false;
+	return true;
 }
 
 /*
@@ -71,7 +71,7 @@ static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
 
 #else /* CONFIG_HAVE_KVM_DIRTY_RING */
 
-bool kvm_dirty_ring_exclusive(struct kvm *kvm);
+bool kvm_use_dirty_bitmap(struct kvm *kvm);
 int kvm_cpu_dirty_log_size(void);
 u32 kvm_dirty_ring_get_rsvd_entries(void);
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d06fbf3e5e95..eb7b1310146d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -779,7 +779,6 @@ struct kvm {
 	pid_t userspace_pid;
 	unsigned int max_halt_poll_ns;
 	u32 dirty_ring_size;
-	bool dirty_ring_with_bitmap;
 	bool vm_bugged;
 	bool vm_dead;
 
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 9cc60af291ef..53802513de79 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -11,9 +11,10 @@
 #include <trace/events/kvm.h>
 #include "kvm_mm.h"
 
-bool kvm_dirty_ring_exclusive(struct kvm *kvm)
+bool kvm_use_dirty_bitmap(struct kvm *kvm)
 {
-	return kvm->dirty_ring_size && !kvm->dirty_ring_with_bitmap;
+	return IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
+	       !kvm->dirty_ring_size;
 }
 
 int __weak kvm_cpu_dirty_log_size(void)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dd52b8e42307..0e8aaac5a222 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1617,7 +1617,7 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 			new->dirty_bitmap = NULL;
 		else if (old && old->dirty_bitmap)
 			new->dirty_bitmap = old->dirty_bitmap;
-		else if (!kvm_dirty_ring_exclusive(kvm)) {
+		else if (kvm_use_dirty_bitmap(kvm)) {
 			r = kvm_alloc_dirty_bitmap(new);
 			if (r)
 				return r;
@@ -2060,8 +2060,7 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
 	unsigned long n;
 	unsigned long any = 0;
 
-	/* Dirty ring tracking may be exclusive to dirty log tracking */
-	if (kvm_dirty_ring_exclusive(kvm))
+	if (!kvm_use_dirty_bitmap(kvm))
 		return -ENXIO;
 
 	*memslot = NULL;
@@ -2125,8 +2124,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
-	/* Dirty ring tracking may be exclusive to dirty log tracking */
-	if (kvm_dirty_ring_exclusive(kvm))
+	if (!kvm_use_dirty_bitmap(kvm))
 		return -ENXIO;
 
 	as_id = log->slot >> 16;
@@ -2237,8 +2235,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
-	/* Dirty ring tracking may be exclusive to dirty log tracking */
-	if (kvm_dirty_ring_exclusive(kvm))
+	if (!kvm_use_dirty_bitmap(kvm))
 		return -ENXIO;
 
 	as_id = log->slot >> 16;
@@ -4505,11 +4502,6 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
 {
 	int r;
 
-#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
-	if (!kvm->dirty_ring_with_bitmap)
-		return -EINVAL;
-#endif
-
 	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
 		return -EINVAL;
 
@@ -4597,11 +4589,7 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 		return 0;
 	}
 	case KVM_CAP_DIRTY_LOG_RING:
-	case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
 		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
-	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
-		kvm->dirty_ring_with_bitmap = true;
-		return 0;
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}

base-commit: 4826e54f82ded9f54782f8e9d6bc36c7bae06c1f
--
Marc Zyngier Oct. 21, 2022, 8:06 a.m. UTC | #5
On Fri, 21 Oct 2022 00:44:51 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Tue, Oct 11, 2022, Gavin Shan wrote:
> > Some architectures (such as arm64) need to dirty memory outside of the
> > context of a vCPU. Of course, this simply doesn't fit with the UAPI of
> > KVM's per-vCPU dirty ring.
> 
> What is the point of using the dirty ring in this case?  KVM still
> burns a pile of memory for the bitmap.  Is the benefit that
> userspace can get away with scanning the bitmap fewer times,
> e.g. scan it once just before blackout under the assumption that
> very few pages will dirty the bitmap?

Apparently, the throttling effect of the ring makes it easier to
converge. Someone who actually uses the feature should be able to
tell you. But that's a policy decision, and I don't see why we should
be prescriptive.

> Why not add a global ring to @kvm?  I assume thread safety is a
> problem, but the memory overhead of the dirty_bitmap also seems like
> a fairly big problem.

Because we already have a stupidly bloated API surface, and that we
could do without yet another one based on a sample of *one*? Because
dirtying memory outside of a vcpu context makes it incredibly awkward
to handle a "ring full" condition?

> 
> > Introduce a new flavor of dirty ring that requires the use of both vCPU
> > dirty rings and a dirty bitmap. The expectation is that for non-vCPU
> > sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
> > the dirty bitmap. Userspace should scan the dirty bitmap before
> > migrating the VM to the target.
> > 
> > Use an additional capability to advertize this behavior and require
> > explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
> > you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
> > not allow userspace to enable dirty ring if it hasn't also enabled the
> > ring && bitmap capability, as a VM is likely DOA without the pages
> > marked in the bitmap.

This is wrong. The *only* case this is useful is when there is an
in-kernel producer of data outside of the context of a vcpu, which is
so far only the ITS save mechanism. No ITS? No need for this.
Userspace knows what it has created the first place, and should be in
charge of it (i.e. I want to be able to migrate my GICv2 and
GICv3-without-ITS VMs with the rings only).

> > 
> > Suggested-by: Marc Zyngier <maz@kernel.org>
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
> 
> Co-developed-by needs Oliver's SoB.
> 
> >  #ifndef CONFIG_HAVE_KVM_DIRTY_RING
> > +static inline bool kvm_dirty_ring_exclusive(struct kvm *kvm)
> 
> What about inverting the naming to better capture that this is about the dirty
> bitmap, and less so about the dirty ring?  It's not obvious what "exclusive"
> means, e.g. I saw this stub before reading the changelog and assumed it was
> making a dirty ring exclusive to something.
> 
> Something like this?
> 
> bool kvm_use_dirty_bitmap(struct kvm *kvm)
> {
> 	return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap;
> }
> 
> > @@ -3305,15 +3305,20 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
> >  	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> >  
> >  #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> > -	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
> > +	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
> >  		return;
> > +
> > +#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> > +	if (WARN_ON_ONCE(!vcpu))
> 
> To cut down on the #ifdefs, this can be:
> 
> 	if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) && !vcpu)
> 
> though that's arguably even harder to read.  Blech.
> 
> > +		return;
> > +#endif
> >  #endif
> >  
> >  	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> >  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> >  		u32 slot = (memslot->as_id << 16) | memslot->id;
> >  
> > -		if (kvm->dirty_ring_size)
> > +		if (vcpu && kvm->dirty_ring_size)
> >  			kvm_dirty_ring_push(&vcpu->dirty_ring,
> >  					    slot, rel_gfn);
> >  		else
> > @@ -4485,6 +4490,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  		return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
> >  #else
> >  		return 0;
> > +#endif
> > +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> > +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
> >  #endif
> >  	case KVM_CAP_BINARY_STATS_FD:
> >  	case KVM_CAP_SYSTEM_EVENT_DATA:
> > @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> >  {
> >  	int r;
> >  
> > +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> > +	if (!kvm->dirty_ring_with_bitmap)
> > +		return -EINVAL;
> > +#endif
> 
> This one at least is prettier with IS_ENABLED
> 
> 	if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) &&
> 	    !kvm->dirty_ring_with_bitmap)
> 		return -EINVAL;
> 
> But dirty_ring_with_bitmap really shouldn't need to exist.  It's
> mandatory for architectures that have
> HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for architectures
> that don't.  In other words, the API for enabling the dirty ring is
> a bit ugly.
> 
> Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been
> officially released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP
> on top, what about usurping bits 63:32 of cap->args[0] for flags?
> E.g.
> 
> Ideally we'd use cap->flags directly, but we screwed up with
> KVM_CAP_DIRTY_LOG_RING and didn't require flags to be zero :-(
>
> Actually, what's the point of allowing
> KVM_CAP_DIRTY_LOG_RING_ACQ_REL to be enabled?  I get why KVM would
> enumerate this info, i.e. allowing checking, but I don't seen any
> value in supporting a second method for enabling the dirty ring.
> 
> The acquire-release thing is irrelevant for x86, and no other
> architecture supports the dirty ring until this series, i.e. there's
> no need for KVM to detect that userspace has been updated to gain
> acquire-release semantics, because the fact that userspace is
> enabling the dirty ring on arm64 means userspace has been updated.

Do we really need to make the API more awkward? There is an
established pattern of "enable what is advertised". Some level of
uniformity wouldn't hurt, really.

	M.
Gavin Shan Oct. 21, 2022, 10:13 a.m. UTC | #6
Hi Sean,

On 10/21/22 7:44 AM, Sean Christopherson wrote:
> On Tue, Oct 11, 2022, Gavin Shan wrote:
>> Some architectures (such as arm64) need to dirty memory outside of the
>> context of a vCPU. Of course, this simply doesn't fit with the UAPI of
>> KVM's per-vCPU dirty ring.
> 
> What is the point of using the dirty ring in this case?  KVM still burns a pile
> of memory for the bitmap.  Is the benefit that userspace can get away with
> scanning the bitmap fewer times, e.g. scan it once just before blackout under
> the assumption that very few pages will dirty the bitmap?
> 
> Why not add a global ring to @kvm?  I assume thread safety is a problem, but the
> memory overhead of the dirty_bitmap also seems like a fairly big problem.
> 

Most of the dirty pages are tracked by the per-vcpu-ring in this particular
case. It means the minority of the dirty pages are tracked by the bitmap.
The trade-off is the coexistence of dirty-ring and bitmap. The advantage of
ring is the discrete property, comparing to bitmap. With dirty ring, userspace
(QEMU) needs to copy in-kernel bitmap. In low-dirty-speed scenario, it's
efficient.

We ever discussed bitmap and per-vm-ring [1]. per-vm-ring is just too
complicated. However, bitmap uses extra memory to track dirty pages.
The bitmap will be only used in two cases (migration and quiescent system).
So the bitmap will be retrieved for limited times and time used to parse
it is limited.

[1] https://lore.kernel.org/kvmarm/320005d1-fe88-fd6a-be91-ddb56f1aa80f@redhat.com/

>> Introduce a new flavor of dirty ring that requires the use of both vCPU
>> dirty rings and a dirty bitmap. The expectation is that for non-vCPU
>> sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
>> the dirty bitmap. Userspace should scan the dirty bitmap before
>> migrating the VM to the target.
>>
>> Use an additional capability to advertize this behavior and require
>> explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
>> you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
>> not allow userspace to enable dirty ring if it hasn't also enabled the
>> ring && bitmap capability, as a VM is likely DOA without the pages
>> marked in the bitmap.
>>
>> Suggested-by: Marc Zyngier <maz@kernel.org>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
> 
> Co-developed-by needs Oliver's SoB.
> 

Sure.

>>   #ifndef CONFIG_HAVE_KVM_DIRTY_RING
>> +static inline bool kvm_dirty_ring_exclusive(struct kvm *kvm)
> 
> What about inverting the naming to better capture that this is about the dirty
> bitmap, and less so about the dirty ring?  It's not obvious what "exclusive"
> means, e.g. I saw this stub before reading the changelog and assumed it was
> making a dirty ring exclusive to something.
> 
> Something like this?
> 
> bool kvm_use_dirty_bitmap(struct kvm *kvm)
> {
> 	return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap;
> }
> 

If you agree, I would rename is to kvm_dirty_ring_use_bitmap(). In this way,
we will have "kvm_dirty_ring" prefix for the function name, consistent with
other functions from same module.

>> @@ -3305,15 +3305,20 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>>   	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>>   
>>   #ifdef CONFIG_HAVE_KVM_DIRTY_RING
>> -	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
>> +	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
>>   		return;
>> +
>> +#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
>> +	if (WARN_ON_ONCE(!vcpu))
> 
> To cut down on the #ifdefs, this can be:
> 
> 	if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) && !vcpu)
> 
> though that's arguably even harder to read.  Blech.
> 

Your suggestion counts :)

>> +		return;
>> +#endif
>>   #endif
>>   
>>   	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
>>   		unsigned long rel_gfn = gfn - memslot->base_gfn;
>>   		u32 slot = (memslot->as_id << 16) | memslot->id;
>>   
>> -		if (kvm->dirty_ring_size)
>> +		if (vcpu && kvm->dirty_ring_size)
>>   			kvm_dirty_ring_push(&vcpu->dirty_ring,
>>   					    slot, rel_gfn);
>>   		else
>> @@ -4485,6 +4490,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>>   		return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
>>   #else
>>   		return 0;
>> +#endif
>> +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
>> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
>>   #endif
>>   	case KVM_CAP_BINARY_STATS_FD:
>>   	case KVM_CAP_SYSTEM_EVENT_DATA:
>> @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
>>   {
>>   	int r;
>>   
>> +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
>> +	if (!kvm->dirty_ring_with_bitmap)
>> +		return -EINVAL;
>> +#endif
> 
> This one at least is prettier with IS_ENABLED
> 
> 	if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) &&
> 	    !kvm->dirty_ring_with_bitmap)
> 		return -EINVAL;
> 
> But dirty_ring_with_bitmap really shouldn't need to exist.  It's mandatory for
> architectures that have HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for
> architectures that don't.  In other words, the API for enabling the dirty ring
> is a bit ugly.
> 
> Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been officially
> released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP on top, what about
> usurping bits 63:32 of cap->args[0] for flags?  E.g.
> 
> Ideally we'd use cap->flags directly, but we screwed up with KVM_CAP_DIRTY_LOG_RING
> and didn't require flags to be zero :-(
> 
> Actually, what's the point of allowing KVM_CAP_DIRTY_LOG_RING_ACQ_REL to be
> enabled?  I get why KVM would enumerate this info, i.e. allowing checking, but I
> don't seen any value in supporting a second method for enabling the dirty ring.
> 
> The acquire-release thing is irrelevant for x86, and no other architecture
> supports the dirty ring until this series, i.e. there's no need for KVM to detect
> that userspace has been updated to gain acquire-release semantics, because the
> fact that userspace is enabling the dirty ring on arm64 means userspace has been
> updated.
> 
> Same goes for the "with bitmap" capability.  There are no existing arm64 users,
> so there's no risk of breaking existing userspace by suddenly shoving stuff into
> the dirty bitmap.
> 
> KVM doesn't even get the enabling checks right, e.g. KVM_CAP_DIRTY_LOG_RING can be
> enabled on architectures that select CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL but not
> KVM_CAP_DIRTY_LOG_RING.  The reverse is true (ignoring that x86 selects both and
> is the only arch that selects the TSO variant).
> 
> Ditto for KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP...

If I didn't miss anything in the previous discussions, we don't want to make
KVM_CAP_DIRTY_LOG_RING_ACQ_REL and KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP architecture
dependent. If they become architecture dependent, the userspace will have different
stubs (x86, arm64, other architectures to support dirty-ring in future) to enable
those capabilities. It's not friendly to userspace. So I intend to prefer the existing
pattern: advertise, enable. To enable a capability without knowing if it's supported
sounds a bit weird to me.

I think it's a good idea to enable KVM_CAP_DIRTY_LOG_RING_{ACQ_REL, WITH_BITMAP} as
flags, instead of standalone capabilities. In this way, those two capabilities can
be treated as sub-capability of KVM_CAP_DIRTY_LOG_RING. The question is how these
two flags can be exposed by kvm_vm_ioctl_check_extension_generic(), if we really
want to expose those two flags.

I don't understand your question on how KVM has wrong checks when KVM_CAP_DIRTY_LOG_RING
and KVM_CAP_DIRTY_LOG_RING_ACQ_REL are enabled.


>> +
>>   	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
>>   		return -EINVAL;
>>   
>> @@ -4588,6 +4601,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>>   	case KVM_CAP_DIRTY_LOG_RING:
>>   	case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
>>   		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>> +	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
> 
> ... as this should return -EINVAL if CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP=n.
> 

Yes.

> So rather than add a rather useless flag and increase KVM's API surface, why not
> make the capabilities informational-only?
> 

Please refer to the reply above. I still intend to prefer the pattern of advertising
and enabling. If architecture dependent stubs to enable those capabilities isn't a
concern to us. I think we can make those capabilities information-only, but I guess
Oliver might have different ideas?

> ---
>   include/linux/kvm_dirty_ring.h |  6 +++---
>   include/linux/kvm_host.h       |  1 -
>   virt/kvm/dirty_ring.c          |  5 +++--
>   virt/kvm/kvm_main.c            | 20 ++++----------------
>   4 files changed, 10 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> index 23b2b466aa0f..f49db42bc26a 100644
> --- a/include/linux/kvm_dirty_ring.h
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -28,9 +28,9 @@ struct kvm_dirty_ring {
>   };
>   
>   #ifndef CONFIG_HAVE_KVM_DIRTY_RING
> -static inline bool kvm_dirty_ring_exclusive(struct kvm *kvm)
> +static inline bool kvm_use_dirty_bitmap(struct kvm *kvm)
>   {
> -	return false;
> +	return true;
>   }
>   
>   /*
> @@ -71,7 +71,7 @@ static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
>   
>   #else /* CONFIG_HAVE_KVM_DIRTY_RING */
>   
> -bool kvm_dirty_ring_exclusive(struct kvm *kvm);
> +bool kvm_use_dirty_bitmap(struct kvm *kvm);
>   int kvm_cpu_dirty_log_size(void);
>   u32 kvm_dirty_ring_get_rsvd_entries(void);
>   int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d06fbf3e5e95..eb7b1310146d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -779,7 +779,6 @@ struct kvm {
>   	pid_t userspace_pid;
>   	unsigned int max_halt_poll_ns;
>   	u32 dirty_ring_size;
> -	bool dirty_ring_with_bitmap;
>   	bool vm_bugged;
>   	bool vm_dead;
>   
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 9cc60af291ef..53802513de79 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -11,9 +11,10 @@
>   #include <trace/events/kvm.h>
>   #include "kvm_mm.h"
>   
> -bool kvm_dirty_ring_exclusive(struct kvm *kvm)
> +bool kvm_use_dirty_bitmap(struct kvm *kvm)
>   {
> -	return kvm->dirty_ring_size && !kvm->dirty_ring_with_bitmap;
> +	return IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) ||
> +	       !kvm->dirty_ring_size;
>   }
>   
>   int __weak kvm_cpu_dirty_log_size(void)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dd52b8e42307..0e8aaac5a222 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1617,7 +1617,7 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>   			new->dirty_bitmap = NULL;
>   		else if (old && old->dirty_bitmap)
>   			new->dirty_bitmap = old->dirty_bitmap;
> -		else if (!kvm_dirty_ring_exclusive(kvm)) {
> +		else if (kvm_use_dirty_bitmap(kvm)) {
>   			r = kvm_alloc_dirty_bitmap(new);
>   			if (r)
>   				return r;
> @@ -2060,8 +2060,7 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
>   	unsigned long n;
>   	unsigned long any = 0;
>   
> -	/* Dirty ring tracking may be exclusive to dirty log tracking */
> -	if (kvm_dirty_ring_exclusive(kvm))
> +	if (!kvm_use_dirty_bitmap(kvm))
>   		return -ENXIO;
>   
>   	*memslot = NULL;
> @@ -2125,8 +2124,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>   	unsigned long *dirty_bitmap_buffer;
>   	bool flush;
>   
> -	/* Dirty ring tracking may be exclusive to dirty log tracking */
> -	if (kvm_dirty_ring_exclusive(kvm))
> +	if (!kvm_use_dirty_bitmap(kvm))
>   		return -ENXIO;
>   
>   	as_id = log->slot >> 16;
> @@ -2237,8 +2235,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>   	unsigned long *dirty_bitmap_buffer;
>   	bool flush;
>   
> -	/* Dirty ring tracking may be exclusive to dirty log tracking */
> -	if (kvm_dirty_ring_exclusive(kvm))
> +	if (!kvm_use_dirty_bitmap(kvm))
>   		return -ENXIO;
>   
>   	as_id = log->slot >> 16;
> @@ -4505,11 +4502,6 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
>   {
>   	int r;
>   
> -#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> -	if (!kvm->dirty_ring_with_bitmap)
> -		return -EINVAL;
> -#endif
> -
>   	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
>   		return -EINVAL;
>   
> @@ -4597,11 +4589,7 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>   		return 0;
>   	}
>   	case KVM_CAP_DIRTY_LOG_RING:
> -	case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
>   		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> -	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
> -		kvm->dirty_ring_with_bitmap = true;
> -		return 0;
>   	default:
>   		return kvm_vm_ioctl_enable_cap(kvm, cap);
>   	}
> 
> base-commit: 4826e54f82ded9f54782f8e9d6bc36c7bae06c1f
> 

Thanks,
Gavin
Sean Christopherson Oct. 21, 2022, 4:05 p.m. UTC | #7
On Fri, Oct 21, 2022, Marc Zyngier wrote:
> On Fri, 21 Oct 2022 00:44:51 +0100,
> Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Tue, Oct 11, 2022, Gavin Shan wrote:
> > > Some architectures (such as arm64) need to dirty memory outside of the
> > > context of a vCPU. Of course, this simply doesn't fit with the UAPI of
> > > KVM's per-vCPU dirty ring.
> > 
> > What is the point of using the dirty ring in this case?  KVM still
> > burns a pile of memory for the bitmap.  Is the benefit that
> > userspace can get away with scanning the bitmap fewer times,
> > e.g. scan it once just before blackout under the assumption that
> > very few pages will dirty the bitmap?
> 
> Apparently, the throttling effect of the ring makes it easier to
> converge. Someone who actually uses the feature should be able to
> tell you. But that's a policy decision, and I don't see why we should
> be prescriptive.

I wasn't suggesting we be prescriptive, it was an honest question.

> > Why not add a global ring to @kvm?  I assume thread safety is a
> > problem, but the memory overhead of the dirty_bitmap also seems like
> > a fairly big problem.
> 
> Because we already have a stupidly bloated API surface, and that we
> could do without yet another one based on a sample of *one*?

But we're adding a new API regardless.  A per-VM ring would definitely be a bigger
addition, but if using the dirty_bitmap won't actually meet the needs of userspace,
then we'll have added a new API and still not have solved the problem.  That's why
I was asking why/when userspace would want to use dirty_ring+dirty_bitmap.

> Because dirtying memory outside of a vcpu context makes it incredibly awkward
> to handle a "ring full" condition?

Kicking all vCPUs with the soft-full request isn't _that_ awkward.  It's certainly
sub-optimal, but if inserting into the per-VM ring is relatively rare, then in
practice it's unlikely to impact guest performance.

> > > Introduce a new flavor of dirty ring that requires the use of both vCPU
> > > dirty rings and a dirty bitmap. The expectation is that for non-vCPU
> > > sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
> > > the dirty bitmap. Userspace should scan the dirty bitmap before
> > > migrating the VM to the target.
> > > 
> > > Use an additional capability to advertize this behavior and require
> > > explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
> > > you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
> > > not allow userspace to enable dirty ring if it hasn't also enabled the
> > > ring && bitmap capability, as a VM is likely DOA without the pages
> > > marked in the bitmap.
> 
> This is wrong. The *only* case this is useful is when there is an
> in-kernel producer of data outside of the context of a vcpu, which is
> so far only the ITS save mechanism. No ITS? No need for this.

How large is the ITS?  If it's a fixed, small size, could we treat the ITS as a
one-off case for now?  E.g. do something gross like shove entries into vcpu0's
dirty ring?

> Userspace knows what it has created the first place, and should be in
> charge of it (i.e. I want to be able to migrate my GICv2 and
> GICv3-without-ITS VMs with the rings only).

Ah, so enabling the dirty bitmap isn't strictly required.  That means this patch
is wrong, and it also means that we need to figure out how we want to handle the
case where mark_page_dirty_in_slot() is invoked without a running vCPU on a memslot
without a dirty_bitmap.

I.e. what's an appropriate action in the below sequence:

void mark_page_dirty_in_slot(struct kvm *kvm,
			     const struct kvm_memory_slot *memslot,
		 	     gfn_t gfn)
{
	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();

#ifdef CONFIG_HAVE_KVM_DIRTY_RING
	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
		return;

#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
	if (WARN_ON_ONCE(!vcpu))
		return;
#endif
#endif

	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
		unsigned long rel_gfn = gfn - memslot->base_gfn;
		u32 slot = (memslot->as_id << 16) | memslot->id;

		if (vcpu && kvm->dirty_ring_size)
			kvm_dirty_ring_push(&vcpu->dirty_ring,
					    slot, rel_gfn);
		else if (memslot->dirty_bitmap)
			set_bit_le(rel_gfn, memslot->dirty_bitmap);
		else
			???? <=================================================
	}
}


Would it be possible to require a dirty bitmap when an ITS is created?  That would
allow treating the above condition as a KVM bug.

> > > @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> > >  {
> > >  	int r;
> > >  
> > > +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> > > +	if (!kvm->dirty_ring_with_bitmap)
> > > +		return -EINVAL;
> > > +#endif
> > 
> > This one at least is prettier with IS_ENABLED
> > 
> > 	if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) &&
> > 	    !kvm->dirty_ring_with_bitmap)
> > 		return -EINVAL;
> > 
> > But dirty_ring_with_bitmap really shouldn't need to exist.  It's
> > mandatory for architectures that have
> > HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for architectures
> > that don't.  In other words, the API for enabling the dirty ring is
> > a bit ugly.
> > 
> > Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been
> > officially released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP
> > on top, what about usurping bits 63:32 of cap->args[0] for flags?
> > E.g.

For posterity, filling in my missing idea...

Since the size is restricted to be well below a 32-bit value, and it's unlikely
that KVM will ever support 4GiB per-vCPU rings, we could usurp the upper bits for
flags:

  static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u64 arg0)
  {
	u32 flags = arg0 >> 32;
	u32 size = arg0;

However, since it sounds like enabling dirty_bitmap isn't strictly required, I
have no objection to enabling KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, my objection
was purely that KVM was adding a per-VM flag just to sanity check the configuration.

> > Ideally we'd use cap->flags directly, but we screwed up with
> > KVM_CAP_DIRTY_LOG_RING and didn't require flags to be zero :-(
> >
> > Actually, what's the point of allowing
> > KVM_CAP_DIRTY_LOG_RING_ACQ_REL to be enabled?  I get why KVM would
> > enumerate this info, i.e. allowing checking, but I don't seen any
> > value in supporting a second method for enabling the dirty ring.
> > 
> > The acquire-release thing is irrelevant for x86, and no other
> > architecture supports the dirty ring until this series, i.e. there's
> > no need for KVM to detect that userspace has been updated to gain
> > acquire-release semantics, because the fact that userspace is
> > enabling the dirty ring on arm64 means userspace has been updated.
> 
> Do we really need to make the API more awkward? There is an
> established pattern of "enable what is advertised". Some level of
> uniformity wouldn't hurt, really.

I agree that uniformity would be nice, but for capabilities I don't think that's
ever going to happen.  I'm pretty sure supporting enabling is actually in the
minority.  E.g. of the 20 capabilities handled in kvm_vm_ioctl_check_extension_generic(),
I believe only 3 support enabling (KVM_CAP_HALT_POLL, KVM_CAP_DIRTY_LOG_RING, and
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).
Sean Christopherson Oct. 21, 2022, 11:20 p.m. UTC | #8
On Fri, Oct 21, 2022, Gavin Shan wrote:
> > What about inverting the naming to better capture that this is about the dirty
> > bitmap, and less so about the dirty ring?  It's not obvious what "exclusive"
> > means, e.g. I saw this stub before reading the changelog and assumed it was
> > making a dirty ring exclusive to something.
> > 
> > Something like this?
> > 
> > bool kvm_use_dirty_bitmap(struct kvm *kvm)
> > {
> > 	return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap;
> > }
> > 
> 
> If you agree, I would rename is to kvm_dirty_ring_use_bitmap(). In this way,
> we will have "kvm_dirty_ring" prefix for the function name, consistent with
> other functions from same module.

I'd prefer to avoid "ring" in the name at all, because in the common case (well,
legacy case at least) the dirty ring has nothing to do with using the dirty
bitmap, e.g. this code ends up being very confusing because the "dirty_ring"
part implies that KVM _doesn't_ need to allocate the bitmap when the dirty ring
isn't being used.

		if (!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
			new->dirty_bitmap = NULL;
		else if (old && old->dirty_bitmap)
			new->dirty_bitmap = old->dirty_bitmap;
		else if (kvm_dirty_ring_use_bitmap(kvm) {
			r = kvm_alloc_dirty_bitmap(new);
			if (r)
				return r;

			if (kvm_dirty_log_manual_protect_and_init_set(kvm))
				bitmap_set(new->dirty_bitmap, 0, new->npages);
		}

The helper exists because the dirty ring exists, but the helper is fundamentally
about the dirty bitmap, not the ring.

> > But dirty_ring_with_bitmap really shouldn't need to exist.  It's mandatory for
> > architectures that have HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for
> > architectures that don't.  In other words, the API for enabling the dirty ring
> > is a bit ugly.
> > 
> > Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been officially
> > released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP on top, what about
> > usurping bits 63:32 of cap->args[0] for flags?  E.g.
> > 
> > Ideally we'd use cap->flags directly, but we screwed up with KVM_CAP_DIRTY_LOG_RING
> > and didn't require flags to be zero :-(
> > 
> > Actually, what's the point of allowing KVM_CAP_DIRTY_LOG_RING_ACQ_REL to be
> > enabled?  I get why KVM would enumerate this info, i.e. allowing checking, but I
> > don't seen any value in supporting a second method for enabling the dirty ring.
> > 
> > The acquire-release thing is irrelevant for x86, and no other architecture
> > supports the dirty ring until this series, i.e. there's no need for KVM to detect
> > that userspace has been updated to gain acquire-release semantics, because the
> > fact that userspace is enabling the dirty ring on arm64 means userspace has been
> > updated.
> > 
> > Same goes for the "with bitmap" capability.  There are no existing arm64 users,
> > so there's no risk of breaking existing userspace by suddenly shoving stuff into
> > the dirty bitmap.
> > 
> > KVM doesn't even get the enabling checks right, e.g. KVM_CAP_DIRTY_LOG_RING can be
> > enabled on architectures that select CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL but not
> > KVM_CAP_DIRTY_LOG_RING.  The reverse is true (ignoring that x86 selects both and
> > is the only arch that selects the TSO variant).
> > 
> > Ditto for KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP...
> 
> If I didn't miss anything in the previous discussions, we don't want to make
> KVM_CAP_DIRTY_LOG_RING_ACQ_REL and KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
> architecture dependent. If they become architecture dependent, the userspace
> will have different stubs (x86, arm64, other architectures to support
> dirty-ring in future) to enable those capabilities. It's not friendly to
> userspace. So I intend to prefer the existing pattern: advertise, enable. To
> enable a capability without knowing if it's supported sounds a bit weird to
> me.

Enabling without KVM advertising that it's supported would indeed be odd.  Ugh,
and QEMU doesn't have existing checks to restrict the dirty ring to x86, i.e. we
can't make the ACQ_REL capability a true attribute without breaking userspace.

Rats.

> I think it's a good idea to enable KVM_CAP_DIRTY_LOG_RING_{ACQ_REL, WITH_BITMAP} as
> flags, instead of standalone capabilities. In this way, those two capabilities can
> be treated as sub-capability of KVM_CAP_DIRTY_LOG_RING. The question is how these
> two flags can be exposed by kvm_vm_ioctl_check_extension_generic(), if we really
> want to expose those two flags.
> 
> I don't understand your question on how KVM has wrong checks when KVM_CAP_DIRTY_LOG_RING
> and KVM_CAP_DIRTY_LOG_RING_ACQ_REL are enabled.

In the current code base, KVM only checks that _a_ form of dirty ring is supported,
by way of kvm_vm_ioctl_enable_dirty_log_ring()'s check on KVM_DIRTY_LOG_PAGE_OFFSET.

The callers don't verify that the "correct" capability is enabled.

	case KVM_CAP_DIRTY_LOG_RING:
	case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);

E.g. userspace could do

	if (kvm_check(KVM_CAP_DIRTY_LOG_RING_ACQ_REL))
		kvm_enable(KVM_CAP_DIRTY_LOG_RING)

and KVM would happily enable the dirty ring.  Functionally it doesn't cause
problems, it's just weird.

Heh, we can fix without more ifdeffery by using the check internally.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e30f1b4ecfa5..300489a0eba5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4585,6 +4585,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
        }
        case KVM_CAP_DIRTY_LOG_RING:
        case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
+               if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap))
+                       return -EINVAL;
                return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
        default:
                return kvm_vm_ioctl_enable_cap(kvm, cap);
Gavin Shan Oct. 22, 2022, 12:33 a.m. UTC | #9
Hi Sean,

On 10/22/22 7:20 AM, Sean Christopherson wrote:
> On Fri, Oct 21, 2022, Gavin Shan wrote:
>>> What about inverting the naming to better capture that this is about the dirty
>>> bitmap, and less so about the dirty ring?  It's not obvious what "exclusive"
>>> means, e.g. I saw this stub before reading the changelog and assumed it was
>>> making a dirty ring exclusive to something.
>>>
>>> Something like this?
>>>
>>> bool kvm_use_dirty_bitmap(struct kvm *kvm)
>>> {
>>> 	return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap;
>>> }
>>>
>>
>> If you agree, I would rename is to kvm_dirty_ring_use_bitmap(). In this way,
>> we will have "kvm_dirty_ring" prefix for the function name, consistent with
>> other functions from same module.
> 
> I'd prefer to avoid "ring" in the name at all, because in the common case (well,
> legacy case at least) the dirty ring has nothing to do with using the dirty
> bitmap, e.g. this code ends up being very confusing because the "dirty_ring"
> part implies that KVM _doesn't_ need to allocate the bitmap when the dirty ring
> isn't being used.
> 
> 		if (!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
> 			new->dirty_bitmap = NULL;
> 		else if (old && old->dirty_bitmap)
> 			new->dirty_bitmap = old->dirty_bitmap;
> 		else if (kvm_dirty_ring_use_bitmap(kvm) {
> 			r = kvm_alloc_dirty_bitmap(new);
> 			if (r)
> 				return r;
> 
> 			if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> 				bitmap_set(new->dirty_bitmap, 0, new->npages);
> 		}
> 
> The helper exists because the dirty ring exists, but the helper is fundamentally
> about the dirty bitmap, not the ring.
> 

Thanks for the details. Yeah, it makes sense to avoid "ring" then. Lets use
the name kvm_use_dirty_bitmap() for the function.

>>> But dirty_ring_with_bitmap really shouldn't need to exist.  It's mandatory for
>>> architectures that have HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for
>>> architectures that don't.  In other words, the API for enabling the dirty ring
>>> is a bit ugly.
>>>
>>> Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been officially
>>> released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP on top, what about
>>> usurping bits 63:32 of cap->args[0] for flags?  E.g.
>>>
>>> Ideally we'd use cap->flags directly, but we screwed up with KVM_CAP_DIRTY_LOG_RING
>>> and didn't require flags to be zero :-(
>>>
>>> Actually, what's the point of allowing KVM_CAP_DIRTY_LOG_RING_ACQ_REL to be
>>> enabled?  I get why KVM would enumerate this info, i.e. allowing checking, but I
>>> don't seen any value in supporting a second method for enabling the dirty ring.
>>>
>>> The acquire-release thing is irrelevant for x86, and no other architecture
>>> supports the dirty ring until this series, i.e. there's no need for KVM to detect
>>> that userspace has been updated to gain acquire-release semantics, because the
>>> fact that userspace is enabling the dirty ring on arm64 means userspace has been
>>> updated.
>>>
>>> Same goes for the "with bitmap" capability.  There are no existing arm64 users,
>>> so there's no risk of breaking existing userspace by suddenly shoving stuff into
>>> the dirty bitmap.
>>>
>>> KVM doesn't even get the enabling checks right, e.g. KVM_CAP_DIRTY_LOG_RING can be
>>> enabled on architectures that select CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL but not
>>> KVM_CAP_DIRTY_LOG_RING.  The reverse is true (ignoring that x86 selects both and
>>> is the only arch that selects the TSO variant).
>>>
>>> Ditto for KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP...
>>
>> If I didn't miss anything in the previous discussions, we don't want to make
>> KVM_CAP_DIRTY_LOG_RING_ACQ_REL and KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
>> architecture dependent. If they become architecture dependent, the userspace
>> will have different stubs (x86, arm64, other architectures to support
>> dirty-ring in future) to enable those capabilities. It's not friendly to
>> userspace. So I intend to prefer the existing pattern: advertise, enable. To
>> enable a capability without knowing if it's supported sounds a bit weird to
>> me.
> 
> Enabling without KVM advertising that it's supported would indeed be odd.  Ugh,
> and QEMU doesn't have existing checks to restrict the dirty ring to x86, i.e. we
> can't make the ACQ_REL capability a true attribute without breaking userspace.
> 
> Rats.
> 

Currently, QEMU doesn't use ACQ_REL and WITH_BITMAP. After both capability are
supported by kvm, we need go ahead to change QEMU so that these two capabilities
can be enabled in QEMU.

>> I think it's a good idea to enable KVM_CAP_DIRTY_LOG_RING_{ACQ_REL, WITH_BITMAP} as
>> flags, instead of standalone capabilities. In this way, those two capabilities can
>> be treated as sub-capability of KVM_CAP_DIRTY_LOG_RING. The question is how these
>> two flags can be exposed by kvm_vm_ioctl_check_extension_generic(), if we really
>> want to expose those two flags.
>>
>> I don't understand your question on how KVM has wrong checks when KVM_CAP_DIRTY_LOG_RING
>> and KVM_CAP_DIRTY_LOG_RING_ACQ_REL are enabled.
> 
> In the current code base, KVM only checks that _a_ form of dirty ring is supported,
> by way of kvm_vm_ioctl_enable_dirty_log_ring()'s check on KVM_DIRTY_LOG_PAGE_OFFSET.
> 
> The callers don't verify that the "correct" capability is enabled.
> 
> 	case KVM_CAP_DIRTY_LOG_RING:
> 	case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
> 		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> 
> E.g. userspace could do
> 
> 	if (kvm_check(KVM_CAP_DIRTY_LOG_RING_ACQ_REL))
> 		kvm_enable(KVM_CAP_DIRTY_LOG_RING)
> 
> and KVM would happily enable the dirty ring.  Functionally it doesn't cause
> problems, it's just weird.
> 
> Heh, we can fix without more ifdeffery by using the check internally.
> 

Hmm, nice catch! Lets fix it up in a separate patch.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e30f1b4ecfa5..300489a0eba5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4585,6 +4585,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>          }
>          case KVM_CAP_DIRTY_LOG_RING:
>          case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
> +               if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap))
> +                       return -EINVAL;
>                  return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>          default:
>                  return kvm_vm_ioctl_enable_cap(kvm, cap);
> 

Thanks,
Gavin
Gavin Shan Oct. 22, 2022, 8:27 a.m. UTC | #10
Hi Sean,

On 10/22/22 12:05 AM, Sean Christopherson wrote:
> On Fri, Oct 21, 2022, Marc Zyngier wrote:
>> On Fri, 21 Oct 2022 00:44:51 +0100,
>> Sean Christopherson <seanjc@google.com> wrote:
>>>
>>> On Tue, Oct 11, 2022, Gavin Shan wrote:
>>>> Some architectures (such as arm64) need to dirty memory outside of the
>>>> context of a vCPU. Of course, this simply doesn't fit with the UAPI of
>>>> KVM's per-vCPU dirty ring.
>>>
>>> What is the point of using the dirty ring in this case?  KVM still
>>> burns a pile of memory for the bitmap.  Is the benefit that
>>> userspace can get away with scanning the bitmap fewer times,
>>> e.g. scan it once just before blackout under the assumption that
>>> very few pages will dirty the bitmap?
>>
>> Apparently, the throttling effect of the ring makes it easier to
>> converge. Someone who actually uses the feature should be able to
>> tell you. But that's a policy decision, and I don't see why we should
>> be prescriptive.
> 
> I wasn't suggesting we be prescriptive, it was an honest question.
> 
>>> Why not add a global ring to @kvm?  I assume thread safety is a
>>> problem, but the memory overhead of the dirty_bitmap also seems like
>>> a fairly big problem.
>>
>> Because we already have a stupidly bloated API surface, and that we
>> could do without yet another one based on a sample of *one*?
> 
> But we're adding a new API regardless.  A per-VM ring would definitely be a bigger
> addition, but if using the dirty_bitmap won't actually meet the needs of userspace,
> then we'll have added a new API and still not have solved the problem.  That's why
> I was asking why/when userspace would want to use dirty_ring+dirty_bitmap.
> 

Bitmap can help to solve the issue, but the extra memory consumption due to
the bitmap is a concern, as you mentioned previously. More information about
the issue can be found here [1]. On ARM64, multiple guest's physical pages are
used by VGIC/ITS to store its states during migration or system shutdown.

[1] https://lore.kernel.org/kvmarm/320005d1-fe88-fd6a-be91-ddb56f1aa80f@redhat.com/

>> Because dirtying memory outside of a vcpu context makes it incredibly awkward
>> to handle a "ring full" condition?
> 
> Kicking all vCPUs with the soft-full request isn't _that_ awkward.  It's certainly
> sub-optimal, but if inserting into the per-VM ring is relatively rare, then in
> practice it's unlikely to impact guest performance.
> 

It's still possible the per-vcpu-ring becomes hard full before it can be
kicked off. per-vm-ring has other issues, one of which is synchronization
between kvm and userspace to avoid overrunning per-kvm-ring. bitmap was
selected due to its simplicity.

>>>> Introduce a new flavor of dirty ring that requires the use of both vCPU
>>>> dirty rings and a dirty bitmap. The expectation is that for non-vCPU
>>>> sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
>>>> the dirty bitmap. Userspace should scan the dirty bitmap before
>>>> migrating the VM to the target.
>>>>
>>>> Use an additional capability to advertize this behavior and require
>>>> explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
>>>> you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
>>>> not allow userspace to enable dirty ring if it hasn't also enabled the
>>>> ring && bitmap capability, as a VM is likely DOA without the pages
>>>> marked in the bitmap.
>>
>> This is wrong. The *only* case this is useful is when there is an
>> in-kernel producer of data outside of the context of a vcpu, which is
>> so far only the ITS save mechanism. No ITS? No need for this.
> 
> How large is the ITS?  If it's a fixed, small size, could we treat the ITS as a
> one-off case for now?  E.g. do something gross like shove entries into vcpu0's
> dirty ring?
> 

There are several VGIC/ITS tables involved in the issue. I checked the
specification and the implementation. As the device ID is 16-bits, so
the maximal devices can be 0x10000. Each device has its ITT (Interrupt
Translation Table), looked by a 32-bits event ID. The memory used for
ITT can be large enough in theory.

     Register       Description           Max-size   Entry-size  Max-entries
     -----------------------------------------------------------------------
     GITS_BASER0    ITS Device Table      512KB      8-bytes     0x10000
     GITS_BASER1    ITS Collection Table  512KB      8-bytes     0x10000
     GITS_BASER2    (GICv4) ITS VPE Table 512KB      8-bytes(?)  0x10000

     max-devices * (1UL << event_id_shift) * entry_size =
     0x10000 * (1UL << 32) * 8                          = 1PB

>> Userspace knows what it has created the first place, and should be in
>> charge of it (i.e. I want to be able to migrate my GICv2 and
>> GICv3-without-ITS VMs with the rings only).
> 
> Ah, so enabling the dirty bitmap isn't strictly required.  That means this patch
> is wrong, and it also means that we need to figure out how we want to handle the
> case where mark_page_dirty_in_slot() is invoked without a running vCPU on a memslot
> without a dirty_bitmap.
> 
> I.e. what's an appropriate action in the below sequence:
> 
> void mark_page_dirty_in_slot(struct kvm *kvm,
> 			     const struct kvm_memory_slot *memslot,
> 		 	     gfn_t gfn)
> {
> 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> 
> #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> 	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
> 		return;
> 
> #ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> 	if (WARN_ON_ONCE(!vcpu))
> 		return;
> #endif
> #endif
> 
> 	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> 		unsigned long rel_gfn = gfn - memslot->base_gfn;
> 		u32 slot = (memslot->as_id << 16) | memslot->id;
> 
> 		if (vcpu && kvm->dirty_ring_size)
> 			kvm_dirty_ring_push(&vcpu->dirty_ring,
> 					    slot, rel_gfn);
> 		else if (memslot->dirty_bitmap)
> 			set_bit_le(rel_gfn, memslot->dirty_bitmap);
> 		else
> 			???? <=================================================
> 	}
> }
> 
> 
> Would it be possible to require a dirty bitmap when an ITS is created?  That would
> allow treating the above condition as a KVM bug.
> 

According to the above calculation, it's impossible to determine the memory size for
the bitmap in advance. The memory used by ITE (Interrupt Translation Entry) tables
can be huge enough to use all guest's system memory in theory. ITE tables are scattered
in guest's system memory, but we don't know its location in advance. ITE tables are
created dynamically on requests from guest.

However, I think it's a good idea to enable the bitmap only when "arm-its-kvm" is
really used in userspace (QEMU). For example, the machine and (kvm) accelerator are
initialized like below. It's unknown if "arm-its-kvm" is used until (c). So we can
enable KVM_CAP_DIRTY_RING_WITH_BITMAP in (d) and the bitmap is created in (e) by
KVM.

   main
     qemu_init
       qemu_create_machine                   (a) machine instance is created
       configure_accelerators
         do_configure_accelerator
           accel_init_machine
             kvm_init                        (b) KVM is initialized
       :
       qmp_x_exit_preconfig
         qemu_init_board
           machine_run_board_init            (c) The board is initialized
       :
       accel_setup_post                      (d) KVM is post initialized
       :
       <migration>                           (e) Migration starts

In order to record if the bitmap is really needed, "struct kvm::dirty_ring_with_bitmap"
is still needed.

    - KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is advertised when CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
      is selected.

    - KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled in (d) only when "arm-its-kvm"
      is used in QEMU. After the capability is enabled, "struct kvm::dirty_ring_with_bitmap"
      is set to 1.

    - The bitmap is created by KVM in (e).

If the above analysis makes sense, I don't see there is anything missed from the patch
Of course, KVM_CAP_DIRTY_LOG_RING_{ACQ_REL, WITH_BITMAP} needs to be enabled separately
and don't depend on each other. the description added to "Documentation/virt/kvm/abi.rst"
need to be improved as Peter and Oliver suggested. kvm_dirty_ring_exclusive() needs to be
renamed to kvm_use_dirty_bitmap() and "#ifdef" needs to be cut down as Sean suggested.


>>>> @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
>>>>   {
>>>>   	int r;
>>>>   
>>>> +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
>>>> +	if (!kvm->dirty_ring_with_bitmap)
>>>> +		return -EINVAL;
>>>> +#endif
>>>
>>> This one at least is prettier with IS_ENABLED
>>>
>>> 	if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) &&
>>> 	    !kvm->dirty_ring_with_bitmap)
>>> 		return -EINVAL;
>>>
>>> But dirty_ring_with_bitmap really shouldn't need to exist.  It's
>>> mandatory for architectures that have
>>> HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for architectures
>>> that don't.  In other words, the API for enabling the dirty ring is
>>> a bit ugly.
>>>
>>> Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been
>>> officially released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP
>>> on top, what about usurping bits 63:32 of cap->args[0] for flags?
>>> E.g.
> 
> For posterity, filling in my missing idea...
> 
> Since the size is restricted to be well below a 32-bit value, and it's unlikely
> that KVM will ever support 4GiB per-vCPU rings, we could usurp the upper bits for
> flags:
> 
>    static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u64 arg0)
>    {
> 	u32 flags = arg0 >> 32;
> 	u32 size = arg0;
> 
> However, since it sounds like enabling dirty_bitmap isn't strictly required, I
> have no objection to enabling KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, my objection
> was purely that KVM was adding a per-VM flag just to sanity check the configuration.
> 

If KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled for "arm-its-kvm", it'd better
to allow enabling those two capability (ACQ_REL and WITH_BITMAP) separately, as I
explained above. userspace (QEMU) will gain flexibility if these two capabilities
can be enabled separately.

To QEMU, KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL are accelerator's
properties. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is board's property. Relaxing their
dependency will give flexibility to QEMU.


[...]

Thanks,
Gavin
Marc Zyngier Oct. 22, 2022, 10:33 a.m. UTC | #11
On Fri, 21 Oct 2022 17:05:26 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Fri, Oct 21, 2022, Marc Zyngier wrote:
> > On Fri, 21 Oct 2022 00:44:51 +0100,
> > Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > On Tue, Oct 11, 2022, Gavin Shan wrote:
> > > > Some architectures (such as arm64) need to dirty memory outside of the
> > > > context of a vCPU. Of course, this simply doesn't fit with the UAPI of
> > > > KVM's per-vCPU dirty ring.
> > > 
> > > What is the point of using the dirty ring in this case?  KVM still
> > > burns a pile of memory for the bitmap.  Is the benefit that
> > > userspace can get away with scanning the bitmap fewer times,
> > > e.g. scan it once just before blackout under the assumption that
> > > very few pages will dirty the bitmap?
> > 
> > Apparently, the throttling effect of the ring makes it easier to
> > converge. Someone who actually uses the feature should be able to
> > tell you. But that's a policy decision, and I don't see why we should
> > be prescriptive.
> 
> I wasn't suggesting we be prescriptive, it was an honest question.
> 
> > > Why not add a global ring to @kvm?  I assume thread safety is a
> > > problem, but the memory overhead of the dirty_bitmap also seems like
> > > a fairly big problem.
> > 
> > Because we already have a stupidly bloated API surface, and that we
> > could do without yet another one based on a sample of *one*?
> 
> But we're adding a new API regardless.  A per-VM ring would
> definitely be a bigger addition, but if using the dirty_bitmap won't
> actually meet the needs of userspace, then we'll have added a new
> API and still not have solved the problem.  That's why I was asking
> why/when userspace would want to use dirty_ring+dirty_bitmap.

Whenever dirty pages can be generated outside of the context of a vcpu
running. And that's anything that comes from *devices*.

> 
> > Because dirtying memory outside of a vcpu context makes it
> > incredibly awkward to handle a "ring full" condition?
> 
> Kicking all vCPUs with the soft-full request isn't _that_ awkward.
> It's certainly sub-optimal, but if inserting into the per-VM ring is
> relatively rare, then in practice it's unlikely to impact guest
> performance.

But there is *nothing* to kick here. The kernel is dirtying pages,
devices are dirtying pages (DMA), and there is no context associated
with that. Which is why a finite ring is the wrong abstraction.

> 
> > > > Introduce a new flavor of dirty ring that requires the use of both vCPU
> > > > dirty rings and a dirty bitmap. The expectation is that for non-vCPU
> > > > sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
> > > > the dirty bitmap. Userspace should scan the dirty bitmap before
> > > > migrating the VM to the target.
> > > > 
> > > > Use an additional capability to advertize this behavior and require
> > > > explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
> > > > you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
> > > > not allow userspace to enable dirty ring if it hasn't also enabled the
> > > > ring && bitmap capability, as a VM is likely DOA without the pages
> > > > marked in the bitmap.
> > 
> > This is wrong. The *only* case this is useful is when there is an
> > in-kernel producer of data outside of the context of a vcpu, which is
> > so far only the ITS save mechanism. No ITS? No need for this.
> 
> How large is the ITS?  If it's a fixed, small size, could we treat
> the ITS as a one-off case for now?  E.g. do something gross like
> shove entries into vcpu0's dirty ring?

The tables can be arbitrarily large, sparse, and are under control of
the guest anyway. And no, I'm not entertaining anything that gross.
I'm actually quite happy with not supporting the dirty ring and stick
with the bitmap that doesn't have any of these problems.

> 
> > Userspace knows what it has created the first place, and should be in
> > charge of it (i.e. I want to be able to migrate my GICv2 and
> > GICv3-without-ITS VMs with the rings only).
> 
> Ah, so enabling the dirty bitmap isn't strictly required.  That
> means this patch is wrong, and it also means that we need to figure
> out how we want to handle the case where mark_page_dirty_in_slot()
> is invoked without a running vCPU on a memslot without a
> dirty_bitmap.
> 
> I.e. what's an appropriate action in the below sequence:
> 
> void mark_page_dirty_in_slot(struct kvm *kvm,
> 			     const struct kvm_memory_slot *memslot,
> 		 	     gfn_t gfn)
> {
> 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> 
> #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> 	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
> 		return;
> 
> #ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> 	if (WARN_ON_ONCE(!vcpu))
> 		return;
> #endif
> #endif
> 
> 	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> 		unsigned long rel_gfn = gfn - memslot->base_gfn;
> 		u32 slot = (memslot->as_id << 16) | memslot->id;
> 
> 		if (vcpu && kvm->dirty_ring_size)
> 			kvm_dirty_ring_push(&vcpu->dirty_ring,
> 					    slot, rel_gfn);
> 		else if (memslot->dirty_bitmap)
> 			set_bit_le(rel_gfn, memslot->dirty_bitmap);
> 		else
> 			???? <=================================================
> 	}
> }
> 
> 
> Would it be possible to require a dirty bitmap when an ITS is
> created?  That would allow treating the above condition as a KVM
> bug.

No. This should be optional. Everything about migration should be
absolutely optional (I run plenty of concurrent VMs on sub-2GB
systems). You want to migrate a VM that has an ITS or will collect
dirty bits originating from a SMMU with HTTU, you enable the dirty
bitmap. You want to have *vcpu* based dirty rings, you enable them.

In short, there shouldn't be any reason for the two are either
mandatory or conflated. Both should be optional, independent, because
they cover completely disjoined use cases. *userspace* should be in
charge of deciding this.

> 
> > > > @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> > > >  {
> > > >  	int r;
> > > >  
> > > > +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> > > > +	if (!kvm->dirty_ring_with_bitmap)
> > > > +		return -EINVAL;
> > > > +#endif
> > > 
> > > This one at least is prettier with IS_ENABLED
> > > 
> > > 	if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) &&
> > > 	    !kvm->dirty_ring_with_bitmap)
> > > 		return -EINVAL;
> > > 
> > > But dirty_ring_with_bitmap really shouldn't need to exist.  It's
> > > mandatory for architectures that have
> > > HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for architectures
> > > that don't.  In other words, the API for enabling the dirty ring is
> > > a bit ugly.
> > > 
> > > Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been
> > > officially released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP
> > > on top, what about usurping bits 63:32 of cap->args[0] for flags?
> > > E.g.
> 
> For posterity, filling in my missing idea...
> 
> Since the size is restricted to be well below a 32-bit value, and it's unlikely
> that KVM will ever support 4GiB per-vCPU rings, we could usurp the upper bits for
> flags:
> 
>   static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u64 arg0)
>   {
> 	u32 flags = arg0 >> 32;
> 	u32 size = arg0;
> 
> However, since it sounds like enabling dirty_bitmap isn't strictly required, I
> have no objection to enabling KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, my objection
> was purely that KVM was adding a per-VM flag just to sanity check the configuration.
> 
> > > Ideally we'd use cap->flags directly, but we screwed up with
> > > KVM_CAP_DIRTY_LOG_RING and didn't require flags to be zero :-(
> > >
> > > Actually, what's the point of allowing
> > > KVM_CAP_DIRTY_LOG_RING_ACQ_REL to be enabled?  I get why KVM would
> > > enumerate this info, i.e. allowing checking, but I don't seen any
> > > value in supporting a second method for enabling the dirty ring.
> > > 
> > > The acquire-release thing is irrelevant for x86, and no other
> > > architecture supports the dirty ring until this series, i.e. there's
> > > no need for KVM to detect that userspace has been updated to gain
> > > acquire-release semantics, because the fact that userspace is
> > > enabling the dirty ring on arm64 means userspace has been updated.
> > 
> > Do we really need to make the API more awkward? There is an
> > established pattern of "enable what is advertised". Some level of
> > uniformity wouldn't hurt, really.
> 
> I agree that uniformity would be nice, but for capabilities I don't
> think that's ever going to happen.  I'm pretty sure supporting
> enabling is actually in the minority.  E.g. of the 20 capabilities
> handled in kvm_vm_ioctl_check_extension_generic(), I believe only 3
> support enabling (KVM_CAP_HALT_POLL, KVM_CAP_DIRTY_LOG_RING, and
> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).

I understood that you were advocating that a check for KVM_CAP_FOO
could result in enabling KVM_CAP_BAR. That I definitely object to.

	M.
Marc Zyngier Oct. 22, 2022, 10:54 a.m. UTC | #12
On Sat, 22 Oct 2022 09:27:41 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> Hi Sean,
> 
> On 10/22/22 12:05 AM, Sean Christopherson wrote:
> > On Fri, Oct 21, 2022, Marc Zyngier wrote:
> >> On Fri, 21 Oct 2022 00:44:51 +0100,
> >> Sean Christopherson <seanjc@google.com> wrote:
> >>> 
> >>> On Tue, Oct 11, 2022, Gavin Shan wrote:
> >>>> Some architectures (such as arm64) need to dirty memory outside of the
> >>>> context of a vCPU. Of course, this simply doesn't fit with the UAPI of
> >>>> KVM's per-vCPU dirty ring.
> >>> 
> >>> What is the point of using the dirty ring in this case?  KVM still
> >>> burns a pile of memory for the bitmap.  Is the benefit that
> >>> userspace can get away with scanning the bitmap fewer times,
> >>> e.g. scan it once just before blackout under the assumption that
> >>> very few pages will dirty the bitmap?
> >> 
> >> Apparently, the throttling effect of the ring makes it easier to
> >> converge. Someone who actually uses the feature should be able to
> >> tell you. But that's a policy decision, and I don't see why we should
> >> be prescriptive.
> > 
> > I wasn't suggesting we be prescriptive, it was an honest question.
> > 
> >>> Why not add a global ring to @kvm?  I assume thread safety is a
> >>> problem, but the memory overhead of the dirty_bitmap also seems like
> >>> a fairly big problem.
> >> 
> >> Because we already have a stupidly bloated API surface, and that we
> >> could do without yet another one based on a sample of *one*?
> > 
> > But we're adding a new API regardless.  A per-VM ring would definitely be a bigger
> > addition, but if using the dirty_bitmap won't actually meet the needs of userspace,
> > then we'll have added a new API and still not have solved the problem.  That's why
> > I was asking why/when userspace would want to use dirty_ring+dirty_bitmap.
> > 
> 
> Bitmap can help to solve the issue, but the extra memory consumption due to
> the bitmap is a concern, as you mentioned previously. More information about
> the issue can be found here [1]. On ARM64, multiple guest's physical pages are
> used by VGIC/ITS to store its states during migration or system shutdown.
> 
> [1] https://lore.kernel.org/kvmarm/320005d1-fe88-fd6a-be91-ddb56f1aa80f@redhat.com/
> 
> >> Because dirtying memory outside of a vcpu context makes it incredibly awkward
> >> to handle a "ring full" condition?
> > 
> > Kicking all vCPUs with the soft-full request isn't _that_ awkward.  It's certainly
> > sub-optimal, but if inserting into the per-VM ring is relatively rare, then in
> > practice it's unlikely to impact guest performance.
> > 
> 
> It's still possible the per-vcpu-ring becomes hard full before it can be
> kicked off. per-vm-ring has other issues, one of which is synchronization
> between kvm and userspace to avoid overrunning per-kvm-ring. bitmap was
> selected due to its simplicity.

Exactly. And once you overflow a ring because the device generate too
much data, what do you do? Return an error to the device?

> 
> >>>> Introduce a new flavor of dirty ring that requires the use of both vCPU
> >>>> dirty rings and a dirty bitmap. The expectation is that for non-vCPU
> >>>> sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
> >>>> the dirty bitmap. Userspace should scan the dirty bitmap before
> >>>> migrating the VM to the target.
> >>>> 
> >>>> Use an additional capability to advertize this behavior and require
> >>>> explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
> >>>> you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
> >>>> not allow userspace to enable dirty ring if it hasn't also enabled the
> >>>> ring && bitmap capability, as a VM is likely DOA without the pages
> >>>> marked in the bitmap.
> >> 
> >> This is wrong. The *only* case this is useful is when there is an
> >> in-kernel producer of data outside of the context of a vcpu, which is
> >> so far only the ITS save mechanism. No ITS? No need for this.
> > 
> > How large is the ITS?  If it's a fixed, small size, could we treat the ITS as a
> > one-off case for now?  E.g. do something gross like shove entries into vcpu0's
> > dirty ring?
> > 
> 
> There are several VGIC/ITS tables involved in the issue. I checked the
> specification and the implementation. As the device ID is 16-bits, so
> the maximal devices can be 0x10000. Each device has its ITT (Interrupt
> Translation Table), looked by a 32-bits event ID. The memory used for
> ITT can be large enough in theory.
> 
>     Register       Description           Max-size   Entry-size  Max-entries
>     -----------------------------------------------------------------------
>     GITS_BASER0    ITS Device Table      512KB      8-bytes     0x10000
>     GITS_BASER1    ITS Collection Table  512KB      8-bytes     0x10000

Both can be two levels. So you can multiply the max size by 64K. The
entry size also depends on the revision of the ABI and can be changed
anytime we see fit.

>     GITS_BASER2    (GICv4) ITS VPE Table 512KB      8-bytes(?)  0x10000

We don't virtualise GICv4. We use GICv4 to virtualise a GICv3. So this
table will never be saved (the guest never sees it, and only KVM
manages it).

>     max-devices * (1UL << event_id_shift) * entry_size =
>     0x10000 * (1UL << 32) * 8                          = 1PB
> 
> >> Userspace knows what it has created the first place, and should be in
> >> charge of it (i.e. I want to be able to migrate my GICv2 and
> >> GICv3-without-ITS VMs with the rings only).
> > 
> > Ah, so enabling the dirty bitmap isn't strictly required.  That means this patch
> > is wrong, and it also means that we need to figure out how we want to handle the
> > case where mark_page_dirty_in_slot() is invoked without a running vCPU on a memslot
> > without a dirty_bitmap.
> > 
> > I.e. what's an appropriate action in the below sequence:
> > 
> > void mark_page_dirty_in_slot(struct kvm *kvm,
> > 			     const struct kvm_memory_slot *memslot,
> > 		 	     gfn_t gfn)
> > {
> > 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > 
> > #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> > 	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
> > 		return;
> > 
> > #ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> > 	if (WARN_ON_ONCE(!vcpu))
> > 		return;
> > #endif
> > #endif
> > 
> > 	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> > 		unsigned long rel_gfn = gfn - memslot->base_gfn;
> > 		u32 slot = (memslot->as_id << 16) | memslot->id;
> > 
> > 		if (vcpu && kvm->dirty_ring_size)
> > 			kvm_dirty_ring_push(&vcpu->dirty_ring,
> > 					    slot, rel_gfn);
> > 		else if (memslot->dirty_bitmap)
> > 			set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > 		else
> > 			???? <=================================================
> > 	}
> > }
> > 
> > 
> > Would it be possible to require a dirty bitmap when an ITS is
> > created?  That would allow treating the above condition as a KVM
> > bug.
> > 
> 
> According to the above calculation, it's impossible to determine the
> memory size for the bitmap in advance. The memory used by ITE
> (Interrupt Translation Entry) tables can be huge enough to use all
> guest's system memory in theory. ITE tables are scattered in guest's
> system memory, but we don't know its location in advance. ITE tables
> are created dynamically on requests from guest.
> 
> However, I think it's a good idea to enable the bitmap only when
> "arm-its-kvm" is really used in userspace (QEMU). For example, the
> machine and (kvm) accelerator are initialized like below. It's
> unknown if "arm-its-kvm" is used until (c). So we can enable
> KVM_CAP_DIRTY_RING_WITH_BITMAP in (d) and the bitmap is created in
> (e) by KVM.
> 
>   main
>     qemu_init
>       qemu_create_machine                   (a) machine instance is created
>       configure_accelerators
>         do_configure_accelerator
>           accel_init_machine
>             kvm_init                        (b) KVM is initialized
>       :
>       qmp_x_exit_preconfig
>         qemu_init_board
>           machine_run_board_init            (c) The board is initialized
>       :
>       accel_setup_post                      (d) KVM is post initialized
>       :
>       <migration>                           (e) Migration starts
> 
> In order to record if the bitmap is really needed, "struct
> kvm::dirty_ring_with_bitmap" is still needed.
> 
>    - KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is advertised when
>      CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP is selected.
> 
>    - KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled in (d) only when
>      "arm-its-kvm" is used in QEMU. After the capability is enabled,
>      "struct kvm::dirty_ring_with_bitmap" is set to 1.
> 
>    - The bitmap is created by KVM in (e).
> 
> If the above analysis makes sense, I don't see there is anything
> missed from the patch Of course, KVM_CAP_DIRTY_LOG_RING_{ACQ_REL,
> WITH_BITMAP} needs to be enabled separately and don't depend on each
> other. the description added to "Documentation/virt/kvm/abi.rst"
> need to be improved as Peter and Oliver suggested.
> kvm_dirty_ring_exclusive() needs to be renamed to
> kvm_use_dirty_bitmap() and "#ifdef" needs to be cut down as Sean
> suggested.

Frankly, I really hate the "mayo and ketchup" approach. Both dirty
tracking approaches serve different purpose, and I really don't see
the point in merging them behind a single cap. Userspace should be
able to chose if and when it wants to use a logging method or another.
We should document how they interact, but that's about it.


> 
> 
> >>>> @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> >>>>   {
> >>>>   	int r;
> >>>>   +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> >>>> +	if (!kvm->dirty_ring_with_bitmap)
> >>>> +		return -EINVAL;
> >>>> +#endif
> >>> 
> >>> This one at least is prettier with IS_ENABLED
> >>> 
> >>> 	if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) &&
> >>> 	    !kvm->dirty_ring_with_bitmap)
> >>> 		return -EINVAL;
> >>> 
> >>> But dirty_ring_with_bitmap really shouldn't need to exist.  It's
> >>> mandatory for architectures that have
> >>> HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for architectures
> >>> that don't.  In other words, the API for enabling the dirty ring is
> >>> a bit ugly.
> >>> 
> >>> Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been
> >>> officially released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP
> >>> on top, what about usurping bits 63:32 of cap->args[0] for flags?
> >>> E.g.
> > 
> > For posterity, filling in my missing idea...
> > 
> > Since the size is restricted to be well below a 32-bit value, and it's unlikely
> > that KVM will ever support 4GiB per-vCPU rings, we could usurp the upper bits for
> > flags:
> > 
> >    static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u64 arg0)
> >    {
> > 	u32 flags = arg0 >> 32;
> > 	u32 size = arg0;
> > 
> > However, since it sounds like enabling dirty_bitmap isn't strictly
> > required, I have no objection to enabling
> > KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, my objection was purely that
> > KVM was adding a per-VM flag just to sanity check the
> > configuration.
> > 
> 
> If KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled for "arm-its-kvm",
> it'd better to allow enabling those two capability (ACQ_REL and
> WITH_BITMAP) separately, as I explained above. userspace (QEMU) will
> gain flexibility if these two capabilities can be enabled
> separately.
> 
> To QEMU, KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
> are accelerator's properties. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is
> board's property. Relaxing their dependency will give flexibility to
> QEMU.

That.

	M.
Sean Christopherson Oct. 24, 2022, 11:50 p.m. UTC | #13
On Sat, Oct 22, 2022, Marc Zyngier wrote:
> On Fri, 21 Oct 2022 17:05:26 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Fri, Oct 21, 2022, Marc Zyngier wrote:
> > > Because dirtying memory outside of a vcpu context makes it
> > > incredibly awkward to handle a "ring full" condition?
> > 
> > Kicking all vCPUs with the soft-full request isn't _that_ awkward.
> > It's certainly sub-optimal, but if inserting into the per-VM ring is
> > relatively rare, then in practice it's unlikely to impact guest
> > performance.
> 
> But there is *nothing* to kick here. The kernel is dirtying pages,
> devices are dirtying pages (DMA), and there is no context associated
> with that. Which is why a finite ring is the wrong abstraction.

I don't follow.  If there's a VM, KVM can always kick all vCPUs.  Again, might
be far from optimal, but it's an option.  If there's literally no VM, then KVM
isn't involved at all and there's no "ring vs. bitmap" decision.

> > Would it be possible to require a dirty bitmap when an ITS is
> > created?  That would allow treating the above condition as a KVM
> > bug.
> 
> No. This should be optional. Everything about migration should be
> absolutely optional (I run plenty of concurrent VMs on sub-2GB
> systems). You want to migrate a VM that has an ITS or will collect
> dirty bits originating from a SMMU with HTTU, you enable the dirty
> bitmap. You want to have *vcpu* based dirty rings, you enable them.
> 
> In short, there shouldn't be any reason for the two are either
> mandatory or conflated. Both should be optional, independent, because
> they cover completely disjoined use cases. *userspace* should be in
> charge of deciding this.

I agree about userspace being in control, what I want to avoid is letting userspace
put KVM into a bad state without any indication from KVM that the setup is wrong
until something actually dirties a page.

Specifically, if mark_page_dirty_in_slot() is invoked without a running vCPU, on
a memslot with dirty tracking enabled but without a dirty bitmap, then the migration
is doomed.  Dropping the dirty page isn't a sane response as that'd all but
guaranatee memory corruption in the guest.  At best, KVM could kick all vCPUs out
to userspace with a new exit reason, but that's not a very good experience for
userspace as either the VM is unexpectedly unmigratable or the VM ends up being
killed (or I suppose userspace could treat the exit as a per-VM dirty ring of
size '1').

That's why I asked if it's possible for KVM to require a dirty_bitmap when KVM
might end up collecting dirty information without a vCPU.  KVM is still
technically prescribing a solution to userspace, but only because there's only
one solution.

> > > > The acquire-release thing is irrelevant for x86, and no other
> > > > architecture supports the dirty ring until this series, i.e. there's
> > > > no need for KVM to detect that userspace has been updated to gain
> > > > acquire-release semantics, because the fact that userspace is
> > > > enabling the dirty ring on arm64 means userspace has been updated.
> > > 
> > > Do we really need to make the API more awkward? There is an
> > > established pattern of "enable what is advertised". Some level of
> > > uniformity wouldn't hurt, really.
> > 
> > I agree that uniformity would be nice, but for capabilities I don't
> > think that's ever going to happen.  I'm pretty sure supporting
> > enabling is actually in the minority.  E.g. of the 20 capabilities
> > handled in kvm_vm_ioctl_check_extension_generic(), I believe only 3
> > support enabling (KVM_CAP_HALT_POLL, KVM_CAP_DIRTY_LOG_RING, and
> > KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).
> 
> I understood that you were advocating that a check for KVM_CAP_FOO
> could result in enabling KVM_CAP_BAR. That I definitely object to.

I was hoping KVM could make the ACQ_REL capability an extension of DIRTY_LOG_RING,
i.e. userspace would DIRTY_LOG_RING _and_ DIRTY_LOG_RING_ACQ_REL for ARM and other
architectures, e.g.

  int enable_dirty_ring(void)
  {
	if (!kvm_check(KVM_CAP_DIRTY_LOG_RING))
		return -EINVAL;

	if (!tso && !kvm_check(KVM_CAP_DIRTY_LOG_RING_ACQ_REL))
		return -EINVAL;

	return kvm_enable(KVM_CAP_DIRTY_LOG_RING);
  }

But I failed to consider that userspace might try to enable DIRTY_LOG_RING on
all architectures, i.e. wouldn't arbitrarily restrict DIRTY_LOG_RING to x86.
Sean Christopherson Oct. 25, 2022, 12:08 a.m. UTC | #14
On Mon, Oct 24, 2022, Sean Christopherson wrote:
> On Sat, Oct 22, 2022, Marc Zyngier wrote:
> > On Fri, 21 Oct 2022 17:05:26 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > On Fri, Oct 21, 2022, Marc Zyngier wrote:
> > > > Because dirtying memory outside of a vcpu context makes it
> > > > incredibly awkward to handle a "ring full" condition?
> > > 
> > > Kicking all vCPUs with the soft-full request isn't _that_ awkward.
> > > It's certainly sub-optimal, but if inserting into the per-VM ring is
> > > relatively rare, then in practice it's unlikely to impact guest
> > > performance.
> > 
> > But there is *nothing* to kick here. The kernel is dirtying pages,
> > devices are dirtying pages (DMA), and there is no context associated
> > with that. Which is why a finite ring is the wrong abstraction.
> 
> I don't follow.  If there's a VM, KVM can always kick all vCPUs.  Again, might
> be far from optimal, but it's an option.  If there's literally no VM, then KVM
> isn't involved at all and there's no "ring vs. bitmap" decision.

Finally caught up in the other part of the thread that calls out that the devices
can't be stalled.

https://lore.kernel.org/all/87czakgmc0.wl-maz@kernel.org
Oliver Upton Oct. 25, 2022, 12:24 a.m. UTC | #15
On Mon, Oct 24, 2022 at 11:50:29PM +0000, Sean Christopherson wrote:
> On Sat, Oct 22, 2022, Marc Zyngier wrote:
> > On Fri, 21 Oct 2022 17:05:26 +0100, Sean Christopherson <seanjc@google.com> wrote:

[...]

> > > Would it be possible to require a dirty bitmap when an ITS is
> > > created?  That would allow treating the above condition as a KVM
> > > bug.
> > 
> > No. This should be optional. Everything about migration should be
> > absolutely optional (I run plenty of concurrent VMs on sub-2GB
> > systems). You want to migrate a VM that has an ITS or will collect
> > dirty bits originating from a SMMU with HTTU, you enable the dirty
> > bitmap. You want to have *vcpu* based dirty rings, you enable them.
> > 
> > In short, there shouldn't be any reason for the two are either
> > mandatory or conflated. Both should be optional, independent, because
> > they cover completely disjoined use cases. *userspace* should be in
> > charge of deciding this.
> 
> I agree about userspace being in control, what I want to avoid is letting userspace
> put KVM into a bad state without any indication from KVM that the setup is wrong
> until something actually dirties a page.
> 
> Specifically, if mark_page_dirty_in_slot() is invoked without a running vCPU, on
> a memslot with dirty tracking enabled but without a dirty bitmap, then the migration
> is doomed.  Dropping the dirty page isn't a sane response as that'd all but
> guaranatee memory corruption in the guest.  At best, KVM could kick all vCPUs out
> to userspace with a new exit reason, but that's not a very good experience for
> userspace as either the VM is unexpectedly unmigratable or the VM ends up being
> killed (or I suppose userspace could treat the exit as a per-VM dirty ring of
> size '1').

This only works on the assumption that the VM is in fact running. In the
case of the GIC ITS, I would expect that the VM has already been paused
in preparation for serialization. So, there would never be a vCPU thread
that would ever detect the kick.

> That's why I asked if it's possible for KVM to require a dirty_bitmap when KVM
> might end up collecting dirty information without a vCPU.  KVM is still
> technically prescribing a solution to userspace, but only because there's only
> one solution.

I was trying to allude to something like this by flat-out requiring
ring + bitmap on arm64.

Otherwise, we'd either need to:

 (1) Document the features that explicitly depend on ring + bitmap (i.e.
 GIC ITS, whatever else may come) such that userspace sets up the
 correct configuration based on what its using. The combined likelihood
 of both KVM and userspace getting this right seems low.

 (2) Outright reject the use of features that require ring + bitmap.
 This pulls in ordering around caps and other UAPI.

> > > > > The acquire-release thing is irrelevant for x86, and no other
> > > > > architecture supports the dirty ring until this series, i.e. there's
> > > > > no need for KVM to detect that userspace has been updated to gain
> > > > > acquire-release semantics, because the fact that userspace is
> > > > > enabling the dirty ring on arm64 means userspace has been updated.
> > > > 
> > > > Do we really need to make the API more awkward? There is an
> > > > established pattern of "enable what is advertised". Some level of
> > > > uniformity wouldn't hurt, really.
> > > 
> > > I agree that uniformity would be nice, but for capabilities I don't
> > > think that's ever going to happen.  I'm pretty sure supporting
> > > enabling is actually in the minority.  E.g. of the 20 capabilities
> > > handled in kvm_vm_ioctl_check_extension_generic(), I believe only 3
> > > support enabling (KVM_CAP_HALT_POLL, KVM_CAP_DIRTY_LOG_RING, and
> > > KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).
> > 
> > I understood that you were advocating that a check for KVM_CAP_FOO
> > could result in enabling KVM_CAP_BAR. That I definitely object to.
> 
> I was hoping KVM could make the ACQ_REL capability an extension of DIRTY_LOG_RING,
> i.e. userspace would DIRTY_LOG_RING _and_ DIRTY_LOG_RING_ACQ_REL for ARM and other
> architectures, e.g.
> 
>   int enable_dirty_ring(void)
>   {
> 	if (!kvm_check(KVM_CAP_DIRTY_LOG_RING))
> 		return -EINVAL;
> 
> 	if (!tso && !kvm_check(KVM_CAP_DIRTY_LOG_RING_ACQ_REL))
> 		return -EINVAL;
> 
> 	return kvm_enable(KVM_CAP_DIRTY_LOG_RING);
>   }
> 
> But I failed to consider that userspace might try to enable DIRTY_LOG_RING on
> all architectures, i.e. wouldn't arbitrarily restrict DIRTY_LOG_RING to x86.

The third option would be to toss DIRTY_LOG_RING_ACQ_REL this release
and instead add DIRTY_LOG_RING2, this time checking the flags.

--
Thanks,
Oliver
Marc Zyngier Oct. 25, 2022, 7:22 a.m. UTC | #16
On Tue, 25 Oct 2022 00:50:29 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Sat, Oct 22, 2022, Marc Zyngier wrote:
> > On Fri, 21 Oct 2022 17:05:26 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > On Fri, Oct 21, 2022, Marc Zyngier wrote:
> > > > Because dirtying memory outside of a vcpu context makes it
> > > > incredibly awkward to handle a "ring full" condition?
> > > 
> > > Kicking all vCPUs with the soft-full request isn't _that_ awkward.
> > > It's certainly sub-optimal, but if inserting into the per-VM ring is
> > > relatively rare, then in practice it's unlikely to impact guest
> > > performance.
> > 
> > But there is *nothing* to kick here. The kernel is dirtying pages,
> > devices are dirtying pages (DMA), and there is no context associated
> > with that. Which is why a finite ring is the wrong abstraction.
> 
> I don't follow.  If there's a VM, KVM can always kick all vCPUs.
> Again, might be far from optimal, but it's an option.  If there's
> literally no VM, then KVM isn't involved at all and there's no "ring
> vs. bitmap" decision.

The key word is *device*. No vcpu is involved here. Actually, we
actively prevent save/restore of the ITS while vcpus are running. How
could you even expect to snapshot a consistent state if the interrupt
state is changing under your feet?

> 
> > > Would it be possible to require a dirty bitmap when an ITS is
> > > created?  That would allow treating the above condition as a KVM
> > > bug.
> > 
> > No. This should be optional. Everything about migration should be
> > absolutely optional (I run plenty of concurrent VMs on sub-2GB
> > systems). You want to migrate a VM that has an ITS or will collect
> > dirty bits originating from a SMMU with HTTU, you enable the dirty
> > bitmap. You want to have *vcpu* based dirty rings, you enable them.
> > 
> > In short, there shouldn't be any reason for the two are either
> > mandatory or conflated. Both should be optional, independent, because
> > they cover completely disjoined use cases. *userspace* should be in
> > charge of deciding this.
> 
> I agree about userspace being in control, what I want to avoid is
> letting userspace put KVM into a bad state without any indication
> from KVM that the setup is wrong until something actually dirties a
> page.

I can't see how that can result in a bad state for KVM itself. All you
lack is a way for userspace to *track* the dirtying. Just like we
don't have a way to track the dirtying of a page from the VMM.

> Specifically, if mark_page_dirty_in_slot() is invoked without a
> running vCPU, on a memslot with dirty tracking enabled but without a
> dirty bitmap, then the migration is doomed.

Yup, and that's a luser error. Too bad. Userspace can still transfer
all the memory, and all will be fine.

> Dropping the dirty page isn't a sane response as that'd all but
> guaranatee memory corruption in the guest.

Again, user error. Userspace can readily write over all the guest
memory (virtio), and no amount of KVM-side tracking will help. What
are you going to do about it?

At the end of the day, what are you trying to do? All the dirty
tracking muck (bitmap and ring) is only a way for userspace to track
dirty pages more easily and accelerate the transfer. If userspace
doesn't tell KVM to track these writes, tough luck. If the author of a
VMM doesn't understand that, then maybe they shouldn't be in charge of
the VMM. Worse case, they can still transfer the whole thing, no harm
done.

> At best, KVM could kick all vCPUs out to userspace
> with a new exit reason, but that's not a very good experience for
> userspace as either the VM is unexpectedly unmigratable or the VM
> ends up being killed (or I suppose userspace could treat the exit as
> a per-VM dirty ring of size '1').

Can we please stop the exit nonsense? There is no vcpu involved
here. This is a device (emulated or not) writing to memory, triggered
by an ioctl from userspace. If you're thinking vcpu, you have the
wrong end of the stick.

Memory gets dirtied system wide, not just by CPUs, and no amount of
per-vcpu resource is going to solve this problem. VM-based rings can
help with if they provide a way to recover from an overflow. But that
obviously doesn't work here as we can't checkpoint and restart the
saving process on overflow.

	M.
Marc Zyngier Oct. 25, 2022, 7:31 a.m. UTC | #17
On Tue, 25 Oct 2022 01:24:19 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> On Mon, Oct 24, 2022 at 11:50:29PM +0000, Sean Christopherson wrote:
> > On Sat, Oct 22, 2022, Marc Zyngier wrote:
> > > On Fri, 21 Oct 2022 17:05:26 +0100, Sean Christopherson <seanjc@google.com> wrote:
> 
> [...]
> 
> > > > Would it be possible to require a dirty bitmap when an ITS is
> > > > created?  That would allow treating the above condition as a KVM
> > > > bug.
> > > 
> > > No. This should be optional. Everything about migration should be
> > > absolutely optional (I run plenty of concurrent VMs on sub-2GB
> > > systems). You want to migrate a VM that has an ITS or will collect
> > > dirty bits originating from a SMMU with HTTU, you enable the dirty
> > > bitmap. You want to have *vcpu* based dirty rings, you enable them.
> > > 
> > > In short, there shouldn't be any reason for the two are either
> > > mandatory or conflated. Both should be optional, independent, because
> > > they cover completely disjoined use cases. *userspace* should be in
> > > charge of deciding this.
> > 
> > I agree about userspace being in control, what I want to avoid is letting userspace
> > put KVM into a bad state without any indication from KVM that the setup is wrong
> > until something actually dirties a page.
> > 
> > Specifically, if mark_page_dirty_in_slot() is invoked without a running vCPU, on
> > a memslot with dirty tracking enabled but without a dirty bitmap, then the migration
> > is doomed.  Dropping the dirty page isn't a sane response as that'd all but
> > guaranatee memory corruption in the guest.  At best, KVM could kick all vCPUs out
> > to userspace with a new exit reason, but that's not a very good experience for
> > userspace as either the VM is unexpectedly unmigratable or the VM ends up being
> > killed (or I suppose userspace could treat the exit as a per-VM dirty ring of
> > size '1').
> 
> This only works on the assumption that the VM is in fact running. In the
> case of the GIC ITS, I would expect that the VM has already been paused
> in preparation for serialization. So, there would never be a vCPU thread
> that would ever detect the kick.

This is indeed the case. The ioctl will return -EBUSY if any vcpu is
running.

> 
> > That's why I asked if it's possible for KVM to require a dirty_bitmap when KVM
> > might end up collecting dirty information without a vCPU.  KVM is still
> > technically prescribing a solution to userspace, but only because there's only
> > one solution.
> 
> I was trying to allude to something like this by flat-out requiring
> ring + bitmap on arm64.

And I claim that this is wrong. It may suit a particular use case, but
that's definitely not a universal truth.

> 
> Otherwise, we'd either need to:
> 
>  (1) Document the features that explicitly depend on ring + bitmap (i.e.
>  GIC ITS, whatever else may come) such that userspace sets up the
>  correct configuration based on what its using. The combined likelihood
>  of both KVM and userspace getting this right seems low.

But what is there to get wrong? Absolutely nothing. Today, you can
save/restore a GICv3-ITS VM without a bitmap at all. Just dump all of
the memory. The bitmap only allows you to do it while the vcpus are
running. Do you want a dirty ring because it makes things faster?
Fine. But you need to understand what this does.

Yes, this may require some additional documentation. But more
importantly, it requires VMM authors to pay attention to what is
happening. At least the QEMU folks are doing that.

>  (2) Outright reject the use of features that require ring + bitmap.
>  This pulls in ordering around caps and other UAPI.

I don't think this makes any sense. Neither bitmap nor ring should be
a prerequisite for *anything*.

	M.
Sean Christopherson Oct. 25, 2022, 5:47 p.m. UTC | #18
On Tue, Oct 25, 2022, Marc Zyngier wrote:
> On Tue, 25 Oct 2022 01:24:19 +0100, Oliver Upton <oliver.upton@linux.dev> wrote:
> > > That's why I asked if it's possible for KVM to require a dirty_bitmap when KVM
> > > might end up collecting dirty information without a vCPU.  KVM is still
> > > technically prescribing a solution to userspace, but only because there's only
> > > one solution.
> > 
> > I was trying to allude to something like this by flat-out requiring
> > ring + bitmap on arm64.
> 
> And I claim that this is wrong. It may suit a particular use case, but
> that's definitely not a universal truth.

Agreed, KVM should not unconditionally require a dirty bitmap for arm64.

> > Otherwise, we'd either need to:
> > 
> >  (1) Document the features that explicitly depend on ring + bitmap (i.e.
> >  GIC ITS, whatever else may come) such that userspace sets up the
> >  correct configuration based on what its using. The combined likelihood
> >  of both KVM and userspace getting this right seems low.
> 
> But what is there to get wrong? Absolutely nothing.

I strongly disagree.  On x86, we've had two bugs escape where KVM attempted to
mark a page dirty without an active vCPU.

  2efd61a608b0 ("KVM: Warn if mark_page_dirty() is called without an active vCPU") 
  42dcbe7d8bac ("KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU")

Call us incompetent, but I have zero confidence that KVM will never unintentionally
add a path that invokes mark_page_dirty_in_slot() without a running vCPU.

By completely dropping the rule that KVM must have an active vCPU on architectures
that support ring+bitmap, those types of bugs will go silently unnoticed, and will
manifest as guest data corruption after live migration.

And ideally such bugs would detected without relying on userspace to enabling
dirty logging, e.g. the Hyper-V bug lurked for quite some time and was only found
when mark_page_dirty_in_slot() started WARNing.

I'm ok if arm64 wants to let userspace shoot itself in the foot with the ITS, but
I'm not ok dropping the protections in the common mark_page_dirty_in_slot().

One somewhat gross idea would be to let architectures override the "there must be
a running vCPU" rule, e.g. arm64 could toggle a flag in kvm->arch in its
kvm_write_guest_lock() to note that an expected write without a vCPU is in-progress:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8c5c69ba47a7..d1da8914f749 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
        struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
 #ifdef CONFIG_HAVE_KVM_DIRTY_RING
-       if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
+       if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu))
+               return;
+
+       if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
                return;
 #endif
 
@@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
                unsigned long rel_gfn = gfn - memslot->base_gfn;
                u32 slot = (memslot->as_id << 16) | memslot->id;
 
-               if (kvm->dirty_ring_size)
+               if (kvm->dirty_ring_size && vcpu)
                        kvm_dirty_ring_push(&vcpu->dirty_ring,
                                            slot, rel_gfn);
-               else
+               else if (memslot->dirty_bitmap)
                        set_bit_le(rel_gfn, memslot->dirty_bitmap);
        }
 }
Marc Zyngier Oct. 27, 2022, 8:29 a.m. UTC | #19
On Tue, 25 Oct 2022 18:47:12 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Tue, Oct 25, 2022, Marc Zyngier wrote:
> > On Tue, 25 Oct 2022 01:24:19 +0100, Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > That's why I asked if it's possible for KVM to require a dirty_bitmap when KVM
> > > > might end up collecting dirty information without a vCPU.  KVM is still
> > > > technically prescribing a solution to userspace, but only because there's only
> > > > one solution.
> > > 
> > > I was trying to allude to something like this by flat-out requiring
> > > ring + bitmap on arm64.
> > 
> > And I claim that this is wrong. It may suit a particular use case, but
> > that's definitely not a universal truth.
> 
> Agreed, KVM should not unconditionally require a dirty bitmap for arm64.
> 
> > > Otherwise, we'd either need to:
> > > 
> > >  (1) Document the features that explicitly depend on ring + bitmap (i.e.
> > >  GIC ITS, whatever else may come) such that userspace sets up the
> > >  correct configuration based on what its using. The combined likelihood
> > >  of both KVM and userspace getting this right seems low.
> > 
> > But what is there to get wrong? Absolutely nothing.
> 
> I strongly disagree.  On x86, we've had two bugs escape where KVM
> attempted to mark a page dirty without an active vCPU.
> 
>   2efd61a608b0 ("KVM: Warn if mark_page_dirty() is called without an active vCPU") 
>   42dcbe7d8bac ("KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU")
> 
> Call us incompetent, but I have zero confidence that KVM will never
> unintentionally add a path that invokes mark_page_dirty_in_slot()
> without a running vCPU.

Well, maybe it is time that KVM acknowledges there is a purpose to
dirtying memory outside of a vcpu context, and that if a write happens
in a vcpu context, this vcpu must be explicitly passed down rather
than obtained from kvm_get_running_vcpu(). Yes, this requires some
heavy surgery.

> By completely dropping the rule that KVM must have an active vCPU on
> architectures that support ring+bitmap, those types of bugs will go
> silently unnoticed, and will manifest as guest data corruption after
> live migration.

The elephant in the room is still userspace writing to its view of the
guest memory for device emulation. Do they get it right? I doubt it.

> And ideally such bugs would detected without relying on userspace to
> enabling dirty logging, e.g. the Hyper-V bug lurked for quite some
> time and was only found when mark_page_dirty_in_slot() started
> WARNing.
> 
> I'm ok if arm64 wants to let userspace shoot itself in the foot with
> the ITS, but I'm not ok dropping the protections in the common
> mark_page_dirty_in_slot().
> 
> One somewhat gross idea would be to let architectures override the
> "there must be a running vCPU" rule, e.g. arm64 could toggle a flag
> in kvm->arch in its kvm_write_guest_lock() to note that an expected
> write without a vCPU is in-progress:
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8c5c69ba47a7..d1da8914f749 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>         struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>  
>  #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> -       if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
> +       if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu))
> +               return;
> +
> +       if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
>                 return;
>  #endif
>  
> @@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>                 unsigned long rel_gfn = gfn - memslot->base_gfn;
>                 u32 slot = (memslot->as_id << 16) | memslot->id;
>  
> -               if (kvm->dirty_ring_size)
> +               if (kvm->dirty_ring_size && vcpu)
>                         kvm_dirty_ring_push(&vcpu->dirty_ring,
>                                             slot, rel_gfn);
> -               else
> +               else if (memslot->dirty_bitmap)
>                         set_bit_le(rel_gfn, memslot->dirty_bitmap);
>         }
>  }

I think this is equally wrong. Writes occur from both CPUs and devices
*concurrently*, and I don't see why KVM should keep ignoring this
pretty obvious fact.

Yes, your patch papers over the problem, and it can probably work if
the kvm->arch flag only gets set in the ITS saving code, which is
already exclusive of vcpus running.

But in the long run, with dirty bits being collected from the IOMMU
page tables or directly from devices, we will need a way to reconcile
the dirty tracking. The above doesn't quite cut it, unfortunately.

	M.
Sean Christopherson Oct. 27, 2022, 5:44 p.m. UTC | #20
On Thu, Oct 27, 2022, Marc Zyngier wrote:
> On Tue, 25 Oct 2022 18:47:12 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > Call us incompetent, but I have zero confidence that KVM will never
> > unintentionally add a path that invokes mark_page_dirty_in_slot()
> > without a running vCPU.
> 
> Well, maybe it is time that KVM acknowledges there is a purpose to
> dirtying memory outside of a vcpu context, and that if a write happens
> in a vcpu context, this vcpu must be explicitly passed down rather
> than obtained from kvm_get_running_vcpu(). Yes, this requires some
> heavy surgery.

Heh, preaching to the choir on this one.

  On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
  > IMO, adding kvm_get_running_vcpu() is a hack that is just asking for future
  > abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring() look
  > extremely fragile.

I'm all in favor of not using kvm_get_running_vcpu() in this path.

That said, it's somewhat of an orthogonal issue, as I would still want a sanity
check in mark_page_dirty_in_slot() that a vCPU is provided when there is no
dirty bitmap.

> > By completely dropping the rule that KVM must have an active vCPU on
> > architectures that support ring+bitmap, those types of bugs will go
> > silently unnoticed, and will manifest as guest data corruption after
> > live migration.
> 
> The elephant in the room is still userspace writing to its view of the
> guest memory for device emulation. Do they get it right? I doubt it.

I don't see what that has to do with KVM though.  There are many things userspace
needs to get right, that doesn't mean that KVM shouldn't strive to provide
safeguards for the functionality that KVM provides.

> > And ideally such bugs would detected without relying on userspace to
> > enabling dirty logging, e.g. the Hyper-V bug lurked for quite some
> > time and was only found when mark_page_dirty_in_slot() started
> > WARNing.
> > 
> > I'm ok if arm64 wants to let userspace shoot itself in the foot with
> > the ITS, but I'm not ok dropping the protections in the common
> > mark_page_dirty_in_slot().
> > 
> > One somewhat gross idea would be to let architectures override the
> > "there must be a running vCPU" rule, e.g. arm64 could toggle a flag
> > in kvm->arch in its kvm_write_guest_lock() to note that an expected
> > write without a vCPU is in-progress:
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 8c5c69ba47a7..d1da8914f749 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
> >         struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> >  
> >  #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> > -       if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
> > +       if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu))
> > +               return;
> > +
> > +       if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
> >                 return;
> >  #endif
> >  
> > @@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
> >                 unsigned long rel_gfn = gfn - memslot->base_gfn;
> >                 u32 slot = (memslot->as_id << 16) | memslot->id;
> >  
> > -               if (kvm->dirty_ring_size)
> > +               if (kvm->dirty_ring_size && vcpu)
> >                         kvm_dirty_ring_push(&vcpu->dirty_ring,
> >                                             slot, rel_gfn);
> > -               else
> > +               else if (memslot->dirty_bitmap)
> >                         set_bit_le(rel_gfn, memslot->dirty_bitmap);
> >         }
> >  }
> 
> I think this is equally wrong. Writes occur from both CPUs and devices
> *concurrently*, and I don't see why KVM should keep ignoring this
> pretty obvious fact.
>
> Yes, your patch papers over the problem, and it can probably work if
> the kvm->arch flag only gets set in the ITS saving code, which is
> already exclusive of vcpus running.
> 
> But in the long run, with dirty bits being collected from the IOMMU
> page tables or directly from devices, we will need a way to reconcile
> the dirty tracking. The above doesn't quite cut it, unfortunately.

Oooh, are you referring to IOMMU page tables and devices _in the guest_?  E.g. if
KVM itself were to emulate a vIOMMU, then KVM would be responsible for updating
dirty bits in the vIOMMU page tables.

Not that it really matters, but do we actually expect KVM to ever emulate a vIOMMU?
On x86 at least, in-kernel acceleration of vIOMMU emulation seems more like VFIO
territory.

Regardless, I don't think the above idea makes it any more difficult to support
in-KVM emulation of non-CPU stuff, which IIUC is the ITS case.  I 100% agree that
the above is a hack, but that's largely due to the use of kvm_get_running_vcpu().

A slightly different alternative would be have a completely separate API for writing
guest memory without an associated vCPU.  I.e. start building up proper device emulation
support.  Then the vCPU-based APIs could yell if a vCPU isn't provided (or there
is no running vCPU in the current mess).  And the deviced-based API could be
provided if and only if the architecture actually supports emulating writes from
devices, i.e. x86 would not opt-in and so would even have access to the API.
Marc Zyngier Oct. 27, 2022, 6:30 p.m. UTC | #21
On Thu, 27 Oct 2022 18:44:51 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Thu, Oct 27, 2022, Marc Zyngier wrote:
> > On Tue, 25 Oct 2022 18:47:12 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > > Call us incompetent, but I have zero confidence that KVM will never
> > > unintentionally add a path that invokes mark_page_dirty_in_slot()
> > > without a running vCPU.
> > 
> > Well, maybe it is time that KVM acknowledges there is a purpose to
> > dirtying memory outside of a vcpu context, and that if a write happens
> > in a vcpu context, this vcpu must be explicitly passed down rather
> > than obtained from kvm_get_running_vcpu(). Yes, this requires some
> > heavy surgery.
> 
> Heh, preaching to the choir on this one.
> 
>   On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
>   > IMO, adding kvm_get_running_vcpu() is a hack that is just asking for future
>   > abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring() look
>   > extremely fragile.
> 
> I'm all in favor of not using kvm_get_running_vcpu() in this path.
> 
> That said, it's somewhat of an orthogonal issue, as I would still
> want a sanity check in mark_page_dirty_in_slot() that a vCPU is
> provided when there is no dirty bitmap.

If we have a separate context and/or API, then all these checks become
a lot less controversial, and we can start reasoning about these
things. At the moment, this is just a mess.

> 
> > > By completely dropping the rule that KVM must have an active vCPU on
> > > architectures that support ring+bitmap, those types of bugs will go
> > > silently unnoticed, and will manifest as guest data corruption after
> > > live migration.
> > 
> > The elephant in the room is still userspace writing to its view of the
> > guest memory for device emulation. Do they get it right? I doubt it.
> 
> I don't see what that has to do with KVM though.  There are many
> things userspace needs to get right, that doesn't mean that KVM
> shouldn't strive to provide safeguards for the functionality that
> KVM provides.

I guess we have different expectations of what KVM should provide. My
take is that userspace doesn't need a nanny, and that a decent level
of documentation should make it obvious what feature captures which
state.

But we've argued for a while now, and I don't see that we're getting
any closer to a resolution. So let's at least make some forward
progress with the opt-out mechanism you mentioned, and arm64 will buy
into it when snapshoting the ITS.

> 
> > > And ideally such bugs would detected without relying on userspace to
> > > enabling dirty logging, e.g. the Hyper-V bug lurked for quite some
> > > time and was only found when mark_page_dirty_in_slot() started
> > > WARNing.
> > > 
> > > I'm ok if arm64 wants to let userspace shoot itself in the foot with
> > > the ITS, but I'm not ok dropping the protections in the common
> > > mark_page_dirty_in_slot().
> > > 
> > > One somewhat gross idea would be to let architectures override the
> > > "there must be a running vCPU" rule, e.g. arm64 could toggle a flag
> > > in kvm->arch in its kvm_write_guest_lock() to note that an expected
> > > write without a vCPU is in-progress:
> > > 
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 8c5c69ba47a7..d1da8914f749 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
> > >         struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > >  
> > >  #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> > > -       if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
> > > +       if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu))
> > > +               return;
> > > +
> > > +       if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
> > >                 return;
> > >  #endif
> > >  
> > > @@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
> > >                 unsigned long rel_gfn = gfn - memslot->base_gfn;
> > >                 u32 slot = (memslot->as_id << 16) | memslot->id;
> > >  
> > > -               if (kvm->dirty_ring_size)
> > > +               if (kvm->dirty_ring_size && vcpu)
> > >                         kvm_dirty_ring_push(&vcpu->dirty_ring,
> > >                                             slot, rel_gfn);
> > > -               else
> > > +               else if (memslot->dirty_bitmap)
> > >                         set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > >         }
> > >  }
> > 
> > I think this is equally wrong. Writes occur from both CPUs and devices
> > *concurrently*, and I don't see why KVM should keep ignoring this
> > pretty obvious fact.
> >
> > Yes, your patch papers over the problem, and it can probably work if
> > the kvm->arch flag only gets set in the ITS saving code, which is
> > already exclusive of vcpus running.
> > 
> > But in the long run, with dirty bits being collected from the IOMMU
> > page tables or directly from devices, we will need a way to reconcile
> > the dirty tracking. The above doesn't quite cut it, unfortunately.
> 
> Oooh, are you referring to IOMMU page tables and devices _in the
> guest_?  E.g. if KVM itself were to emulate a vIOMMU, then KVM would
> be responsible for updating dirty bits in the vIOMMU page tables.

No. I'm talking about the *physical* IOMMU, which is (with the correct
architecture revision and feature set) capable of providing its own
set of dirty bits, on a per-device, per-PTE basis. Once we enable
that, we'll need to be able to sink these bits into the bitmap and
provide a unified view of the dirty state to userspace.

> Not that it really matters, but do we actually expect KVM to ever
> emulate a vIOMMU?  On x86 at least, in-kernel acceleration of vIOMMU
> emulation seems more like VFIO territory.

I don't expect KVM/arm64 to fully emulate an IOMMU, but at least to
eventually provide the required filtering to enable a stage-1 SMMU to
be passed to a guest. This is the sort of things pKVM needs to
implement for the host anyway, and going the extra mile to support
arbitrary guests outside of the pKVM context isn't much more work.

> Regardless, I don't think the above idea makes it any more difficult
> to support in-KVM emulation of non-CPU stuff, which IIUC is the ITS
> case.  I 100% agree that the above is a hack, but that's largely due
> to the use of kvm_get_running_vcpu().

That I agree.

> A slightly different alternative would be have a completely separate
> API for writing guest memory without an associated vCPU.  I.e. start
> building up proper device emulation support.  Then the vCPU-based
> APIs could yell if a vCPU isn't provided (or there is no running
> vCPU in the current mess).  And the deviced-based API could be
> provided if and only if the architecture actually supports emulating
> writes from devices, i.e. x86 would not opt-in and so would even
> have access to the API.

Which is what I was putting under the "major surgery" label in my
previous email.

Anyhow, for the purpose of unblocking Gavin's series, I suggest to
adopt your per-arch opt-out suggestion as a stop gap measure, and we
will then be able to bike-shed for weeks on what the shape of the
device-originated memory dirtying API should be.

	M.
Sean Christopherson Oct. 27, 2022, 7:09 p.m. UTC | #22
On Thu, Oct 27, 2022, Marc Zyngier wrote:
> On Thu, 27 Oct 2022 18:44:51 +0100,
> Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Thu, Oct 27, 2022, Marc Zyngier wrote:
> > > But in the long run, with dirty bits being collected from the IOMMU
> > > page tables or directly from devices, we will need a way to reconcile
> > > the dirty tracking. The above doesn't quite cut it, unfortunately.
> > 
> > Oooh, are you referring to IOMMU page tables and devices _in the
> > guest_?  E.g. if KVM itself were to emulate a vIOMMU, then KVM would
> > be responsible for updating dirty bits in the vIOMMU page tables.
> 
> No. I'm talking about the *physical* IOMMU, which is (with the correct
> architecture revision and feature set) capable of providing its own
> set of dirty bits, on a per-device, per-PTE basis. Once we enable
> that, we'll need to be able to sink these bits into the bitmap and
> provide a unified view of the dirty state to userspace.

Isn't that already handled by VFIO, e.g. via VFIO_IOMMU_DIRTY_PAGES?  There may
be "duplicate" information if a page is dirty in both the IOMMU page tables and
the CPU page tables, but that's ok in that the worst case scenario is that the
VMM performs a redundant unnecessary transfer.

A unified dirty bitmap would potentially reduce the memory footprint needed for
dirty logging, but presumably IOMMU-mapped memory is a small subset of CPU-mapped
memory in most use cases.
Gavin Shan Oct. 28, 2022, 6:43 a.m. UTC | #23
Hi Sean and Marc,

On 10/28/22 2:30 AM, Marc Zyngier wrote:
> On Thu, 27 Oct 2022 18:44:51 +0100,
> Sean Christopherson <seanjc@google.com> wrote:
>>
>> On Thu, Oct 27, 2022, Marc Zyngier wrote:
>>> On Tue, 25 Oct 2022 18:47:12 +0100, Sean Christopherson <seanjc@google.com> wrote:

[...]
  
>>
>>>> And ideally such bugs would detected without relying on userspace to
>>>> enabling dirty logging, e.g. the Hyper-V bug lurked for quite some
>>>> time and was only found when mark_page_dirty_in_slot() started
>>>> WARNing.
>>>>
>>>> I'm ok if arm64 wants to let userspace shoot itself in the foot with
>>>> the ITS, but I'm not ok dropping the protections in the common
>>>> mark_page_dirty_in_slot().
>>>>
>>>> One somewhat gross idea would be to let architectures override the
>>>> "there must be a running vCPU" rule, e.g. arm64 could toggle a flag
>>>> in kvm->arch in its kvm_write_guest_lock() to note that an expected
>>>> write without a vCPU is in-progress:
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 8c5c69ba47a7..d1da8914f749 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>>>>          struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>>>>   
>>>>   #ifdef CONFIG_HAVE_KVM_DIRTY_RING
>>>> -       if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
>>>> +       if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu))
>>>> +               return;
>>>> +
>>>> +       if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
>>>>                  return;
>>>>   #endif
>>>>   
>>>> @@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>>>>                  unsigned long rel_gfn = gfn - memslot->base_gfn;
>>>>                  u32 slot = (memslot->as_id << 16) | memslot->id;
>>>>   
>>>> -               if (kvm->dirty_ring_size)
>>>> +               if (kvm->dirty_ring_size && vcpu)
>>>>                          kvm_dirty_ring_push(&vcpu->dirty_ring,
>>>>                                              slot, rel_gfn);
>>>> -               else
>>>> +               else if (memslot->dirty_bitmap)
>>>>                          set_bit_le(rel_gfn, memslot->dirty_bitmap);
>>>>          }
>>>>   }
>>>
>>> I think this is equally wrong. Writes occur from both CPUs and devices
>>> *concurrently*, and I don't see why KVM should keep ignoring this
>>> pretty obvious fact.
>>>
>>> Yes, your patch papers over the problem, and it can probably work if
>>> the kvm->arch flag only gets set in the ITS saving code, which is
>>> already exclusive of vcpus running.
>>>
>>> But in the long run, with dirty bits being collected from the IOMMU
>>> page tables or directly from devices, we will need a way to reconcile
>>> the dirty tracking. The above doesn't quite cut it, unfortunately.
>>
>> Oooh, are you referring to IOMMU page tables and devices _in the
>> guest_?  E.g. if KVM itself were to emulate a vIOMMU, then KVM would
>> be responsible for updating dirty bits in the vIOMMU page tables.
> 
> No. I'm talking about the *physical* IOMMU, which is (with the correct
> architecture revision and feature set) capable of providing its own
> set of dirty bits, on a per-device, per-PTE basis. Once we enable
> that, we'll need to be able to sink these bits into the bitmap and
> provide a unified view of the dirty state to userspace.
> 
>> Not that it really matters, but do we actually expect KVM to ever
>> emulate a vIOMMU?  On x86 at least, in-kernel acceleration of vIOMMU
>> emulation seems more like VFIO territory.
> 
> I don't expect KVM/arm64 to fully emulate an IOMMU, but at least to
> eventually provide the required filtering to enable a stage-1 SMMU to
> be passed to a guest. This is the sort of things pKVM needs to
> implement for the host anyway, and going the extra mile to support
> arbitrary guests outside of the pKVM context isn't much more work.
> 
>> Regardless, I don't think the above idea makes it any more difficult
>> to support in-KVM emulation of non-CPU stuff, which IIUC is the ITS
>> case.  I 100% agree that the above is a hack, but that's largely due
>> to the use of kvm_get_running_vcpu().
> 
> That I agree.
> 
>> A slightly different alternative would be have a completely separate
>> API for writing guest memory without an associated vCPU.  I.e. start
>> building up proper device emulation support.  Then the vCPU-based
>> APIs could yell if a vCPU isn't provided (or there is no running
>> vCPU in the current mess).  And the deviced-based API could be
>> provided if and only if the architecture actually supports emulating
>> writes from devices, i.e. x86 would not opt-in and so would even
>> have access to the API.
> 
> Which is what I was putting under the "major surgery" label in my
> previous email.
> 
> Anyhow, for the purpose of unblocking Gavin's series, I suggest to
> adopt your per-arch opt-out suggestion as a stop gap measure, and we
> will then be able to bike-shed for weeks on what the shape of the
> device-originated memory dirtying API should be.
> 

It's really a 'major surgery' and I would like to make sure I fully understand
'a completely separate API for writing guest memory without an associated vCPU",
before I'm going to working on v7 for this.

There are 7 functions and 2 macros involved as below. I assume Sean is suggesting
to add another argument, whose name can be 'has_vcpu', for these functions and macros?
Sean, could you please double confirm?

If I'm understanding correctly, and 'has_vcpu' argument will be added for these
functions and macros. Except the call sites in vgic/its, 'has_vcpu' is set to 'true',
and passed to these functions. It means we need a 'false' for the argument in vgic/its
call sites. Please correct me if I'm wrong.

   int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
                            int offset, int len);
   int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
                       unsigned long len);
   int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
                              void *data, unsigned long len);
   int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
                                     void *data, unsigned int offset,
                                     unsigned long len);

   void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn);
   void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
   void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn);

   #define __kvm_put_guest(kvm, gfn, offset, v)
   #define kvm_put_guest(kvm, gpa, v)
   
Thanks,
Gavin
Sean Christopherson Oct. 28, 2022, 4:51 p.m. UTC | #24
On Fri, Oct 28, 2022, Gavin Shan wrote:
> Hi Sean and Marc,
> 
> On 10/28/22 2:30 AM, Marc Zyngier wrote:
> > On Thu, 27 Oct 2022 18:44:51 +0100,
> > Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > On Thu, Oct 27, 2022, Marc Zyngier wrote:
> > > > On Tue, 25 Oct 2022 18:47:12 +0100, Sean Christopherson <seanjc@google.com> wrote:
> 
> [...]
> > > 
> > > > > And ideally such bugs would detected without relying on userspace to
> > > > > enabling dirty logging, e.g. the Hyper-V bug lurked for quite some
> > > > > time and was only found when mark_page_dirty_in_slot() started
> > > > > WARNing.
> > > > > 
> > > > > I'm ok if arm64 wants to let userspace shoot itself in the foot with
> > > > > the ITS, but I'm not ok dropping the protections in the common
> > > > > mark_page_dirty_in_slot().
> > > > > 
> > > > > One somewhat gross idea would be to let architectures override the
> > > > > "there must be a running vCPU" rule, e.g. arm64 could toggle a flag
> > > > > in kvm->arch in its kvm_write_guest_lock() to note that an expected
> > > > > write without a vCPU is in-progress:
> > > > > 
> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > index 8c5c69ba47a7..d1da8914f749 100644
> > > > > --- a/virt/kvm/kvm_main.c
> > > > > +++ b/virt/kvm/kvm_main.c
> > > > > @@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
> > > > >          struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > > > >   #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> > > > > -       if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
> > > > > +       if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu))
> > > > > +               return;
> > > > > +
> > > > > +       if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
> > > > >                  return;
> > > > >   #endif
> > > > > @@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
> > > > >                  unsigned long rel_gfn = gfn - memslot->base_gfn;
> > > > >                  u32 slot = (memslot->as_id << 16) | memslot->id;
> > > > > -               if (kvm->dirty_ring_size)
> > > > > +               if (kvm->dirty_ring_size && vcpu)
> > > > >                          kvm_dirty_ring_push(&vcpu->dirty_ring,
> > > > >                                              slot, rel_gfn);
> > > > > -               else
> > > > > +               else if (memslot->dirty_bitmap)
> > > > >                          set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > > > >          }
> > > > >   }

...

> > > A slightly different alternative would be have a completely separate
> > > API for writing guest memory without an associated vCPU.  I.e. start
> > > building up proper device emulation support.  Then the vCPU-based
> > > APIs could yell if a vCPU isn't provided (or there is no running
> > > vCPU in the current mess).  And the deviced-based API could be
> > > provided if and only if the architecture actually supports emulating
> > > writes from devices, i.e. x86 would not opt-in and so would even
> > > have access to the API.
> > 
> > Which is what I was putting under the "major surgery" label in my
> > previous email.
> > 
> > Anyhow, for the purpose of unblocking Gavin's series, I suggest to
> > adopt your per-arch opt-out suggestion as a stop gap measure, and we
> > will then be able to bike-shed for weeks on what the shape of the
> > device-originated memory dirtying API should be.
> > 
> 
> It's really a 'major surgery' and I would like to make sure I fully understand
> 'a completely separate API for writing guest memory without an associated vCPU",
> before I'm going to working on v7 for this.
>
> There are 7 functions and 2 macros involved as below. I assume Sean is suggesting
> to add another argument, whose name can be 'has_vcpu', for these functions and macros?

No.

As March suggested, for your series just implement the hacky arch opt-out, don't
try and do surgery at this time as that's likely going to be a months-long effort
that touches a lot of cross-arch code.

E.g. I believe the ARM opt-out (opt-in?) for the above hack would be

bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
{
	return vgic_has_its(kvm);
}
Gavin Shan Oct. 31, 2022, 3:37 a.m. UTC | #25
Hi Sean,

On 10/29/22 12:51 AM, Sean Christopherson wrote:
> On Fri, Oct 28, 2022, Gavin Shan wrote:
>> On 10/28/22 2:30 AM, Marc Zyngier wrote:
>>> On Thu, 27 Oct 2022 18:44:51 +0100,
>>> Sean Christopherson <seanjc@google.com> wrote:
>>>>
>>>> On Thu, Oct 27, 2022, Marc Zyngier wrote:
>>>>> On Tue, 25 Oct 2022 18:47:12 +0100, Sean Christopherson <seanjc@google.com> wrote:
>>
>> [...]
>>>>
>>>>>> And ideally such bugs would detected without relying on userspace to
>>>>>> enabling dirty logging, e.g. the Hyper-V bug lurked for quite some
>>>>>> time and was only found when mark_page_dirty_in_slot() started
>>>>>> WARNing.
>>>>>>
>>>>>> I'm ok if arm64 wants to let userspace shoot itself in the foot with
>>>>>> the ITS, but I'm not ok dropping the protections in the common
>>>>>> mark_page_dirty_in_slot().
>>>>>>
>>>>>> One somewhat gross idea would be to let architectures override the
>>>>>> "there must be a running vCPU" rule, e.g. arm64 could toggle a flag
>>>>>> in kvm->arch in its kvm_write_guest_lock() to note that an expected
>>>>>> write without a vCPU is in-progress:
>>>>>>
>>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>>> index 8c5c69ba47a7..d1da8914f749 100644
>>>>>> --- a/virt/kvm/kvm_main.c
>>>>>> +++ b/virt/kvm/kvm_main.c
>>>>>> @@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>>>>>>           struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>>>>>>    #ifdef CONFIG_HAVE_KVM_DIRTY_RING
>>>>>> -       if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
>>>>>> +       if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu))
>>>>>> +               return;
>>>>>> +
>>>>>> +       if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
>>>>>>                   return;
>>>>>>    #endif
>>>>>> @@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>>>>>>                   unsigned long rel_gfn = gfn - memslot->base_gfn;
>>>>>>                   u32 slot = (memslot->as_id << 16) | memslot->id;
>>>>>> -               if (kvm->dirty_ring_size)
>>>>>> +               if (kvm->dirty_ring_size && vcpu)
>>>>>>                           kvm_dirty_ring_push(&vcpu->dirty_ring,
>>>>>>                                               slot, rel_gfn);
>>>>>> -               else
>>>>>> +               else if (memslot->dirty_bitmap)
>>>>>>                           set_bit_le(rel_gfn, memslot->dirty_bitmap);
>>>>>>           }
>>>>>>    }
> 
> ...
> 
>>>> A slightly different alternative would be have a completely separate
>>>> API for writing guest memory without an associated vCPU.  I.e. start
>>>> building up proper device emulation support.  Then the vCPU-based
>>>> APIs could yell if a vCPU isn't provided (or there is no running
>>>> vCPU in the current mess).  And the deviced-based API could be
>>>> provided if and only if the architecture actually supports emulating
>>>> writes from devices, i.e. x86 would not opt-in and so would even
>>>> have access to the API.
>>>
>>> Which is what I was putting under the "major surgery" label in my
>>> previous email.
>>>
>>> Anyhow, for the purpose of unblocking Gavin's series, I suggest to
>>> adopt your per-arch opt-out suggestion as a stop gap measure, and we
>>> will then be able to bike-shed for weeks on what the shape of the
>>> device-originated memory dirtying API should be.
>>>
>>
>> It's really a 'major surgery' and I would like to make sure I fully understand
>> 'a completely separate API for writing guest memory without an associated vCPU",
>> before I'm going to working on v7 for this.
>>
>> There are 7 functions and 2 macros involved as below. I assume Sean is suggesting
>> to add another argument, whose name can be 'has_vcpu', for these functions and macros?
> 
> No.
> 
> As March suggested, for your series just implement the hacky arch opt-out, don't
> try and do surgery at this time as that's likely going to be a months-long effort
> that touches a lot of cross-arch code.
> 
> E.g. I believe the ARM opt-out (opt-in?) for the above hack would be
> 
> bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
> {
> 	return vgic_has_its(kvm);
> }
> 

Ok, Thanks for your confirm. v7 was just posted to address comments from Marc,
Peter, Oliver and you. Please help to review when you get a chance.

https://lore.kernel.org/kvmarm/20221031003621.164306-1-gshan@redhat.com/T/#t

Thanks,
Gavin
Marc Zyngier Oct. 31, 2022, 9:08 a.m. UTC | #26
On 2022-10-28 17:51, Sean Christopherson wrote:
> On Fri, Oct 28, 2022, Gavin Shan wrote:
>> Hi Sean and Marc,
>> 
>> On 10/28/22 2:30 AM, Marc Zyngier wrote:
>> > On Thu, 27 Oct 2022 18:44:51 +0100,
>> > Sean Christopherson <seanjc@google.com> wrote:
>> > >
>> > > On Thu, Oct 27, 2022, Marc Zyngier wrote:
>> > > > On Tue, 25 Oct 2022 18:47:12 +0100, Sean Christopherson <seanjc@google.com> wrote:
>> 
>> [...]
>> > >
>> > > > > And ideally such bugs would detected without relying on userspace to
>> > > > > enabling dirty logging, e.g. the Hyper-V bug lurked for quite some
>> > > > > time and was only found when mark_page_dirty_in_slot() started
>> > > > > WARNing.
>> > > > >
>> > > > > I'm ok if arm64 wants to let userspace shoot itself in the foot with
>> > > > > the ITS, but I'm not ok dropping the protections in the common
>> > > > > mark_page_dirty_in_slot().
>> > > > >
>> > > > > One somewhat gross idea would be to let architectures override the
>> > > > > "there must be a running vCPU" rule, e.g. arm64 could toggle a flag
>> > > > > in kvm->arch in its kvm_write_guest_lock() to note that an expected
>> > > > > write without a vCPU is in-progress:
>> > > > >
>> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> > > > > index 8c5c69ba47a7..d1da8914f749 100644
>> > > > > --- a/virt/kvm/kvm_main.c
>> > > > > +++ b/virt/kvm/kvm_main.c
>> > > > > @@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>> > > > >          struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>> > > > >   #ifdef CONFIG_HAVE_KVM_DIRTY_RING
>> > > > > -       if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
>> > > > > +       if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu))
>> > > > > +               return;
>> > > > > +
>> > > > > +       if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
>> > > > >                  return;
>> > > > >   #endif
>> > > > > @@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>> > > > >                  unsigned long rel_gfn = gfn - memslot->base_gfn;
>> > > > >                  u32 slot = (memslot->as_id << 16) | memslot->id;
>> > > > > -               if (kvm->dirty_ring_size)
>> > > > > +               if (kvm->dirty_ring_size && vcpu)
>> > > > >                          kvm_dirty_ring_push(&vcpu->dirty_ring,
>> > > > >                                              slot, rel_gfn);
>> > > > > -               else
>> > > > > +               else if (memslot->dirty_bitmap)
>> > > > >                          set_bit_le(rel_gfn, memslot->dirty_bitmap);
>> > > > >          }
>> > > > >   }
> 
> ...
> 
>> > > A slightly different alternative would be have a completely separate
>> > > API for writing guest memory without an associated vCPU.  I.e. start
>> > > building up proper device emulation support.  Then the vCPU-based
>> > > APIs could yell if a vCPU isn't provided (or there is no running
>> > > vCPU in the current mess).  And the deviced-based API could be
>> > > provided if and only if the architecture actually supports emulating
>> > > writes from devices, i.e. x86 would not opt-in and so would even
>> > > have access to the API.
>> >
>> > Which is what I was putting under the "major surgery" label in my
>> > previous email.
>> >
>> > Anyhow, for the purpose of unblocking Gavin's series, I suggest to
>> > adopt your per-arch opt-out suggestion as a stop gap measure, and we
>> > will then be able to bike-shed for weeks on what the shape of the
>> > device-originated memory dirtying API should be.
>> >
>> 
>> It's really a 'major surgery' and I would like to make sure I fully 
>> understand
>> 'a completely separate API for writing guest memory without an 
>> associated vCPU",
>> before I'm going to working on v7 for this.
>> 
>> There are 7 functions and 2 macros involved as below. I assume Sean is 
>> suggesting
>> to add another argument, whose name can be 'has_vcpu', for these 
>> functions and macros?
> 
> No.
> 
> As March suggested, for your series just implement the hacky arch 
> opt-out, don't

Please call me April.

> try and do surgery at this time as that's likely going to be a
> months-long effort
> that touches a lot of cross-arch code.
> 
> E.g. I believe the ARM opt-out (opt-in?) for the above hack would be
> 
> bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
> {
> 	return vgic_has_its(kvm);
> }

Although that will probably lead to the expected effect,
this helper should only return true when the ITS is actively
dumped.

         M.
Gavin Shan Oct. 31, 2022, 10:48 p.m. UTC | #27
On 10/31/22 5:08 PM, Marc Zyngier wrote:
> On 2022-10-28 17:51, Sean Christopherson wrote:
>> On Fri, Oct 28, 2022, Gavin Shan wrote:
>>> On 10/28/22 2:30 AM, Marc Zyngier wrote:
>>> > On Thu, 27 Oct 2022 18:44:51 +0100,
>>> > > On Thu, Oct 27, 2022, Marc Zyngier wrote:
>>> > > > On Tue, 25 Oct 2022 18:47:12 +0100, Sean Christopherson <seanjc@google.com> wrote:

[...]

>>>
>>> It's really a 'major surgery' and I would like to make sure I fully understand
>>> 'a completely separate API for writing guest memory without an associated vCPU",
>>> before I'm going to working on v7 for this.
>>>
>>> There are 7 functions and 2 macros involved as below. I assume Sean is suggesting
>>> to add another argument, whose name can be 'has_vcpu', for these functions and macros?
>>
>> No.
>>
>> As March suggested, for your series just implement the hacky arch opt-out, don't
> 
> Please call me April.
> 
>> try and do surgery at this time as that's likely going to be a
>> months-long effort
>> that touches a lot of cross-arch code.
>>
>> E.g. I believe the ARM opt-out (opt-in?) for the above hack would be
>>
>> bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
>> {
>>     return vgic_has_its(kvm);
>> }
> 
> Although that will probably lead to the expected effect,
> this helper should only return true when the ITS is actively
> dumped.
> 

Thanks, Marc. It makes sense to return true only when vgic/its tables
are being saved. Lets have more discussion in PATCH[v7 5/9] since Oliver
has other concerns there :)

Thanks,
Gavin
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 32427ea160df..09fa6c491c1b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8019,8 +8019,8 @@  guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
 (0x40000001). Otherwise, a guest may use the paravirtual features
 regardless of what has actually been exposed through the CPUID leaf.
 
-8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
-----------------------------------------------------------
+8.29 KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL, RING_WITH_BITMAP}
+-------------------------------------------------------------
 
 :Architectures: x86
 :Parameters: args[0] - size of the dirty log ring
@@ -8104,13 +8104,6 @@  flushing is done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one
 needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
 vmexit ensures that all dirty GFNs are flushed to the dirty rings.
 
-NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
-ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
-KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG.  After enabling
-KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
-machine will switch to ring-buffer dirty page tracking and further
-KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
-
 NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
 should be exposed by weakly ordered architecture, in order to indicate
 the additional memory ordering requirements imposed on userspace when
@@ -8119,6 +8112,12 @@  Architecture with TSO-like ordering (such as x86) are allowed to
 expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
 to userspace.
 
+NOTE: There is no running vcpu and available vcpu dirty ring when pages
+becomes dirty in some cases. One example is to save arm64's vgic/its
+tables during migration. The dirty bitmap is still used to track those
+dirty pages, indicated by KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP. The ditry
+bitmap is visited by KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG ioctls.
+
 8.30 KVM_CAP_XEN_HVM
 --------------------
 
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index fe5982b46424..23b2b466aa0f 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -28,6 +28,11 @@  struct kvm_dirty_ring {
 };
 
 #ifndef CONFIG_HAVE_KVM_DIRTY_RING
+static inline bool kvm_dirty_ring_exclusive(struct kvm *kvm)
+{
+	return false;
+}
+
 /*
  * If CONFIG_HAVE_HVM_DIRTY_RING not defined, kvm_dirty_ring.o should
  * not be included as well, so define these nop functions for the arch.
@@ -66,6 +71,7 @@  static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
 
 #else /* CONFIG_HAVE_KVM_DIRTY_RING */
 
+bool kvm_dirty_ring_exclusive(struct kvm *kvm);
 int kvm_cpu_dirty_log_size(void);
 u32 kvm_dirty_ring_get_rsvd_entries(void);
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 53fa3134fee0..a3fae111f25c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -780,6 +780,7 @@  struct kvm {
 	pid_t userspace_pid;
 	unsigned int max_halt_poll_ns;
 	u32 dirty_ring_size;
+	bool dirty_ring_with_bitmap;
 	bool vm_bugged;
 	bool vm_dead;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0d5d4419139a..c87b5882d7ae 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1178,6 +1178,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
+#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 224
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 800f9470e36b..228be1145cf3 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -33,6 +33,14 @@  config HAVE_KVM_DIRTY_RING_ACQ_REL
        bool
        select HAVE_KVM_DIRTY_RING
 
+# Only architectures that need to dirty memory outside of a vCPU
+# context should select this, advertising to userspace the
+# requirement to use a dirty bitmap in addition to the vCPU dirty
+# ring.
+config HAVE_KVM_DIRTY_RING_WITH_BITMAP
+	bool
+	depends on HAVE_KVM_DIRTY_RING
+
 config HAVE_KVM_EVENTFD
        bool
        select EVENTFD
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index f68d75026bc0..9cc60af291ef 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -11,6 +11,11 @@ 
 #include <trace/events/kvm.h>
 #include "kvm_mm.h"
 
+bool kvm_dirty_ring_exclusive(struct kvm *kvm)
+{
+	return kvm->dirty_ring_size && !kvm->dirty_ring_with_bitmap;
+}
+
 int __weak kvm_cpu_dirty_log_size(void)
 {
 	return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b064dbadaf4..8915dcefcefd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1617,7 +1617,7 @@  static int kvm_prepare_memory_region(struct kvm *kvm,
 			new->dirty_bitmap = NULL;
 		else if (old && old->dirty_bitmap)
 			new->dirty_bitmap = old->dirty_bitmap;
-		else if (!kvm->dirty_ring_size) {
+		else if (!kvm_dirty_ring_exclusive(kvm)) {
 			r = kvm_alloc_dirty_bitmap(new);
 			if (r)
 				return r;
@@ -2060,8 +2060,8 @@  int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
 	unsigned long n;
 	unsigned long any = 0;
 
-	/* Dirty ring tracking is exclusive to dirty log tracking */
-	if (kvm->dirty_ring_size)
+	/* Dirty ring tracking may be exclusive to dirty log tracking */
+	if (kvm_dirty_ring_exclusive(kvm))
 		return -ENXIO;
 
 	*memslot = NULL;
@@ -2125,8 +2125,8 @@  static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
-	/* Dirty ring tracking is exclusive to dirty log tracking */
-	if (kvm->dirty_ring_size)
+	/* Dirty ring tracking may be exclusive to dirty log tracking */
+	if (kvm_dirty_ring_exclusive(kvm))
 		return -ENXIO;
 
 	as_id = log->slot >> 16;
@@ -2237,8 +2237,8 @@  static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
-	/* Dirty ring tracking is exclusive to dirty log tracking */
-	if (kvm->dirty_ring_size)
+	/* Dirty ring tracking may be exclusive to dirty log tracking */
+	if (kvm_dirty_ring_exclusive(kvm))
 		return -ENXIO;
 
 	as_id = log->slot >> 16;
@@ -3305,15 +3305,20 @@  void mark_page_dirty_in_slot(struct kvm *kvm,
 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
 #ifdef CONFIG_HAVE_KVM_DIRTY_RING
-	if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
+	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
 		return;
+
+#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
+	if (WARN_ON_ONCE(!vcpu))
+		return;
+#endif
 #endif
 
 	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-		if (kvm->dirty_ring_size)
+		if (vcpu && kvm->dirty_ring_size)
 			kvm_dirty_ring_push(&vcpu->dirty_ring,
 					    slot, rel_gfn);
 		else
@@ -4485,6 +4490,9 @@  static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
 #else
 		return 0;
+#endif
+#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
+	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
 #endif
 	case KVM_CAP_BINARY_STATS_FD:
 	case KVM_CAP_SYSTEM_EVENT_DATA:
@@ -4499,6 +4507,11 @@  static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
 {
 	int r;
 
+#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
+	if (!kvm->dirty_ring_with_bitmap)
+		return -EINVAL;
+#endif
+
 	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
 		return -EINVAL;
 
@@ -4588,6 +4601,9 @@  static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 	case KVM_CAP_DIRTY_LOG_RING:
 	case KVM_CAP_DIRTY_LOG_RING_ACQ_REL:
 		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
+	case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP:
+		kvm->dirty_ring_with_bitmap = true;
+		return 0;
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}