diff mbox series

[v1,1/5] KVM: arm64: Enable ring-based dirty memory tracking

Message ID 20220819005601.198436-2-gshan@redhat.com (mailing list archive)
State New
Headers show
Series KVM: arm64: Enable ring-based dirty memory tracking | expand

Commit Message

Gavin Shan Aug. 19, 2022, 12:55 a.m. UTC
The ring-based dirty memory tracking has been available and enabled
on x86 for a while. The feature is beneficial when the number of
dirty pages is small in a checkpointing system or live migration
scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
Implement ring-based dirty memory tracking").

This enables the ring-based dirty memory tracking on ARM64. It's
notable that no extra reserved ring entries are needed on ARM64
because the huge pages are always split into base pages when page
dirty tracking is enabled.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 Documentation/virt/kvm/api.rst    | 2 +-
 arch/arm64/include/uapi/asm/kvm.h | 1 +
 arch/arm64/kvm/Kconfig            | 1 +
 arch/arm64/kvm/arm.c              | 8 ++++++++
 4 files changed, 11 insertions(+), 1 deletion(-)

Comments

Marc Zyngier Aug. 19, 2022, 8 a.m. UTC | #1
On Fri, 19 Aug 2022 01:55:57 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> The ring-based dirty memory tracking has been available and enabled
> on x86 for a while. The feature is beneficial when the number of
> dirty pages is small in a checkpointing system or live migration
> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> Implement ring-based dirty memory tracking").
> 
> This enables the ring-based dirty memory tracking on ARM64. It's
> notable that no extra reserved ring entries are needed on ARM64
> because the huge pages are always split into base pages when page
> dirty tracking is enabled.

Can you please elaborate on this? Adding a per-CPU ring of course
results in extra memory allocation, so there must be a subtle
x86-specific detail that I'm not aware of...

> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  Documentation/virt/kvm/api.rst    | 2 +-
>  arch/arm64/include/uapi/asm/kvm.h | 1 +
>  arch/arm64/kvm/Kconfig            | 1 +
>  arch/arm64/kvm/arm.c              | 8 ++++++++
>  4 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index abd7c32126ce..19fa1ac017ed 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
>  8.29 KVM_CAP_DIRTY_LOG_RING
>  ---------------------------
>  
> -:Architectures: x86
> +:Architectures: x86, arm64
>  :Parameters: args[0] - size of the dirty log ring
>  
>  KVM is capable of tracking dirty memory using ring buffers that are
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 3bb134355874..7e04b0b8d2b2 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -43,6 +43,7 @@
>  #define __KVM_HAVE_VCPU_EVENTS
>  
>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64

For context, the documentation says:

<quote>
- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
</quote>

What is the reason for picking this particular value?


>  
>  #define KVM_REG_SIZE(id)						\
>  	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 815cc118c675..0309b2d0f2da 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -32,6 +32,7 @@ menuconfig KVM
>  	select KVM_VFIO
>  	select HAVE_KVM_EVENTFD
>  	select HAVE_KVM_IRQFD
> +	select HAVE_KVM_DIRTY_RING
>  	select HAVE_KVM_MSI
>  	select HAVE_KVM_IRQCHIP
>  	select HAVE_KVM_IRQ_ROUTING
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 986cee6fbc7f..3de6b9b39db7 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  		if (!ret)
>  			ret = 1;
>  
> +		/* Force vcpu exit if its dirty ring is soft-full */
> +		if (unlikely(vcpu->kvm->dirty_ring_size &&
> +			     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> +			trace_kvm_dirty_ring_exit(vcpu);
> +			ret = 0;
> +		}
> +

Why can't this be moved to kvm_vcpu_exit_request() instead? I would
also very much like the check to be made a common helper with x86.

A seemingly approach would be to make this a request on dirty log
insertion, and avoid the whole "check the log size" on every run,
which adds pointless overhead to unsuspecting users (aka everyone).

Thanks,

	M.
Gavin Shan Aug. 22, 2022, 1:58 a.m. UTC | #2
Hi Marc,

On 8/19/22 6:00 PM, Marc Zyngier wrote:
> On Fri, 19 Aug 2022 01:55:57 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>>
>> The ring-based dirty memory tracking has been available and enabled
>> on x86 for a while. The feature is beneficial when the number of
>> dirty pages is small in a checkpointing system or live migration
>> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
>> Implement ring-based dirty memory tracking").
>>
>> This enables the ring-based dirty memory tracking on ARM64. It's
>> notable that no extra reserved ring entries are needed on ARM64
>> because the huge pages are always split into base pages when page
>> dirty tracking is enabled.
> 
> Can you please elaborate on this? Adding a per-CPU ring of course
> results in extra memory allocation, so there must be a subtle
> x86-specific detail that I'm not aware of...
> 

Sure. I guess it's helpful to explain how it works in next revision.
Something like below:

This enables the ring-based dirty memory tracking on ARM64. The feature
is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
pushed by host when page becomes dirty and pulled by userspace. A vcpu
exit is forced when the ring buffer becomes full. The ring buffers on
all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.

Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
However, it's avoiding synchronization among multiple vcpus when dirty
pages happen on multiple vcpus. More discussion can be found from [1]

[1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/
(comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)


>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   Documentation/virt/kvm/api.rst    | 2 +-
>>   arch/arm64/include/uapi/asm/kvm.h | 1 +
>>   arch/arm64/kvm/Kconfig            | 1 +
>>   arch/arm64/kvm/arm.c              | 8 ++++++++
>>   4 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index abd7c32126ce..19fa1ac017ed 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
>>   8.29 KVM_CAP_DIRTY_LOG_RING
>>   ---------------------------
>>   
>> -:Architectures: x86
>> +:Architectures: x86, arm64
>>   :Parameters: args[0] - size of the dirty log ring
>>   
>>   KVM is capable of tracking dirty memory using ring buffers that are
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 3bb134355874..7e04b0b8d2b2 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -43,6 +43,7 @@
>>   #define __KVM_HAVE_VCPU_EVENTS
>>   
>>   #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
> 
> For context, the documentation says:
> 
> <quote>
> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
>    KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
> </quote>
> 
> What is the reason for picking this particular value?
> 

It's inherited from x86. I don't think it has to be this particular value.
The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET,
KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.

How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
The virtual area is cheap, I guess it's also nice to use x86's
pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.

     #define KVM_COALESCED_MMIO_PAGE_OFFSET   1
     #define KVM_DIRTY_LOG_PAGE_OFFSET        2

>>   
>>   #define KVM_REG_SIZE(id)						\
>>   	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index 815cc118c675..0309b2d0f2da 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -32,6 +32,7 @@ menuconfig KVM
>>   	select KVM_VFIO
>>   	select HAVE_KVM_EVENTFD
>>   	select HAVE_KVM_IRQFD
>> +	select HAVE_KVM_DIRTY_RING
>>   	select HAVE_KVM_MSI
>>   	select HAVE_KVM_IRQCHIP
>>   	select HAVE_KVM_IRQ_ROUTING
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 986cee6fbc7f..3de6b9b39db7 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>   		if (!ret)
>>   			ret = 1;
>>   
>> +		/* Force vcpu exit if its dirty ring is soft-full */
>> +		if (unlikely(vcpu->kvm->dirty_ring_size &&
>> +			     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
>> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>> +			trace_kvm_dirty_ring_exit(vcpu);
>> +			ret = 0;
>> +		}
>> +
> 
> Why can't this be moved to kvm_vcpu_exit_request() instead? I would
> also very much like the check to be made a common helper with x86.
> 
> A seemingly approach would be to make this a request on dirty log
> insertion, and avoid the whole "check the log size" on every run,
> which adds pointless overhead to unsuspecting users (aka everyone).
> 

I though of having the check in kvm_vcpu_exit_request(). The various
exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the
highest priority and ARM64 is just to follow. I don't think it really
matters. I will improve it accordingly in next revision:

- Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c

   bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu)
   {
        struct kvm *kvm = vcpu->vcpu;
        struct kvm_dirty_ring *ring = &vcpu->dirty_ring;

        if (unlikely(kvm->dirty_ring_size &&
                     kvm_dirty_ring_used(ring) >= ring->soft_limit)) {
            vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
            trace_kvm_dirty_ring_exit(vcpu);
            return true;
        }

        return false;
   }

- Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().

Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time.
Otherwise, the dirty log in the ring buffer will be overwritten. I'm not
sure if anything else I missed?

Thanks,
Gavin
Peter Xu Aug. 22, 2022, 6:55 p.m. UTC | #3
Hi, Gavin,

On Mon, Aug 22, 2022 at 11:58:20AM +1000, Gavin Shan wrote:
> > For context, the documentation says:
> > 
> > <quote>
> > - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> >    KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
> > </quote>
> > 
> > What is the reason for picking this particular value?
> > 
> 
> It's inherited from x86. I don't think it has to be this particular value.
> The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET,
> KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
> 
> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
> The virtual area is cheap, I guess it's also nice to use x86's
> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
> 
>     #define KVM_COALESCED_MMIO_PAGE_OFFSET   1
>     #define KVM_DIRTY_LOG_PAGE_OFFSET        2

It was chosen not to be continuous of previous used offset because it'll be
the 1st vcpu region that can cover multiple & dynamic number of pages.  I
wanted to leave the 3-63 (x86 used offset 2 already) for small fields so
they can be continuous, which looks a little bit cleaner.

Currently how many pages it'll use depends on the size set by the user,
though there's a max size limited by KVM_DIRTY_RING_MAX_ENTRIES, which is a
maximum of 1MB memory.

So I think setting it to 2 is okay, as long as we keep the rest 1MB address
space for the per-vcpu ring structure, so any new vcpu fields (even if only
1 page will be needed) need to be after that maximum size of the ring.

Thanks,
Marc Zyngier Aug. 22, 2022, 9:42 p.m. UTC | #4
Hi Gavin,

On Mon, 22 Aug 2022 02:58:20 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> Hi Marc,
> 
> On 8/19/22 6:00 PM, Marc Zyngier wrote:
> > On Fri, 19 Aug 2022 01:55:57 +0100,
> > Gavin Shan <gshan@redhat.com> wrote:
> >> 
> >> The ring-based dirty memory tracking has been available and enabled
> >> on x86 for a while. The feature is beneficial when the number of
> >> dirty pages is small in a checkpointing system or live migration
> >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> >> Implement ring-based dirty memory tracking").
> >> 
> >> This enables the ring-based dirty memory tracking on ARM64. It's
> >> notable that no extra reserved ring entries are needed on ARM64
> >> because the huge pages are always split into base pages when page
> >> dirty tracking is enabled.
> > 
> > Can you please elaborate on this? Adding a per-CPU ring of course
> > results in extra memory allocation, so there must be a subtle
> > x86-specific detail that I'm not aware of...
> > 
> 
> Sure. I guess it's helpful to explain how it works in next revision.
> Something like below:
> 
> This enables the ring-based dirty memory tracking on ARM64. The feature
> is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
> CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
> each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
> pushed by host when page becomes dirty and pulled by userspace. A vcpu
> exit is forced when the ring buffer becomes full. The ring buffers on
> all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
> 
> Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
> However, it's avoiding synchronization among multiple vcpus when dirty
> pages happen on multiple vcpus. More discussion can be found from [1]

Oh, I totally buy the relaxation of the synchronisation (though I
doubt this will have any visible effect until we have something like
Oliver's patches to allow parallel faulting).

But it is the "no extra reserved ring entries are needed on ARM64"
argument that I don't get yet.


>
> [1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/
> (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)
> 
> 
> >> 
> >> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> ---
> >>   Documentation/virt/kvm/api.rst    | 2 +-
> >>   arch/arm64/include/uapi/asm/kvm.h | 1 +
> >>   arch/arm64/kvm/Kconfig            | 1 +
> >>   arch/arm64/kvm/arm.c              | 8 ++++++++
> >>   4 files changed, 11 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index abd7c32126ce..19fa1ac017ed 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
> >>   8.29 KVM_CAP_DIRTY_LOG_RING
> >>   ---------------------------
> >>   -:Architectures: x86
> >> +:Architectures: x86, arm64
> >>   :Parameters: args[0] - size of the dirty log ring
> >>     KVM is capable of tracking dirty memory using ring buffers that
> >> are
> >> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> >> index 3bb134355874..7e04b0b8d2b2 100644
> >> --- a/arch/arm64/include/uapi/asm/kvm.h
> >> +++ b/arch/arm64/include/uapi/asm/kvm.h
> >> @@ -43,6 +43,7 @@
> >>   #define __KVM_HAVE_VCPU_EVENTS
> >>     #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
> >> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
> > 
> > For context, the documentation says:
> > 
> > <quote>
> > - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> >    KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
> > </quote>
> > 
> > What is the reason for picking this particular value?
> > 
> 
> It's inherited from x86. I don't think it has to be this particular
> value.  The value is used to distinguish the region's owners like
> kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and
> KVM_DIRTY_LOG_PAGE_OFFSET.
> 
> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
> The virtual area is cheap, I guess it's also nice to use x86's
> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
> 
>     #define KVM_COALESCED_MMIO_PAGE_OFFSET   1
>     #define KVM_DIRTY_LOG_PAGE_OFFSET        2

Given that this is just an offset in the vcpu "file", I don't think it
matters that much. 64 definitely allows for some struct vcpu growth,
and it doesn't hurt to be compatible with x86 (for once...).

> 
> >>     #define KVM_REG_SIZE(id)
> >> \
> >>   	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
> >> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> >> index 815cc118c675..0309b2d0f2da 100644
> >> --- a/arch/arm64/kvm/Kconfig
> >> +++ b/arch/arm64/kvm/Kconfig
> >> @@ -32,6 +32,7 @@ menuconfig KVM
> >>   	select KVM_VFIO
> >>   	select HAVE_KVM_EVENTFD
> >>   	select HAVE_KVM_IRQFD
> >> +	select HAVE_KVM_DIRTY_RING
> >>   	select HAVE_KVM_MSI
> >>   	select HAVE_KVM_IRQCHIP
> >>   	select HAVE_KVM_IRQ_ROUTING
> >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >> index 986cee6fbc7f..3de6b9b39db7 100644
> >> --- a/arch/arm64/kvm/arm.c
> >> +++ b/arch/arm64/kvm/arm.c
> >> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >>   		if (!ret)
> >>   			ret = 1;
> >>   +		/* Force vcpu exit if its dirty ring is soft-full */
> >> +		if (unlikely(vcpu->kvm->dirty_ring_size &&
> >> +			     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
> >> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> >> +			trace_kvm_dirty_ring_exit(vcpu);
> >> +			ret = 0;
> >> +		}
> >> +
> > 
> > Why can't this be moved to kvm_vcpu_exit_request() instead? I would
> > also very much like the check to be made a common helper with x86.
> > 
> > A seemingly approach would be to make this a request on dirty log
> > insertion, and avoid the whole "check the log size" on every run,
> > which adds pointless overhead to unsuspecting users (aka everyone).
> > 
> 
> I though of having the check in kvm_vcpu_exit_request(). The various
> exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the
> highest priority and ARM64 is just to follow. I don't think it really
> matters. I will improve it accordingly in next revision:
> 
> - Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c
> 
>   bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu)
>   {
>        struct kvm *kvm = vcpu->vcpu;
>        struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
> 
>        if (unlikely(kvm->dirty_ring_size &&
>                     kvm_dirty_ring_used(ring) >= ring->soft_limit)) {
>            vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>            trace_kvm_dirty_ring_exit(vcpu);
>            return true;
>        }
> 
>        return false;
>   }
> 
> - Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().
> 
> Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time.
> Otherwise, the dirty log in the ring buffer will be overwritten. I'm not
> sure if anything else I missed?

I'm fine with the above, but what I really wanted is a request from
the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request.
Something like this (which obviously doesn't compile, but you'll get
the idea):

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 986cee6fbc7f..0b41feb6fb7d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
 			return kvm_vcpu_suspend(vcpu);
+
+		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
+			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+			trace_kvm_dirty_ring_exit(vcpu);
+			return 0;
+		}
 	}
 
 	return 1;
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index f4c2a6eb1666..08b2f01164fa 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
 
 void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
 {
+	struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
 	struct kvm_dirty_gfn *entry;
 
 	/* It should never get full */
@@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
 	kvm_dirty_gfn_set_dirtied(entry);
 	ring->dirty_index++;
 	trace_kvm_dirty_ring_push(ring, slot, offset);
+
+	if (kvm_dirty_ring_soft_full(vcpu))
+		kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
 }
 
 struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)

Thanks,

	M.
Gavin Shan Aug. 23, 2022, 3:19 a.m. UTC | #5
Hi Peter,

On 8/23/22 4:55 AM, Peter Xu wrote:
> On Mon, Aug 22, 2022 at 11:58:20AM +1000, Gavin Shan wrote:
>>> For context, the documentation says:
>>>
>>> <quote>
>>> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
>>>     KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
>>> </quote>
>>>
>>> What is the reason for picking this particular value?
>>>
>>
>> It's inherited from x86. I don't think it has to be this particular value.
>> The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET,
>> KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
>> The virtual area is cheap, I guess it's also nice to use x86's
>> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>>      #define KVM_COALESCED_MMIO_PAGE_OFFSET   1
>>      #define KVM_DIRTY_LOG_PAGE_OFFSET        2
> 
> It was chosen not to be continuous of previous used offset because it'll be
> the 1st vcpu region that can cover multiple & dynamic number of pages.  I
> wanted to leave the 3-63 (x86 used offset 2 already) for small fields so
> they can be continuous, which looks a little bit cleaner.
> 
> Currently how many pages it'll use depends on the size set by the user,
> though there's a max size limited by KVM_DIRTY_RING_MAX_ENTRIES, which is a
> maximum of 1MB memory.
> 
> So I think setting it to 2 is okay, as long as we keep the rest 1MB address
> space for the per-vcpu ring structure, so any new vcpu fields (even if only
> 1 page will be needed) need to be after that maximum size of the ring.
> 

Thanks for the details. I think it'd keep the layout since virtual area
is really cheap. So lets keep it as of being if Marc doesn't object. In
this way, the new area, which is usually one page size can go after
KVM_COALESCED_MMIO_PAGE_OFFSET.

#define KVM_DIRTY_LOG_PAGE_OFFSET 64

Thanks,
Gavin
Gavin Shan Aug. 23, 2022, 5:22 a.m. UTC | #6
Hi Marc,

On 8/23/22 7:42 AM, Marc Zyngier wrote:
> On Mon, 22 Aug 2022 02:58:20 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>> On 8/19/22 6:00 PM, Marc Zyngier wrote:
>>> On Fri, 19 Aug 2022 01:55:57 +0100,
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>
>>>> The ring-based dirty memory tracking has been available and enabled
>>>> on x86 for a while. The feature is beneficial when the number of
>>>> dirty pages is small in a checkpointing system or live migration
>>>> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
>>>> Implement ring-based dirty memory tracking").
>>>>
>>>> This enables the ring-based dirty memory tracking on ARM64. It's
>>>> notable that no extra reserved ring entries are needed on ARM64
>>>> because the huge pages are always split into base pages when page
>>>> dirty tracking is enabled.
>>>
>>> Can you please elaborate on this? Adding a per-CPU ring of course
>>> results in extra memory allocation, so there must be a subtle
>>> x86-specific detail that I'm not aware of...
>>>
>>
>> Sure. I guess it's helpful to explain how it works in next revision.
>> Something like below:
>>
>> This enables the ring-based dirty memory tracking on ARM64. The feature
>> is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
>> CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
>> each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
>> pushed by host when page becomes dirty and pulled by userspace. A vcpu
>> exit is forced when the ring buffer becomes full. The ring buffers on
>> all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
>>
>> Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
>> However, it's avoiding synchronization among multiple vcpus when dirty
>> pages happen on multiple vcpus. More discussion can be found from [1]
> 
> Oh, I totally buy the relaxation of the synchronisation (though I
> doubt this will have any visible effect until we have something like
> Oliver's patches to allow parallel faulting).
> 
> But it is the "no extra reserved ring entries are needed on ARM64"
> argument that I don't get yet.
> 

Ok. The extra reserved ring entries are x86 specific. When x86's PML
(Page Modification Logging) hardware capability is enabled, the vcpu
exits due to full PML buffer, which is 512 entries. All the information
in PML buffer is pushed to the dirty ring buffer in one shoot. To
avoid overrunning the dirty ring buffer, there are 512 entries are
reserved.

   === include/linux/kvm_host.h

   #define KVM_DIRTY_RING_RSVD_ENTRIES    64     // fixed and reserved ring entries

   === virt/kvm/dirty_ring.c

   int __weak kvm_cpu_dirty_log_size(void)
   {
         return 0;
   }

   u32 kvm_dirty_ring_get_rsvd_entries(void)
   {
         return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
   }

   === arch/x86/kvm/mmu/mmu.c

   int kvm_cpu_dirty_log_size(void)
   {
         return kvm_x86_ops.cpu_dirty_log_size;    // Set to 512 when PML is enabled
   }


kvm_cpu_dirty_log_size() isn't be overrided by ARM64, meaning it returns
zero on ARM64. On x86, it returns 512 when PML is enabled.

>>
>> [1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/
>> (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)
>>
>>
>>>>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> ---
>>>>    Documentation/virt/kvm/api.rst    | 2 +-
>>>>    arch/arm64/include/uapi/asm/kvm.h | 1 +
>>>>    arch/arm64/kvm/Kconfig            | 1 +
>>>>    arch/arm64/kvm/arm.c              | 8 ++++++++
>>>>    4 files changed, 11 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>>>> index abd7c32126ce..19fa1ac017ed 100644
>>>> --- a/Documentation/virt/kvm/api.rst
>>>> +++ b/Documentation/virt/kvm/api.rst
>>>> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
>>>>    8.29 KVM_CAP_DIRTY_LOG_RING
>>>>    ---------------------------
>>>>    -:Architectures: x86
>>>> +:Architectures: x86, arm64
>>>>    :Parameters: args[0] - size of the dirty log ring
>>>>      KVM is capable of tracking dirty memory using ring buffers that
>>>> are
>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>>>> index 3bb134355874..7e04b0b8d2b2 100644
>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>> @@ -43,6 +43,7 @@
>>>>    #define __KVM_HAVE_VCPU_EVENTS
>>>>      #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>>>> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
>>>
>>> For context, the documentation says:
>>>
>>> <quote>
>>> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
>>>     KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
>>> </quote>
>>>
>>> What is the reason for picking this particular value?
>>>
>>
>> It's inherited from x86. I don't think it has to be this particular
>> value.  The value is used to distinguish the region's owners like
>> kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and
>> KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
>> The virtual area is cheap, I guess it's also nice to use x86's
>> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>>      #define KVM_COALESCED_MMIO_PAGE_OFFSET   1
>>      #define KVM_DIRTY_LOG_PAGE_OFFSET        2
> 
> Given that this is just an offset in the vcpu "file", I don't think it
> matters that much. 64 definitely allows for some struct vcpu growth,
> and it doesn't hurt to be compatible with x86 (for once...).
> 

Sure, thanks. I think it'd better to have same pattern as x86 either.

>>
>>>>      #define KVM_REG_SIZE(id)
>>>> \
>>>>    	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
>>>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>>>> index 815cc118c675..0309b2d0f2da 100644
>>>> --- a/arch/arm64/kvm/Kconfig
>>>> +++ b/arch/arm64/kvm/Kconfig
>>>> @@ -32,6 +32,7 @@ menuconfig KVM
>>>>    	select KVM_VFIO
>>>>    	select HAVE_KVM_EVENTFD
>>>>    	select HAVE_KVM_IRQFD
>>>> +	select HAVE_KVM_DIRTY_RING
>>>>    	select HAVE_KVM_MSI
>>>>    	select HAVE_KVM_IRQCHIP
>>>>    	select HAVE_KVM_IRQ_ROUTING
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index 986cee6fbc7f..3de6b9b39db7 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>>>    		if (!ret)
>>>>    			ret = 1;
>>>>    +		/* Force vcpu exit if its dirty ring is soft-full */
>>>> +		if (unlikely(vcpu->kvm->dirty_ring_size &&
>>>> +			     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
>>>> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>>>> +			trace_kvm_dirty_ring_exit(vcpu);
>>>> +			ret = 0;
>>>> +		}
>>>> +
>>>
>>> Why can't this be moved to kvm_vcpu_exit_request() instead? I would
>>> also very much like the check to be made a common helper with x86.
>>>
>>> A seemingly approach would be to make this a request on dirty log
>>> insertion, and avoid the whole "check the log size" on every run,
>>> which adds pointless overhead to unsuspecting users (aka everyone).
>>>
>>
>> I though of having the check in kvm_vcpu_exit_request(). The various
>> exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the
>> highest priority and ARM64 is just to follow. I don't think it really
>> matters. I will improve it accordingly in next revision:
>>
>> - Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c
>>
>>    bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu)
>>    {
>>         struct kvm *kvm = vcpu->vcpu;
>>         struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
>>
>>         if (unlikely(kvm->dirty_ring_size &&
>>                      kvm_dirty_ring_used(ring) >= ring->soft_limit)) {
>>             vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>>             trace_kvm_dirty_ring_exit(vcpu);
>>             return true;
>>         }
>>
>>         return false;
>>    }
>>
>> - Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().
>>
>> Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time.
>> Otherwise, the dirty log in the ring buffer will be overwritten. I'm not
>> sure if anything else I missed?
> 
> I'm fine with the above, but what I really wanted is a request from
> the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request.
> Something like this (which obviously doesn't compile, but you'll get
> the idea):
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 986cee6fbc7f..0b41feb6fb7d 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>   
>   		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
>   			return kvm_vcpu_suspend(vcpu);
> +
> +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> +			trace_kvm_dirty_ring_exit(vcpu);
> +			return 0;
> +		}
>   	}
>   
>   	return 1;
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index f4c2a6eb1666..08b2f01164fa 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
>   
>   void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
>   {
> +	struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
>   	struct kvm_dirty_gfn *entry;
>   
>   	/* It should never get full */
> @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
>   	kvm_dirty_gfn_set_dirtied(entry);
>   	ring->dirty_index++;
>   	trace_kvm_dirty_ring_push(ring, slot, offset);
> +
> +	if (kvm_dirty_ring_soft_full(vcpu))
> +		kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
>   }
>   
>   struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> 

Ok, thanks for the details, Marc. I will adopt your code in next revision :)

Thanks,
Gavin
Peter Xu Aug. 23, 2022, 1:58 p.m. UTC | #7
On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote:
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 986cee6fbc7f..0b41feb6fb7d 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> >   		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> >   			return kvm_vcpu_suspend(vcpu);
> > +
> > +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
> > +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > +			trace_kvm_dirty_ring_exit(vcpu);
> > +			return 0;
> > +		}
> >   	}
> >   	return 1;
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index f4c2a6eb1666..08b2f01164fa 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> >   void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> >   {
> > +	struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
> >   	struct kvm_dirty_gfn *entry;
> >   	/* It should never get full */
> > @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> >   	kvm_dirty_gfn_set_dirtied(entry);
> >   	ring->dirty_index++;
> >   	trace_kvm_dirty_ring_push(ring, slot, offset);
> > +
> > +	if (kvm_dirty_ring_soft_full(vcpu))
> > +		kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> >   }
> >   struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> > 
> 
> Ok, thanks for the details, Marc. I will adopt your code in next revision :)

Note that there can be a slight difference with the old/new code, in that
an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL
vmexit and keep kicking VCPU_RUN with the new code.

Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code
until the next dirty pfn being pushed to the ring, then it'll request ring
full exit again.

Each time it exits the ring grows 1.

At last iiuc it can easily hit the ring full and trigger the warning at the
entry of kvm_dirty_ring_push():

	/* It should never get full */
	WARN_ON_ONCE(kvm_dirty_ring_full(ring));

We did that because kvm_dirty_ring_push() was previously designed to not be
able to fail at all (e.g., in the old bitmap world we never will fail too).
We can't because we can't lose any dirty page or migration could silently
fail too (consider when we do user exit due to ring full and migration just
completed; there could be unsynced pages on src/dst).

So even though the old approach will need to read kvm->dirty_ring_size for
every entrance which is a pity, it will avoid issue above.

Side note: for x86 the dirty ring check was put at the entrance not because
it needs to be the highest priority - it should really be the same when
check kvm requests. It's just that it'll be the fastest way to fail
properly if needed before loading mmu, disabling preemption/irq, etc.

Thanks,
Oliver Upton Aug. 23, 2022, 2:44 p.m. UTC | #8
On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote:
> Hi Gavin,
> 
> On Mon, 22 Aug 2022 02:58:20 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
> > 
> > Hi Marc,
> > 
> > On 8/19/22 6:00 PM, Marc Zyngier wrote:
> > > On Fri, 19 Aug 2022 01:55:57 +0100,
> > > Gavin Shan <gshan@redhat.com> wrote:
> > >> 
> > >> The ring-based dirty memory tracking has been available and enabled
> > >> on x86 for a while. The feature is beneficial when the number of
> > >> dirty pages is small in a checkpointing system or live migration
> > >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> > >> Implement ring-based dirty memory tracking").
> > >> 
> > >> This enables the ring-based dirty memory tracking on ARM64. It's
> > >> notable that no extra reserved ring entries are needed on ARM64
> > >> because the huge pages are always split into base pages when page
> > >> dirty tracking is enabled.
> > > 
> > > Can you please elaborate on this? Adding a per-CPU ring of course
> > > results in extra memory allocation, so there must be a subtle
> > > x86-specific detail that I'm not aware of...
> > > 
> > 
> > Sure. I guess it's helpful to explain how it works in next revision.
> > Something like below:
> > 
> > This enables the ring-based dirty memory tracking on ARM64. The feature
> > is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
> > CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
> > each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
> > pushed by host when page becomes dirty and pulled by userspace. A vcpu
> > exit is forced when the ring buffer becomes full. The ring buffers on
> > all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
> > 
> > Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
> > However, it's avoiding synchronization among multiple vcpus when dirty
> > pages happen on multiple vcpus. More discussion can be found from [1]
> 
> Oh, I totally buy the relaxation of the synchronisation (though I
> doubt this will have any visible effect until we have something like
> Oliver's patches to allow parallel faulting).
> 

Heh, yeah I need to get that out the door. I'll also note that Gavin's
changes are still relevant without that series, as we do write unprotect
in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
Add fast path to handle permission relaxation during dirty logging").

--
Thanks,
Oliver
Marc Zyngier Aug. 23, 2022, 7:17 p.m. UTC | #9
On Tue, 23 Aug 2022 14:58:19 +0100,
Peter Xu <peterx@redhat.com> wrote:
> 
> On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote:
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 986cee6fbc7f..0b41feb6fb7d 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > >   		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > >   			return kvm_vcpu_suspend(vcpu);
> > > +
> > > +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
> > > +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > > +			trace_kvm_dirty_ring_exit(vcpu);
> > > +			return 0;
> > > +		}
> > >   	}
> > >   	return 1;
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > index f4c2a6eb1666..08b2f01164fa 100644
> > > --- a/virt/kvm/dirty_ring.c
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > >   void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > >   {
> > > +	struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
> > >   	struct kvm_dirty_gfn *entry;
> > >   	/* It should never get full */
> > > @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > >   	kvm_dirty_gfn_set_dirtied(entry);
> > >   	ring->dirty_index++;
> > >   	trace_kvm_dirty_ring_push(ring, slot, offset);
> > > +
> > > +	if (kvm_dirty_ring_soft_full(vcpu))
> > > +		kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > >   }
> > >   struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> > > 
> > 
> > Ok, thanks for the details, Marc. I will adopt your code in next revision :)
> 
> Note that there can be a slight difference with the old/new code, in that
> an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL
> vmexit and keep kicking VCPU_RUN with the new code.
> 
> Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code
> until the next dirty pfn being pushed to the ring, then it'll request ring
> full exit again.
> 
> Each time it exits the ring grows 1.
> 
> At last iiuc it can easily hit the ring full and trigger the warning at the
> entry of kvm_dirty_ring_push():
> 
> 	/* It should never get full */
> 	WARN_ON_ONCE(kvm_dirty_ring_full(ring));

Hmmm, yes. Well spotted.

> We did that because kvm_dirty_ring_push() was previously designed to not be
> able to fail at all (e.g., in the old bitmap world we never will fail too).
> We can't because we can't lose any dirty page or migration could silently
> fail too (consider when we do user exit due to ring full and migration just
> completed; there could be unsynced pages on src/dst).
> 
> So even though the old approach will need to read kvm->dirty_ring_size for
> every entrance which is a pity, it will avoid issue above.

I don't think we really need this check on the hot path. All we need
is to make the request sticky until userspace gets their act together
and consumes elements in the ring. Something like:

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 986cee6fbc7f..e8ed5e1af159 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
 			return kvm_vcpu_suspend(vcpu);
+
+		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
+		    kvm_dirty_ring_soft_full(vcpu)) {
+			kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
+			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+			trace_kvm_dirty_ring_exit(vcpu);
+			return 0;
+		}
 	}
 
 	return 1;


However, I'm a bit concerned by the reset side of things. It iterates
over the vcpus and expects the view of each ring to be consistent,
even if userspace is hacking at it from another CPU. For example, I
can't see what guarantees that the kernel observes the writes from
userspace in the order they are being performed (the documentation
provides no requirements other than "it must collect the dirty GFNs in
sequence", which doesn't mean much from an ordering perspective).

I can see that working on a strongly ordered architecture, but on
something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
stuff that isn't explicitly ordered. I have the feeling that a CAS
operation on both sides would be enough, but someone who actually
understands how this works should have a look...

Thanks,

	M.
Marc Zyngier Aug. 23, 2022, 8:35 p.m. UTC | #10
On Tue, 23 Aug 2022 15:44:42 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote:
> > Hi Gavin,
> > 
> > On Mon, 22 Aug 2022 02:58:20 +0100,
> > Gavin Shan <gshan@redhat.com> wrote:
> > > 
> > > Hi Marc,
> > > 
> > > On 8/19/22 6:00 PM, Marc Zyngier wrote:
> > > > On Fri, 19 Aug 2022 01:55:57 +0100,
> > > > Gavin Shan <gshan@redhat.com> wrote:
> > > >> 
> > > >> The ring-based dirty memory tracking has been available and enabled
> > > >> on x86 for a while. The feature is beneficial when the number of
> > > >> dirty pages is small in a checkpointing system or live migration
> > > >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> > > >> Implement ring-based dirty memory tracking").
> > > >> 
> > > >> This enables the ring-based dirty memory tracking on ARM64. It's
> > > >> notable that no extra reserved ring entries are needed on ARM64
> > > >> because the huge pages are always split into base pages when page
> > > >> dirty tracking is enabled.
> > > > 
> > > > Can you please elaborate on this? Adding a per-CPU ring of course
> > > > results in extra memory allocation, so there must be a subtle
> > > > x86-specific detail that I'm not aware of...
> > > > 
> > > 
> > > Sure. I guess it's helpful to explain how it works in next revision.
> > > Something like below:
> > > 
> > > This enables the ring-based dirty memory tracking on ARM64. The feature
> > > is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
> > > CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
> > > each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
> > > pushed by host when page becomes dirty and pulled by userspace. A vcpu
> > > exit is forced when the ring buffer becomes full. The ring buffers on
> > > all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
> > > 
> > > Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
> > > However, it's avoiding synchronization among multiple vcpus when dirty
> > > pages happen on multiple vcpus. More discussion can be found from [1]
> > 
> > Oh, I totally buy the relaxation of the synchronisation (though I
> > doubt this will have any visible effect until we have something like
> > Oliver's patches to allow parallel faulting).
> > 
> 
> Heh, yeah I need to get that out the door. I'll also note that Gavin's
> changes are still relevant without that series, as we do write unprotect
> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
> Add fast path to handle permission relaxation during dirty logging").

Ah, true. Now if only someone could explain how the whole
producer-consumer thing works without a trace of a barrier, that'd be
great...

Thanks,

	M.
Peter Xu Aug. 23, 2022, 9:20 p.m. UTC | #11
On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> I don't think we really need this check on the hot path. All we need
> is to make the request sticky until userspace gets their act together
> and consumes elements in the ring. Something like:
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 986cee6fbc7f..e8ed5e1af159 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  
>  		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
>  			return kvm_vcpu_suspend(vcpu);
> +
> +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> +		    kvm_dirty_ring_soft_full(vcpu)) {
> +			kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> +			trace_kvm_dirty_ring_exit(vcpu);
> +			return 0;
> +		}
>  	}
>  
>  	return 1;

Right, this seems working.  We can also use kvm_test_request() here.

> 
> 
> However, I'm a bit concerned by the reset side of things. It iterates
> over the vcpus and expects the view of each ring to be consistent,
> even if userspace is hacking at it from another CPU. For example, I
> can't see what guarantees that the kernel observes the writes from
> userspace in the order they are being performed (the documentation
> provides no requirements other than "it must collect the dirty GFNs in
> sequence", which doesn't mean much from an ordering perspective).
> 
> I can see that working on a strongly ordered architecture, but on
> something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> stuff that isn't explicitly ordered. I have the feeling that a CAS
> operation on both sides would be enough, but someone who actually
> understands how this works should have a look...

I definitely don't think I 100% understand all the ordering things since
they're complicated.. but my understanding is that the reset procedure
didn't need memory barrier (unlike pushing, where we have explicit wmb),
because we assumed the userapp is not hostile so logically it should only
modify the flags which is a 32bit field, assuming atomicity guaranteed.

IIRC we used to discuss similar questions on "what if the user is hostile
and wants to hack the process by messing up with the ring", and our
conclusion was as long as the process wouldn't mess up anything outside
itself it should be okay. E.g. It should not be able to either cause the
host to misfunction, or trigger kernel warnings in dmesg, etc..

Thanks,
Marc Zyngier Aug. 23, 2022, 10:47 p.m. UTC | #12
On Tue, 23 Aug 2022 22:20:32 +0100,
Peter Xu <peterx@redhat.com> wrote:
> 
> On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> > I don't think we really need this check on the hot path. All we need
> > is to make the request sticky until userspace gets their act together
> > and consumes elements in the ring. Something like:
> > 
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 986cee6fbc7f..e8ed5e1af159 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> >  
> >  		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> >  			return kvm_vcpu_suspend(vcpu);
> > +
> > +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> > +		    kvm_dirty_ring_soft_full(vcpu)) {
> > +			kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > +			trace_kvm_dirty_ring_exit(vcpu);
> > +			return 0;
> > +		}
> >  	}
> >  
> >  	return 1;
> 
> Right, this seems working.  We can also use kvm_test_request() here.
> 
> > 
> > 
> > However, I'm a bit concerned by the reset side of things. It iterates
> > over the vcpus and expects the view of each ring to be consistent,
> > even if userspace is hacking at it from another CPU. For example, I
> > can't see what guarantees that the kernel observes the writes from
> > userspace in the order they are being performed (the documentation
> > provides no requirements other than "it must collect the dirty GFNs in
> > sequence", which doesn't mean much from an ordering perspective).
> > 
> > I can see that working on a strongly ordered architecture, but on
> > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> > stuff that isn't explicitly ordered. I have the feeling that a CAS
> > operation on both sides would be enough, but someone who actually
> > understands how this works should have a look...
> 
> I definitely don't think I 100% understand all the ordering things since
> they're complicated.. but my understanding is that the reset procedure
> didn't need memory barrier (unlike pushing, where we have explicit wmb),
> because we assumed the userapp is not hostile so logically it should only
> modify the flags which is a 32bit field, assuming atomicity guaranteed.

Atomicity doesn't guarantee ordering, unfortunately. Take the
following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
D that exist in the ring in that order, and CPU1 performs an ioctl to
reset the page state.

CPU0:
    write_flag(A, KVM_DIRTY_GFN_F_RESET)
    write_flag(B, KVM_DIRTY_GFN_F_RESET)
    write_flag(C, KVM_DIRTY_GFN_F_RESET)
    write_flag(D, KVM_DIRTY_GFN_F_RESET)
    [...]

CPU1:
   ioctl(KVM_RESET_DIRTY_RINGS)

Since CPU0 writes do not have any ordering, CPU1 can observe the
writes in a sequence that have nothing to do with program order, and
could for example observe that GFN A and D have been reset, but not B
and C. This in turn breaks the logic in the reset code (B, C, and D
don't get reset), despite userspace having followed the spec to the
letter. If each was a store-release (which is the case on x86), it
wouldn't be a problem, but nothing calls it in the documentation.

Maybe that's not a big deal if it is expected that each CPU will issue
a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
writes. But expecting this to work across CPUs without any barrier is
wishful thinking.

> IIRC we used to discuss similar questions on "what if the user is hostile
> and wants to hack the process by messing up with the ring", and our
> conclusion was as long as the process wouldn't mess up anything outside
> itself it should be okay. E.g. It should not be able to either cause the
> host to misfunction, or trigger kernel warnings in dmesg, etc..

I'm not even discussing safety here. I'm purely discussing the
interactions between userspace and kernel based on the documentation
and the existing kernel code.

Thanks,

	M.
Peter Xu Aug. 23, 2022, 11:19 p.m. UTC | #13
On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> On Tue, 23 Aug 2022 22:20:32 +0100,
> Peter Xu <peterx@redhat.com> wrote:
> > 
> > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> > > I don't think we really need this check on the hot path. All we need
> > > is to make the request sticky until userspace gets their act together
> > > and consumes elements in the ring. Something like:
> > > 
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 986cee6fbc7f..e8ed5e1af159 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > >  
> > >  		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > >  			return kvm_vcpu_suspend(vcpu);
> > > +
> > > +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> > > +		    kvm_dirty_ring_soft_full(vcpu)) {
> > > +			kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > > +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > > +			trace_kvm_dirty_ring_exit(vcpu);
> > > +			return 0;
> > > +		}
> > >  	}
> > >  
> > >  	return 1;
> > 
> > Right, this seems working.  We can also use kvm_test_request() here.
> > 
> > > 
> > > 
> > > However, I'm a bit concerned by the reset side of things. It iterates
> > > over the vcpus and expects the view of each ring to be consistent,
> > > even if userspace is hacking at it from another CPU. For example, I
> > > can't see what guarantees that the kernel observes the writes from
> > > userspace in the order they are being performed (the documentation
> > > provides no requirements other than "it must collect the dirty GFNs in
> > > sequence", which doesn't mean much from an ordering perspective).
> > > 
> > > I can see that working on a strongly ordered architecture, but on
> > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> > > stuff that isn't explicitly ordered. I have the feeling that a CAS
> > > operation on both sides would be enough, but someone who actually
> > > understands how this works should have a look...
> > 
> > I definitely don't think I 100% understand all the ordering things since
> > they're complicated.. but my understanding is that the reset procedure
> > didn't need memory barrier (unlike pushing, where we have explicit wmb),
> > because we assumed the userapp is not hostile so logically it should only
> > modify the flags which is a 32bit field, assuming atomicity guaranteed.
> 
> Atomicity doesn't guarantee ordering, unfortunately.

Right, sorry to be misleading.  The "atomicity" part I was trying to say
the kernel will always see consistent update on the fields.

The ordering should also be guaranteed, because things must happen with
below sequence:

  (1) kernel publish dirty GFN data (slot, offset)
  (2) kernel publish dirty GFN flag (set to DIRTY)
  (3) user sees DIRTY, collects (slots, offset)
  (4) user sets it to RESET
  (5) kernel reads RESET

So the ordering of single-entry is guaranteed in that when (5) happens it
must be after stablized (1+2).

> Take the
> following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> D that exist in the ring in that order, and CPU1 performs an ioctl to
> reset the page state.
> 
> CPU0:
>     write_flag(A, KVM_DIRTY_GFN_F_RESET)
>     write_flag(B, KVM_DIRTY_GFN_F_RESET)
>     write_flag(C, KVM_DIRTY_GFN_F_RESET)
>     write_flag(D, KVM_DIRTY_GFN_F_RESET)
>     [...]
> 
> CPU1:
>    ioctl(KVM_RESET_DIRTY_RINGS)
> 
> Since CPU0 writes do not have any ordering, CPU1 can observe the
> writes in a sequence that have nothing to do with program order, and
> could for example observe that GFN A and D have been reset, but not B
> and C. This in turn breaks the logic in the reset code (B, C, and D
> don't get reset), despite userspace having followed the spec to the
> letter. If each was a store-release (which is the case on x86), it
> wouldn't be a problem, but nothing calls it in the documentation.
> 
> Maybe that's not a big deal if it is expected that each CPU will issue
> a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> writes. But expecting this to work across CPUs without any barrier is
> wishful thinking.

I see what you meant...

Firstly I'm actually curious whether that'll really happen if the gfns are
collected in something like a for loop:

  for(i = 0; i < N; i++)
    collect_dirty_gfn(ring, i);

Because since all the gfps to be read will depend on variable "i", IIUC no
reordering should happen, but I'm not really sure, so more of a pure
question.

Besides, the other thing to mention is that I think it is fine the RESET
ioctl didn't recycle all the gfns got set to reset state.  Taking above
example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags
are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D
are not reset.  But IMHO it's okay because it means we reset partial of the
gfns not all of them, and it's safe to do so.  It means the next ring full
event can come earlier because we recycled less, but that's functionally
safe to me.

> 
> > IIRC we used to discuss similar questions on "what if the user is hostile
> > and wants to hack the process by messing up with the ring", and our
> > conclusion was as long as the process wouldn't mess up anything outside
> > itself it should be okay. E.g. It should not be able to either cause the
> > host to misfunction, or trigger kernel warnings in dmesg, etc..
> 
> I'm not even discussing safety here. I'm purely discussing the
> interactions between userspace and kernel based on the documentation
> and the existing kernel code.

I see.  Please let me know if above makes sense.

Thanks!
Marc Zyngier Aug. 24, 2022, 2:45 p.m. UTC | #14
On Wed, 24 Aug 2022 00:19:04 +0100,
Peter Xu <peterx@redhat.com> wrote:
> 
> On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> > On Tue, 23 Aug 2022 22:20:32 +0100,
> > Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> > > > I don't think we really need this check on the hot path. All we need
> > > > is to make the request sticky until userspace gets their act together
> > > > and consumes elements in the ring. Something like:
> > > > 
> > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > index 986cee6fbc7f..e8ed5e1af159 100644
> > > > --- a/arch/arm64/kvm/arm.c
> > > > +++ b/arch/arm64/kvm/arm.c
> > > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > > >  
> > > >  		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > > >  			return kvm_vcpu_suspend(vcpu);
> > > > +
> > > > +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> > > > +		    kvm_dirty_ring_soft_full(vcpu)) {
> > > > +			kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > > > +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > > > +			trace_kvm_dirty_ring_exit(vcpu);
> > > > +			return 0;
> > > > +		}
> > > >  	}
> > > >  
> > > >  	return 1;
> > > 
> > > Right, this seems working.  We can also use kvm_test_request() here.
> > > 
> > > > 
> > > > 
> > > > However, I'm a bit concerned by the reset side of things. It iterates
> > > > over the vcpus and expects the view of each ring to be consistent,
> > > > even if userspace is hacking at it from another CPU. For example, I
> > > > can't see what guarantees that the kernel observes the writes from
> > > > userspace in the order they are being performed (the documentation
> > > > provides no requirements other than "it must collect the dirty GFNs in
> > > > sequence", which doesn't mean much from an ordering perspective).
> > > > 
> > > > I can see that working on a strongly ordered architecture, but on
> > > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> > > > stuff that isn't explicitly ordered. I have the feeling that a CAS
> > > > operation on both sides would be enough, but someone who actually
> > > > understands how this works should have a look...
> > > 
> > > I definitely don't think I 100% understand all the ordering things since
> > > they're complicated.. but my understanding is that the reset procedure
> > > didn't need memory barrier (unlike pushing, where we have explicit wmb),
> > > because we assumed the userapp is not hostile so logically it should only
> > > modify the flags which is a 32bit field, assuming atomicity guaranteed.
> > 
> > Atomicity doesn't guarantee ordering, unfortunately.
> 
> Right, sorry to be misleading.  The "atomicity" part I was trying to say
> the kernel will always see consistent update on the fields.
>
> The ordering should also be guaranteed, because things must happen with
> below sequence:
> 
>   (1) kernel publish dirty GFN data (slot, offset)
>   (2) kernel publish dirty GFN flag (set to DIRTY)
>   (3) user sees DIRTY, collects (slots, offset)
>   (4) user sets it to RESET
>   (5) kernel reads RESET

Maybe. Maybe not. The reset could well be sitting in the CPU write
buffer for as long as it wants and not be seen by the kernel if the
read occurs on another CPU. And that's the crucial bit: single-CPU is
fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
on collection, and global on reset (this seems like a bad decision,
but it is too late to fix this).

> 
> So the ordering of single-entry is guaranteed in that when (5) happens it
> must be after stablized (1+2).
> 
> > Take the
> > following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> > D that exist in the ring in that order, and CPU1 performs an ioctl to
> > reset the page state.
> > 
> > CPU0:
> >     write_flag(A, KVM_DIRTY_GFN_F_RESET)
> >     write_flag(B, KVM_DIRTY_GFN_F_RESET)
> >     write_flag(C, KVM_DIRTY_GFN_F_RESET)
> >     write_flag(D, KVM_DIRTY_GFN_F_RESET)
> >     [...]
> > 
> > CPU1:
> >    ioctl(KVM_RESET_DIRTY_RINGS)
> > 
> > Since CPU0 writes do not have any ordering, CPU1 can observe the
> > writes in a sequence that have nothing to do with program order, and
> > could for example observe that GFN A and D have been reset, but not B
> > and C. This in turn breaks the logic in the reset code (B, C, and D
> > don't get reset), despite userspace having followed the spec to the
> > letter. If each was a store-release (which is the case on x86), it
> > wouldn't be a problem, but nothing calls it in the documentation.
> > 
> > Maybe that's not a big deal if it is expected that each CPU will issue
> > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> > writes. But expecting this to work across CPUs without any barrier is
> > wishful thinking.
> 
> I see what you meant...
> 
> Firstly I'm actually curious whether that'll really happen if the gfns are
> collected in something like a for loop:
> 
>   for(i = 0; i < N; i++)
>     collect_dirty_gfn(ring, i);
> 
> Because since all the gfps to be read will depend on variable "i", IIUC no
> reordering should happen, but I'm not really sure, so more of a pure
> question.

'i' has no influence on the write ordering. Each write targets a
different address, there is no inter-write dependencies (this concept
doesn't exist other than for writes to the same address), so they can
be reordered at will.

If you want a proof of this, head to http://diy.inria.fr/www/ and run
the MP.litmus test (which conveniently gives you a reduction of this
problem) on both the x86 and AArch64 models. You will see that the
reordering isn't allowed on x86, but definitely allowed on arm64.

> Besides, the other thing to mention is that I think it is fine the RESET
> ioctl didn't recycle all the gfns got set to reset state.  Taking above
> example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags
> are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D
> are not reset.  But IMHO it's okay because it means we reset partial of the
> gfns not all of them, and it's safe to do so.  It means the next ring full
> event can come earlier because we recycled less, but that's functionally
> safe to me.

It may be safe, but it isn't what the userspace API promises. In other
words, without further straightening of the API, this doesn't work as
expected on relaxed memory architectures. So before this gets enabled
on arm64, this whole ordering issue must be addressed.

Thanks,

	M.
Peter Xu Aug. 24, 2022, 4:21 p.m. UTC | #15
On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
> On Wed, 24 Aug 2022 00:19:04 +0100,
> Peter Xu <peterx@redhat.com> wrote:
> > 
> > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> > > On Tue, 23 Aug 2022 22:20:32 +0100,
> > > Peter Xu <peterx@redhat.com> wrote:
> > > > 
> > > > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> > > > > I don't think we really need this check on the hot path. All we need
> > > > > is to make the request sticky until userspace gets their act together
> > > > > and consumes elements in the ring. Something like:
> > > > > 
> > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > > index 986cee6fbc7f..e8ed5e1af159 100644
> > > > > --- a/arch/arm64/kvm/arm.c
> > > > > +++ b/arch/arm64/kvm/arm.c
> > > > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > > > >  
> > > > >  		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > > > >  			return kvm_vcpu_suspend(vcpu);
> > > > > +
> > > > > +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> > > > > +		    kvm_dirty_ring_soft_full(vcpu)) {
> > > > > +			kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > > > > +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > > > > +			trace_kvm_dirty_ring_exit(vcpu);
> > > > > +			return 0;
> > > > > +		}
> > > > >  	}
> > > > >  
> > > > >  	return 1;
> > > > 
> > > > Right, this seems working.  We can also use kvm_test_request() here.
> > > > 
> > > > > 
> > > > > 
> > > > > However, I'm a bit concerned by the reset side of things. It iterates
> > > > > over the vcpus and expects the view of each ring to be consistent,
> > > > > even if userspace is hacking at it from another CPU. For example, I
> > > > > can't see what guarantees that the kernel observes the writes from
> > > > > userspace in the order they are being performed (the documentation
> > > > > provides no requirements other than "it must collect the dirty GFNs in
> > > > > sequence", which doesn't mean much from an ordering perspective).
> > > > > 
> > > > > I can see that working on a strongly ordered architecture, but on
> > > > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> > > > > stuff that isn't explicitly ordered. I have the feeling that a CAS
> > > > > operation on both sides would be enough, but someone who actually
> > > > > understands how this works should have a look...
> > > > 
> > > > I definitely don't think I 100% understand all the ordering things since
> > > > they're complicated.. but my understanding is that the reset procedure
> > > > didn't need memory barrier (unlike pushing, where we have explicit wmb),
> > > > because we assumed the userapp is not hostile so logically it should only
> > > > modify the flags which is a 32bit field, assuming atomicity guaranteed.
> > > 
> > > Atomicity doesn't guarantee ordering, unfortunately.
> > 
> > Right, sorry to be misleading.  The "atomicity" part I was trying to say
> > the kernel will always see consistent update on the fields.
> >
> > The ordering should also be guaranteed, because things must happen with
> > below sequence:
> > 
> >   (1) kernel publish dirty GFN data (slot, offset)
> >   (2) kernel publish dirty GFN flag (set to DIRTY)
> >   (3) user sees DIRTY, collects (slots, offset)
> >   (4) user sets it to RESET
> >   (5) kernel reads RESET
> 
> Maybe. Maybe not. The reset could well be sitting in the CPU write
> buffer for as long as it wants and not be seen by the kernel if the
> read occurs on another CPU. And that's the crucial bit: single-CPU is
> fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
> on collection, and global on reset (this seems like a bad decision,
> but it is too late to fix this).

Regarding the last statement, that's something I had question too and
discussed with Paolo, even though at that time it's not a outcome of
discussing memory ordering issues.

IIUC the initial design was trying to avoid tlb flush flood when vcpu
number is large (each RESET per ring even for one page will need all vcpus
to flush, so O(N^2) flushing needed).  With global RESET it's O(N).  So
it's kind of a trade-off, and indeed until now I'm not sure which one is
better.  E.g., with per-ring reset, we can have locality too in userspace,
e.g. the vcpu thread might be able to recycle without holding global locks.

Regarding safety I hope I covered that below in previous reply.

> 
> > 
> > So the ordering of single-entry is guaranteed in that when (5) happens it
> > must be after stablized (1+2).
> > 
> > > Take the
> > > following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> > > D that exist in the ring in that order, and CPU1 performs an ioctl to
> > > reset the page state.
> > > 
> > > CPU0:
> > >     write_flag(A, KVM_DIRTY_GFN_F_RESET)
> > >     write_flag(B, KVM_DIRTY_GFN_F_RESET)
> > >     write_flag(C, KVM_DIRTY_GFN_F_RESET)
> > >     write_flag(D, KVM_DIRTY_GFN_F_RESET)
> > >     [...]
> > > 
> > > CPU1:
> > >    ioctl(KVM_RESET_DIRTY_RINGS)
> > > 
> > > Since CPU0 writes do not have any ordering, CPU1 can observe the
> > > writes in a sequence that have nothing to do with program order, and
> > > could for example observe that GFN A and D have been reset, but not B
> > > and C. This in turn breaks the logic in the reset code (B, C, and D
> > > don't get reset), despite userspace having followed the spec to the
> > > letter. If each was a store-release (which is the case on x86), it
> > > wouldn't be a problem, but nothing calls it in the documentation.
> > > 
> > > Maybe that's not a big deal if it is expected that each CPU will issue
> > > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> > > writes. But expecting this to work across CPUs without any barrier is
> > > wishful thinking.
> > 
> > I see what you meant...
> > 
> > Firstly I'm actually curious whether that'll really happen if the gfns are
> > collected in something like a for loop:
> > 
> >   for(i = 0; i < N; i++)
> >     collect_dirty_gfn(ring, i);
> > 
> > Because since all the gfps to be read will depend on variable "i", IIUC no
> > reordering should happen, but I'm not really sure, so more of a pure
> > question.
> 
> 'i' has no influence on the write ordering. Each write targets a
> different address, there is no inter-write dependencies (this concept
> doesn't exist other than for writes to the same address), so they can
> be reordered at will.
> 
> If you want a proof of this, head to http://diy.inria.fr/www/ and run
> the MP.litmus test (which conveniently gives you a reduction of this
> problem) on both the x86 and AArch64 models. You will see that the
> reordering isn't allowed on x86, but definitely allowed on arm64.
> 
> > Besides, the other thing to mention is that I think it is fine the RESET
> > ioctl didn't recycle all the gfns got set to reset state.  Taking above
> > example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags
> > are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D
> > are not reset.  But IMHO it's okay because it means we reset partial of the
> > gfns not all of them, and it's safe to do so.  It means the next ring full
> > event can come earlier because we recycled less, but that's functionally
> > safe to me.
> 
> It may be safe, but it isn't what the userspace API promises.

The document says:

  After processing one or more entries in the ring buffer, userspace calls
  the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
  the kernel will reprotect those collected GFNs.  Therefore, the ioctl
  must be called *before* reading the content of the dirty pages.

I'd say it's not an explicit promise, but I think I agree with you that at
least it's unclear on the behavior.

Since we have a global recycle mechanism, most likely the app (e.g. current
qemu impl) will use the same thread to collect/reset dirty GFNs, and
trigger the RESET ioctl().  In that case it's safe, IIUC, because no
cross-core ops.

QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):

    if (total) {
        ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
        assert(ret == total);
    }

I think the assert() should never trigger as mentioned above.  But ideally
maybe it should just be a loop until cleared gfns match total.

> In other words, without further straightening of the API, this doesn't
> work as expected on relaxed memory architectures. So before this gets
> enabled on arm64, this whole ordering issue must be addressed.

How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
possibility of recycling partial of the pages, especially when collection
and the ioctl() aren't done from the same thread?

Any suggestions will be greatly welcomed.

Thanks,
Marc Zyngier Aug. 24, 2022, 8:57 p.m. UTC | #16
On Wed, 24 Aug 2022 17:21:50 +0100,
Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
> > On Wed, 24 Aug 2022 00:19:04 +0100,
> > Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> > > > Atomicity doesn't guarantee ordering, unfortunately.
> > > 
> > > Right, sorry to be misleading.  The "atomicity" part I was trying to say
> > > the kernel will always see consistent update on the fields.
> > >
> > > The ordering should also be guaranteed, because things must happen with
> > > below sequence:
> > > 
> > >   (1) kernel publish dirty GFN data (slot, offset)
> > >   (2) kernel publish dirty GFN flag (set to DIRTY)
> > >   (3) user sees DIRTY, collects (slots, offset)
> > >   (4) user sets it to RESET
> > >   (5) kernel reads RESET
> > 
> > Maybe. Maybe not. The reset could well be sitting in the CPU write
> > buffer for as long as it wants and not be seen by the kernel if the
> > read occurs on another CPU. And that's the crucial bit: single-CPU is
> > fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
> > on collection, and global on reset (this seems like a bad decision,
> > but it is too late to fix this).
> 
> Regarding the last statement, that's something I had question too and
> discussed with Paolo, even though at that time it's not a outcome of
> discussing memory ordering issues.
> 
> IIUC the initial design was trying to avoid tlb flush flood when vcpu
> number is large (each RESET per ring even for one page will need all vcpus
> to flush, so O(N^2) flushing needed). With global RESET it's O(N).  So
> it's kind of a trade-off, and indeed until now I'm not sure which one is
> better.  E.g., with per-ring reset, we can have locality too in userspace,
> e.g. the vcpu thread might be able to recycle without holding global locks.

I don't get that. On x86, each CPU must perform the TLB invalidation
(there is an IPI for that). So whether you do a per-CPU scan of the
ring or a global scan is irrelevant: each entry you find in any of the
rings must result in a global invalidation, since you've updated the
PTE to make the page RO.

The same is true on ARM, except that the broadcast is done in HW
instead of being tracked in SW.

Buy anyway, this is all moot. The API is what it is, and it isn't
going to change any time soon. All we can do is add some
clarifications to the API for the more relaxed architectures, and make
sure the kernel behaves accordingly.

[...]

> > It may be safe, but it isn't what the userspace API promises.
> 
> The document says:
> 
>   After processing one or more entries in the ring buffer, userspace calls
>   the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
>   the kernel will reprotect those collected GFNs.  Therefore, the ioctl
>   must be called *before* reading the content of the dirty pages.
> 
> I'd say it's not an explicit promise, but I think I agree with you that at
> least it's unclear on the behavior.

This is the least problematic part of the documentation. The bit I
literally choke on is this:

<quote>
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
</quote>

This is the core of the issue. Without ordering rules, the consumer on
the other side cannot observe the updates correctly, even if userspace
follows the above to the letter. Of course, the kernel itself must do
the right thing (I guess it amounts to the kernel doing a
load-acquire, and userspace doing a store-release -- effectively
emulating x86...).

> Since we have a global recycle mechanism, most likely the app (e.g. current
> qemu impl) will use the same thread to collect/reset dirty GFNs, and
> trigger the RESET ioctl().  In that case it's safe, IIUC, because no
> cross-core ops.
> 
> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
> 
>     if (total) {
>         ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
>         assert(ret == total);
>     }
> 
> I think the assert() should never trigger as mentioned above.  But ideally
> maybe it should just be a loop until cleared gfns match total.

Right. If userspace calls the ioctl on every vcpu, then things should
work correctly. It is only that the overhead is higher than what it
should be if multiple vcpus perform a reset at the same time.

>
> > In other words, without further straightening of the API, this doesn't
> > work as expected on relaxed memory architectures. So before this gets
> > enabled on arm64, this whole ordering issue must be addressed.
> 
> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
> possibility of recycling partial of the pages, especially when collection
> and the ioctl() aren't done from the same thread?

I'd rather tell people about the ordering rules. That will come at
zero cost on x86.

> Any suggestions will be greatly welcomed.

I'll write a couple of patch when I get the time, most likely next
week. Gavin will hopefully be able to take them as part of his series.

	M.
Gavin Shan Aug. 26, 2022, 6:05 a.m. UTC | #17
Hi Marc,

On 8/25/22 6:57 AM, Marc Zyngier wrote:
> On Wed, 24 Aug 2022 17:21:50 +0100,
> Peter Xu <peterx@redhat.com> wrote:
>>
>> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
>>> On Wed, 24 Aug 2022 00:19:04 +0100,
>>> Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
>>>>> Atomicity doesn't guarantee ordering, unfortunately.
>>>>
>>>> Right, sorry to be misleading.  The "atomicity" part I was trying to say
>>>> the kernel will always see consistent update on the fields.
>>>>
>>>> The ordering should also be guaranteed, because things must happen with
>>>> below sequence:
>>>>
>>>>    (1) kernel publish dirty GFN data (slot, offset)
>>>>    (2) kernel publish dirty GFN flag (set to DIRTY)
>>>>    (3) user sees DIRTY, collects (slots, offset)
>>>>    (4) user sets it to RESET
>>>>    (5) kernel reads RESET
>>>
>>> Maybe. Maybe not. The reset could well be sitting in the CPU write
>>> buffer for as long as it wants and not be seen by the kernel if the
>>> read occurs on another CPU. And that's the crucial bit: single-CPU is
>>> fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
>>> on collection, and global on reset (this seems like a bad decision,
>>> but it is too late to fix this).
>>
>> Regarding the last statement, that's something I had question too and
>> discussed with Paolo, even though at that time it's not a outcome of
>> discussing memory ordering issues.
>>
>> IIUC the initial design was trying to avoid tlb flush flood when vcpu
>> number is large (each RESET per ring even for one page will need all vcpus
>> to flush, so O(N^2) flushing needed). With global RESET it's O(N).  So
>> it's kind of a trade-off, and indeed until now I'm not sure which one is
>> better.  E.g., with per-ring reset, we can have locality too in userspace,
>> e.g. the vcpu thread might be able to recycle without holding global locks.
> 
> I don't get that. On x86, each CPU must perform the TLB invalidation
> (there is an IPI for that). So whether you do a per-CPU scan of the
> ring or a global scan is irrelevant: each entry you find in any of the
> rings must result in a global invalidation, since you've updated the
> PTE to make the page RO.
> 
> The same is true on ARM, except that the broadcast is done in HW
> instead of being tracked in SW.
> 
> Buy anyway, this is all moot. The API is what it is, and it isn't
> going to change any time soon. All we can do is add some
> clarifications to the API for the more relaxed architectures, and make
> sure the kernel behaves accordingly.
> 
> [...]
> 
>>> It may be safe, but it isn't what the userspace API promises.
>>
>> The document says:
>>
>>    After processing one or more entries in the ring buffer, userspace calls
>>    the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
>>    the kernel will reprotect those collected GFNs.  Therefore, the ioctl
>>    must be called *before* reading the content of the dirty pages.
>>
>> I'd say it's not an explicit promise, but I think I agree with you that at
>> least it's unclear on the behavior.
> 
> This is the least problematic part of the documentation. The bit I
> literally choke on is this:
> 
> <quote>
> It's not necessary for userspace to harvest the all dirty GFNs at once.
> However it must collect the dirty GFNs in sequence, i.e., the userspace
> program cannot skip one dirty GFN to collect the one next to it.
> </quote>
> 
> This is the core of the issue. Without ordering rules, the consumer on
> the other side cannot observe the updates correctly, even if userspace
> follows the above to the letter. Of course, the kernel itself must do
> the right thing (I guess it amounts to the kernel doing a
> load-acquire, and userspace doing a store-release -- effectively
> emulating x86...).
> 
>> Since we have a global recycle mechanism, most likely the app (e.g. current
>> qemu impl) will use the same thread to collect/reset dirty GFNs, and
>> trigger the RESET ioctl().  In that case it's safe, IIUC, because no
>> cross-core ops.
>>
>> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
>>
>>      if (total) {
>>          ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
>>          assert(ret == total);
>>      }
>>
>> I think the assert() should never trigger as mentioned above.  But ideally
>> maybe it should just be a loop until cleared gfns match total.
> 
> Right. If userspace calls the ioctl on every vcpu, then things should
> work correctly. It is only that the overhead is higher than what it
> should be if multiple vcpus perform a reset at the same time.
> 
>>
>>> In other words, without further straightening of the API, this doesn't
>>> work as expected on relaxed memory architectures. So before this gets
>>> enabled on arm64, this whole ordering issue must be addressed.
>>
>> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
>> possibility of recycling partial of the pages, especially when collection
>> and the ioctl() aren't done from the same thread?
> 
> I'd rather tell people about the ordering rules. That will come at
> zero cost on x86.
> 
>> Any suggestions will be greatly welcomed.
> 
> I'll write a couple of patch when I get the time, most likely next
> week. Gavin will hopefully be able to take them as part of his series.
> 

Thanks, Marc. Please let me know where I can check out the patches
when they're ready. I can include the patches into this series in
next revision :)

Thanks,
Gavin
Paolo Bonzini Aug. 26, 2022, 10:50 a.m. UTC | #18
On 8/24/22 00:47, Marc Zyngier wrote:
>> I definitely don't think I 100% understand all the ordering things since
>> they're complicated.. but my understanding is that the reset procedure
>> didn't need memory barrier (unlike pushing, where we have explicit wmb),
>> because we assumed the userapp is not hostile so logically it should only
>> modify the flags which is a 32bit field, assuming atomicity guaranteed.
> Atomicity doesn't guarantee ordering, unfortunately. Take the
> following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> D that exist in the ring in that order, and CPU1 performs an ioctl to
> reset the page state.
> 
> CPU0:
>      write_flag(A, KVM_DIRTY_GFN_F_RESET)
>      write_flag(B, KVM_DIRTY_GFN_F_RESET)
>      write_flag(C, KVM_DIRTY_GFN_F_RESET)
>      write_flag(D, KVM_DIRTY_GFN_F_RESET)
>      [...]
> 
> CPU1:
>     ioctl(KVM_RESET_DIRTY_RINGS)
> 
> Since CPU0 writes do not have any ordering, CPU1 can observe the
> writes in a sequence that have nothing to do with program order, and
> could for example observe that GFN A and D have been reset, but not B
> and C. This in turn breaks the logic in the reset code (B, C, and D
> don't get reset), despite userspace having followed the spec to the
> letter. If each was a store-release (which is the case on x86), it
> wouldn't be a problem, but nothing calls it in the documentation.
> 
> Maybe that's not a big deal if it is expected that each CPU will issue
> a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> writes. But expecting this to work across CPUs without any barrier is
> wishful thinking.

Agreed, but that's a problem for userspace to solve.  If userspace wants 
to reset the fields in different CPUs, it has to synchronize with its 
own invoking of the ioctl.

That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done 
after (in the memory-ordering sense) its last write_flag(D, 
KVM_DIRTY_GFN_F_RESET).  If there's no such ordering, there's no 
guarantee that the write_flag will have any effect.

The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl was 
because it takes kvm->slots_lock so the execution would be serialized 
anyway.  Turning slots_lock into an rwsem would be even worse because it 
also takes kvm->mmu_lock (since slots_lock is a mutex, at least two 
concurrent invocations won't clash with each other on the mmu_lock).

Paolo
Paolo Bonzini Aug. 26, 2022, 10:58 a.m. UTC | #19
On 8/23/22 22:35, Marc Zyngier wrote:
>> Heh, yeah I need to get that out the door. I'll also note that Gavin's
>> changes are still relevant without that series, as we do write unprotect
>> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
>> Add fast path to handle permission relaxation during dirty logging").
>
> Ah, true. Now if only someone could explain how the whole
> producer-consumer thing works without a trace of a barrier, that'd be
> great...

Do you mean this?

void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
{
         struct kvm_dirty_gfn *entry;

         /* It should never get full */
         WARN_ON_ONCE(kvm_dirty_ring_full(ring));

         entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];

         entry->slot = slot;
         entry->offset = offset;
         /*
          * Make sure the data is filled in before we publish this to
          * the userspace program.  There's no paired kernel-side reader.
          */
         smp_wmb();
         kvm_dirty_gfn_set_dirtied(entry);
         ring->dirty_index++;
         trace_kvm_dirty_ring_push(ring, slot, offset);
}

The matching smp_rmb() is in userspace.

Paolo
Marc Zyngier Aug. 26, 2022, 3:28 p.m. UTC | #20
On Fri, 26 Aug 2022 11:58:08 +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 8/23/22 22:35, Marc Zyngier wrote:
> >> Heh, yeah I need to get that out the door. I'll also note that Gavin's
> >> changes are still relevant without that series, as we do write unprotect
> >> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
> >> Add fast path to handle permission relaxation during dirty logging").
> > 
> > Ah, true. Now if only someone could explain how the whole
> > producer-consumer thing works without a trace of a barrier, that'd be
> > great...
> 
> Do you mean this?
>
> void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)

Of course not. I mean this:

static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
{
	unsigned long i;
	struct kvm_vcpu *vcpu;
	int cleared = 0;

	if (!kvm->dirty_ring_size)
		return -EINVAL;

	mutex_lock(&kvm->slots_lock);

	kvm_for_each_vcpu(i, vcpu, kvm)
		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
[...]
}

and this

int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
{
	u32 cur_slot, next_slot;
	u64 cur_offset, next_offset;
	unsigned long mask;
	int count = 0;
	struct kvm_dirty_gfn *entry;
	bool first_round = true;

	/* This is only needed to make compilers happy */
	cur_slot = cur_offset = mask = 0;

	while (true) {
		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];

		if (!kvm_dirty_gfn_harvested(entry))
			break;
[...]

}

which provides no ordering whatsoever when a ring is updated from one
CPU and reset from another.

	M.
Marc Zyngier Aug. 26, 2022, 3:49 p.m. UTC | #21
On Fri, 26 Aug 2022 11:50:24 +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 8/24/22 00:47, Marc Zyngier wrote:
> >> I definitely don't think I 100% understand all the ordering things since
> >> they're complicated.. but my understanding is that the reset procedure
> >> didn't need memory barrier (unlike pushing, where we have explicit wmb),
> >> because we assumed the userapp is not hostile so logically it should only
> >> modify the flags which is a 32bit field, assuming atomicity guaranteed.
> > Atomicity doesn't guarantee ordering, unfortunately. Take the
> > following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> > D that exist in the ring in that order, and CPU1 performs an ioctl to
> > reset the page state.
> > 
> > CPU0:
> >      write_flag(A, KVM_DIRTY_GFN_F_RESET)
> >      write_flag(B, KVM_DIRTY_GFN_F_RESET)
> >      write_flag(C, KVM_DIRTY_GFN_F_RESET)
> >      write_flag(D, KVM_DIRTY_GFN_F_RESET)
> >      [...]
> > 
> > CPU1:
> >     ioctl(KVM_RESET_DIRTY_RINGS)
> > 
> > Since CPU0 writes do not have any ordering, CPU1 can observe the
> > writes in a sequence that have nothing to do with program order, and
> > could for example observe that GFN A and D have been reset, but not B
> > and C. This in turn breaks the logic in the reset code (B, C, and D
> > don't get reset), despite userspace having followed the spec to the
> > letter. If each was a store-release (which is the case on x86), it
> > wouldn't be a problem, but nothing calls it in the documentation.
> > 
> > Maybe that's not a big deal if it is expected that each CPU will issue
> > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> > writes. But expecting this to work across CPUs without any barrier is
> > wishful thinking.
> 
> Agreed, but that's a problem for userspace to solve.  If userspace
> wants to reset the fields in different CPUs, it has to synchronize
> with its own invoking of the ioctl.

userspace has no choice. It cannot order on its own the reads that the
kernel will do to *other* rings.

> That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done
> after (in the memory-ordering sense) its last write_flag(D,
> KVM_DIRTY_GFN_F_RESET).  If there's no such ordering, there's no
> guarantee that the write_flag will have any effect.

The problem isn't on CPU0 The problem is that CPU1 does observe
inconsistent data on arm64, and I don't think this difference in
behaviour is acceptable. Nothing documents this, and there is a baked
in assumption that there is a strong ordering between writes as well
as between writes and read.

> The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl
> was because it takes kvm->slots_lock so the execution would be
> serialized anyway.  Turning slots_lock into an rwsem would be even
> worse because it also takes kvm->mmu_lock (since slots_lock is a
> mutex, at least two concurrent invocations won't clash with each other
> on the mmu_lock).

Whatever the reason, the behaviour should be identical on all
architectures. As is is, it only really works on x86, and I contend
this is a bug that needs fixing.

Thankfully, this can be done at zero cost for x86, and at that of a
set of load-acquires on other architectures.

	M.
Paolo Bonzini Aug. 27, 2022, 8:27 a.m. UTC | #22
On 8/26/22 17:49, Marc Zyngier wrote:
>> Agreed, but that's a problem for userspace to solve.  If userspace
>> wants to reset the fields in different CPUs, it has to synchronize
>> with its own invoking of the ioctl.
> 
> userspace has no choice. It cannot order on its own the reads that the
> kernel will do to *other* rings.

Those reads will never see KVM_DIRTY_GFN_F_RESET in the flags however, 
if userspace has never interacted with the ring.  So there will be 
exactly one read on those rings, and there's nothing to reorder.

If that's too tricky and you want to add a load-acquire I have no 
objection though.  It also helps avoiding read-read reordering between 
one entry's flags to the next one's, so it's a good idea to have it anyway.

>> The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl
>> was because it takes kvm->slots_lock so the execution would be
>> serialized anyway.  Turning slots_lock into an rwsem would be even
>> worse because it also takes kvm->mmu_lock (since slots_lock is a
>> mutex, at least two concurrent invocations won't clash with each other
>> on the mmu_lock).
> 
> Whatever the reason, the behaviour should be identical on all
> architectures. As is is, it only really works on x86, and I contend
> this is a bug that needs fixing.
> 
> Thankfully, this can be done at zero cost for x86, and at that of a
> set of load-acquires on other architectures.

Yes, the global-ness of the API is orthogonal to the memory ordering 
issue.  I just wanted to explain why a per-vCPU API probably isn't going 
to work great.

Paolo
Peter Xu Aug. 30, 2022, 2:42 p.m. UTC | #23
On Fri, Aug 26, 2022 at 04:28:51PM +0100, Marc Zyngier wrote:
> On Fri, 26 Aug 2022 11:58:08 +0100,
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> > On 8/23/22 22:35, Marc Zyngier wrote:
> > >> Heh, yeah I need to get that out the door. I'll also note that Gavin's
> > >> changes are still relevant without that series, as we do write unprotect
> > >> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
> > >> Add fast path to handle permission relaxation during dirty logging").
> > > 
> > > Ah, true. Now if only someone could explain how the whole
> > > producer-consumer thing works without a trace of a barrier, that'd be
> > > great...
> > 
> > Do you mean this?
> >
> > void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> 
> Of course not. I mean this:
> 
> static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> {
> 	unsigned long i;
> 	struct kvm_vcpu *vcpu;
> 	int cleared = 0;
> 
> 	if (!kvm->dirty_ring_size)
> 		return -EINVAL;
> 
> 	mutex_lock(&kvm->slots_lock);
> 
> 	kvm_for_each_vcpu(i, vcpu, kvm)
> 		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
> [...]
> }
> 
> and this
> 
> int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> {
> 	u32 cur_slot, next_slot;
> 	u64 cur_offset, next_offset;
> 	unsigned long mask;
> 	int count = 0;
> 	struct kvm_dirty_gfn *entry;
> 	bool first_round = true;
> 
> 	/* This is only needed to make compilers happy */
> 	cur_slot = cur_offset = mask = 0;
> 
> 	while (true) {
> 		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> 
> 		if (!kvm_dirty_gfn_harvested(entry))
> 			break;
> [...]
> 
> }
> 
> which provides no ordering whatsoever when a ring is updated from one
> CPU and reset from another.

Marc,

I thought we won't hit this as long as we properly take care of other
orderings of (a) gfn push, and (b) gfn collect, but after a second thought
I think it's indeed logically possible that with a reversed ordering here
we can be reading some garbage gfn before (a) happens butt also read the
valid flag after (b).

It seems we must have all the barriers correctly applied always.  If that's
correct, do you perhaps mean something like this to just add the last piece
of barrier?

===8<===
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index f4c2a6eb1666..ea620bfb012d 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn)
 
 static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
 {
-       return gfn->flags & KVM_DIRTY_GFN_F_RESET;
+       return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
 }
 
 int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
===8<===

Thanks,
Paolo Bonzini Sept. 2, 2022, 12:19 a.m. UTC | #24
On 8/30/22 16:42, Peter Xu wrote:
> Marc,
> 
> I thought we won't hit this as long as we properly take care of other
> orderings of (a) gfn push, and (b) gfn collect, but after a second thought
> I think it's indeed logically possible that with a reversed ordering here
> we can be reading some garbage gfn before (a) happens butt also read the
> valid flag after (b).
> 
> It seems we must have all the barriers correctly applied always.  If that's
> correct, do you perhaps mean something like this to just add the last piece
> of barrier?

Okay, so I thought about it some more and it's quite tricky.

Strictly speaking, the synchronization is just between userspace and 
kernel. The fact that the actual producer of dirty pages is in another 
CPU is a red herring, because reset only cares about harvested pages.

In other words, the dirty page ring is essentially two ring buffers in 
one and we only care about the "harvested ring", not the "produced ring".

On the other hand, it may happen that userspace has set more RESET flags 
while the ioctl is ongoing:


     CPU0                     CPU1               CPU2
                                                 fill gfn0
                                                 store-rel flags for gfn0
                                                 fill gfn1
                                                 store-rel flags for gfn1
     load-acq flags for gfn0
     set RESET for gfn0
     load-acq flags for gfn1
     set RESET for gfn1
     do ioctl! ----------->
                              ioctl(RESET_RINGS)
                                                 fill gfn2
                                                 store-rel flags for gfn2
     load-acq flags for gfn2
     set RESET for gfn2
                              process gfn0
                              process gfn1
                              process gfn2
     do ioctl!
     etc.

The three load-acquire in CPU0 synchronize with the three store-release 
in CPU2, but CPU0 and CPU1 are only synchronized up to gfn1 and CPU1 may 
miss gfn2's fields other than flags.

The kernel must be able to cope with invalid values of the fields, and 
userspace will invoke the ioctl once more.  However, once the RESET flag 
is cleared on gfn2, it is lost forever, therefore in the above scenario 
CPU1 must read the correct value of gfn2's fields.

Therefore RESET must be set with a store-release, that will synchronize 
with a load-acquire in CPU1 as you suggested.

Paolo

> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index f4c2a6eb1666..ea620bfb012d 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn)
>  
>  static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
>  {
> -       return gfn->flags & KVM_DIRTY_GFN_F_RESET;
> +       return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
>  }
>  
>  int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> ===8<===
> 
> Thanks,
> 
> -- 
> Peter Xu
>
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index abd7c32126ce..19fa1ac017ed 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8022,7 +8022,7 @@  regardless of what has actually been exposed through the CPUID leaf.
 8.29 KVM_CAP_DIRTY_LOG_RING
 ---------------------------
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Parameters: args[0] - size of the dirty log ring
 
 KVM is capable of tracking dirty memory using ring buffers that are
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 3bb134355874..7e04b0b8d2b2 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -43,6 +43,7 @@ 
 #define __KVM_HAVE_VCPU_EVENTS
 
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define KVM_REG_SIZE(id)						\
 	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 815cc118c675..0309b2d0f2da 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -32,6 +32,7 @@  menuconfig KVM
 	select KVM_VFIO
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_IRQFD
+	select HAVE_KVM_DIRTY_RING
 	select HAVE_KVM_MSI
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 986cee6fbc7f..3de6b9b39db7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -866,6 +866,14 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		if (!ret)
 			ret = 1;
 
+		/* Force vcpu exit if its dirty ring is soft-full */
+		if (unlikely(vcpu->kvm->dirty_ring_size &&
+			     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
+			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+			trace_kvm_dirty_ring_exit(vcpu);
+			ret = 0;
+		}
+
 		if (ret > 0)
 			ret = check_vcpu_requests(vcpu);